ColumnStore Bulk Data Loading

Overview

cpimport is a high-speed bulk load utility that imports data into ColumnStore tables in a fast and efficient manner. It accepts as input any flat file containing data that contains a delimiter between fields of data (i.e. columns in a table). The default delimiter is the pipe (‘|’) character, but other delimiters such as commas may be used as well. The data values must be in the same order as the create table statement, i.e. column 1 matches the first column in the table and so on. Date values must be specified in the format 'yyyy-mm-dd'.

cpimport – performs the following operations when importing data into a MariaDB ColumnStore database:

  • Data is read from specified flat files.

  • Data is transformed to fit ColumnStore’s column-oriented storage design.

  • Redundant data is tokenized and logically compressed.

  • Data is written to disk.

It is important to note that:

  • The bulk loads are an append operation to a table, so they allow existing data to be read and remain unaffected during the process.

  • The bulk loads do not write their data operations to the transaction log; they are not transactional in nature but are considered an atomic operation at this time. Information markers, however, are placed in the transaction log so the DBA is aware that a bulk operation did occur.

  • Upon completion of the load operation, a high-water mark in each column file is moved in an atomic operation that allows for any subsequent queries to read the newly loaded data. It appends operation provides for consistent read but does not incur the overhead of logging the data.

There are two primary steps to using the cpimport utility:

  1. Optionally create a job file that is used to load data from a flat file into multiple tables.

  2. Run the cpimport utility to perform the data import.

Syntax

The simplest form of cpimport command is

The full syntax is like this:

cpimport modes

Mode 1: Bulk Load from a central location with single data source file

In this mode, you run the cpimport from your primary node (mcs1). The source file is located at this primary location and the data from cpimport is distributed across all the nodes. If no mode is specified, then this is the default.

cpimport-mode1

Example:

Mode 2: Bulk load from central location with distributed data source files

In this mode, you run the cpimport from your primary node (mcs1). The source data is in already partitioned data files residing on the PMs. Each PM should have the source data file of the same name but containing the partitioned data for the PM

cpimport-mode2

Example:

Mode 3: Parallel distributed bulk load

In this mode, you run cpimport from the individual nodes independently, which will import the source file that exists on that node. Concurrent imports can be executed on every node for the same table.

cpimport-mode3

Example:

Note:

  • The bulk loads are an append operation to a table, so they allow existing data to be read and remain unaffected during the process.

  • The bulk loads do not write their data operations to the transaction log; they are not transactional in nature but are considered an atomic operation at this time. Information markers, however, are placed in the transaction log so the DBA is aware that a bulk operation did occur.

  • Upon completion of the load operation, a high-water mark in each column file is moved in an atomic operation that allows for any subsequent queries to read the newly loaded data. It appends operation provides for consistent read but does not incur the overhead of logging the data.

Bulk loading data from STDIN

Data can be loaded from STDIN into ColumnStore by simply not including the loadFile parameter

Example:

Bulk loading from AWS S3

Similarly the AWS cli utility can be utilized to read data from an s3 bucket and pipe the output into cpimport allowing direct loading from S3. This assumes the aws cli program has been installed and configured on the host:

Example:

For troubleshooting connectivity problems remove the --quiet option which suppresses client logging including permission errors.

Bulk loading output of SELECT FROM Table(s)

Standard in can also be used to directly pipe the output from an arbitrary SELECT statement into cpimport. The select statement may select from non-columnstore tables such as MyISAM or InnoDB. In the example below, the db2.source_table is selected from, using the -N flag to remove non-data formatting. The -q flag tells the mysql client to not cache results which will avoid possible timeouts causing the load to fail.

Example:

Bulk loading from JSON

Let's create a sample ColumnStore table:

Now let's create a sample products.json file like this:

We can then bulk load data from JSON into Columnstore by first piping the data to jq and then to cpimport using a one-line command.

Example:

In this example, the JSON data is coming from a static JSON file, but this same method will work for, and output streamed from any datasource using JSON such as an API or NoSQL database. For more information on 'jq', please view the manual here here.

Bulk loading into multiple tables

There are two ways multiple tables can be loaded:

  1. Run multiple cpimport jobs simultaneously. Tables per import should be unique or PMs for each import should be unique if using mode 3.

  2. Use colxml utility: colxml creates an XML job file for your database schema before you can import data. Multiple tables may be imported by either importing all tables within a schema or listing specific tables using the -t option in colxml. Then, using cpimport, that uses the job file generated by colxml. Here is an example of how to use colxml and cpimport to import data into all the tables in a database schema

colxml syntax

Example usage of colxml

The following tables comprise a database name ‘tpch2’:

  1. First, put delimited input data file for each table in /usr/local/mariadb/columnstore/data/bulk/data/import. Each file should be named .tbl.

  2. Run colxml for the load job for the ‘tpch2’ database as shown here:

Now actually run cpimport to use the job file generated by the colxml execution

Handling Differences in Column Order and Values

If there are some differences between the input file and table definition then the colxml utility can be utilized to handle these cases:

  • Different order of columns in the input file from table order

  • Input file column values to be skipped / ignored.

  • Target table columns to be defaulted.

In this case run the colxml utility (the -t argument can be useful for producing a job file for one table if preferred) to produce the job xml file and then use this a template for editing and then subsequently use that job file for running cpimport.

Consider the following simple table example:

This would produce a colxml file with the following table element:

If your input file had the data such that hire_date comes before salary then the following modification will allow correct loading of that data to the original table definition (note the last 2 Column elements are swapped):

The following example would ignore the last entry in the file and default salary to it's default value (in this case null):

  • IgnoreFields instructs cpimport to ignore and skip the particular value at that position in the file.

  • DefaultColumn instructs cpimport to default the current table column and not move the column pointer forward to the next delimiter.

Both instructions can be used indepedently and as many times as makes sense for your data and table definition.

Binary Source Import

It is possible to import using a binary file instead of a CSV file using fixed length rows in binary data. This can be done using the '-I' flag which has two modes:

  • -I1 - binary mode with NULLs accepted Numeric fields containing NULL will be treated as NULL unless the column has a default value

  • -I2 - binary mode with NULLs saturated NULLs in numeric fields will be saturated

The following table shows how to represent the data in the binary format:

Datatype
Description

INT/TINYINT/SMALLINT/BIGINT

Little-endian format for the numeric data

FLOAT/DOUBLE

IEEE format native to the computer

CHAR/VARCHAR

Data padded with '\0' for the length of the field. An entry that is all '\0' is treated as NULL

DATE

Using the Date struct below

DATETIME

Using the DateTime struct below

DECIMAL

Stored using an integer representation of the DECIMAL without the decimal point. With precision/width of 2 or less 2 bytes should be used, 3-4 should use 3 bytes, 4-9 should use 4 bytes and 10+ should use 8 bytes

For NULL values the following table should be used:

Datatype
Signed NULL
Unsigned NULL

BIGINT

0x8000000000000000ULL

0xFFFFFFFFFFFFFFFEULL

INT

0x80000000

0xFFFFFFFE

SMALLINT

0x8000

0xFFFE

TINYINT

0x80

0xFE

DECIMAL

As equiv. INT

As equiv. INT

FLOAT

0xFFAAAAAA

N/A

DOUBLE

0xFFFAAAAAAAAAAAAAULL

N/A

DATE

0xFFFFFFFE

N/A

DATETIME

0xFFFFFFFFFFFFFFFEULL

N/A

CHAR/VARCHAR

Fill with '\0'

N/A

Date Struct

The spare bits in the Date struct "must" be set to 0x3E.

DateTime Struct

Working Folders & Logging

As of version 1.4, cpimport uses the /var/lib/columnstore/bulk folder for all work being done. This folder contains:

  1. Logs

  2. Rollback info

  3. Job info

  4. A staging folder

The log folder typically contains:

A typical log might look like this:

Prior to version 1.4, this folder was located at /usr/local/mariadb/columnstore/bulk.

This page is: Copyright © 2025 MariaDB. All rights reserved.

Last updated

Was this helpful?