cpimport for MariaDB Enterprise ColumnStore 1.4

Overview

MariaDB Enterprise ColumnStore 1.4 includes the cpimport utility, which efficiently bulk loads data into ColumnStore tables.

If the LOAD DATA [ LOCAL ] INFILE or INSERT INTO .. SELECT FROM .. statements are used with a ColumnStore table, MariaDB Enterprise ColumnStore 1.4 transforms the operation into a bulk load that uses cpimport by default.

Reference material is available for MariaDB Enterprise ColumnStore 1.4.

MariaDB Enterprise ColumnStore 1.4 is included with MariaDB Enterprise Server 10.4.

USAGE

The syntax for cpimport is:

$ cpimport [OPTION ...] DATABASE_NAME TABLE_NAME [INPUT_FILE]

DETAILS

MariaDB Enterprise ColumnStore 1.4 includes the cpimport tool for bulk data loads:

  • Bypasses the SQL layer to decrease overhead

  • Does not block read queries

  • Requires a write metadata lock on the table, which can be monitored with the METADATA_LOCK_INFO plugin

  • Appends the new data to the table. While the bulk load is in progress, the newly appended data is temporarily hidden from queries. After the bulk load is complete, the newly appended data is visible to queries.

  • Inserts each row in the order the rows are read from the source file. Users can optimize data loads for Enterprise ColumnStore's automatic partitioning by loading presorted data files.

  • Supports parallel distributed bulk loads

  • Imports data from text files

  • Imports data from binary files

  • Imports data from standard input (stdin)

For additional information, see "Data Loading with cpimport".

cpimport Options for MariaDB Enterprise ColumnStore 1.4

Option

Description

-B

I/O library read buffer size (in bytes)

-b

Number of read buffers

-C

Escape character used in conjunction with 'enclosed by'

-c

Application read buffer size(in bytes)

-d

Print different level(1-3) debug message

-E

Enclosed by character if field values are enclosed.

-e

Max number of allowable error per table per PM

-f

Data file directory path.

-g

S3 Region (for S3 imports)

-H

S3 Hostname (for S3 imports, Amazon's S3 default)

-h

Print this message.

-I

Import binary data; how to treat NULL values:

-i

Print extended info to console in Mode 3.

-j

Job ID. In simple usage, default is the table OID.

-K

S3 Authentication Secret (for S3 imports)

-l

Name of import file to be loaded, relative to -f path,

-m

mode

-n

NullOption (0-treat the string NULL as data (default);

-P

List of PMs ex: -P 1,2,3. Default is all PMs.

-p

Path for XML job description file.

-q

Batch Quantity, Number of rows distributed per batch in Mode 1

-r

Number of readers.

-S

Treat string truncations as errors.

-s

'c' is the delimiter between column values.

-T

Timezone used for TIMESTAMP datatype.

-t

S3 Bucket (for S3 imports)

-w

Number of parsers.

-y

S3 Authentication Key (for S3 imports)

To see options supported in other versions, see "cpimport Options by MariaDB Enterprise ColumnStore Version".

Batch Insert Mode

MariaDB Enterprise ColumnStore 1.4 enables batch insert mode by default.

When batch insert mode is enabled, MariaDB Enterprise ColumnStore, MariaDB Enterprise ColumnStore has special handling for the following statements:

Enterprise ColumnStore uses the following rules:

  • If the statement is executed outside of a transaction, Enterprise ColumnStore loads the data using cpimport. It executes cpimport using a wrapper called cpimport.bin.

  • If the statement is executed inside of a transaction, Enterprise ColumnStore loads the data using the DML interface, which is slower.

Batch insert mode can be disabled by setting the columnstore_use_import_for_batchinsert system variable. When batch insert mode is disabled, Enterprise ColumnStore executes the statements using the DML interface, which is slower.

Appends Data

When MariaDB Enterprise ColumnStore 1.4 performs a bulk data load, it appends data to the table in the order in which the data is read. Appending data reduces the I/O requirements of bulk data loads, so that larger data sets can be loaded very efficiently.

While the bulk load is in progress, the newly appended data is temporarily hidden from queries.

After the bulk load is complete, the newly appended data is visible to queries.

Sort the Input File

When MariaDB Enterprise ColumnStore 1.4 performs a bulk data load, it appends data to the table in the order in which the data is read.

The order of data can have a significant effect on performance with Enterprise ColumnStore, so it can be helpful to sort the data in the input file prior to importing it.

For additional information, see "Load Ordered Data in Proper Order".

Locking

MariaDB Enterprise ColumnStore 1.4 requires a write metadata lock (MDL) on the table when a bulk data load is performed with cpimport.

When a bulk data load is running:

  • Read queries will not be blocked.

  • Write queries and concurrent bulk data loads on the same table will be blocked until the bulk data load operation is complete, and the write metadata lock on the table has been released.

  • The write metadata lock (MDL) can be monitored with the METADATA_LOCK_INFO plugin.

Logging

In MariaDB Enterprise ColumnStore 1.4, the cpimport tool writes bulk load logs to /usr/local/mariadb/columnstore/bulk/.

EXAMPLES

Import from Text Files

The cpimport tool can import data from a text file if a file is provided as an argument after the database and table name.

For example, to import the file inventory-products.txt into the products table in the inventory database:

$ sudo cpimport \
   inventory products \
   inventory-products.txt

Import from Remote MariaDB Server

The cpimport tool can import data from a remote MariaDB Server. You can use MariaDB Client to query the table using the SELECT statement, and then pipe the results into the standard input of the cpimport tool:

$ mariadb --quick \
   --skip-column-names \
   --execute="SELECT * FROM inventory.products" \
   | cpimport -s '\t' inventory products

Import from an S3 Bucket

The cpimport tool can import data from a file stored in a remote S3 bucket. You can use the AWS CLI to copy the file from S3, and then pipe the contents into the standard input of the cpimport tool:

$ aws s3 cp --quiet s3://columnstore-test/inventory-products.csv - \
   | cpimport -s ',' inventory products

Set the Field Delimiter

The default field delimiter for the cpimport tool is a pipe (|).

If your data file uses a different field delimiter, you can specify the field delimiter with the -s option.

For a TSV (tab-separated values) file:

$ sudo cpimport -s '\t' \
   inventory products \
   inventory-products.tsv

For a CSV (comma-separated values) file:

$ sudo cpimport -s ',' \
   inventory products \
   inventory-products.csv

Set the Quoting Style

By default, the cpimport tool does not expect fields to be quoted.

If your data file uses quotes around fields, you can specify the quote character with the -E option.

To load a TSV (tab-separated values) file that uses double quotes:

$ sudo cpimport -s '\t' -E '"' \
   inventory products \
   inventory-products.tsv

To load a CSV (comma-separated values) file that uses optional single quotes:

$ sudo cpimport -s ',' -E "'" \
   inventory products \
   inventory-products.csv