Bulk Data Loading
Overview
cpimport is a high-speed bulk load utility that imports data into ColumnStore tables in a fast and efficient manner. It accepts as input any flat file containing data that contains a delimiter between fields of data (i.e. columns in a table). The default delimiter is the pipe (‘|’) character, but other delimiters such as commas may be used as well. cpimport – performs the following operations when importing data into an MariaDB ColumnStore database:
- Data is read from specified flat files
- Data is transformed to fit InfiniDB’s column-oriented storage design
- Redundant data is tokenized and logically compressed
- Data is written to disk
It is important to note that:
- The bulk loads are an append operation to a table so they allow existing data to be read and remain unaffected during the process.
- The bulk loads do not write their data operations to the transaction log; they are not transactional in nature but are considered an atomic operation at this time. Information markers, however, are placed in the transaction log so the DBA is aware that a bulk operation did occur.
- Upon completion of the load operation, a high water mark in each column file is moved in an atomic operation that allows for any subsequent queries to read the newly loaded data. This append operation provides for consistent read but does not incur the overhead of logging the data.
There are two primary steps to using the cpimport utility:
- Optionally create a job file that is used to load data from a flat file into multiple tables
- Run the cpimport utility to perform the data import
Syntax
The simplest form of cpimport command is
cpimport dbName tblName [loadFile]
The full syntax is like this:
cpimport dbName tblName [loadFile] [-h] [-m mode] [-f filepath] [-d DebugLevel] [-c readBufferSize] [-b numBuffers] [-r numReaders] [-e maxErrors] [-B libBufferSize] [-s colDelimiter] [-E EnclosedByChar] [-C escChar] [-j jobID] [-p jobFilePath] [-w numParsers] [-n nullOption] [-P pmList] [-i] [-S] [-q batchQty]
cpimport Modes
Mode 1: Bulk Load from a central location with Single Data Source File
In this mode, you run the cpimport from a central location. The source file is located at this central location and the data from cpimport is distributed across all the PM nodes.
Example cpimport -m1 mytest mytable mytable.tbl
Mode 2: Bulk Load from central location with Distributed Data Source Files
In this mode, you run the cpimport from a central location(either UM or PM). The source data is in already partitioned data files residing on the PMs. Each PM should have the source data file of the same name but containing the partitioned data for the PM
Example cpimport -m2 mytest mytable /home/mydata/mytable.tbl
Mode 3: Parallel Distributed Bulk Load
In this mode, you run the cpimport from individual PM node independently, which will import the source file that exists on that PM. Concurrent imports can be executed on every PM for the same table.
Example cpimport -m3 mytest mytable /home/mydata/mytable.tbl
Note:
- The bulk loads are an append operation to a table so they allow existing data to be read and remain unaffected during the process.
- The bulk loads do not write their data operations to the transaction log; they are not transactional in nature but are considered an atomic operation at this time. Information markers, however, are placed in the transaction log so the DBA is aware that a bulk operation did occur.
- Upon completion of the load operation, a high water mark in each column file is moved in an atomic operation that allows for any subsequent queries to read the newly loaded data. This append operation provides for consistent read but does not incur the overhead of logging the data.
There are two primary steps to using the cpimport utility:
- Optionally create a job file that is used to load data from a flat file into multiple tables
- Run the cpimport utility to perform the data import