ColumnStore remote bulk data import: mcsimport
Overview
mcsimport is a high-speed bulk load utility that imports data into ColumnStore tables in a fast and efficient manner utilizing ColumnStore's Bulk Write SDK. Unlike cpimport, mcsimport was designed to be executed from a remote machine that doesn't necessarily needs to be a UM or PM. mcsimport is further executable from Windows and Linux operating systems.
Similar to cpimport, mcsimport accepts as input any flat file that contains a delimiter between fields of data (i.e. columns in a table). The default delimiter is a comma (‘,’), but other delimiters such as pipes may also be used. By default mcsimport expects the data values to be in the same order as the create table statement, and a date format of ‘YYYY-MM-DD HH:MM:SS’. But, these settings can be overwritten in a mapping file which allows customizeable input column to ColumnStore column mappings, the usage of individual input column specific date formats utilizing the strptime format, and the specification of default values for non mapped target columns.
It is important to note that:
- The bulk loads are an append operation to a table so they allow existing data to be read and remain unaffected during the process.
- The bulk loads do not write their data operations to the transaction log; they are not transactional in nature but are considered an atomic operation at this time. Information markers, however, are placed in the transaction log so the DBA is aware that a bulk operation did occur.
- Upon completion of the load operation, a high water mark in each column file is moved in an atomic operation that allows for any subsequent queries to read the newly loaded data. This append operation provides for consistent read but does not incur the overhead of logging the data.
There are three primary steps to using the mcsimport utility:
- Create the Columnstore.xml configuration file that holds the information of the ColumnStore instance to connect to.
- Optionally create a mapping file that defines the mapping between input file and target ColumnStore table.
- Run the mcsimport utility to perform the data import.
Syntax
mcsimport database table input_file [-m mapping_file] [-c Columnstore.xml] [-d delimiter] [-df date_format] [-default_non_mapped]
-m mapping_file
The mapping file is used to define the mapping between source csv columns and target ColumnStore columns, to define column specific input date formats, and to set default values for ignored target columns. It follows the Yaml 1.2 standard and can address the source csv columns implicit and explicit.
Source csv columns can only be identified by their position in the csv file starting with 0, and target ColumnStore columns can be identified either by their position or name.
Following snippet is an example for an implicit mapping file.
- column: target: 0 - column: - ignore - column: target: id - column: target: occurred format: "%d %b %Y %H:%M:%S" - target: 2 value: default - target: salary value: 20000
It defines that the first csv column (#0) is mapped to the first column in the ColumnStore table, that the second csv column (#1) is ignored and won't be injected into the target table, that the third csv column (#2) is mapped to the ColumnStore column with the name id, and that the fourth csv column (#3) is mapped to the ColumnStore column with the name occurred and uses a specific date format. (defined using the strptime format) The mapping file further defines that for the third ColumnStore target column (#2) its default value will be used, and that the ColumnStore target column with the name salary will be set to 20000 for all injections.
Explicit mapping is also possible.
- column: 0 target: id - column: 4 target: salary - target: timestamp value: 2018-09-13 12:00:00
Using this variant the first (#0) csv source column is mapped to the target ColumnStore column with the name id, and the fifth source csv column (#4) is mapped to the target ColumnStore column with the name salary. It further defines that the target ColumnStore column timestamp uses a default value of 2018-09-13 12:00:00 for the injection.
-c Columnstore.xml
As mcsimport is built upon ColumnStore's Bulk Write SDK it inherits its methods to connect to ColumnStore instances to ingest data. By default mcsimport uses the standard configuration file /usr/local/mariadb/ColumnStore/etc/Columnstore.xml or if set the one defined through the environment variable COLUMNSTORE_INSTALL_DIR to connect to the remote Columnstore instance. Individual configurations can be defined through the command line parameter -c. Instructions on how to prepare Columnstore.xml for remote ingestion can be found here.
-d delimiter
The default delimiter of the CSV input file is a comma (‘,’) and can be changed through the command line parameter -d. Only one character delimiters are currently supported.
-df date_format
By default mcsimport uses YYYY-MM-DD HH:MM:SS as input date format. An individual global date format can be specified via the command line parameter -df using the strptime format. Column specific input date formats can be defined in the mapping file and overwrite the global date format.
-default_non_mapped
mcsimport needs to inject values for all ColumnStore columns of the target table. In order to use the ColumnStore column's default values for all non mapped target columns the global parameter default_non_mapped can be used. Target column specific default values in the mapping file overwrite the global default values of this parameter.