Bulk Data Loading

You are viewing an old version of this article. View the current version here.

Overview

cpimport is a high-speed bulk load utility that imports data into ColumnStore tables in a fast and efficient manner. It accepts as input any flat file containing data that contains a delimiter between fields of data (i.e. columns in a table). The default delimiter is the pipe (‘|’) character, but other delimiters such as commas may be used as well. cpimport – performs the following operations when importing data into an MariaDB ColumnStore database:

  • Data is read from specified flat files
  • Data is transformed to fit InfiniDB’s column-oriented storage design
  • Redundant data is tokenized and logically compressed
  • Data is written to disk

The 2 most-common ways to use cpimport are: 1) from the UM: cpimport will distribute rows to all Performance Modules; and 2) from a PM: cpimport will load the imported rows only on the PM from which is was invoked.

There are two primary steps to using the cpimport utility:

  1. Optionally create a job file that is used to load data from a flat file into multiple tables
  2. Run the cpimport utility to perform the data import

Note:

  • The bulk loads are an append operation to a table so they allow existing data to be read and remain unaffected during the process.
  • The bulk loads do not write their data operations to the transaction log; they are not transactional in nature but are considered an atomic operation at this time. Information markers, however, are placed in the transaction log so the DBA is aware that a bulk operation did occur.
  • Upon completion of the load operation, a high water mark in each column file is moved in an atomic operation that allows for any subsequent queries to read the newly loaded data. This append operation provides for consistent read but does not incur the overhead of logging the data.

Single Table Data loading

cpimport dbName tblName [loadFile]
[-h] [-m mode] [-f filepath] [-d DebugLevel]
[-c readBufferSize] [-b numBuffers] [-r numReaders]
[-e maxErrors] [-B libBufferSize] [-s colDelimiter] [-E EnclosedByChar]
[-C escChar] [-j jobID] [-p jobFilePath] [-w numParsers]
[-n nullOption] [-P pmList] [-i] [-S] [-q batchQty]

Where:
-d debug level
Set the debug level of the cpimport process. It ranges from 1 to 3. 
The default value is 0, which means no debugging messages.
-f import path
Alternate import file(s) directory path. If an alternate path of “STDIN” (must be all caps) is\\ specified, then cpimport will read the input table data from STDIN, and in this case, \\only 1 table can be imported by the cpimport job.
This option is not applicable with the simple cpimport \\usage if the loadFile parameter is omitted. Omitting the loadFile name implies that the input is\\ coming from STDIN, which conflicts with the use of the –f argument.
-h
Print help.
-i
Print extended information to the console for Mode 3, else extended information will only go the log file.
-j jobid
The jobid is mandatory if executing the traditional cpimport usage. The jobid is used to \\lookup the job description file (Job_jobid.xml) in /BulkRoot/bulk/job. Job_jobid.xml is \\created by running colxml prior to the bulk load, which is required for the traditional cpimport usage. 
The jobid is also used to identify the job in any system logging. If no jobid parameter is \\provided when calling the simple cpimport usage, a default will be provided. The \\maximum value is 2147483647.
-l loadFile
Alternate names for the import file(s) being loaded. The default name is tablename.tbl. The\\ first –l specifies the file(s) corresponding to the first table referenced in the job description file; the second –l specifies the file(s) for the second table, etc. 
A –l argument can specify a list of files (separated by comma, space, or vertical bar) to be imported consecutively into the relevant database table. T
he -l arguments may contain a fully qualified pathname (in which case it will override the -f entry) or a path relative to the -f pathname entry. 
When this option is provided to cpimport, it will override any previous setting in the job description file. This option is only applicable when calling the traditional cpimport usage.
-n NULL option
Controls the handling of any data values having a string value of NULL. Valid NULL option values are 0 and 1.
* A NULL option of 0, indicates that any data values read in as the string NULL will be treated as non-null values having the string value of NULL. This is the default mode.
* A NULL option of 1, indicates that any data values read in as the string NULL will be interpreted and stored in the database as null values.
-p jobxml path
Overrides the default path for the job description file. The default is /BulkRoot/bulk/job. If the specified path begins with a ‘/’, then it is treated as an absolute path, else it is treated as a relative path that is relative to the current working directory. This option is only applicable when calling the traditional cpimport usage.
-r number of read threads
Number of read threads to spawn that will read data from input data files. This parameter can be used to tune the bulk load throughput. The default value is 1.
-s column delimiter
Specifies the character used in delimiting the fields or columns of the input data file. The default is ‘|’. The same option is available when running colxml to construct the job description file. When this option is provided to cpimport, it will override any previous setting in the job description file.
-S
By default, string truncation will be treated as warnings. The use of this option will change to treat string truncations as errors. This will be treated like other errors and is subject to the max errors allowed before terminating the job. They will be reported like other errors and will be found in the .err and .bad files.
-w number of parse threads
Number of parse threads to spawn. Parse threads parse data from the buffers filled up by the read threads and load it into the column files. This parameter can be used to tune the bulk load throughput. The default value is 3.
-E enclosedByCharacter
If the import file(s) contain any columns which are enclosed by a designated character (such as a double quote), then the –E option can be used to specify this character. The enclosing character will be stripped from the start and end of the column value prior to being loaded into the database. The same option is available when running colxml to construct the job description file. When this option is provided to cpimport, it will override any previous setting in the job description file.
-C escapeCharacter or NULL escape sequence
Escape character used in conjunction with ‘enclosed by’ character, or as part of NULL escape sequence ('\N'). 
For use as the escape character: If the –E option is enabled, then the –C option specifies an escape character to be used when the enclosedByCharacter is present in a column value. 
An enclosedByCharacter can be escaped by preceding it with the escapeCharacter or by repeating the enclosedByCharacter itself. 
The default escape Character is the backslash (//\). The same option is available when running colxml to construct the job description file. When this option is provided to cpimport, it will override any previous setting in the job description file.
For use as the NULL escape sequence: The default NULL escape sequence is ‘\N’ and may be overridden using this option. Example: A -CZ will override the NULL intrepretation from ‘\N’ to ‘ZN’.
-b number of read buffers
cpimport reads data from input data files into read buffers. This parameter affects the number of read buffers that will be used by cpimport for each table that gets loaded. 
The recommended number of read buffers is 1 more than the number of parse threads. This parameter can be used to tune the bulk load throughput. The default is 5.
-c read buffer size
The size or capacity of the read buffer determines how much of the data from the input files will be read at one time. The default is 1048576 bytes.
-e number
The maximum number of error rows above which cpimport will abort the bulkload process. The default is 10. For a distributed Mode 1 or Mode 2 import, this represents the maximum number of error rows for each PM.
-B i/o buffer size
This buffer is used to set the size of the internal library buffer used by setvbuf call. Currently this impacts only the file handles that are used by the read threads. This can be tuned to improve the speed of the bulkload process. Default is 10485760 bytes.
-m mode
This represents the mode of cpimport to execute:
1. Distributed import (single command, central source)
2. Distributed import (single command, distributed source)
3. Independent import (partitioned source)
For more information on these modes and their defaults, see the “Distributed Imports” section above.
-P PM list
This represents which PM (or a list of PMs separate by comma) to run the Mode 1 or Mode 2 distributed imports. If not specified, the default is all PMs.
-q batch quantity
The number of rows that cpimport will batch up and distribute at a time. It is only applicable for a Mode 1 import. Default is 10000.

Comments

Comments loading...
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.