How MariaDB ColumnStore Handles Big Data Workloads – Data Loading and Manipulation

MariaDB ColumnStore is a massively parallel scale out columnar database. Data loading and modification behaves somewhat differently from how a row based engine works. This article outlines the options available and how these affect performance.

Data Loading and Manipulation Options

MariaDB ColumnStore provides several options for writing data:

  1. DML operations: insert, update, delete
  2. Bulk DML: INSERT INTO … SELECT
  3. MariaDB Server bulk file load: LOAD DATA INFILE
  4. ColumnStore bulk data load: cpimport
  5. ColumnStore bulk delete: ColumnStore partition drop.

 

DML Operations

ColumnStore supports transactionally consistent insert, update, and delete statements using standard syntax. Performance of individual statements will be significantly slower than you’d expect with row based engines such as InnoDB. This is due to the system being optimized for block writes and the fact that a column based change must affect multiple underlying files. Updates in general will be faster since updates are performed inline only to the updated columns.

 

Bulk DML

INSERT INTO .., SELECT statements where the destination table is ColumnStore are optimized by default to internally convert and use the cpimport utility executing in mode 1 which will offer greater performance than utilizing raw DML operations. This can be a useful capability to migrate a non ColumnStore table. For further details please refer to the following knowledge base article: https://mariadb.com/kb/en/mariadb/columnstore-batch-insert-mode/

 

LOAD DATA INFILE

The LOAD DATA INFILE command can also be used and similarly is optimized by default to utilize cpimport mode 1 over DML operations to provide better performance. However greater performance (approximately 2x) and flexibility is provided for by utilizing cpimport directly. This can be useful for compatibility purposes.  For further details please refer to the following knowledge base article: https://mariadb.com/kb/en/mariadb/columnstore-load-data-infile/

 

cpimport

The cpimport utility is the fastest and most flexible data loading utility for ColumnStore. It works directly with the PM WriteEngine processes eliminating many overheads of the prior options. Cpimport is designed to work with either delimited text files or delimited data provided via stdin. The latter option provides for some simple integration capabilities such as streaming a query result from another database directly into cpimport. Multiple tables can be loaded in parallel and a separate utility colxml is provided to help automate this. For further details please refer to the following knowledge base article: https://mariadb.com/kb/en/mariadb/columnstore-bulk-data-loading/

 

The utility can operate in different modes as designated by the -m flag (default 1):

 

Mode 1 – Centralized Trigger, Distributed Loading

The data to be loaded is provided as one input file on a single server. The data is then divided and distributed evenly to each of the PM nodes for loading. The extent map is referenced to aim for even data distribution.  In addition the -P argument can be utilized to send the data to specific PM nodes which allows for the finer grain control of modes 2 and 3 while preserving centralized loading.

Mode 2 – Centralized Trigger, Local Loading

In this mode, the data must be pre divided and pushed to each pm server. The load on each server is triggered from a central location and triggers a local cpimport on each pm server.

Mode 3 – Local Trigger, Local Loading

This mode allows for loading data individually per PM node across some or all of the nodes. The load is triggered from the PM server and runs cpimport locally on that PM node only.

Modes 2 and 3 allow for more direct control of where data is loaded and in what order within the corresponding extents however care needs to be taken to allow for even distribution across nodes. This direct control does allow for explicit partitioning by PM node, for example, with 3 nodes you could have one node with only America’s data, one with EMEA, and one with APAC data. Local query can be enabled to allow querying a PM directly limiting to that regions data while still allowing querying all data from the UM level.

 

Partition Drop

ColumnStore provides a mechanism to support bulk delete by extents. An extent can be dropped by partition id or by using a value range corresponding to the minimum and maximum values for the extents to be dropped. This allows for an effective and fast purging mechanism if the data has an increasing date based column then the minimum and maximum values for the columns extents will form a (potentially overlapping) range based partitioning scheme. Data can be dropped by specifying the range values to be removed. This can form a very effective information lifecycle management strategy by removing old data by partition range. For further details please refer to the following knowledge base article: https://mariadb.com/kb/en/mariadb/columnstore-partition-management/

 

Transactional Consistency

MariaDB ColumnStore provides read committed transaction isolation. Changes to the data whether performed through DML or bulk import are always applied such that reads are not blocked and other transactions maintain a consistent view of the prior data until the data is successfully committed.

The cpimport utility interacts with a high water mark value for each column. All queries will only read below the high water mark and cpimport will insert new rows above the high water mark. When the load is completed the high water mark is updated atomically.

SQL DML operations utilize a block based MVCC architecture to provide for transactional consistency. Other transactions will read blocks at a particular version while the uncommitted version is maintained in a version buffer.