1 of 100

Analytics

MariaDB Enterprise offers powerful solutions to break down the barriers to insight. Whether you need to run ad hoc queries on massive datasets or power the most demanding AI workloads.

MariaDB ColumnStore

For fast, ad hoc analytics at scale, MariaDB ColumnStore is a powerful columnar database that can be deployed as a standalone analytics solution or integrated with MariaDB Enterprise Server to act as a powerful query accelerator. It stores data in a columnar format and can be distributed across a cluster of servers, allowing it to execute complex queries in parallel on petabytes of data.

This integration allows you to access your InnoDB data in near-real time, processing it directly in the ColumnStore engine to run fast, parallel OLAP queries straight from your transactional data. This eliminates the need to maintain a separate pipeline or use delayed batch inserts to analyze your live data.

MariaDB ColumnStore

MariaDB Exa

For the ultimate in analytical performance, the joint solution between MariaDB and Exasol connects your mission-critical transactional data to the world’s fastest analytics engine. Available on-premise or in the cloud on platforms like AWS and Microsoft Azure, this solution brings high-speed analytics to any environment.

MariaDB Exa erases the barrier between live operational data and high-speed analytics, leveraging Exasol’s massively parallel processing (MPP) and in-memory engine. It is the ideal solution for powering your most demanding analytics and AI/ML workloads with unmatched speed and efficiency.

MariaDB ColumnStore

Discover MariaDB ColumnStore, the powerful columnar storage engine for analytical workloads. Learn about its architecture, features, and how it enables high-performance data warehousing and analytics.

Quickstart Guides ColumnStore Architecture Managing ColumnStore Security Use Cases High Availability Clients & Tools Tutorials Reference

Quickstart Guides

MariaDB ColumnStore Quickstart Guides provide concise, Docker-friendly steps to quickly set up, configure, and explore the ColumnStore analytic engine.

MariaDB ColumnStore Guide

Quickstart guide for MariaDB ColumnStore

Quickstart Guide: MariaDB ColumnStore

MariaDB ColumnStore is a specialized columnar storage engine designed for high-performance analytical processing and big data workloads. Unlike traditional row-based storage engines, ColumnStore organizes data by columns, which is highly efficient for analytical queries that often access only a subset of columns across vast datasets.

MariaDB ColumnStore Hardware Guide

This page details MariaDB ColumnStore hardware requirements (CPU, RAM, storage, and network).

Overview

MariaDB ColumnStore is designed for analytical workloads and scales linearly with hardware resources. While the performance generally improves with more CPU cores, memory, and servers, understanding the minimum hardware specifications is crucial for successful deployment, especially in development and production environments.

MariaDB ColumnStore's performance directly benefits from additional hardware resources:

ColumnStore Architecture

MariaDB ColumnStore uses a shared-nothing, distributed architecture with separate modules for SQL and storage, enabling scalable, high-performance analytics.

Topologies Overview

MariaDB offers varied deployment topologies by workload and technology, each named and diagrammed with benefits listed. Custom configurations are also supported.

MariaDB products can be deployed in many different topologies. The topologies described in this section are representative of the overall structure. MariaDB products can be deployed to form other topologies, leverage advanced product capabilities, or combine the capabilities of multiple topologies.

Topologies are the arrangements of nodes and links to achieve a purpose. This documentation describes a few of the many topologies that can be deployed using MariaDB database products.

We group topologies by workload (transactional, analytical, or hybrid) and technologies (Enterprise Spider). Single-node topologies are listed separately.

To help you select the correct topology:

Each topology is named, and this name is used consistently throughout the documentation.
A thumbnail diagram provides a small-scale summary of the topology's architecture.
Finally, we provide a list of the benefits of the topology.

Although multiple topologies are listed on this page, the listed topologies are not the only options. MariaDB products are flexible, configurable, and extensible, so it is possible to deploy different topologies that combine the capabilities of multiple topologies listed on this page. The topologies listed on this page are primarily intended to be representative of the most commonly requested use cases.

Transactional (OLTP)

Primary/Replica Topology

Diagram

Features

Galera Cluster Topology

Diagram

Features

Analytical (OLAP, Data Warehousing, DSS)

ColumnStore Shared Local Storage Topology

Diagram

Features

ColumnStore Object Storage Topology

Diagram

Features

Hybrid Workloads

HTAP Topology

Diagram

Features

ColumnStore System Databases

When using ColumnStore, MariaDB Server creates a series of system databases used for operational purposes.

Database

Description

ColumnStore Query Processing

Clients issue a query to the MariaDB Server, which has the ColumnStore storage engine installed. MariaDB Server parses the SQL, identifies the involved ColumnStore tables, and creates an initial logical query execution plan.

Using the ColumnStore storage engine interface (ha_columnstore), MariaDB Server converts involved table references into ColumnStore internal objects. These are then handed off to the ExeMgr, which is responsible for managing and orchestrating query execution across the cluster.

The ExeMgr analyzes the query plan and translates it into a distributed ColumnStore execution plan. It determines the necessary query steps and the execution order, including any required parallelization.

The ExeMgr then references the extent map to identify which PrimProc instances hold the relevant data segments. It applies extent elimination to exclude any PrimProc nodes whose extents do not match the query’s filter criteria.

The ExeMgr dispatches commands to the selected PrimProc instances to perform data block I/O operations.

The PrimProc components perform operations such as

Predicate filtering
Join processing
Initial aggregation
Data retrieval from local disk or external storage (e.g., S3 or cloud object storage)

They then return intermediate result sets to the ExeMgr.

The ExeMgr handles:

Final-stage aggregation
Window function evaluation
Result-set sorting and shaping

The completed result set is returned to the MariaDB Server, which performs any remaining SQL operations like ORDER BY, LIMIT, or computed expressions in the SELECT list.

Finally, the MariaDB Server returns the result set to the client.

MariaDB Enterprise Columnstore Locking

Overview

MariaDB Enterprise ColumnStore minimizes locking for analytical workloads, bulk data loads, and online schema changes.

Lockless Reads

MariaDB Enterprise ColumnStore supports lockless reads.

Managing ColumnStore

Managing MariaDB ColumnStore involves setup, configuration, and tools like mcsadmin and cpimport for efficient analytics.

Deployment

MariaDB ColumnStore Hardware Guide Installing ColumnStore Upgrading ColumnStore

Installing ColumnStore

This section provides instructions for installing and configuring MariaDB ColumnStore. It covers various deployment scenarios, including single- and multi-node setups with both local and S3 storage.

Step 1: Prepare Systems for Enterprise ColumnStore Nodes

Overview

This page details step 1 of a 5-step procedure for deploying .

This step prepares the system to host MariaDB Enterprise Server and MariaDB Enterprise ColumnStore.

Interactive commands are detailed. Alternatively, the described operations can be performed using automation.

Step 5: Bulk Import of Data

Overview

This page details step 5 of a 5-step procedure for deploying Single-Node Enterprise ColumnStore with Local storage.

This step bulk imports data to Enterprise ColumnStore.

Interactive commands are detailed. Alternatively, the described operations can be performed using automation.

Import the Schema

Before data can be imported into the tables, create a matching schema.

On the primary server, create the schema:

For each database that you are importing, create the database with the statement:

For each table that you are importing, create the table with the statement:

Import the Data

Enterprise ColumnStore supports multiple methods to import data into ColumnStore tables.

cpimport

MariaDB Enterprise ColumnStore includes , which is a command-line utility designed to efficiently load data in bulk. Alternative methods are available.

To import your data from a TSV (tab-separated values) file, on the primary server run :

LOAD DATA INFILE

When data is loaded with the statement, MariaDB Enterprise ColumnStore loads the data using , which is a command-line utility designed to efficiently load data in bulk. Alternative methods are available.

To import your data from a TSV (tab-separated values) file, on the primary server use statement:

Import from Remote Database

MariaDB Enterprise ColumnStore can also import data directly from a remote database. A simple method is to query the table using the statement, and then pipe the results into , which is a command-line utility that is designed to efficiently load data in bulk. Alternative methods are available.

To import your data from a remote MariaDB database:

Next Step

Navigation in the Single-Node Enterprise ColumnStore topology with Local storage deployment procedure:

This page was step 5 of 5.

This procedure is complete.

Step 3: Install MariaDB Enterprise Server

Overview

This page details step 3 of the 9-step procedure "Deploy ColumnStore Shared Local Storage Topology".

This step installs MariaDB Enterprise Server, MariaDB Enterprise ColumnStore, CMAPI, and dependencies.

Interactive commands are detailed. Alternatively, the described operations can be performed using automation.

Retrieve Download Token

MariaDB Corporation provides package repositories for CentOS / RHEL (YUM) and Debian / Ubuntu (APT). A download token is required to access the MariaDB Enterprise Repository.

Customer Download Tokens are customer-specific and are available through the MariaDB Customer Portal.

To retrieve the token for your account:

Navigate to
Log in.
Copy the Customer Download Token.

Substitute your token for CUSTOMER_DOWNLOAD_TOKEN when configuring the package repositories.

Set Up Repository

On each Enterprise ColumnStore node, install the prerequisites for downloading the software from the Web. Install on CentOS / RHEL (YUM):

Install on Debian / Ubuntu (APT):

On each Enterprise ColumnStore node, configure package repositories and specify Enterprise Server:

Checksums of the various releases of the mariadb_es_repo_setup script can be found in the section at the bottom of the page. Substitute ${checksum} in the example above with the latest checksum.

Install Enterprise Server and Enterprise ColumnStore

On each Enterprise ColumnStore node, install additional dependencies:

Install on CentOS and RHEL (YUM):

Install on Debian 9 and Ubuntu 18.04 (APT)

Install on Debian 10 and Ubuntu 20.04 (APT):

On each Enterprise ColumnStore node, install MariaDB Enterprise Server and MariaDB Enterprise ColumnStore:

Install on CentOS / RHEL (YUM):

Install on Debian / Ubuntu (APT):

Next Step

Navigation in the procedure "Deploy ColumnStore Shared Local Storage Topology".

This page was step 3 of 9.

Step 6: Install MariaDB MaxScale

Overview

This page details step 6 of the 9-step procedure "Deploy ColumnStore Shared Local Storage Topology".

This step installs MariaDB MaxScale 22.08. ColumnStore Object Storage requires 1 or more MaxScale nodes.

Interactive commands are detailed. Alternatively, the described operations can be performed using automation.

Step 3: Install MariaDB Enterprise Server

Overview

This page details step 3 of the 9-step procedure "Deploy ColumnStore Object Storage Topology".

This step installs MariaDB Enterprise Server, MariaDB Enterprise ColumnStore, CMAPI, and dependencies.

Interactive commands are detailed. Alternatively, the described operations can be performed using automation.

Step 6: Install MariaDB MaxScale

Overview

This page details step 6 of the 9-step procedure "Deploy ColumnStore Object Storage Topology".

This step installs MariaDB MaxScale 22.08.

ColumnStore Object Storage requires 1 or more MaxScale nodes.

Interactive commands are detailed. Alternatively, the described operations can be performed using automation.

Retrieve Customer Download Token

MariaDB Corporation provides package repositories for CentOS / RHEL (YUM) and Debian / Ubuntu (APT). A download token is required to access the MariaDB Enterprise Repository.

Customer Download Tokens are customer-specific and are available through the MariaDB Customer Portal.

To retrieve the token for your account:

Navigate to
Log in.
Copy the Customer Download Token.

Substitute your token for CUSTOMER_DOWNLOAD_TOKEN when configuring the package repositories.

Set Up Repository

On the MaxScale node, install the prerequisites for downloading the software from the Web. Install on CentOS / RHEL (YUM):

Install on Debian / Ubuntu (APT):

On the MaxScale node, configure package repositories and specify MariaDB MaxScale 22.08:

Install MaxScale

On the MaxScale node, install MariaDB MaxScale.

Install on CentOS / RHEL (YUM):

Install on Debian / Ubuntu (APT):

Next Step

Navigation in the procedure "Deploy ColumnStore Object Storage Topology":

This page was step 6 of 9.

Step 9: Import Data

Overview

This page details step 9 of the 9-step procedure "Deploy ColumnStore Object Storage Topology".

This step bulk imports data to Enterprise ColumnStore.

Interactive commands are detailed. Alternatively, the described operations can be performed using automation.

Import the Schema

Before data can be imported into the tables, create a matching schema.

On the primary server, create the schema:

For each database that you are importing, create the database with the CREATE DATABASE statement:

For each table that you are importing, create the table with the CREATE TABLE statement:

Import the Data

Enterprise ColumnStore supports multiple methods to import data into ColumnStore tables.

Interface

Method

Benefits

cpimport

MariaDB Enterprise ColumnStore includes , which is a command-line utility designed to efficiently load data in bulk. Alternative methods are available.

To import your data from a TSV (tab-separated values) file, on the primary server run :

LOAD DATA INFILE

When data is loaded with the LOAD DATA INFILE statement, MariaDB Enterprise ColumnStore loads the data using , which is a command-line utility designed to efficiently load data in bulk. Alternative methods are available.

To import your data from a TSV (tab-separated values) file, on the primary server use LOAD DATA INFILE statement:

Import from Remote Database

MariaDB Enterprise ColumnStore can also import data directly from a remote database. A simple method is to query the table using the SELECT statement, and then pipe the results into , which is a command-line utility that is designed to efficiently load data in bulk. Alternative methods are available.

To import your data from a remote MariaDB database:

Next Step

Navigation in the procedure "Deploy ColumnStore Object Storage Topology":

This page was step 9 of 9.

This procedure is complete.

Step 1: Prepare Systems for Enterprise ColumnStore Nodes

Overview

This page details step 1 of a 5-step procedure for deploying Single-Node Enterprise ColumnStore with Object storage.

This step prepares the system to host MariaDB Enterprise Server and MariaDB Enterprise ColumnStore.

Interactive commands are detailed. Alternatively, the described operations can be performed using automation.

Step 2: Install Enterprise ColumnStore

Overview

This page details step 2 of a 5-step procedure for deploying Single-Node Enterprise ColumnStore with Object storage.

This step installs MariaDB Enterprise Server and MariaDB Enterprise ColumnStore.

Interactive commands are detailed. Alternatively, the described operations can be performed using automation.

Step 5: Bulk Import of Data

Overview

This page details step 5 of a 5-step procedure for deploying Single-Node Enterprise ColumnStore with Object storage.

This step bulk imports data to Enterprise ColumnStore.

Interactive commands are detailed. Alternatively, the described operations can be performed using automation.

Import the Schema

Before data can be imported into the tables, create a matching schema.

On the primary server, create the schema:

For each database that you are importing, create the database with the statement:

For each table that you are importing, create the table with the statement:

Import the Data

Enterprise ColumnStore supports multiple methods to import data into ColumnStore tables.

cpimport

MariaDB Enterprise ColumnStore includes , which is a command-line utility designed to efficiently load data in bulk. Alternative methods are available.

To import your data from a TSV (tab-separated values) file, on the primary server run :

LOAD DATA INFILE

To import your data from a TSV (tab-separated values) file, on the primary server use LOAD DATA INFILE statement:

Import from Remote Database

To import your data from a remote MariaDB database:

Next Step

Navigation in the Single-Node Enterprise ColumnStore topology with Object storage deployment procedure:

This page was step 5 of 5.

This procedure is complete.

Upgrading ColumnStore

Managing ColumnStore Database Environment

Managing MariaDB ColumnStore means deploying its architecture, scaling modules, and maintaining performance through monitoring, optimization, and backups.

Node Maintenance for MariaDB Enterprise Columnstore

Rejoining a Node

To rejoin a node with Enterprise ColumnStore, perform the following procedure.

Performing Rejoin in MaxScale

The node can be configured to rejoin in MaxScale using :

Use or another supported REST client.

Setting a Node to Maintenance Mode

To set a node to maintenance mode with Enterprise ColumnStore, perform the following procedure.

Setting the Server State in MaxScale

The server object for the node can be set to maintenance mode in MaxScale using :

Use or another supported REST client.

Switchover of the Primary Node

To switchover to a new primary node with Enterprise ColumnStore, perform the following procedure.

Performing Switchover in MaxScale

The primary node can be switched in MaxScale using :

Use or another supported REST client.
Call a module command using the call command command.
As the first argument, provide the name for the module, which is .
As the second argument, provide the module command, which is switchover .
As the third argument, provide the name of the monitor.

For example:

With the above syntax, MaxScale will choose the most up-to-date replica to be the new primary.

If you want to manually select a new primary, provide the server name of the new primary as the fourth argument:

Checking the Replication Status with MaxScale

MaxScale is capable of checking the status of using :

List the servers using the list servers command, like this:

If switchover was properly performed, the State column of the new primary shows Master, Running.

View and Clear Table Locks

MariaDB Enterprise ColumnStore acquires table locks for some operations, and it provides utilities to view and clear those locks.

MariaDB Enterprise ColumnStore acquires table locks for some operations, such as:

DDL statements
DML statements
Bulk data loads

If an operation fails, the table lock does not always get released. If you try to access the table, you can see errors like the following:

Backup & Restore

MariaDB ColumnStore backup and restore manage distributed data using snapshots or tools like mariadb-backup, with restoration ensuring cluster sync via cpimport or file system recovery.

Backup and Restore Overview

Overview

MariaDB Enterprise ColumnStore supports backup and restore.

System of Record

Before you determine a backup strategy for your Enterprise ColumnStore deployment, it is a good idea to determine the

ColumnStore Table Size Limitations

MariaDB ColumnStore has a hard limit of 4096 columns per table.

However, it's likely that you run into other limitations before hitting that limit, including:

Row size limit of tables. This varies, depending on the storage engine you're using. For example, which indirectly limits the number of columns.
Size limit of .frm files. Those files hold the column description of tables. Column descriptions vary in length. Once all column descriptions combined reach a length of 64KB, the table's .frm file is full, limiting the number of columns you can have in a table.

Given that, the maximum number of columns a ColumnStore table can effectively have is around 2000 columns.

Security

MariaDB ColumnStore uses MariaDB Server’s security—encryption, access control, auditing, and firewall—for secure analytics.

ColumnStore Security Vulnerabilities

This page is about security vulnerabilities that have been fixed for or still affect MariaDB ColumnStore. In addition, links are included to fixed security vulnerabilities in MariaDB Server since MariaDB ColumnStore is based on MariaDB Server.

Credentials Management

Overview

Starting with MariaDB Enterprise ColumnStore 6.2.3, ColumnStore supports encryption for user passwords stored in Columnstore.xml:

Encryption keys are created with the cskeys utility

Use Cases

MariaDB ColumnStore is ideal for real-time analytics and complex queries on large datasets across industries.

About MariaDB ColumnStore

MariaDB ColumnStore is a columnar storage engine that utilizes a massively parallel distributed data architecture. It's a columnar storage system built by porting InfiniDB 4.6.7 to MariaDB and released under the GPL license.

is available as a storage engine for MariaDB Server. Before then, it is available as a separate download.

Release notes and other documentation for ColumnStore is also available in the Enterprise docs section of the MariaDB website. For example:

High Availability

MariaDB ColumnStore ensures high availability with multi-node setups and shared storage, while MaxScale adds monitoring and failover for continuous analytics.

Query Tuning

MariaDB ColumnStore query tuning optimizes analytics using data types, joins, projection elimination, WHERE clauses, and EXPLAIN for performance insights.

Collecting Statistics with ANALYZE TABLE

Overview

In MariaDB Enterprise ColumnStore 6, the uses optimizer statistics in its query planning process.

ColumnStore uses the optimizer statistics to add support for queries that contain circular inner joins.

In Enterprise ColumnStore 5 and before, ColumnStore would raise the following error when a query containing a circular inner join was executed:

The optimizer statistics store each column's NDV (Number of Distinct Values), which can help the ExeMgr process choose the optimal join order for queries with circular joins. When Enterprise ColumnStore executes a query with a circular join, the query's execution can take longer if ColumnStore chooses a sub-optimal join order. When you collect optimizer statistics for your ColumnStore tables, the ExeMgr process is less likely to choose a sub-optimal join order.

Query Plans and Optimizer Trace

MariaDB ColumnStore's query plans and Optimizer Trace show how analytical queries run in parallel across its distributed, columnar architecture, aiding performance tuning.

Execution Plan (CSEP)

Overview

The ColumnStore storage engine uses a ColumnStore Execution Plan (CSEP) to represent a query plan internally.

When the select handler receives the SELECT_LEX object, it transforms it into a CSEP as part of the query planning and optimization process. For additional information, see "."

Clients & Tools

MariaDB ColumnStore supports standard MariaDB tools, BI connectors (e.g., Tableau, Power BI), data ingestion (cpimport, Kafka), and REST APIs for admin.

Performance Concepts

Introduction

The high level components of the ColumnStore architecture are:

PrimProc: PrimProc (Primitives Processor) is responsible for parsing the SQL requests into an optimized set of primitive job steps executed by one or more servers. PrimProc is thus responsible for query optimization and orchestration of query execution by the servers. While every instance has their own PrimProc in a multi-server deployment, each query begins and ends on the same PrimProc it originated from. A database load balancer such as MariaDB MaxScale can be deployed to appropriately balance external requests against individual servers. PrimProc also executes granular job steps received from the server (mariadbd) in a multi-threaded manner. ColumnStore allows distribution of the work across many servers.
Extent Maps: ColumnStore maintains metadata about each column in a shared distributed object known as the Extent Map. The primary node references the Extent Map to help assist in generating the correct primitive job steps. The primary node server references the Extent Map to identify the correct disk blocks to read. Each column is made up of one or more files and each file can contain multiple extents. As much as possible the system attempts to allocate contiguous physical storage to improve read performance.
Storage: ColumnStore can use either local storage or shared storage (e.g. SAN or EBS) to store data. Using shared storage allows for data processing to fail over to another node automatically in case of a server failing.

Data Loading

The system supports full MVCC ACID transactional logic via Insert, Update, and Delete statements. The MVCC architecture allows for concurrent query and DML / batch load. Although DML is supported, the system is optimized more for batch inserts and so larger data loads should be achieved through a batch load. The most flexible and optimal way to load data is via the cpimport tool. This tool optimizes the load path and can be run centrally or in parallel on each server.

If the data contains a time or (time correlated ascending value) column then significant performance gains will be achieved if the data is sorted by this field and also typically queried with a where clause on that column. This is because the system records a minimum and maximum value for each extent providing for a system maintained range partitioning scheme. This allows the system to completely eliminate scanning an extent map if the query includes a where clause for that field limiting the results to a subset of extent maps.

Query Execution

MariaDB ColumnStore has its own query optimizer and execution engine distinct from the MariaDB server implementation. This allows for scaling out query execution to multiple servers, and to optimize for handling data stored as columns rather than rows. As such, the factors influencing query performance are very different:

A query is first parsed by the MariaDB server (mariadbd) process and passed through to the ColumnStore storage engine. This passes the request onto the PrimProc process which is responsible for optimizing and orchestrating execution of the query. The PrimProc module's optimizer creates a series of batch primitive steps that are executed on all nodes in the cluster. Since multiple servers can be deployed, this allows for scale-out execution of the queries. The optimizer attempts to process query execution in parallel. However, certain operations inherently must be executed centrally, for example final result ordering. Filtering, joins, aggregates, and GROUP BY clauses are general.y pushed down and executed in parallel in PrimProc on all servers. In PrimProc, batch primitive steps are performed at a granular level where individual threads operate on individual 1K-8K blocks within an extent. This enables a larger multi-core server to be fully consumed and scale out within a single server. The current batch primitive steps available in the system include:

Single Column Scan: Scan one or more Extents for a given column based on a single column predicate, including operators like =, <>, IN (list), BETWEEN, and ISNULL. See the first scan section of for additional details on tuning this.
Additional Single Column Filters: Project additional columns for any rows found by a previous scan and apply additional single column predicates as needed. Access of blocks is based on row identifier, going directly to the blocks. See the additional column read section of for additional details on tuning this.

ColumnStore Query Execution Paradigms

The following items should be considered when thinking about query execution in ColumnStore vs a row based store such as InnoDB.

Data Scanning and Filtering

ColumnStore is optimized for large scale aggregation / OLAP queries over large data sets. As such indexes typically used to optimize query access for row based systems do not make sense since selectivity is low for such queries. Instead ColumnStore gains performance by only scanning necessary columns, utilizing system maintained partitioning, and utilizing multiple threads and servers to scale query response time.

Since ColumnStore only reads the necessary columns to resolve a query, only include the necessary columns required. For example, SELECT * is significantly slower than SELECT col1, col2 FROM tbl.

Datatype size is important. If say you have a column that can only have values 0 through 100 then declare this as a tinyint as this will be represented with 1 byte rather than 4 bytes for int. This reduces the I/O cost by 4 times.

For string types, an important threshold is CHAR(9) and VARCHAR(8) or greater. Each column storage file uses a fixed number of bytes per value. This enables fast positional lookup of other columns to form the row. Currently the upper limit for columnar data storage is 8 bytes. So. for strings longer than this, the system maintains an additional 'dictionary' extent where the values are stored. The columnar extent file then stores a pointer into the dictionary. For example, it is more expensive to read and process a VARCHAR(8) column than a CHAR(8) column. Where possible, you get better performance if you can utilize shorter strings, especially if you avoid the dictionary lookup. All TEXT/BLOB data types in ColumnStore 1.1 onward utilize a dictionary and do a multiple-block 8KB lookup to retrieve that data if required. The longer the data, the more blocks are retrieved, and the greater is a potential performance impact.

In a row-based system, adding redundant columns adds to the overall query cost, but in a columnar system a cost is only occurred if the column is referenced. Therefore, additional columns should be created to support different access paths. For instance, store a leading portion of a field in one column to allow for faster lookups, but additionally store the long-form value as another column. Scans on a shorter code or a leading-portion column are faster.

ColumnStore distributes function application across all nodes for greater performance, but this requires a distributed implementation of the function in addition to the MariaDB server implementation. See for the full list.

Its important to note that ColumnStore does not have a cost based optimizer, so for optimal extent elimination and performance, your first WHERE clause predicate order should be based on the same column order that the data is imported by. Example: Most use cases with a date column benefit from a natural sort. (Today's data are being inserted after yesterday's data.) Having the first column to filter by date helps efficiently filter through records. WHERE DATE='x' outperforms a query based on a column with random values as the first predicate. Compare different query plans using calSetTrace and calGetTrace. Optimizing for the lowest PIO/LIO and highest PBE. See also .

Joins

Hash joins are utilized by ColumnStore to optimize for large scale joins and avoid the need for indexes and the overhead of nested loop processing. ColumnStore maintains table statistics so as to determine the optimal join order. This is implemented by first identifying the small table side (based on extent map data) and materializing the necessary rows from that table for the join. If the size of this is less than the configuration setting PmMaxMemorySmallSide, the join is pushed down into PrimProc for distributed in-memory processing. Otherwise, the larger side rows is not processed in a distributed manner for joining, and only the WHERE clause on that side is executed across all PrimProc modules in the cluster. If the join is too large for memory, disk-based join can be enabled to allow the query to complete.

Aggregations

Similarly to scalar functions ColumnStore distributes aggregate evaluation as much as possible. However some post processing is required to combine the final results. Enough memory must exist to handle queries with a very large number of values in the aggregate columns.

Aggregation performance is also influenced by the number of distinct aggregate column values. Generally, the same number of rows with 100 distinct values computes faster than 10000 distinct values. This is due to increased memory management as well as transfer overhead.

SELECT COUNT() is internally optimized to be SELECT COUNT(COL-N), where COL-N is the column that uses the least number of bytes for storage. For example it would be pick a CHAR(1) column over int column because CHAR(1) uses 1 byte for storage and int uses 4 bytes. The implementation still honors ANSI semantics in that SELECT COUNT() will include nulls in the total count as opposed to an explicit SELECT(COL-N) which excludes NULL values in the count.

ORDER BY and LIMIT

ORDER BY and LIMIT are implemented at the very end by the mariadbd server process on the temporary result set table. This means that the unsorted results must be fully retrieved before either are applied. The performance overhead of this is minimal on small to medium results, but for larger results, it can be significant.

Complex Queries

Subqueries are executed in sequence thus the subquery intermediate results must be materialized and then the join logic applies with the outer query.

Window functions are executed as part of final aggregation in PrimProc due to the need for ordering of the window results. The ColumnStore window function engines uses a dedicated faster sort process.

Partitioning

Automated system partitioning of columns is provided by ColumnStore. As data is loaded into extent maps, the system will capture and maintain min/max values of column data in that extent map. New rows are appended to each extent map until full at which point a new extent map is created. For column values that are ordered or semi-ordered this allows for very effective data partitioning. By using the min and max values, entire extent maps can be eliminated and not read to filter data. This generally works particularly well for time dimension / series data or similar values that increase over time.

Job Steps

Overview

When Enterprise ColumnStore executes a query, the ExeMgr process on the initiator/aggregator node translates the ColumnStore execution plan (CSEP) into a job list. A job list is a sequence of job steps.

Enterprise ColumnStore uses many different types of job steps that provide different scalability benefits:

Some types of job steps perform operations in a distributed manner, using multiple nodes to operate to different extents. Distributed operations provide horizontal scalability.
Some types of job steps perform operations in a multi-threaded manner using a thread pool. Performing multi-threaded operations provides vertical scalability.

As you increase the number of ColumnStore nodes or the number of cores on each node, Enterprise ColumnStore can use those resources to more efficiently execute job steps.

For additional information, see ".".

Batch Primitive Step (BPS)

Enterprise ColumnStore defines a batch primitive step to handle many types of tasks, such as scanning/filtering columns, JOIN operations, aggregation, functional filtering, and projecting (putting values into a SELECT list).

In calGetTrace() output, a batch primitive step is abbreviated BPS.

Batch primitive steps are evaluated on multiple nodes in parallel. The PrimProc process on each node evaluates the batch primitive step to one extent at a time. The PrimProc process uses a thread pool to operate on individual blocks within the extent in parallel.

Cross Engine Step (CES)

Enterprise ColumnStore defines a cross-engine step to perform cross-engine joins, in which a ColumnStore table is joined with a table that uses a different storage engine.

In calGetTrace() output, a cross-engine step is abbreviated CES.

Cross-engine steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Enterprise ColumnStore can perform cross-engine joins when the mandatory utility user is properly configured.

For additional information, refer to the ""

Dictionary Structure Step (DSS)

Enterprise ColumnStore defines a dictionary structure step to scan the dictionary extents that ColumnStore uses to store variable-length string values.

In calGetTrace() output, a dictionary structure step is abbreviated DSS.

Dictionary structure steps are evaluated on multiple nodes in parallel. The PrimProc process on each node evaluates the dictionary structure step to one extent at a time. It uses a thread pool to operate on individual blocks within the extent in parallel.

Dictionary structure steps can require a lot of I/O for a couple of reasons:

Dictionary structure steps do not support extent elimination, so all extents for the column must be scanned.
Dictionary structure steps must read the column extents to find each pointer and the dictionary extents to find each value, so it doubles the number of extents to scan.

It is generally recommended to avoid queries that will cause dictionary scans.

For additional information, see "Avoid Creating Long String Columns".

Hash Join Step (HJS)

Enterprise ColumnStore defines a hash join step to perform a hash join between two tables.

In calGetTrace() output, a hash join step is abbreviated HJS.

Hash join steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Enterprise ColumnStore performs the hash join in memory by default. If you perform large joins, you may be able get better performance by changing some configuration defaults with mcsSetConfig:

Enterprise ColumnStore can be configured to use more memory for in-memory hash joins.
Enterprise ColumnStore can be configured to use disk-based joins.

For additional information, see "" and "".

Having Step (HVS)

Enterprise ColumnStore defines a having step to evaluate a HAVING clause on a result set.

In calGetTrace() output, a having step is abbreviated HVS.

Subquery Step (SQS)

Enterprise ColumnStore defines a subquery step to evaluate a subquery.

In calGetTrace() output, a subquery step is abbreviated SQS.

Tuple Aggregation Step (TAS)

Enterprise ColumnStore defines a tuple aggregation step to collect intermediate aggregation prior to the final aggregation and evaluation of the results.

In calGetTrace() output, a tuple aggregation step is abbreviated TAS.

Tuple aggregation steps are primarily evaluated by the ExeMgr process on the initiator/aggregator node. However, the PrimProc process on each node also plays a role, since the PrimProc process on each node provides the intermediate aggregation results to the ExeMgr process on the initiator/aggregator node.

Tuple Annexation Step (TNS)

Enterprise ColumnStore defines a tuple annexation step to perform the final aggregation and evaluation of the results.

In calGetTrace() output, a tuple annexation step is abbreviated TNS.

Tuple annexation steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Enterprise ColumnStore 5 performs aggregation operations in memory. As a consequence, more complex aggregation operations require more memory in that version.

In Enterprise ColumnStore 6, disk-based aggregations can be enabled.

For additional information, see "".

Tuple Union Step (TUS)

Enterprise ColumnStore defines a tuple union step to perform a union of two subqueries.

In calGetTrace() output, a tuple union step is abbreviated TUS.

Tuple union steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Tuple Constant Step (TCS)

Enterprise ColumnStore defines a tuple constant step to evaluate constant values.

In calGetTrace() output, a tuple constant step is abbreviated TCS.

Tuple constant steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Window Function Step (WFS)

Enterprise ColumnStore defines a window function step to evaluate window functions.

In calGetTrace() output, a window function step is abbreviated WFS.

Window function steps are evaluated locally by the on the initiator/aggregator node.

ColumnStore Partition Management

Introduction

MariaDB ColumnStore automatically creates logical horizontal partitions across every column. For ordered or semi-ordered data fields such as an order date this will result in a highly effective partitioning scheme based on that column. This allows for increased performance of queries filtering on that column since partition elimination can be performed. It also allows for data lifecycle management as data can be disabled or dropped by partition cheaply. Caution should be used when disabling or dropping partitions as these commands are destructive.

It is important to understand that a Partition in ColumnStore terms is actually 2 extents (16 million rows) and that extents & partitions are created according to the following algorithm in 1.0.x:

Create 4 extents in 4 files
When these are filled up (after 32M rows), create 4 more extents in the 4 files created in step 1.
When these are filled up (after 64M rows), create a new partition.

Managing Partitions by Partition Number

Displaying Partitioning Information

Information about all partitions for a given column can be retrieved using the calShowPartitions stored procedure which takes either two or three mandatory parameters: [database_name], table_name, and column_name. If two parameters are provided the current database is assumed. For example:

Enabling Partitions

The calEnablePartitions stored procedure allows for enabling of one or more partitions. The procedure takes the same set of parameters as calDisablePartitions.

For example:

The result showing the first partition has been enabled:

Disabling Partitions

The calDisablePartitions stored procedure allows for disabling of one or more partitions. A disabled partition still exists on the file system (and can be enabled again at a later time) but will not participate in any query, DML or import activity. The procedure takes either two or three mandatory parameters: [database_name], table_name, and partition_numbers separated by commas. If two parameters are provided the current database is assumed.

For example:

The result showing the first partition has been disabled:

Dropping Partitions

The calDropPartitions stored procedure allows for dropping of one or more partitions. Dropping means that the underlying storage is deleted and the partition is completely removed. A partition can be dropped from either enabled or disabled state. The procedure takes the same set of parameters as calDisablePartitions. Extra caution should be used with this procedure since it is destructive and cannot be reversed.

For example:

The result showing the first partition has been dropped:

Managing Partitions by Column Value

Displaying Partitioning Information

Information about a range of parititions for a given column can be retrieved using the calShowPartitionsByValue stored procedure. This procedure takes either four or five mandatory parameters: [database_name], table_name,`` column_name,`` start_value, and`` end_value. If four parameters are provided, the current database is assumed. Only casual partition column types (, , , , up to 8 bytes and up to 7 bytes) are supported for this function.

The function returns a list of partitions whose minimum and maximum values for the column col_name fall completely within the range of start_value and end_value.

For example:

Enabling Partitions

The calEnablePartitionsbyValue stored procedure allows for enabling of one or more partitions by value. The procedure takes the same set of arguments as calShowPartitionsByValue.

A good practice is to use calShowPartitionsByValue to identify the partitions to be enabled and then the same argument values used to construct the calEnablePartitionsbyValue call.

For example:

The result showing the first partition has been enabled:

Disabling Partitions

The calDisablePartitionsByValue stored procedure allows for disabling of one or more partitions by value. A disabled partition still exists on the file system (and can be enabled again at a later time) but will not participate in any query, DML or import activity. The procedure takes the same set of arguments as calShowPartitionsByValue.

A good practice is to use calShowPartitionsByValue to identify the partitions to be disabled and then the same argument values used to construct the calDisablePartitionsByValue call. For example:

The result showing the first partition has been disabled:

Dropping Partitions

The calDropPartitionsByValue stored procedure allows for dropping of one or more partitions by value. Dropping means that the underlying storage is deleted and the partition is completely removed. A partition can be dropped from either enabled or disabled state. The procedure takes the same set of arguments as calShowPartitionsByValue. A good practice is to use calShowPartitionsByValue to identify the partitions to be enabled and then the same argument values used to construct the calDropPartitionsByValue call. Extra caution should be used with this procedure since it is destructive and cannot be reversed.

For example:

The result showing the first partition has been dropped:

Dropping Data Outside of Partitions

Since the partitioning scheme is system-maintained, the minimum and maximum values are not directly specified, but influenced by the order of data loading. If you want to drop a specific date range, additional deletes are required to achieve this. The following cases may occur:

For semi-ordered data, there may be overlap between minimum and maximum values between partitions.
As in the example above, the partition ranges from 1992-01-01 to 1998-08-02. It may be desirable to drop the remaining 1998 rows.

A bulk-delete statement can be used to delete the remaining rows that do not fall exactly within partition ranges. The partition drops will be fastest; however, the system optimizes bulk-delete statements to delete by block internally. This is still relatively fast.

Data Loading with INSERT .. SELECT

Overview

MariaDB Enterprise ColumnStore automatically translates INSERT INTO .. SELECT FROM .. statements into bulk data loads. By default, it translates the statement into a bulk data load that uses cpimport.bin, which is an internal wrapper around the cpimport tool.

Intended Use Cases

You can load data using INSERT INTO .. SELECT FROM .. in the following cases:

You are loading data into a ColumnStore table by querying one or more local tables.

Batch Insert Mode

MariaDB Enterprise ColumnStore enables batch insert mode by default.

When batch insert mode is enabled, MariaDB Enterprise ColumnStore has special handling for statements.

Enterprise ColumnStore uses the following rules:

If the statement is executed outside of a transaction, Enterprise ColumnStore loads the data using cpimport, which is a command-line utility that is designed to efficiently load data in bulk. Enterprise ColumnStore executes cpimport using a wrapper called cpimport.bin.
If the statement is executed inside of a transaction, Enterprise ColumnStore loads the data using the DML interface, which is slower.

Batch insert mode can be disabled by setting the columnstore_use_import_for_batchinsert system variable to OFF. When batch insert mode is disabled, Enterprise ColumnStore executes the statements using the DML interface, which is slower.

Locking

MariaDB Enterprise ColumnStore requires a write metadata lock (MDL) on the table when a bulk data load is performed with cpimport.

When a bulk data load is running:

Read queries will not be blocked.
Write queries and concurrent bulk data loads on the same table will be blocked until the bulk data load operation is complete, and the write metadata lock on the table has been released.
The write metadata lock (MDL) can be monitored with the .

Importing the Schema

Before data can be imported into the tables, the schema must be created.

Connect to the primary server using :

After the command is executed, it will prompt you for a password.

For each database that you are importing, create the database with the statement:

For each table that you are importing, create the table with the statement:

To get the best performance from Enterprise ColumnStore, make sure to follow Enterprise ColumnStore's best practices for schema design.

Appending Data

When MariaDB Enterprise ColumnStore performs a bulk data load, it appends data to the table in the order in which the data is read. Appending data reduces the I/O requirements of bulk data loads, so that larger data sets can be loaded very efficiently.

While the bulk load is in progress, the newly appended data is temporarily hidden from queries.

After the bulk load is complete, the newly appended data is visible to queries.

Sorting the Query Results

When MariaDB Enterprise ColumnStore performs a bulk data load, it appends data to the table in the order in which the data is read.

The order of data can have a significant effect on performance with Enterprise ColumnStore. If your data is not already sorted, it can be helpful to sort the query results using an ORDER BY clause.

For example:

For additional information, see "".

Confirming the Field Delimiter

Before importing a table's data into MariaDB Enterprise ColumnStore, confirm that the field delimiter is not present in the data.

The field delimiter is determined by the columnstore_import_for_batchinsert_delimiter system variable. By default, Enterprise ColumnStore sets the field delimiter to the ASCII value 7, which corresponds to the BEL character.

To use a different delimiter, you can set the field delimiter.

Setting the Field Delimiter

When the data is passed to cpimport, each value is separated by a field delimiter. The field delimiter is determined by the columnstore_import_for_batchinsert_delimiter system variable.

By default, Enterprise ColumnStore sets the field delimiter to the ASCII value 7, which corresponds to the BEL character. In general, setting the field delimiter is only required if your data contains this value.

Set the field delimiter by setting the columnstore_import_for_batchinsert_delimiter system variable to the ASCII value for the desired delimiter character.

For example, if you want to use a comma (,) as the field delimiter, you would set columnstore_import_for_batchinsert_delimiter to 44:

Setting the Quoting Style

When the data is passed to cpimport, each value is enclosed by a quote character. The quote character is determined by the columnstore_import_for_batchinsert_enclosed_by system variable.

By default, Enterprise ColumnStore sets the quote character to the ASCII value 17, which corresponds to the DC1 character. In general, setting the quote character is only required if your data contains this value.

Set the quote character by setting the columnstore_import_for_batchinsert_enclosed_by system variable to the ASCII value for the desired quote character.

For example, if you want to use a double quote (") as the quote character, you would set columnstore_import_for_batchinsert_enclosed_by to 34:

Adding a Node

Adding a Node to MariaDB Enterprise ColumnStore

To add a new node to Enterprise ColumnStore, perform the following procedure.

Deploying Enterprise ColumnStore

Before you can add a node to Enterprise ColumnStore, confirm that the Enterprise ColumnStore software has been deployed on the node in the desired topology.

For additional information, see "".

Backing Up MariaDB Data Directory on the Primary Server

Before the new node can be added, its MariaDB data directory must be consistent with the Primary Server. To ensure that it is consistent, take a backup of the Primary Server:

The instructions below show how to perform a backup using .

On the Primary Server, take a full backup:
Confirm successful completion of the backup operation.
On the Primary Server, prepare the backup:
Confirm successful completion of the prepare operation.

Restoring the Backup on the New Node

To make the new node consistent with the Primary Server, restore the new backup on the new node:

On the Primary Server, copy the backup to the new node:
On the new node, restore the backup using .
On the new node, fix the file permissions of the restored backup:

Starting the Enterprise ColumnStore Services

The Enterprise Server. Enterprise ColumnStore, and CMAPI services can be started using the systemctl command. In case the services were started during the installation process, use the restart command.

Perform the following procedure on the new node:

Start and enable the MariaDB Enterprise Server service, so that it starts automatically upon reboot:
Start and disable the MariaDB Enterprise ColumnStore service, so that it does not start automatically upon reboot:
Note
The Enterprise ColumnStore service should not be enabled in a multi-node deployment. The Enterprise ColumnStore service will be started as-needed by the CMAPI service, so it does not require starting automatically upon reboot.
Start and enable the CMAPI service, so that it starts automatically upon reboot:

Configuring MariaDB Replication

MariaDB Enterprise ColumnStore requires MariaDB Replication, which must be configured.

Get the GTID position that corresponds to the restored backup.
If the backup was taken with , this position will be located in xtrabackup_binlog_info:
The GTID position from the above output is 0-1-2001,1-2-5139.
Connect to the Replica Server using using the root@localhost user account:

Adding the Node to Enterprise ColumnStore

The new node must be added to Enterprise ColumnStore using :

Add the node using the endpoint path
Use a , such as curl
Format the JSON output using jq for enhanced readability

For example, if the primary node's host name is mcs1 and the new node's IP address is 192.0.2.3:

In ES 10.5.10-7 and later:
In ES 10.5.9-6 and earlier:

Example output:

Checking Enterprise ColumnStore Status

To confirm that the node was properly added, the status of Enterprise ColumnStore should be checked using :

Check the status using the endpoint path

For example, if the primary node's host name is mcs1:

Example output:

Adding a Server to MaxScale

A server object for the new node must also be added to MaxScale using :

Use or another supported REST client
Add the server object using the create server command
As the first argument, provide a name for the server
As the second argument, provide the IP address for the node

For example:

Verifying the Server in MaxScale

To confirm that the server object was properly added, the server objects should be checked using :

Show the server objects using the show servers command

For example:

Linking to Monitor in MaxScale

The server object for the new node must be linked to the monitor using :

Link a server object to the monitor using the link monitor command
As the first argument, provide the name of the monitor
As the second argument, provide the name of the server

Checking the Monitor in MaxScale

To confirm that the server object was properly linked to the monitor, the monitor should be checked using :

Show the monitors using the show monitors command

For example:

Linking to Service in MaxScale

The server object for the new node must be linked to the service using :

Link the server object to the service using the link service command
As the first argument, provide the name of the service
As the second argument, provide the name of the server

Checking the Service in MaxScale

To confirm that the server object was properly linked to the service, the service should be checked using :

Show the services using the show services command

For example:

Checking the Replication Status with MaxScale

MaxScale is capable of checking the status of using :

List the servers using the list servers command

For example:

If the new node is properly replicating, then the State column will show Slave, Running.

ColumnStore Streaming Data Adapters

The ColumnStore Bulk Data API enables the creation of higher performance adapters for ETL integration and data ingestions. The Streaming Data Adapters are out of box adapters using these API for specific data sources and use cases.

MaxScale CDC Data Adapter is integration of the MaxScale CDC streams into MariaDB ColumnStore.
Kafka Data Adapter is integration of the Kafka streams into MariaDB ColumnStore.

MaxScale CDC Data Adapter

The MaxScale CDC Data Adapter has been deprecated.

The MaxScale CDC Data Adapter allows streaming change data events (binary log events) from MariaDB Master hosting non-columnstore engines (InnoDB, MyRocks, MyISAM) to MariaDB ColumnStore. In other words, replicate data from a MariaDB master server to MariaDB ColumnStore. It acts as a CDC Client for MaxScale and uses the events received from MaxScale as input to MariaDB ColumnStore Bulk Data API to push the data to MariaDB ColumnStore.

It registers with MariaDB MaxScale as a CDC Client using the , receiving change data records from MariaDB MaxScale (that are converted from binlog events received from the Master on MariaDB TX) in a JSON format. Then, using the MariaDB ColumnStore bulk write SDK, it converts the JSON data into API calls and streams it to a MariaDB PM node. The adapter has options to insert all the events in the same schema as the source database table or insert each event with metadata as well as table data. The event meta data includes the event timestamp, the GTID, event sequence and event type (insert, update, delete).

Installation

Pre-requisite:

Download and install MaxScale CDC Connector API from .
Download and install MariaDB ColumnStore bulk write SDK from columnstore-bulk-write-sdk.md.

CentOS 7

Debian 9/Ubuntu Xenial:

Debian 8:

Usage

Streaming Multiple Tables

To stream multiple tables, use the -f parameter to define a path to a TSV formatted file. The file must have one database and one table name per line. The database and table must be separated by a TAB character and the line must be terminated in a newline (\n).

Here is an example file with two tables, t1 and t2 both in the test database:

Automated Table Creation on ColumnStore

You can have the adapter automatically create the tables on the ColumnStore instance with the -an option. In this case, the user used for cross-engine queries will be used to create the table (the values in Columnstore.CrossEngineSupport). This user requires CREATE privileges on all streamed databases and tables.

Data Transformation Mode

The -z option enables the data transformation mode. In this mode, the data is converted from historical, append-only data to the current version of the data. In practice, this replicates changes from a MariaDB master server to ColumnStore via the MaxScale CDC.

This mode is not as fast as the append-only mode and might not be suitable for heavy workloads. This is due to the fact that the data transformation is done via various DML statements.

Quick Start

Download and install both and .

Copy the Columnstore.xml file from /usr/local/mariadb/columnstore/etc/Columnstore.xml from one of the ColumnStore PrimProc nodes to the server where the adapter is installed.

Configure MaxScale according to the .

Create a CDC user by executing the following MaxAdmin command on the MaxScale server. Replace the <service> with the name of the avrorouter service and <user> and <password> with the credentials that are to be created.

Then we can start the adapter by executing the following command.

The <database> and <table> define the table that is streamed to ColumnStore. This table should exist on the master server where MaxScale is reading events from. If the table is not created on ColumnStore, the adapter will print instructions on how to define it in the correct way.

The <user> and <password> are the users created for the CDC user, <host> is the MaxScale address and <port> is the port where the CDC service listener is listening.

The -c flag is optional if you are running the adapter on the server where ColumnStore is located.

Kafka to ColumnStore Adapter

The Kafka data adapter streams all messages published to Apache Kafka topics in Avro format to MariaDB ColumnStore automatically and continuously - enabling data from many sources to be streamed and collected for analysis without complex code. The Kafka adapter is built using and the MariaDB ColumnStore bulk write SDK

A tutorial for the Kafka adapter for ingesting Avro formatted data can be found in the document.

ColumnStore - Pentaho Data Integration - Data Adapter

Starting with MariaDB ColumnStore 1.1.4, a data adapter for Pentaho Data Integration (PDI) / Kettle is available to import data directly into ColumnStore’s WriteEngine. It is built on MariaDB’s rapid-paced Bulk Write SDK.

Compatibility notice

The plugin was designed for the following software composition:

Operating system: Windows 10 / Ubuntu 16.04 / RHEL/CentOS 7+
MariaDB ColumnStore >= 1.1.4
MariaDB Java Database client* >= 2.2.1
Java >= 8

*Only needed if you want to execute DDL.

Installation

The following steps are necessary to install the ColumnStore Data adapter (bulk loader plugin):

Build the plugin from or download it from our
Extract the archive mariadb-columnstore-kettle-bulk-exporter-plugin-*.zip into your PDI installation directory $PDI-INSTALLATION/plugins.
Copy mariadb-java-client-2.2.x.jar into PDI's lib directory $PDI-INSTALLATION/lib.
Install the additional library dependencies

Ubuntu dependencies

CentOS dependencies

Windows 10 dependencies

On Windows the installation of the is required.

Configuration

Each MariaDB ColumnStore Bulk Loader block needs to be configured. On the one hand, it needs to know how to connect to the underlying Bulk Write SDK to inject data into ColumnStore, and on the other hand, it needs to have a proper JDBC connection to execute DDL.

Both configurations can be set in each block’s settings tab.

The database connection configuration follows PDI’s default schema.

By default, the plugin tries to use ColumnStore's default configuration /usr/local/mariadb/columnstore/etc/Columnstore.xml to connect to the ColumnStore instance through the Bulk Write SDK. In addition, individual paths or variables can be used too.

Information on how to prepare the Columnstore.xml configuration file can be found here.

Usage

Once a block is configured and all inputs are connected in PDI, the inputs have to be mapped to ColumnStore’s table format.

One can either choose “Map all inputs”, which sets target columns of adequate type, or choose a custom mapping based on the structure of the existing table.

The SQL button can be used to generate DDL based on the defined mapping and to execute it.

Limitations

This plugin is a beta release.

In addition, it can't handle blob data types and only supports multiple inputs to one block if the input field names are equal for all input sources.

Data Loading with cpimport

Overview

MariaDB Enterprise ColumnStore includes a bulk data loading tool called cpimport, which bypasses the SQL layer to decrease the overhead of bulk data loading.

Refer to the cpimport modes for additional information and to ColumnStore Bulk Data Loading.

The cpimport tool:

Bypasses the SQL layer to decrease overhead;
Does not block read queries;
Requires a write metadata lock on the table, which can be monitored with the ;
Appends the new data to the table. While the bulk load is in progress, the newly appended data is temporarily hidden from queries. After the bulk load is complete, the newly appended data is visible to queries;
Inserts each row in the order the rows are read from the source file. Users can optimize data loads for Enterprise ColumnStore's automatic partitioning by loading presorted data files;
Supports parallel distributed bulk loads;
Imports data from text files;
Imports data from binary files;
Imports data from standard input (stdin).

Intended Use Cases

You can load data using the cpimport tool in the following cases:

You are loading data into a ColumnStore table from a text file stored on the primary node's file system.
You are loading data into a ColumnStore table from a binary file stored on the primary node's file system.
You are loading data into a ColumnStore table from the output of a command running on the primary node.

Locking

MariaDB Enterprise ColumnStore requires a write metadata lock (MDL) on the table when a bulk data load is performed with cpimport.

When a bulk data load is running:

Read queries will not be blocked.
Write queries and concurrent bulk data loads on the same table will be blocked until the bulk data load operation is complete, and the write metadata lock on the table has been released.
The write metadata lock (MDL) can be monitored with the .

Importing the Schema

Before data can be imported into the tables, the schema must be created.

Connect to the primary server using :

After the command is executed, it prompts for a password.

For each imported database, create the database with the statement:

For each imported table, create the table with the statement:

To get the best performance from Enterprise ColumnStore, make sure to follow Enterprise ColumnStore's best practices for schema design.

Appending Data

While the bulk load is in progress, the newly appended data is temporarily hidden from queries.

After the bulk load is complete, the newly appended data is visible to queries.

Sorting the Input File

When MariaDB Enterprise ColumnStore performs a bulk data load, it appends data to the table in the order in which the data is read.

The order of data can have a significant effect on performance with Enterprise ColumnStore, so it can be helpful to sort the data in the input file prior to importing it.

For additional information, see "".

Confirming the Field Delimiter

Before importing a file into MariaDB Enterprise ColumnStore, confirm that the field delimiter is not present in the data.

The default field delimiter for the cpimport tool is a pipe (|).

To use a different delimiter, you can set the field delimiter.

Importing from Text Files

The cpimport tool can import data from a text file if a file is provided as an argument after the database and table name.

For example, to import the file inventory-products.txt into the products table in the inventory database:

Importing from Binary Files

The cpimport tool can import data from a binary file if the -I1 or -I2 option is provided and a file is provided as an argument after the database and table name.

For example, to import the file inventory-products.bin into the products table in the inventory database:

The -I1 and -I2 options allow two different binary import modes to be selected:

Option

Description

The binary file should use the following format for data:

Data Type(s)

Format

Binary DATE Format

In binary input files, the cpimport tool expects columns to be in the following format:

Binary DATETIME Format

In binary input files, the cpimport tool expects columns to be in the following format:

Importing from Standard Input

The cpimport tool can import data from standard input (stdin) if no file is provided as an argument.

Importing from standard input is useful in many scenarios.

One scenario is when you want to import data from a remote database. You can use to query the table using the statement, and then pipe the results into the standard input of the cpimport tool:

Importing from S3 Using AWS CLI

The cpimport tool can import data from a file stored in a remote S3 bucket.

You can use the AWS CLI to copy the file from S3, and then pipe the contents into the standard input of the cpimport tool:

Alternatively, the columnstore_info.load_from_s3 stored procedure can import data from S3-compatible cloud object storage.

Setting the Field Delimiter

The default field delimiter for the cpimport tool is a pipe sign (|).

If your data file uses a different field delimiter, you can specify the field delimiter with the -s option.

For a TSV (tab-separated values) file:

For a CSV (comma-separated values) file:

Setting the Quoting Style

By default, the cpimport tool does not expect fields to be quoted.

If your data file uses quotes around fields, you can specify the quote character with the -E option.

To load a TSV (tab-separated values) file that uses double quotes:

To load a CSV (comma-separated values) file that uses optional single quotes:

Logging

The cpimport tool writes logs to different directories, depending on the Enterprise ColumnStore version:

In Enterprise ColumnStore 5.5.2 and later, logs are written to /var/log/mariadb/columnstore/bulk/
In Enterprise ColumnStore 5 releases before 5.5.2, logs are written to /var/lib/columnstore/data/bulk/
In Enterprise ColumnStore 1.4, logs are written to /usr/local/mariadb/columnstore/bulk/

Special Handling

Column Order

The cpimport tool requires column values to be in the same order in the input file as the columns in the table definition.

Date Format

The cpimport tool requires values to be specified in the format YYYY-MM-DD.

Transaction Log

The cpimport tool does not write bulk data loads to the transaction log, so they are not transactional.

Binary Log

The cpimport tool does not write bulk data loads to the binary log, so they cannot be replicated using .

EFS Storage

When Enterprise ColumnStore uses object storage and the Storage Manager directory uses EFS in the default Bursting Throughput mode, the cpimport tool can have performance problems if multiple data load operations are executed consecutively. The performance problems can occur because the Bursting Throughput mode scales the rate relative to the size of the file system, so the burst credits for a small Storage Manager volume can be fully consumed very quickly.

When this problem occurs, some solutions are:

Avoid using burst credits by using Provisioned Throughput mode instead of Bursting Throughput mode
Monitor burst credit balances in AWS and run data load operations when burst credits are available
Increase the burst credit balance by increasing the file system size (for example, by creating a dummy file)

Additional information is available .

Step 4: Start and Configure MariaDB Enterprise Server

Overview

This page details step 4 of the 9-step procedure "Deploy ColumnStore Shared Local Storage Topology".

This step starts and configures MariaDB Enterprise Server, and MariaDB Enterprise ColumnStore.

The instructions were tested against ColumnStore 23.10.

Interactive commands are detailed. Alternatively, the described operations can be performed using automation.

Stop the Enterprise ColumnStore Services

The installation process might have started some of the ColumnStore services. The services should be stopped prior to making configuration changes.

On each Enterprise ColumnStore node, stop the MariaDB Enterprise Server service:

On each Enterprise ColumnStore node, stop the MariaDB Enterprise ColumnStore service:

On each Enterprise ColumnStore node, stop the CMAPI service:

Configure Enterprise ColumnStore

On each Enterprise ColumnStore node, configure Enterprise Server.

Connector

MariaDB Connector/R2DBC

Mandatory system variables and options for ColumnStore Object Storage include:

Example Configuration

Start the Enterprise ColumnStore Services

On each Enterprise ColumnStore node, start and enable the MariaDB Enterprise Server service, so that it starts automatically upon reboot:

On each Enterprise ColumnStore node, stop the MariaDB Enterprise ColumnStore service:

After the CMAPI service is installed in the next step, CMAPI will start the Enterprise ColumnStore service as needed on each node. CMAPI disables the Enterprise ColumnStore service to prevent systemd from automatically starting Enterprise ColumnStore upon reboot.

On each Enterprise ColumnStore node, start and enable the CMAPI service, so that it starts automatically upon reboot:

For additional information, see "Start and Stop Services".

Create User Accounts

The ColumnStore Object Storage topology requires several user accounts. Each user account should be created on the primary server, so that it is replicated to the replica servers.

Create the Utility User

Enterprise ColumnStore requires a mandatory utility user account to perform cross-engine joins and similar operations.

On the primary server, create the user account with the CREATE USER statement:

On the primary server, grant the user account SELECT privileges on all databases with the GRANT statement:

On each Enterprise ColumnStore node, configure the ColumnStore utility user:

On each Enterprise ColumnStore node, set the password:

For details about how to encrypt the password, see "".

Passwords should meet your organization's password policies. If your MariaDB Enterprise Server instance has a password validation plugin installed, then the password should also meet the configured requirements.

Create the Replication User

ColumnStore Object Storage uses MariaDB Replication to replicate writes between the primary and replica servers. As MaxScale can promote a replica server to become a new primary in the event of node failure, all nodes must have a replication user.

The action is performed on the primary server.

Create the replication user and grant it the required privileges:

Use the CREATE USER statement to create replication user.

Replace the referenced IP address with the relevant address for your environment.

Ensure that the user account can connect to the primary server from each replica.

Grant the user account the required privileges with the GRANT statement.

Create MaxScale User

ColumnStore Object Storage 23.10 uses MariaDB MaxScale 22.08 to load balance between the nodes.

This action is performed on the primary server.

Use the statement to create the MaxScale user:

Replace the referenced IP address with the relevant address for your environment.

Ensure that the user account can connect from the IP address of the MaxScale instance.

Use the statement to grant the privileges required by the router:

Use the statement to grant privileges required by the MariaDB Monitor.

Configure MariaDB Replication

On each replica server, configure MariaDB Replication:

Use the CHANGE MASTER TO statement to configure the connection to the primary server:

Start replication using the START REPLICA statement:

Confirm that replication is working using the SHOW REPLICA STATUS statement:

Ensure that the replica server cannot accept local writes by setting the read_only system variable to ON using the SET GLOBAL statement:

Initiate the Primary Server with CMAPI

Initiate the primary server using CMAPI.

Create an API key for the cluster. This API key should be stored securely and kept confidential, because it can be used to add cluster nodes to the multi-node Enterprise ColumnStore deployment.

For example, to create a random 256-bit API key using openssl rand:

This document will use the following API key in further examples, but users should create their own:

Use CMAPI to add the primary server to the cluster and set the API key. The new API key needs to be provided as part of the X-API-key HTML header.

For example, if the primary server's host name is mcs1 and its IP address is 192.0.2.1, use the following node command:

Use CMAPI to check the status of the cluster node:

Add Replica Servers with CMAPI

Add the replica servers with CMAPI:

For each replica server, use to add the replica server to the cluster. The previously set API key needs to be provided as part of the X-API-key HTML header.

For example, if the primary server's host name is mcs1 and the replica server's IP address is 192.0.2.2, use the following node command:

After all replica servers have been added, use CMAPI to confirm that all cluster nodes have been successfully added:

Configure Linux Security Modules (LSM)

The specific steps to configure the security module depend on the operating system.

Configure SELinux (CentOS, RHEL)

Configure SELinux for Enterprise ColumnStore:

To configure SELinux, you have to install the packages required for audit2allow. On CentOS 7 and RHEL 7, install the following:

On RHEL 8, install the following:

Allow the system to run under load for a while to generate SELinux audit events.
After the system has taken some load, generate an SELinux policy from the audit events using audit2allow:

If no audit events were found, this will print the following:

If audit events were found, the new SELinux policy can be loaded using semodule:

Set SELinux to enforcing mode:

Set SELinux to enforcing mode by setting SELINUX=enforcing in /etc/selinux/config.

For example, the file will usually look like this after the change:

Confirm that SELinux is in enforcing mode:

Configure AppArmor (Ubuntu)

For information on how to create a profile, see on Ubuntu.com.

Configure Firewalls

The specific steps to configure the firewall service depend on the platform.

Configure firewalld (CentOS, RHEL)

Configure firewalld for Enterprise Cluster on CentOS and RHEL:

Check if the firewalld service is running:

If the firewalld service was stopped to perform the installation, start it now:

For example, if your cluster nodes are in the 192.0.2.0/24 subnet:

Open up the relevant ports using firewall-cmd:

Reload the runtime configuration:

Configure UFW (Ubuntu)

Configure UFW for Enterprise ColumnStore on Ubuntu:

Check if the UFW service is running:

If the UFW service was stopped to perform the installation, start it now:

Open up the relevant ports using ufw.

For example, if your cluster nodes are in the 192.0.2.0/24 subnet in the range 192.0.2.1 - 192.0.2.3:

Reload the runtime configuration:

Next Step

Navigation in the procedure "Deploy ColumnStore Shared Local Storage Topology".

This page was step 4 of 9.

Step 8: Test MariaDB MaxScale

Overview

This page details step 8 of the 9-step procedure "Deploy ColumnStore Object Storage Topology".

This step tests MariaDB MaxScale 22.08.

The instructions were tested against ColumnStore 23.10.

Interactive commands are detailed. Alternatively, the described operations can be performed using automation.

Check Global Configuration

Use command to view the global MaxScale configuration.

This action is performed on the MaxScale node:

Output should align to the global MaxScale configuration in the new configuration file you created.

Check Server Configuration Use the and commands to view the configured server objects.

This action is performed on the MaxScale node:

Obtain the full list of servers objects:

For each server object, view the configuration:

Output should align to the Server Object configuration you performed.

Check Monitor Configuration

Use the and commands to view the configured monitors.

This action is performed on the MaxScale node:

Obtain the full list of monitors:

For each monitor, view the monitor configuration:

Output should align to the MariaDB Monitor (mariadbmon) configuration you performed.

Check Service Configuration

Use the and commands to view the configured routing services.

This action is performed on the MaxScale node:

Obtain the full list of routing services:

For each service, view the service configuration:

Output should align to the or configuration you performed.

Test Application User

Applications should use a dedicated user account. The user account must be created on the primary server.

When users connect to MaxScale, MaxScale authenticates the user connection before routing it to an Enterprise Server node. Enterprise Server authenticates the connection as originating from the IP address of the MaxScale node.

The application users must have one user account with the host IP address of the application server and a second user account with the host IP address of the MaxScale node.

The requirement of a duplicate user account can be avoided by enabling the proxy_protocol parameter for MaxScale and the proxy_protocol_networks for Enterprise Server.

Create a User to Connect from MaxScale

This action is performed on the primary Enterprise ColumnStore node:

Connect to the primary Enterprise ColumnStore node:

Create the database user account for your MaxScale node:

Replace 192.0.2.10 with the relevant IP address specification for your MaxScale node.

Passwords should meet your organization's password policies.

Grant the privileges required by your application to the database user account for your MaxScale node:

The privileges shown are designed to allow the tests in the subsequent sections to work. The user account for your production application may require different privileges.

Create a User to Connect from the Application Server

This action is performed on the primary Enterprise ColumnStore node:

Create the database user account for your application server:

Replace 192.0.2.11 with the relevant IP address specification for your application server.

Passwords should meet your organization's password policies.

Grant the privileges required by your application to the d database user account for your application server:

The privileges shown are designed to allow the tests in the subsequent sections to work. The user account for your production application may require different privileges.

Test Connection with Application User

To test the connection, use the MariaDB Client from your application server to connect to an Enterprise ColumnStore node through MaxScale.

This action is performed on a client connected to the MaxScale node:

Test Connection with Read Connection Router

If you configured the Read Connection Router, confirm that MaxScale routes connections to the replica servers.

On the MaxScale node, use the command to view the available listeners and ports:

Open multiple terminals connected to your application server, in each, use MariaDB Client to connect to the listener port for the Read Connection Router (in the example, 3308):

Use the application user credentials you created for the --user and --password options.

In each terminal, query the hostname and server_id system variable and option to identify to which you're connected:

Different terminals should return different values since MaxScale routes the connections to different nodes.

Since the router was configured with the slave router option, the Read Connection Router only routes connections to replica servers.

Test Write Queries with Read/Write Split Router

If you configured the Read/Write Split Router, confirm that MaxScale routes write queries on this router to the primary Enterprise ColumnStore node.

on the MaxScale node, use the command to view the available listeners and ports:

Open multiple terminals connected to your application server, in each, use MariaDB Client to connect to the listener port for the Read/Write Split Router (in the example, 3307):

Use the application user credentials you created for the --user and --password options.

In one terminal, create the test table:

In each terminal, issue an insert.md statement to add a row to the example table with the values of the hostname and server_id system variable and option:

In one terminal, issue a SELECT statement to query the results:

While MaxScale is handling multiple connections from different terminals, it routed all connections to the current primary Enterprise ColumnStore node, which in the example is mcs1#.

Test Read Queries with Read/Write Split Router

If you configured the , confirm that MaxScale routes read queries on this router to replica servers.

On the MaxScale node, use the command to view the available listeners and ports:

In a terminal connected to your application server, use MariaDB Client to connect to the listener port for the (in the example, 3307):

Use the application user credentials you created for the --user and --password options.

Query the hostname and server_id to identify which server MaxScale routed you to.

Resend the query:

Confirm that MaxScale routes the SELECT statements to different replica servers.

For more information on different routing criteria, see slave_selection_criteria

Next Step

Navigation in the procedure "Deploy ColumnStore Object Storage Topology":

This page was step 8 of 9.

ColumnStore Bulk Data Loading

Overview

cpimport is a high-speed bulk load utility that imports data into ColumnStore tables in a fast and efficient manner. It accepts as input any flat file containing data that contains a delimiter between fields of data (i.e. columns in a table). The default delimiter is the pipe (‘|’) character, but other delimiters such as commas may be used as well. The data values must be in the same order as the create table statement, i.e. column 1 matches the first column in the table and so on. Date values must be specified in the format 'yyyy-mm-dd'.

cpimport – performs the following operations when importing data into a MariaDB ColumnStore database:

Data is read from specified flat files.
Data is transformed to fit ColumnStore’s column-oriented storage design.
Redundant data is tokenized and logically compressed.
Data is written to disk.

It is important to note that:

The bulk loads are an append operation to a table, so they allow existing data to be read and remain unaffected during the process.
The bulk loads do not write their data operations to the transaction log; they are not transactional in nature but are considered an atomic operation at this time. Information markers, however, are placed in the transaction log so the DBA is aware that a bulk operation did occur.
Upon completion of the load operation, a high-water mark in each column file is moved in an atomic operation that allows for any subsequent queries to read the newly loaded data. It appends operation provides for consistent read but does not incur the overhead of logging the data.

There are two primary steps to using the cpimport utility:

Optionally create a job file that is used to load data from a flat file into multiple tables.
Run the cpimport utility to perform the data import.

Syntax

The simplest form of cpimport command is

The full syntax is like this:

cpimport modes

Mode 1: Bulk Load from a central location with single data source file

In this mode, you run the cpimport from your primary node (mcs1). The source file is located at this primary location and the data from cpimport is distributed across all the nodes. If no mode is specified, then this is the default.

Example:

Mode 2: Bulk load from central location with distributed data source files

In this mode, you run the cpimport from your primary node (mcs1). The source data is in already partitioned data files residing on the PMs. Each PM should have the source data file of the same name but containing the partitioned data for the PM

Example:

Mode 3: Parallel distributed bulk load

In this mode, you run cpimport from the individual nodes independently, which will import the source file that exists on that node. Concurrent imports can be executed on every node for the same table.

Example:

Note:

The bulk loads are an append operation to a table, so they allow existing data to be read and remain unaffected during the process.
The bulk loads do not write their data operations to the transaction log; they are not transactional in nature but are considered an atomic operation at this time. Information markers, however, are placed in the transaction log so the DBA is aware that a bulk operation did occur.
Upon completion of the load operation, a high-water mark in each column file is moved in an atomic operation that allows for any subsequent queries to read the newly loaded data. It appends operation provides for consistent read but does not incur the overhead of logging the data.

Bulk loading data from STDIN

Data can be loaded from STDIN into ColumnStore by simply not including the loadFile parameter

Example:

Bulk loading from AWS S3

Similarly the AWS cli utility can be utilized to read data from an s3 bucket and pipe the output into cpimport allowing direct loading from S3. This assumes the aws cli program has been installed and configured on the host:

Example:

For troubleshooting connectivity problems remove the --quiet option which suppresses client logging including permission errors.

Bulk loading output of SELECT FROM Table(s)

Standard in can also be used to directly pipe the output from an arbitrary SELECT statement into cpimport. The select statement may select from non-columnstore tables such as or . In the example below, the db2.source_table is selected from, using the -N flag to remove non-data formatting. The -q flag tells the mysql client to not cache results which will avoid possible timeouts causing the load to fail.

Example:

Bulk loading from JSON

Let's create a sample ColumnStore table:

Now let's create a sample products.json file like this:

We can then bulk load data from JSON into Columnstore by first piping the data to and then to using a one-line command.

Example:

In this example, the JSON data is coming from a static JSON file, but this same method will work for, and output streamed from any datasource using JSON such as an API or NoSQL database. For more information on 'jq', please view the manual here .

Bulk loading into multiple tables

There are two ways multiple tables can be loaded:

Run multiple cpimport jobs simultaneously. Tables per import should be unique or for each import should be unique if using mode 3.
Use colxml utility: colxml creates an XML job file for your database schema before you can import data. Multiple tables may be imported by either importing all tables within a schema or listing specific tables using the -t option in colxml. Then, using cpimport, that uses the job file generated by colxml. Here is an example of how to use colxml and cpimport to import data into all the tables in a database schema

colxml syntax

Example usage of colxml

The following tables comprise a database name ‘tpch2’:

First, put delimited input data file for each table in /usr/local/mariadb/columnstore/data/bulk/data/import. Each file should be named .tbl.
Run colxml for the load job for the ‘tpch2’ database as shown here:

Now actually run cpimport to use the job file generated by the colxml execution

Handling Differences in Column Order and Values

If there are some differences between the input file and table definition then the colxml utility can be utilized to handle these cases:

Different order of columns in the input file from table order
Input file column values to be skipped / ignored.
Target table columns to be defaulted.

In this case run the colxml utility (the -t argument can be useful for producing a job file for one table if preferred) to produce the job xml file and then use this a template for editing and then subsequently use that job file for running cpimport.

Consider the following simple table example:

This would produce a colxml file with the following table element:

If your input file had the data such that hire_date comes before salary then the following modification will allow correct loading of that data to the original table definition (note the last 2 Column elements are swapped):

The following example would ignore the last entry in the file and default salary to it's default value (in this case null):

IgnoreFields instructs cpimport to ignore and skip the particular value at that position in the file.
DefaultColumn instructs cpimport to default the current table column and not move the column pointer forward to the next delimiter.

Both instructions can be used indepedently and as many times as makes sense for your data and table definition.

Binary Source Import

It is possible to import using a binary file instead of a CSV file using fixed length rows in binary data. This can be done using the '-I' flag which has two modes:

-I1 - binary mode with NULLs accepted Numeric fields containing NULL will be treated as NULL unless the column has a default value
-I2 - binary mode with NULLs saturated NULLs in numeric fields will be saturated

The following table shows how to represent the data in the binary format:

Datatype

Description

For NULL values the following table should be used:

Datatype

Signed NULL

Unsigned NULL

Date Struct

The spare bits in the Date struct "must" be set to 0x3E.

DateTime Struct

Working Folders & Logging

As of version 1.4, cpimport uses the /var/lib/columnstore/bulk folder for all work being done. This folder contains:

Logs
Rollback info
Job info
A staging folder

The log folder typically contains:

A typical log might look like this:

Prior to version 1.4, this folder was located at /usr/local/mariadb/columnstore/bulk.

Single-Node S3

This guide provides steps for deploying a single-node S3 ColumnStore, setting up the environment, installing the software, and bulk importing data for online analytical processing (OLAP) workloads.

Overview

Enterprise Server 10.5
Enterprise Server 10.6
Enterprise Server 11.4

Columnar storage engine with S3-compatible object storage

Highly available
Automatic failover via MaxScale and CMAPI
Scales read via MaxScale
Bulk data import

This procedure describes the deployment of the ColumnStore Object Storage topology with MariaDB Enterprise Server 10.5, MariaDB Enterprise ColumnStore 5, and MariaDB MaxScale 2.5.

MariaDB Enterprise ColumnStore 5 is a columnar storage engine for MariaDB Enterprise Server 10.5. Enterprise ColumnStore is suitable for Online Analytical Processing (OLAP) workloads.

This procedure has 9 steps, which are executed in sequence.

This procedure represents basic product capability and deploys 3 Enterprise ColumnStore nodes and 1 MaxScale node.

This page provides an overview of the topology, requirements, and deployment procedures.

Please read and understand this procedure before executing.

Procedure Steps

Step

Description

Support

Customers can obtain support by submitting a support case.

Components

The following components are deployed during this procedure:

Component

Function

MariaDB Enterprise Server Components

Component

Description

MariaDB MaxScale Components

Component

Description

Topology

The MariaDB Enterprise ColumnStore topology with Object Storage delivers production analytics with high availability, fault tolerance, and limitless data storage by leveraging S3-compatible storage.

The topology consists of:

One or more MaxScale nodes
An odd number of ColumnStore nodes (minimum of 3) running ES, Enterprise ColumnStore, and CMAPI

The MaxScale nodes:

Monitor the health and availability of each ColumnStore node using the MariaDB Monitor (mariadbmon)
Accept client and application connections
Route queries to ColumnStore nodes using the Read/Write Split Router (readwritesplit)

The ColumnStore nodes:

Receive queries from MaxScale
Execute queries
Use for data
Use shared local storage for the Storage Manager directory

Requirements

These requirements are for the ColumnStore Object Storage topology when deployed with MariaDB Enterprise Server 10.5, MariaDB Enterprise ColumnStore 5, and MariaDB MaxScale 2.5.

Node Count
Operating System
Minimum Hardware Requirements
Recommended Hardware Requirements

Node Count

MaxScale nodes, 1 or more are required.
Enterprise ColumnStore nodes, 3 or more are required for high availability. You should always have an odd number of nodes in a multi-node ColumnStore deployment to avoid split brain scenarios.

Operating System

In alignment to the , the ColumnStore Object Storage topology with MariaDB Enterprise Server 10.5, MariaDB Enterprise ColumnStore 5, and MariaDB MaxScale 2.5 is provided for:

CentOS Linux 7 (x86_64)
Debian 10 (x86_64)
Red Hat Enterprise Linux 7 (x86_64)
Red Hat Enterprise Linux 8 (x86_64)

Minimum Hardware Requirements

MariaDB Enterprise ColumnStore's minimum hardware requirements are not intended for production environments, but the minimum hardware requirements can be appropriate for development and test environments. For production environments, see the instead.

The minimum hardware requirements are:

Component

CPU

Memory

MariaDB Enterprise ColumnStore will refuse to start if the system has less than 3 GB of memory.

If Enterprise ColumnStore is started on a system with less memory, the following error message will be written to the ColumnStore system log called crit.log:

And the following error message will be raised to the client:

Recommended Hardware Requirements

MariaDB Enterprise ColumnStore's recommended hardware requirements are intended for production analytics.

The recommended hardware requirements are:

Component

CPU

Memory

Storage Requirements

The ColumnStore Object Storage topology requires the following storage types:

Storage Type

Description

S3-Compatible Object Storage Requirements

The ColumnStore Object Storage topology uses S3-compatible object storage to store data.

Many S3-compatible object storage services exist. MariaDB Corporation cannot make guarantees about all S3-compatible object storage services, because different services provide different functionality.

For the preferred S3-compatible object storage providers that provide cloud and hardware solutions, see the following sections:

The use of non-cloud and non-hardware providers is at your own risk.

If you have any questions about using specific S3-compatible object storage with MariaDB Enterprise ColumnStore, contact us.

Preferred Object Storage Providers: Cloud

Amazon Web Services (AWS) S3
Google Cloud Storage
Azure Storage
Alibaba Cloud Object Storage Service

Preferred Object Storage Providers: Hardware

Cloudian HyperStore
Cohesity S3
Dell EMC
IBM Cloud Object Storage

Shared Local Storage Directories

The ColumnStore Object Storage topology uses shared local storage for the to store metadata.

The Storage Manager directory is located at the following path by default:

/var/lib/columnstore/storagemanager

Shared Local Storage Options

The most common shared local storage options for the ColumnStore Object Storage topology are:

Shared Local Storage

Common Usage

Description

Recommended Storage Options

For best results, MariaDB Corporation would recommend the following storage options:

Environment

Object Storage For Data

Shared Local Storage For Storage Manager

Enterprise ColumnStore Management with CMAPI

Enterprise ColumnStore's CMAPI (Cluster Management API) is a REST API that can be used to manage a multi-node Enterprise ColumnStore cluster.

Many tools are capable of interacting with REST APIs. For example, the curl utility could be used to make REST API calls from the command-line.

Many programming languages also have libraries for interacting with REST APIs.

The examples below show how to use the CMAPI with curl.

URL Endpoint Format for REST API

For example:

https://mcs1:8640/cmapi/0.4.0/cluster/shutdown
https://mcs1:8640/cmapi/0.4.0/cluster/start
https://mcs1:8640/cmapi/0.4.0/cluster/status

With CMAPI 1.4 and later:

https://mcs1:8640/cmapi/0.4.0/cluster/node

With CMAPI 1.3 and earlier:

https://mcs1:8640/cmapi/0.4.0/cluster/add-node
https://mcs1:8640/cmapi/0.4.0/cluster/remove-node

Required Request Headers

'x-api-key': '93816fa66cc2d8c224e62275bd4f248234dd4947b68d4af2b29671dd7d5532dd'
'Content-Type': 'application/json'

x-api-key can be set to any value of your choice during the first call to the server. Subsequent connections will require this same key.

Get Status

curl examples remain valid but are now considered legacy.

$ mcs cluster status

Start Cluster

$ mcs cluster start --timeout 20

Stop Cluster

$ mcs cluster shutdown --timeout 20

Add Node

With CMAPI 1.4 and later:

With CMAPI 1.3 and earlier:

Remove Node

With CMAPI 1.4 and later:

With CMAPI 1.3 and earlier:

Quick Reference

MariaDB Enterprise Server Configuration Management

Method

Description

MariaDB Enterprise Server packages are configured to read configuration files from different paths, depending on the operating system. Making custom changes to Enterprise Server default configuration files is not recommended because custom changes may be overwritten by other default configuration files that are loaded later.

To ensure that your custom changes will be read last, create a custom configuration file with the z- prefix in one of the include directories.

Distribution

Example Configuration File Path

MariaDB Enterprise Server Service Management

The systemctl command is used to start and stop the MariaDB Enterprise Server service.

Operation

Command

For additional information, see "".

MariaDB Enterprise Server Logs

MariaDB Enterprise Server produces log data that can be helpful in problem diagnosis.

Log filenames and locations may be overridden in the server configuration. The default location of logs is the data directory. The data directory is specified by the datadir system variable.

Log

System Variable/Option

Default Filename

Enterprise ColumnStore Service Management

The systemctl command is used to start and stop the ColumnStore service.

Operation

Command

In the ColumnStore Object Storage topology, the mariadb-columnstore service should not be enabled. The CMAPI service restarts Enterprise ColumnStore as needed, so it does not need to start automatically upon reboot.

Enterprise ColumnStore CMAPI Service Management

The systemctl command is used to start and stop the CMAPI service.

Operation

Command

For additional information on endpoints, see "CMAPI".

MaxScale Configuration Management

MaxScale can be configured using several methods. These methods make use of MaxScale's .

Method

Benefits

The procedure on these pages configures MaxScale using MaxCtrl.

MaxScale Service Management

The systemctl command is used to start and stop the MaxScale service.

Operation

Command

For additional information, see "".

Next Step

Navigation in the procedure "Deploy ColumnStore Object Storage Topology":