MariaDB Enterprise ColumnStore Query Evaluation

MariaDB Enterprise ColumnStore is a smart storage engine designed to efficiently execute analytical queries using distributed query execution and massively parallel processing (MPP) techniques.

Scalability

MariaDB Enterprise ColumnStore is designed to achieve vertical and horizontal scalability for production analytics using distributed query execution and massively parallel processing (MPP) techniques.

Enterprise ColumnStore evaluates each query as a sequence of job steps using sophisticated techniques to get the best performance for complex analytical queries. Some types of job steps are designed to scale with the system's resources. As you increase the number of ColumnStore nodes or the number of cores on each node, Enterprise ColumnStore can use those resources to more efficiently execute those types of job steps.

Enterprise ColumnStore stores each column on disk in extents. The storage format is designed to maintain scalability, even as the table grows. If an operation does not read parts of a large table, I/O costs are reduced. Enterprise ColumnStore uses a technique called extent elimination that compares the maximum and minimum values in the extent map to the query's conditions, and it avoids scanning extents that don't satisfy the conditions.

Enterprise ColumnStore provides exceptional scalability for analytical queries. Enterprise ColumnStore's design supports targeted scale-out to address increased workload requirements, whether it is a larger query load or increased storage and query processing capacity.

Horizontal Scalability

MariaDB Enterprise ColumnStore provides horizontal scalability by executing some types of job steps in a distributed manner using multiple nodes.

When Enterprise ColumnStore is evaluating a job step, the ExeMgr process on the initiator/aggregator node requests the PrimProc process on each node to perform the job step on different extents in parallel. As more nodes are added, Enterprise ColumnStore can perform more work in parallel.

Enterprise ColumnStore also uses massively parallel processing (MPP) techniques to speed up some types of job steps. For some types of aggregation operations, each node can perform an initial local aggregation, and then the initiator/aggregator node only needs to combine the local results and perform a final aggregation. This technique can be very efficient for some types of aggregation operations, such as for queries that use the AVG(), COUNT(), or SUM() aggregate functions.

Vertical Scalability

MariaDB Enterprise ColumnStore provides vertical scalability by executing some types of job steps in a multi-threaded manner using a thread pool.

When the PrimProc process on a node receives work, it executes the job step on an extent in a multi-threaded manner using a thread pool. Each thread operates on a different block within the extent. As more CPUs are added, Enterprise ColumnStore can work on more blocks in parallel.

Extent Elimination

Enterprise ColumnStore: Extent Elimination

MariaDB Enterprise ColumnStore uses extent elimination to scale query evaluation as table size increases.

Most databases are row-based databases that use manually-created indexes to achieve high performance on large tables. This works well for transactional workloads. However, analytical queries tend to have very low selectivity, so traditional indexes are not typically effective for analytical queries.

Enterprise ColumnStore uses extent elimination to achieve high performance, without requiring manually created indexes. Enterprise ColumnStore automatically partitions all data into extents. Enterprise ColumnStore stores the minimum and maximum values for each extent in the extent map. Enterprise ColumnStore uses the minimum and maximum values in the extent map to perform extent elimination.

When Enterprise ColumnStore performs extent elimination, it compares the query's join conditions and filter conditions (i.e., WHERE clause) to the minimum and maximum values for each extent in the extent map. If the extent's minimum and maximum values fall outside the bounds of the query's conditions, Enterprise ColumnStore skips that extent for the query.

Extent elimination is automatically performed for every query. It can significantly decrease I/O for columns with clustered values. For example, extent elimination works effectively for series, ordered, patterned, and time-based data.

Custom Select Handler

The ColumnStore storage engine plugin implements a custom select handler to fully take advantage of Enterprise ColumnStore's capabilities.

All storage engines interact with ES using an internal handler API, which is highly extensible. Storage engines can implement different features by implementing different methods within the handler API.

For select statements, the handler API transforms each query into a SELECT_LEX object, which is provided to the select handler.

Before ES 10.4, the handler API implemented a generic select handler that all storage engines used. Starting with ES 10.4, the handler API allows storage engines to implement a custom select handler. Implementing a custom select handler enables storage engines to take full advantage of unique features.

The generic select handler is not optimal for Enterprise ColumnStore, because:

  • Enterprise ColumnStore selects data by column, but the generic select handler selects data by row

  • Enterprise ColumnStore supports parallel query evaluation, but the generic select handler does not

  • Enterprise ColumnStore supports distributed aggregations, but the generic select handler does not

  • Enterprise ColumnStore supports distributed functions, but the generic select handler does not

  • Enterprise ColumnStore supports extent elimination, but the generic select handler does not

  • Enterprise ColumnStore has its own query planner, but the generic select handler cannot use it

Smart Storage Engine

The ColumnStore storage engine plugin is known as a smart storage engine, because it implements a custom select handler. MariaDB Enterprise ColumnStore integrates with MariaDB Enterprise Server using the ColumnStore storage engine plugin. The ColumnStore storage engine plugin enables MariaDB Enterprise Server to interact with ColumnStore tables.

If a storage engine implements a custom select handler, it is known as a smart storage engine.

As a smart storage engine, the ColumnStore storage engine plugin tightly integrates Enterprise ColumnStore with ES, but it has enough independence to efficiently execute analytical queries using a completely unique approach.

Configure the Select Handler

The ColumnStore storage engine can use either the custom select handler or the generic select handler. The select handler can be configured using the columnstore_select_handler system variable:

Value

Description

AUTO

  • When set to AUTO, Enterprise ColumnStore automatically chooses the best select handler for a given SELECT query.

  • AUTO was added in Enterprise ColumnStore 6.

OFF

  • When set to OFF, Enterprise ColumnStore uses the generic select handler for all SELECT queries.

  • It is not recommended to use this value, unless recommended by MariaDB Support.

ON

  • When set to ON, Enterprise ColumnStore uses the custom select handler for all SELECT queries.

  • ON is the default in Enterprise ColumnStore 5 and Enterprise ColumnStore 6.

Joins

MariaDB Enterprise ColumnStore performs join operations using hash joins.

By default, hash joins are performed in memory.

Configure In-Memory Joins

MariaDB Enterprise ColumnStore can be configured to allocate more memory for hash joins.

The relevant configuration options are:

Option

Description

PmMaxMemorySmallSide

  • Configures the amount of memory available for a single join.

  • Valid values are from 0 to 4 GB.

  • Default value is 1 GB.

TotalUmMemory

  • Configures the amount of memory available for all joins.

  • Values can be specified as a percentage of total system memory or as a specific amount of memory.

  • Valid percentage values are from 0 to 100%.

  • Default value is 25%.

For example, to configure Enterprise ColumnStore to use more memory for hash joins using the mcsSetConfig utility:

$ mcsSetConfig HashJoin PmMaxMemorySmallSide 2G
$ mcsSetConfig HashJoin TotalUmMemory '40%'

Configure Disk-Based Joins

MariaDB Enterprise ColumnStore can be configured to perform disk-based joins.

The relevant configuration options are:

Option

Description

AllowDiskBasedJoin

  • Enables disk-based joins.

  • Valid values are Y and N.

  • Default value is N.

SystemTempFileDir

  • Configures the directory used for temporary files used by disk-based joins and aggregations.

  • Default value is /tmp/columnstore_tmp_files.

TempFileCompression

  • Enables compression for temporary files used by disk-based joins.

  • Valid values are Y and N.

  • Default value is N.

For example, to configure Enterprise ColumnStore to perform disk-based joins using the mcsSetConfig utility:

$ mcsSetConfig HashJoin AllowDiskBasedJoin Y
$ mcsSetConfig HashJoin TempFileCompression Y
$ mcsSetConfig SystemConfig SystemTempFileDir /mariadb/tmp

Aggregations

MariaDB Enterprise ColumnStore performs aggregation operations on all nodes in a distributed manner, and then all nodes send their results to a single node, which combines the results and performs the final aggregation.

By default, aggregation operations are performed in memory.

Configure Disk-Based Aggregations

In Enterprise ColumnStore 5.6.1 and later, disk-based aggregations can be configured.

The relevant configuration options are:

Option

Description

AllowDiskBasedAggregation

  • Enables disk-based aggregations.

  • Valid values are Y and N.

  • Default value is N.

SystemTempFileDir

  • Configures the directory used for temporary files used by disk-based joins and aggregations.

  • Default value is /tmp/columnstore_tmp_files.

For example, to configure Enterprise ColumnStore to perform disk-based aggregations using the mcsSetConfig utility:

$ mcsSetConfig RowAggregation AllowDiskBasedAggregation Y
$ mcsSetConfig SystemConfig SystemTempFileDir /mariadb/tmp

Query Planning

The ColumnStore storage engine plugin is a smart storage engine, so MariaDB Enterprise ColumnStore to plan its own queries using the custom select handler.

MariaDB Enterprise ColumnStore's query planning is divided into two steps:

  1. ES provides the query's SELECT_LEX object to the custom select handler. The custom select handler builds a ColumnStore Execution Plan (CSEP).

  2. The custom select handler provides the CSEP to the ExeMgr process on the same node. The ExeMgr process performs extent elimination and creates a job list.

Optimizer Statistics

In MariaDB Enterprise ColumnStore 6, the ExeMgr process uses optimizer statistics in its query planning process.

ColumnStore uses the optimizer statistics to add support for queries that contain circular inner joins.

In Enterprise ColumnStore 5 and before, ColumnStore would raise the following error when a query containing a circular inner join was executed:

ERROR 1815 (HY000): Internal error: IDB-1003: Circular joins are not supported.

The optimizer statistics store each column's NDV (Number of Distinct Values), which can help the ExeMgr process choose the optimal join order for queries with circular joins. When Enterprise ColumnStore executes a query with a circular join, the query's execution can take longer if ColumnStore chooses a sub-optimal join order. When you collect optimizer statistics for your ColumnStore tables, the ExeMgr process is less likely to choose a sub-optimal join order.

Enterprise ColumnStore's optimizer statistics can be collected for ColumnStore tables by executing ANALYZE TABLE:

ANALYZE TABLE columnstore_tab;

Enterprise ColumnStore's optimizer statistics are not updated automatically. To update the optimizer statistics for a ColumnStore table, ANALYZE TABLE must be re-executed.

Enterprise ColumnStore does not implement an interface to show optimizer statistics.

ColumnStore Execution Plan (CSEP)

The ColumnStore storage engine uses a ColumnStore Execution Plan (CSEP) to internally represent a query plan.

When the select handler receives the SELECT_LEX object, the select handler transforms the SELECT_LEX object into a CSEP as part of the query planning and optimization process.

Viewing the CSEP

The CSEP for a given query can be viewed by performing the following:

  1. Calling the calSetTrace(1) function:

    SELECT calSetTrace(1);
    
  2. Executing the query:

    SELECT column1, column2
    FROM columnstore_tab
    WHERE column1 > '2020-04-01'
    AND column1 < '2020-11-01';
    
  3. Calling the calGetTrace() function:

    SELECT calGetTrace();
    

ExeMgr Process

The ColumnStore storage engine provides the CSEP to the ExeMgr process on the same node, which will act as the initiator/aggregator node for the query.

The ExeMgr process performs multiple tasks:

  • Performs extent elimination.

  • Views the optimizer statistics.

  • Transforms the CSEP to a job list, which consists of job steps.

  • Assigns distributed job steps to the PrimProc process on each node.

  • Evaluates non-distributed job steps itself.

  • Provides final query results to ES.

Job Steps

When Enterprise ColumnStore executes a query, the ExeMgr process on the initiator/aggregator node translates the ColumnStore execution plan (CSEP) into a job list. A job list is a sequence of job steps.

Enterprise ColumnStore uses many different types of job steps that provide different scalability benefits:

  • Some types of job steps perform operations in a distributed manner using multiple nodes to operate on different extents. Distributed operations provide horizontal scalability.

  • Some types of job steps perform operations in a multi-threaded manner using a thread pool. Performing multi-threaded operations provides vertical scalability.

As you increase the number of ColumnStore nodes or the number of cores on each node, Enterprise ColumnStore can use those resources to more efficiently execute job steps.

The specific types of job steps are listed in the sections below.

Batch Primitive Step (BPS)

Enterprise ColumnStore defines a batch primitive step to handle many types of tasks, such as scanning/filtering columns, JOIN operations, aggregation, functional filtering, projecting (putting values into SELECT list).

In calGetTrace() output, a batch primitive step is abbreviated BPS.

Batch primitive steps are evaluated on multiple nodes in parallel. The PrimProc process on each node evaluates the batch primitive step on one extent at a time. The PrimProc process uses a thread pool to operate on individual blocks within the extent in parallel.

Cross Engine Step (CES)

Enterprise ColumnStore defines a cross engine step to perform cross-engine joins, in which a ColumnStore table is joined with a table that uses a different storage engine.

In calGetTrace() output, a cross engine step is abbreviated CES.

Cross engine steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Enterprise ColumnStore can perform cross engine joins when the mandatory utility user is properly configured.

For additional information, see "Mandatory Utility User Account".

Dictionary Structure Step (DSS)

Enterprise ColumnStore defines a dictionary structure step to scan the dictionary extents that ColumnStore uses to store variable-length string values.

In calGetTrace() output, a dictionary structure step is abbreviated DSS.

Dictionary structure steps are evaluated on multiple nodes in parallel. The PrimProc process on each node evaluates the dictionary structure step on one extent at a time. It uses a thread pool to operate on individual blocks within the extent in parallel.

Dictionary structure steps can require a lot of I/O for a couple reasons:

  • Dictionary structure steps do not support extent elimination, so all extents for the column must be scanned.

  • Dictionary structure steps must read the column extents to find each pointer and the dictionary extents to find each value, so it doubles the number of extents to scan.

It is generally recommended to avoid queries that will cause dictionary scans.

For additional information, see "Avoid Creating Long String Columns".

Hash Join Step (HJS)

Enterprise ColumnStore defines a hash join step to perform a hash join between two tables.

In calGetTrace() output, a hash join step is abbreviated HJS.

Hash join steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Enterprise ColumnStore performs the hash join in memory by default. If you perform large joins, you may be able get better performance by changing some configuration defaults with mcsSetConfig:

  • Enterprise ColumnStore can be configured to use more memory for in-memory hash joins.

  • Enterprise ColumnStore can be configured to use disk-based joins.

For additional information, see "Configure In-Memory Joins" and "Configure Disk-Based Joins".

Having Step (HVS)

Enterprise ColumnStore defines a having step to evaluate a HAVING clause on a result set.

In calGetTrace() output, a having step is abbreviated HVS.

Subquery Step (SQS)

Enterprise ColumnStore defines a subquery step to evaluate a subquery.

In calGetTrace() output, a subquery step is abbreviated SQS.

Tuple Aggregation Step (TAS)

Enterprise ColumnStore defines a tuple aggregation step to collect intermediate aggregation prior to the final aggregation and evaluation of the results.

In calGetTrace() output, a tuple aggregation step is abbreviated TAS.

Tuple aggregation steps are primarily evaluated by the ExeMgr process on the initiator/aggregator node. However, the PrimProc process on each node also plays a role, since the PrimProc process on each node provides the intermediate aggregation results to the ExeMgr process on the initiator/aggregator node.

Tuple Annexation Step (TNS)

Enterprise ColumnStore defines a tuple annexation step to perform the final aggregation and evaluation of the results.

In calGetTrace() output, a tuple annexation step is abbreviated TNS.

Tuple annexation steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Enterprise ColumnStore 5 performs aggregation operations in memory. As a consequence, more complex aggregation operations require more memory in that version.

In Enterprise ColumnStore 6, disk-based aggregations can be enabled.

For additional information, see "Configure Disk-Based Aggregations".

Tuple Union Step (TUS)

Enterprise ColumnStore defines a tuple union step to perform a union of two subqueries.

In calGetTrace() output, a tuple union step is abbreviated TUS.

Tuple union steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Tuple Constant Step (TCS)

Enterprise ColumnStore defines a tuple constant step to evaluate constant values.

In calGetTrace() output, a tuple constant step is abbreviated TCS.

Tuple constant steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Window Function Step (WFS)

Enterprise ColumnStore defines a window function step to evaluate window functions.

In calGetTrace() output, a window function step is abbreviated WFS.

Window function steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Query Evaluation Process

Enterprise ColumnStore Query Execution

When Enterprise ColumnStore executes a query, it goes through the following process:

  1. The client or application sends the query to MariaDB MaxScale's listener port.

  2. The query is processed by the Read/Write Split Router (readwritesplit) service associated with the listener.

  3. The service routes the query to the ES TCP port on a ColumnStore node.

  4. MariaDB Enterprise Server (ES) evaluates the query using the handler interface.

    1. The handler interface builds a SELECT_LEX object to represent the query.

    2. The handler interface provides the SELECT_LEX object to the ColumnStore storage engine's select handler.

    3. The select handler transforms the SELECT_LEX object into a ColumnStore Execution Plan (CSEP).

    4. The select handler provides the CSEP to the ExeMgr process on the same node, which will act as the initiator/aggregator node for the query.

  5. The ExeMgr process transforms the CSEP into a job list, which consists of job steps.

  6. The ExeMgr process evaluates each job step sequentially.

    1. If it is a non-distributed job step, the ExeMgr process evaluates the job step itself.

    2. If it is a distributed job step, the ExeMgr process provides the job step to the PrimProc process on each node. The PrimProc process on each node evaluates the job step in a multi-threaded manner using a thread pool. After the PrimProc process on each node evaluates its job step, the results are returned to the ExeMgr process on the initiator/aggregator node as a Row Group.

  7. After all job steps are evaluated, the ExeMgr process returns the results to ES.

  8. ES returns the results to MaxScale.

  9. MaxScale returns the results to the client or application.