Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
MariaDB ColumnStore ensures high availability with multi-node setups and shared storage, while MaxScale adds monitoring and failover for continuous analytics.
MariaDB ColumnStore query tuning optimizes analytics using data types, joins, projection elimination, WHERE clauses, and EXPLAIN for performance insights.
A number of system configuration variables exist to allow fine tuning of the system to suit the physical hardware and query characteristics. In general the default values will work relatively well for many cases.
The configuration parameters are maintained in the /etc/Columnstore.xml file. In a multiple server deployment these should only be edited on the PM1 server as this will be automatically replicated to other servers by the system. A system restart will be required for the configuration change to take affect.
Convenience utility programs getConfig and setConfig are available to safely update the Columnstore.xml without needing to be comfortable with editing XML files. The -h argument will display usage information.
The NumBlocksPct configuration parameter specifies the percentage of physical memory to utilize for disk block caching. The default value is 25, to ensure enough physical memory.
The NumBlocksPct configuration parameter specifies the percentage of physical memory to utilize for disk block caching. The default value is 50, to ensure enough physical memory.
The TotalUmMemory configuration parameter specifies the percentage of physical memory to utilize for joins, intermediate results and set operations. This specifies an upper limit for small table results in joins rather than a pre-allocation of memory. The default value is 50.
The TotalUmMemory configuration parameter specifies the percentage of physical memory to utilize for joins, intermediate results and set operations. This specifies an upper limit for small table results in joins rather than a pre-allocation of memory. The default value is 25.
In a single server or combined deployment, the sum of NumBlocksPct and TotalUmMemory should typically not exceed 75% of physical memory. With very large memory servers this could be raised but the key point is to leave enough memory for other processes including mariadbd.
ColumnStore handles concurrent query execution by managing the rate of concurrent batch primitive steps. This is configured using the MaxOutstandingRequests parameter and has a default value of 20. Each batch primitive step is executed within the context of 1 extent column according to this high level process:
ColumnStore issues up to MaxOutstandingRequests number of batch primitive steps.
PrimProc processes the request, using many threads and returns its response. These generally take a fraction of a second up to a low number of seconds depending on the amount of Physical I/O and the performance of that storage.
ColumnStore issues new requests as prior requests complete maintaining the maximum number of outstanding requests.
This scheme allows for large queries to use all available resources when not otherwise being consumed and for smaller queries to execute with minimal delay. Lower values optimize for higher throughput of smaller queries while a larger value optimizes for response time of that query. The default value should work well under most circumstances however the value should be increased as the number of nodes is increased.
How many Queries are running and how many queries are currently in the queue can be checked with
ColumnStore maintains statistics for table and utilizes this to determine which is the larger table of the two. This is based both on the number of blocks in that table and estimation of the predicate cardinality. The first step is to apply any filters as appropriate to the smaller table and returning this data set to memory. The size of this data set is compared against the configuration parameter PmMaxMemorySmallSide which has a default value of 64 (MB). This value can be set all the way up to 4GB. This default allows for approximately 1M rows on the small table side to be joined against billions (or trillions) on the large table side. If the size of the small data set is less than PmMaxMemorySmallSide the dataset is sent to PrimProc for creation of a distributed hashmap. Thus this setting is important to tuning of joins and whether the operation can be distributed or not. This should be set to support your largest expected small table join size up to available memory:
Although this will increase the size of data between nodes to support the join, it means that the join and subsequent aggregates are pushed down, scaled out, and a smaller data set is returned back.
In a multiple server deployment, the sizing should be based from available physical memory on the servers, how much memory to reserve for block caching, and the number of simultaneous join operations that can be expected to run times the average small table join data size.
The above logic for a single table join extrapolates out to multi table joins where the small table values are precalculated and performed as one single scan against the large table. This works well for the typical star schema case joining multiple dimension tables with a large fact table. For some join scenarios it may be necessary to sequence joins to create the intermediate datasets for joining, this would happen for instance with a snowflake schema structure. In some extreme cases it may be hard for the optimizer to be able to determine the most optimal join path. In this case a hint is available to force a join ordering. The INFINIDB_ORDERED hint will force the first table in the from clause to be considered the largest table and override any statistics based decision, for example:
Note: INFINIDB\_ORDERED is deprecated and does not work anymore for ColumnStore 1.2 and above.
use set infinidb_ordered_only=ON;
and for 1.4 set columnstore_ordered_only=ON;
When a join is very large and exceeds the PmMaxMemorySmallSide setting, it is performed in memory. For very large joins, this could exceed the available memory, in which case this is detected and a query error reported. Several configuration parameters are available to enable and configure usage of disk overflow should this occur:
AllowDiskBasedJoin – Controls the option to use disk Based joins or not. Valid values are Y (enabled) or N (disabled). By default, this option is disabled.
TempFileCompression – Controls whether the disk join files are compressed or noncompressed. Valid values are Y (use compressed files) or N (use non-compressed files).
TempFilePath – The directory path used for the disk joins. By default, this path is the tmp directory for your installation (i.e., /tmp/columnstore_tmp_files/). Files (named infinidb-join-data*
A MariaDB global or session variable is available to specify a memory limit at which point the query is switched over to disk-based joins:
infinidb_um_mem_limit - Memory limit in MB per user (i.e., switch to disk-based join if this limit is exceeded). By default, this limit is not set (value of 0).
The MariaDB SHOW PROCESSLIST statement is used to see a list of active queries on that UM:
getActiveSQLStatements is a mcsadmin command that shows which SQL statements are currently being executed on the database:
The calGetStats function provides statistics about resources used on the node, and network by the last run query. Example:
The output contains information on:
MaxMemPct - Peak memory utilization on the , likely in support of a large (User Module) based hash join operation.
NumTempFiles - Report on any temporary files created in support of query operations larger than available memory, typically for unusual join operations where the smaller table join cardinality exceeds some configurable threshold.
TempFileSpace - Report on space used by temporary files created in support of query operations larger than available memory, typically for unusual join operations where the smaller table join cardinality exceeds some configurable threshold.
The output is useful to determine how much physical I/O was required, how much data was cached, and how many partition blocks were eliminated through use of extent map elimination. The system maintains min / max values for each extent and uses these to help implement where clause filters to completely bypass extents where the value is outside of the min/max range. When a column is ordered (or semi-ordered) during load such as a time column this offer very large performance gains as the system can avoid scanning many extents for the column.
While the MariaDB Server's utility can be used to look at the query plan, it is somewhat less helpful for ColumnStore tables as ColumnStore does not use indexes or make use of MariaDB I/O functionality. The execution plan for a query on a ColumnStore table is made up of multiple steps. Each step in the query plan performs a set of operations that are issued from the to the set of in support of a given step in a query.
Full Column Scan - an operation that scans each entry in a column using all available threads on the Performance Modules. Speed of operation is generally related to the size of the data type and the total number of rows in the column. The closest analogy for a traditional system is an index scan operation.
Partitioned Column Scan - an operation that uses the Extent Map to identify that certain portions of the column do not contain any matching values for a given set of filters. The closest analogy for a traditional row-based DBMS is a partitioned index scan, or partitioned table scan operation.
Column lookup by row offset - once the set of matching filters have been applied and the minimal set of rows have been identified; additional blocks are requested using a calculation that determines exactly which block is required. The closest analogy for a traditional system is a lookup by rowid.
These operations are automatically executed together in order to execute appropriate filters and column lookup by row offset.
In MariaDB ColumnStore there is a set of SQL tracing stored functions provided to see the distributed query execution plan between the nodes.
The basic steps to using these SQL tracing stored functions are:
Start the trace for the particular session.
Execute the SQL statement in question.
Review the trace collected for the statement. As an example, the following session starts a trace, issues a query against a 6 million row fact table and 300,000 row dimension table, and then reviews the output from the trace:
The columns headings in the output are as follows:
Desc – Operation being executed. Possible values:
BPS - Batch Primitive Step: scanning or projecting the column blocks.
CES - Cross Engine Step: Performing Cross engine join
DSS - Dictionary Structure Step: a dictionary scan for a particular variable length string value.
Sometimes it can be useful to clear caches to allow understanding of un-cached and cached query access. The calFlushCache() function will clear caches on all servers. This is only really useful for testing query performance:
It can be useful to view details about the extent map for a given column. This can be achieved using the edit item process on any ColumnStore server. Available arguments can be provided by using the -h flag. The most common use is to provide the column object id with the -o argument which will output details for the column and in this case the -t argument is provided to show min / max values as dates:
Here it can be seen that the extent maps for the o_orderdate (object id 3032) column are well partitioned since the order table source data was sorted by the order_date. This example shows 2 separate DBRoot values as the environment was a 2-node combined deployment.
Column object ids may be found by querying the calpontsys.syscolumn metadata table (deprecated) or information_schema.columnstore_columns table (version 1.0.6+).
MariaDB ColumnStore query statistics history can be retrieved for analysis. By default the query stats collection is disabled. To enable the collection of query stats, the element in the ColumnStore.XML configuration file should be set to Y (default is N).
Cross Engine Support must also be enabled before enabling Query Statistics. See the section.
For Querystats Cross Engine User needs INSERT Privilege on querystats table.
Example:
When enabled the history of query statistics across all sessions along with execution time, and those stats provided by calgetstats() is stored in a table in the infinidb_querystats schema. Only queries in the following ColumnStore syntax are available for statistics monitoring:
SELECT
INSERT
UPDATE
DELETE
When QueryStats is enabled, the query statistics history is collected in the querystats table in the infinidb_querystats schema.
The columns of this table are:
queryID - A unique identifier assigned to the query
Session ID (sessionID) - The session number that executed the statement.
queryType - The type of the query whether insert, update, delete, select, delete, insert select or load data infile
query - The text of the query
Users can view the query statistics by selecting the rows from the query stats table in the infinidb_querystats schema. Examples listed below:
Example 1: List execution time, rows returned for all the select queries within the past 12 hours:
Example 2: List the three slowest running select queries of session 2 within the past 12 hours:
Example 3: List the average, min and max running time of all the INSERT SELECT queries within the past 12 hours:
MariaDB [test]> SHOW PROCESSLIST;
+----+------+-----------+-------+---------+------+-------+--------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+-------+---------+------+-------+--------------+
| 73 | root | localhost | ssb10 | Query | 0 | NULL | show processlist
+----+------+-----------+-------+---------+------+-------+--------------+
1 row in set (0.01 sec)ExeMgrCacheI/O - Approximate number of 8k blocks processed in memory, adjusted down by the number of discrete PhyI/O calls required.
BlocksTouched - Approximate number of 8k blocks processed in memory.
PartitionBlocksEliminated - The number of block touches eliminated via the Extent Map elimination behavior.
MsgBytesIn, MsgByteOut - Message size in MB sent between nodes in support of the query.
HJS - Hash Join Step: Performing a hash join between 2 tables
HVS - Having Step: Performing the having clause on the result set
SQS - Sub Query Step: Performing a sub query
TAS - Tuple Aggregation step: the process of receiving intermediate aggregation results from other nodes.
TNS - Tuple Annexation Step: Query result finishing, e.g. filling in constant columns, limit, order by and final distinct cases.
TUS = Tuple Union step: Performing a SQL union of 2 sub queries.
TCS = Tuple Constant Step: Process Constant Value Columns
WFS = Window Function Step: Performing a window function.
Mode – Where the operation was performed within the PrimProc library
Table – Table for which columns may be scanned/projected.
TableOID – ObjectID for the table being scanned.
ReferencedOIDs – ObjectIDs for the columns required by the query.
PIO – Physical I/O (reads from storage) executed for the query.
LIO – Logical I/O executed for the query, also known as Blocks Touched.
PBE – Partition Blocks Eliminated identifies blocks eliminated by Extent Map min/max.
Elapsed – Elapsed time for a give step.
Rows – Intermediate rows returned.
INSERT ... SELECT
LOAD DATA INFILE
Host (host) - The host that executed the statement.
User ID (user) - The user that executed the statement.
Priority (priority) The priority the user has for this statement.
Query Execution Times (startTime, endTime) Calculated as end time – start time.
start time - the time that the query gets to ExeMgr, DDLProc, or DMLProc
end time - the time that the last result packet exits ExeMgr, DDLProc or DMLProc
Rows returned or affected (rows) -The number of rows returned for SELECT queries, or the number of rows affected by DML queries. Not valid for DDL and other query types.
Error Number (errNo) - The IDB error number if this query failed, 0 if it succeeded.
Physical I/O (phyIO) - The number of blocks that the query accessed from the disk, including the pre-fetch blocks. This statistic is only valid for the queries that are processed by ExeMgr, i.e. SELECT, DML with WHERE clause, and INSERT SELECT.
Cache I/O (cacheIO) - The number of blocks that the query accessed from the cache. This statistic is only valid for queries that are processed by ExeMgr, i.e. SELECT, DML with WHERE clause, and INSERT SELECT.
Blocks Touched (blocksTouched) - The total number of blocks that the query accessed physically and from the cache. This should be equal or less than the sum of physical I/O and cache I/O. This statistic is only valid for queries that are processed by ExeMgr, i.e. SELECT, DML with WHERE clause, and INSERT SELECT.
Partition Blocks Eliminated (CPBlocksSkipped) - The number of blocks being eliminated by the extent map casual partition. This statistic is only valid for queries that are processed by ExeMgr, i.e. SELECT, DML with WHERE clause, and INSERT SELECT.
Messages to other nodes (msgOutUM) - The number of messages in bytes that ExeMgr sends to the PrimProc. If a message needs to be distributed to all the PMs, the sum of all the distributed messages will be counted. Only valid for queries that are processed by ExeMgr, i.e. SELECT, DML with WHERE clause, and INSERT SELECT.
Messages from other nodes (msgInUM) - The number of messages in bytes that PrimProc sends to the ExeMgr. Only valid for queries that are processed by ExeMgr, i.e. SELECT, DML with where clause, and INSERT SELECT.
Memory Utilization (maxMemPct) - This field shows memory utilization in support of any join, group by, aggregation, distinct, or other operation.
Blocks Changed (blocksChanged) - Total number of blocks that queries physically changed on disk. This is only for delete/update statements.
Temp Files (numTempFiles) - This field shows any temporary file utilization in support of any join, group by, aggregation, distinct, or other operation.
Temp File Space (tempFileSpace) - This shows the size of any temporary file utilization in support of any join, group by, aggregation, distinct, or other operation.
In MariaDB Enterprise ColumnStore 6, the ExeMgr process uses optimizer statistics in its query planning process.
ColumnStore uses the optimizer statistics to add support for queries that contain circular inner joins.
In Enterprise ColumnStore 5 and before, ColumnStore would raise the following error when a query containing a circular inner join was executed:
ERROR 1815 (HY000): Internal error: IDB-1003: Circular joins are not supported.The optimizer statistics store each column's NDV (Number of Distinct Values), which can help the ExeMgr process choose the optimal join order for queries with circular joins. When Enterprise ColumnStore executes a query with a circular join, the query's execution can take longer if ColumnStore chooses a sub-optimal join order. When you collect optimizer statistics for your ColumnStore tables, the ExeMgr process is less likely to choose a sub-optimal join order.
Enterprise ColumnStore's optimizer statistics can be collected for ColumnStore tables by executing :
Enterprise ColumnStore's optimizer statistics are not updated automatically. To update the optimizer statistics for a ColumnStore table, must be re-executed.
Enterprise ColumnStore does not implement an interface to show optimizer statistics.
When tuning queries for MariaDB Enterprise ColumnStore, there are some important details to consider.
Enterprise ColumnStore only reads the columns that are necessary to resolve a query.
For example, the following query selects every column in the table:
Whereas the following query only selects two columns in the table, so it requires less I/O:
For best performance, only select the columns that are necessary to resolve a query.
When Enterprise ColumnStore executes a query, the on the initiator/aggregator node translates the ColumnStore execution plan (CSEP) into a job list. A job list is a sequence of job steps.
Enterprise ColumnStore uses many different types of job steps that provide different scalability benefits:
Some types of job steps perform operations in a distributed manner, using multiple nodes to operate to different extents. Distributed operations provide horizontal scalability.
SELECT calgetsqlcount();SELECT /*! INFINIDB_ORDERED */ r_regionkey
FROM region r, customer c, nation n
WHERE r.r_regionkey = n.n_regionkey
AND n.n_nationkey = c.c_nationkeymcsadmin> getActiveSQLStatements
getactivesqlstatements Wed Oct 7 08:38:32 2015
Get List of Active SQL Statements
=================================
Start Time Time (hh:mm:ss) Session ID SQL Statement
---------------- ---------------- -------------------- ------------------------------------------------------------
Oct 7 08:38:30 00:00:03 73 select c_name,sum(lo_revenue) from customer, lineorder where lo_custkey = c_custkey and c_custkey = 6 group by c_nameMariaDB [test]> SELECT count(*) FROM wide2;
+----------+
| count(*) |
+----------+
| 5000000 |
+----------+
1 row in set (0.22 sec)
MariaDB [test]> SELECT calGetStats();
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| calGetStats() |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Stats: MaxMemPct-0; NumTempFiles-0; TempFileSpace-0B; ApproxPhyI/O-1931; CacheI/O-2446; BlocksTouched-2443; PartitionBlocksEliminated-0; MsgBytesIn-73KB; MsgBytesOut-1KB; Mode-Distributed |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)MariaDB [test]> SELECT calSetTrace(1);
+----------------+
| calSetTrace(1) |
+----------------+
| 0 |
+----------------+
1 row in set (0.00 sec)
MariaDB [test]> SELECT c_name, sum(o_totalprice)
-> FROM customer, orders
-> WHERE o_custkey = c_custkey
-> AND c_custkey = 5
-> GROUP BY c_name;
+--------------------+-------------------+
| c_name | sum(o_totalprice) |
+--------------------+-------------------+
| Customer#000000005 | 684965.28 |
+--------------------+-------------------+
1 row in set, 1 warning (0.34 sec)
MariaDB [test]> SELECT calGetTrace();
+------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------+
| calGetTrace() |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------+
|
Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows
BPS PM customer 3024 (c_custkey,c_name) 0 43 36 0.006 1
BPS PM orders 3038 (o_custkey,o_totalprice) 0 766 0 0.032 3
HJS PM orders-customer 3038 - - - - ----- -
TAS UM - - - - - - 0.021 1
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)MariaDB [test]> SELECT calFlushCache();editem -o 3032 -t
Col OID = 3032, NumExtents = 10, width = 4
428032 - 432127 (4096) min: 1992-01-01, max: 1993-06-21, seqNum: 1, state: valid, fbo: 0, DBRoot: 1, part#: 0, seg#: 0, HWM: 0; status: avail
502784 - 506879 (4096) min: 1992-01-01, max: 1993-06-22, seqNum: 1, state: valid, fbo: 0, DBRoot: 2, part#: 0, seg#: 1, HWM: 0; status: unavail
708608 - 712703 (4096) min: 1993-06-21, max: 1994-12-11, seqNum: 1, state: valid, fbo: 0, DBRoot: 1, part#: 0, seg#: 2, HWM: 0; status: unavail
766976 - 771071 (4096) min: 1993-06-22, max: 1994-12-12, seqNum: 1, state: valid, fbo: 0, DBRoot: 2, part#: 0, seg#: 3, HWM: 0; status: unavail
989184 - 993279 (4096) min: 1994-12-11, max: 1996-06-01, seqNum: 1, state: valid, fbo: 4096, DBRoot: 1, part#: 0, seg#: 0, HWM: 8191; status: avail
1039360 - 1043455 (4096) min: 1994-12-12, max: 1996-06-02, seqNum: 1, state: valid, fbo: 4096, DBRoot: 2, part#: 0, seg#: 1, HWM: 8191; status: avail
1220608 - 1224703 (4096) min: 1996-06-01, max: 1997-11-22, seqNum: 1, state: valid, fbo: 4096, DBRoot: 1, part#: 0, seg#: 2, HWM: 8191; status: avail
1270784 - 1274879 (4096) min: 1996-06-02, max: 1997-11-22, seqNum: 1, state: valid, fbo: 4096, DBRoot: 2, part#: 0, seg#: 3, HWM: 8191; status: avail
1452032 - 1456127 (4096) min: 1997-11-22, max: 1998-08-02, seqNum: 1, state: valid, fbo: 0, DBRoot: 1, part#: 1, seg#: 0, HWM: 1930; status: avail
1510400 - 1514495 (4096) min: 1997-11-22, max: 1998-08-02, seqNum: 1, state: valid, fbo: 0, DBRoot: 2, part#: 1, seg#: 1, HWM: 1930; status: avail<QueryStats>
<Enabled>Y</Enabled>
</QueryStats>grant INSERT on infinidb_querystats.querystats to 'cross_engine'@'127.0.0.1';
grant INSERT on infinidb_querystats.querystats to 'cross_engine'@'localhost';MariaDB [infinidb_querystats]> select queryid, query, endtime-starttime, rows from querystats
where starttime >= now() - interval 12 hour and querytype = 'SELECT';MariaDB [infinidb_querystats]> select a.* from (select endtime-starttime execTime, query from queryStats
where sessionid = 2 and querytype = 'SELECT' and starttime >= now()-interval 12 hour
order by 1 limit 3) a;MariaDB [infinidb_querystats]> select min(endtime-starttime), max(endtime-starttime), avg(endtime-starttime) from querystats
where querytype='INSERT SELECT' and starttime >= now() - interval 12 hour;When Enterprise ColumnStore performs ORDER BY and LIMIT operations, the operations are performed in a single-threaded manner after the rest of the query processing has been completed, and the full unsorted result-set has been retrieved. For large data sets, the performance overhead can be significant.
When Enterprise ColumnStore 5 performs aggregations (i.e., DISTINCT, GROUP BY, COUNT(*), etc.), all of the aggregation work happens in-memory by default. As a consequence, more complex aggregation operations require more memory in that version.
For example, the following query could require a lot of memory in Enterprise ColumnStore 5, since it has to calculate many distinct values in memory:
Whereas the following query could require much less memory in Enterprise ColumnStore 5, since it has to calculate fewer distinct values:
In Enterprise ColumnStore 6, disk-based aggregations can be enabled.
For best performance, avoid excessive aggregations or enable disk-based aggregations.
For additional information, see "Configure Disk-Based Aggregations".
When Enterprise ColumnStore evaluates built-in functions and aggregate functions, it can often evaluate the function in a distributed manner. Distributed evaluation of functions can significantly improve performance.
Enterprise ColumnStore supports distributed evaluation for some built-in functions. For other built-in functions, the function must be evaluated serially on the final result set.
Enterprise ColumnStore also supports distributed evaluation for user-defined functions developed with ColumnStore's User-Defined Aggregate Function (UDAF) C++ API. For functions developed with Enterprise Server's standard User-Defined Function (UDF) API, the function must be evaluated serially on the final result set.
For best performance, avoid non-distributed functions.
By default, Enterprise ColumnStore performs all joins as in-memory hash joins.
If the joined tables are very large, the in-memory hash join can require too much memory for the default configuration. There are a couple options to work around this:
Enterprise ColumnStore can be configured to use more memory for in-memory hash joins.
Enterprise ColumnStore can be configured to use disk-based joins.
Enterprise ColumnStore can use optimizer statistics to better optimize the join order.
For additional information, see "Configure In-Memory Joins", "Configure Disk-Based Joins", and "Optimizer Statistics".
Enterprise ColumnStore uses extent elimination to optimize queries. Extent elimination uses the minimum and maximum values in the extent map to determine which extents can be skipped for a query.
When data is loaded into Enterprise ColumnStore, it appends the data to the latest extent. When an extent reaches the maximum number of column values, Enterprise ColumnStore creates a new extent. As a consequence, if ordered data is loaded in its proper order, then similar values will be clustered together in the same extent. This can improve query performance, because extent elimination performs best when similar values are clustered together.
For example, if you expect to query a table with a filter on a timestamp column, you should sort the data using the timestamp column before loading it into Enterprise ColumnStore. Later, when the table is queried with a filter on the timestamp column, Enterprise ColumnStore would be able to skip many extents using extent elimination.
For best performance, load ordered data in proper order.
When Enterprise ColumnStore performs mathematical operations with very big values using the , , and data types, the operation can sometimes overflow ColumnStore's maximum precision or scale. The maximum precision and scale depend on the version of Enterprise ColumnStore:
In Enterprise ColumnStore 6, the maximum precision (M) is 38, and the maximum scale (D) is 38.
In Enterprise ColumnStore 5, the maximum precision (M) is 18, and the maximum scale (D) is 18.
In Enterprise ColumnStore 6, applications can configure Enterprise ColumnStore to check for decimal overflows by setting the columnstore_decimal_overflow_check system variable, but only when the column has a decimal precision that is 18 or more:
When decimal overflow checks are enabled, math operations have extra overhead.
When the decimal overflow check fails, MariaDB Enterprise ColumnStore raises an error with the ER_INTERNAL_ERROR error SQL, and it writes detailed information about the overflow check failure to the ColumnStore system logs.
MariaDB Enterprise ColumnStore supports Enterprise Server's standard User-Defined Function (UDF) API. However, UDFs developed using that API cannot be executed in a distributed manner.
To support distributed execution of custom SQL, MariaDB Enterprise ColumnStore supports a Distributed User Defined Aggregate Functions (UDAF) C++ API:
The Distributed User Defined Aggregate Functions (UDAF) C++ API allows anyone to create aggregate functions of arbitrary complexity for distributed execution in the ColumnStore storage engine.
These functions can also be used as Analytic (Window) functions just like any built-in aggregate function.
Some types of job steps perform operations in a multi-threaded manner using a thread pool. Performing multi-threaded operations provides vertical scalability.
As you increase the number of ColumnStore nodes or the number of cores on each node, Enterprise ColumnStore can use those resources to more efficiently execute job steps.
For additional information, see "MariaDB Enterprise ColumnStore Query Evaluation.".
Enterprise ColumnStore defines a batch primitive step to handle many types of tasks, such as scanning/filtering columns, JOIN operations, aggregation, functional filtering, and projecting (putting values into a SELECT list).
In calGetTrace() output, a batch primitive step is abbreviated BPS.
Batch primitive steps are evaluated on multiple nodes in parallel. The PrimProc process on each node evaluates the batch primitive step to one extent at a time. The PrimProc process uses a thread pool to operate on individual blocks within the extent in parallel.
Enterprise ColumnStore defines a cross-engine step to perform cross-engine joins, in which a ColumnStore table is joined with a table that uses a different storage engine.
In calGetTrace() output, a cross-engine step is abbreviated CES.
Cross-engine steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.
Enterprise ColumnStore can perform cross-engine joins when the mandatory utility user is properly configured.
For additional information, refer to the "Mandatory Utility User Account"
Enterprise ColumnStore defines a dictionary structure step to scan the dictionary extents that ColumnStore uses to store variable-length string values.
In calGetTrace() output, a dictionary structure step is abbreviated DSS.
Dictionary structure steps are evaluated on multiple nodes in parallel. The PrimProc process on each node evaluates the dictionary structure step to one extent at a time. It uses a thread pool to operate on individual blocks within the extent in parallel.
Dictionary structure steps can require a lot of I/O for a couple of reasons:
Dictionary structure steps do not support extent elimination, so all extents for the column must be scanned.
Dictionary structure steps must read the column extents to find each pointer and the dictionary extents to find each value, so it doubles the number of extents to scan.
It is generally recommended to avoid queries that will cause dictionary scans.
For additional information, see "Avoid Creating Long String Columns".
Enterprise ColumnStore defines a hash join step to perform a hash join between two tables.
In calGetTrace() output, a hash join step is abbreviated HJS.
Hash join steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.
Enterprise ColumnStore performs the hash join in memory by default. If you perform large joins, you may be able get better performance by changing some configuration defaults with mcsSetConfig:
Enterprise ColumnStore can be configured to use more memory for in-memory hash joins.
Enterprise ColumnStore can be configured to use disk-based joins.
For additional information, see "Configure in-memory joins" and "Configure Disk-Based Joins".
Enterprise ColumnStore defines a having step to evaluate a HAVING clause on a result set.
In calGetTrace() output, a having step is abbreviated HVS.
Enterprise ColumnStore defines a subquery step to evaluate a subquery.
In calGetTrace() output, a subquery step is abbreviated SQS.
Enterprise ColumnStore defines a tuple aggregation step to collect intermediate aggregation prior to the final aggregation and evaluation of the results.
In calGetTrace() output, a tuple aggregation step is abbreviated TAS.
Tuple aggregation steps are primarily evaluated by the ExeMgr process on the initiator/aggregator node. However, the PrimProc process on each node also plays a role, since the PrimProc process on each node provides the intermediate aggregation results to the ExeMgr process on the initiator/aggregator node.
Enterprise ColumnStore defines a tuple annexation step to perform the final aggregation and evaluation of the results.
In calGetTrace() output, a tuple annexation step is abbreviated TNS.
Tuple annexation steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.
Enterprise ColumnStore 5 performs aggregation operations in memory. As a consequence, more complex aggregation operations require more memory in that version.
In Enterprise ColumnStore 6, disk-based aggregations can be enabled.
For additional information, see "Configure Disk-Based Aggregations".
Enterprise ColumnStore defines a tuple union step to perform a union of two subqueries.
In calGetTrace() output, a tuple union step is abbreviated TUS.
Tuple union steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.
Enterprise ColumnStore defines a tuple constant step to evaluate constant values.
In calGetTrace() output, a tuple constant step is abbreviated TCS.
Tuple constant steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.
Enterprise ColumnStore defines a window function step to evaluate window functions.
In calGetTrace() output, a window function step is abbreviated WFS.
Window function steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.
[[analyze-table|ANALYZE TABLE]] columnstore_tab;The ColumnStore storage engine uses a ColumnStore Execution Plan (CSEP) to represent a query plan internally.
When the select handler receives the SELECT_LEX object, it transforms it into a CSEP as part of the query planning and optimization process. For additional information, see "MariaDB Enterprise ColumnStore Query Evaluation."
The CSEP for a given query can be viewed by performing the following:
Calling the calSetTrace(1) function:
Executing the query:
Calling the calGetTrace() function:
SELECT * FROM tab;SELECT col1, col2 FROM tab;SELECT DISTINCT col1 FROM tab LIMIT 10000;SELECT DISTINCT col1 FROM tab LIMIT 100;SET SESSION columnstore_decimal_overflow_check=ON;
SELECT (big_decimal1 * big_decimal2) AS product
FROM columnstore_tab;SELECT calSetTrace(1);SELECT column1, column2
FROM columnstore_tab
WHERE column1 > '2020-04-01'
AND column1 < '2020-11-01';SELECT calGetTrace();MariaDB ColumnStore's query plans and Optimizer Trace show how analytical queries run in parallel across its distributed, columnar architecture, aiding performance tuning.
The high level components of the ColumnStore architecture are:
PrimProc: PrimProc (Primitives Processor) is responsible for parsing the SQL requests into an optimized set of primitive job steps executed by one or more servers. PrimProc is thus responsible for query optimization and orchestration of query execution by the servers. While every instance has their own PrimProc in a multi-server deployment, each query begins and ends on the same PrimProc it originated from. A database load balancer such as MariaDB MaxScale can be deployed to appropriately balance external requests against individual servers. PrimProc also executes granular job steps received from the server (mariadbd) in a multi-threaded manner. ColumnStore allows distribution of the work across many servers.
Extent Maps: ColumnStore maintains metadata about each column in a shared distributed object known as the Extent Map. The primary node references the Extent Map to help assist in generating the correct primitive job steps. The primary node server references the Extent Map to identify the correct disk blocks to read. Each column is made up of one or more files and each file can contain multiple extents. As much as possible the system attempts to allocate contiguous physical storage to improve read performance.
Storage: ColumnStore can use either local storage or shared storage (e.g. SAN or EBS) to store data. Using shared storage allows for data processing to fail over to another node automatically in case of a server failing.
The system supports full MVCC ACID transactional logic via Insert, Update, and Delete statements. The MVCC architecture allows for concurrent query and DML / batch load. Although DML is supported, the system is optimized more for batch inserts and so larger data loads should be achieved through a batch load. The most flexible and optimal way to load data is via the cpimport tool. This tool optimizes the load path and can be run centrally or in parallel on each server.
If the data contains a time or (time correlated ascending value) column then significant performance gains will be achieved if the data is sorted by this field and also typically queried with a where clause on that column. This is because the system records a minimum and maximum value for each extent providing for a system maintained range partitioning scheme. This allows the system to completely eliminate scanning an extent map if the query includes a where clause for that field limiting the results to a subset of extent maps.
MariaDB ColumnStore has its own query optimizer and execution engine distinct from the MariaDB server implementation. This allows for scaling out query execution to multiple servers, and to optimize for handling data stored as columns rather than rows. As such, the factors influencing query performance are very different:
A query is first parsed by the MariaDB server (mariadbd) process and passed through to the ColumnStore storage engine. This passes the request onto the PrimProc process which is responsible for optimizing and orchestrating execution of the query. The PrimProc module's optimizer creates a series of batch primitive steps that are executed on all nodes in the cluster. Since multiple servers can be deployed, this allows for scale-out execution of the queries. The optimizer attempts to process query execution in parallel. However, certain operations inherently must be executed centrally, for example final result ordering. Filtering, joins, aggregates, and GROUP BY clauses are general.y pushed down and executed in parallel in PrimProc on all servers. In PrimProc, batch primitive steps are performed at a granular level where individual threads operate on individual 1K-8K blocks within an extent. This enables a larger multi-core server to be fully consumed and scale out within a single server. The current batch primitive steps available in the system include:
Single Column Scan: Scan one or more Extents for a given column based on a single column predicate, including operators like =, <>, IN (list), BETWEEN, and ISNULL. See the first scan section of for additional details on tuning this.
Additional Single Column Filters: Project additional columns for any rows found by a previous scan and apply additional single column predicates as needed. Access of blocks is based on row identifier, going directly to the blocks. See the additional column read section of for additional details on tuning this.
The following items should be considered when thinking about query execution in ColumnStore vs a row based store such as InnoDB.
ColumnStore is optimized for large scale aggregation / OLAP queries over large data sets. As such indexes typically used to optimize query access for row based systems do not make sense since selectivity is low for such queries. Instead ColumnStore gains performance by only scanning necessary columns, utilizing system maintained partitioning, and utilizing multiple threads and servers to scale query response time.
Since ColumnStore only reads the necessary columns to resolve a query, only include the necessary columns required. For example, SELECT * is significantly slower than SELECT col1, col2 FROM tbl.
Datatype size is important. If say you have a column that can only have values 0 through 100 then declare this as a tinyint as this will be represented with 1 byte rather than 4 bytes for int. This reduces the I/O cost by 4 times.
For string types, an important threshold is CHAR(9) and VARCHAR(8) or greater. Each column storage file uses a fixed number of bytes per value. This enables fast positional lookup of other columns to form the row. Currently the upper limit for columnar data storage is 8 bytes. So. for strings longer than this, the system maintains an additional 'dictionary' extent where the values are stored. The columnar extent file then stores a pointer into the dictionary. For example, it is more expensive to read and process a VARCHAR(8) column than a CHAR(8) column. Where possible, you get better performance if you can utilize shorter strings, especially if you avoid the dictionary lookup. All TEXT/BLOB data types in ColumnStore 1.1 onward utilize a dictionary and do a multiple-block 8KB lookup to retrieve that data if required. The longer the data, the more blocks are retrieved, and the greater is a potential performance impact.
In a row-based system, adding redundant columns adds to the overall query cost, but in a columnar system a cost is only occurred if the column is referenced. Therefore, additional columns should be created to support different access paths. For instance, store a leading portion of a field in one column to allow for faster lookups, but additionally store the long-form value as another column. Scans on a shorter code or a leading-portion column are faster.
ColumnStore distributes function application across all nodes for greater performance, but this requires a distributed implementation of the function in addition to the MariaDB server implementation. See for the full list.
Hash joins are utilized by ColumnStore to optimize for large scale joins and avoid the need for indexes and the overhead of nested loop processing. ColumnStore maintains table statistics so as to determine the optimal join order. This is implemented by first identifying the small table side (based on extent map data) and materializing the necessary rows from that table for the join. If the size of this is less than the configuration setting PmMaxMemorySmallSide, the join is pushed down into PrimProc for distributed in-memory processing. Otherwise, the larger side rows is not processed in a distributed manner for joining, and only the WHERE clause on that side is executed across all PrimProc modules in the cluster. If the join is too large for memory, disk-based join can be enabled to allow the query to complete.
Similarly to scalar functions ColumnStore distributes aggregate evaluation as much as possible. However some post processing is required to combine the final results. Enough memory must exist to handle queries with a very large number of values in the aggregate columns.
Aggregation performance is also influenced by the number of distinct aggregate column values. Generally, the same number of rows with 100 distinct values computes faster than 10000 distinct values. This is due to increased memory management as well as transfer overhead.
ORDER BY and LIMIT are implemented at the very end by the mariadbd server process on the temporary result set table. This means that the unsorted results must be fully retrieved before either are applied. The performance overhead of this is minimal on small to medium results, but for larger results, it can be significant.
Subqueries are executed in sequence thus the subquery intermediate results must be materialized and then the join logic applies with the outer query.
Window functions are executed as part of final aggregation in PrimProc due to the need for ordering of the window results. The ColumnStore window function engines uses a dedicated faster sort process.
Automated system partitioning of columns is provided by ColumnStore. As data is loaded into extent maps, the system will capture and maintain min/max values of column data in that extent map. New rows are appended to each extent map until full at which point a new extent map is created. For column values that are ordered or semi-ordered this allows for very effective data partitioning. By using the min and max values, entire extent maps can be eliminated and not read to filter data. This generally works particularly well for time dimension / series data or similar values that increase over time.
Table Level Filters: Project additional columns as required for any table level filters such as column1 < column2, or more advanced functions and expressions. Access of blocks is again based on row identifier, going directly to the blocks.
Project Join Columns for Joins: Project additional join columns as needed for any join operations. Access of blocks is again based on row identifier, going directly to the blocks. See the join tuning section of performance configuration for additional details on tuning this.
Execute Multi-Join: Apply one or more hash join operation against projected join columns, and use that value to probe a previously built hash map. Build out tuples as needed to satisfy inner or outer join requirements. See the multi-table join section of performance configuration for additional details on tuning this.
Cross-Table Level Filters: Project additional columns from the range of rows for the Primitive Step as needed for any cross-table level filters such as table1.column1 < table2.column2, or more advanced functions and expressions. Access of blocks is again based on row identifier, going directly to the blocks.
Aggregation/Distinct Operation Part 1: Apply any local group by, distinct, or aggregation operation against the set of joined rows assigned to a given Batch Primitive. Part 1 of this process is handled by PrimProc.
Aggregation/Distinct Operation Part 2: Apply any final group by, distinct, or aggregation operation against the set of joined rows assigned to a given Batch Primitive. This processing is handled by PrimProc. See the memory management section of performance configuration for additional details on tuning this.
This page is: Copyright © 2025 MariaDB. All rights reserved.
This page is: Copyright © 2025 MariaDB. All rights reserved.
This page is: Copyright © 2025 MariaDB. All rights reserved.
This page is: Copyright © 2025 MariaDB. All rights reserved.
This page is: Copyright © 2025 MariaDB. All rights reserved.
This page is: Copyright © 2025 MariaDB. All rights reserved.
This page is: Copyright © 2025 MariaDB. All rights reserved.