Storage Overview

In this blog post, I will outline MariaDB ColumnStore’s architecture, which has the capacity to handle large datasets and scale out across multiple nodes as your data grows.

Columns are the unit of storage, which is a key differentiator from a row-based storage engine such as InnoDB. A columnar system stores data per column rather than per row.
Partitions are used to store the data for a Column. Within MariaDB ColumnStore, a Partition is a logical concept for a grouping of Segments (default 4 per Partition).
A Segment is a storage file belonging to a Partition and containing a number of Extents (default 2). The system creates Segment files as needed.
An Extent is a collection of 8 million values for a given Column stored within a Segment. An Extent is made up of many Blocks.
A Block stores 8K worth of data and is the unit of disk I/O.

How MariaDB Utilizes Storage

Within each Extent and Block, MariaDB ColumnStore stores column values sequentially using a fixed length datatype between 1 and 8 bytes long. For string types longer than this, a separate Dictionary extent is created to store unique string values. The column extent stores pointers to the string within the Dictionary extent.

Because the system utilizes fixed length datatypes (within the primary extent), it is possible to map directly between columns belonging to the same row. For example, if we have row 234 in an extent for column ‘Name’, the query engine can easily read the value for row 234 in column ‘Amount’. This allows for efficient recreation of the required columns to form a query result row.

By default, column and dictionary values are compressed within Extent storage. This trades off CPU for reduced I/O, which benefits query response time. MariaDB ColumnStore utilizes the Snappy library (https://google.github.io/snappy/) which provides high decompression speeds with reasonable compression. Many columns have repeating or low cardinality values which will compress extremely well, in some cases up to 10x.

Segment files are physically managed within a DBRoot directory. A DBRoot encapsulates a physical storage unit and is assigned to one physical PM server at a point in time. A DBRoot contains Segment files containing Extents. In the installation directory, each “data<N>” directory corresponds to the given DBRoot identified by N. The system automatically distributes data to the available DBRoots across servers.

MariaDB ColumnStore allows use of internal (local) or external storage. If external storage (e.g., SAN, GlusterFS or EBS in AWS) is utilized then the system provides automated failover in a multi-node deployment should a server fail. This is possible because the failed server’s storage can be remounted by another server. With internal storage, automated failover is not possible since a given server’s data is not replicated or available to another server.

Extent Maps & Horizontal Partitioning

The system maintains a persistent distributed data structure called an Extent Map which provides necessary metadata on Extents. This includes tracking the minimum and maximum column values within that Extent. This allows MariaDB ColumnStore to provide a simple but effective horizontal partitioning scheme. At query time, the optimizer can eliminate reading Extents that fall outside of the WHERE clause predicate for that column, for example:

If a query is executed with a WHERE clause filter of “COL1 BETWEEN 220 and 250”, then the system can eliminate COL1 Extents 1, 2 and 4 from being scanned, saving ¾ of the I/O and many comparison operations. This can extend to multiple columns, for example, a query WHERE clause filter of “COL1 BETWEEN 220 AND 250 AND COL2 < 10000” can be fulfilled entirely from the Extent Map since no rows can ever match this query based on the minimum and maximum values across both COL1 and COL2.

Use cases where this works well are time series or semi-ordered date or time derived columns. For example, consider order tracking. Each order has an order date and a ship date. In most systems order creation will correspond to the order date and this value will increase with each inserted record. In general, the ship date will be based on item availability; therefore, it will not strictly increase with each record. However, ship dates are normally semi-ordered, i.e., a ship date will usually be within days to weeks of an order date. Ship dates will also form a natural partition with some possible overlap between Extents.

In addition, the system allows for bulk deletion by extent using Extent Map minimum and maximum values. In the order management case, the administrator can drop an entire extent using order date values. This allows for a simple information lifecycle management strategy by disabling or purging old lower value data to make room for current higher value data.

In summary, the storage model for MariaDB ColumnStore provides the following benefits:

Columnar storage is optimized for analytical queries that access a subset of columns and a majority of rows.
Online schema changes can be made to ColumnStore tables with no table locking.
Disk I/O is optimized by utilizing contiguous block storage where possible, allowing for efficient I/O streaming. Data compression further reduces the amount of I/O required.
Storage is automatically distributed across nodes to provide optimal distributed query execution and scale-out capabilities.
Bulk data loading can be run in parallel with concurrent query access.
An automated vertical partitioning scheme allows for query optimization to avoid reading rows that can’t possibly be found in a given extent / range of columns.
Data can be bulk deleted online using the vertical partitioning scheme which simplifies data lifecycle management.

In subsequent blogs I will provide details on how MariaDB ColumnStore builds upon this storage architecture to provide high-speed data ingestion, scale-out query performance, and rich ANSI SQL support.