ColumnStore Storage Architecture
Contents
Storage architecture
When a table is created on MariaDB ColumnStore, the system creates at least one file per column for the table. So, for example, a table created with three columns would have at a minimum, three separately-addressable logical objects created on a SAN or a server’s disk drives.
- Each column is stored independently in 8M rows logical measure called an Extent. An extent for 1 byte datatypes consumes 8MB; 2 byte datatypes consume 16MB; 4 byte datatypes consume 32MB; 8 byte datatypes consume 64MB; and variable size datatypes consume 64MB. Once an extent becomes full, a new extent is automatically created.
- An Extent is physically stored as a collection of blocks.
- A block is 8K bytes. Every database block in a database is uniquely identified by its Logical Block Identifier (LBID)
- String columns > 8 characters store indexes in the main column file and actual values in separate dictionary files.
- A segment file is a physical file on disk that holds a column’s data. Once a segment file reaches its maximum number of extents, a new segment file is automatically created. The max number of extents in a segment file is configured as ExtentsPerSegmentFile in ColumnStore.xml - It should be set to multiple of Db Roots. Default value is 4.
- Collectively, all the column's segment files for one or more extents form a partition - this is the horizontal partitioning in ColumnStore.
- Partitions are stored in a hierarchical structure organized by segments (i.e. folders).
- MariaDB ColumnStore meta-store maps file structure/location to DB schema as well as information used for partitioning. The maximum number of files per partition is configured as FilesPerColumnPartition in ColumnStore.xml. Default value is 2.
- By default the data is compressed.
Extent Map
MariaDB ColumnStore utilizes a smart structure known as the Extent Map which provides logical range partitioning and removes the need for indexing, manual table partitioning, materialized views, summary tables, and other structures and objects that row-based databases must implement for query performance.
An extent is a logical block of space that exists within a physical segment file, and is anywhere from 8-64MB in size. Each extent supports the same number of rows, with smaller data types using less space on disk.
The Extent Map catalogs all extents and their corresponding blocks (LBID’s) along with minimum and maximum values for a column’s data within an extent.
A master copy of the Extent Map exists on the primary Performance Module. Upon system startup, the file is read into memory and then physically copied to all other participating User and Performance Modules for disaster recovery and failover purposes. All nodes keep the Extent Map resident in memory for quick access. As extents are modified, updates are broadcast to all participating nodes as well.
How Extent Map works
Extent Map provides the ability to do logical range partitioning and the ability to only retrieve the blocks needed to satisfy a query. This is accomplished via what is called "extent elimination" in MariaDB ColumnStore. 'Extent Elimination' is the process of eliminating those extents for the purpose of I/O operation that do not meet the join and filter condition of a query.
Extent elimination is first accomplished in MariaDB ColumnStore via only scanning columns in join and filter conditions. Then the logical horizontal partitioning information of each extent along with the minimum and maximum value for the column is used for further elimination. To eliminate an extent when a column scan involves a filter, that filter value is compared to the minimum and maximum values stored in each extent for the column: If the filter value is outside of a extent's min-max value range, the extent is eliminated.
This automatic extent elimination behavior is well suited for series, ordered, patterned, or time-based data where data is loaded frequently and often referenced by time. Any column with clustered values is a good candidate for extent elimination.
Example:
Compression with real-time decompression
Columnar storage offers excellent compressibility because similar data is stored within each column file. Most data sets will show excellent compression rates, saving between 65% and 95% of space, however the actual space savings depends on the randomness of the data and the number of distinct values that exist.
MariaDB ColumnStore's compression strategy is optimized for read performance from disk while still offering excellent compression. It is tuned to accelerate the decompression rate, maximizing the performance benefits when reading from disk. This allows for systems that are I/O bound when reading from disk to improve performance.
Compression mode
Compression is on by default, but can be turned on or off at the table level or the column level and in addition can be controlled at a session level by setting a variable: infinidb_compression_type When turned on, MariaDB ColumnStore uses snappy compression.
Version Buffer
MariaDB ColumnStore utilizes a structure termed a Version Buffer to store the disk blocks which are being modified. It is also used to manage rollback activities for transactions as well as service the MVCC (multi-version concurrency control) or “snapshot read” function of the database so that it can offer a query consistent view of the database. All statements in MariaDB ColumnStore run at a particular version (or snapshot) of the database, which the system refers to as the System Change Number (SCN). Although it is called the Version Buffer, it is composed of both memory and disk structures.
How the Version Buffer files work
The Version Buffer utilizes in-memory hash tables to supply memory access to in-flight transaction information. The initial size upon startup is 4MB with the memory region growing from that amount to handle blocks that are being modified by a transaction. Each entry in the hash table is a 40-byte reference to an 8K block that is being modified.
The number of rows being updated is not the limiting factor for the Version Buffer, rather the number of disk blocks that are being updated. The size can be increased but caution should be used since updating more disk blocks can allow the update/delete statements to run for long periods of time and if a problem is encountered a rollback would also take a long period of time.
The Version Buffer files default to a 1GB-sized file on each DBRoot, which is configurable via the VersionBufferFileSize parameter. The Version Buffer files are spread across each DBRoot in the system.
NOTE: While MariaDB ColumnStore on HDFS has not been tested in the current release, it inherits this behavior from InfiniDB when configured to run as a Hadoop query engine over HDFS. In this configuration the MVCC function (and hence usage of the Version Buffer files) is disabled. HDFS is a write-only file system and hence block-level versioning in the MVCC model is not operationally practical. MariaDB ColumnStore supports statement level tracking and rollback capability for DML operations when running over HDFS. Queries over HDFS operate in more of an “eventual consistency” mode where queries will converge to the proper result at a point in time when updates are completed, but queries issued concurrently with updates will use whatever block happens to be current at the time the block-level execution occurs.
Transaction Log
MariaDB ColumnStore supports logging committed transactions to MariaDB Server's binary log.