MariaDB Xpand Architecture

MariaDB Xpand provides ACID-compliant distributed SQL, high availability, fault tolerance, write scaling, and horizontal scale-out for transactional workloads. MariaDB Xpand is a component of MariaDB Enterprise.

Distributed SQL databases consist of multiple database nodes working together with each node storing and querying a subset of the data.

MariaDB Xpand is a distributed SQL database that scales efficiently for modern massive-workload web applications that require strong consistency and data integrity. MariaDB Xpand achieves strong consistency through synchronous writes to replicas. MariaDB Xpand is designed for large data-set transactional workloads that require high performance, such as online transactional processing (OLTP).

This page describes the architecture of MariaDB Xpand and the role of each component.

Benefits

  • ACID-compliant distributed SQL for modern web applications with massive workloads that require high performance and strong consistency, such as online transactional processing (OLTP)

  • Elastic scale-out and scale-in to adapt for changes in workload or capacity requirements

  • High Availability (HA) and fault tolerance

  • Sophisticated data distribution partitions Xpand horizontally into slices, and distributes multiple copies of the slices among nodes as replicas which are managed automatically for data protection and load rebalancing

  • Read and write scaling, with a shared-nothing architecture

  • Parallel query evaluation for efficient execution of complex queries such as JOINs and aggregate operations

  • Columnar indexes can be used in Xpand 6:

    • Columnar indexes are best suited for queries that "scan" much or all of a table. This includes queries that perform aggregations on large amounts of data, queries that select a subset of columns from large tables, and queries that perform a full table scan of a fact table. Queries tend to match this pattern in analytical, online-analytical processing (OLAP), and data warehousing workloads.

    • With composite (multi-column) Columnar indexes, filtration is not dependent on column order. If a Columnar index is defined with COLUMNAR INDEX (a, b, c), and the query filters with WHERE b = 1 AND c = 2, the Columnar index can be used to filter the results.

  • Optimized topologies:

    • Xpand Performance topology for maximum throughput and lowest latency

    • Xpand Storage Engine topology to maximize compatibility with MariaDB Enterprise Server, gaining capabilities like Window functions, CTEs, and cross-engine JOINs

Topologies

MariaDB Xpand supports multiple topologies. Several options are described below. MariaDB products can be deployed in many different topologies. The topologies on this page are representative. MariaDB products can be deployed to form other topologies, leverage advanced product capabilities, or combine the capabilities of multiple topologies.

Xpand Performance Topology

Xpand Performance Topology

Xpand Performance topology delivers maximum throughput and lowest latency.

The Xpand Performance topology consists of:

  • One or more MaxScale nodes

  • Three or more Xpand nodes

The MaxScale nodes:

  • Monitor the health and availability of each Xpand node using the Xpand Monitor (xpandmon)

  • Accept client and application connections

  • Route queries to Xpand nodes using the Read Connection (readconnroute) or Read/Write Split (readwritesplit) routers.

The Xpand nodes:

  • Receive queries from MaxScale

  • Store data in a distributed manner

  • Execute queries using parallel query evaluation

Multi-Zone Xpand Performance Topology

The Xpand Performance topology can also be deployed in a multi-zone environment:

  • Three or more zones

  • Each zone should be connected by low latency network connections

  • Each zone must have the same number of Xpand nodes

Xpand Storage Engine Topology

Xpand Storage Engine Topology

Xpand Storage Engine topology leverages Xpand's benefits and maximize compatibility with MariaDB Enterprise Server (ES). It allows Xpand users to leverage ES features like window functions, CTEs, and cross-engine JOINs.

The Xpand Performance topology consists of:

  • One or more MaxScale nodes

  • Three or more ES nodes

  • Three or more Xpand nodes

The MaxScale nodes:

  • Monitor the health and availability of each ES node using the MariaDB Monitor (mariadbmon)

  • Accept client and application connections

  • Route queries to ES nodes using the Read Connection (readconnroute) or Read/Write Split (readwritesplit) routers

  • Optionally monitor the health and availability of each Xpand node using the Xpand Monitor (xpandmon)

  • Optionally route Xpand-only queries (such as ALTER CLUSTER) to Xpand nodes using the Read Connection Router (readconnroute)

The ES nodes:

  • Receive queries from MaxScale

  • Route queries to Xpand nodes using the Xpand storage engine

  • Perform final query evaluation to support ES features, like window functions, CTEs, and cross-engine JOINs

The Xpand nodes:

  • Receive queries from ES nodes

  • Store data in a distributed manner

  • Execute queries on Xpand tables using parallel query evaluation

  • Optionally receive Xpand-only queries from MaxScale

Multi-Zone Xpand Storage Engine Topology

The Xpand Storage Engine topology can also be deployed in a multi-zone environment:

  • Three or more zones

  • Each zone should be connected by low latency network connections

  • Each zone must have the same number of Xpand nodes

Distributed SQL

MariaDB Xpand provides distributed SQL capabilities that support modern massive-workload web applications which require strong consistency and data integrity.

A distributed SQL database consists of multiple database nodes working together, with each node storing and querying a subset of the data. Strong consistency is achieved by synchronously updating each copy of a row.

MariaDB Xpand uses a shared-nothing architecture. In a shared-nothing distributed computing architecture nodes do not share the same memory or storage. Xpand's shared-nothing architecture provides important benefits, including read and write scaling, and storage scalability.

Data Distribution

MariaDB Xpand uses a shared-nothing architecture to distribute data between all nodes. Xpand's shared-nothing architecture provides important benefits, including read and write scaling and storage scalability.

Independent Index Distribution

Xpand distributes data between nodes using independent index distribution:

  • Each table is divided into representations that correspond to the table's Primary Key and each secondary index. Each representation has its own distribution key based on the index.

  • Each representation is divided into logical slices by hashing the distribution key.

  • Each slice is stored as multiple physical replicas. Each replica for a slice is stored on a different node to ensure fault tolerance.

Since each index has its own representation, Xpand's query optimizer can support a broad range of distributed query evaluation plans.

Example Table with Multiple Representations

The following example table consists of a primary key and two indexes, so Xpand creates three representations:

CREATE TABLE test.xpand_dist_example (
   id BIGINT PRIMARY KEY,
   col1 INT,
   col2 INT,
   col3 VARCHAR(64),
   KEY k1 (col2),
   KEY k2 (col3, col1)
);

Let's assume that our example table contains some data:

INSERT INTO test.xpand_dist_example
   (col1, col2, col3)
   VALUES
   (16, 36, "january"),
   (17, 35, "february"),
   (18, 34,  "march"),
   (19, 33,  "april"),
   (20, 32,  "may");

SELECT * FROM test.xpand_dist_example;
+----+------+------+----------+
| id | col1 | col2 | col3     |
+----+------+------+----------+
| 1  | 16   | 36   | january  |
| 2  | 17   | 35   | february |
| 3  | 18   | 34   | march    |
| 4  | 19   | 33   | april    |
| 5  | 20   | 32   | may      |
+----+------+------+----------+

In the following sections, this example table is used to show how Xpand distributes and balances data.

Representations

Xpand divides each table into a representation for each index on the table. Each representation has a different distribution key defined by the index associated with the representation. The Primary Key produces the base representation, which contains all columns in the table. Other indexes produce other representations, which contain the indexed columns. Slices of each representation contain different rows. Xpand hashes the distribution key to determine which slice a row belongs to.

The example table consists of a primary key and two secondary indexes, so it has three representations:

  • The primary key representation uses the primary key column as the distribution key. The slices for this representation contain all table columns.

  • The k1 representation uses the col2 column as the distribution key. The slices for this representation also contain the primary key column.

  • The k2 representation uses the col3 and col1 columns as the distribution key. The slices for this representation also contain the primary key column.

Xpand - Representations

Slices

Xpand horizontally divides each representation into slices:

  • The slice for a given row is determined by applying a consistent hashing algorithm on the distribution key.

  • Each representation is sliced independently.

  • Each representation of the same table can have a different number of slices.

  • By default, each representation starts with one slice per node. However, the initial number of slices for a table can be configured using the SLICE table option.

  • If a slice gets too large, Xpand automatically splits the slice into multiple slices. The maximum slice size is based on the rebalancer_split_threshold_kb system variable. By default, the maximum size is 8 GB.

The example table consists of three representations, and each representation has a different number of slices:

Xpand - Slices

Replicas

Xpand writes slices to disk as replicas. Xpand maintains multiple copies of each slice.

  • In the default configuration, Xpand automatically maintains two replicas of each slice on disk. However, Xpand can be configured to maintain more replicas using the REPLICAS table option or by setting MAX_FAILURES.

  • Each replica for a given slice is stored on a different node.

  • Replicas are distributed among the nodes for redundancy and to balance reads, writes, and disk usage.

  • Xpand can make new replicas available online, without suspending or blocking writes to the slice.

The example table consists of two replicas for each slice with the default configuration:

Xpand - Replicas

When the REPLICAS table option is set to ALLNODES, Xpand maintains a replica of each slice on each node. This provides the maximum in fault tolerance and read performance, but at the expense of write performance and disk usage.

When Xpand is configured to maintain additional physical replicas, Xpand's overhead increases for updates to logical slices, because each additional replica is updated synchronously when the slice is updated. Each additional replica requires the transaction to perform:

  • Additional network communication to transfer the updated data to the node hosting the replica

  • Additional disk writes to update the replica on disk

In some cases, the additional overhead can decrease throughput and increase commit latency, which can have negative effects for the application. If you choose to configure Xpand to maintain additional replicas, it is recommended to perform performance testing.

Consistent Hashing

Xpand uses consistent hashing for data distribution. Consistent hashing allows Xpand to dynamically redistribute data without rehashing the entire data-set.

Xpand hashes each distribution key to a 64-bit number. It then divides the space into ranges. Each range is owned by a specific slice. The table below illustrates how consistent hashing assigns specific keys to specific slices.

Slice

Hash Range

Key Values

1

min-100

A, D, G

2

101-200

B, E

3

201-max

C, F, H

Xpand assigns each slice to a node based on data capacity and to balance the load for reads, writes, and disk usage:

Xpand - Consistent Hashing

If a slice grows larger than rebalancer_split_threshold_kb (by default, 8 GB), the Rebalancer automatically splits the slice into multiple slices and distributes the original rows between the new slices.

For example, let's say that slice 2 has grown larger than rebalancer_split_threshold_kb:

Slice

Hash Range

Key Values

Size

1

min-100

A, D, G

768 MB

2

101-200

B, E, I, J, K

8354 MB (too large)

3

201-max

C, F, H

800 MB

Note the increased size of slice 2 in the diagram:

Xpand - Reslicing (Before)

The Rebalancer automatically detects that the split threshold has been exceeded, so it schedules a slice-split operation. During the slice-split operation, the hash range for slice 2 is divided into two ranges, which are associated with two new slices called slice 4 and slice 5. After the data is moved to slice 4 and slice 5, slice 2 is removed:

Slice

Hash Range

Key Values

Size

1

min-100

A, D, G

768 MB

3

201-max

C, F, H

800 MB

4

101-150

B, E

4670 MB

5

151-200

I, J, K

4684 MB

The Rebalancer does not modify slice 1 or slice 3. The Rebalancer only modifies the slices being split, which allows for very large data reorganizations to proceed in small chunks.

Xpand - Reslicing (Before)

Fault Tolerance

MariaDB Xpand is fault tolerant. By default, MariaDB Xpand is configured to tolerate a single node failure or a single zone failure.

In the default configuration, Xpand automatically maintains two replicas, or copies, of each slice on disk. Therefore, if a single node fails, no slices are lost, because at least one replica of each slice is stored on other nodes. As long as Xpand has sufficient replicas and a quorum of nodes is available, it can lose a single node without experiencing any data loss.

Maximum Failures

MariaDB Xpand's fault tolerance is configurable. By default, MariaDB Xpand is configured to tolerate a single node failure or a single zone failure. If the default configuration does not provide sufficient fault tolerance for your needs, Xpand can be configured to provide even more fault tolerance.

To configure Xpand to provide a higher level of fault tolerance, you can configure the maximum number of node or zone failures using the ALTER CLUSTER SET MAX_FAILURES statement.

When the maximum number of failures is increased, Xpand maintains additional replicas for each slice, ensuring that Xpand can handle a greater simultaneous loss of nodes or zones without experiencing data loss.

When Xpand is configured to maintain additional physical replicas, Xpand's overhead increases for updates to logical slices, because each additional replica is updated synchronously when the slice is updated. Each additional replica requires the transaction to perform:

  • Additional network communication to transfer the updated data to the node hosting the replica

  • Additional disk writes to update the replica on disk

In some cases, the additional overhead can decrease throughput and increase commit latency, which can have negative effects for the application. If you choose to configure Xpand to maintain additional replicas, it is recommended to perform performance testing.

Zones and Fault Tolerance

Xpand can be configured to be zone-aware. A zone is an arbitrary grouping of Xpand nodes. Different zones could represent different availability zones in a cloud environment, different server racks, different network switches, and/or different power sources. Different zones should not be used to represent different regions, because the distance between regions tends to cause higher network latency, which can negatively affect performance.

To configure Xpand to be zone-aware, you can configure the zone for each node using the ALTER CLUSTER .. ZONE statement.

When Xpand is configured to use zones, Xpand's default zone-aware configuration can tolerate a single zone failure without experiencing any data loss. However, Xpand can be configured to tolerate more failures by setting MAX_FAILURES.

Effects of a Node or Zone Failure

When Xpand experiences a node or zone failure:

  1. A node or zone can no longer communicate with other nodes, so it fails a heartbeat check.

    Xpand with Failed Node
  2. A short group change occurs, during which Xpand updates membership. Once this is complete, Xpand resumes processing transactions.

  3. The Rebalancer starts a timer that counts down from the global rebalancer_reprotect_queue_interval_s value (default = 10 minutes).

  4. The Rebalancer establishes a reprotect queue of pending changes for the node's or the zone's data and tracks all pending changes for that node or zone in that queue. The reprotect queue is necessary only if the failure is temporary.

  5. The next steps depend on whether the failed node or zone returns before the timer exceeds the rebalancer_reprotect_queue_interval_s value.

    If the failed node or zone returns within the interval:

    1. The Rebalancer applies the transactions in the reprotect queue to the returned node or zone.

    2. Operations resume normally.

    If the failed node or zone does not return within the interval:

    1. The Rebalancer discards the queued transactions.

    2. The Rebalancer reprotects slices from the failed node or zone by creating new replicas to replace the ones that were lost. If the failed node(s) contained any ranking replicas, Xpand assigns that role to another replica.

      Xpand Rebalancer
    3. When the reprotect process completes, Xpand sends a message indicating that full protection has been restored using Email Alerts. Xpand also writes an entry to query.log.

    4. The failed/unavailable node(s) can be safely removed with ALTER CLUSTER DROP.

Group Change Effects on Transactions

If a transaction is interrupted by a group change or encounters a non-fatal error, Xpand automatically retries the transaction in some cases.

Transactions will only be retried if the global value of the autoretry system variable is set to TRUE.

The following types of transactions can be retried:

  • A single statement transaction executed with autocommit = 1

  • The first statement in an explicit transaction

If Xpand retries a transaction and the transaction fails, Xpand returns an error to the application.

The following types of transactions cannot be retried:

  • Subsequent statements in an explicit transaction

  • Stored procedures

  • Stored functions

If Xpand can't retry the transaction and your application is connecting to MaxScale's Read/Write Split (readwritesplit) router, you can configure MaxScale automatically retry some transactions by configuring the delayed_retry and transaction_replay parameters.

Group Change Effects on Connections

When a group change occurs, connections can be affected:

  • If a connection was opened to a node still in quorum, the connection will remain open after the new group is formed with the available nodes.

  • If a connection was opened to a node that is no longer in quorum, the connection will be lost.

When the connection is lost, users have a couple ways to automatically re-establish a connection to a valid node:

  • If your application is connecting to MaxScale's Read/Write Split (readwritesplit) router, you can configure MaxScale to automatically reconnect by configuring the master_reconnection parameter.

  • If your application is using a MariaDB Connector that supports automatically reconnecting, you can enable that feature in the connector.

    • If you are using MariaDB Connector/C, the MYSQL_OPT_RECONNECT option can be set with the mysql_optionsv() function:

      /* enable auto-reconnect */
      mysql_optionsv(mysql, MYSQL_OPT_RECONNECT, (void *)"1");
      
    • If you are using MariaDB Connector/J, the autoReconnect parameter can be set for the connection:

      Connection connection = DriverManager.getConnection("jdbc:mariadb://192.168.0.1/test?user=test_user&password=myPassword&autoReconnect=true");
      
    • If you are using MariaDB Connector/Python, the auto_reconnect parameter can be set for the connection:

      conn = mariadb.connect(
         user="test_user",
         password="myPassword",
         host="192.168.0.1",
         port=3306,
         auto_reconnect=true)
      

Rebalancer

MariaDB Xpand Rebalancer automatically maintains a healthy distribution of data:

  • Automated: Xpand Rebalancer automatically controls data distribution in the background.

  • Available: Xpand Rebalancer distributes data with minimal interruption to user operations, so that your database and application can remain available.

  • Protects Data: Xpand Rebalancer responds to unexpected node and zone failures. If a slice does not have enough replicas after a failure, the rebalancer will create new replicas on the remaining nodes.

  • Re-slices: Xpand Rebalancer ensures that a table has an optimal number of slices, even if the data set or load changes.

  • Prevents Load Imbalances: Xpand Rebalancer ensures that the slices are fairly distributed between nodes, which allows the load to be fairly distributed between nodes.

  • Prevents Storage Imbalances: Xpand Rebalancer ensures that the slices are fairly distributed between nodes, which allows the data set to be fairly distributes between nodes.

  • Enables Node Maintenance: Xpand allows nodes to be marked as soft-failed in preparation for maintenance or decommissioning. When a node is soft-failed, the rebalancer redistributes the node's replicas between the remaining nodes. You can monitor soft-failing nodes by querying the system.softfailing_containers table.

For additional information, see "Xpand Rebalancer".

High Availability

MariaDB Xpand is highly available.

Group Membership and Quorum

Xpand nodes use a distributed group membership protocol. This protocol maintains a listing of two fundamental sets of nodes:

  • A static set of all Xpand nodes known.

  • A dynamic set of all Xpand nodes that can currently communicate with each other.

Xpand cannot remain operational unless more than half the nodes in the static set are able to communicate with each other, which forms a quorum.

When less than half the nodes can connect to each other, Xpand loses the quorum and becomes non-operational. This prevents a split-brain scenario. Consider a deployment where MariaDB MaxScale routes queries between six Xpand nodes. A network partition between the Xpand nodes can separate them into two sets of three nodes:

Xpand with No Quorum

Neither set has more than half of the nodes, so neither one can form a quorum. Both sets become non-operational.

To avoid problems establishing quorum during a network partition, it is recommended to deploy an odd number of Xpand nodes. If you deploy nodes in multiple zones, it is recommended to use an odd number of zones.

Partial Availability

Xpand's default configuration finds a balance between optimal fault tolerance and optimal performance by only tolerating a single node failure or a single zone failure. However, Xpand can be configured to tolerate more failures by setting MAX_FAILURES. If the number of simultaneous failures exceeds the maximum number of failures, Xpand's data will only be partially available

Consider the example in which MariaDB MaxScale routes queries between five Xpand nodes where two of the Xpand nodes have been separated from the rest by a network partition:

Xpand with Partial Availability

In the example, each Xpand node contains two replicas of a slice from a table. One replica is designated as the ranking replica. The other replica is maintained for fault tolerance.

The network partition causes the following effects:

  • Slices 1 and 5 remain available and protected, because both replicas for each slice are still available.

  • Slice 2 remains available, but not protected, because the non-ranking replica for the slice is still available. Xpand promotes the remaining replica to be the ranking replica for the slice. The Rebalancer creates a new non-ranking replica for the slice to maintain fault tolerance.

  • Slice 3 remains available, but not protected, because the ranking replica is still available. The Rebalancer creates a new non-ranking replica for the slice to maintain fault tolerance.

  • Slice 4 is unavailable, because both the ranking and non-ranking replicas are unavailable.

Since Xpand cannot serve rows from slice 4, the table is only partially available. Xpand can still handle queries to the table, but it returns an error if any transaction attempts to access the unavailable slice.

Strongly Consistent

MariaDB Xpand is strongly consistent.

Synchronous Replication

When Xpand writes to a slice, multiple replicas of the slice are written to separate nodes to provide fault tolerance. Xpand writes all replicas for the slice in parallel. Before the write can complete, each participating node must provide an acknowledgment that its replica has been written.

This approach to synchronous replication is highly scalable, and it helps to provide strong consistency.

Paxos Protocol

Xpand uses the Paxos protocol to resolve distributed transactions.

The Paxos protocol allows each participating node to reach a consensus on whether a specific operation has been performed in a fault tolerant manner.

This approach to transaction resolution allows operations to be performed durably on all nodes.

Transaction Isolation

Xpand supports the Repeatable Read transaction isolation level. The Repeatable Read isolation level allows a transaction to see a consistent snapshot of data for a given table throughout the lifetime of a transaction, even if the underlying data has been modified.

This approach to transaction isolation prevents concurrent transactions from interfering with each other, while still maintaining strong consistency.

For additional information, see "Xpand Transaction Isolation Levels".

Multi-Version Concurrency Control (MVCC)

Xpand implements a distributed Multi-Version Concurrency Control (MVCC) scheme. It supports lockless reads, 2-phase locking for writes, and online schema changes.

This approach to concurrency allows both reads and writes to scale, while still maintaining strong consistency.

For additional information, see "Xpand Multi-Version Concurrency Control (MVCC)".

Parallel Query Evaluation

MariaDB Xpand executes different parts of a query in parallel:

  • Linear Scalability: Regardless of the data set size and the number of nodes, Xpand always executes selects of single rows and inserts with only 0 or 1 hop. This provides linear scalability of reads and writes.

  • Query Fragments: Query compilation algorithm in Xpand divides each query into fragments. Xpand's optimizer is able to determine which node needs to execute each fragment.

  • Join Scalability: Xpand uses a query optimizer to minimize data movement while executing Joins, because it knows exactly which node stores each row.

  • Massively Parallel Processing (MPP): Xpand uses a massively parallel processing strategy, which allows multiple nodes to execute different parts of a query concurrently.

For additional information, see "Xpand Query Evaluation".

Concurrency Control

MariaDB Xpand is designed to support mixed read-write workloads with a lot of concurrency:

  • Multi-Version Concurrency Control (MVCC): Xpand uses MVCC to ensure that read and write operations never interfere with each other.

  • Lockless Reads: Xpand is able to execute read queries without locks.

  • 2-Phase Locking (2PL) for Writes: Xpand resolves write conflicts using 2-phase locking.

  • Distributed Lock Manager: Xpand implements a distributed lock manager to scale write access to hot tables. Each node maintains a portion of the lock domain. No single node holds all of the lock information for Xpand.

  • Row Locks: Xpand uses row-level locks for transactions that only touch a few rows.

  • Lock Promotion: For statements that affect a significant portion of a table, Xpand will promote row-level locks to a table lock.

For additional information, see "Xpand Multi-Version Concurrency Control (MVCC)" and "Xpand Locking".

Backing up and Restoring

MariaDB Xpand provides SQL statements to handle backup and restore operations. These commands allow you to back up and restore replicas in parallel to a remote server.

Backing Up Xpand Tables

Backing up Xpand tables is handled using the BACKUP statement. BACKUP can send data to a remote server in parallel using either FTP or SFTP.

Xpand BACKUP

Restoring Xpand Tables from Backup

Backups can be restored to Xpand tables using the RESTORE statement RESTORE retrieves data from a remote server in parallel using either FTP or SFTP.

Xpand RESTORE

Data Loading

MariaDB Xpand includes the clustrix_import tool, which can efficiently bulk load data into Xpand. It supports loading data from uncompressed SQL files. The SQL files can be generated from an existing database using mariadb-dump or similar tools. The tool pre-slices data, and it writes each slice to multiple nodes in parallel.

Using clustrix_import, you can efficiently bulk load data into Xpand:

Xpand importing data with clustrix_import