Consistency, Fault Tolerance, and Availability with MariaDB Xpand
Many distributed databases have embraced eventual consistency over strong consistency to achieve scalability. However, eventual consistency comes with a cost of increased complexity for the application developer, who must develop for anomalies that may arise with inconsistent data.
Xpand provides a consistency model that can scale using a combination of intelligent data distribution, multi-version concurrency control (MVCC), and Paxos. Our approach enables Xpand to scale writes, scale reads in the presence of write workloads, and provide strong ACID semantics. For an in-depth explanation of how Xpand scales reads and writes, see "MVCC (Multi-Version Concurrency Control) with MariaDB Xpand".
Xpand takes the following approach to consistency:
Synchronous replication within the cluster. All nodes participating in a write must provide an acknowledgment before a write is complete. Writes are performed in parallel.
The Paxos protocol is used for distributed transaction resolution.
Xpand supports for Read Committed and Repeatable Read (Snapshot) isolation levels with limited support for Serializable.
Multi-Version Concurrency Control (MVCC allows) for lockless reads and ensures that writes will not block reads.
Xpand provides fault tolerance by maintaining multiple copies of data across the cluster. By default, Xpand can accommodate a single node failure and automatically recover with no loss of data. The degree of fault tolerance (nResiliency) is configurable and Xpand can be set up to handle multiple node failures and zone failure.
When Xpand writes to a slice, multiple replicas of the slice are written to separate nodes to provide fault tolerance. Xpand writes all replicas for the slice in parallel. Before the write can complete, each participating node must provide an acknowledgment that its replica has been written.
Xpand uses the Paxos protocol to resolve distributed transactions.
The Paxos protocol allows each participating node to reach a consensus on whether a specific operation has been performed in a fault tolerant manner.
This approach to transaction resolution allows operations to be performed durably on all nodes.
In order to understand Xpand's availability modes and failure cases, it is necessary to understand our group membership protocol.
Group Membership and Quorum
Xpand nodes use a distributed group membership protocol. This protocol maintains a listing of two fundamental sets of nodes:
A static set of all Xpand nodes known.
A dynamic set of all Xpand nodes that can currently communicate with each other.
Xpand cannot remain operational unless more than half the nodes in the static set are able to communicate with each other, which forms a quorum.
When less than half the nodes can connect to each other, Xpand loses the quorum and becomes non-operational. This prevents a split-brain scenario. Consider a deployment where MariaDB MaxScale routes queries between six Xpand nodes. A network partition between the Xpand nodes can separate them into two sets of three nodes:
Neither set has more than half of the nodes, so neither one can form a quorum. Both sets become non-operational.
To avoid problems establishing quorum during a network partition, it is recommended to deploy an odd number of Xpand nodes. If you deploy nodes in multiple zones, it is recommended to use an odd number of zones.
Xpand's default configuration finds a balance between optimal fault tolerance and optimal performance by only tolerating a single node failure or a single zone failure. However, Xpand can be configured to tolerate more failures by setting MAX_
Consider the example in which MariaDB MaxScale routes queries between five Xpand nodes where two of the Xpand nodes have been separated from the rest by a network partition:
In the example, each Xpand node contains two replicas of a slice from a table. One replica is designated as the ranking replica. The other replica is maintained for fault tolerance.
The network partition causes the following effects:
Slices 1 and 5 remain available and protected, because both replicas for each slice are still available.
Slice 2 remains available, but not protected, because the non-ranking replica for the slice is still available. Xpand promotes the remaining replica to be the ranking replica for the slice. The Rebalancer creates a new non-ranking replica for the slice to maintain fault tolerance.
Slice 3 remains available, but not protected, because the ranking replica is still available. The Rebalancer creates a new non-ranking replica for the slice to maintain fault tolerance.
Slice 4 is unavailable, because both the ranking and non-ranking replicas are unavailable.
Since Xpand cannot serve rows from slice 4, the table is only partially available. Xpand can still handle queries to the table, but it returns an error if any transaction attempts to access the unavailable slice.
Xpand can provide availability even in the face of failure. In order to provide full availability, Xpand requires that:
A majority of nodes are able to form a cluster (i.e., quorum requirement).
The available nodes hold at least one replica for each set of data.