MariaDB Xpand Rebalancer

Overview

MariaDB Xpand's Rebalancer maintains healthy distribution of data within the cluster:

  • Xpand's Rebalancer automatically handles data distribution in the background

  • Xpand's Rebalancer distributes data with minimal interruption to user operations, so that your database and application can remain available

  • Xpand's Rebalancer responds to unexpected node and zone failures by creating new replicas on the remaining nodes

  • Xpand's Rebalancer re-slices data to ensures that a table has an optimal number of slices when the data set or load changes

  • Xpand's Rebalancer prevents load and storage imbalances by rebalancing data between nodes

The Xpand Rebalancer was designed to run automatically as a background process to rebalance data across the cluster. The following sections describe how the Rebalancer works. The default values for distribution and replicas are sufficient for most deployments and typically do not require changing.

The Xpand Rebalancer has been awarded two patents for distributing and slicing data.

Compatibility

Information provided here applies to:

  • MariaDB Xpand 5.3

  • MariaDB Xpand 6

A Healthy Cluster

In Xpand, user tables are vertically partitioned in representations, which are horizontally partitioned into slices. When a new representation is created, the system tries to determine distribution and placement of the data such that:

  • The representation has an appropriate number of slices.

  • The representation has an appropriate distribution key, to fairly balance rows across its replicas, but still allow fast queries to specific replicas.

  • Replicas are well distributed around the cluster on storage devices that are not overfull.

  • Replicas are distributed across zones (if configured).

  • Replicas are not placed on decommissioned nodes.

  • Reads from each representation are balanced across the representation's nodes.

Over time, representations can lose these properties as their data changes or cluster membership changes. This section describes the various situations that the Rebalancer is able to remedy.

Under-protection

By default, Xpand keeps two copies (replicas) of every slice. If an unexpected node failure makes one of the replicas unavailable, the slice will still be accessible through the remaining replica. When only one replica of a slice exists, the data on that slice is vulnerable to being lost in the event of an additional failure. The number of replicas per slice can be specified via the global variable MAX_FAILURES.

When a slice has fewer replicas than desired, the Rebalancer will create a new copy an existing replica on a different node. The most common reason for this is if a node fails or otherwise becomes unavailable. Initially, the cluster will create Recovery Queues for that node's replicas so that they can be made up-to-date when that node returns to quorum. However, if the node is unavailable for an extended period of time, the Rebalancer will begin making copies of replicas from elsewhere in the cluster and will retire the Recovery Queues.

If a node becomes permanently unavailable, the cluster has reduced storage capacity. If there is not enough remaining storage capacity to make new replicas, the Rebalancer will not be able to do so, and the slices will remain under-protected. The cluster does not automatically reserve capacity for re-protecting slices.

Load Imbalance

If the slices of a representation are not well-distributed across the cluster, the Rebalancer will try to move them to more optimal locations.

The Rebalancer evaluates the placement of each representation independently, in the following manner:

  • Each slice of a representation is assumed to exert load proportional to its share of the representation's key-space. For example, if the index size of a slice constitutes 10% of the overall representation's index space, then it will also be assumed that slice will comprise 10% of the representation's load, as well. The Rebalancer considers that anticipated activity level when placing a given replica of a slice.

  • The representation is well-distributed when the difference between the "most loaded" and "least loaded" nodes is minimal.

Consider the following examples of a representation with three equal-size slices, S1, S2, and S3. Each slice has two replicas distributed across a five-node cluster.

This is an example of a poor distribution of this representation. Each slice is protected against the failure of a node, but the majority of the representation is stored on node 2.

This is an example of a good distribution. The replicas outlined in red were relocated by the Rebalancer to improve cluster balance. Although node 1 has one more replica than the other nodes, there is no node that is under-loaded.

When a Node is Too Full

If a node in the cluster is holding more than its share of table data, the Rebalancer will try to move replicas from that node to a less utilized node.

Before moving any replicas, the Rebalancer computes the load imbalance of the cluster's storage devices. If this imbalance is below a configurable threshold, the Rebalancer will leave things alone. This is to prevent the Rebalancer from making small, unnecessary replica moves.

Balancing storage utilization is a second priority to maintaining representation distribution. In some situations, this may result in less optimal storage device utilization, in exchange for better representation distribution.

When a Slice is Too Big

Representations are partitioned into slices, each of which is assigned a portion of the representation's rows. If a slice becomes large, the Rebalancer will split the slice into several new slices and distribute the original slice's rows among them. The larger a slice becomes, the more expensive it is to move or copy it across the system. The maximum desired slice size is configurable, but by default, the Rebalancer will split slices utilizing greater than 1 GiB. (Utilized space will be slightly larger than the size of user data because of storage overhead).

Too many slices can also be a problem: more slices means more metadata for the cluster to manage, which can make queries and group changes take longer. The Rebalancer will not reverse a split, so the usual recommendation is to err on the side of fewer slices and allow the Rebalancer to split if there is a question about how many slices a representation needs. Splits can be manually reversed with an ALTER statement to change the slicing of a representation.

Because rows are hash-distributed among slices, if a slice approaching the split threshold, it is likely that the other slices of the representation will also need to be split.

The global rebalancer_split_threshold_mb determines when a slice needs splitting. That global may be overridden for a per table or per index basis via DDL. See Slices.

Read Imbalance

Xpand reads exclusively from only one replica of each slice, and that slice is designated as the ranking replica. This allows the Rebalancer to better manage data distribution and load for both write operations, which are applied simultaneously to all replicas, and read operations, which consistently use only the ranking replica.

From a write perspective, this five-slice representation is well distributed across a five node cluster. Each node is doing an even share of the work.

For read operations, Xpand designates one replica as the ranking replica for the slice and always reads from that replica to balance load across your cluster.

Since all replicas of a slice are identical, directing reads to a non-ranking replica would produce the same results. However, to better utilize each node's memory, Xpand consistently reads from the designated ranking replica. If a ranking replica becomes unavailable, another replica takes its place.

Decommissioned Nodes

When a node is to be removed from the cluster, the administrator can designate it as soft-failed. This directs the Rebalancer to not place new replicas on that node, nor will it be considered when assessing storage imbalance. The Rebalancer will begin making additional copies of replicas on the softfailed node and place them on other nodes. Once there are sufficient replicas, the softfailed node can be removed from the cluster with loss of data protection.

Rebalancer Components

The Rebalancer is a system of several components:

  1. Information is gathered to build a Model of the cluster's state

  2. Tasks examine the Model and decide if action is necessary

  3. Tasks post operations to the Queue that will schedule them

  4. When an operation graduates from the Queue, it is applied to the cluster

  5. When an operation completes, the Model is updated to reflect the new state

Distribution Model

MariaDB Xpand's Rebalancer uses a Distribution Model to determine if the data distribution is healthy:

  • Xpand's Distribution Model contains metadata about the location of every replica

  • Each node contains a copy of the Distribution Model

  • The size of each replica is only known to the node storing the replica

  • Xpand's Rebalancer periodically polls all nodes to ensure the Distribution Model is current

  • Between polls, the Distribution Model can be somewhat out of date

  • If the Rebalancer determines an issue with the Distribution Model, it schedules a Task to correct the issue

Tasks

MariaDB Xpand's Rebalancer schedules Tasks to correct issues with the Distribution Model.

For additional information, see "Rebalancer Tasks for MariaDB Xpand".

Priority Queue

When MariaDB Xpand's Rebalancer schedules a Task, the Task is added to a Priority Queue:

  • Each Task has a priority

  • A queued Task can be surpassed by a higher priority Task

  • Each Task is scheduled independently, but all Tasks are aware of other scheduled Tasks

  • Once a Task starts running, the Task is not interrupted, even if higher priority Tasks are queued

  • To prevent multiple Tasks from changing the same slices simultaneously, the Priority Queue limits slices to one Task at a time

  • To prevent multiple Tasks from overloading a node, the Priority Queue limits the number of tasks that can run on one node simultaneously

    • The limit is rebalancer_vdev_task_limit (1 by default)

  • To prevent many complex Tasks from overloading Xpand, the Priority Queue limits the number of simultaneous executions of some complex Tasks

    • The limit applies to Rebalance and Rebalance Distribution

    • The limit is rebalancer_rebalance_task_limit (2 by default)

  • To prevent many simultaneous Tasks from overloading Xpand, the Priority Queue limits the number of tasks that can run on all nodes simultaneously

    • The limit is rebalancer_global_task_limit (16 by default)

  • Tasks can sometimes fail

    • The Distribution Model changes frequently, so some failures are expected even in normal operations

    • If a Task fails, the changes are reverted

    • The Rebalancer can retry the failed Task

    • The Rebalancer expects that most types of failures and transient, so Tasks that repeatedly fail are repeatedly retried

Versioned Metadata

MariaDB Xpand maintains versioned metadata for each representation using a Multi-Version Concurrency-Control (MVCC) scheme, similar to that used for table data:

  • Metadata changes are transactional, allowing metadata changes to be made without much locking or coordination

  • When a representation changes, each transaction sees a different version of the metadata

  • If a transaction started before the metadata changed, the transaction sees version of the metadata that existed when the transaction began

  • If an error occurs during the operation, the entire change rolls back to the state in which it began

  • The location of each replica is considered to be part of that metadata:

    • The Rebalancer moves replicas between nodes using a series of DDL changes

    • Between each DDL change is an epoch, during which other transactions may start

    • The Rebalancer performs the replica move online, with limited disruption to new or running transactions as the operation proceeds.

Recovery Queues

MariaDB Xpand uses a Recovery Queue to record the changes to a replica when the replica is unavailable:

  • When a node fails and does not come back online before the rebalancer_reprotect_queue_interval_s internal has expired, a Recovery Queue will record any changes to the node's replicas in case the node comes back online

  • When a replica is being moved to a different node, a Recovery Queue will record any changes to the replica

  • A Recovery Queue is different from the Rebalancer's Priority Queue that is used to schedule tasks