MariaDB Xpand Rebalancer

Xpand's Rebalancer maintains a healthy distribution of data, as a way of achieving fault tolerance, high availability, and optimal performance. It works automatically in the background without affecting application traffic.

Healthy Data Distribution

Xpand distributes data in sophisticated ways. It divides tables into representations. It divides each representation into slices. It provides high availability by automatically creating replicas of each slice on multiple nodes. Each slice has a single ranking replica. If a write operation updates a slice, Xpand updates every replica of that slice in a synchronous manner. If a read operation accesses a slice, Xpand only accesses the ranking replica of that slice.

The data distribution is "healthy" when it meets several characteristics:

  • Each slice should be protected with a sufficient number of replicas. A sufficient number of replicas equals the value of the REPLICAS table option or the MAX_FAILURES setting.

  • If zones are configured, each slice should have replicas distributed across zones.

  • No slice should have replicas on decommissioned (soft-failed) nodes.

  • No slice should have excessive replicas, which would be greater than the value of the REPLICAS table option or the MAX_FAILURES setting.

  • No slice should be too large, which would be greater than the value of the rebalancer_split_threshold_kb system variable.

  • No node or zone should receive a disproportionate number of reads, which would generally happen when ranking replicas are not evenly distributed between nodes or zones.

  • Each representation has sufficient ranking replicas to distribute read requests evenly between each node.

  • No node or zone should receive a disproportionate number of writes, which would generally happen when replicas are not evenly distributed between nodes or zones.

  • Each representation should have sufficient replicas to distribute write requests evenly between each node.

The state of a database is constantly changing. Xpand Rebalancer maintains a "healthy" data distribution by regularly checking the data distribution and automatically rebalancing the data to counteract changes caused by data and load changes, node failure, and node maintenance

Rebalancer Components

Xpand Rebalancer uses several components to maintain a healthy distribution and balance of data.

Distribution Model

Xpand's Rebalancer uses a distribution model to determine if the data distribution is healthy.

Xpand maintains metadata about the location of every replica. Each node contains a copy of this metadata. However, the metadata does not include the size of each replica. The node storing a replica is the only node that knows its size. Replicas frequently grow and shrink. New replicas are frequently created and destroyed. Given these factors, the distribution model must be updated often. Otherwise, the distribution can become unhealthy, and performance can be impacted.

Xpand's Rebalancer updates its distribution model in a couple ways:

  • It regularly gathers information from each node to create a distribution model that represents the current state.

  • When a task is applied from the Rebalance Queue, the distribution model is updated to reflect the new state.

If the Rebalancer determines an issue with the distribution model, it schedules a task to correct the issue.

Tasks

Xpand's Rebalancer uses tasks to correct issues with the distribution model.

It supports the following tasks:

Priority

Rate

Name

Fixes

High

Aggressive

Reprotect

Missing replicas

High

Aggressive

Zone Balance

Slice imbalance for a zone

High

Moderate

Softfail

Slices on decommissioned hardware

High

Moderate

Reap

Extra replicas/queues

Medium

Moderate

Split

Large slices

Low

Conservative

Re-rank

Node/zone read imbalance

Low

Conservative

Reran Distribution

Representation read imbalance

Low

Conservative

Rebalance

Node/zone usage imbalance

Low

Conservative

Rebalance Distribution

Representation write imbalance

All tasks rely on the distribution model. All tasks are scheduled independently, but they are aware of other scheduled tasks.

Each task is designed to avoid conflicts with other tasks. Let's use the "Split" task as an example. When the Rebalancer splits a slice into two new slices, it considers various details from the distribution model to decide where the new slices should be located, and which nodes should contain the ranking replicas. It takes precautions to avoid conflicting with other tasks, such as:

  • It considers how the representation's other slices are distributed, so that it does not need to perform a "Rerank" or "Rerank Distribution" task later.

  • It considers how the load is distributed among nodes, so that it does not need to perform a "Rebalance" or "Rebalance Distribution" task later.

Priority Queue

When the Rebalancer schedules a task, it is added to a priority queue. The priority queue takes some precautions to avoid various issues, such as:

  • It limits slices to one task at a time, so that multiple tasks do not try to change the same slice simultaneously.

  • It limits the number of tasks that can run on one node simultaneously, so that many simultaneous tasks do not overload a node. This limit is configured by the rebalancer_vdev_task_limit system variable, which is 1 by default.

  • It limits the number of simultaneous executions of some complex tasks, such as "Rebalance" and "Rebalance Distribution", so that many complex tasks do not overload Xpand. This limit is configured by the rebalancer_rebalance_task_limit system variable, which is 2 by default.

  • It limits the number of tasks that can run on all nodes simultaneously, so that many simultaneous tasks do not overload Xpand. This limit is configured by the rebalancer_global_task_limit system variable, which is 16 by default.

Each queued task has a priority, and a queued task can be surpassed by a higher priority task. However, once a task starts running, it is not interrupted, even if higher priority tasks are queued.

Tasks can sometimes fail. The distribution model changes frequently, so some failures are expected even in normal operations. If a task fails, the changes are reverted. The Rebalancer may retry the failed task. It does not keep record of past failures, so tasks that repeatedly fail may be repeatedly retried. This is normal, because most types of failures are expected to be transient.

Reprotect Task

The Rebalancer's Reprotect task ensures that each slice is protected with a sufficient number of replicas:

  • It is high priority task in the queue.

  • It is executed with an aggressive rate.

  • It runs at intervals defined by the task_rebalancer_reprotect_interval_ms system variable (by default, 15000 ms).

By default, Xpand maintains a minimum of two replicas for every slice. However, the minimum number of replicas can be increased by setting the REPLICAS table option or the MAX_FAILURES setting. If a node fails, the failed node's replicas become unavailable. If a slice no longer has the minimum number of replicas, it is considered to be under-protected. When a slice is under-protected, it may be lost if any additional nodes fail.

The Reprotect task corrects this issue by creating new replicas for an under-protected slice. However, there is a chance that the failed node will become available again. Therefore, the Reprotect task is not executed unless the failed node is unavailable for longer than the value of the rebalancer_reprotect_queue_interval_s system variable (by default, 600 seconds). During this interval, the Rebalancer maintains a reprotect queue that stores any changes made to the replicas on the failed node.

If the failed node comes back online before the interval expires, the changes in the reprotect queue are applied to the replicas on the node.

If the failed node does not come back online before the interval expires, the Rebalancer begins creating new replicas to replace the ones on the failed node, and it discards the reprotect queue.

A node failure reduces the storage capacity of the deployment. If the storage capacity falls below what Xpand needs to store new replicas, the Rebalancer does not create new replicas, and the slices remain under-protected

Zone Balance Task

If zones are configured, the Rebalancer's Zone Balance task ensures that each slice has replicas distributed across zones:

  • It is high priority task in the queue.

  • It is executed with an aggressive rate.

  • It runs at intervals defined by the task_rebalancer_zone_balance_interval_ms system variable (by default, 60000 ms).

Softfail Task

The Rebalancer's Softfail task ensures that no replicas are stored on decommissioned (soft-failed) nodes:

  • It is high priority task in the queue.

  • It is executed with a moderate rate.

  • It does not run at any specific task interval.

  • It is run immediately when a node is decommissioned (or soft-failed) with the ALTER CLUSTER SOFTFAIL statement.

  • It is not affected by the per-task limit.

Reap Task

The Rebalancer's Reap task ensures that no slice has excessive replicas:

  • It is a high priority task in the queue.

  • It is executed with a moderate rate.

By default, Xpand maintains a minimum of two replicas for every slice. However, the minimum number of replicas can be increased by setting the REPLICAS table option or the MAX_FAILURES setting.

If a slice has more replicas than the minimum, the Reap task can remove the extra replicas.

For additional information, see "Consistent Hashing".

Split Task

The Rebalancer's Split task ensures that no slices are too large:

  • It is a medium priority task in the queue.

  • It is executed with a moderate rate.

  • It runs at intervals defined by the task_rebalancer_split_interval_ms system variable (by default, 30000 ms).

  • It considers a slice to be too large if the slice's size is greater than the value of the rebalancer_split_threshold_kb system variable (by default, 8 GB).

The number of slices for a table can also be manually configured using the SLICE table option.

There is no inverse of the Split task. If a slice gets too small, the Rebalancer will not automatically merge the slice with another slice. If you need to reduce the number of slices for a particular table, you must manually configure the slices with the SLICE table option.

Rerank Task

The Rebalancer's Rerank task ensures that no node or zone receives a disproportionate number of reads:

  • It is a low priority task in the queue.

  • It is executed with a conservative rate.

If a write operation updates a slice, Xpand synchronously updates every replica. If a read operation accesses a slice, Xpand only accesses the ranking replica. Since all replicas of a slice are identical, directing reads to a non-ranking replica would produce the same results. The distinction between ranking replicas and non-ranking replicas allows the Rebalancer to better manage data distribution and load for both read and write operations, and it allows Xpand to better utilize each node's memory.

If a given node receives a disproportionate number of ranking replicas, it will handle a disproportionate number of read operations. The Rerank task corrects this issue by ensuring that replicas are fairly ranked.

The following diagram shows a set of replicas that are fairly ranked:

Balanced Writes with Xpand
  • Each node has exactly 2 replicas total, so each node is likely to handle an equal number of write operations.

  • Each node has exactly 1 ranking replica (indicated in bold), so each node is likely to handle an equal number of read operations.

Rerank Distribution Task

The Rebalancer's Rerank Distribution ensures that each representation has sufficient ranking replicas to distribute read requests evenly between each node:

  • It is a low priority task in the queue.

  • It is executed with a conservative rate.

Rebalance Task

The Rebalancer's Rebalance task ensures that no node or zone should receive a disproportionate number of writes:

  • It is a low priority task in the queue.

  • It is executed with a conservative rate.

  • It runs at intervals defined by the task_rebalancer_rebalance_interval_ms system variable (by default, 30000 ms).

  • It considers a write load to be disproportionate if the write load variation is greater than the value of the rebalancer_rebalance_threshold system variable (by default, 0.05).

  • Simultaneous executions are limited by the value of the rebalancer_rebalance_task_limit system variable (by default, 2).

The Rebalance task evaluates write load indirectly by calculating some details about the representation:

  • Each slice of a representation is assumed to exert load proportional to its share of the representation's key-space.

    For example, if the index size of a slice constitutes 10% of the overall representation's index space, the Rebalancer assumes the slice comprises 10% of the load on the representation. The Rebalancer anticipates that activity level when placing replicas of that slice.

  • The representation is well-distributed when the difference between the "most loaded" and "least loaded" nodes is minimal.

Consider the following examples of a representation with three equal-size slices: Slice 1, Slice 2, and Slice 3. Each slice has two replicas distributed across a deployment of five nodes.

  • Here is an example of a poor distribution of the representation:

    Example of an Unhealthy Cluster

    While each slice is protected against the failure of a single node, the majority of the representation is stored on node2. This means that if node2 fails, it can create a significant workload for Xpand to restore fault tolerance.

    Xpand Rebalancer responds to this imbalance by automatically moving replicas off of node2 and onto other nodes.

  • Here is an example of the same representation, well-distributed across the nodes:

    Example of an Unhealthy Cluster

    To correct the imbalance, the Rebalancer moved the ranking replica of Slice 2 to node4 and the non-ranking replica of Slice 3 to node5. While node1 still has one more replica than the others, none of the Xpand nodes are under-loaded

Rebalance Distribution Task

The Rebalancer's Rebalance Distribution task ensures that each representation has sufficient replicas to distribute write requests evenly between each node:

  • It is a low priority task in the queue.

  • It is executed with a conservative rate.

  • It runs at intervals defined by the task_rebalancer_rebalance_distribution_interval_ms system variable (by default, 30000 ms).

  • Simultaneous executions are limited by the value of the rebalancer_rebalance_task_limit system variable (by default, 2).

Versioned Metadata

Xpand maintains versioned metadata for each representation using a Multi-Version Concurrency-Control (MVCC) scheme, similar to that used for table data.

Metadata changes are transactional:

  • When a representation changes, transactions that start before the change are isolated from the change and will never observe it. Older transactions are not canceled. Newer transactions don't wait for older transactions to complete. Instead, each transaction sees a different version of the metadata, allowing metadata changes to be made without coordination with current transactions.

  • If an error occurs during the operation, the entire change rolls back to the state in which it began.

The location of each replica is considered to be part of that metadata. When the Rebalancer moves replicas between nodes, it uses a specific series of DDL changes. Between each DDL change is an epoch, during which other transactions may start. The Rebalancer performs the replica move online, with limited disruption to new or running transactions as the operation proceeds.

Reprotect Queues

A reprotect queue is used to record the changes to a replica while the replica is temporarily unavailable.

Reprotect queues are used in a couple situations:

  • When a node fails and the interval defined by the rebalancer_reprotect_queue_interval_s system variable has not expired yet, a reprotect queue will record any changes to the node's replicas in case the node comes back online.

  • When a replica is being moved to a different node, a reprotect queue will record any changes to the replica.

A reprotect queue is different from the Rebalancer's priority queue that is used to schedule tasks.

The following example shows how a reprotect queue is used while moving a replica to a different node. Other Rebalancer operations are similar, but more complex:

  1. In Epoch A, replicas of Slice 1 reside on Xpand nodes node3 and node4.

    Xpand Rebalancer Queue

    Xpand decides to move the non-ranking replica from node4 to node1.

  2. In Epoch B, Xpand creates an empty replica on node1.

    Xpand Rebalancer Queue

    Xpand marks the empty replica as building, ensuring that no queries access it. The Rebalancer begins copying data from the non-ranking replica on node4 into the empty replica on node1.

    Xpand also starts the reprotect queue on node2 for the slice. Xpand treats the reprotect queue as another replica for queries that write to the slice, but instead of writing to disk it logs the writes so that they can be replayed later against the new replica.

    The reprotect queue enables Xpand to copy data without blocking updates to the slice.

    Once Xpand finishes copying the data from node4 to node1, it begins asynchronously replaying the transactions in the reprotect queue onto the new replica.

    Xpand Rebalancer Queue

    When Xpand reaches the end of the reprotect queue, it shifts from asynchronous to synchronous, applying new transactions as they come in to the new replica.

  3. In Epoch C, Xpand disables the reprotect queue. It is no longer updated by new transactions, but older transactions continue to update it and these are applied synchronously to the new replica.

    Xpand puts the new replica online and retires the old replica. Transactions before the update still see the old replica, but the old replica no longer receives new writes.

    Xpand Rebalancer Queue
  4. In Epoch D, Xpand removes the reprotect queue and the old replica.

    Xpand Rebalancer Queue