Rebalancer Tasks for MariaDB Xpand

Overview

MariaDB Xpand's Rebalancer uses Tasks to correct issues with the distribution model:

  • Xpand schedules Tasks independently

  • Xpand's Tasks are designed to avoid conflicts with other Tasks

  • Xpand implements many different kinds of Tasks with different priorities

Compatibility

Information provided here applies to:

  • MariaDB Xpand 5.3

  • MariaDB Xpand 6.0

  • MariaDB Xpand 6.1

Distribution Model

MariaDB Xpand's Rebalancer uses Tasks to correct issues with the distribution model:

  • Ranking replicas might not be distributed evenly, so some nodes receive more writes than others

  • Replicas might not be distributed evenly, so some nodes receive more reads than others

  • A slice might be excessively large

  • A slice might have excessive replicas

Scheduling

MariaDB Xpand's Rebalancer schedules Tasks in the Priority Queue.

Conflict Avoidance

MariaDB Xpand's Tasks are designed to avoid conflicts with other Tasks.

As an example, the Rebalancer uses the "Split" Task to split a slice into two new slice:

  • The Rebalancer considers various details from the distribution model to decide where the new slices should be located, and which nodes should contain the ranking replicas

  • The Rebalancer takes precautions to avoid conflicting with other Tasks, such as:

    • The Rebalancer considers how the representation's other slices are distributed, so that it does not need to perform a "Rerank" or "Rerank Distribution" task later

    • The Rebalancer considers how the load is distributed among nodes, so that it does not need to perform a "Rebalance" or "Rebalance Distribution" task later

Task List

MariaDB Xpand implements many different kinds of Tasks with different priorities:

Priority

Rate

Name

Fixes

High

Aggressive

Reprotect

Missing replicas

High

Aggressive

Zone Balance

Slice imbalance for a zone

High

Moderate

Softfail

Slices on decommissioned hardware

High

Moderate

Reap

Extra replicas/queues

Medium

Moderate

Split

Large slices

Low

Conservative

Re-rank

Node/zone read imbalance

Low

Conservative

Re-rank Distribution

Representation read imbalance

Low

Conservative

Rebalance

Node/zone usage imbalance

Low

Conservative

Rebalance Distribution

Representation write imbalance

Reprotect Task

The Rebalancer's Reprotect task ensures that each slice is protected with a sufficient number of replicas:

  • It is high priority task in the queue.

  • It is executed with an aggressive rate.

  • It runs at intervals defined by the task_rebalancer_reprotect_interval_ms system variable (by default, 15000 ms).

By default, Xpand maintains a minimum of two replicas for every slice. However, the minimum number of replicas can be increased by setting the REPLICAS table option or the MAX_FAILURES setting. If a node fails, the failed node's replicas become unavailable. If a slice no longer has the minimum number of replicas, it is considered to be under-protected. When a slice is under-protected, it may be lost if any additional nodes fail.

The Reprotect task corrects this issue by creating new replicas for an under-protected slice. However, there is a chance that the failed node will become available again. Therefore, the Reprotect task is not executed unless the failed node is unavailable for longer than the value of the rebalancer_reprotect_queue_interval_s system variable (by default, 600 seconds). During this interval, the Rebalancer maintains a Recovery Queue that stores any changes made to the replicas on the failed node.

If the failed node comes back online before the interval expires, the changes in the Recovery Queue are applied to the replicas on the node.

If the failed node does not come back online before the interval expires, the Rebalancer begins creating new replicas to replace the ones on the failed node, and it discards the Recovery Queue.

A node failure reduces the storage capacity of the cluster. If the storage capacity falls below what Xpand needs to store new replicas, the Rebalancer does not create new replicas, and the slices remain under-protected

Zone Balance Task

If zones are configured, the Rebalancer's Zone Balance task ensures that each slice has replicas distributed across zones:

  • It is high priority task in the queue.

  • It is executed with an aggressive rate.

  • It runs at intervals defined by the task_rebalancer_zone_balance_interval_ms system variable (by default, 60000 ms).

Softfail Task

The Rebalancer's Softfail task ensures that no replicas are stored on decommissioned (soft-failed) nodes:

  • It is high priority task in the queue.

  • It is executed with a moderate rate.

  • It does not run at any specific task interval.

  • It is run immediately when a node is decommissioned (or soft-failed) with the ALTER CLUSTER SOFTFAIL statement.

  • It is not affected by the per-task limit.

Reap Task

The Rebalancer's Reap task ensures that no slice has excessive replicas:

  • It is a high priority task in the queue.

  • It is executed with a moderate rate.

By default, Xpand maintains a minimum of two replicas for every slice. However, the minimum number of replicas can be increased by setting the REPLICAS table option or the MAX_FAILURES setting.

If a slice has more replicas than the minimum, the Reap task can remove the extra replicas.

For additional information, see "Consistent Hashing".

Split Task

The Rebalancer's Split task ensures that no slices are too large:

  • It is a medium priority task in the queue.

  • It is executed with a moderate rate.

  • It runs at intervals defined by the task_rebalancer_split_interval_ms system variable (by default, 30000 ms).

  • It considers a slice to be too large if the slice's size is greater than the value of the rebalancer_split_threshold_kb system variable (by default, 8 GB).

The number of slices for a table can also be manually configured using the SLICE table option.

There is no inverse of the Split task. If a slice gets too small, the Rebalancer will not automatically merge the slice with another slice. If you need to reduce the number of slices for a particular table, you must manually configure the slices with the SLICE table option.

Re-rank Task

The Rebalancer's Re-rank task ensures that no node or zone receives a disproportionate number of reads:

  • It is a low priority task in the queue.

  • It is executed with a conservative rate.

If a write operation updates a slice, Xpand synchronously updates every replica. If a read operation accesses a slice, Xpand only accesses the ranking replica. Since all replicas of a slice are identical, directing reads to a non-ranking replica would produce the same results. The distinction between ranking replicas and non-ranking replicas allows the Rebalancer to better manage data distribution and load for both read and write operations, and it allows Xpand to better utilize each node's memory.

If a given node receives a disproportionate number of ranking replicas, it will handle a disproportionate number of read operations. The Rerank task corrects this issue by ensuring that replicas are fairly ranked.

The following diagram shows a set of replicas that are fairly ranked:

Balanced Writes with Xpand
  • Each node has exactly 2 replicas total, so each node is likely to handle an equal number of write operations.

  • Each node has exactly 1 ranking replica (indicated in bold), so each node is likely to handle an equal number of read operations.

Re-rank Distribution Task

The Rebalancer's Re-rank Distribution ensures that each representation has sufficient ranking replicas to distribute read requests evenly between each node:

  • It is a low priority task in the queue.

  • It is executed with a conservative rate.

Rebalance Task

The Rebalancer's Rebalance task ensures that no node or zone should receive a disproportionate number of writes:

  • It is a low priority task in the queue.

  • It is executed with a conservative rate.

  • It runs at intervals defined by the task_rebalancer_rebalance_interval_ms system variable (by default, 30000 ms).

  • It considers a write load to be disproportionate if the write load variation is greater than the value of the rebalancer_rebalance_threshold system variable (by default, 0.05).

  • Simultaneous executions are limited by the value of the rebalancer_rebalance_task_limit system variable (by default, 2).

The Rebalance task evaluates write load indirectly by calculating some details about the representation:

  • Each slice of a representation is assumed to exert load proportional to its share of the representation's key-space.

    For example, if the index size of a slice constitutes 10% of the overall representation's index space, the Rebalancer assumes the slice comprises 10% of the load on the representation. The Rebalancer anticipates that activity level when placing replicas of that slice.

  • The representation is well-distributed when the difference between the "most loaded" and "least loaded" nodes is minimal.

Consider the following examples of a representation with three equal-size slices: Slice 1, Slice 2, and Slice 3. Each slice has two replicas distributed between five nodes.

  • Here is an example of a poor distribution of the representation:

    Example of an Unhealthy Cluster

    While each slice is protected against the failure of a single node, the majority of the representation is stored on node2. This means that if node2 fails, it can create a significant workload for Xpand to restore fault tolerance.

    Xpand Rebalancer responds to this imbalance by automatically moving replicas off of node2 and onto other nodes.

  • Here is an example of the same representation, well-distributed across the nodes:

    Example of an Unhealthy Cluster

    To correct the imbalance, the Rebalancer moved the ranking replica of Slice 2 to node4 and the non-ranking replica of Slice 3 to node5. While node1 still has one more replica than the others, none of the Xpand nodes are under-loaded

Rebalance Distribution Task

The Rebalancer's Rebalance Distribution task ensures that each representation has sufficient replicas to distribute write requests evenly between each node:

  • It is a low priority task in the queue.

  • It is executed with a conservative rate.

  • It runs at intervals defined by the task_rebalancer_rebalance_distribution_interval_ms system variable (by default, 30000 ms).

  • Simultaneous executions are limited by the value of the rebalancer_rebalance_task_limit system variable (by default, 2).