Rebalancer Tasks for MariaDB Xpand
This page is part of MariaDB's Documentation.
The parent of this page is: MariaDB Xpand Rebalancer
Topics on this page:
Overview
MariaDB Xpand's Rebalancer uses Tasks to correct issues with the distribution model:
Xpand schedules Tasks independently
Xpand's Tasks are designed to avoid conflicts with other Tasks
Xpand implements many different kinds of Tasks with different priorities
Compatibility
Information provided here applies to:
MariaDB Xpand 5.3
MariaDB Xpand 6.0
MariaDB Xpand 6.1
Distribution Model
MariaDB Xpand's Rebalancer uses Tasks to correct issues with the distribution model:
Ranking replicas might not be distributed evenly, so some nodes receive more writes than others
Replicas might not be distributed evenly, so some nodes receive more reads than others
A slice might be excessively large
A slice might have excessive replicas
Scheduling
MariaDB Xpand's Rebalancer schedules Tasks in the Priority Queue.
Conflict Avoidance
MariaDB Xpand's Tasks are designed to avoid conflicts with other Tasks.
As an example, the Rebalancer uses the "Split" Task to split a slice into two new slice:
The Rebalancer considers various details from the distribution model to decide where the new slices should be located, and which nodes should contain the ranking replicas
The Rebalancer takes precautions to avoid conflicting with other Tasks, such as:
The Rebalancer considers how the representation's other slices are distributed, so that it does not need to perform a "Rerank" or "Rerank Distribution" task later
The Rebalancer considers how the load is distributed among nodes, so that it does not need to perform a "Rebalance" or "Rebalance Distribution" task later
Task List
MariaDB Xpand implements many different kinds of Tasks with different priorities:
Priority | Rate | Name | Fixes |
---|---|---|---|
High | Aggressive | Missing replicas | |
High | Aggressive | Slice imbalance for a zone | |
High | Moderate | Slices on decommissioned hardware | |
High | Moderate | Extra replicas/queues | |
Medium | Moderate | Large slices | |
Low | Conservative | Node/zone read imbalance | |
Low | Conservative | Representation read imbalance | |
Low | Conservative | Node/zone usage imbalance | |
Low | Conservative | Representation write imbalance |
Reprotect Task
The Rebalancer's Reprotect task ensures that each slice is protected with a sufficient number of replicas:
It is high priority task in the queue.
It is executed with an aggressive rate.
It runs at intervals defined by the
task_rebalancer_reprotect_interval_ms
system variable (by default, 15000 ms).
By default, Xpand maintains a minimum of two replicas for every slice. However, the minimum number of replicas can be increased by setting the REPLICAS
table option or the MAX_FAILURES
setting. If a node fails, the failed node's replicas become unavailable. If a slice no longer has the minimum number of replicas, it is considered to be under-protected. When a slice is under-protected, it may be lost if any additional nodes fail.
The Reprotect task corrects this issue by creating new replicas for an under-protected slice. However, there is a chance that the failed node will become available again. Therefore, the Reprotect task is not executed unless the failed node is unavailable for longer than the value of the rebalancer_reprotect_queue_interval_s
system variable (by default, 600 seconds). During this interval, the Rebalancer maintains a Recovery Queue that stores any changes made to the replicas on the failed node.
If the failed node comes back online before the interval expires, the changes in the Recovery Queue are applied to the replicas on the node.
If the failed node does not come back online before the interval expires, the Rebalancer begins creating new replicas to replace the ones on the failed node, and it discards the Recovery Queue.
A node failure reduces the storage capacity of the cluster. If the storage capacity falls below what Xpand needs to store new replicas, the Rebalancer does not create new replicas, and the slices remain under-protected
Zone Balance Task
If zones are configured, the Rebalancer's Zone Balance task ensures that each slice has replicas distributed across zones:
It is high priority task in the queue.
It is executed with an aggressive rate.
It runs at intervals defined by the
task_rebalancer_zone_balance_interval_ms
system variable (by default, 60000 ms).
Softfail Task
The Rebalancer's Softfail task ensures that no replicas are stored on decommissioned (soft-failed) nodes:
It is high priority task in the queue.
It is executed with a moderate rate.
It does not run at any specific task interval.
It is run immediately when a node is decommissioned (or soft-failed) with the ALTER CLUSTER SOFTFAIL statement.
It is not affected by the per-task limit.
Reap Task
The Rebalancer's Reap task ensures that no slice has excessive replicas:
It is a high priority task in the queue.
It is executed with a moderate rate.
By default, Xpand maintains a minimum of two replicas for every slice. However, the minimum number of replicas can be increased by setting the REPLICAS
table option or the MAX_FAILURES
setting.
If a slice has more replicas than the minimum, the Reap task can remove the extra replicas.
For additional information, see "Consistent Hashing".
Split Task
The Rebalancer's Split task ensures that no slices are too large:
It is a medium priority task in the queue.
It is executed with a moderate rate.
It runs at intervals defined by the
task_rebalancer_split_interval_ms
system variable (by default, 30000 ms).It considers a slice to be too large if the slice's size is greater than the value of the
rebalancer_split_threshold_kb
system variable (by default, 8 GB).
The number of slices for a table can also be manually configured using the SLICE
table option.
There is no inverse of the Split task. If a slice gets too small, the Rebalancer will not automatically merge the slice with another slice. If you need to reduce the number of slices for a particular table, you must manually configure the slices with the SLICE
table option.
Re-rank Task
The Rebalancer's Re-rank task ensures that no node or zone receives a disproportionate number of reads:
It is a low priority task in the queue.
It is executed with a conservative rate.
If a write operation updates a slice, Xpand synchronously updates every replica. If a read operation accesses a slice, Xpand only accesses the ranking replica. Since all replicas of a slice are identical, directing reads to a non-ranking replica would produce the same results. The distinction between ranking replicas and non-ranking replicas allows the Rebalancer to better manage data distribution and load for both read and write operations, and it allows Xpand to better utilize each node's memory.
If a given node receives a disproportionate number of ranking replicas, it will handle a disproportionate number of read operations. The Rerank task corrects this issue by ensuring that replicas are fairly ranked.
The following diagram shows a set of replicas that are fairly ranked:
Each node has exactly 2 replicas total, so each node is likely to handle an equal number of write operations.
Each node has exactly 1 ranking replica (indicated in bold), so each node is likely to handle an equal number of read operations.
Re-rank Distribution Task
The Rebalancer's Re-rank Distribution ensures that each representation has sufficient ranking replicas to distribute read requests evenly between each node:
It is a low priority task in the queue.
It is executed with a conservative rate.
Rebalance Task
The Rebalancer's Rebalance task ensures that no node or zone should receive a disproportionate number of writes:
It is a low priority task in the queue.
It is executed with a conservative rate.
It runs at intervals defined by the
task_rebalancer_rebalance_interval_ms
system variable (by default, 30000 ms).It considers a write load to be disproportionate if the write load variation is greater than the value of the
rebalancer_rebalance_threshold
system variable (by default, 0.05).Simultaneous executions are limited by the value of the
rebalancer_rebalance_task_limit
system variable (by default, 2).
The Rebalance task evaluates write load indirectly by calculating some details about the representation:
Each slice of a representation is assumed to exert load proportional to its share of the representation's key-space.
For example, if the index size of a slice constitutes 10% of the overall representation's index space, the Rebalancer assumes the slice comprises 10% of the load on the representation. The Rebalancer anticipates that activity level when placing replicas of that slice.
The representation is well-distributed when the difference between the "most loaded" and "least loaded" nodes is minimal.
Consider the following examples of a representation with three equal-size slices: Slice 1
, Slice 2
, and Slice 3
. Each slice has two replicas distributed between five nodes.
Here is an example of a poor distribution of the representation:
While each slice is protected against the failure of a single node, the majority of the representation is stored on
node2
. This means that ifnode2
fails, it can create a significant workload for Xpand to restore fault tolerance.Xpand Rebalancer responds to this imbalance by automatically moving replicas off of
node2
and onto other nodes.Here is an example of the same representation, well-distributed across the nodes:
To correct the imbalance, the Rebalancer moved the ranking replica of Slice 2 to
node4
and the non-ranking replica of Slice 3 tonode5
. Whilenode1
still has one more replica than the others, none of the Xpand nodes are under-loaded
Rebalance Distribution Task
The Rebalancer's Rebalance Distribution task ensures that each representation has sufficient replicas to distribute write requests evenly between each node:
It is a low priority task in the queue.
It is executed with a conservative rate.
It runs at intervals defined by the
task_rebalancer_rebalance_distribution_interval_ms
system variable (by default, 30000 ms).Simultaneous executions are limited by the value of the
rebalancer_rebalance_task_limit
system variable (by default, 2).