Administering Failure and Recovery with MariaDB Xpand

Overview

This section describes how MariaDB Xpand handles various types of failures and tells you how best to recover from them if they occur.

Front-end Network Failure

If an Xpand node cannot communicate on its front-end Ethernet network port, for example, due to an inadvertent cable pull, switch misconfiguration, or NIC failure, no manual intervention is required. The deployment takes the following actions:

  • Additional connections are not assigned to the failed instance.

  • If the failed Xpand node was handling replication slave connections, the connections are moved to another Xpand node.

Back-end Network Failure

Backend network failures look like Node Failures (see below).

Disk Failure

Xpand maintains two copies of your data (replicas) by default. When the deployment detects a disk failure, the system will automatically schedule reprotect operations to generate new copies of data. The administrator does not need to take any action to reprotect the dataset. The deployment will also issue Email Alerts when any tables fall below the specified number of copies and following the completion of the resulting reprotect task.

In some situations, the system may detect errors on the disk. However, if the errors are below the threshold at which we mark the drive as failed, some user queries may get occasional errors attempting to read from the failed device. In such cases, the administrator may manually deactivate the Xpand node without reducing the number of available copies. The system will attempt to safely move all data on such devices to another device in the system. To do this, follow the steps in Scale-In

Node Failure

This section describes two types of Xpand node failure: transient, where the Xpand node is offline briefly (for example, due to a crash or power failure), and permanent, where an Xpand node has failed completely and is not expected to return (for example, due to a hardware failure)

Transient Node Failure and Rebalancer Reprotect

When the deployment loses contact with an individual Xpand node for any reason, surviving Xpand nodes in the deployment form a new group without this Xpand node and continue serving clients. All services, such as replication slaves, are reassigned across the surviving Xpand nodes. Clients that were distributed to the failed Xpand node must reconnect. Clients that were directly connected to the failed Xpand node are unable to query the database. You will receive an email alert and a message like following will appear in the clustrix.log for one of the Xpand nodes:

ALERT PROTECTION_LOST WARNING Full protection lost for some data;
queueing writes for down node; reprotection will begin in 600
seconds if node has not recovered

This message simply indicates that not all data has a full set of replicas available. The global variable rebalancer_reprotect_queue_interval_s specifies how long the Rebalancer should wait for a node to re-join the deployment before starting to create additional replicas.

  • If a node re-joins within rebalancer_reprotect_queue_interval_s:

    • Xpand replays the changes that were made since the last time the node was in quorum, thereby enabling the node to rejoin the deployment quickly.

    • The node rejoins the deployment and begins accepting work. No further action is necessary.

  • If a node joins after rebalancer_reprotect_queue_interval_s has passed:

    • The Rebalancer begins copying under-protected slices to create new replicas throughout the surviving nodes.

    • The deployment performs a group change.

    • Assuming there is sufficient disk space, Xpand will automatically handle the reprotect process and no manual intervention is required.

Use the following query on one of the Xpand nodes to view the Rebalancer reprotection tasks that have not been finished:

SELECT * FROM system.rebalancer_activity_log
WHERE finished IS NULL;

Once there are sufficient copies of all replicas (either because an Xpand node was recovered or the Rebalancer is done making copies), you will receive an alert and a message like the following will appear in the clustrix.log for one of the Xpand nodes:

ALERT PROTECTION_RESTORED WARNING Full protection restored
for all data after 20 minutes and 40 seconds

Softfailing a Node

If an Xpand node becomes unreliable and you would like to remove it from the deployment, MariaDB recommends marking it as softfailed (using the Scale-In) procedure. You can simultaneously incorporate a replacement using the Scale-Out procedure. The high level steps are:

  1. Provision replacement Xpand node(s) by installing Xpand and adding them to the deployment using ALTER CLUSTER ADD

  2. Mark the Xpand node(s) in question as softfailed using ALTER CLUSTER SOFTFAIL

  3. Once softfail operations complete, execute ALTER CLUSTER REFORM to remove the softfailed Xpand node(s)

Permanent Node Failure

If an Xpand node has failed permanently, the Rebalancer will automatically create additional replicas as described above. The lost Xpand node is still considered to be a quorum participant until it is removed explicitly.

Manually drop a permanently failed Xpand node:

ALTER CLUSTER DROP nodeid;

This command results in a group change.

Note

Dropping a node before reprotect has completed can leave the deployment vulnerable to data loss.

To incorporate a replacement Xpand node, follow the instructions for Scale-Out.

Multiple Node Failures

Xpand can be configured to withstand multiple failures by setting the value of MAX_FAILURES.

For a deployment to tolerate the configured value for MAX_FAILURES:

  • All representations must have sufficient replicas. If MAX_FAILURES is updated, all tables created previously must have their replicas updated manually.

  • There must be a quorum (at least N/2+1) of Xpand nodes available

  • Xpand recommends provisioning enough disk space so that the deployment has enough space to reprotect after an unexpected failure. See Allocating Disk Space for Fault Tolerance and Availability

Zone Failure

When zones are configured, a failure of an entire zone is analogous to an Xpand node failure. Xpand will automatically resume operation with the Xpand nodes from available zones and automatically reprotect. To remove a zone from the deployment, mark all Xpand nodes in the zone as softfailed.

When Reprotect Cannot Complete

Insufficient Disk Space

If there is insufficient disk space for all replicas, the Reprotect process will be unable to complete. Consider adding additional capacity by Scale-Out. See Managing File Space.

Missing Replicas

If the deployment has lost more Xpand nodes and/or zones than specified for MAX_FAILURES, the deployment will be unable to reprotect. The Rebalancer activity log (system.rebalancer_activity_log) will show Representation not found errors.

SELECT `Database`, `Table`, `Index`, slice, status
FROM (SELECT `Database`, `Table`, `Index`, slice, MIN(status)
   AS status FROM system.table_replicas
GROUP BY slice) AS x
WHERE x.status > 1;

If the unavailable Xpand nodes cannot be recovered, these tables must be restored from backup.