Rapid Node Recovery with IST and the GCache
Incremental State Transfer (IST)
An Incremental State Transfer (IST) is the fast and efficient process where a joining node receives only the missing transactions it needs to catch up with the cluster, rather than receiving a full copy of the entire database.
This is the preferred provisioning method because it is:
Fast: Transferring only the missing changes is significantly faster than copying the entire dataset.
Non-Blocking: The donor node can continue to serve read and write traffic while an IST is in progress.
Conditions for IST
IST is an automatic process, but it is only possible if the following conditions are met:
The joining node has previously been a member of the cluster (its state UUID matches the cluster's).
All of the write-sets that the joiner is missing are still available in the donor node's Write-set Cache (GCache).
If these conditions are not met, the cluster automatically falls back to performing a full State Snapshot Transfer (SST).
Skipping Foreign Key Checks
Appliers need to verify foreign key constraints during normal operation in multi-active topologies. Therefore, appliers are configured to enable checking.
However, during node joining, in IST and latter catch-up period, the node is still idle (from local connections), and the only source for incoming transactions is the cluster sending certified write sets for applying. IST happens with parallel applying — there is a possibility that foreign key check cause lock conflicts between appliers accessing FK child and parent tables. Also, excessive FK checking slows down the IST process.
To address that issue, you can relax FK checks for appliers during IST and catch-up periods. The relaxed FK check mode is configurable by setting this flag:
wsrep_mode=SKIP_APPLIER_FK_CHECKS_IN_ISTWhen this operation mode is set, and the node is processing IST or catch-up, appliers skip FK checking.
The Write-Set Cache (GCache)
The GCache is a special cache on each node whose primary purpose is to store recent write-sets specifically to facilitate Incremental State Transfers. The size and configuration of the GCache are therefore critical for the cluster's recovery speed and high availability.
How the GCache Enables IST
When a node attempts to rejoin the cluster, it reports the sequence number (seqno) of the last transaction it successfully applied. The potential donor node then checks its GCache for the very next seqno in that sequence.
The donor has the necessary history. It streams all subsequent write-sets from its GCache to the joiner. The joiner applies them in order and quickly becomes Synced.
The node was disconnected for too long, and the required history has been purged from the cache. IST is not possible, and an SST is initiated.
Configuring the GCache
You can control the GCache behavior with several parameters in the [galera] section of your configuration file (my.cnf).
Controls the size of the on-disk ring-buffer file. A larger GCache can hold more history, increasing the chance of a fast IST over SST.
Specifies where GCache files are stored. Best practice is to place it on the fastest available storage like SSD or NVMe.
Enabled by default in modern Galera versions, it allows a node to recover its GCache post-restart, enabling immediate service as a donor for IST.
Tuning gcache.size
gcache.sizeThe gcache.size parameter is the most critical setting for ensuring nodes can use IST. A GCache that is too small is the most common reason for a cluster falling back to a full SST.
The ideal size depends on your cluster's write rate and the amount of downtime you want to tolerate for a node before forcing an SST. For instance, do you want a node that is down for 1 hour for maintenance to recover instantly (IST), or can you afford a full SST?
Calculating Size Based on Write Rate
The most accurate way to size your GCache is to base it on your cluster's write rate.
Find your cluster's write rate:
You can calculate this using the wsrep_received_bytes status variable. First, check the value and note the time:
SHOW STATUS LIKE 'wsrep_received_bytes';+------------------------+-----------+
| Variable_name | Value |
+------------------------+-----------+
| wsrep_received_bytes | 6637093 |
+------------------------+-----------+Wait for a significant interval during peak load (e.g., 3600 seconds, or 1 hour). Run the query again:
SHOW STATUS LIKE 'wsrep_received_bytes';+------------------------+-----------+
| Variable_name | Value |
+------------------------+-----------+
| wsrep_received_bytes | 79883093 |
+------------------------+-----------+Now, calculate the rate (bytes per second):
Calculate your desired GCache size:
Decide on the time window you want to support (e.g., 2 hours = 7200 seconds).
In this example, a gcache.size of 140M would allow a node to be down for 2 hours and still rejoin using IST.
Check your current GCache validity period:
Conversely, you can use your write rate to see how long your current GCache size is valid:
A General Heuristic for Sizing
If you cannot calculate the write rate, you can use a simpler heuristic based on your data directory size as a starting point.
Start with the size of your data directory.
Subtract the size of the GCache's ring buffer file itself (default:
galera.cache).Consider your SST method:
If you use
mysqldumpfor SST, you can also subtract the size of your InnoDB log files (asmysqldumpdoes not copy them).If you use
rsyncorxtrabackup, the log files are copied, so they should be part of the total size.
These calculations are guidelines. If your cluster nodes frequently request SSTs, it is a clear sign your gcache.size is too small. In cases where you must avoid SSTs as much as possible, you should use a much larger GCache than suggested, assuming you have the available storage.
This page is licensed: CC BY-SA / Gnu FDL
Last updated
Was this helpful?

