Understanding MariaDB Galera Cluster
In order to handle increasing load and especially when that load exceeds what a single server can process, it is best practice to deploy multiple MariaDB Servers with a replication solution to maintain data consistency between them. MariaDB Galera Cluster is a multi-primary replication solution that serves as an alternative to the single-primary MariaDB Replication.
MariaDB Galera Cluster is available on all versions of MariaDB Community Server.
How it Works
MariaDB Galera Cluster implements MariaDB Community Server with Galera Cluster.
As a multi-primary replication solution, any MariaDB Community Server can operate as a Primary Server. This means that changes made to any node in the cluster replicate to every other node in the cluster, using certification-based replication and global ordering of transactions for the InnoDB storage engine.
MariaDB Galera Cluster is only available for Linux operating systems. It requires use of the InnoDB storage engine and does not support XA transactions.
There are a few things to consider when planning the hardware, virtual machines, or containers for MariaDB Enterprise Cluster.
MariaDB Galera Cluster architecture involves deploying multiple instances of MariaDB Community Server. The Servers are configured to use multi-primary replication to maintain consistency between themselves.
Writes made to any node in this cluster replicate to all the other nodes of the cluster.
When MariaDB Community Servers start in a cluster:
Each Server attempts to establish network connectivity with the other Servers in the cluster
Groups of connected Servers form a component
When a Server establishes network connectivity with the Primary Component, it synchronizes its local database with that of the cluster
As a member of the Primary Component, the Server becomes operational — able to accept read and write queries from clients
During startup, the Primary Component is the Server bootstrapped to run as the Primary Component. Once the cluster is online, the Primary Component is any combination of Servers which includes a minimum of more than half the total number of Servers.
A Server or group of Servers that loses network connectivity with the majority of the cluster becomes non-operational.
In planning the number of systems to provision for MariaDB Galera Cluster, it is important to keep cluster operation in mind. Ensuring that it has enough disk space and that it is able to maintain a Primary Component in the event of outages.
Each Server requires the minimum amount of disk space needed to store the entire database. The upper storage limit for MariaDB Enterprise Cluster is that of the smallest disk in use.
Each switch in use should have an odd number of Servers above three.
In a cluster that spans multiple switches, each data center in use should have an odd number of switches above three.
In a cluster that spans multiple data centers, use an odd number of data centers above three.
Each data center in use should have at least one Server dedicated to backup operations. This can be another cluster node or a separate Replica Server kept in sync using MariaDB Replication.
In case of planning Servers to the switch, switches to the data center, and data centers in the cluster, this model helps preserve the Primary Component. A minimum of three in use means that a single Server or switch can fail without taking down the cluster.
Using an odd number above three reduces the risk of a split-brain situation (that is, a case where two separate groups of Servers believe that they are part of the Primary Component and remain operational).
Nodes in MariaDB Galera Cluster are individual MariaDB Community Servers configured to perform multi-primary cluster replication. This configuration is set using a series of system variables in the configuration file.
[mariadb] # General Configuration bind_address = 0.0.0.0 innodb_autoinc_lock_mode = 2 # Cluster Configuration wsrep_cluster_name = "accounting_cluster" wsrep_cluster_address = "gcomm://192.0.2.1,192.0.2.2,192.0.2.3" # wsrep Provider wsrep_provider = /usr/lib/galera/libgalera_smm.so wsrep_provider_options = "evs.suspect_timeout=PT10S"
Additional information on system variables is available in the Reference chapter.
innodb_autoinc_lock_mode system variable must be set to a value of
2 to enable interleaved lock mode. MariaDB Enterprise Cluster does not support other lock modes.
Ensure also that the
bind_address system variable is properly set to allow MariaDB Enterprise Server to listen for TCP/IP connections:
bind_address = 0.0.0.0 innodb_autoinc_lock_mode = 2
Cluster Name and Address
MariaDB Galera Cluster requires that you set a name for your cluster, using the
wsrep_cluster_name system variable. When nodes connect to each other, they check the cluster name to ensure that they've connected to the correct cluster before replicating data. All Servers in the cluster must have the same value for this system variable.
wsrep_cluster_address system variable, you can define the back-end protocol (always
gcomm) and comma-separated list of the IP addresses or domain names of the other nodes in the cluster.
wsrep_cluster_name = "accounting_cluster" wsrep_cluster_address = "gcomm://192.0.2.1,192.0.2.2,192.0.2.3"
It is best practice to list all nodes on this system variable, as this is the list the node searches when attempting to reestablish network connectivity with the primary component.
In certain environments, such as deployments in the cloud, you may also need to set the
wsrep_node_address system variable, so that MariaDB Community Server properly informs other Servers how to reach it.
Galera Replicator Plugin
MariaDB Community Server connects to other Servers and replicates data from the cluster through a wsrep Provider called the Galera Replicator plugin. In order to enable clustering, specify the path to the relevant
.so file using the
wsrep_provider system variable.
To enable MariaDB Galera Cluster, use the
wsrep_provider = /usr/lib/galera/libgalera_smm.so
In addition to system variables, there is a set of options that you can pass to the wsrep Provider to configure or to otherwise adjust its operations. This is done through the
wsrep_provider_options system variable:
wsrep_provider_options = "evs.suspect_timeout=PT10S"
Additional information is available in the Reference chapter.
MariaDB Galera Cluster implements a multi-primary replication solution.
When you write to a table on a node, the node collects the write into a write-set transaction, which it then replicates to the other nodes in the cluster.
Your application can write to any node in the cluster. Each node certifies the replicated write-set. If the transaction has no conflicts, the nodes apply it. If the transaction does have conflicts, it is rejected and all of the nodes revert the changes.
The first node you start in MariaDB Galera Cluster bootstraps the Primary Component. Each subsequent node that establishes a connection joins and synchronizes with the Primary Component. A cluster achieves a quorum when more than half the nodes are joined to the Primary Component.
When a component forms that has less than half the nodes in the cluster, it becomes non-operational, since it believes there is a running Primary Component to which it has lost network connectivity.
These quorum requirements, combined with the requisite number of odd nodes, avoid a split brain situation, or one in which two separate components believe they are each the Primary Component.
Dynamically Bootstrapping the Cluster
In cases where the cluster goes down and your nodes become non-operational, you can dynamically bootstrap the cluster.
First, find the most up-to-date node (that is, the node with the highest value for the
wsrep_last_committed status variable):
SHOW STATUS LIKE 'wsrep_last_committed';
Once you determine the node with the most recent transaction, you can designate it as the Primary Component by running the following on it:
SET GLOBAL wsrep_provider_options="pc.bootstrap=YES";
The node bootstraps the Primary Component onto itself. Other nodes in the cluster with network connectivity then submit state transfer requests to this node to bring their local databases into sync with what's available on this node.
From time to time a node can fall behind the cluster. This can occur due to expensive operations being issued to it or due to network connectivity issues that lead to write-sets backing up in the queue. Whatever the cause, when a node finds that it has fallen too far behind the cluster, it attempts to initiate a state transfer.
In a state transfer, the node connects to another node in the cluster and attempts to bring its local database back in sync with the cluster. There are two types of state transfers:
Incremental State Transfer (IST)
State Snapshot Transfer (SST)
When the donor node receives a state transfer request, it checks its write-set cache (that is, the GCache) to see if it has enough saved write-sets to bring the joiner into sync. If the donor node has the intervening write-sets, it performs an IST operation, where the donor node only sends the missing write-sets to the joiner. The joiner applies these write-sets following the global ordering to bring its local databases into sync with the cluster.
When the donor does not have enough write-sets cached for an IST, it runs an SST operation. In an SST, the donor uses a backup solution, like MariaDB Enterprise Backup, to copy its data directory to the joiner. When the joiner completes the SST, it begins to process the write-sets that came in during the transfer. Once it's in sync with the cluster, it becomes operational.
IST's provide the best performance for state transfers and the size of the GCache may need adjustment to facilitate their use.
MariaDB Galera Cluster uses Flow Control to sometimes throttle transactions and ensure that all nodes work equitably.
Write-sets that replicate to a node are collected by the node in its received queue. The node then processes the write-sets according to global ordering. Large transactions, expensive operations, or simple hardware limitations can lead to the received queue backing up over time.
When a node's received queue grows beyond certain limits, the node initiates Flow Control. In Flow Control, the node pauses replication to work through the write-sets it already has. Once it has worked the received queue down to a certain size, it re-initiates replication.
A node or nodes will be removed or evicted from a cluster if it becomes non-responsive.
In MariaDB Galera Cluster, each node monitors network connectivity and response times from every other node. MariaDB Enterprise Cluster evaluates network performance using the EVS Protocol.
When a node finds another to have poor network connectivity, it adds an entry to the delayed list. If the node becomes active again and the network performance improves for a certain amount of time, an entry for it is removed from the delayed list. That is, the longer a node has network problems the longer it has to be active again to be removed from the delayed list.
If the number of entries for a node in the delayed list exceeds a threshold established for the cluster, the EVS Protocol evicts the node from the cluster.
Evicted nodes become non-operational components. They cannot rejoin the cluster until you restart MariaDB Enterprise Server.
Under normal operation, huge transactions and long-running transactions are difficult to replicate. MariaDB Enterprise Cluster rejects conflicting transactions and rolls back the changes. A transaction that takes several minutes or longer to run can encounter issues if a small transaction is run on another node and attempts to write to the same table. The large transaction fails because it encounters a conflict when it attempts to replicate.
MariaDB Enterprise Server 10.4 introduces streaming replication for MariaDB Enterprise Cluster. In streaming replication, huge transactions are broken into transactional fragments, which are replicated and applied as the operation runs. This makes it more difficult for intervening sessions to introduce conflicts.
Initiate Streaming Replication
To initiate streaming replication, set the
wsrep_trx_fragment_size system variables. You can set the unit to
SET SESSION wsrep_trx_fragment_unit='STATEMENTS'; SET SESSION wsrep_trx_fragment_size=5;
Then, run your transaction.
Streaming replication works best with very large transactions where you don't expect to encounter conflicts. If the statement does encounter a conflict, the rollback operation is much more expensive than usual. As such, it's best practice to enable streaming replication at a session-level and to disable it by setting the wsrep_trx_fragment_size system variable to 0 when it's not needed.
SET SESSION wsrep_trx_fragment_size=0;
Deployments on mixed hardware can introduce issues where some MariaDB Enterprise Servers perform better than others. A Server in one part of the world might perform more reliably or be physically closer to most users than others. In cases where a particular MariaDB Enterprise Server holds logical significance for your cluster, you can weight its value in quorum calculations.
Galera Arbitrator is a separate process that runs alongside MariaDB Community Server. While the Arbitrator does not take part in replication, whenever the cluster performs quorum calculations it gives the Arbitrator a vote as though it were another MariaDB Enterprise Server. In effect this means that the system has the vote of MariaDB Enterprise Server plus any running Arbitrators in determining whether it's part of the Primary Component.
Bear in mind that the Galera Arbitrator is a separate package,
galera-arbitrator-4, which is not installed by default with MariaDB Enterprise Server.
MariaDB Community Servers that join a cluster attempt to connect to the IP addresses provided to the
wsrep_cluster_address system variable. This variable adjusts itself at runtime to include the addresses of all connected nodes.
To scale-out MariaDB Galera Cluster, start new MariaDB Community Servers with the appropriate
wsrep_cluster_address list and the same
wsrep_cluster_name value. The new nodes establish network connectivity with the running cluster and request a state transfer to bring their local database into sync with the cluster.
Once the MariaDB Community Server reports itself as being in sync with the cluster, MariaDB MaxScale can begin including it in the load distribution for the cluster.
Being a multi-primary replication solution means that any MariaDB Enterprise Server in the cluster can handle write operations, but write scale-out is minimal as every Server in the cluster needs to apply the changes.
MariaDB Galera Cluster does not provide failover capabilities on its own. MariaDB MaxScale is used to route client connections to MariaDB Enterprise Server.
Unlike a traditional load balancer, MariaDB MaxScale is aware of changes in the node and cluster states.
MaxScale takes nodes out of the distribution that initiate a blocking SST operation or Flow Control or otherwise go down, which allows them to recover or catch up without stopping service to the rest of the cluster.
With MariaDB Galera Cluster, each node contains a replica of all the data in the cluster. As such, you run MariaDB Backup on any node to backup the available data. The process for backing-up a node is the same as for a single MariaDB Community Server.
MariaDB Community Server supports data-at-rest encryption to secure data on disk, and data-in-transit encryption to secure data on the network.
For data-in-transit, MariaDB Galera Cluster supports encryption the same as MariaDB Server and additionally provides data-in-transit encryption for Galera replication traffic and for State Snapshot Transfer (SST) traffic.