October 5, 2016

MariaDB High Availability: Replication Manager

Note: Replication Manager is developed and maintained by community members. It can be used with MariaDB Server, but it is not supported with an enterprise subscription.

This is a follow-up blog post that expands on the subject of highly available cluster, discussed in MariaDB MaxScale High Availability: Active-Standby Cluster.

Replication Manager is a tool that manages MariaDB 10 clusters. It supports both interactive and automated failover of the master server. It verifies the integrity of the slave servers before promoting one of them as the replacement master and it also protects the slaves by automatically setting them into read-only mode. You can find more information on the replication-manager from the replication-manager GitHub repository.

Using Replication Manager allows us to automate the replication failover. This reduces the amount of manual work required to adapt to changes in the cluster topology and makes for a more highly available database cluster.

In this blog post, we'll cover the topic of backend database HA and we’ll use Replication Manager to create a complete HA solution. We build on the setup described in the earlier blog post and integrate Replication Manager into it. We're using Centos 7 as our OS and we'll use the 0.7.0-rc2 version of the replication-manager.

Setting Up Replication Manager

Replication Manager allows us to manage the replication topology of the cluster without having to manually change it. The easiest way to integrate it to our Corosync setup is to build it from source and use the Systemd service file it provides.

sudo yum install go git
export GOPATH=~/gocode && mkdir ~/gocode && cd ~/gocode
go get github.com/tanji/replication-manager
go install github.com/tanji/replication-manager
sudo cp src/github.com/tanji/replication-manager/service/replication-manager.service /lib/systemd/system/replication-manager.service
sudo ln -s $GOPATH/bin/replication-manager /usr/bin/replication-manager

Now, we should have a working replication-manager installation. The next step is to configure the servers that it manages. The replication-manager reads its configuration file from /etc/replication-manager/config.toml. Create the file and add the following lines to it.

logfile = "/var/log/replication-manager.log"
verbose = true
hosts = "192.168.56.1:3000,192.168.56.1:3001,192.168.56.1:3002,192.168.56.1:3003"
user = "maxuser:maxpwd"
rpluser = "maxuser:maxpwd"
interactive = false
failover-limit = 0

Once the configuration file is in place, we can add it as a resource and colocate it with the clusterip resource we created in the previous blog post.

sudo pcs resource create replication-manager systemd:replication-manager op monitor interval=1s
sudo pcs constraint colocation add replication-manager with clusterip INFINITY

After the resource is added and configured, we see that it was added and stared on node1.

[user@localhost ~]$ sudo pcs resource
 Clone Set: maxscale-clone [maxscale]
     Started: [ node1 node2 ]
 clusterip	(ocf::heartbeat:IPaddr2):	Started node1
 replication-manager	(systemd:replication-manager):	Started node1

Looking at the maxadmin output on the server where the active MaxScale is running, we see that the server at 192.168.56.1:3000 is currently the master.

[user@localhost ~]$ sudo maxadmin list servers 
Servers.
---------+-----------------+-------+-------------+--------------
Server   | Address         | Port  | Connections | Status              
---------+-----------------+-------+-------------+--------------
server1  | 192.168.56.1    |  3000 |           0 | Master, Running
server2  | 192.168.56.1    |  3001 |           0 | Slave, Running
server3  | 192.168.56.1    |  3002 |           0 | Slave, Running
server4  | 192.168.56.1    |  3003 |           0 | Slave, Running
---------+-----------------+-------+-------------+--------------

Now if we kill the master, replication-manager should pick that up and perform a master failover. But first we need to insert some data to make sure the failover is performed correctly and to make it a bit of a challenge, we’ll do those inserts continuously. First, we create an extremely simple table.

CREATE TABLE test.t1 (id INT);

For this, a mysql client in a loop has been started on a remove server.

i=0; while true; do mysql -ss -u maxuser -pmaxpwd -h 192.168.56.220 -P 4006 -e "INSERT INTO test.t1 VALUES (1);SELECT NOW()"; sleep 1; done

Now, we’ll have a constant stream of inserts going to our cluster and we can see what happens when we kill the current master at 192.168.56.1:3000.

2016/10/01 19:56:05 WARN : Master Failure detected! Retry 1/5 2016/10/01 19:56:05 INFO : INF00001 Server 192.168.56.1:3000 is down 2016/10/01 19:56:07 WARN : Master Failure detected! Retry 2/5 2016/10/01 19:56:09 WARN : Master Failure detected! Retry 3/5 2016/10/01 19:56:11 WARN : Master Failure detected! Retry 4/5 2016/10/01 19:56:13 WARN : Master Failure detected! Retry 5/5 2016/10/01 19:56:13 WARN : Declaring master as failed 2016/10/01 19:56:13 INFO : Starting master switch 2016/10/01 19:56:13 INFO : Electing a new master 2016/10/01 19:56:13 INFO : Slave 192.168.56.1:3001 [0] has been elected as a new master 2016/10/01 19:56:13 INFO : Reading all relay logs on 192.168.56.1:3001 2016/10/01 19:56:13 INFO : Stopping slave thread on new master 2016/10/01 19:56:14 INFO : Resetting slave on new master and set read/write mode on 2016/10/01 19:56:14 INFO : Switching other slaves to the new master 2016/10/01 19:56:14 INFO : Change master on slave 192.168.56.1:3003 2016/10/01 19:56:14 INFO : Change master on slave 192.168.56.1:3002 2016/10/01 19:56:15 INFO : Master switch on 192.168.56.1:3001 complete 

The replication-manager successfully detected the failure of the master and performed a failover. MaxScale will detect this and adapt accordingly.

[user@localhost ~]$ sudo maxadmin list servers 
Servers.
---------+---------------+-------+-------------+--------------
Server   | Address       | Port  | Connections | Status              
---------+---------------+-------+-------------+--------------
server1  | 192.168.56.1  |  3000 |           0 | Down
server2  | 192.168.56.1  |  3001 |           0 | Master, Running
server3  | 192.168.56.1  |  3002 |           0 | Slave, Running
server4  | 192.168.56.1  |  3003 |           0 | Slave, Running
---------+---------------+-------+-------------+--------------

From the remote server’s terminal, we can see that there was a small window where writes weren’t possible.

2016-10-01 19:56:01
2016-10-01 19:56:02
2016-10-01 19:56:03
2016-10-01 19:56:04
ERROR 2013 (HY000) at line 1: Lost connection to MySQL server during query
ERROR 1045 (28000): failed to create new session
ERROR 1045 (28000): failed to create new session
ERROR 1045 (28000): failed to create new session
ERROR 1045 (28000): failed to create new session
ERROR 1045 (28000): failed to create new session
ERROR 1045 (28000): failed to create new session
ERROR 1045 (28000): failed to create new session
ERROR 1045 (28000): failed to create new session
ERROR 1045 (28000): failed to create new session
ERROR 1045 (28000): failed to create new session
2016-10-01 19:56:17
2016-10-01 19:56:18
2016-10-01 19:56:19

Now our whole cluster is highly available and ready for all kinds of disasters.

Summary

After integrating the replication-manager into our Corosync/Pacemaker setup, our cluster is highly available. If the server where the replication-manager is running were to go down, it would be started up on another node. Database server outages will be managed by the replication-manager and the access to the cluster will be handled by MaxScale. 

As was mentioned in the previous blog post, high availability is a critical part of any modern system. Even a comfortable Saturday afternoon can turn into a nightmare when a service isn't highly available.

About Markus Mäkelä

Markus Mäkelä is a Software Engineer working on MariaDB MaxScale. He graduated from Metropolia University of Applied Sciences in Helsinki, Finland.

Read all posts by Markus Mäkelä