Understanding and Using In-Order Parallel Replication

spacer

In the world of databases, having multiple copies of your database helps to serve your business in a multitude of ways. To achieve this, replication is used. Replication copies the contents of one server called a Primary to one or more servers called Replicas.

One advantage of replication is the ability to separate the read requests from the write requests and direct those reads to a Replica, offloading that portion of a workload away from the Primary improving performance on the Primary. The main difference in basic replication, between a Primary and a Replica, is that the Primary must handle all the writes to a database. MariaDB Cluster is a multi-primary setup which is a different architecture and is excluded here. In order to keep the Primary and a Replica in sync, two replication threads are used.

Before MariaDB 10.0, replication was done with a single I/O thread. Each transaction in the Primary’s binlog is replicated to the relay logs on a Replica, by a single I/O thread. The SQL thread on a Replica would then apply those changes; one right after the other. Being single threaded, dealing with network latency, system resources and other components, replication can sometimes lag. The replication lag makes it difficult to keep the Primary and a Replica in sync. Replication can be started and stopped. Planned or unplanned, it can pick up where it left off. However, stopping replication for any period of time can and will cause replication lag, resulting in your data across the enterprise not being in sync. This was a major problem for single threaded replication. It is up to a Replica to catch up with the Primary.

Starting with MariaDB 10.0, parallel replication was introduced. Parallel replication creates multiple replication worker threads, working in parallel to apply transactions from the relay logs to a Replica. The SQL thread can now hand off transactions to be applied at the same time. This process potentially increases replication speed on a Replica and thereby reducing and eliminating replication lag. Let’s see how this works.

First, set up basic replication with MariaDB 10.x.x. The example below is using MariaDB Server 10.2.21. Next, enable parallel replication, on a Replica by setting the slave_parallel_threads variable. The slave_parallel_threads variable is dynamic, meaning it can be changed without restarting mysqld. Add this variable to the my.cnf configuration file to make it permanent. However, replication must be stopped on a Replica to enable it. The default value is zero, no worker threads are created and the SQL thread will handle this portion of the replication process, one at a time. Setting it to a non-zero value, specifies the number of threads a Replica will create. The slave_parallel_threads variable can be set between 0 and 16383. These threads will apply multiple transactions from the relay log in parallel. The number of parallel transactions that can be applied at the same time is limited by the slave_domain_parallel_threads parameter for a given replication domain on a Replica. This parameter is dynamic as well. Remember, replication on a Replica, must be stopped to enable it.

Example: (Dynamically done on a Replica.)

MariaDB [(none)]> stop slave; 
Query OK, 0 rows affected (0.00 sec) 

MariaDB [(none)]> set global slave_parallel_threads=8; 
Query OK, 0 rows affected (0.00 sec) 

MariaDB [(none)]> set global slave_domain_parallel_threads=8; 
Query OK, 0 rows affected (0.00 sec) 

MariaDB [(none)]> start slave; 
Query OK, 0 rows affected (0.01 sec)

Parallel replication is handled in two ways with MariaDB, In-Order and Out-of-Order replication. Out-of-Order replication deals with GTIDs and needs to be explicitly enabled by the application using domain ID. Out-of-Order replication will not be addressed here.

In-Order replication executes transactions from the relay logs on a Replica in the same order as they happen on the Primary. If it can be determined, that those transactions have no conflicts, then those transactions will be applied to a Replica in parallel. In-Order replication has five different modes to handle replication. They are conservative, optimistic, aggressive, minimal and none. The slave_parallel_mode global variable controls which mode. This variable is dynamic as well. Remember to stop replication on a Replica, before changing it.

Example:

MariaDB [(none)]> stop slave;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> set global slave_parallel_mode=conservative;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> show global variables like 'slave_parallel_mode';
+---------------------+--------------+
| Variable_name | Value |
+---------------------+--------------+
| slave_parallel_mode | conservative |
+---------------------+--------------+

MariaDB [(none)]> start slave;
Query OK, 0 rows affected (0.01 sec)

Conservative mode is the default mode for parallel replication and was introduced with MariaDB version 10.0. This mode uses group commit on the Primary. It reads ahead in the relay logs to find potential transactions to apply in parallel on a Replica. If transactions are committed together on the Primary, their commit id will be identical in the binlogs. These transactions then can be committed on a Replica in parallel using different slave parallel worker threads. They still have to maintain the same commit order as they happened on the Primary. If two transactions that were committed separately on the Primary and can potentially have a conflict, then the second transaction will wait for the first transaction to commit, before starting. These types of transactions are not committed in parallel.

Optimistic mode was introduced in MariaDB Server 10.1.3. It is just that, optimistically, it assumes that there will be few conflicts. It allows any transactional DML (INSERT/UPDATE/DELETE) commit to be applied in parallel on a Replica. If conflicts happen between any two commits, the second commit will be rolled back to allow the first commit to finish. Then the second commit will retry and commit. Keeping the same order on a Replica as it happened on the Primary. This takes valuable time. The extra time is offset by the time gained by using parallel replication. There are some heuristics in optimistic mode to prevent conflicts. One example of this is, if a transaction executed a row lock wait on the Primary, it will not run in parallel on a Replica.

DDL and non-transactional DML can not be applied in parallel. Simply because they cannot be rolled back. These types of transactions are applied individually and not in parallel. The different types of transactions and how they are applied can be seen in the binlogs. These logs are not readable in plain text. Mysqlbinlog is needed to read their contents.

Example:

# cd /binlog/directory
# mysqlbinlog mysql-binlog.00420 | more

Aggressive mode was introduced in MariaDB Server 10.1.3. It is very similar to optimistic mode. The main difference between aggressive mode and optimistic is that the heuristics are disabled in aggressive mode. This means more commits can be applied, but at the cost of greater amounts of conflicts.

Minimal mode was introduced in MariaDB Server 10.1.3. This mode only allows the commit step of a transaction to be applied in parallel.

To check the status of the worker threads on a Replica use the SHOW PROCESSLIST. The worker threads will be under “system user”. Their state will show the query they are currently working on or it will show one of the following messages:

Waiting for work from SQL threads –
This means that the worker thread is idle, no work is available for it at the moment.

Waiting for prior transaction to start commit before starting next transaction –
This means that the previous batch of transactions that committed together on the Primary has to complete first. This worker thread is waiting for that to happen before it can start working on the following batch.

Waiting for prior transaction to commit –
This means that the transaction has been executed by the worker thread. In order to ensure in-order commit, the worker thread is waiting to commit until the previous transaction is ready to commit before it.

For better performance with parallel replication on a Replica, it is best to have more group commits on the Primary, when using conservative mode of in-order replication. To calculate the number of transactions in a binlog group commit you divide binlog_commits , which is the total number of transactions committed to the binary log, by binlog_group_commits, which is the total number of groups committed to the binary log, in SHOW STATUS. To see how group commit is helping parallel replication, take two calculations. If your group commit ratio is close to 1, meaning having parallel replication threads may not help, then it may help to change your group commit frequency. To calculate the group commit frequency use the following formula:

group commit = (2nd Binlog_commits - 1stBinlog_commits)/(2nd Binlog_group_commits)–(1st Binlog_group_commits)

Example: 1st collection:

SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+----------+
| Variable_name | Value |
+----------------------+----------+
| Binlog_commits | 35275333 |
| Binlog_group_commits | 13818856 |
+----------------------+----------+

Commit some transactions (small workload).
2nd collection:

SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+----------+
| Variable_name | Value |
+----------------------+----------+
| Binlog_commits | 42787157 |
| Binlog_group_commits | 15244542 |
+----------------------+----------+

This gives the number of transactions in a binlog group commit.

group commit = ( 7511824 / 1425686 = 5.26 group commit

Starting with MariaDB 10.0. Group commits frequency can be changed by configuring the system variables binlog_commit_wait_usec, which is the given number of transactions that are ready to be flushed as a group, and the binlog_commit_wait_count, which delays flushing a committed transaction into the binary log until that transaction has waited the configured number of microseconds.

Binlog_commit_wait_count has a direct correlation to the slave_parallel_threads on a Replica. Bringing us back to the different variables on a Replica that influence parallel replication like slave_parallel_threads. Another variable that influence parallel replication on a Replica is slave_parallel_max_queued. In parallel replication the SQL thread reads the relay logs on a Replica, instead of applying those transactions itself, one at a time, it hands those commits off to worker threads to  apply in parallel. The slave_parallel_max_queued controls the maximum size of a parallel slave queue can have in memory. If set too high, there is replication lag. It could fill up the queue faster than the worker threads can commit. If set too low, then there would not be enough space in memory to keep all the worker threads busy. You would see this in the SHOW PROCESSLIST on a Replica, with Waiting for room in worker thread event queue.

In order to ensure durability with group commit, the “D” in ACID compliance, is accomplished by setting both the sync_binlog variable to one, and innodb_flush_log_at_trx_commit variable to one as well. Setting these both to one gives you the greatest level of fault tolerance. But, may come at a cost. Parallel replication may help alleviate that cost.

Let’s see how parallel replication affects these variables. I set up two database instances on two separate servers, both running MariaDB 10.2.21. Port 3502 will be known as the Primary and port 3503 will be a Replica.

Connect to the first server, log on to the first database instance and create a database for testing. This instance will be known as “bob”.

ssh <server>@.com
MariaDB [(none)]> CREATE DATABASE bob;

Create two users. One is a replication user and the other is for ptstalk. Ptstalk is a great tool to connect system statistics. Give them the appropriate privileges.
On Primary:

MariaDB [(none)]> CREATE user 'repl_bob'@'localhost' identified by 'repl_bob';
MariaDB [(none)]> GRANT PROCESS, REPLICATION SLAVE, REPLICATION CLIENT on *.* to 'repl_bob'@'localhost';
MariaDB [(none)]> CREATE user 'tracker'@'localhost' identified by "tracker";
MariaDB [(none)]> GRANT ALL on *.* to 'tracker'@'localhost';

Then,

MariaDB [(none)]> Show master status;

On Replica:

MariaDB [(none)]> CHANGE MASTER TO MASTER_HOST=.com,
MASTER_USER='repl_bob', MASTER_PASSWORD='repl_bob', MASTER_PORT=3502, MASTER_LOG_FILE='mysql-binlog.0000010', MASTER_LOG_POS=123456789, MASTER_CONNECT_RETRY=10;

MariaDB [(none)]> Start slave;
MariaDB [(none)]> Show slave status\G

At this point, basic replication should be running. Back on the Primary we create tables to be populated and help with showing replication lag using this workload.
On Primary:

MariaDB [(none)]>
CREATE TABLE Tab1(id VARCHAR(30)PRIMARY KEY, firstname VARCHAR(30) NOT NULL, lastname VARCHAR(30) NOT NULL, email VARCHAR(50), reg_date TIMESTAMP );
CREATE TABLE Tab2(id VARCHAR(30)PRIMARY KEY, firstname VARCHAR(30) NOT NULL, lastname VARCHAR(30) NOT NULL, email VARCHAR(50), reg_date TIMESTAMP );
CREATE TABLE Tab3(id VARCHAR(30)PRIMARY KEY, firstname VARCHAR(30) NOT NULL, lastname VARCHAR(30) NOT NULL, email VARCHAR(50), reg_date TIMESTAMP );
CREATE TABLE Tab4(id VARCHAR(30)PRIMARY KEY, firstname VARCHAR(30) NOT NULL, lastname VARCHAR(30) NOT NULL, email VARCHAR(50), reg_date TIMESTAMP );
CREATE TABLE Tab5(id VARCHAR(30)PRIMARY KEY, firstname VARCHAR(30) NOT NULL, lastname VARCHAR(30) NOT NULL, email VARCHAR(50), reg_date TIMESTAMP );

Let’s do some testing:

I run a replication workload on the Primary, for fifteen minutes to get a baseline. Take a timestamp on how far behind replication gets on a Replica and how long it takes for replication lag to catch up. Shown below are the changing of different replication variables for each test and the results of those tests.

The following procedures were performed for each test.

Start pt-stalk on the Primary before running the replication workload script. This helps gather valuable information about what the database is doing.

pt-stalk --no-stalk --iterations 200 --sleep 5 --dest ./pt-stalk --user=tracker --
socket=/tmp/mysqld_test01_3502.sock port=3502 --password=tracker &

On a Replica start pt-stalk to collect a Replica’s statistics.

pt-stalk --no-stalk --iterations 200 --sleep 5 --dest ./pt-stalk --user=tracker --
socket=/tmp/mysqld_test01_3503.sock --port=3503 --password=tracker &

Start a workload on the Primary. The repl_wk_load.sh script contains one INSERT, one UPDATE and one DELETE statement for each table that continuously execute for fifteen minutes.

nohup ./repl_wk_load.sh nohuppy &

Manually monitoring the status of the replication workload script can be done withSHOW PROCESSLIST on the Primary. Again, this information is also being obtained by ptstalk.

MariaDB [(none)]> SHOW PROCESSLIST;

On a Replica, check the processes and replication lag with:

MariaDB [(none)]> SHOW PROCESSLIST;
MariaDB [(none)]> SHOW SLAVE STATUS\G

=============================================================================
Test 0 – Baseline

Primary -
sync_binlog = 0
Innodb_flush_log_at_trx_commit = 0
binlog_format = Mixed
binlog_commit_wait_count = 0
binlog_commit_wait_usec = 100000
Replica - Defaults
slave_parallel_threads = 0
slave_parallel_max_queued = 131072
slave_compressed_protocol = 0

Test 1

Primary -
sync_binlog = 1
Innodb_flush_log_at_trx_commit = 1
binlog_format = Mixed
binlog_commit_wait_count = 0
binlog_commit_wait_usec = 100000
Replica - no change
slave_parallel_threads = 0
slave_parallel_max_queued = 131072
slave_compressed_protocol = 0

Test 2

Primary -
sync_binlog = 1
Innodb_flush_log_at_trx_commit = 1
binlog_format = Mixed
binlog_commit_wait_count = 5
binlog_commit_wait_usec = 100000
Replica -
slave_parallel_threads = 5
slave_parallel_max_queued = 131072
slave_compressed_protocol = 0

Test 3

Primary -
sync_binlog = 1
Innodb_flush_log_at_trx_commit = 1
binlog_format = Mixed
binlog_commit_wait_count = 5
binlog_commit_wait_usec = 50000
Replica -
slave_parallel_threads = 5
slave_parallel_max_queued = 131072
slave_compressed_protocol = 0

Results:

Start Time/Workload 

(15 minutes)

Endtime/Workload, Stopped Sec Behind Master at Endtime Max Sec Behind Master Replication Catch Up Time
Baseline 10:33:27 10:48:27 673 2892 11:36:28 

(48 minutes)

Test 1 12:14:41 12:29:41 753 6397 14:06:33 

(97 minutes)

Test 2 08:14:41 08:29:40 491 1401 08:54:00 

(24 minutes)

Test 3 09:04:15 09:19:15 0 0 No Lag

Below are the measured group commits on the Primary.

Baseline

MariaDB [sample5]> SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+----------+
| Variable_name    	| Value	|
+----------------------+----------+
| Binlog_commits   	| 42180283 |
| Binlog_group_commits | 32312303 |
+----------------------+----------+
+----------------------+----------+
| Variable_name    	| Value	|
+----------------------+----------+
| Binlog_commits   	| 54229555 |
| Binlog_group_commits | 43653895 |
+----------------------+----------+

(54229555-42180283)/(43653895-32312303) = 12049272/11341592 = 1.06xxx

Test 1

MariaDB [(none)]> SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| Binlog_commits       | 5335214 |
| Binlog_group_commits | 2632599 |
+----------------------+---------+
+----------------------+----------+
| Variable_name        | Value    |
+----------------------+----------+
| Binlog_commits       | 13733155 |
| Binlog_group_commits | 6773225  |
+----------------------+----------+

(13733155-5335214)/(6773225-2632599) = 8397941/4140626 = 2.02xxx

Test 2

MariaDB [(none)]> SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+----------+
| Variable_name    	| Value	|
+----------------------+----------+
| Binlog_commits   	| 23916530 |
| Binlog_group_commits | 9063206  |
+----------------------+----------+
+----------------------+----------+
| Variable_name    	| Value	|
+----------------------+----------+
| Binlog_commits   	| 33913182 |
| Binlog_group_commits | 13566537 |
+----------------------+----------+

(13566537-9063206)/(33913182-23916530) = 9996652/4503331 = 2.21

Test 3

MariaDB [(none)]> SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+----------+
| Variable_name    	| Value	|
+----------------------+----------+
| Binlog_commits   	| 35275333 |
| Binlog_group_commits | 13818856 |
+----------------------+----------+
+----------------------+----------+
| Variable_name    	| Value	|
+----------------------+----------+
| Binlog_commits   	| 42787157 |
| Binlog_group_commits | 15244542 |
+----------------------+----------+
 
(15244542-13818856)/(42787157-35275333) = 7511824/1425686 = 5.26

Drawing conclusions from the above data you can see how using slave_parallel_threads with parallel replication can reduce or remove replication lag completely. We can also see what affect binlog_commit_wait_count and binlog_commit_wait_usec have on group commits. Each workload is different, just enabling parallel replication may not be enough. Test and evaluate which settings are right for your workload.