Evaluating MariaDB & MySQL Parallel Replication Part 2: Slave Group Commit

(The previous post, Better Parallel Replication for MySQL, is Part 1 of the series.) Parallel replication is a much expected feature of MySQL. It is already available in MariaDB 10.0 and in MySQL 5.7. In this post, a very nice side effect of the MariaDB implementation is presented: Slave Group Commit.

(If you are not familiar with parallel replication and its implementation, read the previous post in this series to better understand this post.)

To allow better parallelization on slaves, MariaDB 10.0 and MySQL 5.7 have parameters that control the commit group sizes on the master. Those parameters are binlog_commit_wait_count / binlog_commit_wait_usec in MariaDB and binlog_group_commit_sync_delay / binlog_group_commit_sync_no_delay_count in MySQL. The purpose of those parameters is to delay the commit of transactions on the master to allow more transactions to join a commit group. This increases the commit group size, and thus the number of transactions that can be run in parallel on slaves. Those parameters can be seen as a way to speed up slaves at the expense of slowing down the master.

When using those parameters on a MariaDB slave with parallel replication enabled (slave_parallel_threads > 1) and when this slave has a MariaDB 10.0 master running a single-threaded workload, transactions will be run sequentially with their commits delayed to try grouping transactions, hence achieving group commit on the slave.

Let’s take the previous sentence one element at a time and explain each step in more detail.

In the replication hierarchy shown below:

X is the master (any of MySQL 5.5, MySQL 5.6, MariaDB 5.5 or MariaDB 10.0)
Y is an intermediate master running MariaDB 10.0
Z is a slave running MariaDB 10.0 with binary logs and log-slave-updates enabled

-----     -----     -----
| X | --> | Y | --> | Z |
-----     -----     -----

The Y intermediate master is configured with the following parameters (single threaded slave):

SET GLOBAL slave_parallel_threads = 0;
SET GLOBAL binlog_commit_wait_count = 0;
SET GLOBAL binlog_commit_wait_usec = 0;

And the Z slave is configured with the following parameters:

SET GLOBAL slave_parallel_threads = "value greater than 1";
SET GLOBAL binlog_commit_wait_count = "value greater than 1";
SET GLOBAL binlog_commit_wait_usec = "value greater than 0";

Let’s follow two transactions (T1 and T2) on Y and Z. On Y, the transactions are executed sequentially as shown below B for begin and C for commit).

   ------Time----->
T1:  B----C
T2:        B----C

Once those are in the binary logs of Y, Z will execute them. As Y is single-threaded, Z cannot run the transactions in parallel (T1 and T2 are each in their own commit group in the binary logs of Y). However, as Z is delaying commit (binlog_commit_wait_count > 1 and binlog_commit_wait_usec > 0), T2 will start executing in another thread once T1 is ready to commit (slave_parallel_threads > 1). If T2 completes before the delay expires, T1 and T2 will group commit as shown below.

   -----Time----->
T1:  B---- . . C
T2:       B----C

Group committing is not limited to two transactions, it can extend to the transactions that follow and can result in much larger groups as shown below.

   ----------Time---------->
T1:  B-- . . . . . . . . C
T2:     B----- . . . . . C
T3:           B----- . . C
T4:                 B----C

Transaction grouping will stop in the following three situations:

The group size reaches binlog_commit_wait_count, grouping stops and all transactions get committed together.
binlog_commit_wait_usec expires, grouping stops, T1 and all the following completed transactions commit together.
While the commit of T1 is delayed, a subsequent transaction is blocked by a lock held by a previous transaction that got delayed at the commit stage, a transaction dependency has been found and the group commit pipeline will stall until binlog_commit_wait_usec expires (MariaDB 10.0.16).

Situation 3 above must be understood clearly. Each time a transaction dependency is found, the slave stops applying transactions until binlog_commit_wait_usec expires. This means that if transaction dependencies are frequent and binlog_commit_wait_usec is too big, the slave throughput will suffer. This could result in slave lag as shown below (T3 depends on T1 or T2).

   ----------------Time----------->
T1:  B-- . . . . . . . . C
T2:     B----- . . . . . C
T3:                       B-----C

This could be optimized in one of the next versions of MariaDB by detecting dependencies and triggering commit omitting the waiting [1].

(Note: in MariaDB 10.1, slave group commit can be enabled with the minimal slave_parallel_mode.)

Back to group committing on slaves, it allows the following:

Identifying transaction parallelism on a slave.
Speeding up the slave by reducing the number of disk syncs needed to write the binary logs when strong durability constraints are set (sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1).

With reference to 1 above, this way of identifying parallelism will be the starting point of the next post in this series. Part 3 of the series will present speedups of parallel applying on slaves using Booking.com production workload. Moreover, you now know how the graphs presented in part 1 of the series were obtained.

With reference to 2 above, this could be a solution to a problem we are facing at Booking.com. To understand this problem, let’s look at a Booking.com typical MySQL deployment:

-----
| M |
-----
  |
  +------+- ... -+---------------+
  |      |       |               |
-----  -----   -----           -----
| S1|  | S2|   | Sn|           | M1|
-----  -----   -----           -----
                                 |
                                 +- ... -+
                                 |       |
                               -----   -----
                               | T1|   | Tm|
                               -----   -----

M is the standard master
Si are the slaves of this master in the same datacenter
M1 is an intermediate master on a remote datacenter
Tj are the slaves of this intermediate master in the remote datacenter

To be able to survive the hardware failure of M1, the database is hosted on shared disks. Such shared disks could be implemented using DRBD or using filer-based storage accessed over fiber channel (SAN) or iSCSI (LAN). To be able to recover the database (and have the Tj slaves recover replication), high durability constraints need to be set on M1 (sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1).

However, high durability constraints on a single-threaded slave mean that, after each transaction, a sync to the binary logs and a sync to the REDO logs are required. When a sync takes 1 millisecond (network round-trip), we cannot run more than 500 (or 1000 [2]) transactions per second.

The other leaf slaves (Si and Tj) do not have this limitation as their databases are hosted on the local disks with a battery backed-up write cache on the RAID controller (syncs are in fact just writes in the controller memory, so they are less expensive). Moreover, those slaves do not need binary logs and we can relax durability on them (innodb_flush_log_at_trx_commit = 0 or 2).

So M1 is an embarrassing bottleneck in the replication, and bursts of transactions on M occasionally cause replication delay in the remote datacenter.

Slave group commit could be a solution to alleviate this bottleneck, but would this work in the real life? To validate that, we did the following test:

-----     -----     -----
| A | --> | B | --> | C |
-----     -----     -----

A is our true production master running MySQL 5.6
B is an intermediate master running MariaDB 10.0.16 without parallel replication enabled (slave_parallel_threads = 0)
C is a MariaDB 10.0.16 slave with binary log and log-slave-updates enabled, and with high durability enforced (sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1)

The database on C is hosted on local disks with a battery backed-up write cache on the RAID controller. We ran the C slave with the following configuration:

Until 17:05 in the graphs below, C is running in single-threaded mode (slave_parallel_threads = 0)
At 17:05, slave group committing is enabled (slave_parallel_threads = 20, binlog_commit_wait_count = 15 and binlog_commit_wait_usec = 40.000)
At 17:10, the write cache of the RAID controller is disabled (the slave is still group committing)
At 17:15, group committing is disabled (slave_parallel_threads = 0 binlog_commit_wait_count = 0 and binlog_commit_wait_usec = 0)
Finally at 17:17, the write cache is re-enabled

During those tests, the BINLOG_COMMITS and BINLOG_GROUP_COMMITS global statuses are gathered regularly giving the following graphs:

E1 commits and group commits

E2 commits and group commits

E3 commits and group commits

E4 commits and group commits

As we can see on the graphs above, in single-threaded mode and with the write cache enabled, the slaves are able to keep up with the flow of transactions from the master. When enabling group commit (at 17:05), the slaves are still able to keep up with the flow of transactions (binlog_commit_wait_usec does not cause a delay). We can see small behaviour changes after 17:10 when the write cache is disabled but three out of four slaves are able to keep up with the flow of transactions (the 3rd graph shows some slow down because lots of dirty page writes are happening at the same time, but no major delays happen). When the slaves are put back in single-threaded mode and stop group committing (17:15), we can see that their transaction processing capabilities are greatly impaired and that they are not able to keep up with the masters (they are committing much less than the master).

This shows that slave group commit is a good way of increasing throughput on slaves that have expensive disk sync (no write cache or SAN/LAN-attached storage) when binary logs, log-slave-updates and high durability settings are needed.

You might have noticed that the B intermediate master does not play a very important role in these tests. It is still needed as the parallel replication code path in MariaDB 10.0.16 is only used when the master is also running MariaDB 10.0. This restriction could be lifted in one of the next versions of MariaDB and then B could be removed.

This concludes this post on slave group commit. The next post will present speedup results of MariaDB 10.0 slaves running parallel replication on true production workload (with parallelism identified by a group committing intermediate master).

One last thing, if you want to know more about the cool things we do with MySQL at Booking.com and if you are at Percona Live Santa Clara in April, we are giving three talks:

You can also come to meet us at booth 315. We are looking for smart people to work with us on MySQL and help us to solve hard problems. See our careers page for more details.

[1] After completing this post, it was brought to our attention that the not yet released MariaDB versions 10.0.18 and 10.1.4 include MDEV-7249 that implement skipping the delay when a transaction dependency is found.

[2] 500 transactions if two syncs are needed per transaction, 1000 transactions if only one is needed: the exact details are not obvious, if you knows more about that, please post a comment below.

Kristian Nielsen from MariaDB comments:

Thanks for doing this great writeup, happy to see all your work on this.

> [2] 500 transactions if two syncs are needed per transaction, 1000 transactions if only one is needed: the exact details are not obvious

Yes, two syncs are needed per (group) commit to ensure crash safety.

First, after the InnoDB prepare step, and before writing to the binlog, a sync of the InnoDB transaction log is needed. Without this, after a crash we might find a transaction written in the binlog, but be unable to recover that transaction inside InnoDB because the prepare step did not reach disk.

Second, after writing the binlog but before the commit step in InnoDB, a sync of the binlog is needed. Once the InnoDB commit step happens, the transaction can no longer be rolled back. If we crash after the commit hits disk, but before the binlog write does, we would have a transaction inside InnoDB that is missing in the binlog.

So without either of these two syncs, a crash might leave us with inconsistency between the binlog and InnoDB.

(It used to be that a third sync was done, but the need for that was removed in MariaDB 10.0 and MySQL 5.6).

This is a repost from the booking.com blog Evaluating MySQL Parallel Replication Part 2: Slave Group Commit.