Parallel Replication

You are viewing an old version of this article. View the current version here.

MariaDB 10.0 can execute some queries in parallel on the slave. This article will explain how it works and how you can tune it.

Note that as of October 2013, parallel replication only exists in a development tree in launchpad. It will be merged with 10.0-base and 10.0 shortly.

How to enable parallel slave

To enable, specify slave-parallel-threads=# in your my.cnf file as an argument to mysql, where # is bigger than 1.

The value (#) specifies how many threads will be used to run queries in parallel for *all* your slaves (this includes multi-source replication).

If slave-parallel-threads is 0, then each slave setup will have one thread each to execute queries.

slave-parallel-threads=# is a dynamic variable that can be changed without restarting mysqld. All slaves connections must however be down when changing the value.

What can be run in parallel

Currently the following things can be run in parallel on the slave:

  • Queries that are run on the master in one group commit.
  • Queries that are from different domains.
  • Queries from different masters (when using multi-source replication).

Expected performance gain

Assuming that the slave has as many threads to execute things as the master, the slave should be almost as fast as the master. We have measured up to fourfold increases in speed when testing parallel replication.

Configuration variable --slave-parallel-max-queued

The variable @@slave_parallel_max_queued is only meaningful when parallel replication is used (@@slave_parallel_threads>0).

When parallel replication is used, the SQL threads will read ahead in the relay logs, queueing events in memory while looking for opportunities for executing events in parallel. The @@slave_parallel_max_queued variable sets a limit for how much memory the SQL threads will use for read-ahead in the relay logs looking for such opportunities.

Note that @@slave_parallel_max_queued is not a hard limit, since the binlog events that are currently executing always need to be held in-memory.

@@slave_parallel_max_queued is mainly needed when using GTID with different replication domain ids. If the binary log contains first transactions in domain 1 followed by some transactions in domain 2, then parallel replication can execute the domain 2 transactions in parallel with the domain 1 transactions. However, this requires that the SQL thread is able to read ahead of the domain 1 transactions while they are executing so that it can queue the domain 2 transactions for parallel execution. Thus, the total size of the domain 1 transactions must be less than @@slave_parallel_max_queued, or parallel execution will not be possible. On the other hand, a too large @@slave_parallel_max_queued value on a slave that is much behind the master could cause the SQL thread to queue up an excessive amount of events in-memory vainly looking for opportunities for parallelism, which could lead to too high memory consumption.

For parallel replication of transactions that group-committed together on the master, only one (or two when moving to the next group commit) transactions can be queued for one worker at a time. In this case, it is sufficient that @@slave_parallel_max_queued is larger than the event size of two normal transactions.

Open issues/TODO

Parallel replication is still in alpha/testing phase. It's planned to be stable when MariaDB 10.0 is stable.

When the code is in the 10.0 main tree, the expectation is that parallel slave should work, except if you get a failure when reading an event, executing an event or if you run out of memory for cached events. We are working on fixing this in good time before the 10.0 gamma release.

Here follows the open issues that we are working on and which should be complete before MariaDB 10.0 is declared stable:

  • Error handling. If we fail in one of multiple parallel executions, we need to make a best effort to complete prior transactions and roll back following transactions, so the slave binlog position will be correct. Also, all the retry logic for temporary errors such as deadlock.
  • Stopping the slave needs to handle stopping all parallel executions. And the logic in sql_slave_killed() that waits for current event group to complete needs to be extended appropriately...
  • Audit the use of Relay_log_info::data_lock. Make sure it is held correctly in all needed places also when using parallel replication.
  • We need some user-configurable limit on how far ahead the SQL thread will fetch and queue events for parallel execution (otherwise if the slave gets behind we will fill up memory with pending malloc()'ed events).
  • Fix update of relay-log.info and master.info. In non-GTID replication, they must be serialised to preserve correctness. In GTID replication, we should not update them at all except at slave thread stop.
  • All the waits (eg. in struct wait_for_commit and in rpl_parallel_thread_pool::get_thread()) need to be killable. And on kill, everything needs to be correctly rolled back and stopped in all threads, to ensure a consistent slave replication state.
  • Handle the case of a partial event group. This occurs when the master crashes in the middle of writing the event group to the binlog. The slave rolls back the transaction; parallel execution needs to be able to deal with this wrt. commit_orderer and such.
  • We should notice if the master doesn't support GTID, and then run in single threaded mode against that master. This is needed to be able to support multi-master-replication with old and new masters.
  • Retry of failed transactions is not yet implemented for the parallel case.

Where can one find details about the implementation

The implementation is described in MDEV-4506.

Comments

Comments loading...
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.