Binlog Commit Optimization For Large Transaction

spacer

This is a guest post by Libing Song from Alibaba Cloud.

This blog post introduces a new feature which speeds up the commit of big transactions and makes the server more stable. It is useful for replication setups that support large transactions. The feature was first introduced in the MariaDB Server 11.7 release series. This is a contribution from the AliSQL Team.

Background

This work was motivated by an unexpected slowness of COMMIT queries on small transactions, which should have been completed quickly. The underlying cause was discovered to be a large transaction from another thread that started committing just before the small transaction, which forced the small transaction to wait for its completion, due to internal binlog ordering constraints. Figure 1 shows an example.

sysbench oltp_write with large (512 MB) transactions in background

Figure 1

When running sysbench oltp_write_only with periodic large UPDATEs every 5 seconds. The transaction per second (TPS) drops deeply every 5 seconds exactly when the background UPDATES are committed. Because they need to copy the binlog cache that blocks other threads. 

In order to understand the phenomena and explain all the mechanics of our solution, let’s take a look inside the binlogging phase of the transaction commit.

The binary log (Binlog) contains “events” that describe database changes: both on data (DML) and structure (DDL). It is used for incremental backup and replication. It consists of a set of binary log files, and optionally an index (since 11.4 with MDEV-4991).

In binlog files, binlog events are organized as a unit of transactions. All events of a transaction must be written into the binlog file together. The binlog events generated during the execution phase of a transaction are stored into a cache (binlog cache) temporarily. The binlog cache has a memory buffer and a temporary file. The generated binary events are first stored into the memory buffer. They will be flushed to the temporary file if the buffer is full. When committing the transaction, these binlog events are copied from its cache to the binlog file and persisted to storage together.

Diagram: In binlog files, binlog events are organized as a unit of transactions. All events of a transaction must be written into the binlog file together. The binlog events generated during the execution phase of a transaction are stored into a cache (binlog cache) temporarily. The binlog cache has a memory buffer and a temporary file. The generated binary events are first stored into the memory buffer. They will be flushed to the temporary file if the buffer is full. When committing the transaction, these binlog events are copied from its cache to the binlog file and persisted to storage together.

Figure 2

The commit phase is sequential, thus there is only one transaction that can copy its binlog events to the binlog file at the same time. If a large transaction (e.g. 1GB binary events is generated) is committing, it will take a fairly long time to copy and persist its binary events. Other transactions will be blocked at the commit phase and take more time to finish, though they are small and generally fast. Figure 2 depicts this, where trx_n is a large transaction that blocks trx_m, a small transaction. This situation should be considered if DML or COMMIT appear in the slow log unexpectedly.

MariaDB [(none)]> select query_time, sql_text from mysql.slow_log;
+-----------------+---------------------------------+
| query_time | sql_text |
+-----------------+---------------------------------+
| 00:00:04.010262 | INSERT INTO t1 SELECT * FROM t2 |
| 00:00:02.004051 | COMMIT |
| 00:00:03.008076 | INSERT INTO t1 VALUES(2) |
+-----------------+---------------------------------+

or they stay in the Commit state for a long time.

MariaDB [(none)]> select ID, TIME, STATE, INFO from information_schema.processlist;
+----+------+--------+--------------------------+
| ID | TIME | STATE | INFO |
+----+------+--------+--------------------------+
| 7 | 2 | Commit | COMMIT |
| 6 | 3 | Commit | INSERT INTO t1 VALUES(2) |
+----+------+--------+--------------------------+

MDEV#32014 implemented a feature to minimize the time spent on the commit phase for large transactions. The solution and usage are explained in the following sections.

Solution

Because transactions store the generated binary events into the temporary file of their binlog cache, the solution is pretty straightforward. It just renames the temporary file of the binlog cache to a binlog file with an extra rotation. Since copying is avoided, the commit phase is much faster for large transactions. Through concurrent transactions still be blocked during commit, but the blocked time is much less. Figure 3 extends the scenario depicted in figure 2 with this solution, where it can be seen the commit time of trx_n is largely reduced, thus allowing trx_m to complete faster.

Figure 3 extends the scenario depicted in figure 2 with this solution, where it can be seen the commit time of trx_n is largely reduced, thus allowing trx_m to complete faster.

Figure 3

With the solution, it does the below things at commit instead of copy:

  • persist the temporary file of the binlog cache before going to the commit phase.
  • rotate the binlog file.
  • copy the “magic” and header events from the new binlog file to the binlog cache file.
  • delete the new binlog file.
  • rename the binlog cache file to the new binlog file.

It surely needs to generate Gtid_log_event and Xid_log_event for the transaction, but that is not important for understanding the solution. The details are ignored here. You can see that there is no file copy anymore for large transactions. Rotation and renaming are pretty fast operations, no matter how big the file is. The whole process can finish in a few milliseconds.

#binlog_cache_files directory

To support rename operation, binlog cache now put its temporary file into #binlog_cache_files directory which is in the same directory to binlog files. The cache files are created as general files instead of system temporary files. They are deleted at session exit or server startup if an abnormal shutdown happens.

Cache files are named with the prefix ‘ML_’ which is the same as the original one and followed by a unique identifier.

$ls var/mysqld.1/data/#binlog_cache_files                              
ML_140413554102520

Reserved Space

A binlog cache file just includes generated binary events of the transaction, it doesn’t include the Gtid event and the required binary events at the beginning of a binlog file (e.g Format_descrition_log_event, Gtid_list_log_event, Checkpoint_log_event etc.). So it needs to reserve some space at the beginning for the extra events when rename happens.

The reserved space is aligned to 4KB, so it reserves at least 4KB.

The reserved space is aligned to 4KB, so it reserves at least 4KB. 4KB is big enough for most of the cases. However, Gtid_list_log_event could be very big if the binlog has many different domains or server ids. There is a mechanism to record the size that the header events take when rotation happens. And the binlog cache file will reserve space as expected before starting the next transaction.

The header events usually take less than 4KB spaces, there will be some space left after the header events. How to handle the left space? Thanks for the mechanism that Gtid_log_event can pad 0 at the end. The left space is padded into its Gtid_log_event. After rename, the new binlog file looks like:

The reserved space is aligned to 4KB, so it reserves at least 4KB. 4KB is big enough for most of the cases. However, Gtid_list_log_event could be very big if the binlog has many different domains or server ids. There is a mechanism to record the size that the header events take when rotation happens. And the binlog cache file will reserve space as expected before starting the next transaction.The header events usually take less than 4KB spaces, there will be some space left after the header events. How to handle the left space? Thanks for the mechanism that Gtid_log_event can pad 0 at the end. The left space is padded into its Gtid_log_event. After rename, the new binlog file looks like this image.

Performance

We tested the effectiveness of this optimization  using the sysbench oltp_write_only workload  while separately running large UPDATE queries in the background. The results show that this feature is very effective at minimizing the latency of other committing transactions while another large transaction is binlogging, which can improve the overall performance/throughput of the system. As shown, the overall benefit is dependent on the frequency and size of the transactions, so please experiment and let us know your observed benefits.

Basic test information:

  • Hardware: X86 64 cores, 256GB memory, local SSD
  • Data size: 64 tables, 1 record per table
  • MariaDB: innodb-buffer-pool-size=48G, sync-binlog=1
  • Sysbench: 64 threads

Background table operations

CREATE TABLE t_large (a INT, b LONGTEXT) ENGINE = InnoDB;


# UPDATE includes before and after image, updating
# a row with 256MB text will generate a 512MB binlog event.
INSERT INTO t_large(1, repeat('a', 256000000));


UPDATE t_large SET a = a + 1;

First, we ran the test with large transactions in different sizes. The result is shown in Figure 4. 

  • The blue line shows the max latencies with this feature disabled.
  • The orange line shows the max latencies with this feature enabled.
Results graph: sysbench oltp_write with large transactions in background.

Figure 4

You can see that without this feature, the max latency increases as quickly as the transaction size. With the feature, the max latency stays low and stable. When transaction size increases to 128MB, it shows obvious benefits. That was tested on a local SSD which has low latency. It should have more improvement for transactions smaller than 128MB if the data is stored on a cloud storage since the higher latency. You may notice that the latency increases when the transaction size is 1024MB with this feature enabled. That is because there is an extra binlog rotation with the rename together, since the transaction is larger than max-binlog-size which is 1024MB by default.

Next, we ran the test with UPDATEs which generates 512MB binlog events every 5 seconds. The result is shown in Figure 5. 

  • The blue line shows the TPS with this feature disabled.
  • The orange line shows the TPS with this feature enabled.

You can see the blue line drop deeply every 5 seconds exactly when the background UPDATES are committed. Because they need to copy the binlog cache that blocks other threads. The orange line is much more stable, since copying is avoided.

Next, we ran the test with UPDATEs which generates 512MB binlog events every 5 seconds. The result is shown in this figure.

Figure 5

How to use it

Global variable binlog_large_commit_threshold is added to control the behavior. When a transaction’s binlog cache file size is larger than binlog_large_commit_threshold(128MB by default), it will go to the rename method.

Limitations

A binlog cache cannot be renamed to a binlog file in one of the below situations. It will fall back to the regular behavior (copy to the binlog file) depicted in figure 2.

  • encrypt_tmp_files is enabled.
  • encrypt_binlog is enabled.
  • both transactional and non-transactional tables are updated in the same statement.
  • the reserved space is not big enough.
  • if the binlog_checksum_option is changed, some transactions started before the change still keep the old behavior, their binlog caches cannot be renamed to binlog file.
  • wsrep_on is enabled in session (Galera Replication).
  • optimize_thread_scheduling is disabled. (It is just for debugging, you should never disable it. It is being deprecated and will be removed in future, tracked via MDEV-33756.)

Conclusion

By using rename instead of copying data from binlog cache file to binlog file, it eliminates the slow io operations during the binlog commit phase. Large transactions just take a few milliseconds to commit, no matter how big it is. It will not block other transactions for a long time to commit which makes the system more stable.

Download MariaDB Community Server 11.7 to try this new feature. Tell us what you think!