Crash-safe DDL of InnoDB tables in MariaDB Server 10.6

Posted on September 8, 2021 by Marko Mäkelä

Finally, in MariaDB Server 10.6, Data Definition Language (DDL) statements are crash-safe: When using a crash-safe storage engine like InnoDB, MyRocks or Aria, and the running server is killed at any time, it is expected to recover to a consistent state. With crash-unsafe storage engines such as MyISAM, DDL operations should be crash-safe.

Before MariaDB Server 10.6, there was no crash-safety guarantee for DDL operations; the common data dictionary (implemented as tablename.frm files) could easily get out of sync with the data files of the storage engine.

The server core implementation of this is called Atomic DDL. The idea is simple: in-progress DDL operations are written to a log file, and a DDL recovery phase is introduced, which will take care of things such rolling back or re-issuing DROP TABLE or RENAME TABLE in order to make things consistent. It is worth noting that DROP DATABASE is not atomic; it is internally implemented as a sequence of DROP TABLE (which themselves are atomic).

Note: Transactional DDL is not supported in many database servers, and MariaDB is no exception. In a future major release, MDEV-4259 might allow mixing DDL and DML statements in a single transaction. A follow-up task might replace some metadata locking with a multi-versioned data dictionary cache, so that a transaction could continue to read from a table using the name that was in effect when the transaction started.

Storage Engine API Changes

The initial expectation was that only the native ALTER TABLE implementation would require some changes in storage engines that support it, so that the DDL recovery subsystem could determine whether the operation had been committed inside the storage engine.

In MariaDB Server, InnoDB and MyRocks implement some ALTER TABLE operations natively, and both were extended.

The MyRocks data dictionary format was changed in order to support ALTER TABLE. The InnoDB implementation avoided that. We allocate an InnoDB transaction identifier upfront and report it to the DDL log. On recovery, InnoDB will be asked the question, “Was the latest definition of this table committed by this transaction ID?”

Known Issues with InnoDB DDL Operations

The first efforts in MariaDB to make some InnoDB DDL operations crash-safe occurred back in 2018, when TRUNCATE TABLE was rewritten to be compatible with Mariabackup. Internally, it is implemented as a combination of RENAME, CREATE and DROP. For this to work, we had to introduce a new undo log record type for renaming tables, so that if the server was killed after the time the rename was persistently written to the redo log, but before the operation had been durably committed, we would be able to roll back the RENAME operation (renaming to the original file name in the file system).

InnoDB had included some questionable design choices from the beginning:

In the undo log header, there was a field TRX_UNDO_TABLE_ID to store one SYS_TABLES.ID. If that field is nonzero for an incomplete transaction, then the table will be dropped by executing a separate transaction on recovery. This was supposed to handle the recovery of an incomplete CREATE TABLE or DROP TABLE.
For DELETE operations on the internal SYS_INDEXES table, there was a trigger that would free the underlying B-tree. Note: this would happen in the middle of the transaction! If the transaction was not committed (for example, the server was killed), the data would be lost. The above mentioned TRX_UNDO_TABLE_ID hack worked around some related trouble.
Soon after InnoDB was integrated in MySQL back in 2001, it turned out that sometimes, ha_innobase::delete_table() would be invoked while the table is still in use or locked by another transaction. Instead of fixing such ACID violations, a “background DROP TABLE queue” was introduced, so that such conflicting tables would be dropped later. Only starting with MariaDB Server 10.2.19, this queue was crash-safe; previously the “orphaned” tables could remain forever if the server had been killed.
Because only one TRX_UNDO_TABLE_ID could be stored per transaction, some DDL operations were split into multiple transactions.
When persistent statistics for InnoDB tables were introduced in MySQL 5.6 (and inherited to MariaDB Server 10.0), those tables were modified in multiple separate transactions during a DDL operation. Thus, the statistics could get out of sync with the table definition.
If the server was killed during the execution of ALTER TABLE, some intermediate #sql files could be left behind. After the recovery of RENAME was improved in MariaDB Server 10.2.19, most such files were removed on InnoDB startup.

Lessons to be Learned from DML Transactions

For DML transactions, the combination of InnoDB undo and redo logging always worked. The undo log facilitates ROLLBACK and multi-version concurrency control (MVCC).

Let us consider the DELETE statement. For each affected row, it will write an undo log record that identifies the row, and then update a delete-mark flag and make the system column DB_ROLL_PTR point to the undo log record. Even after the COMMIT of the DELETE, we must preserve the delete-marked records and the undo log if any transaction read view exists that had been created before the DELETE was committed. Once the last such read view has been closed, the history may be purged (both the index records and undo log records can be deleted).

Likewise, an INSERT statement will first write an undo log record that identifies the row, and then insert the record into each index tree. On ROLLBACK (possibly as part of crash recovery), we would delete the record from each index tree. On COMMIT, the records will remain.

The redo log is a write-ahead log: undo log or index pages may only be written after redo log has been written. Any change that has been durably written to the redo log will be replayed on recovery. Any transactions that are recovered in an incomplete state will be rolled back.

How We Made InnoDB DDL Atomic

Deleting files is analogous to deleting index records: Once they are deleted, there is no easy way to go back. So, DROP TABLE must not delete files before the transaction is committed. On the other hand, there is no problem to delete a file while rolling back CREATE TABLE.

In the end, we were able to address all the above mentioned limitations, without introducing any file format changes whatsoever. Any destruction of data would happen only during rollback, or after commit, or as part of the purge of history. If the server were killed between commit and the deletion of a file or freeing an index tree, the purge of history for that DDL transaction would delete the file or free the pages. The trigger that would drop index trees when a record was deleted from SYS_INDEXES was moved to rollback and purge. If the metadata for the clustered index record is deleted, we would delete the file.

We can now allow any number of tables to be created or deleted in a dictionary transaction. The TRX_UNDO_TABLE_ID field will always be written as 0.
There no longer is a drop table queue. We will still rename to-be-dropped tables to #sql-ib*.ibd files, but such table names never exist in committed versions of the data dictionary. For example, TRUNCATE TABLE t will within a single InnoDB transaction rename and drop the old copy of the table and create a new table. If the transaction is rolled back, the new file will be deleted (if it had been created) and the old file renamed back.
For DDL operations that involve FULLTEXT INDEX, all internal FTS_ tables will be created or dropped as part of the single DDL transaction.
Persistent statistics will be deleted or renamed as part of the single DDL transaction.

Faster and Safe `*.ibd` File Creation

We used to synchronously write a dummy page image to the start of a newly created *.ibd file so that recovery could determine the tablespace identifier.

That special handling was removed, and instead recovery was made smarter. The tablespace header page will eventually be written by the page cleaner thread, just like any other page. Depending on when the server was killed, recovery may encounter an empty file or a file where the first page is filled with zero bytes. The inability to immediately determine the tablespace identifier will no longer be treated as an error.

Recovery will keep track of files where it cannot find a valid page at the start to determine the tablespace identifier. Once all redo log records have been parsed, we must know the tablespace ID of each file, or we will abort unless innodb_force_recovery was set.

Before we would create a file (or create an index tree in the InnoDB system tablespace), we would have written an undo log record for inserting the clustered index metadata. The FILE_CREATE record and the data of the first page will be written in the same atomic mini-transaction. Only after everything has been durably written to the redo log, we will start creating and allocating the file. If the DDL transaction was not recovered in the committed state, recovery would roll back the transaction and delete the file if it had been created.

Cleaner Locking for DDL Operations and Statistics Tables

InnoDB followed a peculiar approach to prevent deadlocks in operations that involve its internal SQL parser, which is used for updating InnoDB data dictionary tables and persistent statistics tables. Any operation that could access the dictionary tables would acquire one or both of dict_sys_mutex or dict_operation_lock, as they are known in PERFORMANCE_SCHEMA. It should be impossible to wait for a table or record lock while holding those latches. However, because the system tables innodb_table_stats and innodb_index_stats are also accessible from SQL, such waits are in fact possible for those tables. To avoid potential hangs, MariaDB Server 10.6.2 would ‘fake’ an instant lock wait timeout. In MariaDB Server 10.6.5, MDEV-25919 fixed this by changing the logic of both DDL operations and the background tasks that update InnoDB statistics tables to follow a simple pattern:

The SQL layer acquires a metadata lock (MDL) on the affected table (say, exclusive MDL on DROP TABLE).
On DDL operations that change data, InnoDB will acquire an exclusive table lock.
If persistent statistics are going to be modified, shared MDL and exclusive InnoDB table locks on the statistics tables will be acquired.
On DDL operations, exclusive table locks on the affected InnoDB system tables will be acquired. This step is immune to timeouts or interrupts (including the KILL statement).
Exclusive dict_sys.latch will be acquired for invoking the internal SQL parser.

If any step before the final one times out or fails, nothing will have been changed yet, and the operation is trivial to abort or roll back.

Performance Surprises

As a preparation for the MDEV-25919 fix, MDEV-24258 in MariaDB Server 10.6.5 removed dict_sys_mutex. Table lookup only requires a shared dict_operation_lock unless the table definition needs to be loaded into the cache. In our tests, the result was mixed: both improved and degraded performance. It turned out that the mutex had acted as a throttle that reduced lower-level conflicts between tasks that purge the history of committed InnoDB transactions. We are working on making the purge more efficient or adaptive.

A car analogy would be that an attempt to optimize the traffic flow along a major road that leads to a city can lead to overall increased congestion, and making some low-priority traffic avoid rush hours would improve throughput.

Summary

With the MariaDB Server 10.6 release, inconsistencies due to the server being killed during a DDL operation should be a thing of the past. The only notable file format change compared to 10.5 is the introduction of the DDL recovery log file in the server core. While we do not test downgrading between major versions nor guarantee it to work, the ddl_recovery.log file should not cause trouble for downgrading (as long as it is logically empty).

In the InnoDB storage engine, almost all DDL operations will be executed in a single atomic transaction. Exceptions include MDEV-25920 DROP DATABASE and MDEV-21602 CREATE…SELECT.

Comments (2)

Marko Mäkelä

2021-11-01 11:27:57

Temporary tables should be completely unaffected by these changes. Unfortunately, temporary table names have to be registered in dict_sys because DROP TEMPORARY TABLE would refer to the tables by name rather than by handle. https://jira.mariadb.org/browse/MDEV-17805 has been filed to address that. In the upcoming MariaDB 10.6.5 release, dict_sys.mutex has been merged with dict_sys.latch (MDEV-24258). Thanks to that, multiple cached table definitions can be looked up in parallel. Modifying the cache (creating, dropping, loading, or evicting table definitions) will require exclusive access.

Ross Mohan

2021-10-26 12:12:47

How much benefit here in 10.6 for "lightweight" DDL such as CREATE/DROP TEMPORARY TABLE? For applications 'frozen' (for one reason or another) in earlier versions (say 10.4.x) would the intensive use of dict_sys and other mutex/MDL waits (say, in nested loops around create/drop of temp tables) be enough to merit converting such temp work to JSON functions?