Isolation level violation testing and debugging in MariaDB

spacer

Isolation is one of key transaction properties in database systems (the I in ACID), and its violation is a serious bug. This blog post is about the approach used by the InnoDB development team to find and debug such violations, and about new options to satisfy Snapshot isolation level requirements.

Transaction isolation levels

Isolation levels are some classification of rules for how concurrent transactions can affect each other’s results. Peter Alvaro & Kyle Kingsbury blog post contains detailed explanation of isolation levels, described in SQL-92 standard, as well as their criticism. Isolation levels defined in SQL-92 standard are criticized for their ambiguity and incompleteness. The academic work Generalized Isolation Level Definitions, which was published in 2000, around the same time with the InnoDB storage engine, gives a formal definition. The definitions are based on dependency graphs between transactions, and each isolation level allows or disallows cycles of certain types of dependencies. A formal, unambiguous specification facilitates automated checking whether an implementation conforms to the specification. The Jepsen testing library does this for transaction isolation levels in database management systems.

InnoDB isolation levels

The analysis from the above mentioned blog post showed that Read Uncommitted, Read Committed, and Serializable isolation levels of InnoDB satisfy PL-1 read uncommitted, PL-2 Read Committed, and PL-3 Serializable, respectively, according to Generalized Isolation Level Definitions.

But the default isolation level Repeatable Read does not satisfy PL-2.99 Repeatable Read, that is, by default, InnoDB is violating the ACID properties, specifically, the Consistency or Isolation of transactions. It exhibits G2-item anomalies: a cycle of write-write, write-read, and read-write dependency edges, where those edges do not involve predicates. It also exhibits G-single: cycles composed of write-write, write-read, and read-write edges, but where read-write edges are never adjacent to one another. The issue is that InnoDB does not detect write/write conflicts(see Section 4.4.3, Definition 6 in Generalized Isolation Level Definitions).

If InnoDB’s Repeatable Read is not a Repeatable Read from a generalized isolation level definition’s perspective, then which isolation level does it correspond to? If G-single was forbidden, it would correspond to Snapshot isolation level (see Section 4.2 of A Critique of ANSI SQL Isolation Levels).

Solution: New Snapshot isolation level in InnoDB

Locking reads inside InnoDB read the latest committed version, ignoring what should actually be visible to the transaction. A new option innodb_snapshot_isolation was introduced in MariaDB 10.6.18 and is enabled by default starting with MariaDB Server 11.6.2. If the option is on and an attempt to acquire a lock on a record that does not exist in the current read view is made, an error ER_CHECKREAD will be raised. This error will be treated in the same way as a deadlock: the transaction will be rolled back. Thus InnoDB’s “Repeatable read” isolation level with innodb_snapshot_isolation turned on satisfies “Snapshot” isolation level for REPEATABLE READ isolation level. For READ COMMITTED the option turns off semi-consistent read for UPDATE statements to find rows matched to the condition in the WHERE clause, and use locking read instead. For READ UNCOMMITTED it also turns off semi-consistent read and allows to read uncommitted versions of records.

InnoDB isolation levels testing

The blog post is not only interesting from InnoDB isolation levels analysis perspective. For isolation level violation testing, a test suite was developed. It is based on the Jepsen testing library. Originally, the test suite was developed to test distributed systems, and it’s supposed such systems have several nodes launched either in some containers or on separate machines. There is also a control node, which installs all necessary stuff to the other nodes, issues queries, gathers results, wipes out the nodes and analyzes the results.

Such configuration is redundant for InnoDB testing. That’s why the MariaDB InnoDB team changed the original test suite to run both MariaDB Server and testing environment in one docker container. The docker container accepts MariaDB Server branch name as an argument, builds it and launches testing.

Also it is not enough just to catch the fact of isolation level violations. It’s important to reproduce and debug it. That’s why the ability to launch MariaDB server under rr was added in the test suite. If an isolation violation is caught  during the testing, we can record it with rr, replay rr trace and debug it. It’s also important not only to replay the trace, but also to catch the beginning of isolation level violation in the debugger. For this purpose a query id was added as a comment to each query. The test suite reports contain the query ids, so we can set a conditional breakpoint in the debugger, and stop execution exactly at the moment we want to debug.

There are two ways to launch testing with the modified test suite. The first way is to launch it in a docker container which image is built with the Dockerfile. The build instructions can be found in the comments in the Dockerfile. To launch the testing we use the following command:

docker run docker_image_name mariadb_branch_name

The container will build MariaDB server from the branch and launch a set of tests.

The second way is to launch the testing locally on the host’s OS. Currently it’s Ubuntu 22.04. The host machine should satisfy some dependencies and have lein installed.The example of commands to start some tests can be found in the entry point of the Dockerfile. There are command line options to set MariaDB Server install, data and tmp directories. One more option is to launch MariaDB server under rr.

The test suite contains several tests. Let’s consider the tests we use for InnoDB testing.

List append

The test generates workload with write-write, write-read, and read-write dependencies between transactions. The workload consists of randomly generated transactions. Each transaction reads or appends unique integer elements to a collection of lists identified by primary key. The lists are represented with text fields of comma separated values. After the transactions are executed Elle checker looks for the cycles in the dependency graph.

To launch this testing we use the following command from the directory of the test suite:

lein run test --db maria-docker --nodes localhost --concurrency 64 --rate 1000 
--time-limit 60 --key-count 40 --no-ssh=true 
--mariadb-install-dir="/mariadb-bin" --mariadb-data-dir=”/mariadb-data” 
--innodb-snapshot-isolation=true  -w append -i repeatable-read

After the testing the results can be found in the store/latest directory. history.edn contains transactions history in the following format:

{:index 0, :time 113163705, :type :invoke, :process 0, :f :txn, :value [[:append 39 1] [:append 38 1]]}
{:index 1, :time 118261772, :type :invoke, :process 1, :f :txn, :value [[:append 32 1]]}
{:index 2, :time 118818033, :type :invoke, :process 2, :f :txn, :value [[:append 39 2] [:append 36 1] [:append 39 3]]}
{:index 3, :time 119093948, :type :invoke, :process 3, :f :txn, :value [[:r 39 nil] [:r 39 nil]]}
{:index 4, :time 119304328, :type :invoke, :process 4, :f :txn, :value [[:append 39 4]]}
{:index 5, :time 119504124, :type :invoke, :process 5, :f :txn, :value [[:r 37 nil]]}

The pair (index, time) is unique and is used to form query id in the comments of each query in the format /* index_time */. As it was mentioned above, this query id can be used to set a conditional breakpoint in debugger during replaying rr trace.

store/elle contains the results of Elle checker in the following format:

G2-item #0
Let:
T1 = {:index 2190, :time 11950176447, :type :ok, :process 193, :f :txn, :value [[:append 37 164] [:r 41 [2 4 5 6 7 11 24 28 8 44 49 50]] [:r 36 [3 2 1 10 4 13 9 15 17 14 18 19 21 26 33 35 38 39 40 42 57 68 69 76]]]}
T2 = {:index 2176, :time 11882729952, :type :ok, :process 351, :f :txn, :value [[:append 41 67] [:r 37 [1 4 8 12 11 13 14 2 9 10 15 24 23 25 28 5 6 19 20 21 22 37 39 40 42 45 30 46 48 33 51 58 65 66 67 72 74 68 79 80 86 93 94 97 87 89 103 114 96 118 122 149 152 158 162]]]}
Then:
- T1 < T2, because T1 did not observe T2's append of 67 to 41.
- However, T2 < T1, because T2 did not observe T1's append of 164 to 37: a contradiction!

It explains which exact transactions caused the G2-item anomaly. Moreover, store/latest/elle/G2-item/ directory contains the illustrations of the anomaly:
Illustration of the anomaly store/latest/elle/G2-item/ directory

If we remove --innodb-snapshot-isolation=true, we will also see G-single anomaly:

G2-item #1
Let:
T1 = …
T2 = …
T3 = …

Then:
- T1 < T2, because T2 observed T1's append of 25 to key 49.
- T2 < T3, because T2 did not observe T3's append of 57 to 49.
- However, T3 < T1, because T3 did not observe T1's append of 126 to 48: a contradiction!

The test can be launched for different isolation levels.

Non-repeatable read

For this test the following simple table of people identified by primary key is created and filled with a single row:

create table people (
  id     int not null,
  name   text not null,
  gender text not null,
  primary key (id))
);
insert into people (id, name, gender)
  values (0, 'male', 'female');

We then perform a series of write transactions which update only the row’s name. Concurrently, a second series of transactions each read the row’s name, update its gender field, and read the name again. Violations of Repeatable Read manifest as the row’s name changing between the two reads. We also perform deletions and re-insertions of row 0, in case they behave differently than plain updates.

The test can be launched with the same options as “append” test, but with a different workload name: -w nonrepeatable-read. store/current/results.end contains the following error report if --innodb-snapshot-isolation=false (by default) :

              #jepsen.history.Op{:index 76340,

                                   :time 100039118574,

                                   :type :ok,

                                   :process 25005,

                                   :f :read,

                                   :value {:id 0,

                                           :name1 "moss",

                                           :name2 "s'more"}}

We can see here that transaction 76340 reads “moss” name before updating “gender” field, and “s`more” name after it, which is indeed a violation of Repeatable Read manifest. Note, that we will not see the violation if we use --innodb-snapshot-isolation=true option.

Monotonic atomic view

Monotonic atomic view (see Section 5.1.2 of Highly Available Transactions: Virtues and Limitations)  is a consistency model, which satisfies the following conditions: once a write from transaction T1 is observed by transaction T2, then all effects of T1 should be visible to T2. Both repeatable read and snapshot isolation levels should meet the requirements of monotonic atomic view.

This workload creates a single table with two rows:

create table mav (
  id      int not null,
  val int not null,
  noop    int not null,
  primary key (id)
);
insert into mav (id, val, noop)
  values (0, 0, 0);
insert into mav (id, val, noop)
  values (1, 0, 0);

It performs a mix of write and read transactions. Each write increments the val of row 0, then increments row 1. Reads select the value of row 0, set the noop field of row 1 to a random value, then read the values of 1 and 0. Under Monotonic Atomic View, these reads should be monotonically increasing. For example, once a reader observes value 2, it should thereafter see every row’s value as 2 or higher.

To launch the test -w mav option should be used. If  --innodb-snapshot-isolation=true is missed, we can see the following error in store/current/results.end report:

               #jepsen.history.Op{:index 812,
                                   :time 3129212728,
                                   :type :ok,
                                   :process 152,
                                   :f :read,
                                   :value {:a1 28, :b2 29, :a2 28}}

I.e. the transaction 812 read 28 from val field of row 0, then updated noop field of row 1, then read 29 from val field of row 1, and 28 from val field of row 0. There are no errors if --innodb-snapshot-isolation=true is used.

Conclusion

Thanks to Alexey Gotsman and his talk we understood that InnoDB Repeatable Read isolation level does not correspond to any of generalized isolation level definitions, and the closest definition for it is Snapshot isolation level. Thanks to Peter Alvaro & Kyle Kingsbury blog post and Kyle’s and Alexey’s participation in email discussions MariaDB InnoDB development team not only developed a new option to satisfy Snapshot isolation level requirements, but also found a great tool for isolation violations testing. The tool was modified for convenient usage in docker container and for recording isolation violations with rr for further debugging. Some parts of the blog post were used in this blog post.