rows_examined UNION Handler% … ?!?

A customer recently asked if I could help clarify the relationship between rows_examined (in the slow query log) and the Handler% variables (in SHOW STATUS output) for a simple UNION. I didn’t know the answer off the top of my head, but I didn’t think it would be too terribly difficult to find it. Several days and many hours of work later, I think I finally have some idea of how this all works. Let’s first set up some sample data and get some information about it. (I’m using MariaDB 5.2.4 for these tests because that’s the version the customer was using. This should basically all apply to other versions/vendors, except specific changes I refer to in MariaDB 5.3.)

maria 5.2.4> create table t1 (id int, key(id));
Query OK, 0 rows affected (0.01 sec)

maria 5.2.4> ! seq -f '(%.0f)' -s, 1 10
(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)
maria 5.2.4> insert into t1 values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
Query OK, 10 rows affected (0.00 sec)
Records: 10  Duplicates: 0  Warnings: 0

maria 5.2.4> set global slow_query_log=1, log_output='table', long_query_time=0;
Query OK, 0 rows affected (0.03 sec)

 5.2.4> flush status;
Query OK, 0 rows affected (0.00 sec)

maria 5.2.4> select * from t1 where id=5 union all select * from t1 where id=5;
+------+
| id   |
+------+
|    5 |
|    5 |
+------+
2 rows in set (0.00 sec)

maria 5.2.4> show status like 'Handler%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 0     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 2     |
| Handler_read_next          | 2     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 3     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_update             | 0     |
| Handler_write              | 4     |
+----------------------------+-------+
15 rows in set (0.01 sec)

maria 5.2.4> select * from mysql.slow_log where sql_text like 'select%' order by start_time desc limit 1G
*************************** 1. row ***************************
    start_time: 2012-01-12 11:12:39
     user_host: root[root] @ localhost []
    query_time: 00:00:00
     lock_time: 00:00:00
     rows_sent: 2
 rows_examined: 7
            db: test
last_insert_id: 0
     insert_id: 0
     server_id: 0
      sql_text: select * from t1 where id=5 union all select * from t1 where id=5
1 row in set (0.01 sec)

So, we have one table (t1) with one column (id) and 10 rows. There’s a key on id. There are two different things I want to look at here: the Handler% variables and rows_examined in the slow query log. Initially, they seem to match up perfectly. Handler_read_key + Handler_read_next + Handler_read_rnd_next = 7 = rows_examined. Case closed, right? Not so fast. How many rows are actually being handled in order to work execute this query? There’s one row from each of the base tables in the UNION (well, there are two instances of the same table, but UNION doesn’t know or care about that). Those rows are written to a temporary table as part of the UNION, and then they’re read back from the temporary table and sent to the client. One row from each base table is two rows, and then two rows from the temporary table make four rows total. Why, then, is rows_examined=7? And why do the Handler_read% variables seem to support this value? There’s a lot more going on here than meets the eye. Not having much luck with the reference manual, I set off on an odyssey through the MariaDB source code to try to make sense out of all of this. If you want to follow along, go grab a copy of the MariaDB source. You can find it at http://downloads.askmonty.org/MariaDB/5.2.4/. I did lots of grepping and eventually found my way to st_select_lex_unit::exec in sql/sql_union.cc. This is the “meat” of where a UNION is executed. Right at the top of the function (line 443) you see this:

  ha_rows examined_rows= 0;

It looks like st_select_lex_unit::exec is going to be maintaining its own examined rows counter. The next mention of examined_rows is at line #515:

          examined_rows+= thd->examined_row_count;

Here, the examined_rows counter gets the current value that’s stored in the thread information structure. The next thing that happens is at line #629:

        thd->examined_row_count+= examined_rows;

This shows the thread information structure that we saw above getting examined_rows added back to it. That can’t be right… assigning that value back whence it came doesn’t make any sense. Sure enough, in MariaDB 5.3.3 this is done differently; the code we saw at line #515 above (it’s actually at line #660 in the MariaDB 5.3.3 source) looks like this, instead:

          examined_rows+= thd->examined_row_count;
          thd->examined_row_count= 0;

Instead of adding thd->examined_row_count back onto itself, thd->examined_row_count is reinitialized to 0. Later examined_rows is added back to thd->examined_row_count just as was done in MariaDB 5.2.4. That means that the value of rows_examined in the slow query log make a lot more sense in MariaDB 5.3.3 than they did in MariaDB 5.2.4:

mariadb 5.3.3> select * from t1 where id=5 union all select * from t1 where id=5;
+------+
| id   |
+------+
|    5 |
|    5 |
+------+
2 rows in set (0.07 sec)

mariadb 5.3.3> select * from mysql.slow_log order by start_time desc limit 1G
*************************** 1. row ***************************
    start_time: 2012-01-12 11:36:52.136361
     user_host: root[root] @ localhost []
    query_time: 00:00:00.067328
     lock_time: 00:00:00.024490
     rows_sent: 2
 rows_examined: 4
            db: test
last_insert_id: 0
     insert_id: 0
     server_id: 0
      sql_text: select * from t1 where id=5 union all select * from t1 where id=5
1 row in set (0.00 sec)

So, that solves the conundrum of why rows_examined didn’t match expectations for a UNION. There was a bug! But… rows_examined, even though it didn’t really make much sense, did match the Handler% variables. Those are unchanged in MariaDB 5.3.3:

mariadb 5.3.3> show status like "handler%";
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 0     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 2     |
| Handler_read_next          | 2     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 3     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 2     |
| Handler_update             | 0     |
| Handler_write              | 1     |
+----------------------------+-------+
17 rows in set (0.13 sec)

Whether we’re using MariaDB 5.2.4 or MariaDB 5.3.3, Handler_read_key, Handler_read_next, and Handler_read_rnd_next still need to be accounted for. With the assistance of some of the absolutely wonderful people at Monty Program AB, I eventually got a grasp on these. Handler_read_key means that the server is making a direct access to a specific index entry. In this case, it’s going straight to the first entry in the index for value “5” (id=5). There are two tables in the UNION, and we’re doing a very simple query against each of them. Each SELECT only matches a single row, so Handler_read_key is 1 for each table, and 2 in total. Handler_read_next means that the server is looking for the next value in an index. Even though MySQL uses a direct lookup (Handler_read_key) to find the first entry in the index for a particular value, there might be other entries for that same value, so Handler_read_next is required. In this case, there is only one row that matches id=5, so Handler_read_next returns something like EOF. Again, this happens twice in the UNION, so Handler_read_next is 2. If more rows were to match, Handler_read_key would still be 1, but Handler_read_next would increase:

mariadb 5.2.4> insert into t1 values (5),(5),(5);
Query OK, 3 rows affected (0.03 sec)
Records: 3  Duplicates: 0  Warnings: 0

mariadb 5.2.4> select * from t1 where id=5;
+------+
| id   |
+------+
|    5 |
|    5 |
|    5 |
|    5 |
+------+
4 rows in set (0.00 sec)

mariadb 5.2.4> show status like 'Handler_read%';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| Handler_read_first    | 0     |
| Handler_read_key      | 1     |
| Handler_read_next     | 4     |
| Handler_read_prev     | 0     |
| Handler_read_rnd      | 0     |
| Handler_read_rnd_next | 0     |
+-----------------------+-------+
6 rows in set (0.00 sec)

Handler_read_rnd_next indicates that the server is doing a table scan. This is happening for our UNION because of the temporary table. Even though there are only 2 rows in the temporary table, Handler_read_rnd_next is 3 for the same reason that Handler_read_next was set—the server must find the end of the data. The last time Handler_read_rnd_next is incremented is when the handler finds the end of the table. That wasn’t so bad, right? There’s a little extra wrinkle to look out for (of course). We’ve been doing UNION ALL, but what changes if we do UNION DISTINCT? I’ll use MariaDB 5.3.3 for this example, so that we can compare better values for rows_examined.

mariadb 5.3.3> select * from t1 where id=5 union all select * from t1 where id=5;
+------+
| id   |
+------+
|    5 |
|    5 |
+------+
2 rows in set (0.00 sec)

mariadb 5.3.3> show status like "handler%";
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 0     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 2     |
| Handler_read_next          | 2     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 3     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 2     |
| Handler_update             | 0     |
| Handler_write              | 1     |
+----------------------------+-------+
17 rows in set (0.00 sec)

mariadb 5.3.3> select * from mysql.slow_log order by start_time desc limit 1G
*************************** 1. row ***************************
    start_time: 2012-01-12 12:07:55.514920
     user_host: root[root] @ localhost []
    query_time: 00:00:00.000745
     lock_time: 00:00:00.000220
     rows_sent: 2
 rows_examined: 4
            db: test
last_insert_id: 0
     insert_id: 0
     server_id: 0
      sql_text: select * from t1 where id=5 union all select * from t1 where id=5
1 row in set (0.01 sec)

mariadb 5.3.3> flush status;
Query OK, 0 rows affected (0.00 sec)

mariadb 5.3.3> select * from t1 where id=5 union distinct select * from t1 where id=5;
+------+
| id   |
+------+
|    5 |
+------+
1 row in set (0.02 sec)

mariadb 5.3.3> show status like "handler%";
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 0     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 2     |
| Handler_read_next          | 2     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 3     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 2     |
| Handler_update             | 0     |
| Handler_write              | 1     |
+----------------------------+-------+
17 rows in set (0.00 sec)

mariadb 5.3.3> select * from mysql.slow_log order by start_time desc limit 1G
*************************** 1. row ***************************
    start_time: 2012-01-12 12:08:23.308634
     user_host: root[root] @ localhost []
    query_time: 00:00:00.023508
     lock_time: 00:00:00.000156
     rows_sent: 1
 rows_examined: 3
            db: test
last_insert_id: 0
     insert_id: 0
     server_id: 0
      sql_text: select * from t1 where id=5 union distinct select * from t1 where id=5
1 row in set (0.00 sec)

When you do UNION [DISTINCT] (DISTINCT is the default if you don’t specify either DISTINCT or ALL), the server creates the temporary table with a flag that tells it to create a unique key and thus reject duplicate rows. So, for UNION DISTINCT, why is Handler_read_rnd_next, the one that corresponds to reading from the temporary table, still equal to 3? If there is only 1 row in the table, shouldn’t this variable be equal to 2? One for the row, and one for EOF, right? It turns out that the row is actually created in the temporary table before the unique key check is performed. After the unique check fails, the row is marked as “deleted”, but it’s still there. When working through the table, Handler_read_rnd_next is incremented every time it’s called … even if it finds a deleted row or EOF. That’s an internal feature of how HEAP/MEMORY tables work. If we force a non-heap (MyISAM or Aria) temporary table, we have Handler_read_rnd_next of 2 for UNION DISTINCT, as we might have expected to begin with:

mariadb 5.3.3> alter table t1 add column t text;
Query OK, 10 rows affected (0.01 sec)              
Records: 10  Duplicates: 0  Warnings: 0

mariadb 5.3.3> flush status;
Query OK, 0 rows affected (0.00 sec)

mariadb 5.3.3> select * from t1 where id=5 union all select * from t1 where id=5;
+------+------+
| id   | t    |
+------+------+
|    5 | NULL |
|    5 | NULL |
+------+------+
2 rows in set (0.00 sec)

mariadb 5.3.3> show status like "Handler_read_rnd_next";
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| Handler_read_rnd_next | 3     |
+-----------------------+-------+
1 row in set (0.00 sec)

mariadb 5.3.3> flush status;
Query OK, 0 rows affected (0.00 sec)

mariadb 5.3.3> select * from t1 where id=5 union distinct select * from t1 where id=5;
+------+------+
| id   | t    |
+------+------+
|    5 | NULL |
+------+------+
1 row in set (0.00 sec)

mariadb 5.3.3> show status like "Handler_read_rnd_next";
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| Handler_read_rnd_next | 2     |
+-----------------------+-------+
1 row in set (0.00 sec)

The Handler_read% variables now taken care of, we ought to take a look at Handler_write. You may have noticed some differences in those between MariaDB 5.2.4 and MariaDB 5.3.3:

mariadb 5.2.4> flush status;
Query OK, 0 rows affected (0.00 sec)

mariadb 5.2.4> select * from t1 where id=5 union all select * from t1 where id=5;
+------+
| id   |
+------+
|    5 |
|    5 |
+------+
2 rows in set (0.07 sec)

mariadb 5.2.4> show status like 'Handler_write%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Handler_write | 4     |
+---------------+-------+
1 row in set (0.04 sec)
mariadb 5.3.3> flush status;
Query OK, 0 rows affected (0.00 sec)

mariadb 5.3.3> select * from t1 where id=5 union all select * from t1 where id=5;
+------+------+
| id   | t    |
+------+------+
|    5 | NULL |
|    5 | NULL |
+------+------+
2 rows in set (0.00 sec)

mariadb 5.3.3> show status like "Handler%write";
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Handler_tmp_write | 2     |
| Handler_write     | 1     |
+-------------------+-------+
2 rows in set (0.00 sec)

It’s easiest to look at MariaDB 5.3.3 first, since it helpfully breaks out temporary table writes for us. We know the UNION writes to the temporary table, so Handler_tmp_write of 2 is easy to explain. I was quite confused about Handler_write at first, until someone reminded me that I had the slow query log turned on with log_output=table!

mariadb 5.3.3> set global log_output='file';
Query OK, 0 rows affected (0.00 sec)

mariadb 5.3.3> select * from t1 where id=5 union all select * from t1 where id=5;
+------+------+
| id   | t    |
+------+------+
|    5 | NULL |
|    5 | NULL |
+------+------+
2 rows in set (0.01 sec)

mariadb 5.3.3> show status like "Handler%write";
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Handler_tmp_write | 2     |
| Handler_write     | 0     |
+-------------------+-------+
2 rows in set (0.01 sec)

Still, though, MariaDB 5.2.4 showed Handler_write of 4, while here we’ve only accounted for 3 (2 for writing to the temporary table, 1 for writing to the slow query log). Then I noticed/remembered/realized that SHOW STATUS itself actually increments Handler_write.

mariadb 5.2.4> show status like 'Handler_write%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Handler_write | 4     |
+---------------+-------+
1 row in set (0.04 sec)

mariadb 5.2.4> show status like 'Handler_write%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Handler_write | 5     |
+---------------+-------+
1 row in set (0.00 sec)

mariadb 5.2.4> show status like 'Handler_write%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Handler_write | 6     |
+---------------+-------+
1 row in set (0.00 sec)

mariadb 5.2.4> show status like 'Handler_write%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Handler_write | 7     |
+---------------+-------+
1 row in set (0.00 sec)

That seems to be fixed in MariaDB 5.3:

mariadb 5.3.3> show status like "Handler%write";
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Handler_tmp_write | 2     |
| Handler_write     | 0     |
+-------------------+-------+
2 rows in set (0.00 sec)

mariadb 5.3.3> show status like "Handler%write";
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Handler_tmp_write | 2     |
| Handler_write     | 0     |
+-------------------+-------+
2 rows in set (0.00 sec)

mariadb 5.3.3> show status like "Handler%write";
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Handler_tmp_write | 2     |
| Handler_write     | 0     |
+-------------------+-------+
2 rows in set (0.00 sec)

What I thought would be a simple investigation into rows_examined and the Handler% variables actually ended up being a really interesting trek through the MariaDB source code, a chance to use gdb, a bug in sql_union.cc, some unexpected ancient low-level behavior exposed via status variables, and a chance to learn about some nice new changes in MariaDB 5.3. I hope I’ve gotten my facts straight, and I hope you’ll let me know if I haven’t!