A hash-based GROUP BY strategy for MySQL

spacer

Sometimes MySQL just doesn’t choose the most efficient way to execute a query.GROUP BY is a good example. A customer recently wanted to add a unique key over some columns of a large (~50GB) table, and they first had to find duplicates. In this case, no suitable index was available that would help with the query. That means “Using temporary”. And of course MySQL’s GROUP BY also sorts by default, which means “Using file sort”. The goal here is just to find duplicates, or more specifically to find the lowest primary key value out of some group of rows with identical values for a subset of columns in the table. I think there is a much more efficient way to solve this problem. I proposed a temporary table with the primary key column and a hash of the concatenated columns to be used for the grouping. This ended up allowing the customer to finish in 10 minutes an operation that originally resulted in a 30GB temporary table and never completed. The hash-based grouping is a simple idea, but there are a few caveats in how it’s done and a few tricks for how to make it efficient. This blog post will provide a walk-through of how to put it into practice. They tried first using a query of this form:

 SELECT MIN(id) FROM tbl GROUP BY a, b, c, d, e HAVING COUNT(*)>1;

The table we were working with had more columns, but this is the structure of the part of the table we’re looking at:

CREATE TABLE `tbl` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `a` bigint(20) NOT NULL,
  `b` bigint(20) NOT NULL,
  `c` bigint(20) NOT NULL,
  `d` bigint(20) DEFAULT NULL,
  `e` varchar(255) CHARACTER SET utf8 NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB

Here’s the EXPLAIN output:

mysql> EXPLAIN SELECT MIN(id) FROM tbl GROUP BY a, b, c, d, e HAVING COUNT(*)>1;
+----+-------------+-------+------+---------------+------+---------+------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows     | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+----------+---------------------------------+
|  1 | SIMPLE      | tbl   | ALL  | NULL          | NULL | NULL    | NULL | 32950332 | Using temporary; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+----------+---------------------------------+
1 row in set (0.00 sec)

One of the columns is utf8 varchar(255) (3 bytes per character means 765 bytes must be reserved when sorting), and all the rest are bigint (8 bytes each). With about 36 million rows in the table, we were looking at grouping and sorting 36000000 * (255*3 + 5*8) = about 27GB of data. I mentioned above that MySQL by default also sorts when doing GROUP BY. If this query had included added ORDER BY NULL at the end of the query, the filesort could have been avoided. But we’d still be looking at a really large temporary table to handle the grouping.

mysql 5.5.28-MariaDB (root) [test]> EXPLAIN SELECT MIN(id) FROM tbl GROUP BY a, b, c, d, e HAVING COUNT(*)>1 ORDER BY NULL;
+------+-------------+-------+------+---------------+------+---------+------+----------+-----------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows     | Extra           |
+------+-------------+-------+------+---------------+------+---------+------+----------+-----------------+
|    1 | SIMPLE      | tbl   | ALL  | NULL          | NULL | NULL    | NULL | 36000125 | Using temporary |
+------+-------------+-------+------+---------------+------+---------+------+----------+-----------------+
1 row in set (0.01 sec)

So, how about the hash-based alternative. Here’s the whole process we’ll use:

 SET sql_log_bin=0;
 SET myisam_sort_buffer_size=1500*1024*1024;
 CREATE TEMPORARY TABLE tmp (id BIGINT UNSIGNED NOT NULL, h BINARY(16), KEY(h,id)) ENGINE=MyISAM MAX_ROWS=36000000;
 SET TRANSACTION ISOLATION LEVEL READ COMMITTED; 
 INSERT INTO tmp SELECT id, UNHEX(MD5(CONCAT_WS('~',a,b,c,d,e))) FROM tbl;
 SELECT MIN(id) FROM tmp GROUP BY h HAVING COUNT(*)>1 ORDER BY NULL;
 DROP TABLE tmp;

Why MyISAM? MyISAM will allow us to store this data in a bit more compact format than InnoDB would, and saving space is part of our goal. MyISAM will also be a bit faster for this task because there’s no locking or logging necessary. I actually found that selecting from InnoDB was faster than selecting from MyISAM for this task by more than 50%, when the key buffer and buffer pool are each large enough to hold the respective indexes; however, inserting into the InnoDB table took 5 to 8 times longer than inserting into the MyISAM table. (If you have memory to spare, you can use the MEMORY storage engine, which will give even better performance, but you will need to be sure to tell the engine to use a BTREE index. The MEMORY engine uses HASH indexes by default. The MEMORY engine uses really large pointers in its BTREE indexes, so you may need considerabley more memory than you need disk space when using MyISAM. In a test using the schema as above, the MEMORY engine required nearly twice the space to store the index as was required by MyISAM.) And why binary(16)? I chose to use MD5 for this task. MD5 is faster and produces a smaller hash than the other options, so I like it better for this purpose. Most people think of a 32-character string like “8a9c538c7f848d97d9d45736c4f709f3” when they think of MD5. The MD5 algorithm, of course, doesn’t really produce a 32-character string of letters and numbers, it produces a 128-bit value. MySQL’s MD5() function returns a human-readable hexadecimal version of that, which we can undo with the UNHEX() function to turn it back into a 16-byte binary value (16 bytes x 8 = 128 bits), which will save space. So, we just need to store the primary key value and the hash. One bigint plus binary(16) is only 24 bytes, hugely less than the 805 bytes we were working with before. Before we start creating tables and moving data around, we should increasekey_buffer_size and myisam_sort_buffer_size if there’s any memory to spare. Ideally, we want they key buffer to be big enough to hold the entire key file we’ll end up with. It’s real easy to calculate the size of that. We have 36,000,000 rows and one index that’s 24 bytes. Each index entry also includes a pointer to the offset for the row in the data file. MyISAM tables these days use a 6-byte row pointer by default (myisam_data_pointer_size). We already know we only have 36,000,000 rows, though, so we can use a smaller pointer for our table to save a bit of space. We do that later by specifying MAX_ROWS with CREATE TEMPORARY TABLE. But we can calculate on our own the number of bytes needed using a logarithm:

mysql> select ceiling(log(36000000*(8+16+1))/log(2)/8) as row_pointer_size;
+------------------+
| row_pointer_size |
+------------------+
|                4 |
+------------------+
1 row in set (0.00 sec)

So, 36,000,000 rows at 24 bytes per row plus a 4-byte pointer. And we’ll give 25% of extra space to be safe.

mysql> select 36000000 * (8+16+4) * 1.25 as new_key_buffer__size;
+----------------------+
| new_key_buffer__size |
+----------------------+
|        1260000000.00 |
+----------------------+
1 row in set (0.00 sec)

I’ll increase my key buffer size for this test, but you should make sure that your existing key buffer is not already bigger than this and/or that you are not low on memory before doing this.

 SET GLOBAL key_buffer_size=1260000000;

The calculation for determining an optimal value for myisam_sort_buffer_size is a bit different and truthfully I’m not completely sure how MySQL decides when to dump the MyISAM sort buffer to disk. That’ll be something for a future blog post. In this case, I found that setting myisam_sort_buffer_size to be equal to 25% over the index file size was not sufficient, but I got the results I wanted by setting it to 1500M.

 SET myisam_sort_buffer_size=1500*1024*1024;

If you have binary logging enabled, you may wish to set sql_log_bin=0 while creating the table and inserting into it. This will keep the temporary table from being replicated to slaves, which would be unnecessary in this case because we are just using it to resolve a query, not directly to modify any non-temporary table. This should only be relevant if you have binlog_format=STATEMENT. If you do change this, be sure to change it back after completing this operation, or better yet just close the session to set it back to the default value.

 SET sql_log_bin=0;

Then we can create our table:

 CREATE TEMPORARY TABLE tmp (id BIGINT UNSIGNED NOT NULL, h BINARY(16), KEY(h,id)) ENGINE=MyISAM MAX_ROWS=36000000;

When generating our hash, we’ll put an unlikely separator between each column during concatenation so that we won’t have “1” and “23” getting confused with “12” and “3” as in this example:

mysql> select concat(1,23), md5(concat(1,23));
+--------------+----------------------------------+
| concat(1,23) | md5(concat(1,23))                |
+--------------+----------------------------------+
| 123          | 202cb962ac59075b964b07152d234b70 |
+--------------+----------------------------------+
1 row in set (0.00 sec)

mysql> select concat(12,3), md5(concat(12,3));
+--------------+----------------------------------+
| concat(12,3) | md5(concat(12,3))                |
+--------------+----------------------------------+
| 123          | 202cb962ac59075b964b07152d234b70 |
+--------------+----------------------------------+
1 row in set (0.00 sec)

Instead, we’ll do this:

mysql> select concat_ws('~',1,23), md5(concat_ws('~',1,23));
+---------------------+----------------------------------+
| concat_ws('~',1,23) | md5(concat_ws('~',1,23))         |
+---------------------+----------------------------------+
| 1~23                | 037ef90202e1e89a23016e4b51489326 |
+---------------------+----------------------------------+
1 row in set (0.00 sec)

mysql> select concat_ws('~',12,3), md5(concat_ws('~',12,3));
+---------------------+----------------------------------+
| concat_ws('~',12,3) | md5(concat_ws('~',12,3))         |
+---------------------+----------------------------------+
| 12~3                | 4ba8224d8a784c8af2af98b4ceb034c6 |
+---------------------+----------------------------------+
1 row in set (0.00 sec)

We’ll use INSERT ... SELECT to load up our table. However, INSERT ... SELECT has an interesting property when you SELECT from an InnoDB table. From Locks Set by Different SQL Statements in InnoDB in the MySQL Reference Manual: “INSERT INTO T SELECT ... FROM S WHERE ... sets an exclusive index record without a gap lock on each row inserted into T. If the transaction isolation level is READ COMMITTED orinnodb_locks_unsafe_for_binlog is enabled, and the transaction isolation level is notSERIALIZABLE, InnoDB does the search on S as a consistent read (no locks).” We certainly do not want to set 36 million shared locks, so we’ll want to change the isolation level before doing the INSERT ... SELECT, like this:

 SET TRANSACTION ISOLATION LEVEL READ COMMITTED; 

Then we can load up our table:

 INSERT INTO tmp SELECT id, UNHEX(MD5(CONCAT_WS('~',a,b,c,d,e))) FROM tbl;

And then we can look for duplicates of the hash value in our new temporary table:

 SELECT MIN(id) FROM tmp GROUP BY h HAVING COUNT(*)>1 ORDER BY NULL;

We have a much nicer execution plan for this query than we had for the original query:

mysql> EXPLAIN SELECT MIN(id) FROM tmp GROUP BY h HAVING COUNT(*)>1 ORDER BY NULL;
+----+-------------+-------+-------+---------------+------+---------+------+----------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows     | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+----------+-------------+
|  1 | SIMPLE      | tmp   | index | h             | h    | 25      | NULL | 36000000 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+----------+-------------+
1 row in set (0.01 sec)

Note that I put ORDER BY NULL at the end. That tells MySQL that it doesn’t need to sort the results for our benefit. In this case, it’s unnecessary, because the optimizer can use the index to resolve the query. But what if it couldn’t?

mysql> ALTER TABLE tmp DROP KEY h;
Query OK, 36000000 rows affected (11.40 sec)
Records: 36000000  Duplicates: 0  Warnings: 0

mysql> EXPLAIN SELECT MIN(id) FROM tmp GROUP BY h HAVING COUNT(*)>1;
+----+-------------+-------+------+---------------+------+---------+------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows     | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+----------+---------------------------------+
|  1 | SIMPLE      | tmp   | ALL  | NULL          | NULL | NULL    | NULL | 36000000 | Using temporary; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+----------+---------------------------------+
1 row in set (0.02 sec)

mysql> EXPLAIN SELECT MIN(id) FROM tmp GROUP BY h HAVING COUNT(*)>1 ORDER BY NULL;
+----+-------------+-------+------+---------------+------+---------+------+----------+-----------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows     | Extra           |
+----+-------------+-------+------+---------------+------+---------+------+----------+-----------------+
|  1 | SIMPLE      | tmp   | ALL  | NULL          | NULL | NULL    | NULL | 36000000 | Using temporary |
+----+-------------+-------+------+---------------+------+---------+------+----------+-----------------+
1 row in set (0.00 sec)

And to show the benefit of using the covering index instead of just an index on “h”:

mysql> ALTER TABLE tmp ADD KEY (h);
Query OK, 36000000 rows affected (3 min 33.08 sec)
Records: 36000000  Duplicates: 0  Warnings: 0

mysql> EXPLAIN SELECT MIN(id) FROM tmp GROUP BY h HAVING COUNT(*)>1 ORDER BY NULL;
+----+-------------+-------+------+---------------+------+---------+------+----------+-----------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows     | Extra           |
+----+-------------+-------+------+---------------+------+---------+------+----------+-----------------+
|  1 | SIMPLE      | tmp   | ALL  | h             | NULL | NULL    | NULL | 36000000 | Using temporary |
+----+-------------+-------+------+---------------+------+---------+------+----------+-----------------+
1 row in set (0.00 sec)

mysql> ALTER TABLE tmp DROP KEY h, ADD KEY (h, id);
Query OK, 36000000 rows affected (3 min 51.46 sec)
Records: 36000000  Duplicates: 0  Warnings: 0

mysql> EXPLAIN SELECT MIN(id) FROM tmp GROUP BY h HAVING COUNT(*)>1 ORDER BY NULL;
+----+-------------+-------+-------+---------------+------+---------+------+----------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows     | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+----------+-------------+
|  1 | SIMPLE      | tmp   | index | h             | h    | 25      | NULL | 36000000 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+----------+-------------+
1 row in set (0.01 sec)

Instead of MySQL needing to build a whole separate temporary table to resolve the query, the covering index allows MySQL to just read rows in order from the index to do the grouping. The covering index of course means that we actually more than double the amount of storage necessary for the table (because a row pointer is also included in the index along with the id and hash values). If you happen to have a large enough key buffer, those indexes should already be cached after enabling keys, so it should not be necessary to even hit the disk. There is of course a very small chance of a hash collision when using this technique, so it might be prudent to verify the results after the fact to ensure that there are no false positives. If you find yourself some day wanting to look for duplicates in a large data set, you might find rolling your own hash-based temporary table to be a much better option than anything MySQL can do for you on its own at this point. I hope it helps!