Using the MySQL stack trace to isolate bugs

I came across an interesting error reported on #mysql the other day. When I went through it with the reporter it looks like we uncovered up to two bugs in InnoDB (or rather XtraDB as it was Percona Server). I thought it might be useful to go through the error message, including the stack trace, to show that you don’t need to be a developer to track down some useful information. If you look at http://pastie.org/3836611 you’ll see the error message.

23 Apr 2012 5 min read

120422 17:10:49 [ERROR] The transaction object for the table handle is at 0x2ac940042408, but for the current thread it is at 0x0

That 0x0 is a dead giveaway that we have a real problem. It is a null pointer, and these are deadly little beasts. So deadly that simply reporting on them, such as in this error, you can crash the server:

00:10:49 UTC - mysqld got signal 11 ;
This could be because you hit a bug...

Signal 11 is a Segmentation Violation or an invalid memory reference and was caused by the previous line where it was trying to helpfully print out the 200 bytes around the location reported for the transaction object in the current thread. However since that was null, printing it out tries to dereference the null pointer (look at the data that the pointer points to) and since null is invalid, it crashes. Bug #1, assuming that the pointer is valid in the debug message. For what it is worth, the code that was responsible is in storage/innobase/ha_innodb.cc and looks like this:

        if (prebuilt->trx != trx) {
          sql_print_error("The transaction object for the table handle is at "
                          "%p, but for the current thread it is at %p",
                          (const void*) prebuilt->trx, (const void*) trx);

                fputs("InnoDB: Dump of 200 bytes around prebuilt: ", stderr);
                ut_print_buf(stderr, ((const byte*)prebuilt) - 100, 200);
                fputs("n"
                        "InnoDB: Dump of 200 bytes around ha_data: ",
                        stderr);
                ut_print_buf(stderr, ((const byte*) trx) - 100, 200);
                putc('n', stderr);
                ut_error;
        }

The server crashed when it executed the ut_print_buf of the 200 bytes surrounding the null pointer. But why was trx (the null pointer) invalid? There must be another bug causing this. Now this could have been a system failure, broken memory, faulty disk or something odd, except that the server crashed multiple times with pretty much exactly the same stack trace. That isn’t a flaky system, that is a bug. So how to figure out what is going on? Luckily functions within MySQL and its forks are reasonably well named and give you a hint of what is going on. So lets look at the stack trace to figure out the sequence. Starting at the top we find the internal functions to print out the stack trace and the thread that started them, so we can ignore those, then at line 31 we start to see some useful information. The function names are C++ names so are “mangled” by the compiler, but still recognisable. So let’s look at the sequence starting from the bottom of the stack trace (which is the top of the stack) and work backwards to the point of failure. Note that not all the parameters are visible, and I’ve just pulled out those that are so you can see how the stack trace relates to internal functions.

handle_one_connection()
do_handle_one_connection()
dispatch_command(enum_server_command)
mysql_parse(Parser_state)

This sequence handles the incoming connection, determines the command coming in and dispatches it to the parser to parse the SQL.

mysql_execute_command()

Here we start executing the SQL

handle_select(select_result)
mysql_select(THD,Item,TABLE_LIST,List,st_order,select_result,st_select_lex_unit,st_select_lex)
JOIN::exec()
sub_select(JOIN,st_join_tableb)
select_insert::send_data(List,Item)

Now here is the bit that makes it easier to find the query triggering this bug. handle_select obviously is handling a SELECT statement, but we have a join, a sub_select and then a select_insert going on. So the statement that we are looking for will likely be of the form:

INSERT INTO ... SELECT ... FROM .. JOIN ...

write_record(THD,TABLE,st_copy_info)
handler::ha_write_row()

And this is where things go wrong. The write_record method hands the thread over to the storage engine that would handle the insert, and the handler, which in this case is the InnoDB/XtraDB storage engine, attempts to write the row out. Unfortunately at some point the THD (thread structure) has become corrupted, in that the previously built transaction has not been copied to the thread. So something about the particular query is conspiring with a quirk of the code to cause the transaction information to be invalid, in turn crashing the server when the situation is detected but not handled. This, along with the timings of the crashes, was enough to identify candidate queries that matched the pattern and when executed crashed the server. I’ll be interested to know when the bug gets lodged and what the fix is, however this piece is more about the wealth of information you can get from the error message, and the stack trace in particular. Without ever going into code, or a debugger, or have any programming knowledge whatsoever, you can still get valuable information from seemingly arcane error messages.