The MaxScale team have been working hard fixing bugs and improving performance. On Friday we released a update of MaxScale, the pluggable proxy for MySQL and MariaDB, I wanted to write a little about a few of those changes. I will not mention every change, there are release notes that give the list of bugs fixed in this version, but rather highlight a couple of performance related changes and describe the rationale behind them. However before I start on the two items I wanted to discuss just a quick note to say that this version introduces cmake as the means to build MaxScale. The previous Makefiles are still available in this version, but will be withdrawn in future versions as we transition to cmake for build, installation and packaging.
In one of the test scenarios that I was running I was observing an issue that manifested itself as pauses in the result set returned via MaxScale. In the particular situation a very large, multi-megabyte, result set was returned, sometimes during the result set stream there would be a measurable pause during which the client would receive no new data. There was clearly some blocking or starvation issue within MaxScale. Clearly this needed to be investigated and resolved.
I added some extra instrumentation to poll.c and noted that we often had no threads available to poll. Increasing the number of threads did not solve the problem, so clearly we had an issue with threads being blocked. This turned out to be the mutexes that had been added to prevent multiple threads processing reads and write for the same DCB. The problem was that this caused other threads to block whilst the first thread worked on a read or a write and starved other DCB’s of the processor. So these mutexes had to be removed and replaced by a simple queuing mechanism that allows for the subsequent threads to be free’d back to the polling loop immediately rather than blocking.
This has made a big difference and is key to the MaxScale architecture – the guiding principle is that threads should not block, if they do then we have the problem that we will never be able to handle the load as the number of threads we require will grow uncontrollably.
Now the threads were more available, and the setting of the number of threads makes a bigger difference to performance, I went on to look at synchronisation issues. So I added instrumentation to the spinlock to detect hot locks that might need decomposing or indicate poor parallelisation in the code. What I found was interesting…
Looking at the Descriptor Control Block (DCB) that was being used to write the results back to the client I saw the write queue spinlock was the only lock that was suffering any obvious contention, however this was isolated to a single DCB, so was not an issue with contention between router sessions, but within a single session.
Write Queue Lock Statistics: Spinlock acquired 1635627 Total no. of spins 69365889 Average no. of spins (overall) 42 Average no. of spins (when contended) 11336 Maximum no. of spins 68855 Maximum no. of blocked threads 1 Contended locks 6119 Contention percentage 0
The spinlock was often acquired and mostly did not contend, however there were a significant amount of calls, 6119 in this sample, that did. When there was contention the number of spins required was high, average of 11336. So the conclusion was that some times when the queue is locked there is a significant amount of work being done with the lock being held. There are two things it could be, either the appending of the buffers to the queue or the processing of the queue to send data. Looking at the show dcb output in the debug interface the queue could become very large.
The next approach tried was to profile the code, with gprof, to see if this problem could be pinned down further. This yielded some more interesting data.
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 94.00 314.29 314.29 26007877 0.01 0.01 gwbuf_append 4.43 329.10 14.82 1116143246 0.00 0.00 atomic_add 0.43 330.55 1.45 dcb_drain_writeq 0.28 331.49 0.94 26397028 0.00 0.00 spinlock_acquire 0.21 332.18 0.69 dcb_write 0.16 332.73 0.55 34 16.18 16.18 gwbuf_alloc 0.13 333.15 0.42 frame_dummy 0.12 333.56 0.41 42439118 0.00 0.00 gwbuf_consume 0.07 333.80 0.24 33654544 0.00 0.00 gwbuf_free 0.05 333.98 0.18 40316849 0.00 0.00 gw_write 0.05 334.16 0.18 25997764 0.00 0.00 gwbuf_length 0.03 334.25 0.09 26390425 0.00 0.00 spinlock_release
So it seems the gwbuf_append is taking significant amounts of CPU time, at least in this scenario. So I made a simple change to maintain a tail pointer in the linked list and remove the need to walk the linked list in order to insert an item. This made a difference, yielding the following profile data…
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 46.26 0.87 0.87 73798949 0.00 0.00 atomic_add 11.77 1.09 0.22 226 0.97 0.97 gwbuf_alloc 10.70 1.29 0.20 dcb_drain_writeq 8.56 1.45 0.16 21028798 0.00 0.00 gwbuf_consume 5.88 1.56 0.11 dcb_write 4.81 1.65 0.09 10261279 0.00 0.00 gwbuf_append 3.48 1.71 0.07 10581718 0.00 0.00 spinlock_acquire 2.14 1.75 0.04 21027433 0.00 0.00 gwbuf_free 1.34 1.78 0.03 frame_dummy 1.07 1.80 0.02 21048962 0.00 0.00 gw_write 1.07 1.82 0.02 1 20.00 33.49 poll_waitevents
This shows that gwbuf_append is no longer a major issue. The top function being the atomic_add that is used for gathering stats and implementing the spinlock primitives. This is not too surprising since atomic_add needs to lock the bus and is thus an expensive instruction.
It also had an impact on the write queue spinlock, with simpler duration sample now having no contention, with a single session running.
Write Queue Lock Statistics: Spinlock acquired 1176242 Total no. of spins 0 Average no. of spins (overall) 0 Maximum no. of spins 0 Maximim no. of blocked threads 0 Contended locks 0 Contention percentage 0
These changes are encouraging, but only related to heavy write loads on MaxScale currently, it does not give universal improvements in performance, but it does yield some significant improvements and reduce the CPU overhead for handling large results sets, making a previously CPU bound process now IO bound.
MaxScale is designed to use several independent threads to process requests, each connection has a descriptor control block associated with it. These DCB’s can only be destroyed when MaxScale is sure than known of the threads are currently using the DCB. If a DCB is destroyed prematurely by a thread there is a risk that another thread is still using it. Therefore MaxScale implements a two stage process to destroy the DCB’s. When a thread wishes to destroy a DCB it marks it as a “zombie” and places it on a zombie queue. Part of the DCB is a bitmask, with a bit per thread running in MaxScale. This bit is set for all threads when the DCB is in a non-zombie state, once it is placed in the zombie state threads will clear the bit in this bitmask that corresponds to the thread’s ID when the thread enters the polling loop. Once the last thread clears its’ bit the DCB can be finally destroyed.
This mechanism relies on threads frequently entering the polling loop in order to cleanup DCB’s in a timely fashion. However the change that was introduced above to queue embedded requests for a DCB was preventing the thread reentering the polling loop when a large request or response was being sent via MaxScale. The result was that on systems with result sets of many gigabytes a large number of DCB’s would build up in the zombie queue and connection would not be closed. This was particularly noticeable when using the binlog router for MySQL replication.
Looking at this problem also highlighted a different issue that could occur, the polling scheme would allow for a thread to pick up all events available at the time the poll call was made. That thread would then be responsible for processing that set of events while allowing other threads to poll and collect a set of events for it to work on. If the first, or any event other than the last in this set either take significant time to be processed, or more likely be one of these connections that was streaming a large quantity of data, the subsequent events in that set would not be processed for some considerable time.
It was decided that the best approach to both of these problems would be to separate the processing of the events from the polling of the events. The implementation now will poll for events within a thread, add these events to queue that is common to all threads and then pick the first event from this queue to process. If another event comes in for a DCB that is currently being processed then it is marked as having pending events. When the current process completes that DCB will then be re-queued at the end of the event queue to be picked up later, as opposed to the previous scheme which would allow that pending event to be processed immediately to the detriment of other DCB’s. This also meant that threads would reenter the polling loop more frequently and zombie process could be done more often in the case of these workloads.
The result was that under high load situations a much fairer event scheduling took place, every connection would tend to see a more predictable performance and surprisingly the processor load MaxScale imposed on the system was decreased.
These are just two of the changes that can be found in the current release of MaxScale, there are many others, some mundane other more important. I would encourage anybody that has not already done so to take a look at the release notes to see what has changed. As with any software we still have issues that are open and we continue to work on these, increase the test coverage and respond to issues raised by users of MaxScale.