MaxScale, from proxy to replication relay. Part 2, the slave side
Part 1 of this blog post told the story of creating a binlog router for MaxScale that could connect to a MySQL Replication Master, download the binlog file from that master and store them locally on the MaxScale server. This post will concentrate on the other side of the router, the interaction with the MySQL slaves that will see MaxScale as the replication master.
In some ways the role of master for MaxScale is much closer to the more expected style of interaction that MaxScale was designed to deliver; a connection originates from a client to a MaxScale service. MaxScale then processes that requirement and returns a result back to the client. The most obvious difference is of course that the processing does not involve forwarding the request on to another server, rather it involves sending back a result already cached within MaxScale. However if you delve a little deeper you find that this is too simplistic a view and, as with the master interaction, the way a slave interacts with MaxScale does not follow this simple pattern.
It is necessary to understand a little about how the slave interaction works in order to highlight the differences and illustrate why and how changes had to be made to the MaxScale model to facilitate the slave side of the binlog router in MaxScale. It should be noted however that the aim was not to create a MaxScale variant that would just act as a binlog relay server or to have facilities in the core that would not be generally useful for other router plugins. The aim, as always with MaxScale, was to keep the specifics of the binlog routing problem within the router plugin, whilst enhancing the core with general functionality that could benefit other router, filter or protocol plugin.
The slave interaction can really be thought of as consisting of a number of phases; the first of these is the registration phase. The registration phase is where the slave and master exchange information they need in order to setup the replication channel. Once setup the master will stream replication events to the slave until such time as the connection is lost or the slave disconnects from the master.
The second phase is the catchup phase; this is when a slave connects to a master and request binlog records from a position in the binlog file which is before the current leading edge of the binlog file. The master must stream the binlog event from this point upto the current insert point in the current binlog file. The slave itself sees no difference between these historic binlog events and new binlog events for current database updates, it is a mere convince of implementation to consider this as a separate phase. Indeed this phase may not exist if the slave that connects is already up to date with the master. What makes this phase different from an implementation perspective however is that there is no external trigger to send these events; only the slave registration message. There is also potential for the number and size of messages to send to be massive. In testing several tens or hundreds of gigabytes have been sent during this phase.
The final phase is the steady state phase; the client is at the leading edge of the binlog records and is merely sent new binlog events when database updates occur. In this case the sending of new events is triggered by the arrival of events form the real database master. This is an example of events on one connection, the connection to the master, causing a reaction on one or more other connections, the connections to every up-to-date slave. Once a slave has entered this third phase it is possible for it to go back to the catchup phase if for some reason a particular slave connection is unable to maintain the rate required to match the incoming master arrival rate. Therefore it is normal to see slave connections go between phase 3 and phase 2 for brief periods.
The first, registration phase fits vey easily into the MaxScale event driven model; a slave connects to MaxScale and sends requests, in the form of queries. These requests are parsed by a "mini-parser" in the router plugin and the stored response that was obtained when MaxScale registered to the real master server. The router implements a state machine for this slave replication registration progress, with each successful query exchange advancing the state of the slave connection until the state machine reaches the registered state. Upon reaching the registered state with the catchup phase is entered or the steady state phase depending upon the binlog position requested by the slave.
The catchup phase is entered when a slave completes registration but that slave asks for a binlog position and/or file which is before the latest available binlog event which MaxScale holds. MaxScale must send all of the events, starting from the requested position, to the current latest position. These events are streamed by MaxScale to the slave server, with no messages being sent from the slave to MaxScale. The architecture of MaxScale is however that it is event driven, it receives in event and then fully processes that event before returning the executing thread back to the thread pool to process more events.
This model does not fit well with the streaming operation required when an out of date slave connects to MaxScale. The potential exists for the processing thread to read and stream vast amounts of data before it returns to the thread pool. The would mean that the thread would not be available to process other requests and would potentially starve those other requests. However returning to the thread pool sooner is not possible as the slave will not send any further events in order to receive the remainder of the binlog. Some alternate mechanism is required if MaxScale is to support this kind of operation without either using large numbers of threads or suffering starvation issues.
The solution chosen was to add a new mechanism to the descriptor control block (DCB) to allow the definition of low and high water marks for the queues of data waiting to be sent. The router sends binlog events which will create a queue of outstanding write requests in the DCB, when this queue reaches a certain size, the high water mark, the process will terminate and the thread will return to the thread pool. Once the write queue drains to below the low water mark for the DCB a synthetic event is generated for the DCB. This event is used to trigger the router to send more binlog events, up to the point of once again hitting the high water mark for the DCB or until all stored binlog events have been sent.
This approach allows the MaxScale processing model to be satisfied, gives a configurable throttling mechanism to the bandwidth used and also provides a way to limit the amount of memory each slave connection uses to buffer outgoing binlog events.
Steady State Phase
In the steady state phase a slave connection is registered to receive binlog events and currently has the most up-to-date event that MaxScale is aware of. Any new event that arrives from the master must be saved to the MaxScale binlog file and then forwarded to each host that is currently in the steady state phase. This is done by cycling around each server and sending a copy of the event to the server if it is currently up to date. This is done on the thread that receives the event from the master server, since, following the MaxScale event driven rule, there are no other events to trigger this transmission. Whilst this works well with a small number of slaves, as the number of slaves increases the processing time for each incoming binlog records also increases, as this occurs it becomes more likely that a new incoming event arrives before the processing of the previous one completes. This can eventually lead to starvation of the MaxScale thread pool and poor performance. Ideally a mechanism should exist to allow a single record to be sent to multiple slaves using multiple threads. In order to facilitate this a mechanism to allow for worker threads to be evoke from within an event processing thread is required. This is one issue that still needs to be resolved in the current proof of concept that has been developed.
As well as the issue with the lack of a worker thread mechanism within MaxScale there are a few other limitations that need to be overcome with the proof of concept as it stands; It is not possible to connect a slave to MaxScale until after MaxScale has connected to a master server. This limitation exists as until MaxScale can connect to a master it does not have the responses it needs to respond to the registration queries that the slave makes to MaxScale. MaxScale only supports slaves of the same version as the master to which it attaches. There is no provision for converting the binlog events and protocol between MySQL versions. Only MySQL 5.6 has been tested currently with the proof of concept.