October 20, 2016

Query Classification and Pluggable Parser

From the very beginning, a central feature of MariaDB MaxScale has been its ability to understand the SQL statements that flow through it. For that purpose, MaxScale is equipped with an SQL parser for parsing statements. However, the parser is not explicitly made available inside MaxScale, but the information it provides is accessed via the query classifier component.

Before looking at what functionality the query classifier provides, let's make a short historical detour to see how the parser's placement within MaxScale has changed over time.

History

Initially, the parser was the one available in the MariaDB embedded library and the query classifier was an integral part of the MaxScale core. That implied that the embedded library was also used in the internal communication between MaxScale and the backend servers. In addition to being a proxy between the client and the backend servers, MaxScale also has some communication of its own for monitoring the servers and for fetching account information. Client initiated communication is handled at the protocol level and is hence a different issue.

However, as we wanted to use MariaDB Connector/C in the internal communication, the explicit dependency on the embedded library had to be removed. Up until MaxScale 1.3, the query classifier component was a static library that defined both the API and its implementation, but in MaxScale 1.4 it was refactored so that the API is defined by MaxScale itself and the implementation is provided by a plugin that is loaded at runtime. The only implementation, called qc_mysqlembedded, still used the embedded library, but since the implementation was now a plugin that only exported the query classifier related API, the dependency on the embedded library was removed. Consequently, we could now link MaxScale itself with MariaDB Connector/C.

For MaxScale 2.0, we created a new query classifier implementation plugin, calledqc_sqlite, that is based upon a rather heavily modified parser from sqlite. The plugin based upon the embedded library is still included in the MaxScale source, but is no longer built as part of the regular build.

By default, MaxScale uses the qc_sqlite query classifier plugin, but it is possible to specify explicitly in the MaxScale configuration file, which query classifier to use.

    query_classifier=qc_sqlite

In other words, it is perfectly possible to implement a query classifier plugin using any technology, should that be relevant.

Information Provided

So, what information does the query classifier provide? A rather wide variety of information; the query classifier has been in MaxScale since the start and the provided functionality has evolved over time as new requirements have arisen.

In MaxScale, the data structure used for shuffling data (including SQL statements) around is called GWBUF, which is also what the query classifier API takes as input. The parsing information is stored as “out-of-band” data in the GWBUF itself, meaning that a statement in a GWBUF is parsed only once irrespective of how many times the parsing information about the statement is needed.

There is a function for parsing the buffer explicitly

    qc_parse_result_t qc_parse(GWBUF* buffer);

But, in general there is no need to do that. If the statement has not been parsed when some information is queried, it will be parsed.

You may want to call qc_parse() explicitly if it is imperative that all information about the statement is returned. For instance, the tokenization process will already tell whether a statement is a SELECT or an UPDATE, but what fields the statement affects is only known, provided the statement is fully parsed. Since the parser of MaxScale is not the same as that in the server, it is in principle possible that the MaxScale parser fails to parse completely a statement that the server will parse.

In most cases, the exact result of the parsing process can be ignored, but in some cases it must be known. For instance, the database firewall filter works in principle as follows:

   bool reject_statement;

   if (qc_parse(statement) == QC_QUERY_PARSED)
   {
       reject_statement = check_rules(statement);
   }
   else
   {
       reject_statement = true;
   }

That is, if the statement isn't fully parsed, it will unconditionally be rejected. Otherwise, we will check whether the firewall rules match. The reason is that unless the statement has been parsed completely we cannot know with certainty, for instance, what fields the statement affects and therefore the firewall cannot allow the statement to pass.

In other contexts we can be more relaxed and may be content with the statement having been classified based on tokenization; we will know it is a SELECT but we may not know what fields the statement accesses.

The two most top-level pieces of information that can be obtained about a statement is its type and its operation. The type basically defines whether a statement is a read (i.e. can be sent to slaves) or a write (i.e. should be sent to master).

  typedef enum
  {
      QUERY_TYPE_UNKNOWN    = 0x000000,
      QUERY_TYPE_LOCAL_READ = 0x000001,
      QUERY_TYPE_READ       = 0x000002,
      QUERY_TYPE_WRITE      = 0x000004,
      ...
  } qc_query_type_t;

  qc_query_type_t qc_get_type(GWBUF *buffer);

The operation then, rather expectedly, tells, for instance, whether the statement is a select or delete.

  typedef enum
  {
      QUERY_OP_UNDEFINED     = 0,
      QUERY_OP_SELECT        = (1 << 0),
      QUERY_OP_UPDATE        = (1 << 1),
      ...
  } qc_query_op_t;

  qc_query_op_t qc_get_operation(GWBUF *buffer);

However, due to historical reasons, there are overlaps between the qc_query_type_t and qc_query_op_t enumerations, which is something that will be addressed in the future.

In addition to being able to tell about the general characteristics of a statement, the query classifier can also provide specific information, such as what databases, tables or fields are accessed.

  char** qc_get_database_names(GWBUF *buffer, int* size);
  char** qc_get_table_names(GWBUF *buffer, int* tblsize, bool fullnames);
  char* qc_get_affected_fields(GWBUF *buffer);

However, as the query classifier only uses the provided statement as input, for instance, if the statement does not specifically address a particular database, one will not be returned even if there would be a default database. That is something that will have to be handled separately. We are working on making session information, such as the default database, whether a transaction is in process, etc., available so that filters and routers need not independently do that themselves.

Conclusion

The query classifier allows filters and routers to take action depending on the characteristics of a statement, and in the future, the information that is made available will be both richer and more easily accessible.

About Johan Wikman

Johan Wikman is a senior developer working on MaxScale at MariaDB Corporation. 

Read all posts by Johan Wikman