High Availability for Drupal Part 1 – Investigating the Issues

spacer

Drupal is one of the most popular Content Management Systems (CMS) and is used increasingly in high-visibility sites, such as www.whitehouse.gov. This has brought a lot of attention on how to get the most performance out of Drupal and how to improve the availability of such sites. In this blog series I’ll take you through the basics and on through to designing your own HA Drupal site.

But first, we need to understand what the challenges are in getting Drupal (or indeed any CMS) working on multiple servers in such a way as to ensure high availability and performance.

The Multi-Server Issue

The normal course of upgrades to high-availability goes roughly along these lines:

  • De-couple the web server and the database, putting each on its own server
  • Duplicate the web front end with more servers behind a load balancer
  • Scale out the database by splitting to a master and multiple slaves

The trouble with this path is that second step. Most CMS, and Drupal is not alone in this, have the ability to upload files, for example a user profile photo, and here is where the problem lies. If you have two servers, and a user updates their profile on one, their photo only exists on that one server, so you get a broken image on the second server. To get around this most solutions rely on a shared storage arrangment, such as a SAN. If you are looking at removing single points of failure (SPOF) this may not seem like a great idea, although the engineering required to provide a viable alternative can be tricky.

Drupal Version Differences

There are two major versions of Drupal out in the wild.  The newer version 7 (D7) and the older, but still quite popular, Drupal 6 (D6).  Later this year D8 is likely to reach beta so there will be yet another version to choose from.

The major difference between these versions as far as high availability goes is in the database abstraction layer. Database abstraction layers are useful when you need to support different underlying database systems, hiding the differences from the application layer. This allows Drupal to use MySQL and PostgreSQL, and for D7, Sqlite, all without having to know about the differences between these systems, at least for the application programmer.

D6 DATABASE ABSTRACTION LAYER

The D6 database abstraction layer is quite simple, and tries not to get in the way. The comments in the includes/database.inc file give the best description of its capabilities

/**
 * @defgroup database Database abstraction layer
 * @{
 * Allow the use of different database servers using the same code base.
 *
 * Drupal provides a slim database abstraction layer to provide developers with
 * the ability to support multiple database servers easily. The intent of this
 * layer is to preserve the syntax and power of SQL as much as possible, while
 * letting Drupal control the pieces of queries that need to be written
 * differently for different servers and provide basic security checks.
 *
 * Most Drupal database queries are performed by a call to db_query() or
 * db_query_range(). Module authors should also consider using pager_query() for
 * queries that return results that need to be presented on multiple pages, and
 * tablesort_sql() for generating appropriate queries for sortable tables.
 *
 * For example, one might wish to return a list of the most recent 10 nodes
 * authored by a given user. Instead of directly issuing the SQL query
 * @code
 *   SELECT n.nid, n.title, n.created FROM node n WHERE n.uid = $uid LIMIT 0, 10;
 * @endcode
 * one would instead call the Drupal functions:
 * @code
 *   $result = db_query_range('SELECT n.nid, n.title, n.created
 *     FROM {node} n WHERE n.uid = %d', $uid, 0, 10);
 *   while ($node = db_fetch_object($result)) {
 *     // Perform operations on $node->body, etc. here.
 *   }
 * @endcode
 * Curly braces are used around "node" to provide table prefixing via
 * db_prefix_tables(). The explicit use of a user ID is pulled out into an
 * argument passed to db_query() so that SQL injection attacks from user input
 * can be caught and nullified. The LIMIT syntax varies between database servers,
 * so that is abstracted into db_query_range() arguments. Finally, note the
 * common pattern of iterating over the result set using db_fetch_object().
 */

You’ll notice in that description that most queries use a simple query call (db_query or db_query_range). This means there is nothing to differentiate, at least at the call level, if the query is one that is reading data and one that is updating or creating data.

The abstraction layer does provide a mechanism that is not well documented for using multiple databases, which might make it useful in splitting read/write queries for use with master/slave configurations. This is by creating an array of database URLs in the sites/default/settings.php file and to use the db_set_active() call in your code. The unfortunate part is that not only is it not documented, it is also not used in the core code, so you would get little benefit from implementing the call. Nevertheless it does show that the Drupal team had been thinking ahead as we shall see in D7.

If you are interested in playing with the db_set_active call, you first need to set up your sites/default/settings.php. Normally the database url looks like the following

$db_url = 'mysqli://user:pass@host/dbname';

To take advantage of multi-database support you would change this to an array, with the default database using the key ‘default’

$db_url = array( 'default' => 'mysqli://user:pass@master_host/dbname',
  'other' => 'mysqli://user:pass@other_host/dbname' );

D7 DATABASE ABSTRACTION LAYER

The D7 abstraction layer is a much richer layer than that in D6. Firstly, rather than being a simple functional abstraction, it is based upon the PHP PDO (PHP Data Objects) database API and is object oriented. It includes support for transactions, splitting of read/write queries, and support for master/slave database configurations. So in terms of going HA, this looks like a far better candidate than D6, at least on the face of it.

Let’s take a look under the hood. Once again, reading the comments in the default.settings.php file gives us a good head start.

/**
 * Database settings:
 *
 * The $databases array specifies the database connection or
 * connections that Drupal may use.  Drupal is able to connect
 * to multiple databases, including multiple types of databases,
 * during the same request.
 *
 * Each database connection is specified as an array of settings,
 * similar to the following:
 * @code
 * array(
 *   'driver' => 'mysql',
 *   'database' => 'databasename',
 *   'username' => 'username',
 *   'password' => 'password',
 *   'host' => 'localhost',
 *   'port' => 3306,
 *   'prefix' => 'myprefix_',
 *   'collation' => 'utf8_general_ci',
 * );
 * @endcode
 *
 * The "driver" property indicates what Drupal database driver the
 * connection should use.  This is usually the same as the name of the
 * database type, such as mysql or sqlite, but not always.  The other
 * properties will vary depending on the driver.  For SQLite, you must
 * specify a database file name in a directory that is writable by the
 * webserver.  For most other drivers, you must specify a
 * username, password, host, and database name.
 *
 * Some database engines support transactions.  In order to enable
 * transaction support for a given database, set the 'transactions' key
 * to TRUE.  To disable it, set it to FALSE.  Note that the default value
 * varies by driver.  For MySQL, the default is FALSE since MyISAM tables
 * do not support transactions.
 *
 * For each database, you may optionally specify multiple "target" databases.
 * A target database allows Drupal to try to send certain queries to a
 * different database if it can but fall back to the default connection if not.
 * That is useful for master/slave replication, as Drupal may try to connect
 * to a slave server when appropriate and if one is not available will simply
 * fall back to the single master server.
 *
 * The general format for the $databases array is as follows:
 * @code
 * $databases['default']['default'] = $info_array;
 * $databases['default']['slave'][] = $info_array;
 * $databases['default']['slave'][] = $info_array;
 * $databases['extra']['default'] = $info_array;
 * @endcode
 *
 * In the above example, $info_array is an array of settings described above.
 * The first line sets a "default" database that has one master database
 * (the second level default).  The second and third lines create an array
 * of potential slave databases.  Drupal will select one at random for a given
 * request as needed.  The fourth line creates a new database with a name of
 * "extra".
 *
 * For a single database configuration, the following is sufficient:
 * @code
 * $databases['default']['default'] = array(
 *   'driver' => 'mysql',
 *   'database' => 'databasename',
 *   'username' => 'username',
 *   'password' => 'password',
 *   'host' => 'localhost',
 *   'prefix' => 'main_',
 *   'collation' => 'utf8_general_ci',
 * );
 * @endcode
 *
 * You can optionally set prefixes for some or all database table names
 * by using the 'prefix' setting. If a prefix is specified, the table
 * name will be prepended with its value. Be sure to use valid database
 * characters only, usually alphanumeric and underscore. If no prefixes
 * are desired, leave it as an empty string ''.
 *
 * To have all database names prefixed, set 'prefix' as a string:
 * @code
 *   'prefix' => 'main_',
 * @endcode
 * To provide prefixes for specific tables, set 'prefix' as an array.
 * The array's keys are the table names and the values are the prefixes.
 * The 'default' element is mandatory and holds the prefix for any tables
 * not specified elsewhere in the array. Example:
 * @code
 *   'prefix' => array(
 *     'default'   => 'main_',
 *     'users'     => 'shared_',
 *     'sessions'  => 'shared_',
 *     'role'      => 'shared_',
 *     'authmap'   => 'shared_',
 *   ),
 * @endcode
 * You can also use a reference to a schema/database as a prefix. This may be
 * useful if your Drupal installation exists in a schema that is not the default
 * or you want to access several databases from the same code base at the same
 * time.
 * Example:
 * @code
 *   'prefix' => array(
 *     'default'   => 'main.',
 *     'users'     => 'shared.',
 *     'sessions'  => 'shared.',
 *     'role'      => 'shared.',
 *     'authmap'   => 'shared.',
 *   );
 * @endcode
 * NOTE: MySQL and SQLite's definition of a schema is a database.
 *
 * Advanced users can add or override initial commands to execute when
 * connecting to the database server, as well as PDO connection settings. For
 * example, to enable MySQL SELECT queries to exceed the max_join_size system
 * variable, and to reduce the database connection timeout to 5 seconds:
 *
 * @code
 * $databases['default']['default'] = array(
 *   'init_commands' => array(
 *     'big_selects' => 'SET SQL_BIG_SELECTS=1',
 *   ),
 *   'pdo' => array(
 *     PDO::ATTR_TIMEOUT => 5,
 *   ),
 * );
 * @endcode
 *
 * WARNING: These defaults are designed for database portability. Changing them
 * may cause unexpected behavior, including potential data loss.
 *

Some interesting stuff in there. Firstly the concept of master/slave separation, along with the ability to have multiple slaves that are load-balanced at the Drupal level. There are some gotchas in there though. Look at the comment about transactions for MySQL: “For MySQL, the default is FALSE since MyISAM tables do not support transactions.” While true, it is also out of date. The default database engine in MySQL has for a long time been InnoDB, which does support transactions. If you are serious about high availability you would want to make sure you are using InnoDB, and transactions.

One problem with slave databases is that they need to replicate every change that is made on the master. This means that under most replication regimes the slave database will be out of date with respect to the master while the changes are being written. This slave lag can be quite a headache for application developers. Generally you want your application to write to the master but read from the slave, otherwise you get no benefit from splitting out the master from the slave. Take our example above of a user updating their profile. You write the data to the master, then display the profile again to the user. Reading it immediately from the slave gives you a potential for out-of-date data, which is not a good idea. Fortunately the Drupal developers have thought of this and added a function called db_ignore_slave, which can be used to ensure you read data back from the master in these situations. One thing to keep in mind is that the default ignore period is 5 minutes, so you may want to adjust that according to your conditions.

In addition to slave selection, the D7 code also does a reasonably good job of ensuring that database updates only go to the master. The core code uses transactions and slave hinting in a number of core modules, although not all contributed modules comply. So for the most part you should be able to leverage the database abstraction layer to simplify a database scale-out solution without having to make changes to the core code.

The Slave Lag Issue

As described above, slave lag can be a real issue, and if you are using D6 you don’t get any help from Drupal in ameliorating it. There are also other issues with slaves, for instance using temporary tables. While a temporary table creation is replicated on the slave, the table is only available to the connection that created it, meaning that you cannot write to a temporary table on a master and expect to read it on the slave. This means that whatever solution you come up with needs to be able to cope with this.

One solution to this problem is to use a cache such as memcached. In this method, when writing data you write to memcached as well as the master database, and read from memcached. On a miss you read from a slave. Given that the main time you have a problem with slave lag is just after a master write, this can be a very effective measure. However it would normally require changes to the database abstraction layer in order to achive this, unless you provide an intelligent query analyzer/load balancer combination outside of the application.

The Web Server

A lot of Drupal sites run on Apache. This is the workhorse of the Internet and is a very capable and well respected web server. There are a number of newer offerings around now, and one gaining a lot of traction is Nginx (pronounced engine-X). Nginx requires a bit of configuration to run with Drupal, or indeed any PHP-based application, but it can be well worth it. Where Nginx excels is in high-thoughput sites with lots of connections and especially where there are a lot of static files involved.

For this blog series I set up a number of AWS EC2 instances to try out various combinations. The first test was of the web server software and it is quite interesting when comparing Apache vs Nginx on a typical static file. In our case it was the robots.txt file from the D6 installation. At 10 concurrent connections they were pretty even, but once the connection count goes up the differences start to become stark. At 100 concurrent connections, Nginx wins out by about 50%. At 400 concurrent connections we really see some differences. Using ab (the Apache Benchmark tool) we get the following for Apache on an m1.small EC2 instance

Server Software:        Apache/2.2.22
Server Hostname:        localhost
Server Port:            80

Document Path:          /drupal-6.28/robots.txt
Document Length:        1521 bytes

Concurrency Level:      400
Time taken for tests:   34.464 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      17990000 bytes
HTML transferred:       15210000 bytes
Requests per second:    290.16 [#/sec] (mean)
Time per request:       1378.553 [ms] (mean)
Time per request:       3.446 [ms] (mean, across all concurrent requests)
Transfer rate:          509.76 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0  105 946.9      1   15041
Processing:     7  790 3426.7    115   34387
Waiting:        0  758 3426.4     32   34377
Total:          7  895 3677.9    123   34464

Percentage of the requests served within a certain time (ms)
  50%    123
  66%    133
  75%    139
  80%    143
  90%   1052
  95%   5142
  98%  11631
  99%  23958
 100%  34464 (longest request)

Now Nginx on the same system

Server Software:        nginx/1.1.19
Server Hostname:        localhost
Server Port:            80

Document Path:          /robots.txt
Document Length:        1521 bytes

Concurrency Level:      400
Time taken for tests:   4.389 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      17340000 bytes
HTML transferred:       15210000 bytes
Requests per second:    2278.57 [#/sec] (mean)
Time per request:       175.549 [ms] (mean)
Time per request:       0.439 [ms] (mean, across all concurrent requests)
Transfer rate:          3858.44 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   98 318.8      8    3099
Processing:     1   55  77.5     20    3110
Waiting:        0   44  75.4     16    3110
Total:          1  153 332.5     86    3118

Percentage of the requests served within a certain time (ms)
  50%     86
  66%    113
  75%    123
  80%    136
  90%    241
  95%   1030
  98%   1098
  99%   1549
 100%   3118 (longest request)

This is quite a big gap between the two!

Doing tests on the running Drupal default first page shows very little difference between the two web servers and could not achieve high concurrency rates without timing out. We will look at this more later in the series, however the gating issue in the case of Drupal is clearly the database.

Next Steps

Now that we have identified the basic issues and opportunities in building a HA Drupal site, in our next post we will look at the “state of the art” for components that can help us build our site, and will wrap up in the third post with two suggested solutions, one for the cloud and one for the data centre.