Horizontal Scaling of Writes & Reads with ClustrixDB

When people come up and ask, “Oh, so ClustrixDB is yet another MySQL-compatible database. What makes you different?” we have a terrific answer, and it’s all about horizontal scaling:

ClustrixDB is the only MySQL compatible database that linearly scales-out both write and read queries.

By horizontal scaling we mean each additional node added to a ClustrixDB cluster adds both linear write and read horizontal scale. This is significantly different from what MySQL can provide — as a single-node master architecture, MySQL can have multiple read slaves added, potentially providing at best only linear read scale.

Other MySQL scaling strategies have challenges to write scale:

Single-master architectures such as MySQL, Percona, MariaDB, and Aurora can only scale up, i.e., vertically, until they reach max instance size available
Read slaves offload read transactions, maximizing write throughput on the master, but ultimately cannot increase the master’s capacity
Master/master topologies (including certification replication as used in Percona XtraDB and Maria Galera Cluster) provide high availability, but not the horizontal linear scaling of writes
Sharding is a common solution for write scale, but it comes with application, business, and DevOps challenges

So what makes ClustrixDB so different?

ClustrixDB’s Secret Sauce

In short, ClustrixDB solves linear write and read horizontal scaling via a three-pronged approach:

Horizontal Scaling Secret #1: Automatic Data Distribution

As tables and indexes are created, they are automatically “sliced” and distributed across all the nodes in the cluster. This distribution utilizes 64 bit consistent hashing based on the data keys, so each slice’s location is predictable, as well as what data it contains, from a simple metadata map. So any transaction running on any node can access any data on the cluster from at most a single hop away, and the “lookup” to find that data is a local in-memory table in the RDBMS. Our multi-patented Clustrix Rebalancer automatically distributes data across the cluster, partitioning the data both horizontally (slices) and vertically (representations).

Horizontal Scaling Secret #2: Automatic Fan-out of SQL Queries

On the query side, the SQL language is declarative, and not trivially parallelized or scaled horizontally. ClustrixDB solves this problem by pre-parsing each query, and distributing compiled query fragments directly to the specific cluster nodes containing the requisite data. This allows data processing to be done locally, minimizing data movement, and only returning result-sets to the initiating node. This kind of query fragment forwarding is very fast, due to the metadata map of where all the data resides. We call this “bringing the query to the data, rather than the reverse.” And this also allows massive parallelism of queries; large queries get maximum parallelism, while many simultaneous queries get max concurrency across all the nodes in the cluster.

Horizontal Scaling Secret #3: Automatic Data Rebalancing

A big challenge of data distribution across shared-nothing systems is data imbalance and/or hotspots. A node’s storage can get full, requiring repartitioning table(s) and moving data to a new node (called “re-sharding” if it’s a sharded system), to create more space. Correspondingly, a node can experience contention of CPU or network access without the storage being exhausted. This kind of contention can happen if the data distribution isn’t granular enough, allowing too many simultaneous transactions to occur on a small segment of data. This is called a “hotspot,” and is automatically handled by the Clustrix Rebalancer as well. The Clustrix Rebalancer automatically notices usage patterns, and re-splits high-access data slices, moving half of the data slice in contention to another node.

Basically, the three main issues with the horizontal scaling of SQL are automatically handled by ClustrixDB, neither requiring changes to the MySQL application, nor data maintenance to be handled by DevOps. Being able to horizontally scale your MySQL application without needing to shard represents a significant CAPEX and OPEX savings to your IT and DevOps budgets.

Let’s go another step deeper…

Under ClustrixDB’s Hood: Horizontal Scaling from the Inside

ClustrixDB is a single logical shared-nothing distributed RDBMS running on multiple peer-to-peer nodes.

If you’ve read about Google Spanner’s internals, some of this may sound familiar:

Like Spanner, ClustrixDB leverages tried and true technologies like two-phase locking (2PL) and multi-version concurrency control (MVCC) for transactional commit consistency, and Paxos for consensus across database servers. However unlike Spanner, ClustrixDB is MySQL compliant, and doesn’t require atomic clocks to run at very high transactional rates (faster than Aurora).

Here’s a high-level overview of the internals of ClustrixDB

Shared-nothing Architecture ClustrixDB utilizes a shared-nothing architecture, because it’s the only known approach that allows for linear horizontal scaling of large distributed systems. This approach is shared by scale-out RDBMSs like Google Spanner. Shared disk approaches suffer from several architectural limitations inherent in coordinating access to a single central resource. In such systems, as the number of nodes in the cluster increases, so does the coordination overhead. This is visible in the very high latency (155ms average, up to 450ms max) experienced by Aurora as it approaches 100k write TPS in their published Sysbench benchmarks:

Independent Index Distribution is the method utilized by ClustrixDB, which enables each index to have its own distribution. Independent index distribution supports a flexible and broad range of distributed query evaluation plans. Each table contains one or more indexes that are representations of the table. Each representation has its own distribution key (a.k.a. a partition key), meaning that ClustrixDB uses multiple independent keys to slice the data in one table.

ClustrixDB Rebalancer ensures optimal data distribution across all nodes for linear horizontal scaling. The multi-patented ClustrixDB Rebalancer is an automated system that maintains the healthy distribution of data in the cluster. The Rebalancer is an online process, which effects change to the cluster with minimal interruption to user operations, thus relieving the cluster administrator from the burden of manually manipulating data placement.

Each representation has an appropriate number of slices, and an appropriate distribution key, balancing rows across its slices, but still allow fast queries to specific slices.
Slices are well distributed around the cluster on storage devices that are not overfull.
Slices are not placed on server(s) that are being flexed in.
Reads from each representation are balanced across the representation’s nodes.
Slices are kept from growing too big
Distribution: The rows of a representation are distributed among its slices by the hash value of its distribution key. The distribution key is some prefix of the representation’s key columns, ensuring the ratio of rows to unique values in a representation’s distribution key is not too high or low. The Rebalancer automatically monitors and corrects any distribution imbalance.

Query Optimizer executes each query with maximum parallelism and many simultaneous queries with maximum concurrency. ClustrixDB query optimization leverages a distributed query planner, compiler, and distributed shared-nothing execution engine as well for linear horizontal scalability.

The ClustrixDB Query Optimizer is modeled on the Cascades Query optimization framework, which is known for leveraging:

Cost-based optimization
Extensible via a rule based mechanism
A top-down approach
Able to separate logical vs. physical operators and properties
A branch-and-bound pruning approach

The Cascades framework is production proven, satisfying many commercial DBMS systems, notably Tandem’s Nonstop SQL and Microsoft’s SQL Server.

Query Optimizer must answer the following questions in order to scale horizontally:

In what order should the tables be joined? This can be the difference between the query executing in 1ms or 10 minutes.
Which indexes should be used? Not using a proper index on a join constraint could be catastrophic, causing broadcast messages and full reads of the second table for each row of the first.
Should the sort/aggregate be non-blocking? Should the sort/aggregate be done in stages, i.e. first on separate nodes and then re-aggregate/re-sort later?

Evaluation Model parallelizes queries, which are horizontally distributed to the node(s) with the relevant data for scalability. Results are created as close to the data as possible, then routed back to the requesting node for consolidation and returned to the client. This is key to ClustrixDB’s ability to”send the query to the data.” Other RDBMS systems routinely move large amounts of data to the node that is processing the query, then eliminate all the data that doesn’t fit the query (often lots of data), whereas ClustrixDB appreciably reduces network traffic performance issues by only moving qualified data to the requesting node.

Processors on multiple nodes can additionally be brought to bear on the data selection process. The system produces results more quickly by selecting data on multiple nodes in parallel rather than selecting all data from a single node that must first gather the required data from the other nodes in the system. Overall, ClustrixDB’s parallel query evaluation for simple queries and Massively Parallel Processing (MPP) for analytic queries (akin to columnar stores) is a major contributor to ClustrixDB’s linear horizontal scalability.

Consistency and Concurrency Control uses a combination of Multi-Version Concurrency Control (MVCC) and 2 Phase Locking (2PL); readers have lock-free snapshot isolation while writers use 2PL to manage conflict. ClustrixDB implements a distributed MVCC scheme to ensure that readers are lockless, ensuring readers and writers never interfere with each other. ClustrixDB maintains a version history of each row as writers modify rows within the system. Each statement within a transaction uses lock-free access to the data to retrieve the relevant version of the row.

Isolation ClustrixDB supports both read committed isolation and repeatable read.

Read committed (i.e., per statement consistent snapshot read) isolation is similar to Oracle’s consistent read isolation. Rows are visible when statement (invocation_id) > the modifying transaction commit_id.

Repeatable read (i.e., per transaction consistent snapshot read). Rows are visible when transaction id > the modifying transaction commit_id.

High Availability ClustrixDB maintains multiple copies (“replicas”) of data for fault tolerance and availability. By default, there are at least two physical replicas of each logical slice, stored on separate nodes. The number of replicas is selectable to increase DBMS resilience to drive or node failures, and a global nResilience configuration can be made to automatically support a maximum number of simultaneous node failures without loss of data or committed transactions (up to floor(#ofnodes / 2) – 1). This allows up to three-node failure in a seven-node cluster, 24-node failure in cluster of 50, etc. Alternate solutions such as Ksafety only allow a maximum of two simultaneous node failures, regardless of cluster size.

ClustrixDB Horizontal Scaling: Same Linear Scale-out of Sharding Without the Challenges

Throughout this site, you will see the terms “flex out,” “flex in” and “scale out,” but at the end of the day, we are talking about linear, horizontal scaling. And specifically, linearly scaling write queries as well as reads. ClustrixDB provides full writes and reads across all the nodes in the cluster, with linear, horizontal scale provided with each additional node. Unlike sharding, applications see a single ClustrixDB server — no need to track different servers or split transactions across ranges. Applications using ClustrixDB don’t need to create or reference any kind of “sharding key” — all partition keys are handled transparently by the RDBMS. DBAs don’t need to manage data location across servers, moving shards around and/or “creating space.” Clustrix’s patented Rebalancer handles this transparently as well. And with fault tolerance already built-in, ClustrixDB is much more resilient than sharding topologies to instance outages.