MariaDB Xpand Parallel Query Evaluation

MariaDB Xpand leverages its distributed SQL architecture for optimal query performance. MariaDB Xpand uses parallel query evaluation for simple queries and massively parallel processing (MPP) for complex queries that require slices from multiple nodes.

Parallelism

Xpand executes queries in parallel using several different kinds of parallelism.

Xpand achieves very efficient parallel query evaluation for multiple reasons:

  • While data is distributed among all nodes, all nodes know the location of every slice. A node can forward an operation to the correct node that owns the slice with only 0 or 1 hop.

  • The query optimizer compiles a query into fragments that are analogous to a collection of functions. The first function looks up where the value resides, and the second function reads the correct value from the correct slice.

The following graphic shows how this works for a simple SELECT statement:

Simple Query Compilation for Xpand

Concurrent Parallelism

Concurrent Parallelism is when two entities can perform the exact same task in parallel.

Xpand uses concurrent parallelism to execute the same fragment of a query on different slices in parallel. A different node or CPU core operates on each slice.

For example, if a query reads every slice for a table, Xpand can read every ranking replica for every slice in parallel.

Pipeline Parallelism

Pipeline Parallelism is when two entities can perform different stages of a task in parallel.

Xpand uses pipeline parallelism to execute different fragments of a query on different slices in parallel. A different node or CPU core executes each fragment.

For example, if a query joins two tables, Xpand optimizes the operation by creating a pipeline. While Xpand is reading the next row from the first table, another node or CPU core will be joining the previous row in the first table with the corresponding row in the second table in parallel.

Massively Parallel Processing (MPP)

Xpand uses massively parallel processing (MPP) strategies. Massively parallel processing (MPP) is a programming strategy that allows an operation to be coordinated among multiple CPU cores on one or more hosts. The operation is split into multiple distinct tasks, and multiple CPU cores handle each task in parallel.

Xpand uses MPP strategies to coordinate query execution between nodes. Different fragments of a query can be executed on different nodes. When a query is executed on a node, the Global Transaction Manager (GTM) on the same node manages the concurrency of the query by communicating with other nodes that are executing fragments of the query. On those nodes, the Local Transaction Manager (LTM) on each node manages the local part of the query.

Xpand also uses MPP strategies to minimize data transmission between nodes for most queries. When the query optimizer compiles a query into fragments, it creates two fragments for each database lookup. The first fragment looks up the node that contains the slice with the specified value, and the second fragment reads the value from the correct slice on that node. This method allows very complicated queries to scale with minimal network overhead:

  • A simple read-only query will always read the slice from the ranking replica. Every simple read-only query requires 0 or 1 hop.

  • A simple write query will always update every replica of the slice. Every simple write query requires a maximum number of hops equal to the number of replicas (by default, 2) plus 1.

  • A join query will always read the joined slice from the ranking replica. As the number of rows in a join increases, the number of hops per row does not increase. Each row in a join only requires a maximum of 3 hops.

Xpand also uses MPP strategies to optimize aggregate operations by performing as much of the aggregation as possible in a distributed manner. Each node performs a partial aggregation on its local data. At the end of the operation, the original node combines all of the data and performs the final aggregation.

Example Tables

The following example tables will be used to show how Xpand evaluates queries in the following sections:

CREATE TABLE test.bundler (
   id INT PRIMARY KEY AUTO_INCREMENT,
   name CHAR(60) DEFAULT NULL
);

CREATE TABLE test.donation (
   id INT PRIMARY KEY AUTO_INCREMENT,
   bundler_id INT,
   amount DOUBLE,
   KEY bundler_key (bundler_id, amount)
);

Since the first table has a primary key and no secondary index, Xpand creates 1 representation. Since the second table has a primary key and a secondary index, Xpand creates 2 representations.

Representation

Index

Table

Distribution Key

_id_primary_bundler

Primary Key

test.bundler

id

_id_primary_donation

Primary Key

test.donation

id

_bundler_key_donation

Key

test.donation

bundler_id

Scaling Simple Queries

Xpand uses concurrent parallelism and massively parallel processing (MPP) strategies to scale simple queries:

  • Xpand uses concurrent parallelism to execute the same fragment of a query on different slices in parallel. A different node or CPU core operates on each slice.

  • Xpand uses massively parallel processing (MPP) strategies to coordinate query execution and minimize data transmission between nodes.

Simple queries include the following:

  • SELECT statements without join or aggregate operations

  • INSERT statements

Xpand evaluates simple queries on a single CPU core on a single node at a time. However, different CPU cores or different nodes can evaluate different simple queries concurrently. As more nodes are added to Xpand, more CPU cores and more memory are available to evaluate simple queries concurrently.

When more nodes are added, Xpand's network overhead does not increase for simple queries:

  • A simple read-only query will always read the slice from the ranking replica. Every simple read-only query requires 0 or 1 hop.

  • A simple write query will always update every replica of the slice. Every simple write query requires a maximum number of hops equal to the number of replicas (by default, 2) plus 1.

Due to the sophisticated design, Xpand has tendency to scale linearly for simple queries as more nodes are added.

For example, consider the following SELECT statement:

SELECT id, amount
FROM test.donation
WHERE id = 15;

At a high level, Xpand evaluates the simple query in the following way:

  1. A node receives the query.

  2. The optimizer compiles a query plan comprised of fragments.

  3. The first fragment determines the node that owns the ranking replica for the slice that contains the relevant row (id=15).

  4. The second fragment reads the row from the ranking replica of the slice.

    • If the original node owns the ranking replica, it reads the row from the slice.

    • If another node owns the ranking replica, the original node forwards the request to the other node. The other node reads the row from the slice and then returns the row to the original node.

  5. The original node returns the results.

The following graphic shows how the data flows:

Simple Query Evaluation for Xpand

Scaling Joins

Xpand uses concurrent parallelism, pipeline parallelism, and massively parallel processing (MPP) strategies to scale joins:

  • Xpand uses concurrent parallelism to execute the same fragment of a query on different slices in parallel. A different node or CPU core operates on each slice.

  • Xpand uses pipeline parallelism to execute different fragments of a query on different slices in parallel. A different node or CPU core executes each fragment.

  • Xpand uses massively parallel processing (MPP) strategies to coordinate query execution and minimize data transmission between nodes.

Xpand minimizes the data transmission for joins. As the number of rows in a join increases, the number of hops per row does not increase. Each row in a join only requires a maximum of 3 hops. The data transmission can have an impact on performance if many large joins are being executed concurrently, so it is very important to have sufficient network bandwidth and minimal network latency between nodes.

For example, if we are joining the tables t1 and t2, the data would be transmitted in the following way:

  1. Query issued to Node 1. node1 determines that node2 has the ranking replica for the current row in t1. It sends the request to node2.

  2. node2 reads the current row in t1 from the ranking replica. node2 also determines that node3 owns the ranking replica for the joined row in t2. It sends the request to node3

  3. node3 reads the current row in t2 from the ranking replica. node3 then returns the join results to node1.

The Global Transaction Manager (GTM) on the node that receives the query manages the concurrency of the query. It coordinates with the Local Transaction Manager (LTM) on any other node that participates in the query.

For example, consider the following SELECT statement with a join:

SELECT name, amount
FROM test.bundler AS b
JOIN donation AS d
   ON b.id = d.bundler_id
WHERE b.id = 15;

At a high level, Xpand evaluates the join in the following way:

  1. A node receives the query. The GTM on this node manages the concurrency.

  2. The optimizer compiles a query plan comprised of fragments.

  3. The first fragment determines which node owns the ranking replica for the slice that contains the relevant row (b.id=15) in the representation (_id_primary_bundler).

  4. The second fragment reads the row from the ranking replica of the slice. If the original node owns the ranking replica, it reads the row from the slice. If another node owns the ranking replica, it forwards the request to the other node. The other node reads the row from the slice.

  5. The third fragment determines which node owns the ranking replica for the slice that contains the relevant row (d.bundler_id=15) in the representation (_bundler_key_donation).

  6. The fourth fragment reads the row. If the current node owns the ranking replica, it reads the row from the slice. If another node owns the ranking replica, it forwards the request to the other node and the other node reads the row. The node that reads the row joins the two rows and returns the joined rows to the GTM node.

  7. The original node returns the results.

The following graphic shows how the data flows:

Simple Joins with Xpand

If the WHERE clause is omitted, Xpand uses concurrent parallelism. Since each node contains a ranking replica for slices of b.id, multiple nodes evaluate each fragment concurrently:

Parallel Joins for Xpand

Scaling Aggregates

Xpand uses concurrent parallelism, pipeline parallelism, and massively parallel processing (MPP) strategies to scale aggregates:

  • Xpand uses concurrent parallelism to execute the same fragment of a query on different slices in parallel. A different node or CPU core operates on each slice.

  • Xpand uses pipeline parallelism to execute different fragments of a query on different slices in parallel. A different node or CPU core executes each fragment.

  • Xpand uses massively parallel processing (MPP) strategies to coordinate query execution and minimize data transmission between nodes.

For some aggregate operations, Xpand performs the aggregation in a distributed manner. At the end, the partial aggregates from each node are combined.

For example, consider the following SELECT statement that aggregates data with SUM():

SELECT id, SUM(amount)
FROM test.donation AS d
GROUP BY by bundler_id;

The donation table uses the id column as the distribution key. However, this query contains GROUP BY bundler_id, so the results are being aggregated on a value that is not part of the distribution key. This means that multiple slices may refer to the same bundler_id value. So, this aggregation operation can only be partially distributed.

Xpand performs as much of the aggregation as possible in a distributed manner. Each node performs a partial aggregation on its local data. At the end of the operation, the original node combines all of the data and performs the final aggregation.

The following graphic shows how the data flows:

Distributed aggregates with Xpand

In some cases, Xpand can perform the entire aggregation operation in a distributed manner.

For example, consider the following SELECT DISTINCT statement:

SELECT DISTINCT bundler_id
FROM test.donation;

In this case, Xpand can use the secondary index's representation called _bundler_key_donation, which uses bundler_id as a distribution key. This has two implications:

  • bundler_id values in each slice are unique.

  • bundler_id values in each slice are sorted.

At the end of the query execution, each node will send the original node a list of distinct bundler_id values that are unique and sorted. The original node does not have to perform any other aggregation. It only has to merge the results from each node.

Query Evaluation Procedure

Xpand's query evaluation procedure depends on the topology in use.

MariaDB Xpand supports multiple topologies. Several options are described below. MariaDB products can be deployed in many different topologies. The topologies on this page are representative. MariaDB products can be deployed to form other topologies, leverage advanced product capabilities, or combine the capabilities of multiple topologies.

Xpand Performance Topology

In the Xpand Performance topology, the application sends read and write queries to MariaDB MaxScale, which routes them to three or more Xpand nodes:

Xpand Performance Topology
  1. MariaDB MaxScale receives an SQL query from the user or application.

  2. MaxScale uses the Xpand Monitor (xpandmon) to determine which Xpand nodes are healthy and able to receive queries.

  3. MaxScale uses the Read/Write Split router (readwritesplit) or the Read Connection router (readconnroute) to select a specific Xpand node to receive the query.

  4. Xpand associates the query with a session on the Xpand node that receives the query from MaxScale. This node acts as the Global Transaction Manager (GTM) for the query.

  5. Xpand parses the query, then sends it through the query optimizer and planner, choosing the optimal execution plan based on internal statistics.

  6. Xpand compiles the plan, breaking it into smaller query fragments, then runs code optimizations.

  7. Xpand runs the compiled query fragments in a distributed way across the Xpand nodes.

  8. The Xpand node that received the query collects the responses from the other nodes and returns the result-set.

  9. MaxScale receives the result-set and passes it back to the user or application.

Xpand Storage Engine Topology

In the Xpand Storage Engine topology, the application sends read and write queries to MariaDB MaxScale, which routes them to ES nodes. The ES nodes route any queries on Xpand tables to the Xpand nodes using the Xpand storage engine:

Xpand Storage Engine Topology
  1. MariaDB MaxScale receives an SQL query from the user or application.

  2. MaxScale uses the MariaDB Monitor (mariadbmon) to determine which ES nodes are healthy and able to receive queries.

  3. MaxScale uses the Read/Write Split router (readwritesplit) or the Read Connection router (readconnroute) to select a specific ES node to receive the query.

  4. The ES node receives the query and parses the SQL. The parts of the query that reference Xpand tables are routed to the Xpand storage engine.

  5. The Xpand storage engine uses a custom select handler to send queries to an Xpand node.

  6. Xpand associates the query with a session the Xpand node that receives the query from the ES node. This Xpand node acts as the Global Transaction Manager (GTM) for the query.

  7. Xpand parses the query, then sends it through the query optimizer and planner, choosing the optimal execution plan based on internal statistics.

  8. Xpand compiles the plan, breaking it into smaller query fragments, then runs code optimizations.

  9. Xpand runs the compiled query fragments in a distributed way across the Xpand nodes.

  10. The Xpand node that received the query collects the responses from the other nodes and returns the result-set.

  11. The ES node receives the result-set and passes it through to MaxScale.

  12. MaxScale receives the result-set and passes it back to the user or application.