Overview

Shard-Query is a high performance MPP (massively parallel processing) query engine for MariaDB and MySQL which offers increased parallelism compared to stand-alone servers. This increased parallelism is achieved by taking advantage of MariaDB/MySQL partitioning, sharding, common query clauses like BETWEEN and IN, or some combination of the above.

Shard-Query is implemented with PHP and Gearman.

Shard-Query is targeted mainly at big data problems, and OLAP queries in general. The primary goal of Shard-Query is to enable low-latency query access to extremely large volumes of data utilizing commodity hardware and open source database software. Shard-Query is a federated query engine which is designed to perform as much work in parallel as possible over a sharded dataset, that is one that is split over multiple servers (shards) or partitioned tables.

Shard-Query behaves like READ-COMMITTED in terms of transaction isolation for a single query. Shard-Query works best with a star schema data set, and is therefore targeted towards large data marts. Shard-Query also works well with extremely large tables, such as machine generated logs or sensor data.

Shard-Query resources

The swanhart-tools repository, on GitHub, contains tools for MariaDB and MySQL from Justin Swanhart. Shard-Query is located in the shard-query directory. The whole project can be downloaded via the GitHub's Download button. It is possible to report bugs using the Issues tab.

A project on Google Code also exists, but it is obsolete. All bugs from Google Code and all downloads from Google Code are deprecated and will be removed.

There is also an official Shard-Query blog at http://shardquery.com/

A Facebook page is also available.

Requirements

  • At least one server for data storage
  • A database schema for storing the Shard-Query configuration
  • MariaDB or MySQL, version 5.1+
  • PHP 5.3 or newer
    • PEAR (any version)
    • mbstring module
  • gearmand 0.18 or newer (get it from epel, or compile from gearman.org)
  • cron (for automatic gearman management)
  • Lua and JSON for Lua (for the proxy)

No web server is required. However, Apache 2 has been tested and a configuration file is included in the project.

What kind of interfaces does Shard-Query have

  • A RESTful UI which allows you to submit queries and examine results as well as configure Shard-Query
  • A MySQL proxy script
  • A PHP Object Oriented interface

The proxy directory contains a MySQL/MariaDB proxy. It can be used to transmit data between Shard-Query and applications, so that client programs don't need to know Shard-Query. The proxy is a Lua script which uses the JSON for Lua library.

What kind of queries are supported?

You can run just about all SQL queries over your dataset.

For SELECT queries:

  • All aggregate functions are supported.
  • SUM, COUNT, MIN, MAX and AVG are the fastest aggregate operations
  • SUM/COUNT(DISTINCT ..) are supported, but are slower
  • STD/VAR/etc are supported but aggregation is not pushed down at all (slowest)
  • Custom aggregate functions are now also supported.
    • PERCENTILE(expr, N) - take a percentile, for example percentile(score,90)

JOINs are supported (no self joins, or joins of tables sharded by different keys)

  • ORDER BY, GROUP BY, HAVING, WITH ROLLUP, and LIMIT are supported

UNIONs are fully supported.

Also upports:

  • INSERT, UPDATE, DELETE, LOAD DATA INFILE
  • DDL such as CREATE TABLE, ALTER TABLE and DROP TABLE

Details about how these statements are performed are explained before.

Sharding Support

Shard-Query is designed to work with a sharded data set. Shard-Query is designed to make a group of shards behave as one virtual database server, even for complex queries. Shard-Query includes two shard "mappers", described below.

Hash mapper

Hash based sharding features very fast row to shard lookups (this is called mapping) because no network round trip is required to calculate the lookup. The downsides are significant though, because the number of shards is fixed.

Directory mapper

Directory based sharding uses a database table on a directory database server to map rows to shards. This is very scalable, because rows can be migrated between shards, and the number of shards is not fixed (it can be easily grown). Directory mapping can be a problem if the directory must be very large (too large for a single directory server.) Directory mapping is slightly slower than hash mapping because a round trip and database lookups must be performed.

Automatic Sharding

SELECT queries

Shard-Query uses one of the above mappers in order to determine to which shards a query should be sent based on the WHERE clause used in the query. It may be possible to reduce the number of shards to which to send a query when the WHERE clause includes a condition on the shard column.

Shard-Query automatically rewrites queries so that they return proper results when the query is executed over multiple shards. For example, when the COUNT aggregate function is used, it is executed as the SUM of the COUNT from each of the queried shards. Shard-Query elects a 'coordinator node' for each SELECT query, which is used for aggregating the results.

INSERT and loading

When loading data or inserting rows, Shard-Query examines the data being inserted and sends the row to the appropriate shard. The loader is massively parallel and will load delimited files in chunks in parallel.

DDL and other DML statements

All SQL except SELECT and INSERT is sent to all shards.

Un-Sharded Tables

Shard-Query supports both sharded and un-sharded tables. Un-sharded tables are tables which do not contain the shard column. Insertions into un-sharded tables go to all shards. If a query contains no sharded tables, then the query is sent to only a single shard to prevent duplication of data.

Built-In tools

Shard-Query includes

Command-line tools

  • bin/run_query - tool to run queries. use php ./run_query --help. This is the tool you should look at to see how to integrate Shard-Query in your own app.
  • loader - tool to load (or pre-split) flat files for sharded loading.

REST UI

The REST UI is installed in /usr/share/shard-query/ui by default.

You can reach the UI by visiting the url: http://your_server/shard_query

The default username is: user and the password is mpp.

Comments

Comments loading...