- Shard-Query resources
- What kind of interfaces does Shard-Query have
- What kind of queries are supported?
- Sharding Support
- Hash mapper
- Directory mapper
- Automatic Sharding
- Built-In tools
Shard-Query is a high performance MPP (massively parallel processing) query engine for MariaDB and MySQL which offers increased parallelism compared to stand-alone servers. This increased parallelism is achieved by taking advantage of MariaDB/MySQL partitioning, sharding, common query clauses like BETWEEN and IN, or some combination of the above.
Shard-Query is implemented with PHP and Gearman.
Shard-Query is targeted mainly at big data problems, and OLAP queries in general. The primary goal of Shard-Query is to enable low-latency query access to extremely large volumes of data utilizing commodity hardware and open source database software. Shard-Query is a federated query engine which is designed to perform as much work in parallel as possible over a sharded dataset, that is one that is split over multiple servers (shards) or partitioned tables.
Shard-Query behaves like
READ-COMMITTED in terms of transaction isolation for a single query. Shard-Query works best with a star schema data set, and is therefore targeted towards large data marts. Shard-Query also works well with extremely large tables, such as machine generated logs or sensor data.
The swanhart-tools repository, on GitHub, contains tools for MariaDB and MySQL from Justin Swanhart. Shard-Query is located in the
shard-query directory. The whole project can be downloaded via the GitHub's Download button. It is possible to report bugs using the Issues tab.
A project on Google Code also exists, but it is obsolete. All bugs from Google Code and all downloads from Google Code are deprecated and will be removed.
A Facebook page is also available.
- At least one server for data storage
- A database schema for storing the Shard-Query configuration
- MariaDB or MySQL, version 5.1+
- PHP 5.3 or newer
- PEAR (any version)
- mbstring module
- gearmand 0.18 or newer (get it from epel, or compile from gearman.org)
- cron (for automatic gearman management)
- Lua and JSON for Lua (for the proxy)
No web server is required. However, Apache 2 has been tested and a configuration file is included in the project.
What kind of interfaces does Shard-Query have
- A RESTful UI which allows you to submit queries and examine results as well as configure Shard-Query
- A MySQL proxy script
- A PHP Object Oriented interface
proxy directory contains a MySQL/MariaDB proxy. It can be used to transmit data between Shard-Query and applications, so that client programs don't need to know Shard-Query. The proxy is a Lua script which uses the JSON for Lua library.
What kind of queries are supported?
You can run just about all SQL queries over your dataset.
- All aggregate functions are supported.
AVGare the fastest aggregate operations
COUNT(DISTINCT ..)are supported, but are slower
VAR/etc are supported but aggregation is not pushed down at all (slowest)
- Custom aggregate functions are now also supported.
PERCENTILE(expr, N)- take a percentile, for example percentile(score,90)
JOINs are supported (no self joins, or joins of tables sharded by different keys)
WITH ROLLUP, and
UNIONs are fully supported.
DELETE, LOAD DATA INFILE
- DDL such as
Details about how these statements are performed are explained before.
Shard-Query is designed to work with a sharded data set. Shard-Query is designed to make a group of shards behave as one virtual database server, even for complex queries. Shard-Query includes two shard "mappers", described below.
Hash based sharding features very fast row to shard lookups (this is called mapping) because no network round trip is required to calculate the lookup. The downsides are significant though, because the number of shards is fixed.
Directory based sharding uses a database table on a directory database server to map rows to shards. This is very scalable, because rows can be migrated between shards, and the number of shards is not fixed (it can be easily grown). Directory mapping can be a problem if the directory must be very large (too large for a single directory server.) Directory mapping is slightly slower than hash mapping because a round trip and database lookups must be performed.
Shard-Query uses one of the above mappers in order to determine to which shards a query should be sent based on the
WHERE clause used in the query. It may be possible to reduce the number of shards to which to send a query when the
WHERE clause includes a condition on the shard column.
Shard-Query automatically rewrites queries so that they return proper results when the query is executed over multiple shards. For example, when the
COUNT aggregate function is used, it is executed as the
SUM of the
COUNT from each of the queried shards. Shard-Query elects a 'coordinator node' for each
SELECT query, which is used for aggregating the results.
INSERT and loading
When loading data or inserting rows, Shard-Query examines the data being inserted and sends the row to the appropriate shard. The loader is massively parallel and will load delimited files in chunks in parallel.
DDL and other DML statements
All SQL except
INSERT is sent to all shards.
Shard-Query supports both sharded and un-sharded tables. Un-sharded tables are tables which do not contain the shard column. Insertions into un-sharded tables go to all shards. If a query contains no sharded tables, then the query is sent to only a single shard to prevent duplication of data.
bin/run_query- tool to run queries. use
php ./run_query --help. This is the tool you should look at to see how to integrate Shard-Query in your own app.
loader- tool to load (or pre-split) flat files for sharded loading.
The REST UI is installed in
/usr/share/shard-query/ui by default.
You can reach the UI by visiting the url:
The default username is: user and the password is mpp.