Cassandra storage engine
What it is
Cassandra Storage Engine allows to access Cassandra's data from MariaDB. Overall architecture is shown in the picture below and is similar to that of NDB cluster storage engine.
You can access the same Cassandra cluster from multiple MariaDB instances, provided each of them runs Cassandra Storage Engine.
Establishing connection to Cassandra
In order to access Cassandra's data from MariaDB, you need to create a table with engine=casssandra
. The table will represent a view of a Column Family in Cassandra and its definition looks like so:
set cassandra_default_thrift_host='192.168.0.10' -- Cassandra's address. It can also -- be specified as startup parameter -- or on per-table basis create table cassandra_tbl -- table name can be chosen at will ( rowkey type PRIMARY KEY, -- represents Column Family's rowkey. Primary key -- must be defined over this column. column1 type, -- Cassandra's static columns can be mapped to column2 type, -- regular SQL columns. dynamic_cols blob DYNAMIC_COLUMN_STORAGE=yes -- If you need to access Cassandra's -- dynamic columns, you can define -- a blob which will receive all of -- them, packed as MariaDB's dynamic -- columns. ) engine=cassandra keyspace= 'cassandra_key_space' -- Cassandra's keyspace.columnFamily we column_family='column_family_name'; -- are accessing.
Mapping for columns
- The row key. The Cassandra API does not consider it to be a column, but CQL and MariaDB do.
- If the Column Family defines a name for the rowkey (always true when using CQL), then the name of the primary key column in MariaDB must match it.
- If the Column Family doesn't define a name for the row key, MariaDB's column must be named "rowkey".
- Columns that are defined as part of Cassandra's "static column family" (this includes columns that one can define with CQL). These columns should have their counterparts in MariaDB. A static column named 'foo' in Cassandra must have a counterpart named 'foo' in MariaDB. The types must also match, they are covered below.
- "Ad-hoc" columns that may be encountered in individual rows and static
columns which are not part of the MariaDB table definition are mapped into
one blob column with Dynamic Columns content. Such a
column has to be a blob and marked with the boolean
option
DYNAMIC_COLUMN_STORAGE
with a "true" value (yes/on/1). If its name matches the Cassandra column's name the Cassandra column also will be included into the blob. If such a column is absent all columns which are not listed in the MariaDB table definition will not be read and can't be changed.
- Cassandra's SuperColumns are not supported, and we have no plans to support them.
Mapping for datatypes
There is no direct 1-to-1 mapping between Cassandra's datatypes and MySQL/MariaDB datatypes. Also, Cassandra's size limitations are often more relaxed than MySQL/MariaDB's, for example, Cassandra's limit on rowkey length is about 2G, while MySQL limits unique key length to about 1.5Kb.
Mapping between Cassandra types and MySQL types are as follows:
Cassandra | MariaDB |
---|---|
blob | VARBINARY(n) |
ascii | VARCHAR(n), use charset=latin1 |
text | VARCHAR(n), use charset=utf8 |
varint | VARBINARY(n) |
int | INT |
bigint | BIGINT, TINY, SHORT (pick the one that will fit the real data) |
uuid | CHAR(36), the UUID will be represented in text form on the MariaDB side |
timestamp | TIMESTAMP (second precision), TIMESTAMP(6) (microsecond precision), BIGINT (gets verbatim Cassandra's 64-bit milliseconds-since-epoch) |
boolean | BOOL |
float | FLOAT |
double | DOUBLE |
decimal | VARBINARY(n) |
counter | BIGINT, only reading is supported |
Command mapping
INSERT
Cassandra doesn't provide any practical way to make INSERT different from UPDATE. Therefore, INSERT works as INSERT-or-UPDATE, it will overwrite the data, if necessary.
INSERT ... SELECT
and multi-line INSERT will try to write data in batches.
Batch size is controlled by the @@cassandra_insert_batch_size
system
variable, which specifies the max. batch size in columns.
The status variables Cassandra_row_inserts
and
Cassandra_row_insert_batches
allow one to see whether inserts are actually
batched.
UPDATE
UPDATE works like one would expect SQL's UPDATE command to work (i.e. changing a primary key value will result in the old record being deleted and a new record being inserted)
DELETE
DELETE FROM cassandra_table
maps to thetruncate(column_family)
call.
- The DELETE with WHERE clause will do per-row deletions.
SELECT
Generally, all SELECT statements work like one expects SQL to work. Conditions
in the form primary_key=...
allow the server to construct query plans which
access Cassandra's rows with key lookups.
Full table scan
Full table scans are performed in a memory-efficient way. Cassandra SE performs
a full table scan as a series of batches, each of which reads not more than
@@cassandra_rnd_batch_size
records.
Batched Key Access support
Cassandra supports Batched Key Access in no-association mode. This means that it requires the SQL layer to do hashing, which means the following settings are required:
- optimizer_switch='join_cache_hashed=on'
- join_cache_level=7|8
Cassandra SE is currently unable to make use of space in the join buffer (the
one whose size is controlled by @@join_buffer_size
). Instead, it will limit
read batches to reading not more than @@cassandra_multiget_batch_size
at a
time, and memory will be allocated on the heap.
Note that the @@join_buffer_size
buffer is still needed by the SQL layer,
so its value should still be increased if you want to read in big batches.
It is possible to track the number of read batches, how many keys were looked-up, and how many results were produced with these status variables:
Variable_name | Value |
---|---|
Cassandra_multiget_reads | 0 |
Cassandra_multiget_keys_scanned | 0 |
Cassandra_multiget_rows_read | 0 |
How to try it out
See Building Cassandra Storage Engine.