LevelDB storage engine

You are viewing an old version of this article. View the current version here.

Basic feature list

  • single-statement transactions
  • secondary indexes
  • HANDLER implementation with extensions to support atomic multi-put (kind of like multi-statement transactions)
  • binlog XA on the master to be crash safe
  • crash-proof slave replication state
  • (almost) non blocking schema change
  • full test coverage via mysql-test-run
  • hot backup
  • possible options are to have LevelDB instance per mysqld, per schema or per table

Implementation overview

One leveldb instance

We consider using one LevelDB instance for mysqld process. LevelDB keys will be prefixed with 'dbname.table_name', 'dbname.table_name.index_name' (or their shorter equivalents). This will allow to store arbitrary number of tables/indexes in one LevelDB instance.

Transaction support

LevelDB supports

  • read snapshots
  • batch updates

when you have just those, there is no easy way to support full transactional semantics in the way it is required from MySQL table engine.

If we limit ourselves to single-statement transactions which touch limited numbers of rows, they could be implemented as follows:

  • updates done by the statement are accumulated in a batch
  • if the statement is committed, the batch is applied. LevelDB guarantees this will be an atomic operation
  • if the statement is rolled back, the batch is simply discarded.

(Note: the "Test implementation" uses exactly this approach. It presents itself to MySQL as a non-transactional engine which is able to roll back a statement)

(Note: According to Serg: Storage Engine API does not specify whether the changes made to table data should be immediately visible, or remain invisible until the end of the statement. Both kinds of behavior are allowed).

TODO: what if two transactions attempt to make conflicting changes? Will one of them get a conflict? A: NO, because LevelDB's operations cannot get in conflict. Delete() means "delete if exists" and Put() means "write, or overwrite". Therefore, no conflicts possible. TODO: is this ok? (more on this below)

Data formats

LevelDB compresses data with something called SnappyCompressor.

Does this mean we can just give it keys in MySQL's KeyTupleFormat, and all other columns in whatever format they happen to be in table->record[0] (except blobs)? Or we need to design a more compact representation?

(note: datatypes in the provided benchmark are: composite primary/secondary keys, INTs and VARCHARs (are they latin1 or utf-8?)).

Secondary Indexes

Unique secondary indexes

Unique secondary is implemented as a {KEY->VALUE} mapping in LevelDB, where index columns are used as KEY, and Primary Key columns are used as VALUE. This way,

  • "index-only" scans are possible
  • non-"index-only" scan is a two step process (access the index, access the primary index).

TODO: Is this needed at all? It is expensive to maintain a unique index. LevelDB's writes will silently overwrite duplicates. In order to maintain uniqueness, one will have to do a read-then-write, and somehow make sure (with locking?) that other threads don't get in the middle of it.

Q: Do we need to support unique secondary indexes [at all |in the first milestone]?

Non-unique secondary indexes

LevelDB stores {KEY->VALUE} mappings. Non-unique index will need to have some unique values for KEY. This is possible if we do

KEY = {index_columns, primary_key_columns}.   
VALUE = {nothing}

(todo: check if leveldb allows zero-sized values).

Using primary key as suffix will make DB::Get() useless. Instead, we will have to do lookups with:

get(secondary_index_key_val)
{
  open cursor for (secondary_index_key_val)
  read the first record
  if (record > secondary_index_key_val)
    return NOT_FOUND;
  else
    return FOUND;
}

Non-blocking schema changes

  • There is a requirement that doing schema changes does not block other queries from running.
  • Reclaiming space immediately after some parts of data were dropped is not important.

Possible approaches we could use:

  • Record format that support multiple versions. That way, adding/modifying/ dropping a non-indexed column may be instant. Note that this is applicable for records, not keys.
  • Background creation/dropping of indexes.

Hot backup

Hot backup will be made outside of this project. The idea is to hard-link the files so that they can't be deleted by compaction process, and then copy them over.

SQL Command mapping for LevelDB

INSERT

INSERTs will translate into DB::Put() calls. This means, one will not get an error if they are inserting data that's already there, and INSERT will work like REPLACE.

(Q: is this semantics ok? Probably not, because multi-value HANDLER WRITE was mentioned as an alternative?)

Single-line insert will translate into db->Put() call with WriteOptions(sync=true).

Multi-line INSERTs will be batched. Storage engine API provides start_bulk_insert()/end_bulk_insert() methods to aid insert batching. The batches will be not bigger than @@leveldb_max_batch_size, if INSERT is bigger, it will use multiple batches.

If an INSERT uses multiple batches and encounters some error on a non-first batch, it is not possible to fully roll it back. In other words, INSERTs bigger than certain size become non-transactional. (Q: is this ok?)

(TODO: can a MySQL storage engine tell between INSERT statement and REPLACE statement?)

UPDATE

UPDATE will do a read before write (if you just want to write, use INSERT).

TODO: Somebody may modify the record after we've done the read, but before we do the write. How can this be handled?

DELETE

MySQL handles DELETE command by reading the data, and possibly calling handler->delete_row() which deletes the row that was just read.

In LevelDB, DB::Delete() call has semantics of "delete if there is one" (TODO: which allows it to make some shortcuts and not read the data, correct?).

MySQL's SQL interpreter does not have a scenario there it makes "delete record X if there is one" calls to storage engine. The exact calling convention for handler::delete_row() is not defined precisely, and could be extended to handle "delete-if-exists". (TODO: more details here)

SELECT

Will use snapshot

SELECTs will allocate/use a snapshot for reading data. This way, the data will

Q: is this needed? Using snapshots has some cost.

Range scans

  • LevelDB cursors can be used for range scans.
  • DB::GetApproximateSizes() can be used to implement handler::records_in_range()
  • There is nothing for rec_per_key (index statistics)

ALTER TABLE

MySQL 5.6 should support online ALTER TABLE operations (as InnoDB now supports them).

TODO: what does the storage engine needs to do to inform the SQL layer that it is running a long DDL change which does not prevent other selects/updates from running?

Binlog XA on Master

This is about keeping binlog and LevelDB in sync on the master. MySQL does it as follows:

  • prepare transaction in the storage engine
  • write it into the binlog
  • commit it in the engine

If transactions are grouped, they are committed in the same order as they were written into the binary log.

Recovery proceeds as follows:

  • Read the last binlog file and note XIDs of transactions that are there.
  • for each storage engine,
    • scan the committed transactions and compare their XIDs to those we've found in the binlog.
    • If transaction is the binlog - commit, otherwise - roll it back)

(note that the order the transactions are applied in is determined from the engine, not from the binlog)

TODO: suggestions about how PREPARE/COMMIT/recovery should work for LevelDB. (got some ideas after discussion with Kristian, need to write them down)

Crash-proof slave

MySQL 5.6 stores information that used to be in relay_log.info in InnoDB. That way, InnoDB and relay_log.info (aka binlog position) are always in sync.

It seems, switching to storing relay_log.info in a LevelDB table is sufficient for crash-proof slave. (note: this implies that semantics of operations over LevelDB table is sufficiently close to that of a regular MySQL storage engine, like innodb).

Other details

  • The target version is MySQL 5.6 (good, because LevelDB API uses STL and 5.6-based versions support compiling with STL).
  • It is ok to make changes to LevelDB itself

Comments

Comments loading...
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.