LevelDB storage engine MS1

You are viewing an old version of this article. View the current version here.

This page describes what will be implemented for milestone#1.

Feature Description

How the data is stored in LevelDB

One leveldb instance

We will use one LevelDB instance for mysqld process. LevelDB keys will be prefixed with 'dbname.table_name.PRIMARY' for the table itself, and 'dbname.table_name.index_name' for the secondary indexes. This allows to store arbitrary number of tables/indexes in one LevelDB instance.

Data encoding

We will rely on LevelDB's compression to make the storage compact. Data that goes into LevelDB's key will be stored in KeyTupleFormat (which allows mysql's lookup/index ordering functions to work).

Data that goes into LevelDB's value will be stored in table->record[0] format, except blobs. (Blobs will require special storage convention because they store a char* pointer in table->record[0]).

We will need to support blobs because table nodetable has a mediumtext field.

Secondary indexes

Non-unique secondary indexes will be supported.

LevelDB stores {KEY->VALUE} mappings. Non-unique index will need to have some unique values for KEY. This is possible if we do

KEY = {index_columns, primary_key_columns}.   
VALUE = {nothing}

(todo: check if leveldb allows zero-sized values).

Using primary key as suffix will make DB::Get() useless. Instead, we will have to do lookups with:

get(secondary_index_key_val)
{
  open cursor for (secondary_index_key_val)
  read the first record
  if (record > secondary_index_key_val)
    return NOT_FOUND;
  else
    return FOUND;
}

TODO: we will not support unique secondary indexes in MS1. ALTER/CREATE TABLE statements attempting to add a unique index will fail. Is this ok?

Concurrency handling

(TODO stuff from locking proposal here)

- Deadlock detection is not necessary. Deadlocks are caught by lock timeout setting (which facebooks sets to 2 seconds). Lock timeout setting should also support a timeout of 0 (means don't wait at all).

- For storing locks, we'll need a highly-concurrent hashtable. Current candidate is lf_hash.

1. Pessimistic locking proposal

Basic idea is:

LevelDB's operations do blind overwrites. In order to make sure we're not overwriting anything, we will use this pattern:

acqure record lock; read; modify as necessary; write; release record lock;

record locks can be obtained for {table_name, primary_key}. Locks are exclusive, for a given record, only one lock can obtained at a time. A lock can be obtained when its record doesn't exist (see INSERT below)

1.1 UPDATE

UPDATE will use the above scheme. It will get row locks for the keys it is reading in order to prevent concurrent updates.

1.2 INSERT

INSERT will use a row lock to lock the record non-existance.

A paste from yesterday's email:

If we want to insert a record, we need to call DB::Put(). The problem is, that function will either insert a record, or overwrite it. Its return value will be the same in both cases. We will never know if we wrote a new {key->value} pair, or we have overwrote an old one.

If we want to execute a query

INSERT INTO linkdb.linktable VALUES ($primary_key_val, ... ) ON DUPLICATE KEY UPDATE ...

then the only way to do it is:

- Acquire a lock meaning "I want to write here" on value $primary_key_val. The lock will prevent anybody from writing or deleting the record with the given value of the primary key.

- Call DB::Get($primary_key_val) to see if the table has record.

- if (we've got a record) do what is specified in ON DUPLICATE KEY UPDATE, else use the VALUES(...) values

- Call DB::Put(). We're holding a lock, so we're sure we will not overwrite somebody else's change.

- Release the lock on $primary_key_val

1.3 DELETE

If we take a DELETE statement in form of

DELETE FROM tbl WHERE tbl.primary_key=const

it can be translated into DB:Delete() call which will delete the intended record (or, do nothing if the record doesn't exist).

If DELETE has other form, we'll have to acquire locks before read to make sure we're deleting the row that we have intended to delete.

In the benchmark, DELETE statement is used by assoc_delete and obj_delete.

1.4 SELECT

In linkbench, selects are only in read-only transactions, except for the first select in assoc_delete.

It is tempting use read snapshots and not do any locking. However, with assoc_delete we need to be sure the record we've got from

SELECT visibility FROM linktable WHERE id1 = X and AND id2 = Y AND link_type = Z;

is not changed before the next statement. Can we change the SELECT to be "SELECT .. FOR UPDATE"? That way, regular SELECTs will use no locking, and the above select will use locking.

1.5 On locking mechanism

If somebody tries to acquire a lock, and the lock is unavailable, there are two options: 1. wait for lock to be released 2. abort immediately

Waiting seems harder to implement. Aborting seems to be easier. (todo: if somebody has locked a record because he wants to delete it, and we want to lock it because we want to update, does it make sense to wait?)

With current locking scheme, it is easy to construct deadlocks. However, linkbench transactions will not cause deadlocks (TODO: check again).

Command handling

(TODO stuff from locking proposal here)

What will not be supported

Non-blocking schema changes will not be supported at all in the first milestone. All DDL-modifying operations will use pump all the data from the original table to the table with the new DDL.

Binlog XA on the master will not be supported.

Crash-proof slave will not be supported.

Write-optimized INSERTs

We will need to do fast bulk data loads. During the bulk load, writes-without-reads are ok: the user knows he's not overwriting data, he doesn't care about @@rows_affected.

These will be achieved as follows:

  • there will be a thread-local @@leveldb_bulk_load variable.
  • Its default value is FALSE.
  • When it is set to true, INSERTS (which make ha_leveldb::write() calls) will work in bulk-loading mode.

Bulk-loading mode means:

  • INSERTs will be done in batches of @@leveldb_bulk_load_size each
  • INSERTs will take no locks, and do no-read-writes. In other words, they will silently overwrite data
  • @@affected_rows will return the value that will show that all records were successfully inserted.

Other details

  • The patch will be against mysql-5.6
  • There is no efficient way to run TRUNCATE TABLE. Is this ok?

Work items

Comments

Comments loading...
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.