HBase storage engine
Contents
Data mapping from HBase to SQL
Hbase data model and operations
1.1 HBase data model
- An HBase table consists of rows, which are identified by row key.
- Each row has an arbitrary (potentially, very large) number of columns.
- Columns are split into column groups, column groups define how the columns are stored (not reading some column groups is an optimization).
- Each (row, column) combination can have multiple versions of the data, identified by timestamp.
1.2 Hbase read operations
HBase API defines two ways to read data:
- Point lookup: get record for a given row_key.
- Point scan: read all records in [startRow, stopRow) range.
Both kinds of scans allow to specify:
- A column family we're interested in
- A particular column we're interested in
The default behavior for versioned columns is to return only the most recent version. HBase API also allows to ask for
- versions of columns that were valid at some specific timestamp value;
- all versions that were valid within a specifed [minStamp, maxStamp) interval.
- N most recent versions We'll refer to the above as [VersionedDataConds].
One can see two ways to map HBase tables to SQL tables:
2. Per-row mapping
Let each row in HBase table be mapped into a row from SQL point of view:
SELECT * FROM hbase_table; row-id column1 column2 column3 column4 ... ------ ------- ------- ------- ------- row1 data1 data2 row2 data3 row3 data4 data5
The problem is that the set of columns in a HBase table is not fixed and is potentially is very large. The solution is to put all columns into one blob column and use Dynamic Columns (http://kb.askmonty.org/en/dynamic-columns) functions to pack/extract values of individual columns:
row-id dyn_columns ------ ------------------------------ row1 {column1=data1,column2=data2} row2 {column3=data3} row3 {column1=data4,column4=data5}
2.2 Mapping definition
Table DDL could look like this:
CREATE TABLE hbase_tbl_rows ( row_id BINARY(MAX_HBASE_ROWID_LEN), columns BLOB, PRIMARY KEY (row_id) ) ENGINE=hbase_row;
(TODO: Does Hbase have MAX_HBASE_ROWID_LEN limit? What is it?)
The `columns` blob will hold values (and names) for all columns.
Access to name/value of individual columns is to be done with dynamic column functions (see http://kb.askmonty.org/en/dynamic-columns).
Functions for reading data:
COLUMN_GET(dynamic_column, column_nr as type) COLUMN_EXISTS(dynamic_column, column_nr); COLUMN_LIST(dynamic_column);
Functions for data modification:
COLUMN_ADD(dynamic_column, column_nr, value [as type], ...) COLUMN_DELETE(dynamic_column, column_nr, column_nr, ...);
2.2.1 Problems to be solved
Dynamic column functions cannot be used as-is:
- Currently, Dynamic Columns functions use integer numbers to identify columns (see 'column_nr' in parameters above). HBase identifies columns by names, and there is no way to get a full list of used column names.
- There is no way to request "all columns from within a certain column family", while that is allowed by the HBase API.
- There is no apparent way to access versioned data (see [VersionedDataConds] above). One possible solution is to have global/session @@hbase_timestamp variables, which will affect all queries.
2.3 Queries in per-row mapping
# Point-select: SELECT COLUMN_GET(hbase_tbl.columns, 'column_name' AS INTEGER) FROM hbase_tbl WHERE row_id='hbase_row_id'; # Range select: # (the example uses BETWEEN but we will support arbitrary predicates) SELECT COLUMN_GET(hbase_tbl.columns, 'column_name' AS INTEGER) FROM hbase_tbl WHERE row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2'; # Update a column UPDATE hbase_tbl SET columns=COLUMN_ADD(columns, 'column_name', 'value'); # Add a column UPDATE hbase_tbl SET columns=COLUMN_ADD(columns, 'column_name', 'value'); # Insert a row with a column INSERT INTO hbase_tbl (row_id, columns) VALUES ('hbase_row_id', COLUMN_CREATE('column_name', 'column-value')); Q: It's not clear how to access versioned data? Can we go without versioned data for the first milestone? (and then, use global @@hbase_timestamp for the second milestone?) Q: It's not clear how to select "all columns from column family X".
2.4 Efficient execution for per-row mapping
The table declares:
row_id BINARY(MAX_HBASE_ROWID_LEN), ... PRIMARY KEY (row_id)
which will allow the range/ref optimizer to construct point-lookups and range scans on the table.
Q: will we need joins, i.e. do I need to implement Multi-Range-Read and support Batched Key Access right away?
Dynamic column functions currently work as follows:
1. The entire blob is read into memory 2. Dynamic column functions operate on the blob. 3. [If this is an UPDATE] the entire blob is written back
With HBase, this will result in reading of redundant columns and column families.
Solution #1: on-demand reads
Instead of returning the blob, we return a handle, and will only read hbase column values when their value is requested.
This will ensure we don't read any unneeded data, however, at the cost of making extra roundtrips. This is probably too expensive.
Solution #2: List of reads
Possible solution for reads: - Find all references to hbase_table.columns - Collect the names of columns that are read, and retrieve only these columns.
This will still cause extra data to be read for queries like:
SELECT COLUMN_GET(hbase_tbl, 'column1' AS INTEGER) FROM hbase_tbl WHERE row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2' AND COLUMN_GET(hbase_tbl, 'column2' AS INTEGER)=1
here we will read column1 even for rows where column2.
I consider this solution to be better than solution#1.
There is a question of what should be done when the query has a reference to
COLUMN_GET(hbase_tbl, {non-const-item} AS ...)
and we don't know which columns will be requested. The possible approaches are - retrieve all columns so that we're sure to have the data - fetch columns on demand - stop the query with an error.
3. Per-cell mapping
HBase shell has 'scan' command, here's an example of its output:
hbase(main):007:0> scan 'testtable' ROW COLUMN+CELL myrow-1 column=colfam1:q1, timestamp=1297345476469, value=value-1 myrow-2 column=colfam1:q2, timestamp=1297345495663, value=value-2 myrow-2 column=colfam1:q3, timestamp=1297345508999, value=value-3
Here, one HBase row produces multiple rows in the output. Each output row represents one (row_id, column) combination, so rows with multiple columns (and multiple revisions of column data) can be easily represented.
3.1 Mapping definition
Mapping could be defined as follows:
CREATE TABLE hbase_tbl_cells ( row_id binary(MAX_HBASE_ROWID_LEN), column_family binary(MAX_HBASE_COLFAM_LEN), column_name binary(MAX_HBASE_NAME_LEN), timestamp TIMESTAMP, value BLOB, PRIMARY KEY (row_id, column_family, column_name, timestamp) ) ENGINE=hbase_cell;
There is no need for dynamic columns in this mapping.
- TODO: Does Hbase have MAX_HBASE_ROWID_LEN limit? What is it?
- NOTE: It is nice to have SQL table DDLs independent of the content of the backend hbase table. This saves us from the need to synchronize table DDLs between hbase and mysql (NDB cluster had to do this and they have ended up implementing a very complex system to do this).
3.2 Queries in per-cell mapping
- Point-select: SELECT value FROM hbase_cell WHERE row_id='hbase_row_id' AND column_family='hbase_column_family' AND column_name='hbase_column' ...
- Range select:
- (the example uses BETWEEN but we will support arbitrary predicates) SELECT value FROM hbase_cell WHERE row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2' AND column_family='hbase_column_family' AND column_name='hbase_column'
- Update a column UPDATE hbase_cell SET value='value' WHERE row_id='hbase_row' AND column_family='col_family' AND column_name='col_name'
- Add a column (this will add a row if one doesn't exist yet) INSERT INTO hbase_cell values ('hbase_row', 'col_family','col_name','value');
Note that - accessing versioned data (some particular version, versions within a range, etc) is easy. - it is also easy to select all columns from a certain column family.
3.3 Efficient execution in per-cell mapping -------------------------------------------
The table is defined as having a
PRIMARY KEY (row_id, column_family, column_name, timestamp)
which allows to make efficient point lookups for (row_id, column_family, column_name) tuples.
Order of columns in the index also allows to read all columns in a column family.
3.3.1 Scanning certain columns within a range
HBase API allows to scan a range of rows, retrieving only certain column name or certain column families. In SQL, this is written as:
SELECT value FROM hbase_cell WHERE row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2' AND column_family='hbase_column_family' (*)
If we feed this into the range optimizer, it will produce a range:
('hbase_row_id1', 'hbase_column_family') <= (row_id, column_family) <= ('hbase_row_id2', 'hbase_column_family')
which includes all column families for records which satisfy
'hbase_row_id1' < rowid < 'hbase_row_id2'
This will cause extra data to be read.
Possible solutions: - Extend multi-range-read interface to walk the 'SEL_ARG graph' instead of list of ranges. This will allow to capture the exact form of conditions like (*). - Use table condition pushdown and there analyze the condition.
- Define more indexes, so that ranges are "dense". what about (row_id BETWEEN $X AND $Y) AND (timestamp BETWEEN $T1 AND $T2) ? No matter which index you define, the range list will not be identical to the WHERE clause.
4. Comparison of the two mappings
If we select two columns from a certain row, per-cell mapping produces "vertical" result, while per-row mapping produces "horizontal" result.
- Per-cell: SELECT column_name, value FROM hbase_cell WHERE row_id='hbase_row_id1' AND column_family='col_fam' AND column_name IN ('column1','column2') +-------------+-------+
column_name | value |
+-------------+-------+
column1 | val1 |
column2 | val2 |
+-------------+-------+
- Per row: SELECT COLUMN_GET(columns, 'col_fam:column1') as col1, COLUMN_GET(columns, 'col_fam:column2') as col2, FROM hbase_row WHERE row_id='hbase_row_id1' +------+------+
col1 | col2 |
+------+------+
val1 | val2 |
+------+------+
Per-cell mapping: - Allows a finer control over selection of versioned data, column families, etc. - Produces a "better" resultset when one selects arbitrary many columns (the client will have to just iterate over SQL resultset, without the need to unpack dynamic-column-format blobs)
Per-row mapping: - is easier to use when one is selecting a pre-defined set of columns - allows joins that involve multiple columns (in per-cell mapping, one needs to do an [inefficient?] self-join if they want to do a join between rows in an hbase table and something else).
5. Interfacing with HBase
HBase is in Java, and its native client API is a java library. We need to interface with it from C++ storage engine code. We could use
1. Thrift (this requires HBase installation to run Thrift server) 2. Manually implement HBase's network protocol. It seems to some RPC protocol (Q: is it a standard RPC protocol, or it something custom-developed for HBase?) 3. Use JNI to move storage engine calls into Java (I've never used JNI, though, so I'm not sure if it is easy to use for our task?)
6. Consistency, transactions, etc
- HBase has single-record transactions. Does this mean that HBase storage engine will have MyISAM-like characteristics? e.g. if we fail in the middle of a multi-row UPDATE, there is no way to go back.
- How should writes be made? Should UPDATEs/INSERTs use checkAndPut operation so that they don't overwrite data that's already there?
- Q: Are the writes important at all? (e.g. if we've had the first version with provide read-only access, would that be useful?) A: Yes?