HBase storage engine
Contents
Data mapping from HBase to SQL
Hbase data model and operations
1.1 HBase data model
- An HBase table consists of rows, which are identified by row key.
- Each row has an arbitrary (potentially, very large) number of columns.
- Columns are split into column groups, column groups define how the columns are stored (not reading some column groups is an optimization).
- Each (row, column) combination can have multiple versions of the data, identified by timestamp.
1.2 Hbase read operations
HBase API defines two ways to read data:
- Point lookup: get record for a given row_key.
- Point scan: read all records in [startRow, stopRow) range.
Both kinds of scans allow to specify:
- A column family we're interested in
- A particular column we're interested in
The default behavior for versioned columns is to return only the most recent version. HBase API also allows to ask for
- versions of columns that were valid at some specific timestamp value;
- all versions that were valid within a specifed [minStamp, maxStamp) interval.
- N most recent versions We'll refer to the above as [VersionedDataConds].
One can see two ways to map HBase tables to SQL tables:
2. Per-row mapping
Let each row in HBase table be mapped into a row from SQL point of view:
SELECT * FROM hbase_table; row-id column1 column2 column3 column4 ... ------ ------- ------- ------- ------- row1 data1 data2 row2 data3 row3 data4 data5
The problem is that the set of columns in a HBase table is not fixed and is potentially is very large. The solution is to put all columns into one blob column and use Dynamic Columns (http://kb.askmonty.org/en/dynamic-columns) functions to pack/extract values of individual columns:
row-id dyn_columns ------ ------------------------------ row1 {column1=data1,column2=data2} row2 {column3=data3} row3 {column1=data4,column4=data5}
2.2 Mapping definition
Table DDL could look like this:
CREATE TABLE hbase_tbl_rows ( row_id BINARY(MAX_HBASE_ROWID_LEN), columns BLOB, PRIMARY KEY (row_id) ) ENGINE=hbase_row;
(TODO: Does Hbase have MAX_HBASE_ROWID_LEN limit? What is it?)
The `columns` blob will hold values (and names) for all columns.
Access to name/value of individual columns is to be done with dynamic column functions (see http://kb.askmonty.org/en/dynamic-columns).
Functions for reading data:
COLUMN_GET(dynamic_column, column_nr as type) COLUMN_EXISTS(dynamic_column, column_nr); COLUMN_LIST(dynamic_column);
Functions for data modification:
COLUMN_ADD(dynamic_column, column_nr, value [as type], ...) COLUMN_DELETE(dynamic_column, column_nr, column_nr, ...);