HBase storage engine
Contents
Data mapping from HBase to SQL
Hbase data model and operations
1.1 HBase data model
- An HBase table consists of rows, which are identified by row key.
- Each row has an arbitrary (potentially, very large) number of columns.
- Columns are split into column groups, column groups define how the columns are stored (not reading some column groups is an optimization).
- Each (row, column) combination can have multiple versions of the data, identified by timestamp.
1.2 Hbase read operations
HBase API defines two ways to read data:
- Point lookup: get record for a given row_key.
- Point scan: read all records in [startRow, stopRow) range.
Both kinds of scans allow to specify:
- A column family we're interested in
- A particular column we're interested in
The default behavior for versioned columns is to return only the most recent version. HBase API also allows to ask for
- versions of columns that were valid at some specific timestamp value;
- all versions that were valid within a specifed [minStamp, maxStamp) interval.
- N most recent versions We'll refer to the above as [VersionedDataConds].
One can see two ways to map HBase tables to SQL tables:
2. Per-row mapping
Let each row in HBase table be mapped into a row from SQL point of view:
SELECT * FROM hbase_table; row-id column1 column2 column3 column4 ... ------ ------- ------- ------- ------- row1 data1 data2 row2 data3 row3 data4 data5
The problem is that the set of columns in a HBase table is not fixed and is potentially is very large. The solution is to put all columns into one blob column and use Dynamic Columns (http://kb.askmonty.org/en/dynamic-columns) functions to pack/extract values of individual columns:
row-id dyn_columns ------ ------------------------------ row1 {column1=data1,column2=data2} row2 {column3=data3} row3 {column1=data4,column4=data5}
2.2 Mapping definition
Table DDL could look like this:
CREATE TABLE hbase_tbl_rows ( row_id BINARY(MAX_HBASE_ROWID_LEN), columns BLOB, PRIMARY KEY (row_id) ) ENGINE=hbase_row;
(TODO: Does Hbase have MAX_HBASE_ROWID_LEN limit? What is it?)
The `columns` blob will hold values (and names) for all columns.
Access to name/value of individual columns is to be done with dynamic column functions (see http://kb.askmonty.org/en/dynamic-columns).
Functions for reading data:
COLUMN_GET(dynamic_column, column_nr as type) COLUMN_EXISTS(dynamic_column, column_nr); COLUMN_LIST(dynamic_column);
Functions for data modification:
COLUMN_ADD(dynamic_column, column_nr, value [as type], ...) COLUMN_DELETE(dynamic_column, column_nr, column_nr, ...);
2.2.1 Problems to be solved
Dynamic column functions cannot be used as-is:
- Currently, Dynamic Columns functions use integer numbers to identify columns (see 'column_nr' in parameters above). HBase identifies columns by names, and there is no way to get a full list of used column names.
- There is no way to request "all columns from within a certain column family", while that is allowed by the HBase API.
- There is no apparent way to access versioned data (see [VersionedDataConds] above). One possible solution is to have global/session @@hbase_timestamp variables, which will affect all queries.
2.3 Queries in per-row mapping
- Point-select: SELECT COLUMN_GET(hbase_tbl.columns, 'column_name' AS INTEGER) FROM hbase_tbl WHERE row_id='hbase_row_id';
- Range select:
- (the example uses BETWEEN but we will support arbitrary predicates) SELECT COLUMN_GET(hbase_tbl.columns, 'column_name' AS INTEGER) FROM hbase_tbl WHERE row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2';
- Update a column UPDATE hbase_tbl SET columns=COLUMN_ADD(columns, 'column_name', 'value');
- Add a column UPDATE hbase_tbl SET columns=COLUMN_ADD(columns, 'column_name', 'value');
- Insert a row with a column INSERT INTO hbase_tbl (row_id, columns) VALUES ('hbase_row_id', COLUMN_CREATE('column_name', 'column-value'));
Q: It's not clear how to access versioned data? Can we go without versioned data for the first milestone? (and then, use global @@hbase_timestamp for the second milestone?)
Q: It's not clear how to select "all columns from column family X".