HBase storage engine
Contents
Data mapping from HBase to SQL
Hbase data model and operations
1.1 HBase data model
- An HBase table consists of rows, which are identified by row key.
- Each row has an arbitrary (potentially, very large) number of columns.
- Columns are split into column groups, column groups define how the columns are stored (not reading some column groups is an optimization).
- Each (row, column) combination can have multiple versions of the data, identified by timestamp.
1.2 Hbase read operations
HBase API defines two ways to read data:
- Point lookup: get record for a given row_key.
- Point scan: read all records in [startRow, stopRow) range.
Both kinds of scans allow to specify:
- A column family we're interested in
- A particular column we're interested in
The default behavior for versioned columns is to return only the most recent version. HBase API also allows to ask for
- versions of columns that were valid at some specific timestamp value;
- all versions that were valid within a specifed [minStamp, maxStamp) interval.
- N most recent versions We'll refer to the above as [VersionedDataConds].
One can see two ways to map HBase tables to SQL tables:
2. Per-row mapping
Let each row in HBase table be mapped into a row from SQL point of view:
SELECT * FROM hbase_table; row-id column1 column2 column3 column4 ... ------ ------- ------- ------- ------- row1 data1 data2 row2 data3 row3 data4 data5
The problem is that the set of columns in a HBase table is not fixed and is potentially is very large. The solution is to put all columns into one blob column and use Dynamic Columns (http://kb.askmonty.org/en/dynamic-columns) functions to pack/extract values of individual columns.
row-id columns ------ ------------------------------ row1 {column1=data1,column2=data2} row2 {column3=data3} row3 {column1=data4,column4=data5}
and use Dynamic Columns