HBase storage engine

You are viewing an old version of this article. View the current version here.

Data mapping from HBase to SQL

Hbase data model and operations

1.1 HBase data model

  • An HBase table consists of rows, which are identified by row key.
  • Each row has an arbitrary (potentially, very large) number of columns.
  • Columns are split into column groups, column groups define how the columns are stored (not reading some column groups is an optimization).
  • Each (row, column) combination can have multiple versions of the data, identified by timestamp.

1.2 Hbase read operations

HBase API defines two ways to read data:

  • Point lookup: get record for a given row_key.
  • Point scan: read all records in [startRow, stopRow) range.

Both kinds of scans allow to specify:

  • A column family we're interested in
  • A particular column we're interested in

The default behavior for versioned columns is to return only the most recent version. HBase API also allows to ask for

  • versions of columns that were valid at some specific timestamp value;
  • all versions that were valid within a specifed [minStamp, maxStamp) interval.
  • N most recent versions We'll refer to the above as [VersionedDataConds].

One can see two ways to map HBase tables to SQL tables:

2. Per-row mapping

Let each row in HBase table be mapped into a row from SQL point of view:

SELECT * FROM hbase_table;

row-id column1 column2  column3  column4  ...
------ ------- -------  -------  -------  
row1    data1   data2
row2                     data3    
row3    data4                      data5

The problem is that the set of columns in a HBase table is not fixed and is potentially is very large. The solution is to put all columns into one blob column and use Dynamic Columns (http://kb.askmonty.org/en/dynamic-columns) functions to pack/extract values of individual columns:

row-id dyn_columns
------ ------------------------------
row1   {column1=data1,column2=data2}
row2   {column3=data3}
row3   {column1=data4,column4=data5}

2.2 Mapping definition

Table DDL could look like this:

CREATE TABLE hbase_tbl_rows (
  row_id BINARY(MAX_HBASE_ROWID_LEN),
  columns BLOB,
  PRIMARY KEY (row_id)
) ENGINE=hbase_row;

(TODO: Does Hbase have MAX_HBASE_ROWID_LEN limit? What is it?)

The `columns` blob will hold values (and names) for all columns.

Access to name/value of individual columns is to be done with dynamic column functions (see http://kb.askmonty.org/en/dynamic-columns).

Functions for reading data:

  COLUMN_GET(dynamic_column, column_nr as type)
  COLUMN_EXISTS(dynamic_column, column_nr);
  COLUMN_LIST(dynamic_column);

Functions for data modification:

  COLUMN_ADD(dynamic_column, column_nr,  value [as type], ...)
  COLUMN_DELETE(dynamic_column, column_nr, column_nr, ...);

2.2.1 Problems to be solved ~~~~~~~~~~~~~~~~~~~~~~~~~~ Dynamic column functions cannot be used as-is:

1. Currently, Dynamic Columns functions use integer numbers to identify columns (see 'column_nr' in parameters above). HBase identifies columns by names, and there is no way to get a full list of used column names.

2. There is no way to request "all columns from within a certain column family", while that is allowed by the HBase API.

3. There is no apparent way to access versioned data (see [VersionedDataConds] above). One possible solution is to have global/session @@hbase_timestamp variables, which will affect all queries.

2.3 Queries in per-row mapping ------------------------------

  1. Point-select: SELECT COLUMN_GET(hbase_tbl.columns, 'column_name' AS INTEGER) FROM hbase_tbl WHERE row_id='hbase_row_id';
  1. Range select:
  2. (the example uses BETWEEN but we will support arbitrary predicates) SELECT COLUMN_GET(hbase_tbl.columns, 'column_name' AS INTEGER) FROM hbase_tbl WHERE row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2';
  1. Update a column UPDATE hbase_tbl SET columns=COLUMN_ADD(columns, 'column_name', 'value');
  1. Add a column UPDATE hbase_tbl SET columns=COLUMN_ADD(columns, 'column_name', 'value');
  1. Insert a row with a column INSERT INTO hbase_tbl (row_id, columns) VALUES ('hbase_row_id', COLUMN_CREATE('column_name', 'column-value'));

Q: It's not clear how to access versioned data? Can we go without versioned data for the first milestone? (and then, use global @@hbase_timestamp for the second milestone?)

Q: It's not clear how to select "all columns from column family X".

Comments

Comments loading...
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.