Lo Storage Engine HBase

Stai visualizzando una vecchia versione di questo article. Visualizza la versione più recente.

Mappatura dei dati da HBase a SQL

Il modello dei dati e le operazioni di HBase

1.1 Il modello dei dati di HBase

  • Una tabella HBase è formata da righe, che sono identificate dalla chiave di riga.
  • Ogni riga ha un numero di colonne arbitrario, potenzialmente molto grande.
  • Le colonne sono suddivise in gruppi di colonne, questi gruppi determinano in che modo esse vengono registrate (un'ottimizzazione consiste nel non leggere alcuni gruppi).
  • Ogni combinazione (riga, colonna) ha diverse versioni dei dati, identificate da un timestamp.

1.2 Le operazioni di lettura di HBase

La API di HBase definisce due modi per leggere i dati:

  • Ricerca di un punto: seleziona un record per una data row_key.
  • Scansione tra i punti: legge tutti i record nell'intervatto [rigaIniziale, rigaFinale).

Entrambi i tipi di letture permettono di specificare:

  • Una famiglia di colonne alla quale si è interessati
  • Una determinata colonna

Il comportamento predefinito per le colonne con versione è restituire solo la versione più recente. L'API di HBase API permette anche di ottenere:

  • le versioni delle colonne che erano attuali ad un certo timestamp;
  • tutte le versioni che sono state attuali in un certo intervallo [tstampInizio, tstampFine).
  • N versioni più recenti. Queste condizioni da qui in avanti verranno chiamate [VersionedDataConds].

Esistono due modi per mappare le tabelle HBase alle tabelle SQL:

2. Mappatura per riga

Se ogni riga in una tabella HBase è mappata ad una riga SQL in questo modo:

SELECT * FROM hbase_table;

row-id column1 column2  column3  column4  ...
------ ------- -------  -------  -------  
row1    data1   data2
row2                     data3    
row3    data4                      data5

Il problema è che l'insieme delle colonne della tabella HBase non è fisso ed è potenzialmente molto grande. The solution is to put all columns into one blob column and use Dynamic Columns (http://kb.askmonty.org/en/dynamic-columns) functions to pack/extract values of individual columns:

row-id dyn_columns
------ ------------------------------
row1   {column1=data1,column2=data2}
row2   {column3=data3}
row3   {column1=data4,column4=data5}

2.2 Mapping definition

Table DDL could look like this:

CREATE TABLE hbase_tbl_rows (
  columns BLOB,
  PRIMARY KEY (row_id)
) ENGINE=hbase_row;

(TODO: Does Hbase have MAX_HBASE_ROWID_LEN limit? What is it?)

The `columns` blob will hold values (and names) for all columns.

Access to name/value of individual columns is to be done with dynamic column functions (see http://kb.askmonty.org/en/dynamic-columns).

Functions for reading data:

  COLUMN_GET(dynamic_column, column_nr as type)
  COLUMN_EXISTS(dynamic_column, column_nr);

Functions for data modification:

  COLUMN_ADD(dynamic_column, column_nr,  value [as type], ...)
  COLUMN_DELETE(dynamic_column, column_nr, column_nr, ...);

2.2.1 Required improvements in Dynamic Columns

Dynamic column functions cannot be used as-is:

  • HBase columns have string names, Dynamic Columns have numbers (see column_nr parameter for the above functions). The set of column names in HBase is potentially very large, there is no way to get a list of all names: we won't be able to solve this with enum-style mapping, we'll need real support for string names.
  • HBase has column families, Dynamic Columns do not . Column family is not just a ':' in the column name. For example, HBase API allows to request "all columns from within a certain column family".
  • HBase supports versioned data, Dynamic Columns do not. A possible simple solution is to have global/session @@hbase_timestamp variable which will globally specify the required data version.
  • (See also note below about efficient execution)

2.3 Queries in per-row mapping

# Point-select:
SELECT COLUMN_GET(hbase_tbl.columns, 'column_name' AS INTEGER)
FROM hbase_tbl

#  Range select:
#   (the example uses BETWEEN but we will support arbitrary predicates)
SELECT COLUMN_GET(hbase_tbl.columns, 'column_name' AS INTEGER)
FROM hbase_tbl
  row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2';

# Update a column
UPDATE hbase_tbl SET columns=COLUMN_ADD(columns, 'column_name', 'value');

# Add a column
UPDATE hbase_tbl SET columns=COLUMN_ADD(columns, 'column_name', 'value');

# Insert a row with a column
INSERT INTO hbase_tbl (row_id, columns) VALUES 
  ('hbase_row_id', COLUMN_CREATE('column_name', 'column-value'));

Q: It's not clear how to access versioned data? Can we go without versioned 
   data for the first milestone? 
   (and then, use global @@hbase_timestamp  for the second milestone?)

Q: It's not clear how to select "all columns from column family X".

2.4 Efficient execution for per-row mapping

The table declares:

  PRIMARY KEY (row_id)

this will allow the range/ref optimizer to construct point-lookups and range scans for row_id.

Q: will we need joins, i.e. do I need to implement Multi-Range-Read and support Batched Key Access right away?

Current MariaDB works with Dynamic Columns with this scenario:

  1. When the record is read, the entire blob is read into memory
  2. Dynamic Column functions operate on the in-memory data (read and update some particular columns in it)
  3. [If this is an UPDATE] the entire blob is written back to the table

If we use this approach with HBase, we will end up reading lots of unneeded columns.

Solution #1: on-demand reads

  • When table record is read, don't read any columns, return a blob handle.
  • Dynamic Column functions will use the handle to read particular columns. The column is read from HBase only when its value is requested.

This scheme ensures there are no redundant data reads, at the cost making extra mysqld<->HBase roundtrips (which are likely to be expensive)

Solution #2: List of reads

  • Walk through the query and find all references to hbase_table.columns.
  • Collect the names of columns that are read, and retrieve only these columns.

This may cause redundant data reads, for example for

  SELECT COLUMN_GET(hbase_tbl, 'column1' AS INTEGER) 
  FROM hbase_tbl
    row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2' AND 
    COLUMN_GET(hbase_tbl, 'column2' AS INTEGER)=1

column1 will be read for rows which have column2!=1. This still seems to be better than making extra roundtrips.

There is a question of what should be done when the query has references like

  COLUMN_GET(hbase_tbl, {non-const-item} AS ...) 

where it is not possible to tell in advance which columns must be read. Possible approaches are

  • retrieve all columns
  • fetch columns on demand
  • stop the query with an error.

3. Per-cell mapping

HBase shell has 'scan' command, here's an example of its output:

hbase(main):007:0> scan 'testtable'
  myrow-1 column=colfam1:q1, timestamp=1297345476469, value=value-1
  myrow-2 column=colfam1:q2, timestamp=1297345495663, value=value-2
  myrow-2 column=colfam1:q3, timestamp=1297345508999, value=value-3

Here, one HBase row produces multiple rows in the query output. Each output row represents one (row_id, column) combination, so rows with multiple columns (and multiple revisions of column data) can be easily represented.

3.1 Mapping definition

Mapping could be defined as follows:

CREATE TABLE hbase_tbl_cells (
  row_id binary(MAX_HBASE_ROWID_LEN),
  column_family binary(MAX_HBASE_COLFAM_LEN),
  column_name binary(MAX_HBASE_NAME_LEN),
  timestamp TIMESTAMP,
  value BLOB,
  PRIMARY KEY (row_id, column_family, column_name, timestamp)
) ENGINE=hbase_cell;

There is no need for dynamic columns in this mapping.

  • NOTE: It is nice to have SQL table DDLs independent of the content of the backend hbase table. This saves us from the need to synchronize table DDLs between hbase and mysql (NDB cluster had to do this and they have ended up implementing a very complex system to do this).

3.2 Queries in per-cell mapping

# Point-select:
SELECT value 
FROM hbase_cell
  row_id='hbase_row_id' AND 
  column_family='hbase_column_family' AND column_name='hbase_column'

#  Range select:
#   (the example uses BETWEEN but we will support arbitrary predicates)
SELECT value 
FROM hbase_cell
  row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2' AND 
  column_family='hbase_column_family' AND column_name='hbase_column'

# Update a column
UPDATE hbase_cell SET value='value' 
WHERE row_id='hbase_row' AND 
      column_family='col_family' AND column_name='col_name'

# Add a column (this will add a row if one doesn't exist yet)
INSERT INTO hbase_cell values ('hbase_row', 'col_family','col_name','value');

Note that

  • accessing versioned data is easy: one can read some particular version, versions within a date range, etc
  • it is also easy to select all columns from a certain column family.

3.3 Efficient execution in per-cell mapping

The table is defined as having a

  PRIMARY KEY (row_id, column_family, column_name, timestamp)

which allows to make efficient point lookups for (row_id, column_family, column_name) tuples.

Order of columns in the index also allows to read all columns in a column family.

3.3.1 Scanning certain columns within a range

HBase API allows to scan a range of rows, retrieving only certain column name or certain column families. In SQL, this is written as:

SELECT value
FROM hbase_cell
  row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2' AND
  column_family='hbase_column_family'                           (*)

If we feed this into the range optimizer, it will produce a range:

  ('hbase_row_id1', 'hbase_column_family') <= (row_id, column_family) <=
  ('hbase_row_id2', 'hbase_column_family')

which includes all column families for records which satisfy

  'hbase_row_id1' < rowid < 'hbase_row_id2'

This will cause extra data to be read.

Possible solutions:

  • Extend multi-range-read interface to walk the 'SEL_ARG graph' instead of list of ranges. This will allow to capture the exact form of conditions like (*).
  • Use table condition pushdown and there analyze the condition.
  • Define more indexes, so that ranges are "dense". what about (row_id BETWEEN $X AND $Y) AND (timestamp BETWEEN $T1 AND $T2) ? No matter which index you define, the range list will not be identical to the WHERE clause.

4. Comparison of the two mappings

If we select two columns from a certain row, per-cell mapping produces "vertical" result, while per-row mapping produces "horizontal" result.

# Per-cell:
SELECT column_name, value 
FROM hbase_cell
  row_id='hbase_row_id1' AND 
  column_family='col_fam' AND column_name IN ('column1','column2')
| column_name | value |
| column1     | val1  |
| column2     | val2  |
# Per row:
  COLUMN_GET(columns, 'col_fam:column1') as col1,  
  COLUMN_GET(columns, 'col_fam:column2') as col2,
FROM hbase_row
| col1 | col2 |
| val1 | val2 |

Per-cell mapping:

  • Allows a finer control over selection of versioned data, column families, etc.
  • Produces a "better" resultset when one selects arbitrary many columns (the client will have to just iterate over SQL resultset, without the need to unpack dynamic-column-format blobs)

Per-row mapping:

  • is easier to use when one is selecting a pre-defined set of columns
  • allows joins that involve multiple columns (in per-cell mapping, one needs to do an [inefficient?] self-join if they want to do a join between rows in an hbase table and something else).

5. Interfacing with HBase

HBase is in Java, and its native client API is a java library. We need to interface with it from C++ storage engine code. Possible options are:

5.1 Use Thrift

This requires HBase installation to run a Thrift server

5.2 Re-implement HBase's network protocol

  • It seems to be a custom-made RPC protocol.
  • There is an independent re-implementation here: https://github.com/stumbleupon/asynchbase. It is 10K lines of Java code, which gives an idea about HBase's protocol complexity
    • It seems to support only a subset of features? I.e. I was unable to find mention of pushed down conditions support?
    • Look in HBaseRpc.java for "Unofficial Hadoop / HBase RPC protocol documentation"

5.3 Use JNI+HBase client protocol

  • not sure how complex this is
  • Mark has mentioned this has an unacceptable overhead?

6. Consistency, transactions, etc

  • HBase has single-record transactions. Does this mean that HBase storage engine will have MyISAM-like characteristics? e.g. if we fail in the middle of a multi-row UPDATE, there is no way to go back.
  • How should writes be made? Should UPDATEs/INSERTs use checkAndPut operation so that they don't overwrite data that's already there?
  • Q: Are the writes important at all? (e.g. if we've had the first version with provide read-only access, would that be useful?) A: Yes?


Sto caricando i commenti......
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.