March 14, 2014

MariaDB 10 CONNECT Engine - A Better Way to Access External Data

Not long ago on a blog post Dean Ellis talked about the new MariaDB 10.0, describing some of the 'little things' it brings to the community. But I believe that MariaDB 10.0 also brings a lot of 'big things' to the table. Sheeri Cabral and myself started covering the CONNECT engine in OurSQL podcast's episode 174. In this article, I'll go through CONNECT engine indexing, which we didn't cover on the podcast.

Accessing The CSV File

Accessing a CSV file, as we explained on the podcast, is pretty easy. I'll cover some basics here to make it easier to follow the 2nd half of the article. I started by exporting a CSV file from an application to track expenses. The file includes a line with the column names and then a list of the expenses separated by commas (,) and the strings are enclosed in double quotes ("). Here are a few sample lines:

Timestamp,Merchant,Amount,MCC,Category,Tag,Comment,Reimbursable,"Original Currency","Original Amount"
2014-02-27,"Hotel la Airport 73los Angeles",492.22,0,Uncategorized,,,yes,USD,492.22
2014-02-26,"Hotel lax Andiamo 7los Angeles",209.31,0,Uncategorized,,,yes,USD,209.31
...
2014-02-18,Hotels,542.18,0,Uncategorized,,,yes,USD,542.18
...

For the file to be used by the CONNECT engine, the mysqld process has to have full access rights. So I copied it to the MySQL data directory and made sure it was owned by the user and group 'mysql'. I also removed the top line with the column names in order to allow for some manipulation later. The CREATE TABLE statement to access some of the columns looks like:

CREATE TABLE  test_csv (
  ts date NOT NULL date_format='YYYY-MM-DD' flag=1,
  merchant varchar(128) NOT NULL flag=2,
  amount decimal(10,2) NOT NULL flag=3,
  category varchar(65) NOT NULL flag=5
) ENGINE=CONNECT DEFAULT CHARSET=latin1 table_type=csv file_name='/var/lib/mysql/Export.csv' header=0 quoted=1;

You may check the full syntax and option definitions in the CSV and FMT Table Types section in the MariaDB Knowledge Base. Finally, a quick check using SELECT * FROM test_csv will reveal that all the data is accessible. Now we can execute simple queries like:

select * from test_csv_2 where category = 'Meals & Tips';

Consider its execution plan:

+------+-------------+------------+------+---------------+------+---------+------+------+-------------+
| id   | select_type | table      | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+------+-------------+------------+------+---------------+------+---------+------+------+-------------+
|    1 | SIMPLE      | test_csv_2 | ALL  | NULL          | NULL | NULL    | NULL | 1271 | Using where |
+------+-------------+------------+------+---------------+------+---------+------+------+-------------+

MariaDB will do a full table scan to find the corresponding values. Since I'm using a small file, this acceptable, however it might take a long time to scan a big file sequentially to only select a few rows.

Using Indexes with CONNECT

This is where the CONNECT engine starts showing its real potential beyond connecting to external data sources. One of the real interesting features is the possibility of managing its own indexes. Indexes for text files, in this case CSV, work best when the data is ordered or clustered by the column in the index. If the file is not in the desired order, we can fit it on Linux using 'sort' For this example it would look like this (sort comma separated file named Export.csv using the 5th column and write it to Export_sort.csv):

sort -t ',' -k 5 Export.csv -o Export_sort.csv

Now we can use a slightly different CREATE statement:

CREATE TABLE  test_csv_sorted (
  ts date NOT NULL date_format='YYYY-MM-DD' flag=1,
  merchant varchar(128) NOT NULL flag=2,
  amount decimal(10,2) NOT NULL flag=3,
  category varchar(65) NOT NULL flag=5,
  KEY ix_cat(category)
) ENGINE=CONNECT DEFAULT CHARSET=latin1 table_type=csv file_name='/var/lib/mysql/Export_sort.csv' header=0 quoted=1;

The execution plan now looks more promising:

explain select * from test_csv_sorted where category = 'Meals & Tips';
+------+-------------+-----------------+------+---------------+--------+---------+-------+------+-------------+
| id   | select_type | table           | type | possible_keys | key    | key_len | ref   | rows | Extra       |
+------+-------------+-----------------+------+---------------+--------+---------+-------+------+-------------+
|    1 | SIMPLE      | test_csv_sorted | ref  | ix_cat        | ix_cat | 67      | const |    5 | Using where |
+------+-------------+-----------------+------+---------------+--------+---------+-------+------+-------------+

Conclusion

In MySQL it is possible to import CSV files using commands like LOAD DATA INFILE into regular tables or use the CSV storage engine directly. With big text files, the first approach can take a long time, the 2nd one can be very simplistic. In those cases, using the CONNECT engine with an external index might be the quickest way to proceed. External indexing also works with other file types, download the latest MariaDB 10.0 binaries and run your own tests and give us your feedback.