Hands-on: MariaDB ColumnStore Spark Connector
In February with the release of MariaDB ColumnStore 1.1.3, we introduced a new Apache Spark connector (Beta) that exports data from Spark into MariaDB ColumnStore. The Spark connector is available as part of our MariaDB AX analytics solution and complements our suite of rapid-paced data ingestion tools such as a Kafka data adapter and MaxScale CDC data adapter. The connector empowers users to directly export machine learning results stored in Spark DataFrames to ColumnStore for high performance analytics. Internally, it utilizes ColumnStore’s Bulk Data Adapters to inject data directly into MariaDB ColumnStore’s WriteEngine.
In this blog, we’ll explain how to export the results of a simple machine learning pipeline on the classification example of the well known mnist handwritten digits dataset. Feel free to start your own copy of our lab environment by typing:
git clone https://github.com/mariadb-corporation/mariadb-columnstore-docker.git cd mariadb-columnstore-docker/columnstore_jupyter docker-compose up -d
This will spin up two docker containers, one with the latest version of MariaDB ColumnStore and the other with a pre-configured version of Jupyter, a web application to interactively execute Python and Scala code snippets.
Now you can access the demo notebook on port 8888 with the password “mariadb”.
In the first five blocks the whole machine learning takes place. For simplicity, a pre-trained random forest model is used to predict the labels of 10,000 handwritten digits of mnist’s test dataset.
The resulting DataFrame contains five columns, from whom we are interested to export three into ColumnStore. These are:
- The original label to validate the prediction
- The normalized probability vector of doubles representing the likelihood of being a certain digit
- The final prediction made by the random forest model
The probability vector indicates the likelihood of the vector index position value matching the input image. This is converted into 10 individual values so that these can be mapped to 10 columns, 1 for each value. The final DataFrame consists of twelve columns: one for the label, one for the final prediction, and ten for each digits’ probability.
This DataFrame can now be exported into ColumnStore by either Spark’s native (JDBC) write function, MariaDB’s Bulk Data Adapters, or MariaDB’s Spark connector. All options are outlined in the demo for easy verification. The MariaDB Spark connector is the easiest to use and provides the best performance.
You’ll get the schema of the DataFrame to export by calling its printSchema() function and can use standard SQL to create the table accordingly. Don’t forget to specify to use the ColumnStore engine, for example:
CREATE TABLE IF NOT EXISTS bulk_api_1 (label double, prediction double, prob_0 double, prob_1 double, prob_2 double, prob_3 double, prob_4 double, prob_5 double, prob_6 double, prob_7 double, prob_8 double, prob_9 double) ENGINE=columnstore;
The table schema needs to match the structure of the DataFrame otherwise the export will fail.
After the table is created it’s just a two-liner:
import columnStoreExporter columnStoreExporter.export("database","table",dataFrame)
Now you could use your favourite SQL tool to analyse and visualize your results. There is an excellent blog entry on how to connect MariaDB with Tableau that I highly recommend for further reading.
Feel free to check out the additional notebooks in the lab environment. Next to Python, they also show how to use the Spark connector with Scala.
Last but not least, there is further information on how to set-up the Spark connector in a production environment in our knowledge base. You can also catch our recent webinar that explains how our Spark and Kafka data connectors streamline and simplify the process of getting near real-time data for analysis. Download MariaDB AX, our high performance analytics solution to get started.
As always, we are thrilled to hear your feedback and suggestions through the usual channels.