MariaDB ColumnStore with Spark
Introduction
Apache Spark (http://spark.apache.org/) is a popular open source data processing engine. It can be integrated with MariaDB ColumnStore utilizing the Spark SQL feature.
Configuration
The following steps outline installing and configuring the MariaDB Java Connector to be available to the spark runtime:
- The latest version of the MariaDB Java Connector should be downloaded from https://mariadb.com/downloads/connector and copied to the master node, e.g. under /usr/share/java.
- The configuration file /usr/local/spark/conf/sparks-default.conf should be created or updated to point to the jdbc directory:
spark.driver.extraClassPath /usr/share/java/mariadb-java-client-1.5.7.jar spark.executor.extraClassPath /usr/share/java/mariadb-java-client-1.5.7.jar
Usage
Currently Spark does not correctly recognize mariadb specific jdbc connect strings and so the jdbc:mysql syntax must be used. The followings shows a simple pyspark script to query the results from ColumnStore UM server columnstore_1 into a spark dataframe:
from pyspark import SparkContext from pyspark.sql import DataFrameReader, SQLContext url = 'jdbc:mysql://columnstore_1:3306/test' properties = {'user': 'root', 'driver': 'org.mariadb.jdbc.Driver'} sc = SparkContext("local", "ColumnStore Simple Query Demo") sqlContext = SQLContext(sc) df = DataFrameReader(sqlContext).jdbc(url='%s' % url, table='results', properties=properties) df.show()
Spark SQL currently offers very limited push down capabilities, so to take advantage of ColumnStore's ability to perform efficient group by, then an inline table must be used, for example:
df = DataFrameReader(sqlContext).jdbc(url='%s' % url, table='( select year, sum(closed_roll_assess_land_value) sum_land_value from property_tax group by year) pt', properties=properties)