Real-time Data Streaming with MariaDB AX
When we started working on the big data and distributed columnar technology through MariaDB ColumnStore, the focus has been to help our customers get the most value out of their data assets. Time to insight and time to action are competitive differentiators for our customers to get the most value from their data. In order to have faster time to insight and time to action it’s critical that:
Organizations make data available for analysis as soon as it arrives, and;
Applications stream data from data sources to analytics platform seamlessly.
With this in mind, the latest MariaDB AX analytics solution, which introduces MariaDB ColumnStore 1.1.2 and MariaDB ColumnStore Data Adapters, enables easy integration with data from various sources such as web/mobile services, IoT, sensors, social networks, device logs and machine learning model output.
In this blog, we explore the two new data streaming capabilities of MariaDB AX and how it helps users.
Bulk Data Adapters
Previously, the method of data ingestion into MariaDB ColumnStore was through high speed bulk loading with cpimport or LOAD DATA INFILE for batch load operations. However, these required manual operational processes and resulted in delays while having to generate CSV files from data sources, and moving them to a UM or PM node. The new bulk data adapter API introduced in MariaDB ColumnStore 1.1, available as an SDK, enables near real-time data analytics by streaming data directly from their ETL and data source applications into MariaDB ColumnStore in a programmatic way. APIs are available as a C++ SDK, along with Python and Java bindings.
The APIs use the MariaDB ColumnStore configuration file (ColumnStore.xml) to understand and locate the distributed PM nodes upon startup. Then, the application can perform per table writes by passing input data to the API calls as data structure. The API allows the application to stream data for each row, and then buffers the configurable number of rows (100,000 by default) before flushing them from the application to the network. When the application commits, then the data is written on the PM node. The application however can commit rows any time, and does not have to wait for a 100,000 row buffering. As the APIs stream data over the network, the data streaming applications using the API can be running outside the MariaDB ColumnStore UM and PM node. Hence, applications can be running very close to data source and pushing data to MariaDB ColumnStore as data is being generated by the source. Thus resulting in real-time streaming.
Users utilize the bulk data API for various use cases such as publishing data from python machine learning models, data ingestion from data collection points across IoT, computing and telecommunication networks, data streaming from Ad-engines, data feeds from transactional databases or queuing systems such as Kafka. A detailed usage guide is available on our KnowledgeBase page. Source code examples of API usage can be found here.
Streaming Data Adapters
MaxScale CDC Data Adapter
Many MariaDB users that have both MariaDB TX (OLTP) and MariaDB AX (Analytics) solution, feed data from the InnoDB tables on MariaDB TX into MariaDB AX. While the InnoDB tables in MariaDB TX are used for transactional purposes such as daily financial transactions, on the MariaDB AX side they are interested in data from certain tables for analytic purposes. The natural inclination is to use replication of data from MariaDB Server in TX to MariaDB ColumnStore in AX. However, MariaDB ColumnStore is not optimized as a replication slave, as replication executes individual SQL insert, update and delete plus MariaDB ColumnStore is optimized for bulk writes rather than row based DML. The MariaDB TX solution has MariaDB MaxScale that includes the capability of streaming change data events to external sources. We marry this with the new Bulk Data Adapter API of MariaDB AX, to provide continuous data streaming from MariaDB TX to MariaDB AX. The out of box integration of the MaxScale CDC streams into MariaDB ColumnStore is available as the MaxScale CDC Data Adapter. No development is required to use this adapter.
The MaxScale-CDC-Data Adapter registers with MariaDB MaxScale as a CDC Client using the MaxScale CDC Connector API, receiving change data records from MariaDB MaxScale (that are converted from binlog events received from the Master on MariaDB TX) in a JSON format. Then, using the MariaDB ColumnStore Bulk Data Adapter API, converts the JSON data into API calls and streams it to a MariaDB PM node. The adapter has options to insert all the events in the same schema as the source database table or insert each event with metadata as well as table data. The event meta data includes the event timestamp, the GTID, event sequence and event type (insert, update, delete).
The usage guide for the adapter can be found here. Using this MaxScale CDC Data Adapter, you can now stream directly from your OLTP MariaDB Servers to Analytics MariaDB ColumnStore servers.
The Kafka data adapter streams all messages published to Apache Kafka topics to MariaDB AX automatically and continuously - enabling data from many sources to be streamed and collected for analysis without complex code. The Kafka adapter is built using librdkafka and the MariaDB ColumnStore bulk data adapter API.
At this point we have tested the Kafka data adapter, where the source of the events to the Kafka broker has been the CDC events from MariaDB MaxScale. Going forward, we will also have support for a generic key-value type events. Having the ability to stream data from Kafka opens up the data adapter to a variety of data sources such as websites, advertising engines, social network feeds, system logs, IoT events, etc.
The bulk data adapters allow users to build their own custom ETL applications, and the streaming data adapters provides out of box capabilities to continuously stream data from MariaDB TX and various other sources without any coding. Try the new data adapters today and let us know your feedback.
Learn more about MariaDB AX, our modern data warehousing solution for large scale advanced analytics.