January 10, 2018

MariaDB ColumnStore Distributed User Defined Aggregate Functions

MariaDB ColumnStore 1.1 introduces the Distributed User Defined Aggregate Functions (UDAF) C++ API. MariaDB Server has supported UDAF (a C API) for a while, but now we have extended it to the ColumnStore Engine. This new feature allows anyone to create aggregate functions of arbitrary complexity for distributed execution in the ColumnStore Engine. These functions can also be used as Analytic (Window) functions just like any built in aggregate. You should have a working understanding of C++ to use this API.

The UDAF API supports an arbitrary number of parameters to be defined for your Aggregate function. Version 1.1 of ColumnStore supports one parameter and will be enhanced to support any number of parameters in the next version.

For example, defining MEDIAN(<table.row>) is allowed, but PERCENTILE_CONT(<table.row>, <percent>) will be enabled in the next version. However, there are some workarounds for the short term, such as using a distributed MEDIAN which provides equivalent functionality to PERCENTILE_CONT(0.5). In this his example, MEDIAN can be extended to support other percentile values. If you want to support a 90th percentile function, then a PERCENTILE90(<table.row>) function can be implemented currently.

So, how does this work? MariaDB ColumnStore has had a distributed aggregate function capability since the start. With the 1.1 release, we’ve added new functionality that allows the engine to recognize when a UDAF has been called and perform callbacks at each level of execution. The code you write mustdefine the data structures and the work to be performed on those structures.

It’s relatively straight forward. There are two classes you need to become familiar with -- mcsv1Context and mcsv1_UDAF. mcsv1Context are the classes that holds the state of your function during execution. mcsv1_UDAF is the class you extend to write your code.

To write a UDAF, this requires extending the class mcsv1_UDAF and implementing a few functions. In many cases, these implementations are straightforward, but it really depends on what you’re trying to do. For use as an aggregate, some functions are called on the UM, some on the PM. For example, the UDAF ssq.cpp is a simple implementation that shows the basic implementation of each function.

UDAF.png

For use as Analytic functions, all calls are on the UM.

UDAnF.png

​In addition, you must write the same exact function in the MariaDB UDAF C API. This is required and can be a simple stub or a complete implementation, depending on whether you want your function to work as an aggregate for other engines. But, it is needed to tell the parser that your function exists.

Since memory must be allocated at each node for the work being performed there, MariaDB ColumnStore handles when and where memory is allocated. You may choose to provide a method to do that allocation which the engine calls when it needs to.

You may have a need for some complex data structure, hash table, vectors or other subobjects. In this situation, you need to become familiar with the Complex Data Model as described in the udaf_sdk Documentation. There is no limit to the complexity of the memory model you choose. It is important that you create a UserData derived class that can serialize and un-serialize itself. It’s destructor must clean up all allocated memory.

If all you need is a simple data structure, you may forgo all the memory allocation steps and rely on the base UserData class. Its default functionality is to allocate a fixed block of memory and to stream that block as a single binary value. You need do nothing except set the amount of memory in the init() method and overlay your structure in each callback method.

MariaDB ColumnStore 1.1 doesn’t support dynamic loading of plugins, so your UDAF must be compiled and linked in the MariaDB ColumnStore code tree in the ./utils.udfsdk directory. You must compile and link it into libudfsdk.so.1.1.0 along with all the other user defined functions. This library must be placed into the mariadb/columnstore/lib directory of each node. The MariaDB C UDAF code must be compiled and linked into libudf_mysql.so.1.0.0 and placed into the same place. There’s a symlink already there for mysqld to find it.

Then to activate your UDAF, in a mysql client, issue a command similar to:
CREATE AGGREGATE FUNCTION median returns REAL soname 'libudf_mysql.so'

In future blogs, I’ll delve deep into each step needed to create and use a UDAF.

User Defined Aggregates open up a whole new avenue to extract value from analytic data. We hope you enjoy using this new tool! MariaDB ColumnStore 1.1 is available for download as part of MariaDB AX, an enterprise open source solution for modern data analytics and data warehousing.