Extending the Power of MariaDB ColumnStore with User Defined Functions

spacer

Introduction

MariaDB ColumnStore 1.0 supports User Defined Functions (UDF) for query extensibility. This allows you to create custom filters and transformations to suit any need. This blog outlines adding support for distributed JSON query filtering.

An important MariaDB ColumnStore concept to grasp is that there are distributed and non-distributed functions. Distributed functions are executed at the PM nodes supporting query execution scale out. Non distributed functions are MariaDB Server functions that are executed within the UM node. As a result, MariaDB ColumnStore requires two distinct implementations of any function.

The next release, MariaDB ColumnStore 1.1, will bring support for User Defined Aggregate Functions and User Defined Window Functions.

Getting Started

To develop a User Defined Function requires familiarity with C, C++ and Linux.  At a high level the process involves:

  • Setting up a development environment to build MariaDB ColumnStore from source
  • Implementing the MariaDB Server UDF interface
  • Implementing the MariaDB ColumnStore Distributed UDF interface

You can find more information in the MariaDB knowledge base: User Defined Functions.

Example – JSON Pointer Query

In this example, I will create a simple UDF for JSON pointer queries using the RapidJSON library from Tencent. It will have the following syntax:

json_ptr(<json-text>, <json-ptr-string>)

The first argument is the JSON string. The second argument is the JSON pointer string.

Build Files

RapidJSON is a C++ header only library, so it must be included with the source code. I’ve chosen to add the RapidJSON source code as a submodule in the mariadb-columnstore-engine source code repository:

git submodule add https://github.com/miloyip/rapidjson utils/udfsdk/rapidjson

This will create a submodule rapidjson under the utils/udfsdk directory where the UDF SDK code is defined.

The rapidjson include directory is added to the cmake file utils/udfsdk/CMakeLists.txt to make it available for use (last line below)

include_directories( ${ENGINE_COMMON_INCLUDES}
                     ../../dbcon/mysql
                     rapidjson/include)

MariaDB UDF SDK

To implement the MariaDB Server UDF SDK, three procedures must be implemented in utils/udfsdk/udfmysql.cpp:

  • <UDF-NAME>_init  : initialization / parameter validation
  • <UDF-NAME> : function implementation.
  • <UDF-NAME>_deinit : any clean up needed, for example freeing of allocated memory

The init function, in this example, will validate the argument count:

my_bool json_ptr_init(UDF_INIT* initid, UDF_ARGS* args, char* message)
{
    if (args->arg_count != 2)
    {
        strcpy(message,"json_ptr() requires two arguments: expression, path");
        return 1;
    }

    return 0;
}

The function implementation will take the JSON string, parse this into an object structure, execute the JSON pointer path and return a string representation of the results:

 1 string json_ptr(UDF_INIT *initid, UDF_ARGS *args, char *is_null, char *error)
 2 {
 3     string json = cvtArgToString(args->arg_type[0], args->args[0]);
 4     string path = cvtArgToString(args->arg_type[1], args->args[1]);
 5     Document d;
 6     d.Parse(json.c_str());
 7    if (Value *v = Pointer(path.c_str()).Get(d)) {
 8         rapidjson::StringBuffer sb;
 9         Writer<rapidjson::StringBuffer> writer(sb);
10         v->Accept(writer);
11         return string(sb.GetString());
12     }
13     else {
14         return string();
15     }
16 }

Here is a brief explanation of the code:

  • Lines 3-4 : read the 2 arguments and convert to string.
  • Lines 5-6 : Parse the JSON string into a RapidJSON Document object .
  • Line 7 : Execute the JSON pointer path against the JSON document into the Value object.
  • Lines 8-11 : If the Value object is not null then serialize Value to a string. The Value class also offers strongly typed accessors that may also be used. Note that string values will be serialized with surrounding double quotes.
  • Line 14: If the Value object is null then return the empty string.

The following RapidJSON includes are required at the top of the file:

#include "rapidjson/document.h"
#include "rapidjson/pointer.h"
#include "rapidjson/stringbuffer.h"
#include "rapidjson/writer.h"

using namespace rapidjson;

The json_ptr_deinit function must be declared, but it is empty – there is nothing we need to do for clean up.

void json_ptr_deinit(UDF_INIT* initid)
{
}

MariaDB ColumnStore Distributed UDF SDK

The MariaDB ColumnStore distributed UDF SDK requires defining a class instance and registering an instance for it to be usable.

First the class must be declared in utils/udfsdk/udfsdk.h. The simplest approach is to clone one of the existing reference implementations:

class json_ptr : public funcexp::Func
{
  public:
    json_ptr() : Func("json_ptr") {}

    virtual ~json_ptr() {}

    execplan::CalpontSystemCatalog::ColType operationType(          
      funcexp::FunctionParm& fp, 
      execplan::CalpontSystemCatalog::ColType& resultType);

    virtual int64_t getIntVal(
      rowgroup::Row& row,
      funcexp::FunctionParm& fp,
      bool& isNull,
      execplan::CalpontSystemCatalog::ColType& op_ct);

    virtual double getDoubleVal(
      rowgroup::Row& row,
      funcexp::FunctionParm& fp,
      bool& isNull,
      execplan::CalpontSystemCatalog::ColType& op_ct);

     virtual float getFloatVal(
       rowgroup::Row& row,
       funcexp::FunctionParm& fp,
       bool& isNull,
       execplan::CalpontSystemCatalog::ColType& op_ct);

     virtual std::string getStrVal(
       rowgroup::Row& row,
       funcexp::FunctionParm& fp,
       bool& isNull,
       execplan::CalpontSystemCatalog::ColType& op_ct);

     virtual bool getBoolVal(
       rowgroup::Row& row,
       funcexp::FunctionParm& fp,
       bool& isNull,
       execplan::CalpontSystemCatalog::ColType& op_ct);

      virtual execplan::IDB_Decimal getDecimalVal(
        rowgroup::Row& row,
        funcexp::FunctionParm& fp,
        bool& isNull,                                                    
        execplan::CalpontSystemCatalog::ColType& op_ct);
   
      virtual int32_t getDateIntVal(
        rowgroup::Row& row,
        funcexp::FunctionParm& fp,
        bool& isNull,
        execplan::CalpontSystemCatalog::ColType& op_ct);

      virtual int64_t getDatetimeIntVal(
        rowgroup::Row& row,
        funcexp::FunctionParm& fp,
        bool& isNull,
        execplan::CalpontSystemCatalog::ColType& op_ct);

  private:
    void log_debug(std::string arg1, std::string arg2);

  };
}

It can be seen that the following methods are defined:

  • operationType which is used to indicate the return type of the function, which could be dynamic.
  • A number of get<Type>Val methods, which perform the operation for a given return type.
  • An optional private log_debug method, which could be used for debug logs in your implementation.

Next the class is implemented in utils/udfsdk/udfsdk.cpp.  First an entry must added to register the class by name (in lower class). This is added to the UDFMap function:

FuncMap UDFSDK::UDFMap() const
{
    FuncMap fm;
    fm["mcs_add"] = new MCS_add();
    fm["mcs_isnull"] = new MCS_isnull();
    fm["json_ptr"] = new json_ptr(); -- new entry for json_ptr function
    return fm;
}

The class is implemented, for brevity only a subset of methods is shown below:

    CalpontSystemCatalog::ColType json_ptr::operationType (
      FunctionParm& fp,                                                           
      CalpontSystemCatalog::ColType& resultType) {
        assert (fp.size() == 2);
        return fp[0]->data()->resultType();
    }

    void json_ptr::log_debug(string arg1, string arg2) {
        logging::LoggingID lid(28); // 28 = primproc
        logging::MessageLog ml(lid);

        logging::Message::Args args;
        logging::Message message(2);
        args.add(arg1);
        args.add(arg2);
        message.format( args );
        ml.logDebugMessage( message );
    }

    string json_ptr::getStrVal(Row& row,
                               FunctionParm& parm,
                               bool& isNull,
                               CalpontSystemCatalog::ColType& op_ct) {
        string json = parm[0]->data()->getStrVal(row, isNull);
        string path = parm[1]->data()->getStrVal(row, isNull);

        Document d;
        d.Parse(json.c_str());
        if (Value *v = Pointer(path.c_str()).Get(d)) {
            StringBuffer sb;
            Writer<StringBuffer> writer(sb);
            v->Accept(writer);
            return string(sb.GetString());
        }
        else {
            return string();
        }
    }

    double json_ptr::getDoubleVal(Row& row,
                                 FunctionParm& parm,
                                 bool& isNull,
                                 CalpontSystemCatalog::ColType& op_ct)
    {
        throw logic_error("Invalid API called json_ptr::getDoubleVal");
    }

The following methods are implemented:

  • operationType which validates there are exactly 2 arguments and specifies the return type as as string.
  • Optional log_debug method which illustrates how to log to the debug log.
  • getStrVal which performs the JSON evaluation. This is similar to the MariaDB UDF implementation with the exception of being strongly typed and the arguments are retrieved differently.
  • getDoubleVal illustrates throwing an error stating that this is not a supported operation. The method could be implemented of course but for simplicity this was not done.

The complete set of changes can be seen in github.

Building and Using the Example

This example can be built from a branch created from the 1.0.10 code. Once you have cloned the mariadb-columnstore-engine source tree, use the following instructions to build and install on the same server:

git checkout json_ptr_udf
git submodule update --init
cmake .
make -j4
cd utils/udfsdk/
sudo cp libudf_mysql.so.1.0.0 libudfsdk.so.1.0.0 /usr/local/mariadb/columnstore/lib

For a multi server setup, the library files should be copied to the same location on each server. After this, restart the MariaDB ColumnStore instance to start using the functions.

mcsadmin restartSystem

The function is registered using the create function syntax:

create function json_ptr returns string soname 'libudf_mysql.so';

Now a simple example will show the user defined function in action:

create table animal(
id int not null, 
creature varchar(30) not null, 
name varchar(30), 
age decimal(18) ,
attributes varchar(1000)
) engine=columnstore;

insert into animal(id, creature, name, age, attributes) 
values (1, 'tiger', 'roger', 10, '{"fierce": true, "colors": ["white", "orange", "black"]}'), 
       (2, 'tiger', 'sally', 2, '{"fierce": true, "colors": ["white", "orange", "black"]}'), 
       (3, 'lion', 'michael', 56, '{"fierce": false, "colors": ["grey"], "weight" : 9000}');

select name, json_ptr(attributes, '/fierce') is_fierce 
from animal 
where json_ptr(attributes, '/weight') = '9000';
+---------+-----------+
| name    | is_fierce |
+---------+-----------+
| michael | false     |
+---------+-----------+
1 row in set (0.05 sec)

In the example, you can see that the where clause uses the json_ptr function to filter on the weight element in the attributes JSON column and then the fierce element is retrieved in the select clause.

I hope this inspires you to further enhance this example or come up with your own user defined functions to extend the query capabilities of MariaDB ColumnStore.

Getting started? Download MariaDB ColumnStore today and learn how you can get started with MariaDB ColumnStore in 10 minutes.