MariaDB Bulk Load API

There are several ways to load data into MariaDB Platform, and some are better than others. The two basic ways are either to use LOAD DATA INFILE / LOAD DATA LOCAL INFILE, which is very fast, in particular the non-LOCAL one and then we have the plain INSERT statement. When using an INSERT statement you may pass an array to MariaDB Server, like this

INSERT (column1,column2) VALUES(<column 1 row 1>, <column 2 row 1>),(<column 1 row 2>, <column 2 row 2>)

Nothing special with that, but what we will cover in this blog is another way of doing INSERTs using arrays, one which uses the MariaDB API to pass a program array to MariaDB and which is actually a very fast way of loading data into MariaDB.

To begin with, let’s look at the two APIs that we use to access a MariaDB Server from a C program. The reason that the C API is relevant is that this API is a thin wrapper around the MariaDB protocol, so explaining the C API also covers what is possible with the protocol itself. The other connectors, such as JDBC, ODBC and Node.js have various levels of functionality and in some cases have other ways of interacting with MariaDB, but then this just happens inside the connector itself.

There are two API’s, one that is text-based and this is the original MariaDB API. In this API all data is sent and received as text. Let’s look at a sample table first before we go into looking at some code.

CREATE TABLE customers(id INTEGER NOT NULL PRIMARY KEY,
cust_name VARCHAR(255),
cust_regdate DATETIME NOT NULL,
cust_numorders INTEGER NOT NULL);

Now, let’s look at a simple program that insert some rows into that table, using the original text-based API:

#include <stdio.h>
#include <mysql.h>

int main(int argc, char *argv[])
   {
   MYSQL *conn;

   conn = mysql_init(NULL);
   if(mysql_real_connect(conn, "localhost", "root", NULL, "blog", 3306,
     "/var/lib/mysql/mysql.sock", CLIENT_INTERACTIVE) == NULL)
      {
      fprintf(stderr, "Error: %s\n", mysql_error(conn));
      return 1;
      }

   if(mysql_query(conn, "INSERT INTO customers VALUES(1, 'Joe Bloggs',"
     "'2019-03-05 14:30:00', 0)") != 0)
      {
      fprintf(stderr, "Error: %s\n", mysql_error(conn));
      return 1;
      }

   if(mysql_query(conn, "INSERT INTO customers VALUES(2, 'Homer Simpson',"
     "'2019-04-12 09:15:00', 0)") != 0)
      {
      fprintf(stderr, "Error: %s\n", mysql_error(conn));
      return 1;
      }

   mysql_close(conn);
   return 0;
   }

This is simple enough, we initialize a connection handle and connect and then we insert two rows using 2 INSERT statements. All columns we pass, be it strings, integers or dates are represented as strings. We can make this INSERT more effective by passing all rows in one single SQL statement, like this:

#include <stdio.h>
#include <mysql.h>

int main(int argc, char *argv[])
   {
   MYSQL *conn;

   conn = mysql_init(NULL);
   if(mysql_real_connect(conn, "localhost", "root", NULL, "blog", 3306,
     "/var/lib/mysql/mysql.sock", CLIENT_INTERACTIVE) == NULL)
      {
      fprintf(stderr, "Error: %s\n", mysql_error(conn));
      return 1;
      }

   if(mysql_query(conn, "INSERT INTO customers VALUES(1, 'Joe Bloggs',"
     "'2019-03-05 14:30:00', 0),(2, 'Homer Simpson',"
     "'2019-04-12 09:15:00', 0)") != 0)
      {
      fprintf(stderr, "Error: %s\n", mysql_error(conn));
      return 1;
      }

   mysql_close(conn);
   return 0;
   }

The Prepared statement MariaDB API

The prepared statement API is different from the text-based API but it is contained within the same library and the same connection functions, and many other functions are used in the same way. It is different in a couple of ways though. First, we don’t pass data as part of the SQL statement, rather the SQL statement contains placeholder where we want data to be and then we associate these placeholders with program variables, a process called binding, where we place the actual data.

The same SQL statement only needs to be prepared once, after which time we can execute it several times and just change the data in our program variables in between. For this to work, the bind process has to know not only a reference to the variable it is binding to, but also a few other things like the data type that is being referenced, the length of it and what is called an indicator variable is necessary. And indicator variable says something more about the referenced variables, such as if it is NULL and if the referenced string is NULL terminated or if the length is taken as the actual length of the string.

As an example, let’s see what the first program above would look like when using prepared statements:

#include <stdio.h>
#include <string.h>
#include <mysql.h>

int main(int argc, char *argv[])
   {
   MYSQL *conn;
   MYSQL_STMT *stmt;
   int id;
   char id_ind;
   char name[256];
   char name_ind;
   long name_len;
   MYSQL_TIME regdate;
   char regdate_ind;
   int numorders;
   char numorders_ind;
   MYSQL_BIND bind[4];

   conn = mysql_init(NULL);
   if(mysql_real_connect(conn, "localhost", "root", NULL, "blog", 3306,
     "/var/lib/mysql/mysql.sock", CLIENT_INTERACTIVE) == NULL)
      {
      fprintf(stderr, "Error: %s\n", mysql_error(conn));
      return 1;
      }

   stmt = mysql_stmt_init(conn);
   if(mysql_stmt_prepare(stmt, "INSERT INTO customers VALUES(?, ?, ?, ?)", -1) != 0)
      {
      fprintf(stderr, "Error: %s\n", mysql_stmt_error(stmt));
      return 1;
      }

   memset(bind, 0, sizeof(bind));
   bind[0].u.indicator = &id_ind;
   bind[0].buffer_type = MYSQL_TYPE_LONG;
   bind[0].buffer = &id;
   bind[1].u.indicator = &name_ind;
   bind[1].buffer_type = MYSQL_TYPE_STRING;
   bind[1].buffer = name;
   bind[1].length = &name_len;
   name_len = -1;
   bind[2].u.indicator = &regdate_ind;
   bind[2].buffer_type = MYSQL_TYPE_DATETIME;
   bind[2].buffer = &regdate;
   bind[3].u.indicator = &numorders_ind;
   bind[3].buffer_type = MYSQL_TYPE_LONG;
   bind[3].buffer = &numorders;

   if(mysql_stmt_bind_param(stmt, bind) != 0)
      {
      fprintf(stderr, "Error: %s\n", mysql_stmt_error(stmt));
      return 1;
      }
   id_ind = regdate_ind = numorders_ind = STMT_INDICATOR_NONE;
   name_ind = STMT_INDICATOR_NTS;

   id = 1;
   strcpy(name, "Joe Bloggs");
   regdate.year = 2019;
   regdate.month = 3;
   regdate.day = 5;
   regdate.hour = 14;
   regdate.minute = 30;
   numorders = 0;

   if(mysql_stmt_execute(stmt) != 0)
      {
      fprintf(stderr, "Error: %s\n", mysql_stmt_error(stmt));
      return 1;
      }

   id = 2;
   strcpy(name, "Homer Simpson");
   regdate.year = 2019;
   regdate.month = 4;
   regdate.day = 12;
   regdate.hour = 9;
   regdate.minute = 15;
   numorders = 0;

   if(mysql_stmt_execute(stmt) != 0)
      {
      fprintf(stderr, "Error: %s\n", mysql_stmt_error(stmt));
      return 1;
      }

   mysql_close(conn);
   return 0;
   }

So, what do you think, better or worse? Well one advantage is that we only need to parse the statement once so in the end it could be a bit faster. Maybe. On the other hand, if you are writing some piece of generic code that handles SQL-statements that aren’t specifically known in advance or maybe only parts of it are known, then this is kind of neat. To support this you can find out how many parameters you deal with by a call to the API after a statement has been prepared. The prepared statement API also handles statements that return data, such as a SELECT, in a similar way. All in all, prepared statements require a bit more code in the interface but is a fair bit more functional.

The example program explained

I will hold the full description on how Prepared Statements and the corresponding API works until another blog post, but the program above still needs some explanation.

After connecting to MariaDB using the usual mysql_real_connect function, we create a handle to work with prepared statements and then we prepare the SQL statement we are to use later using the mysql_stmt_prepare function. Notice the error handling at this point, and this is repeated everywhere a prepared statement API function is called, instead of calling mysql_error, you call mysql_stmt_error which takes the statement handle, not the connection handle, as an argument. The SQL statements that we prepare has a ? to indicate where we are to bind to a parameters.

Following this it is time to do the bind, which takes up most of the code. The bind of the type MYSQL_BIND has 4 members, as there are 4 parameters to bind. This can of course be dynamic and allocated on the heap, using malloc or similar, but in this case we are working with a predefined SQL statement and we know that there are 4 parameters.

We start by zeroing all members on all the bind parameters. Following this we fill only the MYSQL_BIND members that are strictly necessary and note that we are using different types for the different columns, to match the table columns. In particular the DATETIME column which is mapped to a MYSQL_TIME struct, but this is not strictly necessary as MariaDB will supply and necessary conversion, for example we could pass a valid datetime string for the cust_regdate column. Then we do the actual bind by calling the mysql_stmt_bind_param function.

Last we fill out the values that the parameters are bind to and we also set the indicator valiables, all of these are normal except the one for the string cust_name which is set to STMT_INDICATOR_NTS to indicate that this is a null-terminated string. Following this we call mysql_stmt_execute to execute the prepared statement.

Bulk loading – Prepared statements with input arrays

If you look at the prepared statement code above, you realize that if you are to insert two or more rows in one go, you would prepare and execute something like this:

INSERT INTO customers VALUES(?, ?, ?, ?),(?, ?, ?, ?)

To make this work you would then bind 8 program variables and this doesn’t really seem terribly flexible, right? You would need to prepare a different statement depending on how many rows you are inserting, and this is just as clumsy as when you have to do the same thing with the text-based interface. With MariaDB and using the MariaDB Connector, there is actually a better way, which is to use array binding.

The way this works is that every bind program variable is an array of values, and then set these properly, tell MariaDB how big the array is and then an arbitrary number of rows can be inserted with one statement. This is probably best explained with an example, again performing the same thing as the previous examples, but in yet another different way:

#include <stdio.h>
#include <string.h>
#include <mysql.h>

int main(int argc, char *argv[])
   {
   MYSQL *conn;
   MYSQL_STMT *stmt;
   int id[2];
   char id_ind[2];
   char *nameptr[2];
   char name[2][256];
   char name_ind[2];
   long name_len[2];
   MYSQL_TIME *regdateptr[2];
   MYSQL_TIME regdate[2];
   char regdate_ind[2];
   int numorders[2];
   char numorders_ind[2];
   MYSQL_BIND bind[4];
   unsigned int numrows;

   conn = mysql_init(NULL);
   if(mysql_real_connect(conn, "localhost", "root", NULL, "blog", 3306,
     "/var/lib/mysql/mysql.sock", CLIENT_INTERACTIVE) == NULL)
      {
      fprintf(stderr, "Error: %s\n", mysql_error(conn));
      return 1;
      }

   stmt = mysql_stmt_init(conn);
   if(mysql_stmt_prepare(stmt, "INSERT INTO customers VALUES(?, ?, ?, ?)", -1) != 0)
      {
      fprintf(stderr, "Error: %s\n", mysql_stmt_error(stmt));
      return 1;
      }

   memset(bind, 0, sizeof(bind));
   bind[0].u.indicator = id_ind;
   bind[0].buffer_type = MYSQL_TYPE_LONG;
   bind[0].buffer = id;
   bind[1].u.indicator = name_ind;
   bind[1].buffer_type = MYSQL_TYPE_STRING;
   bind[1].buffer = nameptr;
   bind[1].length = name_len;
   bind[2].u.indicator = regdate_ind;
   bind[2].buffer_type = MYSQL_TYPE_DATETIME;
   bind[2].buffer = regdateptr;
   bind[3].u.indicator = numorders_ind;
   bind[3].buffer_type = MYSQL_TYPE_LONG;
   bind[3].buffer = numorders;

   if(mysql_stmt_bind_param(stmt, bind) != 0)
      {
      fprintf(stderr, "Error: %s\n", mysql_stmt_error(stmt));
      return 1;
      }

   memset(&regdate, 0, sizeof(regdate));

   id_ind[0] = regdate_ind[0] = numorders_ind[0] = STMT_INDICATOR_NONE;
   name_ind[0] = STMT_INDICATOR_NTS;
   id[0] = 1;
   nameptr[0] = name[0];
   strcpy(name[0], "Joe Bloggs");
   name_len[0] = 256;
   regdateptr[0] = &regdate[0];
   regdate[0].year = 2019;
   regdate[0].month = 3;
   regdate[0].day = 5;
   regdate[0].hour = 14;
   regdate[0].minute = 30;
   numorders[0] = 0;

   id_ind[1] = regdate_ind[1] = numorders_ind[1] = STMT_INDICATOR_NONE;
   name_ind[1] = STMT_INDICATOR_NTS;
   id[1] = 2;
   nameptr[1] = name[1];
   strcpy(name[1], "Homer Simpson");
   name_len[1] = 256;
   regdateptr[1] = &regdate[1];
   regdate[1].year = 2019;
   regdate[1].month = 4;
   regdate[1].day = 12;
   regdate[1].hour = 9;
   regdate[1].minute = 15;
   numorders[1] = 0;

   numrows = 2;
   mysql_stmt_attr_set(stmt, STMT_ATTR_ARRAY_SIZE, &numrows);
   if(mysql_stmt_execute(stmt) != 0)
      {
      fprintf(stderr, "Error: %s\n", mysql_stmt_error(stmt));
      return 1;
      }

   mysql_close(conn);
   return 0;
   }

There are a couple of key points to note here. First, when we bind to an array any data type that is a char * string or a MYSQL_TIME has to be an array of pointers, and you see this in the code above. This makes this code look somewhat overcomplicated, but in the end, this is an advantage as the bound data can be anywhere (like each row can be a member of class or struct somewhere).

Secondly, to tell MariaDB that we are passing an array, we need to call mysql_stmt_attr_set and set the STMT_ATTR_ARRAY_SIZE attribute to the number of rows in the array.

The example program explained

The example above is not much different from the first prepared statement example, with a few exceptions. First, the bind process now points to our array values, we only have 2 values in the array but this should still illustrate my point. And for the cust_name and cust_regdate columns, we are also doing the bind to an array of pointers to the actual values. Before calling the single mysql_stmt_execute we also need to tell MariaDB how many rows to insert.

Conclusion

The ability to load data into MariaDB as program data arrays has several advantages, it is programmatically easier to deal with than a single array string, in particular if the latter consists of data for many rows. Aligning program data contained in classes or similar is also easier, allowing for better code integration.

Finally, performance is a bit better, in particular when there are many rows of data to INSERT.

Happy SQL’ing!