MariaDB Vector preview is out

MariaDB Vector is out! Only as a preview yet, not all bugs are fixed, not all optimizations implemented. But it performs well in ann-benchmarks and we want to give you a chance to try it out.

What is MariaDB Vector?

MariaDB Vector is a new feature of MariaDB Server that turns it into a relational vector database. I won’t explain what vectors are and why you need them, chances are you know all that (otherwise you likely wouldn’t be reading this) and even if not — there are plenty of posts explaining it far better than I ever could. Instead, I’ll show what vectors are in MariaDB Vector, how to store them, index them and search for them.

First thing to keep in mind — there is no dedicated VECTOR data type. At least, not yet — while we do plan to have it done before the first official release, the priority was on implementing a usable indexed search first. For now, one can use any data type that can store a sequence of bytes. VARBINARY or BLOB will do just fine. Also, avoid the CREATE VECTOR INDEX statement for now, it hasn’t been fully tested yet. Declare the index as a part of your CREATE TABLE statement instead.

But enough of limitations, they are expected in a preview, after all. Let’s see what does work. I’ll use snippets from my development branch of ann-benchmarks as an example. It’s written in Python.

Usage example

Storing the data

We start from creating a table:

self._cur.execute(f"""
  CREATE TABLE ann (
    id INT PRIMARY KEY,
    v BLOB NOT NULL,
    VECTOR INDEX (v)
) ENGINE={self._engine}
""")

This, as expected, creates a new table with two columns and a vector index over the second — BLOB — column. The index is implemented in a way that supports different storage engines and at least InnoDB and MyISAM are tested and known to work.

The vector column stores 32-bit IEEE 754 floating point numbers. MariaDB module in ann-benchmarks inserts them directly as

self._cur.execute("INSERT INTO ann VALUES (%d, %s)",
                  (i, numpy.array(v, 'float32').tobytes()))

But if you’re not using Numpy, you can do the same with

self._cur.execute("INSERT INTO ann VALUES (%d, %s)",
                  (i, array.array('f',v).tobytes()))

Or you can send a text representation of the vector and let MariaDB convert it back:

self._cur.execute("INSERT INTO ann VALUES (%d, Vec_FromText(%s))",
                  (i, str(v)))

MariaDB Server uses a compact representation of vectors, so an index with a million of vectors of a thousand of dimensions per vector will take only about 2GB. As with any index, it works the fastest when the complete index fits in memory. Pay attention to the server variable @@mhnsw_cache_size — it controls the size of the cache per index and you want it big enough to fit your largest vector index.

Querying the index

The most common operation with vectors is searching for vectors that are closest to a given vector, for example, to the vector generated from a user search query, image or a song fragment. In SQL you do that by ordering results by distance and selecting few top results. Indeed, the MariaDB module in ann-benchmarks does exactly that

self._cur.execute("SELECT id FROM ann ORDER BY VEC_Distance(v, %s) LIMIT %d",
                  (numpy.array(v, 'float32').tobytes(), n))

VEC_Distance() is an SQL function that calculates a Euclidean (L2) distance between two points. More distance metrics, in particular, cosine, are coming soon.

Configuration

MariaDB Vector implements the Modified HNSW algorithm for searching in the vector index (to solve the so-called Approximate Nearest Neighbor problem). Thus all server variables related to MariaDB Vector start from mhnsw_ — here you can read the letter “m” as MariaDB or Modified, whatever you prefer. There are only three configuration variables. You already know @@mhnsw_cache_size. Two others are @@mhnsw_max_edges_per_node and @@mhnsw_limit_multiplier (corresponding to M and ef parameters in the HNSW algorithm) and they allow you to tune your preferences. Because the search is approximate, you can prefer it to be slower, but closer to exact, or faster, but less precise. The first variable @@mhnsw_max_edges_per_node improves the search precision at the expense of a larger index, and slower INSERTs. It cannot be changed after the table was created. The second variable affects only SELECTs and improves the search precision at the expense of the SELECT speed. Two variables together allow you to specify how much precision you are ready to trade for performance and whether you’d rather spend more time on precision during INSERT or during SELECT.

Where to get it

You can download MariaDB Vector preview as a source tarball — make sure to build it with a recent compiler for a modern x86-compatible CPU — and as a prebuilt docker image. There is no generic Linux binary tarball this time because, being a generic “run everywhere” binary, it is compiled using a 10 year old compiler for a 20 year old CPU. It’s not how you’d want a modern computation intensive algorithm to be compiled.

What’s next

Next, we complete the development, add missing features, run it through our standard QA process and publish an official release. Of course, we will try to incorporate all your feedback into the release. It might not always be possible to do after the release, so send your feedback now.

And we will blog about the performance of MariaDB Vector preview, as soon as benchmarks finish running. Stay tuned!