1 of 12

Optimization and Indexes

Optimize MariaDB Server queries with indexes. This section covers index types, creation, and best practices for leveraging them to significantly improve query performance and data retrieval speed.

Building the best INDEX for a given SELECT

The problem

You have a SELECT and you want to build the best INDEX for it. This blog is a "cookbook" on how to do that task.

A short algorithm that works for many simpler SELECTs and helps in complex queries.
Examples of the algorithm, plus digressions into exceptions and variants
Finally a long list of "other cases".

The hope is that a newbie can quickly get up to speed, and his/her INDEXes will no longer smack of "newbie".

Many edge cases are explained, so even an expert may find something useful here.

Algorithm

Here's the way to approach creating an INDEX, given a SELECT. Follow the steps below, gathering columns to put in the INDEX in order. When the steps give out, you usually have the 'perfect' index.

Given a WHERE with a bunch of expressions connected by AND: Include the columns (if any), in any order, that are compared to a constant and not hidden in a function.
You get one more chance to add to the INDEX; do the first of these that applies:

2a. One column used in a 'range' -- BETWEEN, '>', LIKE w/o leading wildcard, etc.
2b. All columns, in order, of the GROUP BY.
2c. All columns, in order, of the ORDER BY if there is no mixing of ASC and DESC.

Digression

This blog assumes you know the basic idea behind having an INDEX. Here is a refresher on some of the key points.

Virtually all INDEXes in MySQL are structured as BTrees BTrees allow very efficient for

Given a key, find the corresponding row(s);
"Range scans" -- That is start at one value for the key and repeatedly find the "next" (or "previous") row.

A PRIMARY KEY is a UNIQUE KEY; a UNIQUE KEY is an INDEX. ("KEY" == "INDEX".)

InnoDB "clusters" the PRIMARY KEY with the data. Hence, given the value of the PK ("PRIMARY KEY"), after drilling down the BTree to find the index entry, you have all the columns of the row when you get there. A "secondary key" (any UNIQUE or INDEX other than the PK) in InnoDB first drills down the BTree for the secondary index, where it finds a copy of the PK. Then it drills down the PK to find the row.

Every InnoDB table has a PRIMARY KEY. While there is a default if you do not specify one, it is best to explicitly provide a PK.

For completeness: MyISAM works differently. All indexes (including the PK) are in separate BTrees. The leaf node of such BTrees have a pointer (usually a byte offset) into the data file.

All discussion here assumes InnoDB tables, however most statements apply to other Engines.

First, some examples

Think of a list of names, sorted by last_name, then first_name. You have undoubtedly seen such lists, and they often have other information such as address and phone number. Suppose you wanted to look me up. If you remember my full name ('James' and 'Rick'), it is easy to find my entry. If you remembered only my last name ('James') and first initial ('R'). You would quickly zoom in on the Jameses and find the Rs in them. There, you might remember 'Rick' and ignore 'Ronald'. But, suppose you remembered my first name ('Rick') and only my last initial ('J'). Now you are in trouble. You would be scanning all the Js -- Jones, Rick; Johnson, Rick; Jamison, Rick; etc, etc. That's much less efficient.

Those equate to

Think about this example as I talk about "=" versus "range" in the Algorithm, below.

Algorithm, step 1 (WHERE "column = const")

WHERE aaa = 123 AND ... : an INDEX starting with aaa is good.
WHERE aaa = 123 AND bbb = 456 AND ... : an INDEX starting with aaa and bbb is good. In this case, it does not matter whether aaa or bbb comes first in the INDEX.
xxx IS NULL : this acts like "= const" for this discussion.

Note that the expression must be of the form of column_name = (constant). These do not apply to this step in the Algorithm: DATE(dt) = '...', LOWER(s) = '...', CAST(s ...) = '...', x='...' COLLATE...

(If there are no "=" parts AND'd in the WHERE clause, move on to step 2 without any columns in your putative INDEX.)

Algorithm, step 2

Find the first of 2a / 2b / 2c that applies; use it; then quit. If none apply, then you are through gathering columns for the index.

In some cases it is optimal to do step 1 (all equals) plus step 2c (ORDER BY).

Algorithm, step 2a (one range)

A "range" shows up as

aaa >= 123 -- any of <, <=, >=, >; but not <>, !=
aaa BETWEEN 22 AND 44
sss LIKE 'blah%' -- but not sss LIKE '%blah'

If there are more parts to the WHERE clause, you must stop now.

Complete examples (assume nothing else comes after the snippet)

WHERE aaa >= 123 AND bbb = 1 ⇒ INDEX(bbb, aaa) (WHERE order does not matter; INDEX order does)
WHERE aaa >= 123 ⇒ INDEX(aaa)
WHERE aaa >= 123 AND ccc > 'xyz' ⇒ INDEX(aaa) or INDEX(ccc) (only one range)

Algorithm, step 2b (GROUP BY)

If there is a GROUP BY, all the columns of the GROUP BY should now be added, in the specified order, to the INDEX you are building. (I do not know what happens if one of the columns is already in the INDEX.)

If you are GROUPing BY an expression (including function calls), you cannot use the GROUP BY; stop.

Complete examples (assume nothing else comes after the snippet)

WHERE aaa = 123 AND bbb = 1 GROUP BY ccc ⇒ INDEX(bbb, aaa, ccc) or INDEX(aaa, bbb, ccc) (='s first, in any order; then the GROUP BY)
WHERE aaa >= 123 GROUP BY xxx ⇒ INDEX(aaa) (You should have stopped with Step 2a)
GROUP BY x,y ⇒ INDEX(x,y) (no WHERE)

Algorithm, step 2c (ORDER BY)

If there is a ORDER BY, all the columns of the ORDER BY should now be added, in the specified order, to the INDEX you are building.

If there are multiple columns in the ORDER BY, and there is a mixture of ASC and DESC, do not add the ORDER BY columns; they won't help; stop.

If you are ORDERing BY an expression (including function calls), you cannot use the ORDER BY; stop.

Complete examples (assume nothing else comes after the snippet)

WHERE aaa = 123 GROUP BY ccc ORDER BY ddd ⇒ INDEX(aaa, ccc) -- should have stopped with Step 2b
WHERE aaa = 123 GROUP BY ccc ORDER BY ccc ⇒ INDEX(aaa, ccc) -- the ccc will be used for both GROUP BY and ORDER BY
WHERE aaa = 123 ORDER BY xxx ASC, yyy DESC ⇒ INDEX(aaa) -- mixture of ASC and DESC.

The following are especially good. Normally a LIMIT cannot be applied until after lots of rows are gathered and then sorted according to the ORDER BY. But, if the INDEX gets all they way through the ORDER BY, only (OFFSET + LIMIT) rows need to be gathered. So, in these cases, you win the lottery with your new index:

WHERE aaa = 123 GROUP BY ccc ORDER BY ccc LIMIT 10 ⇒ INDEX(aaa, ccc)
WHERE aaa = 123 ORDER BY ccc LIMIT 10 ⇒ INDEX(aaa, ccc)
ORDER BY ccc LIMIT 10 ⇒ INDEX(ccc)

(It does not make much sense to have a LIMIT without an ORDER BY, so I do not discuss that case.)

Algorithm end

You have collected a few columns; put them in INDEX and ADD that to the table. That will often produce a "good" index for the SELECT you have. Below are some other suggestions that may be relevant.

An example of the Algorithm being 'wrong':

This would (according to the Algorithm) call for INDEX(flag). However, indexing a column that has two (or a small number of) values is almost always useless. This is called 'low cardinality'. The Optimizer would prefer to do a table scan than to bounce between the index BTree and the data.

On the other hand, the Algorithm is 'right' with

That would call for a compound index starting with a flag: INDEX(flag, date). Such an index is likely to be very beneficial. And it is likely to be more beneficial than INDEX(date).

If your resulting INDEX include column(s) that are likely to be UPDATEd, note that the UPDATE will have extra work to remove a 'row' from one place in the INDEX's BTree and insert a 'row' back into the BTree. For example:

There are too many variables to say whether it is better to keep the index or to toss it.

In this case, shortening the index may be beneficial:

Changing to INDEX(z) would make for less work for the UPDATE, but might hurt some SELECT. It depends on the frequency of each, plus many more factors.

Limitations

(There are exceptions to some of these.)

You may not create an index bigger than 3KB.
You may not include a column that equates to bigger than some value (767 bytes -- VARCHAR(255) CHARACTER SET utf8).
You can deal with big fields using "prefix" indexing; but see below.
You should not have more than 5 columns in an index. (This is just a Rule of Thumb; nothing prevents having more.)

Flags and low cardinality

INDEX(flag) is almost never useful if flag has very few values. More specifically, when you say WHERE flag = 1 and "1" occurs more than 20% of the time, such an index will be shunned. The Optimizer would prefer to scan the table instead of bouncing back and forth between the index and the data for more than 20% of the rows.

("20%" is really somewhere between 10% and 30%, depending on the phase of the moon.)

"Covering" indexes

A "Covering" index is an index that contains all the columns in the SELECT. It is special in that the SELECT can be completed by looking only at the INDEX BTree. (Since InnoDB's PRIMARY KEY is clustered with the data, "covering" is of no benefit when considering at the PRIMARY KEY.)

Mini-cookbook:

Gather the list of column(s) according to the "Algorithm", above.
Add to the end of the list the rest of the columns seen in the SELECT, in any order.

Examples:

SELECT x FROM t WHERE y = 5; ⇒ INDEX(y,x) -- The algorithm said just INDEX(y)
SELECT x,z FROM t WHERE y = 5 AND q = 7; ⇒ INDEX(y,q,x,z) -- y and q in either order (Algorithm), then x and z in either order (covering).
SELECT x FROM t WHERE y > 5 AND q > 7; ⇒ INDEX(y,q,x) -- y or q first (that's as far as the Algorithm goes), then the other two fields afterwards.

The speedup you get might be minor, or it might be spectacular; it is hard to predict.

But...

It is not wise to build an index with lots of columns. Let's cut it off at 5 (Rule of Thumb).
Prefix indexes cannot 'cover', so don't use them anywhere in a 'covering' index.
There are limits (3KB?) on how 'wide' an index can be, so "covering" may not be possible.

Redundant/excessive indexes

INDEX(a,b) can find anything that INDEX(a) could find. So you don't need both. Get rid of the shorter one.

If you have lots of SELECTs and they generate lots of INDEXes, this may cause a different problem. Each index must be updated (sooner or later) for each INSERT. More indexes ⇒ slower INSERTs. Limit the number of indexes on a table to about 6 (Rule of Thumb).

Notice in the cookbook how it says "in any order" in a few places. If, for example, you have both of these (in different SELECTs):

WHERE a=1 AND b=2 begs for either INDEX(a,b) or INDEX(b,a)
WHERE a>1 AND b=2 begs only for INDEX(b,a) Include only INDEX(b,a) since it handles both cases with only one INDEX.

Suppose you have a lot of indexes, including (a,b,c,dd) and (a,b,c,ee). Those are getting rather long. Consider either picking one of them, or having simply (a,b,c). Sometimes the selectivity of (a,b,c) is so good that tacking on 'dd' or 'ee' does make enough difference to matter.

Optimizer picks ORDER BY

The main cookbook skips over an important optimization that is sometimes used. The optimizer will sometimes ignore the WHERE and, instead, use an INDEX that matches the ORDER BY. This, of course, needs to be a perfect match -- all columns, in the same order. And all ASC or all DESC.

This becomes especially beneficial if there is a LIMIT.

But there is a problem. There could be two situations, and the Optimizer is sometimes not smart enough to see which case applies:

If the WHERE does very little filtering, fetching the rows in ORDER BY order avoids a sort and has little wasted effort (because of 'little filtering'). Using the INDEX matching the ORDER BY is better in this case.
If the WHERE does a lot of filtering, the ORDER BY is wasting a lot of time fetching rows only to filter them out. Using an INDEX matching the WHERE clause is better.

What should you do? If you think the "little filtering" is likely, then create an index with the ORDER BY columns in order and hope that the Optimizer uses it when it should.

OR

Cases...

WHERE a=1 OR a=2 -- This is turned into WHERE a IN (1,2) and optimized that way.
WHERE a=1 OR b=2 usually cannot be optimized.
WHERE x.a=1 OR y.b=2 This is even worse because of using two different tables.

A workaround is to use UNION. Each part of the UNION is optimized separately. For the second case:

Now the query can take good advantage of two different indexes. Note: "Index merge" might kick in on the original query, but it is not necessarily any faster than the UNION. Sister blog on compound indexes, including 'Index Merge'

The third case (OR across 2 tables) is similar to the second.

If you originally had a LIMIT, UNION gets complicated. If you started with ORDER BY z LIMIT 190, 10, then the UNION needs to be

TEXT / BLOB

You cannot directly index a TEXT or BLOB or large VARCHAR or large BINARY column. However, you can use a "prefix" index: INDEX(foo(20)). This says to index the first 20 characters of foo. But... It is rarely useful.

Example of a prefix index:

The index for me would contain 'Ja', 'Rick'. That's not useful for distinguishing between 'Jamison', 'Jackson', 'James', etc., so the index is so close to useless that the optimizer often ignores it.

Probably never do UNIQUE(foo(20)) because this applies a uniqueness constraint on the first 20 characters of the column, not the whole column!

Dates

DATE, DATETIME, etc. are tricky to compare against.

Some tempting, but inefficient, techniques:

date_col LIKE '2016-01%' -- must convert date_col to a string, so acts like a functionLEFT(date_col, 4) = '2016-01' -- hiding the column in functionDATE(date_col) = 2016 -- hiding the column in function

All must do a full scan. (On the other hand, it can handy to use GROUP BY LEFT(date_col, 7) for monthly grouping, but that is not an INDEX issue.)

This is efficient, and can use an index:

This case works because both right-hand values are converted to constants, then it is a "range". I like the design pattern with INTERVAL because it avoids computing the last day of the month. And it avoids tacking on '23:59:59', which is wrong if you have microsecond times. (And other cases.)

EXPLAIN Key_len

Perform EXPLAIN SELECT... (and EXPLAIN FORMAT=JSON SELECT... if you have 5.6.5). Look at the Key that it chose, and the Key_len. From those you can deduce how many columns of the index are being used for filtering. (JSON makes it easier to get the answer.) From that you can decide whether it is using as much of the INDEX as you thought. Caveat: Key_len only covers the WHERE part of the action; the non-JSON output won't easily say whether GROUP BY or ORDER BY was handled by the index.

IN

IN (1,99,3) is sometimes optimized as efficiently as "=", but not always. Older versions of MySQL did not optimize it as well as newer versions. (5.6 is possibly the main turning point.)

IN ( SELECT ... )

From version 4.1 through 5.5, IN ( SELECT ... ) was very poorly optimized. The SELECT was effectively re-evaluated every time. Often it can be transformed into a JOIN, which works much faster. Heres is a pattern to follow:

The SELECT expressions will need "a." prefixing the column names.

Alas, there are cases where the pattern is hard to follow.

5.6 does some optimizing, but probably not as good as the JOIN.

If there is a JOIN or GROUP BY or ORDER BY LIMIT in the subquery, that complicates the JOIN in new format. So, it might be better to use this pattern:

Caveat: If you end up with two subqueries JOINed together, note that neither has any indexes, hence performance can be very bad. (5.6 improves on it by dynamically creating indexes for subqueries.)

There is work going on in MariaDB and Oracle 5.7, in relation to "NOT IN", "NOT EXISTS", and "LEFT JOIN..IS NULL"; here is an old discussion on the topic So, what I say here may not be the final word.

Explode/Implode

When you have a JOIN and a GROUP BY, you may have the situation where the JOIN exploded more rows than the original query (due to many:many), but you wanted only one row from the original table, so you added the GROUP BY to implode back to the desired set of rows.

This explode + implode, itself, is costly. It would be better to avoid them if possible.

Sometimes the following will work.

Using DISTINCT or GROUP BY to counteract the explosion

When using second table just to check for existence:

Many-to-many mapping table

Do it this way.

Notes:

Lack of an AUTO_INCREMENT id for this table -- The PK given is the 'natural' PK; there is no good reason for a surrogate.
"MEDIUMINT" -- This is a reminder that all INTs should be made as small as is safe (smaller ⇒ faster). Of course the declaration here must match the definition in the table being linked to.
"UNSIGNED" -- Nearly all INTs may as well be declared non-negative
"NOT NULL" -- Well, that's true, isn't it?

To conditionally INSERT new links, use

Note that if you had an AUTO_INCREMENT in this table, IODKU would "burn" ids quite rapidly.

Subqueries and UNIONs

Each subquery SELECT and each SELECT in a UNION can be considered separately for finding the optimal INDEX.

Exception: In a "correlated" ("dependent") subquery, the part of the WHERE that depends on the outside table is not easily factored into the INDEX generation. (Cop out!)

JOINs

The first step is to decide what order the optimizer will go through the tables. If you cannot figure it out, then you may need to be pessimistic and create two indexes for each table -- one assuming the table will be used first, one assiming that it will come later in the table order.

The optimizer usually starts with one table and extracts the data needed from it. As it finds a useful (that is, matches the WHERE clause, if any) row, it reaches into the 'next' table. This is called NLJ ("Nested Loop Join"). The process of filtering and reaching to the next table continues through the rest of the tables.

The optimizer usually picks the "first" table based on these hints:

STRAIGHT_JOIN forces the table order.
The WHERE clause limits which rows needed (whether indexed or not).
The table to the "left" in a LEFT JOIN usually comes before the "right" table. (By looking at the table definitions, the optimizer may decide that "LEFT" is irrelevant.)
The current INDEXes will encourage an order.

Running EXPLAIN tells you the table order that the Optimizer is very likely to use today. After adding a new INDEX, the optimizer may pick a different table order. You should anticipate the order changing, guess at what order makes the most sense, and build the INDEXes accordingly. Then rerun EXPLAIN to see if the Optimizer's brain was on the same wavelength you were on.

You should build the INDEX for the "first" table based on any parts of the WHERE, GROUP BY, and ORDER BY clauses that are relevant to it. If a GROUP/ORDER BY mentions a different table, you should ignore that clause.

The second (and subsequent) table will be reached into based on the ON clause. (Instead of using commajoin, please write JOINs with the JOIN keyword and ON clause!) In addition, there could be parts of the WHERE clause that are relevant. GROUP/ORDER BY are not to be considered in writing the optimal INDEX for subsequent tables.

PARTITIONing

PARTITIONing is rarely a substitute for a good INDEX.

PARTITION BY RANGE is a technique that is sometimes useful when indexing fails to be good enough. In a two-dimensional situation such as nearness in a geographical sense, one dimension can partially be handled by partition pruning; then the other dimension can be handled by a regular index (preferrably the PRIMARY KEY). This goes into more detail: .

FULLTEXT

FULLTEXT is now implemented in InnoDB as well as MyISAM. It provides a way to search for "words" in TEXT columns. This is much faster (when it is applicable) than col LIKE '%word%'.

always(?) uses the FULLTEXT index first. That is, the whole Algorithm is invalidated when one of the ANDs is a MATCH.

Signs of a Newbie

No "compound" (aka "composite") indexes
No PRIMARY KEY
Redundant indexes (especially blatant is PRIMARY KEY(id), KEY(id))
Most or all columns individually indexes ("But I indexed everything")

Speeding up wp_postmeta

The published table (see Wikipedia) is

The problems:

The AUTO_INCREMENT provides no benefit; in fact it slows down most queries and clutters disk.
Much better is PRIMARY KEY(post_id, meta_key) -- clustered, handles both parts of usual JOIN.
BIGINT is overkill, but that can't be fixed without changing other tables.
VARCHAR(255) can be a problem in 5.6 with utf8mb4; see workarounds below.

The solutions:

Postlog

Initial posting: March, 2015; Refreshed Feb, 2016; Add DATE June, 2016; Add WP example May, 2017.

The tips in this document apply to MySQL, MariaDB, and Percona.

Compound (Composite) Indexes

A mini-lesson in "compound indexes" ("composite indexes")

This document starts out trivial and perhaps boring, but builds up to more interesting information, perhaps things you did not realize about how MariaDB and MySQL indexing works.

This also explains (to some extent).

(Most of this applies to other databases, too.)

Ignored Indexes

Ignored indexes allow indexes to be visible and maintained without being used by the optimizer. This feature is comparable to MySQL 8’s "invisible indexes."

This feature is available from MariaDB 10.6.

Ignored indexes are indexes that are visible and maintained, but which are not used by the optimizer. MySQL 8 has a similar feature which they call "invisible indexes".

Syntax

Index Statistics

Index statistics provide crucial insights to the MariaDB query optimizer, guiding it in executing queries efficiently. Up-to-date index statistics ensure optimized query performance.

How Index Statistics Help the Query Optimizer

Understanding index statistics is crucial for the MariaDB query optimizer to efficiently execute queries. Accurate and current statistics guide the optimizer in choosing the best way to access data, similar to using a personal address book for quicker searches rather than a larger phone book. Up-to-date index statistics ensure optimized query performance.

Latitude/Longitude Indexing

The problem

You want to find the nearest 10 pizza parlors, but you cannot figure out how to do it efficiently in your huge database. Database indexes are good at one-dimensional indexing, but poor at two-dimensions.

You might have tried

INDEX(lat), INDEX(lon) -- but the optimizer used only one
INDEX(lat,lon) -- but it still had to work too hard
Sometimes you ended up with a full table scan -- Yuck.

WHERE < ... -- No chance of using any index.

WHERE lat BETWEEN ... AND lng BETWEEN... -- This has some chance of using such indexes.

The goal is to look only at records "close", in both directions, to the target lat/lng.

A solution -- first, the principles

in MariaDB and MySQL sort of give you a way to have two clustered indexes. So, if we could slice up (partition) the globe in one dimension and use ordinary indexing in the other dimension, maybe we can get something approximating a 2D index. This 2D approach keeps the number of disk hits significantly lower than 1D approaches, thereby speeding up "find nearest" queries.

It works. Not perfectly, but better than the alternatives.

What to PARTITION on? It seems like latitude or longitude would be a good idea. Note that longitudes vary in width, from 69 miles (111 km) at the equator, to 0 at the poles. So, latitude seems like a better choice.

How many PARTITIONs? It does not matter a lot. Some thoughts:

90 partitions - 2 degrees each. (I don't like tables with too many partitions; 90 seems like plenty.)
50-100 - evenly populated. (This requires code. For 2.7M placenames, 85 partitions varied from 0.5 degrees to very wide partitions at the poles.)
Don't have more than 100 partitions, there are inefficiencies in the partition implementation.

How to PARTITION? Well, MariaDB and MySQL are very picky. So / are out. is out. So, we are stuck with some kludge. Essentially, we need to convert Lat/Lng to some size of and use PARTITION BY RANGE.

Representation choices

To get to a datatype that can be used in PARTITION, you need to "scale" the latitude and longitude. (Consider only the *INTs; the other datatypes are included for comparison)

(Sorted by resolution)

What these mean...

Deg*100 () -- you take the lat/lng, multiply by 100, round, and store into a SMALLINT. That will take 2 bytes for each dimension, for a total of 4 bytes. Two items might be 1570 meters apart, but register as having the same latitude and longitude.

for latitude and DECIMAL(5,2) for longitude will take 2+3 bytes and have no better resolution than Deg*100.

SMALLINT scaled -- Convert latitude into a SMALLINT SIGNED by doing (degrees / 90 * 32767) and rounding; longitude by (degrees / 180 * 32767).

has 24 significant bits; has 53. (They don't work with PARTITIONing but are included for completeness. Often people use DOUBLE without realizing how much an overkill it is, and how much space it takes.)

Sure, you could do DEG_1000 and other "in between" cases, but there is no advantage. DEG_1000 takes as much space as DEG*10000, but has less resolution.

So, go down the list to see how much resolution you need, then pick an encoding you are comfortable with. However, since we are about to use latitude as a "partition key", it must be limited to one of the INTs. For the sample code, I will use Deg*10000 ().

GCDist -- compute "great circle distance"

GCDist is a helper FUNCTION that correctly computes the distance between two points on the globe.

The code has been benchmarked at about 20 microseconds per call on a 2011-vintage PC. If you had to check a million points, that would take 20 seconds -- far too much for a web application. So, one goal of the Procedure that uses it will be to minimize the usage of this function. With the code presented here, the function need be called only a few dozen or few hundred times, except in pathological cases.

Sure, you could use the Pythagorean formula. And it would work for most applications. But it does not take extra effort to do the GC. Furthermore, GC works across a pole and across the dateline. And, a Pythagorean function is not that much faster.

For efficiency, GCDist understands the scaling you picked and has that stuff hardcoded. I am picking "Deg*10000", so the function expects 350000 for representing 35 degrees. If you choose a different scaling, you will need to change the code.

GCDist() takes 4 scaled DOUBLEs -- lat1, lon1, lat2, lon2 -- and returns a scaled number of "degrees" representing the distance.

The table of representation choices says 52 feet of resolution for Deg*10000 and DECIMAL(x,4). Here is how it was calculated: To measuring a diagonal between lat/lng (0,0) and (0.0001,00001) (one 'unit in the last place'): GCDist(0,0,1,1) * 69.172 / 10000 * 5280 = 51.65, where

69.172 miles/degree of latitude
10000 units per degree for the scaling chosen
5280 feet / mile.

(No, this function does not compensate for the Earth being an oblate spheroid, etc.)

Required table structure

There will be one table (plus normalization tables as needed). The one table must be partitioned and indexed as indicated below.

Fields and indexes

PARTITION BY RANGE(lat)
lat -- scaled latitude (see above)
lon -- scaled longitude
PRIMARY KEY(lon, lat, ...) -- lon must be first; something must be added to make it UNIQUE

For most of this discussion, lat is assumed to be MEDIUMINT -- scaled from -90 to +90 by multiplying by 10000. Similarly for lon and -180 to +180.

The PRIMARY KEY must

start with lon since the algorithm needs the "clustering" that InnoDB will provide, and
include lat somewhere, since it is the PARTITION key, and
contain something to make the key UNIQUE (lon+lat is unlikely to be sufficient).

The FindNearest PROCEDURE will do multiple SELECTs something like this:

The query planner will

Do PARTITION "pruning" based on the latitude; then
Within a PARTITION (which is effectively a table), use lon do a 'clustered' range scan; then
Use the "condition" to filter down to the rows you desire, plus recheck lat. This design leads to very few disk blocks needing to be read, which is the main goal of the design.

Note that this does not even call GCDist. That comes in the last pass when the ORDER BY and LIMIT are used.

The has a loop. At least two SELECTs will be executed, but with proper tuning; usually no more than about 6 SELECTs will be performed. Because of searching by the PRIMARY KEY, each SELECT hits only one block, sometimes more, of the table. Counting the number of blocks hit is a crude, but effective way, of comparing the performance of multiple designs. By comparison, a full table scan will probably touch thousands of blocks. A simple INDEX(lat) probably leads to hitting hundreds of blocks.

Filtering... An argument to the FindNearest procedure includes a boolean expression ("condition") for a WHERE clause. If you don't need any filtering, pass in "1". To avoid "SQL injection", do not let web users put arbitrary expressions; instead, construct the "condition" from inputs they provide, thereby making sure it is safe.

The algorithm

The algorithm is embodied in a because of its complexity.

You feed it a starting width for a "square" and a number of items to find.
It builds a "square" around where you are.
A SELECT is performed to see how many items are in the square.
Loop, doubling the width of the square, until enough items are found.

The next section ("Performance") should make this a bit clearer as it walks through some examples.

Performance

Because of all the variations, it is hard to get a meaningful benchmark. So, here is some hand-waving instead.

Each SELECT is constrained by a "square" defined by a latitude range and a longitude range. (See the WHERE clause mentioned above, or in the sample code below.) Because of the way longitude lines warp, the longitude range of the "square" will be more degrees than the latitude range. Let's say the latitude partitioning is 3 degrees wide in the area where you are searching. That is over 200 miles (over 300km), so you are very likely to have a latitude range smaller than the partition width. Still, if you are reaching from the edge of a latitude stripe, the square could span two partitions. After partition pruning down to one (sometimes more) partition, the query is then constrained by a longitude range. (Remember, the PRIMARY KEY starts with lon.) If an InnoDB data block contains 100 rows (a handy Rule of Thumb), the select will touch one (or a few) block. If the square spans two (or more) partitions, then the same logic applies to each partition.

So, scanning the square will involve as little as one block; rarely more than a few blocks. The number of blocks is mostly independent of the dataset size.

The primary use case for this algorithm is when the data is significantly larger than will fit into cache (the buffer_pool). Hence, the main goal is to minimize the number of disk hits.

Now let's look at some edge cases, and argue that the number of blocks is still better (usually) than with traditional indexing techniques.

What if you are looking for Starbucks in a dense city? There would be dozens, maybe hundreds per square mile. If you start the guess at 100 miles, the SELECTs would be hitting lots of blocks -- not efficient. In this case, the "starting distance" should be small, say, 2 miles. Let's say your app wants the closest 10 stores. In this example, you would probably find more than 10 Starbucks within 2 miles in 1 InnoDB block in one partition. Even though there is a second SELECT to finish off the query, it would be hitting the same block. Total: One block hit == cheap.

Let's say you start with a 5 mile square. Since there are upwards of 200 Starbucks within a 5-miles radius in some dense cities of the world, that might imply 300 in our "square". That maps to about 4 disk blocks, and a modest amount of CPU to chew through the 300 records. Still not bad.

Now, suppose you are on an ocean liner somewhere in the Pacific. And there is one Starbucks onboard, but you are looking for the nearest 10. If you again start with 2 miles, it will take several iterations to find 10 sites. But, let's walk through it anyway. The first probe will hit one partition (maybe 2), and find just one hit. The second probe doubles the width of the square; 4 miles will still give you one hit -- the same hit in the same block, which is now cached, so we won't count it as a second disk I/O. Eventually the square will be wide enough to span multiple partitions. Each extra partition will be one new disk hit to discover no sites in the square. Finally, the square will hit Chile or Hawaii or Fiji and find some more sites, perhaps enough to stop the iteration. Since the main criteria in determining the number of disk hits is the number of partitions hit, we do not want to split the world into too many partitions. If there are, say, 40 partitions, then I have just described a case where there might be 20 disk hits.

2-degree partitions might be good for a global table of stores or restaurants. A 5-mile starting distance might be good when filtering for Starbucks. 20 miles might be better for a department store.

Now, let's discuss the 'last' SELECT, wherein the square is expanded by SQRT(2) and it uses the Great Circle formula to precisely order the N results. The SQRT(2) is in case that the N items were all at the corners of the 'square'. Growing the square by this much allows us to catch any other sites that were just outside the old square.

First, note that this 'last' SELECT is hitting the same block(s) that the iteration hit, plus possibly hitting some more blocks. It is hard to predict how many extra blocks might be hit. Here's a pathological case. You are in the middle of a desert; the square grows and grows. Eventually it finds N sites. There is a big city just outside the final square from the iterating. Now the 'last' SELECT kicks in, and it includes lots of sites in this big city. "Lots of sites" --> lots of blocks --> lots of disk hits.

Discussion of reference code

Here's the gist of the FindNearest().

Make a guess at how close to "me" to look.
See how many items are in a 'square' around me, after filtering.
If not enough, repeat, doubling the width of the square.
After finding enough, or giving up because we are looking "too far", make one last pass to get all the data, ORDERed and LIMITed

Note that the loop merely uses 'squares' of lat/lng ranges. This is crude, but works well with the partitioning and indexing, and avoids calling to GCDist (until the last step). In the sample code, I picked 15 miles as starting value. Adjusting this will have some impact on the Procedure's performance, but the impact will vary with the use cases. A rough way to set the radius is to guess what will find the desired LIMIT about half the time. (This value is hardcoded in the PROCEDURE.)

Parameters passed into FindNearest():

your Latitude -- -90..90 (not scaled -- see hardcoded conversion in PROCEDURE)
your Longitude -- -180..180 (not scaled)
Start distance -- (miles or km) -- see discussion below
Max distance -- in miles or km -- see hardcoded conversion in PROCEDURE

The function will find the nearest items, up to Limit that meet the Condition. But it will give up at Max distance. (If you are at the South Pole, why bother searching very far for the tenth pizza parlor?)

Because of the "scaling", "hardcoding", "Condition", the table name, etc, this PROCEDURE is not truly generic; the code must be modified for each application. Yes, I could have designed it to pass all that stuff in. But what a mess.

The "_start_dist" gives some control over the performance. Making this too small leads to extra iterations; too big leads to more rows being checked. If you choose to tune the Stored Procedure, do the following. "SELECT @iterations" after calling the SP for a number of typical values. If the value is usually 1, then decrease _start_dist. If it is usually 2 or more, then increase it.

Timing: Under 10ms for "typical" usage; any dataset size. Slower for pathological cases (low min distance, high max distance, crossing dateline, bad filtering, cold cache, etc)

End-cases:

By using GC distance, not Pythagoras, distances are 'correct' even near poles.
Poles -- Even if the "nearest" is almost 360 degrees away (longitude), it can find it.
Dateline -- There is a small, 'contained', piece of code for crossing the Dateline. Example: you are at +179 deg longitude, and the nearest item is at -179.

The procedure returns one resultset, SELECT *, distance.

Only rows that meet your Condition, within Max distance are returned
At most Limit rows are returned
The rows will be ordered, "closest" first.
"dist" will be in miles or km (based on a hardcoded constant in the SP)

Reference code, assuming deg*10000 and 'miles'

This version is based on scaling "Deg*10000 (MEDIUMINT)".

Postlog

There is a "Haversine" algorithm that is twice as fast as the GCDist function here. But it has a fatal flaw of sometimes returning NULL for the distance between a point and itself. (This is because of computing a number slightly bigger than 1.0, then trying to take the ACOS of it.)

Primary Keys with Nullable Columns

MariaDB deals with primary keys over nullable columns according to the SQL standards.

Take the following table structure:

CREATE TABLE t1(
  c1 INT NOT NULL AUTO_INCREMENT, 
  c2 INT NULL DEFAULT NULL, 
  PRIMARY KEY(c1,c2)
);

Column c2 is part of a primary key, and thus it cannot be NULL.

Before , MariaDB (as well as versions of MySQL before MySQL 5.7) would silently convert it into a NOT NULL column with a default value of 0.

Since , the column is converted to NOT NULL, but without a default value. If we then attempt to insert a record without explicitly setting c2, a warning (or, in strict mode, an error), will be thrown, for example:

MySQL, since 5.7, will abort such a CREATE TABLE with an error.

The behavior adheres to the SQL 2003 standard.

SQL-2003, Part II, “Foundation” says:

**11.7 **Syntax Rules

…

5) If the specifies PRIMARY KEY, then for each in the explicit or implicit for which NOT NULL is not specified, NOT NULL is implicit in the .

Essentially this means that all PRIMARY KEY columns are automatically converted to NOT NULL. Furthermore:

11.5 General Rules

…

3) When a site S is set to its default value,

…

b) If the data descriptor for the site includes a , then S is set to the value specified by that .

…

e) Otherwise, S is set to the null value.

There is no concept of “no default value” in the standard. Instead, a column always has an implicit default value of NULL. On insertion it might however fail the NOT NULL constraint. MariaDB and MySQL instead mark such a column as “not having a default value”. The end result is the same — a value must be specified explicitly or an INSERT will fail.

MariaDB since 10.1.7 behaves in a standard compatible manner — being part of a PRIMARY KEY, the nullable column gets an automatic NOT NULL constraint, on insertion one must specify a value for such a column. MariaDB before 10.1.7 was automatically assigning a default value of 0 — this behavior was non-standard. Issuing an error at CREATE TABLE time is also non-standard.

Storage Engine Index Types

This refers to the index_type definition when creating an index, i.e. BTREE, HASH or RTREE.

For more information on general types of indexes, such as primary keys, unique indexes etc, go to Getting Started with Indexes.

Storage Engine

Permitted Indexes

BTREE, RTREE

BTREE

BTREE is generally the default index type. For tables, HASH is the default. uses a particular data structure called fractal trees, which is optimized for data that do not entirely fit memory.

Understanding the B-tree and hash data structures can help predict how different queries perform on different storage engines that use these data structures in their indexes, particularly for the MEMORY storage engine that lets you choose B-tree or hash indexes. B-Tree Index Characteristics

B-tree Indexes

B-tree indexes are used for column comparisons using the >, >=, =, >=, < or BETWEEN operators, as well as for LIKE comparisons that begin with a constant.

For example, the query SELECT * FROM Employees WHERE First_Name LIKE 'Maria%'; can make use of a B-tree index, while SELECT * FROM Employees WHERE First_Name LIKE '%aria'; cannot.

B-tree indexes also permit leftmost prefixing for searching of rows.

If the number or rows doesn't change, hash indexes occupy a fixed amount of memory, which is lower than the memory occupied by BTREE indexes.

Hash Indexes

Hash indexes, in contrast, can only be used for equality comparisons, so those using the = or <=> operators. They cannot be used for ordering, and provide no information to the optimizer on how many rows exist between two values.

Hash indexes do not permit leftmost prefixing - only the whole index can be used.

R-tree Indexes

See for more information.

_{This page is licensed: CC BY-SA / Gnu FDL}

Full-Text Indexes

Implement full-text indexes in MariaDB Server for efficient text search. This section guides you through creating and utilizing these indexes to optimize queries on large text datasets.

Full-Text Index Overview

MariaDB has support for full-text indexing and searching:

A full-text index in MariaDB is an index of type FULLTEXT, and it allows more options when searching for portions of text from a field.
Full-text indexes can be used only with MyISAM, Aria, InnoDB and Mroonga tables, and can be created only for CHAR, VARCHAR, or TEXT columns.
Partitioned tables cannot contain fulltext indexes, even if the storage engine supports them.
A FULLTEXT index definition can be given in the statement when a table is created, or added later using or .
For large data sets, it is much faster to load your data into a table that has no FULLTEXT index and then create the index after that, than to load data into a table that has an existing FULLTEXT index.

Full-text searching is performed using syntax. MATCH() takes a comma-separated list that names the columns to be searched. AGAINST takes a string to search for, and an optional modifier that indicates what type of search to perform. The search string must be a literal string, not a variable or a column name.

Excluded Results

Partial words are excluded.
Words less than 4 (MyISAM) or 3 (InnoDB) characters in length will not be stored in the fulltext index. This value can be adjusted by changing the system variable (or, for , ).
Words longer than 84 characters in length will also not be stored in the fulltext index. This values can be adjusted by changing the system variable (or, for , ).

Relevance

MariaDB calculates a relevance for each result, based on a number of factors, including the number of words in the index, the number of unique words in a row, the total number of words in both the index and the result, and the weight of the word. In English, 'cool' will be weighted less than 'dandy', at least at present! The relevance can be returned as part of a query simply by using the MATCH function in the field list.

Types of Full-Text search

IN NATURAL LANGUAGE MODE

IN NATURAL LANGUAGE MODE is the default type of full-text search, and the keywords can be omitted. There are no special operators, and searches consist of one or more comma-separated keywords.

Searches are returned in descending order of relevance.

IN BOOLEAN MODE

Boolean search permits the use of a number of special operators:

Operator

Description

Searches are not returned in order of relevance, and nor does the 50% limit apply. Stopwords and word minimum and maximum lengths still apply as usual.

WITH QUERY EXPANSION

A query expansion search is a modification of a natural language search. The search string is used to perform a regular natural language search. Then, words from the most relevant rows returned by the search are added to the search string and the search is done again. The query returns the rows from the second search. The IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION or WITH QUERY EXPANSION modifier specifies a query expansion search. It can be useful when relying on implied knowledge within the data, for example that MariaDB is a database.

Examples

Creating a table, and performing a basic search:

Multiple words:

Since 'Once' is a , no result is returned:

Inserting the word 'wicked' into more than half the rows excludes it from the results:

Using IN BOOLEAN MODE to overcome the 50% limitation:

Returning the relevance:

WITH QUERY EXPANSION. In the following example, 'MariaDB' is always associated with the word 'database', so it is returned when query expansion is used, even though not explicitly requested.

Partial word matching with IN BOOLEAN MODE:

Using boolean operators

Building the best INDEX for a given SELECT

The problem

You have a SELECT and you want to build the best INDEX for it. This blog is a "cookbook" on how to do that task.

A short algorithm that works for many simpler SELECTs and helps in complex queries.
Examples of the algorithm, plus digressions into exceptions and variants
Finally a long list of "other cases".

The hope is that a newbie can quickly get up to speed, and his/her INDEXes will no longer smack of "newbie".

Many edge cases are explained, so even an expert may find something useful here.

Algorithm

Here's the way to approach creating an INDEX, given a SELECT. Follow the steps below, gathering columns to put in the INDEX in order. When the steps give out, you usually have the 'perfect' index.

Given a WHERE with a bunch of expressions connected by AND: Include the columns (if any), in any order, that are compared to a constant and not hidden in a function.
You get one more chance to add to the INDEX; do the first of these that applies:

2a. One column used in a 'range' -- BETWEEN, '>', LIKE w/o leading wildcard, etc.
2b. All columns, in order, of the GROUP BY.
2c. All columns, in order, of the ORDER BY if there is no mixing of ASC and DESC.

Digression

This blog assumes you know the basic idea behind having an INDEX. Here is a refresher on some of the key points.

Virtually all INDEXes in MySQL are structured as BTrees BTrees allow very efficient for

Given a key, find the corresponding row(s);
"Range scans" -- That is start at one value for the key and repeatedly find the "next" (or "previous") row.

A PRIMARY KEY is a UNIQUE KEY; a UNIQUE KEY is an INDEX. ("KEY" == "INDEX".)

Every InnoDB table has a PRIMARY KEY. While there is a default if you do not specify one, it is best to explicitly provide a PK.

For completeness: MyISAM works differently. All indexes (including the PK) are in separate BTrees. The leaf node of such BTrees have a pointer (usually a byte offset) into the data file.

All discussion here assumes InnoDB tables, however most statements apply to other Engines.

First, some examples

Those equate to

Think about this example as I talk about "=" versus "range" in the Algorithm, below.

Algorithm, step 1 (WHERE "column = const")

WHERE aaa = 123 AND ... : an INDEX starting with aaa is good.
WHERE aaa = 123 AND bbb = 456 AND ... : an INDEX starting with aaa and bbb is good. In this case, it does not matter whether aaa or bbb comes first in the INDEX.
xxx IS NULL : this acts like "= const" for this discussion.

(If there are no "=" parts AND'd in the WHERE clause, move on to step 2 without any columns in your putative INDEX.)

Algorithm, step 2

Find the first of 2a / 2b / 2c that applies; use it; then quit. If none apply, then you are through gathering columns for the index.

In some cases it is optimal to do step 1 (all equals) plus step 2c (ORDER BY).

Algorithm, step 2a (one range)

A "range" shows up as

aaa >= 123 -- any of <, <=, >=, >; but not <>, !=
aaa BETWEEN 22 AND 44
sss LIKE 'blah%' -- but not sss LIKE '%blah'

If there are more parts to the WHERE clause, you must stop now.

Complete examples (assume nothing else comes after the snippet)

WHERE aaa >= 123 AND bbb = 1 ⇒ INDEX(bbb, aaa) (WHERE order does not matter; INDEX order does)
WHERE aaa >= 123 ⇒ INDEX(aaa)
WHERE aaa >= 123 AND ccc > 'xyz' ⇒ INDEX(aaa) or INDEX(ccc) (only one range)

Algorithm, step 2b (GROUP BY)

If you are GROUPing BY an expression (including function calls), you cannot use the GROUP BY; stop.

Complete examples (assume nothing else comes after the snippet)

WHERE aaa = 123 AND bbb = 1 GROUP BY ccc ⇒ INDEX(bbb, aaa, ccc) or INDEX(aaa, bbb, ccc) (='s first, in any order; then the GROUP BY)
WHERE aaa >= 123 GROUP BY xxx ⇒ INDEX(aaa) (You should have stopped with Step 2a)
GROUP BY x,y ⇒ INDEX(x,y) (no WHERE)

Algorithm, step 2c (ORDER BY)

If there is a ORDER BY, all the columns of the ORDER BY should now be added, in the specified order, to the INDEX you are building.

If there are multiple columns in the ORDER BY, and there is a mixture of ASC and DESC, do not add the ORDER BY columns; they won't help; stop.

If you are ORDERing BY an expression (including function calls), you cannot use the ORDER BY; stop.

Complete examples (assume nothing else comes after the snippet)

WHERE aaa = 123 GROUP BY ccc ORDER BY ddd ⇒ INDEX(aaa, ccc) -- should have stopped with Step 2b
WHERE aaa = 123 GROUP BY ccc ORDER BY ccc ⇒ INDEX(aaa, ccc) -- the ccc will be used for both GROUP BY and ORDER BY
WHERE aaa = 123 ORDER BY xxx ASC, yyy DESC ⇒ INDEX(aaa) -- mixture of ASC and DESC.

WHERE aaa = 123 GROUP BY ccc ORDER BY ccc LIMIT 10 ⇒ INDEX(aaa, ccc)
WHERE aaa = 123 ORDER BY ccc LIMIT 10 ⇒ INDEX(aaa, ccc)
ORDER BY ccc LIMIT 10 ⇒ INDEX(ccc)

(It does not make much sense to have a LIMIT without an ORDER BY, so I do not discuss that case.)

Algorithm end

You have collected a few columns; put them in INDEX and ADD that to the table. That will often produce a "good" index for the SELECT you have. Below are some other suggestions that may be relevant.

An example of the Algorithm being 'wrong':

On the other hand, the Algorithm is 'right' with

That would call for a compound index starting with a flag: INDEX(flag, date). Such an index is likely to be very beneficial. And it is likely to be more beneficial than INDEX(date).

There are too many variables to say whether it is better to keep the index or to toss it.

In this case, shortening the index may be beneficial:

Changing to INDEX(z) would make for less work for the UPDATE, but might hurt some SELECT. It depends on the frequency of each, plus many more factors.

Limitations

(There are exceptions to some of these.)

You may not create an index bigger than 3KB.
You may not include a column that equates to bigger than some value (767 bytes -- VARCHAR(255) CHARACTER SET utf8).
You can deal with big fields using "prefix" indexing; but see below.
You should not have more than 5 columns in an index. (This is just a Rule of Thumb; nothing prevents having more.)

Flags and low cardinality

("20%" is really somewhere between 10% and 30%, depending on the phase of the moon.)

"Covering" indexes

Mini-cookbook:

Gather the list of column(s) according to the "Algorithm", above.
Add to the end of the list the rest of the columns seen in the SELECT, in any order.

Examples:

SELECT x FROM t WHERE y = 5; ⇒ INDEX(y,x) -- The algorithm said just INDEX(y)
SELECT x,z FROM t WHERE y = 5 AND q = 7; ⇒ INDEX(y,q,x,z) -- y and q in either order (Algorithm), then x and z in either order (covering).
SELECT x FROM t WHERE y > 5 AND q > 7; ⇒ INDEX(y,q,x) -- y or q first (that's as far as the Algorithm goes), then the other two fields afterwards.

The speedup you get might be minor, or it might be spectacular; it is hard to predict.

But...

It is not wise to build an index with lots of columns. Let's cut it off at 5 (Rule of Thumb).
Prefix indexes cannot 'cover', so don't use them anywhere in a 'covering' index.
There are limits (3KB?) on how 'wide' an index can be, so "covering" may not be possible.

Redundant/excessive indexes

INDEX(a,b) can find anything that INDEX(a) could find. So you don't need both. Get rid of the shorter one.

Notice in the cookbook how it says "in any order" in a few places. If, for example, you have both of these (in different SELECTs):

WHERE a=1 AND b=2 begs for either INDEX(a,b) or INDEX(b,a)
WHERE a>1 AND b=2 begs only for INDEX(b,a) Include only INDEX(b,a) since it handles both cases with only one INDEX.

Optimizer picks ORDER BY

This becomes especially beneficial if there is a LIMIT.

But there is a problem. There could be two situations, and the Optimizer is sometimes not smart enough to see which case applies:

If the WHERE does very little filtering, fetching the rows in ORDER BY order avoids a sort and has little wasted effort (because of 'little filtering'). Using the INDEX matching the ORDER BY is better in this case.
If the WHERE does a lot of filtering, the ORDER BY is wasting a lot of time fetching rows only to filter them out. Using an INDEX matching the WHERE clause is better.

What should you do? If you think the "little filtering" is likely, then create an index with the ORDER BY columns in order and hope that the Optimizer uses it when it should.

OR

Cases...

WHERE a=1 OR a=2 -- This is turned into WHERE a IN (1,2) and optimized that way.
WHERE a=1 OR b=2 usually cannot be optimized.
WHERE x.a=1 OR y.b=2 This is even worse because of using two different tables.

A workaround is to use UNION. Each part of the UNION is optimized separately. For the second case:

The third case (OR across 2 tables) is similar to the second.

If you originally had a LIMIT, UNION gets complicated. If you started with ORDER BY z LIMIT 190, 10, then the UNION needs to be

TEXT / BLOB

Example of a prefix index:

The index for me would contain 'Ja', 'Rick'. That's not useful for distinguishing between 'Jamison', 'Jackson', 'James', etc., so the index is so close to useless that the optimizer often ignores it.

Probably never do UNIQUE(foo(20)) because this applies a uniqueness constraint on the first 20 characters of the column, not the whole column!

Dates

DATE, DATETIME, etc. are tricky to compare against.

Some tempting, but inefficient, techniques:

All must do a full scan. (On the other hand, it can handy to use GROUP BY LEFT(date_col, 7) for monthly grouping, but that is not an INDEX issue.)

This is efficient, and can use an index:

EXPLAIN Key_len

IN

IN (1,99,3) is sometimes optimized as efficiently as "=", but not always. Older versions of MySQL did not optimize it as well as newer versions. (5.6 is possibly the main turning point.)

IN ( SELECT ... )

The SELECT expressions will need "a." prefixing the column names.

Alas, there are cases where the pattern is hard to follow.

5.6 does some optimizing, but probably not as good as the JOIN.

If there is a JOIN or GROUP BY or ORDER BY LIMIT in the subquery, that complicates the JOIN in new format. So, it might be better to use this pattern:

Caveat: If you end up with two subqueries JOINed together, note that neither has any indexes, hence performance can be very bad. (5.6 improves on it by dynamically creating indexes for subqueries.)

Explode/Implode

This explode + implode, itself, is costly. It would be better to avoid them if possible.

Sometimes the following will work.

Using DISTINCT or GROUP BY to counteract the explosion

When using second table just to check for existence:

Many-to-many mapping table

Do it this way.

Notes:

Lack of an AUTO_INCREMENT id for this table -- The PK given is the 'natural' PK; there is no good reason for a surrogate.
"MEDIUMINT" -- This is a reminder that all INTs should be made as small as is safe (smaller ⇒ faster). Of course the declaration here must match the definition in the table being linked to.
"UNSIGNED" -- Nearly all INTs may as well be declared non-negative
"NOT NULL" -- Well, that's true, isn't it?

To conditionally INSERT new links, use

Note that if you had an AUTO_INCREMENT in this table, IODKU would "burn" ids quite rapidly.

Subqueries and UNIONs

Each subquery SELECT and each SELECT in a UNION can be considered separately for finding the optimal INDEX.

Exception: In a "correlated" ("dependent") subquery, the part of the WHERE that depends on the outside table is not easily factored into the INDEX generation. (Cop out!)

JOINs

The optimizer usually picks the "first" table based on these hints:

STRAIGHT_JOIN forces the table order.
The WHERE clause limits which rows needed (whether indexed or not).
The table to the "left" in a LEFT JOIN usually comes before the "right" table. (By looking at the table definitions, the optimizer may decide that "LEFT" is irrelevant.)
The current INDEXes will encourage an order.

PARTITIONing

PARTITIONing is rarely a substitute for a good INDEX.

FULLTEXT

FULLTEXT is now implemented in InnoDB as well as MyISAM. It provides a way to search for "words" in TEXT columns. This is much faster (when it is applicable) than col LIKE '%word%'.

always(?) uses the FULLTEXT index first. That is, the whole Algorithm is invalidated when one of the ANDs is a MATCH.

Signs of a Newbie

No "compound" (aka "composite") indexes
No PRIMARY KEY
Redundant indexes (especially blatant is PRIMARY KEY(id), KEY(id))
Most or all columns individually indexes ("But I indexed everything")

Speeding up wp_postmeta

The published table (see Wikipedia) is

The problems:

The AUTO_INCREMENT provides no benefit; in fact it slows down most queries and clutters disk.
Much better is PRIMARY KEY(post_id, meta_key) -- clustered, handles both parts of usual JOIN.
BIGINT is overkill, but that can't be fixed without changing other tables.
VARCHAR(255) can be a problem in 5.6 with utf8mb4; see workarounds below.

The solutions:

Postlog

Initial posting: March, 2015; Refreshed Feb, 2016; Add DATE June, 2016; Add WP example May, 2017.

The tips in this document apply to MySQL, MariaDB, and Percona.

Latitude/Longitude Indexing

The problem

You might have tried

INDEX(lat), INDEX(lon) -- but the optimizer used only one
INDEX(lat,lon) -- but it still had to work too hard
Sometimes you ended up with a full table scan -- Yuck.

WHERE < ... -- No chance of using any index.

WHERE lat BETWEEN ... AND lng BETWEEN... -- This has some chance of using such indexes.

The goal is to look only at records "close", in both directions, to the target lat/lng.

A solution -- first, the principles

It works. Not perfectly, but better than the alternatives.

How many PARTITIONs? It does not matter a lot. Some thoughts:

90 partitions - 2 degrees each. (I don't like tables with too many partitions; 90 seems like plenty.)
50-100 - evenly populated. (This requires code. For 2.7M placenames, 85 partitions varied from 0.5 degrees to very wide partitions at the poles.)
Don't have more than 100 partitions, there are inefficiencies in the partition implementation.

How to PARTITION? Well, MariaDB and MySQL are very picky. So / are out. is out. So, we are stuck with some kludge. Essentially, we need to convert Lat/Lng to some size of and use PARTITION BY RANGE.

Representation choices

To get to a datatype that can be used in PARTITION, you need to "scale" the latitude and longitude. (Consider only the *INTs; the other datatypes are included for comparison)

(Sorted by resolution)

What these mean...

for latitude and DECIMAL(5,2) for longitude will take 2+3 bytes and have no better resolution than Deg*100.

SMALLINT scaled -- Convert latitude into a SMALLINT SIGNED by doing (degrees / 90 * 32767) and rounding; longitude by (degrees / 180 * 32767).

Sure, you could do DEG_1000 and other "in between" cases, but there is no advantage. DEG_1000 takes as much space as DEG*10000, but has less resolution.

GCDist -- compute "great circle distance"

GCDist is a helper FUNCTION that correctly computes the distance between two points on the globe.

GCDist() takes 4 scaled DOUBLEs -- lat1, lon1, lat2, lon2 -- and returns a scaled number of "degrees" representing the distance.

69.172 miles/degree of latitude
10000 units per degree for the scaling chosen
5280 feet / mile.

(No, this function does not compensate for the Earth being an oblate spheroid, etc.)

Required table structure

There will be one table (plus normalization tables as needed). The one table must be partitioned and indexed as indicated below.

Fields and indexes

PARTITION BY RANGE(lat)
lat -- scaled latitude (see above)
lon -- scaled longitude
PRIMARY KEY(lon, lat, ...) -- lon must be first; something must be added to make it UNIQUE

For most of this discussion, lat is assumed to be MEDIUMINT -- scaled from -90 to +90 by multiplying by 10000. Similarly for lon and -180 to +180.

The PRIMARY KEY must

start with lon since the algorithm needs the "clustering" that InnoDB will provide, and
include lat somewhere, since it is the PARTITION key, and
contain something to make the key UNIQUE (lon+lat is unlikely to be sufficient).

The FindNearest PROCEDURE will do multiple SELECTs something like this:

The query planner will

Do PARTITION "pruning" based on the latitude; then
Within a PARTITION (which is effectively a table), use lon do a 'clustered' range scan; then
Use the "condition" to filter down to the rows you desire, plus recheck lat. This design leads to very few disk blocks needing to be read, which is the main goal of the design.

Note that this does not even call GCDist. That comes in the last pass when the ORDER BY and LIMIT are used.

The algorithm

The algorithm is embodied in a because of its complexity.

You feed it a starting width for a "square" and a number of items to find.
It builds a "square" around where you are.
A SELECT is performed to see how many items are in the square.
Loop, doubling the width of the square, until enough items are found.

The next section ("Performance") should make this a bit clearer as it walks through some examples.

Performance

Because of all the variations, it is hard to get a meaningful benchmark. So, here is some hand-waving instead.

So, scanning the square will involve as little as one block; rarely more than a few blocks. The number of blocks is mostly independent of the dataset size.

The primary use case for this algorithm is when the data is significantly larger than will fit into cache (the buffer_pool). Hence, the main goal is to minimize the number of disk hits.

Now let's look at some edge cases, and argue that the number of blocks is still better (usually) than with traditional indexing techniques.

2-degree partitions might be good for a global table of stores or restaurants. A 5-mile starting distance might be good when filtering for Starbucks. 20 miles might be better for a department store.

Discussion of reference code

Here's the gist of the FindNearest().

Make a guess at how close to "me" to look.
See how many items are in a 'square' around me, after filtering.
If not enough, repeat, doubling the width of the square.
After finding enough, or giving up because we are looking "too far", make one last pass to get all the data, ORDERed and LIMITed

Parameters passed into FindNearest():

your Latitude -- -90..90 (not scaled -- see hardcoded conversion in PROCEDURE)
your Longitude -- -180..180 (not scaled)
Start distance -- (miles or km) -- see discussion below
Max distance -- in miles or km -- see hardcoded conversion in PROCEDURE

Timing: Under 10ms for "typical" usage; any dataset size. Slower for pathological cases (low min distance, high max distance, crossing dateline, bad filtering, cold cache, etc)

End-cases:

By using GC distance, not Pythagoras, distances are 'correct' even near poles.
Poles -- Even if the "nearest" is almost 360 degrees away (longitude), it can find it.
Dateline -- There is a small, 'contained', piece of code for crossing the Dateline. Example: you are at +179 deg longitude, and the nearest item is at -179.

The procedure returns one resultset, SELECT *, distance.

Only rows that meet your Condition, within Max distance are returned
At most Limit rows are returned
The rows will be ordered, "closest" first.
"dist" will be in miles or km (based on a hardcoded constant in the SP)

Reference code, assuming deg*10000 and 'miles'

This version is based on scaling "Deg*10000 (MEDIUMINT)".

Optimization and Indexes

Building the best INDEX for a given SELECT

The problem

Algorithm

Digression

First, some examples

Algorithm, step 1 (WHERE "column = const")

Algorithm, step 2

Algorithm, step 2a (one range)

Algorithm, step 2b (GROUP BY)

Algorithm, step 2c (ORDER BY)

Algorithm end

Limitations

Flags and low cardinality

"Covering" indexes

Redundant/excessive indexes

Optimizer picks ORDER BY

OR

TEXT / BLOB

Dates

EXPLAIN Key_len

IN

Explode/Implode

Many-to-many mapping table

Subqueries and UNIONs

JOINs

PARTITIONing

FULLTEXT

Signs of a Newbie

Speeding up wp_postmeta

Postlog

See also

Compound (Composite) Indexes

A mini-lesson in "compound indexes" ("composite indexes")

Ignored Indexes

Syntax

Index Statistics

How Index Statistics Help the Query Optimizer

Latitude/Longitude Indexing

The problem

A solution -- first, the principles

Representation choices

GCDist -- compute "great circle distance"

Required table structure

The algorithm

Performance

Discussion of reference code

Reference code, assuming deg*10000 and 'miles'

Postlog

See also

Primary Keys with Nullable Columns

See Also

Storage Engine Index Types

B-tree Indexes

Hash Indexes

R-tree Indexes

Full-Text Indexes

Full-Text Index Overview

Excluded Results

Relevance

Types of Full-Text search

IN NATURAL LANGUAGE MODE

IN BOOLEAN MODE

WITH QUERY EXPANSION

Examples

See Also

Optimization and Indexes

Building the best INDEX for a given SELECT

The problem

Algorithm

Digression

First, some examples

Algorithm, step 1 (WHERE "column = const")

Algorithm, step 2

Algorithm, step 2a (one range)

Algorithm, step 2b (GROUP BY)

Algorithm, step 2c (ORDER BY)

Algorithm end

Limitations

Flags and low cardinality