A first second look at InnoDB spatial indexes in the MySQL 5.7 april lab release …

Warning: it turned out that I used a full debug build instead of a release one, and that seems to make a hell of a difference … most numbers have been updated now, the old debug results and conclusions are still there but striked out (to document my own stupidity forever 😉 The MySQL 5.7 April Labs release comes with a preview of spatial indexes for InnoDB, something that I’ve heard rumours about for quite a while but so far couldn’t find any kind of actual confirmation for. Spatial indexes for InnoDB would more or less get rid of the last MyISAM-only feature (after the addition of fulltext indexing to InnoDB in MySQL 5.6) So it’s about time to have a closer look at what has been brewing at “the cathedrals crypt” …

Where to get and how to build

The snapshot releases are available on labs.mysql.com in source form and as linux x86 64bit binaries. Unfortunately there doesn’t seem to be a mechanism to directly link to a specific labs snapshot, so you have to go to http://labs.mysql.com/ and select

MySQL GIS, InnoDB R-tree, Parser Refactoring

from the select box on that page. I usually prefer to build MySQL binaries myself, but had a hard time with this snapshot as it added a Boost dependency but at the time I tried the build instructions had not been published yet, and even with these I ran into some extra Boost-related gotchas (bugs 72172, 72185, 72186, 72188). None of these were really critical, but annoying nonetheless …

Test setup

To put things to a real world test I took the MyISAM tables imported from OpenStreetMap data for a previous post and converted them to InnoDB using “ALTER TABLE … ENGINE=InnoDB” Originally I planned to do tests with the OSM data for all Germany, using my usual “post boxes in my hometown” query, but that turned out to take way too much time when using InnoDB instead of MyISAM (round about 10 to 100 times as long as with MyISAM) … so all tests below are using the OSM data for Sweden instead, with a data set having:

   338 012 points of interest
   901 154 lines (roads and other linear objects)
    76 550 roads or road segments (a subset of lines)
 1 119 979 polygons
 ---------
 2 435 695 geometry objects

Data import

As a first test I just converted the existing MyISAM tables to InnoDB using “ALTER TABLE … ENGINE=innodb”. This lead to the first surprise already as just converting the tables took about as long as the initial import into MyISAM with all the extra OpenStreetMap parsing and processing overhead.

  Table           to InnoDB     back to   to InnoDB
                                 MyISAM   no SPATIAL
  ----------------------------------------------------
  sweden_point        168s  16s         33s 3.7s         83s 5.5s
  sweden_line         341s  47s         77s  11s        199s 2.9s
  sweden_roads         28s 3.5s          6s 0.7s         16s 1.9s
  sweden_polygon      432s  49s        107s  12s        272s  27s

So converting the table from MyISAM to InnoDB takes about four to five times as long as the other way round, and building the spatial index takes about 40-50% 200% of the time for the InnoBD step. My first theory was that the OSM table schema might be to blame as the tables don’t have explicit primary keys and a lot of TEXT columns (although most of them being NULL most of the time). Changing the tables to have an explicit PK (an extra auto_increment column) and just the two attribute columns I was interested in (as VARCHAR instead of TEXT) didn’t have much of an effect. (Tables are be available for download as MyISAM files and as InnoDB transportable tablespaces) Also interesting: converting the sweden_point table to InnoDB without SPATIAL index: 83s, then adding the index: 190s, so even longer than converting it from MyISAM to InnoDB with SPATIAL index right away. Might be that this is due to a combination of the index not being created online, so that another full table conversion has to happen, combined with reading the full original table taking longer than scanning the MyISAM version.

Disk usage

Disk usage for the OSM data for Sweden (~2.4 million geometries in total)

  MyISAM without spatial index:  586M
            with spatial index:  726M (+140M, +~24%)

  Innodb without spatial index:  845M
            with spatial index: 1076M (+231M, +~27%)

MyISAM taking less space in general was to be expected. The MyISAM R-Tree implementation seems to take about 60 bytes per row while InnoDBs implementation comes out at around 100 bytes per row, this may partially be due to InnoDB secondary indexes carrying around a copy of the rows primary key though …

Dump and restore

mysqldump results are identical, with the exception of the ENGINE=… in CREATE statements, as expected. mysqldump takes about 25s16s for MyISAM and about 160s20s for InnoDB, presence or absence of spatial indexes doesn’t make a difference, again as expected. Restore times do not differ as much as dump times, and the absolute overhead for creating the SPATIAL index even seems to be slightly lower with the new InnoDB implementation):

  MyISAM without spatial index:  308s 94s
            with spatial index:  725s 116s (+417s, +~135%) (+22s, +~23%)

  Innodb without spatial index:  892s 144s
            with spatial index: 1284s 216s (+392s, +~44%) (+72s, +~50%)

Spatial queries #1

So lets start with some actual spatial queries now, using my favourite example “All post boxes in $city”, here: find all post boxes in the city of Lund

  SELECT COUNT(*) 
    FROM sweden_point p 
    JOIN sweden_polygon c 
      ON ST_CONTAINS(c.way, p.way) 
   WHERE c.name='Lund' 
     AND p.amenity='post_box'
  ;

Results with warm caches are:

  MyISAM without index:  4.1s 1.05s
  MyISAM with index:     2.3s 0.69s
  InnoDB without index:  32s 1.03s
  InnoDB with index:     26s 0.66s

So while the relative gain from adding a spatial index is similar (around 1/3rd of the original query time saved) InnoDB query times are by about an order of magnitude worse than with MyISAM. Not very encouraging yet … So both MyISAM and InnoDB are actually playing in the same field …

Spatial queries #2

Part of the problem on the previous query was that most of the query time seems to be spent on scanning the polygon table to find the border polygon for the city of Lund. So lets try a simplified version of the above query where we pass in the MBR bounding box of the cities border as constant instead of looking it up in the polygon table (using the MBR to keep the POLYGON string short):

  SELECT COUNT(*) 
    FROM sweden_point 
   WHERE ST_CONTAINS( GeomFromText('POLYGON((13.1499887 55.678392
                                            ,13.2554854 55.678392
                                            ,13.2554854 55.7345506
                                            ,13.1499887 55.7345506
                                            ,13.1499887 55.678392))')
                    , way)
     AND amenity = 'post_box';

Results with warm caches:

  MyISAM without index:  0.76s 0.24s
  MyISAM with index:     0.05s 0.02s 
  InnoDB without index:  6.45s 0.24s
  InnoDB with index:     0.26s 0.02s

Here we see a much higher gain from spatial indexes, but InnoDB is still off by a factor of about five compared to MyISAM. Still not that encouraging yet … Again both results about equal … looks as if I need to redo the tests with the larger data set for Germany once again …

Spatial queries #3

So part of the story seems to be that InnoDB isn’t as good at full table scans as MyISAM. Lets try to rule that part out by adding indexes on the amenity and name columns and re-doing the query from part #1. Now things look a little better with MyISAM at 0.18s and InnoDB at 0.20s, but here the SPATIAL index doesn’t get used at all in favour of the new indexes, and the ST_Contains() condition is applied after the index scan only … Again results are pretty similar now …

Conclusion

InnoDB now having SPATIAL INDEX support is nice, but for now it can mostly only be seen as a proof of concept as it is not a match to the original MyISAM implementation performance wise yet. (Especially as import speed is the main pain point with OpenStreetMap imports anyway …) Turns out that query performance wise InnoDB is actually as fast as MyISAM now, give or take a few percent. So definitely not just proof-of-concept after all. My import performance pain point sort of remains, there is still a substantial difference between MyISAM and InnoDB here, but with the non-debug build the mysqldump import time is at least no longer as long as the total openstreetmap conversion plus import time … so it is not going to double the overall run time but probably by only about 20-30% … that’s still quite a bit with OpenStreetMap import times being in the range of ~20h for Germany in my current osm2pgsql port (and being slower than PostGIS by a factor of 5-10 with MyISAM already) …