Semi-join materialization strategy

You are viewing an old version of this article. View the current version here.

Semi-join Materialization is a special kind of subquery materialization that's used for Semi-join subqueries. It actually includes two strategies:

  • Materialization/lookup
  • Materialization/scan

The idea

Let us again look at the query that finds countries in Europe that have big cities:

select * from Country 
where Country.code IN (select City.Country 
                       from City 
                       where City.Population > 7*1000*1000)
      and Country.continent='Europe'

The subquery is uncorrelated, that is, we can run it independently of the upper query. The idea of semi-join materialization is to do just that, and fill a temporary table with possible values of City.country field of big cities, and then do a join with countries in Europe:

sj-materialization1

The join can be done in two directions:

  1. From Materialized table to countries in Europe
  2. From Countries in Europe to materialized table

The first way involves doing a full scan on the materialized table, so we call it "Materialization-scan".

If you run a join from Countries to materialized table, the cheapest way to find a match in materialized table is to make a lookup on its primary key (it has one: we used it to remove duplicates). Because of that, we call the strategy "Materialization-lookup".

Semi-join materialization in action

Materialization-Scan

If we chose to look for cities with population greater than 7 million, the optimizer will use Materialization-Scan and the EXPLAIN will show this:

MariaDB [world]> explain select * from Country where Country.code IN (select City.Country from City where  City.Population > 7*1000*1000);
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
| id | select_type  | table       | type   | possible_keys      | key        | key_len | ref                | rows | Extra                 |
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
|  1 | PRIMARY      | <subquery2> | ALL    | distinct_key       | NULL       | NULL    | NULL               |   15 |                       |
|  1 | PRIMARY      | Country     | eq_ref | PRIMARY            | PRIMARY    | 3       | world.City.Country |    1 |                       |
|  2 | MATERIALIZED | City        | range  | Population,Country | Population | 4       | NULL               |   15 | Using index condition |
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
3 rows in set (0.01 sec)

Here, you can see:

  • There are still two SELECTs (look for columns with id=1 and id=2)
  • The second select (with id=2) has select_type=MATERIALIZED. This means it will be executed and its results will be stored in a temporary table with a unique key over all columns. The unique key is there to prevent the table from containing any duplicate records.
  • The first select got table named <subquery2>. This is the table that we've got as a result of materialization of the select with id=2.

The optimizer chose to do a full scan over the materialized table, so this is an example of use of Materialization-Scan strategy.

As for execution costs, we're going to read 15 rows from table City, write 15 rows to materialized table, read them back (the optimizer assumes there won't be any duplicates), and then do 15 eq_ref accesses to table Country. In total, we'll do 45 reads and 15 writes.

If you run the EXPLAIN in MySQL, you'll get this:

MySQL [world]> explain select * from Country where Country.code IN (select City.Country from City where  City.Population > 7*1000*1000);
+----+--------------------+---------+-------+--------------------+------------+---------+------+------+------------------------------------+
| id | select_type        | table   | type  | possible_keys      | key        | key_len | ref  | rows | Extra                              |
+----+--------------------+---------+-------+--------------------+------------+---------+------+------+------------------------------------+
|  1 | PRIMARY            | Country | ALL   | NULL               | NULL       | NULL    | NULL |  239 | Using where                        |
|  2 | DEPENDENT SUBQUERY | City    | range | Population,Country | Population | 4       | NULL |   15 | Using index condition; Using where |
+----+--------------------+---------+-------+--------------------+------------+---------+------+------+------------------------------------+

which is a plan to do 239 + 239*15 = 3824 table reads.

Materialization-Lookup

Let's modify the query slightly and look for countries that have cities with population over one millon instead of seven:

MariaDB [world]> explain select * from Country where Country.code IN (select City.Country from City where  City.Population > 1*1000*1000) ;
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
| id | select_type  | table       | type   | possible_keys      | key          | key_len | ref  | rows | Extra                 |
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
|  1 | PRIMARY      | Country     | ALL    | PRIMARY            | NULL         | NULL    | NULL |  239 |                       |
|  1 | PRIMARY      | <subquery2> | eq_ref | distinct_key       | distinct_key | 3       | func |    1 |                       |
|  2 | MATERIALIZED | City        | range  | Population,Country | Population   | 4       | NULL |  238 | Using index condition |
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
3 rows in set (0.00 sec)

the EXPLAIN output is similar to one of Materialization-scan, except that

  • <subquery2> table is accessed with eq_ref access method
  • the access uses an index named distinct_key

This means that we're going to do index lookups into the materialized table, in other words, we're going to use Materialization-lookup strategy.

In MySQL (or with optimizer_switch='semijoin=off,materialization=off'), one will get this EXPLAIN:

MySQL [world]> explain select * from Country where Country.code IN (select City.Country from City where  City.Population > 1*1000*1000) ;
+----+--------------------+---------+----------------+--------------------+---------+---------+------+------+-------------+
| id | select_type        | table   | type           | possible_keys      | key     | key_len | ref  | rows | Extra       |
+----+--------------------+---------+----------------+--------------------+---------+---------+------+------+-------------+
|  1 | PRIMARY            | Country | ALL            | NULL               | NULL    | NULL    | NULL |  239 | Using where |
|  2 | DEPENDENT SUBQUERY | City    | index_subquery | Population,Country | Country | 3       | func |   18 | Using where |
+----+--------------------+---------+----------------+--------------------+---------+---------+------+------+-------------+

One can see that both plans with full scan on Country table. For the second step, MariaDB will fill the materialized table (238 rows read+written, once) and then do a 1-row lookup every time, which gives 239 unique key lookups. In total, second step will cost 237+238=477 reads and 239 temp.table writes.

MySQL's plan for the second step is to read 18 rows using index 'Country' every time. This gives 18*239=4302 reads. Had there been fewer subquery invocations, this plan would have been better than the one with Materialization. By the way, MariaDB considered this plan too (see FirstMatch Strategy, but did not choose it.

Factsheet

Semi-join materialization

  • Is shown in EXPLAIN as type=MATERIALIZED for the subquery, and a line with table=<derivedN> in the parent subquery.
  • Is enabled when one has both materialization=on and semijoin=on

Comments

Comments loading...
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.