La strategia Semi-join Materialization

Stai visualizzando una vecchia versione di questo article. Visualizza la versione più recente.

La Semi-join Materialization è un tipo particolare di materializzazione delle subquery utilizzata per le subquery di tipo semi-join. In realtà comprende due strategie:

  • Materializzazione/ricerca
  • Materializzazione/scansione

L'idea

Si consideri una query che trova i Paesi in Europa che contengono grandi città:

select * from Country 
where Country.code IN (select City.Country 
                       from City 
                       where City.Population > 7*1000*1000)
      and Country.continent='Europe'

La subquery non è correlata, quindi può essere eseguita indipendentemente dalla query esterna. L'idea della materializzazione delle semi-join consiste appunto nel materializzarle, popolare una tabella temporanea con i possibili valori del campo City.country e infine eseguire una join con gli Stati europei:

sj-materialization1

La join può essere eseguita in due direzioni:

  1. Dalla tabella materializzata agli Stati in Europa
  2. Dagli Stati in Europa alla tabella materializzata

Il primo modo implica l'esecuzione di una scansione completa della tabella materializzata, perciò viene chiamato "Materialization-scan".

Se si esegue una join da Countries alla tabella materializzata, il modo meno costoso per trovare una corrispondenza nella tabella materializzata è eseguire una ricerca sulla chiave primaria (ne ha una: è stata usata per eliminare i duplicati). Per questo motivo. questa strategia si chiama "Materialization-lookup".

Semi-join materialization in azione

Materialization-Scan

If we chose to look for cities with a population greater than 7 million, the optimizer will use Materialization-Scan and EXPLAIN will show this:

MariaDB [world]> explain select * from Country where Country.code IN (select City.Country from City where  City.Population > 7*1000*1000);
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
| id | select_type  | table       | type   | possible_keys      | key        | key_len | ref                | rows | Extra                 |
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
|  1 | PRIMARY      | <subquery2> | ALL    | distinct_key       | NULL       | NULL    | NULL               |   15 |                       |
|  1 | PRIMARY      | Country     | eq_ref | PRIMARY            | PRIMARY    | 3       | world.City.Country |    1 |                       |
|  2 | MATERIALIZED | City        | range  | Population,Country | Population | 4       | NULL               |   15 | Using index condition |
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
3 rows in set (0.01 sec)

Here, you can see:

  • There are still two SELECTs (look for columns with id=1 and id=2)
  • The second select (with id=2) has select_type=MATERIALIZED. This means it will be executed and its results will be stored in a temporary table with a unique key over all columns. The unique key is there to prevent the table from containing any duplicate records.
  • The first select received the table name <subquery2>. This is the table that we got as a result of the materialization of the select with id=2.

The optimizer chose to do a full scan over the materialized table, so this is an example of a use of the Materialization-Scan strategy.

As for execution costs, we're going to read 15 rows from table City, write 15 rows to materialized table, read them back (the optimizer assumes there won't be any duplicates), and then do 15 eq_ref accesses to table Country. In total, we'll do 45 reads and 15 writes.

By comparison, if you run the EXPLAIN in MySQL, you'll get this:

MySQL [world]> explain select * from Country where Country.code IN (select City.Country from City where  City.Population > 7*1000*1000);
+----+--------------------+---------+-------+--------------------+------------+---------+------+------+------------------------------------+
| id | select_type        | table   | type  | possible_keys      | key        | key_len | ref  | rows | Extra                              |
+----+--------------------+---------+-------+--------------------+------------+---------+------+------+------------------------------------+
|  1 | PRIMARY            | Country | ALL   | NULL               | NULL       | NULL    | NULL |  239 | Using where                        |
|  2 | DEPENDENT SUBQUERY | City    | range | Population,Country | Population | 4       | NULL |   15 | Using index condition; Using where |
+----+--------------------+---------+-------+--------------------+------------+---------+------+------+------------------------------------+

...which is a plan to do (239 + 239*15) = 3824 table reads.

Materialization-Lookup

Let's modify the query slightly and look for countries which have cities with a population over one millon (instead of seven):

MariaDB [world]> explain select * from Country where Country.code IN (select City.Country from City where  City.Population > 1*1000*1000) ;
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
| id | select_type  | table       | type   | possible_keys      | key          | key_len | ref  | rows | Extra                 |
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
|  1 | PRIMARY      | Country     | ALL    | PRIMARY            | NULL         | NULL    | NULL |  239 |                       |
|  1 | PRIMARY      | <subquery2> | eq_ref | distinct_key       | distinct_key | 3       | func |    1 |                       |
|  2 | MATERIALIZED | City        | range  | Population,Country | Population   | 4       | NULL |  238 | Using index condition |
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
3 rows in set (0.00 sec)

The EXPLAIN output is similar to the one which used Materialization-scan, except that:

  • the <subquery2> table is accessed with the eq_ref access method
  • the access uses an index named distinct_key

This means that the optimizer is planning to do index lookups into the materialized table. In other words, we're going to use the Materialization-lookup strategy.

In MySQL (or with optimizer_switch='semijoin=off,materialization=off'), one will get this EXPLAIN:

MySQL [world]> explain select * from Country where Country.code IN (select City.Country from City where  City.Population > 1*1000*1000) ;
+----+--------------------+---------+----------------+--------------------+---------+---------+------+------+-------------+
| id | select_type        | table   | type           | possible_keys      | key     | key_len | ref  | rows | Extra       |
+----+--------------------+---------+----------------+--------------------+---------+---------+------+------+-------------+
|  1 | PRIMARY            | Country | ALL            | NULL               | NULL    | NULL    | NULL |  239 | Using where |
|  2 | DEPENDENT SUBQUERY | City    | index_subquery | Population,Country | Country | 3       | func |   18 | Using where |
+----+--------------------+---------+----------------+--------------------+---------+---------+------+------+-------------+

One can see that both plans will do a full scan on the Country table. For the second step, MariaDB will fill the materialized table (238 rows read from table City and written to the temporary table) and then do a unique key lookup for each record in table Country, which works out to 238 unique key lookups. In total, the second step will cost (239+238) = 477 reads and 238 temp.table writes.

MySQL's plan for the second step is to read 18 rows using an index on City.Country for each record it receives for table Country. This works out to a cost of (18*239) = 4302 reads. Had there been fewer subquery invocations, this plan would have been better than the one with Materialization. By the way, MariaDB has an option to use such a query plan, too (see FirstMatch Strategy), but it did not choose it.

A note about grouping in subqueries

TODO

Factsheet

Semi-join materialization

  • Can be used for uncorrelated IN-subqueries. The subselect may use grouping and/or aggregate functions.
  • Is shown in EXPLAIN as type=MATERIALIZED for the subquery, and a line with table=<subqueryN> in the parent subquery.
  • Is enabled when one has both materialization=on and semijoin=on in their @@optimizer_switch.
  • The materialization=on|off flag is shared with Non-semijoin materialization.

Commenti

Sto caricando i commenti......
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.