La strategia DuplicateWeedout
DuplicateWeedout
è una strategia di esecuzione per le subquery di tipo semi-join.
L'idea
L'idea è eseguire la semi-join come se fosse una normale inner join, e poi eliminare le combinazioni di record duplicate usando una tabella temporanea.
Si supponga di avere una query con la quale si cercano i Paesi aventi più del 33% percento della loro popolazione in una grande città:
select * from Country where Country.code IN (select City.Country from City where City.Population > 0.33 * Country.Population and City.Population > 1*1000*1000);
Prima si esegue una normale inner join tra le tabelle City
e Country
:
La Inner join produce dei duplicati. Avremo la Germania tre volte, perché ha tre grandi città.
Ma vediamo lo schema con DuplicateWeedout
:
Come si può vedere, è stata usata una tabella temporanea con una chiave primaria per evitare di produrre diversi record con 'Germany'.
DuplicateWeedout in azione
Start temporary
e End temporary
, presenti nell'ultimo diagramma, vengono mostrati nell'output di EXPLAIN
:
MariaDB [world]> explain select * from Country where Country.code IN (select City.Country from City where City.Population > 0.33 * Country.Population and City.Population > 1*1000*1000)\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: City type: range possible_keys: Population,Country key: Population key_len: 4 ref: NULL rows: 238 Extra: Using index condition; Start temporary *************************** 2. row *************************** id: 1 select_type: PRIMARY table: Country type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 3 ref: world.City.Country rows: 1 Extra: Using where; End temporary 2 rows in set (0.00 sec)
This query will read 238 rows from the City
table, and for each of them will make a primary key lookup in the Country
table, which gives another 238 rows. This gives a total of 476 rows, and you need to add 238 lookups in the temporary table (which are typically *much* cheaper since the temporary table is in-memory).
If we run the same EXPLAIN in MySQL, we'll get:
mysql> explain select * from Country where Country.code IN (select City.Country from City where City.Population > 0.33 * Country.Population and City.Population > 1*1000*1000) *************************** 1. row *************************** id: 1 select_type: PRIMARY table: Country type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 239 Extra: Using where *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: City type: index_subquery possible_keys: Population,Country key: Country key_len: 3 ref: func rows: 18 Extra: Using where 2 rows in set (0.00 sec)
This plan will read (239 + 239*18) = 4541
rows, which is much slower.
Factsheet
DuplicateWeedout
is shown as "Start temporary/End temporary" inEXPLAIN
.- The strategy can handle correlated subqueries.
- But it cannot be applied if the subquery has meaningful
GROUP BY
and/or aggregate functions. DuplicateWeedout
allows the optimizer to freely mix a subquery's tables and the parent select's tables.- There is no separate @@optimizer_switch flag for
DuplicateWeedout
. The strategy can be disabled by switching off all semi-join optimizations withSET @@optimizer_switch='optimizer_semijoin=off'
command.