Non-semi-join subquery optimizations
!!! WORK IN PROGRESS !!!
Certain kinds of IN-subqueries canot be flattened into semi-joins. These subqueries can be both correlated or non-correlated. In order to provide consistent performance in all cases, MariaDB provides several alternative strategies for these types of subqueries. Whenever several strategies are possible, the optimizer chooses the optimal one based on cost estimates.
The two primary non-semi-join strategies are materialization (also called outside-in materialization), and in-to-exists tranformation. Materialization is applicable only for non-correlated subqueries, while in-to-exist can be used both for correlated and non-correlated subqueries.
Applicability
An IN subquery cannot be flattened into a semi-join in the following cases. The examples below use the World database from the MariaDB regression test suite.
Subquery in a disjunction (OR)
The subquery is located directly or indirectly under an OR operation in the WHERE clause of the outer query.
Query pattern:
SELECT ... FROM ... WHERE (expr1, ..., exprN) [NOT] IN (SELECT ... ) OR expr;
Example:
SELECT Name FROM Country WHERE (Code IN (select Country from City where City.Population > 100000) OR Name LIKE 'L%') AND surfacearea > 1000000;
Negated subquery predicate (NOT IN)
The subquery predicate itself is negated.
Query pattern:
SELECT ... FROM ... WHERE ... (expr1, ..., exprN) NOT IN (SELECT ... ) ...;
Example:
SELECT Country.Name FROM Country, CountryLanguage WHERE Code NOT IN (SELECT Country FROM CountryLanguage WHERE Language = 'English') AND CountryLanguage.Language = 'French' AND Code = Country;
Subqueries with GROUP BY, HAVING, and aggregate functions
The subquery has at leas one of a: GROUP BY clause, HAVING clause, or aggregate functions.
Query pattern:
SELECT ... FROM ... WHERE ... (expr1, ..., exprN) [NOT] IN (SELECT field1, ..., aggr_func1 ... FROM ... WHERE ... GROUP BY group-expr HAVING having-expr) ...;
Example:
select * from Country, City where capital = id and (City.name in (SELECT name FROM City GROUP BY name HAVING Count(*) > 2);
Subquery in the SELECT or HAVING clause
The subquery is located in the SELECT or HAVING clauses of the outer query.
Query pattern:
SELECT field1, ..., (SELECT ...) WHERE ...; SELECT ... WHERE ... HAVING (SELECT ...);
Example:
select Name, City.id in (select capital from Country where capital is not null) as is_capital from City where City.population > 10000000;
Subquery with a UNION
The subquery itself is a UNION, while the IN predicate may be anywhere in the query where IN is allowed.
Query pattern:
... [NOT] IN (SELECT ... UNION SELECT ...)
Example:
SELECT * from City where (Name, 91) IN (SELECT Name, round(Population/1000) FROM City WHERE Country = "IND" AND Population > 2500000 UNION SELECT Name, round(Population/1000) FROM City WHERE Country = "IND" AND Population < 100000);
Materialization for non-correlated IN-subqueries
Materialization basics
The basic idea of subquery materialization is to execute the subquery and store its result in an internal temporary table. Naturally this is possible only when the subquery is non-correlated. Since an IN predicate is only interested whether its left operand is present in the subquery result, subquery materialization skips all duplicate result rows. The resulting temporary table is then indexed on all its fields. If the size of the temporary table is less than the tmp_table_size system variable, the table is a hash-indexed in-memory HEAP table. In the rare cases when the subquery exceeds this limit, the subquery result is stored on disk in an ARIA or MyISAM B-tree indexed table.
Subquery materialization happens on demand during the first execution of the IN predicate. Once the subquery is materialized, the IN predicate is checked very efficiently by index lookups of the outer expression into the index of the materialized temporary table.
NULL-aware efficient execution
Limitations
- Blob fields
- Incomparable fields
The IN-TO-EXISTS transformation
This optimization is the only subquery execution strategy that existed in older versions of MariaDB and MySQL prior to MariaDB 5.3. We have made various changes and fixed a number of bugs in this code as well, but in essense it remains the same.
For the time being we refer the reader to the MySQL documentation of this optimization.
Performance discussion
- Example improvement over MySQL 5.x
- General performance guidelines
Optimizer control
In certain cases it may be necessary to override the choice of the optimizer. Typically this is needed for benchmarking or testing purposes, or to mimic the behavior of an older version of the server, or if the optimizer made a poor choice.
All the above strategies can be controlled via the optimizer_switch system variable.
- materialization=on/off
In some very special cases, even if materialization was forced, it may still be impossible to
partial_match_rowid_merge=on/off
In addition to this switch, the system variable
rowid_merge_buff_size
controls the maximum memory available
to the Rowid-merge algorithm. By default it is: TODO.
partial_match_table_scan=on/off
- in_to_exists=on/off
Example: To tell the server that all non-semijoin ...
<<code>>
<</code>>
- tmp_table_size and max_heap_table_size