All pages
Powered by GitBook
1 of 10

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Subquery Optimizations

Optimize subqueries in MariaDB Server for improved performance. This section provides techniques and best practices to ensure your nested queries execute efficiently and enhance overall query speed.

Condition Pushdown Into IN subqueries

This article describes Condition Pushdown into IN subqueries as implemented in MDEV-12387.

optimizer_switch flag name: condition_pushdown_for_subquery.

This page is licensed: CC BY-SA / Gnu FDL

Conversion of Big IN Predicates Into Subqueries

Starting from , the optimizer converts certain big IN predicates into IN subqueries.

That is, an IN predicate in the form

is converted into an equivalent IN-subquery:

which opens new opportunities for the query optimizer.

The conversion happens if the following conditions are met:

  • the IN list has more than 1000 elements (One can control it through the parameter).

  • the [NOT] IN condition is at the top level of the WHERE/ON clause.

Controlling the Optimization
  • The optimization is on by default. (and debug builds prior to that) introduced the in_predicate_conversion_threshold variable. Set to 0 to disable the optimization.

Benefits of the Optimization

If column is a key-prefix, MariaDB optimizer will process the condition

by trying to construct a range access. If the list is large, the analysis may take a lot of memory and CPU time. The problem gets worse when column is a part of a multi-column index and the query has conditions on other parts of the index.

Conversion of IN predicates into subqueries bypass the range analysis, which means the query optimization phase will use less CPU and memory.

Possible disadvantages of the conversion are are:

  • The optimization may convert 'IN LIST elements' key accesses to a table scan (if there is no other usable index for the table)

  • The estimates for the number of rows matching the IN (...) are less precise.

See Also

  • IN operator

Links

MDEV-12176

This page is licensed: CC BY-SA / Gnu FDL

in_predicate_conversion_threshold

Subquery Optimizations Map

Below is a map showing all types of subqueries allowed in the SQL language, and the optimizer strategies available to handle them.

  • Uncolored areas represent different kinds of subqueries, for example:

    • Subqueries that have form x IN (SELECT ...)

    • Subqueries that are in the FROM clause

    • .. and so forth

  • The size of each uncolored area roughly corresponds to how important (i.e. frequently used) that kind of subquery is. For example, x IN (SELECT ...) queries are the most important, and EXISTS (SELECT ...) are relatively unimportant.

  • Colored areas represent optimizations/execution strategies that are applied to handle various kinds of subqueries.

  • The color of optimization indicates which version of MySQL/MariaDB it was available in (see legend below)

Some things are not on the map:

  • MariaDB doesn't evaluate expensive subqueries when doing optimization (this means, EXPLAIN is always fast). MySQL 5.6 has made a progress in this regard, but its optimizer will still evaluate certain kinds of subqueries (for example, scalar-context subqueries used in range predicates)

Links to pages about individual optimizations:

See also

This page is licensed: CC BY-SA / Gnu FDL

Optimizing GROUP BY and DISTINCT Clauses in Subqueries

A DISTINCT clause and a GROUP BY without a corresponding HAVING clause have no meaning in IN/ALL/ANY/SOME/EXISTS subqueries. The reason is that IN/ALL/ANY/SOME/EXISTS only check if an outer row satisfies some condition with respect to all or any row in the subquery result. Therefore is doesn't matter if the subquery has duplicate result rows or not - if some condition is true for some row of the subquery, this condition will be true for all duplicates of this row. Notice that GROUP BY without a corresponding HAVING clause is equivalent to a DISTINCT.

and later versions automatically remove DISTINCT and GROUP BY without HAVING if these clauses appear in an IN/ALL/ANY/SOME/EXISTS subquery. For instance:

SELECT * FROM t1
WHERE t1.a > ALL(SELECT DISTINCT b FROM t2 WHERE t2.c > 100)

is transformed to:

SELECT * FROM t1
WHERE t1.a > ALL(SELECT b FROM t2 WHERE t2.c > 100)

Removing these unnecessary clauses allows the optimizer to find more efficient query plans because it doesn't need to take care of post-processing the subquery result to satisfy DISTINCT / GROUP BY.

This page is licensed: CC BY-SA / Gnu FDL

EXISTS-to-IN Optimization

MySQL (including MySQL 5.6) has only one execution strategy for EXISTS subqueries. The strategy is essentially the straightforward, "naive" execution, without any rewrites.

introduced a rich set of optimizations for IN subqueries. Since then, it makes sense to convert an EXISTS subquery into an IN so that the new optimizations can be used.

EXISTS will be converted into IN in two cases:

  1. Trivially correlated EXISTS subqueries

  2. Semi-join EXISTS

COLUMN [NOT] IN (const1, const2, .... )
column [NOT] IN (select ... from temporary_table)
COLUMN [NOT] IN (const1, const2, .... )

FirstMatch

  • Materialization, +scan, +lookup

  • LooseScan

  • DuplicateWeedout execution strategy

  • Non-semi-join Materialization (including NULL-aware and partial matching)

  • Derived table optimizations

    • Derived table merge

    • Derived table with keys

  • IN->EXISTS
    Subquery Caching
    Semi-join optimizations
    Table pullout
    We will now describe these two cases in detail

    Trivially-correlated EXISTS subqueries

    Often, EXISTS subquery is correlated, but the correlation is trivial. The subquery has form

    and "outer_col" is the only place where the subquery refers to outside fields. In this case, the subquery can be re-written into uncorrelated IN:

    (NULL values require some special handling, see below). For uncorrelated IN subqueries, MariaDB is able a cost-based choice between two execution strategies:

    • IN-to-EXISTS (basically, convert back into EXISTS)

    • Materialization

    That is, converting trivially-correlated EXISTS into uncorrelated IN gives query optimizer an option to use Materialization strategy for the subquery.

    Currently, EXISTS->IN conversion works only for subqueries that are at top level of the WHERE clause, or are under NOT operation which is directly at top level of the WHERE clause.

    Semi-join EXISTS subqueries

    If EXISTS subquery is an AND-part of the WHERE clause:

    then it satisfies the main property of semi-join subqueries:

    with semi-join subquery, we're only interested in records of outer_tables that have matches in the subquery

    Semi-join optimizer offers a rich set of execution strategies for both correlated and uncorrelated subqueries. The set includes FirstMatch strategy which is an equivalent of how EXISTS suqueries are executed, so we do not lose any opportunities when converting an EXISTS subquery into a semi-join.

    In theory, it makes sense to convert all kinds of EXISTS subqueries: convert both correlated and uncorrelated ones, convert irrespectively of whether the subquery has inner=outer equality.

    In practice, the subquery will be converted only if it has inner=outer equality. Both correlated and uncorrelated subqueries are converted.

    Handling of NULL values

    TODO: rephrase this:

    • IN has complicated NULL-semantics. NOT EXISTS doesn't.

    • EXISTS-to-IN adds IS NOT NULL before the subquery predicate, when required

    Control

    The optimization is controlled by the exists_to_in flag in optimizer_switch. Before , the optimization was OFF by default. Since , it has been ON by default.

    Limitations

    EXISTS-to-IN doesn't handle

    • subqueries that have GROUP BY, aggregate functions, or HAVING clause

    • subqueries are UNIONs

    • a number of degenerate edge cases

    This page is licensed: CC BY-SA / Gnu FDL

    EXISTS (SELECT ...  FROM ... WHERE outer_col= inner_col AND inner_where)
    outer_col IN (SELECT inner_col FROM ... WHERE inner_where)
    SELECT ... FROM outer_tables WHERE EXISTS (SELECT ...) AND ...

    Table Pullout Optimization

    Table pullout is an optimization for Semi-join subqueries.

    The idea of Table Pullout

    Sometimes, a subquery can be re-written as a join. For example:

    If we know that there can be, at most, one country with a given value of Country.Code (we can tell that if we see that table Country has a primary key or unique index over that column), we can re-write this query as:

    Table pullout in action

    If one runs for the above query in MySQL 5.1-5.6 or -5.2, they'll get this plan:

    It shows that the optimizer is going to do a full scan on table City, and for each city it will do a lookup in table Country.

    If one runs the same query in , they will get this plan:

    The interesting parts are:

    • Both tables have select_type=PRIMARY, and id=1 as if they were in one join.

    • The Country table is first, followed by the City table.

    Indeed, if one runs EXPLAIN EXTENDED; SHOW WARNINGS, they will see that the subquery is gone and it was replaced with a join:

    Changing the subquery into a join allows feeding the join to the join optimizer, which can make a choice between two possible join orders:

    1. City -> Country

    2. Country -> City

    as opposed to the single choice of

    1. City->Country

    which we had before the optimization.

    In the above example, the choice produces a better query plan. Without pullout, the query plan with a subquery would read (4079 + 1*4079)=8158 table records. With table pullout, the join plan would read (37 + 37 * 18) = 703 rows. Not all row reads are equal, but generally, reading 10 times fewer table records is faster.

    Table pullout fact sheet

    • Table pullout is possible only in semi-join subqueries.

    • Table pullout is based on UNIQUE/PRIMARY key definitions.

    • Doing table pullout does not cut off any possible query plans, so MariaDB will always try to pull out as much as possible.

    Controlling table pullout

    There is no separate @@optimizer_switch flag for table pullout. Table pullout can be disabled by switching off all semi-join optimizations withSET @@optimizer_switch='semijoin=off' command.

    This page is licensed: CC BY-SA / Gnu FDL

    SELECT *
    FROM City 
    WHERE City.Country IN (SELECT Country.Code
                           FROM Country 
                           WHERE Country.Population < 100*1000);
    SELECT City.* 
    FROM 
      City, Country 
    WHERE
     City.Country=Country.Code AND Country.Population < 100*1000;
    Table pullout is able to pull individual tables out of subqueries to their parent selects. If all tables in a subquery have been pulled out, the subquery (i.e. its semi-join) is removed completely.
  • One common bit of advice for optimizing MySQL has been "If possible, rewrite your subqueries as joins". Table pullout does exactly that, so manual rewrites are no longer necessary.

  • EXPLAIN
    MySQL [world]> EXPLAIN SELECT * FROM City WHERE City.Country IN (SELECT Country.Code FROM Country WHERE Country.Population < 100*1000);
    +----+--------------------+---------+-----------------+--------------------+---------+---------+------+------+-------------+
    | id | select_type        | table   | type            | possible_keys      | key     | key_len | ref  | rows | Extra       |
    +----+--------------------+---------+-----------------+--------------------+---------+---------+------+------+-------------+
    |  1 | PRIMARY            | City    | ALL             | NULL               | NULL    | NULL    | NULL | 4079 | Using where |
    |  2 | DEPENDENT SUBQUERY | Country | unique_subquery | PRIMARY,Population | PRIMARY | 3       | func |    1 | Using where |
    +----+--------------------+---------+-----------------+--------------------+---------+---------+------+------+-------------+
    2 rows in set (0.00 sec)
    MariaDB [world]> EXPLAIN SELECT * FROM City WHERE City.Country IN (SELECT Country.Code FROM Country WHERE Country.Population < 100*1000);
    +----+-------------+---------+-------+--------------------+------------+---------+--------------------+------+-----------------------+
    | id | select_type | table   | type  | possible_keys      | key        | key_len | ref                | rows | Extra                 |
    +----+-------------+---------+-------+--------------------+------------+---------+--------------------+------+-----------------------+
    |  1 | PRIMARY     | Country | range | PRIMARY,Population | Population | 4       | NULL               |   37 | Using index condition |
    |  1 | PRIMARY     | City    | ref   | Country            | Country    | 3       | world.Country.Code |   18 |                       |
    +----+-------------+---------+-------+--------------------+------------+---------+--------------------+------+-----------------------+
    2 rows in set (0.00 sec)
    MariaDB [world]> SHOW warnings\G
    *************************** 1. row ***************************
      Level: Note
       Code: 1003
    Message: SELECT `world`.`City`.`ID` AS `ID`,`world`.`City`.`Name` AS 
    `Name`,`world`.`City`.`Country` AS `Country`,`world`.`City`.`Population` AS 
    `Population` 
    
      
       FROM `world`.`City` JOIN `world`.`Country` WHERE 
    
    
    ((`world`.`City`.`Country` = `world`.`Country`.`Code`) and (`world`.`Country`.
    `Population` < (100 * 1000)))
    1 row in set (0.00 sec)

    Subquery Cache

    The goal of the subquery cache is to optimize the evaluation of correlated subqueries by storing results together with correlation parameters in a cache and avoiding re-execution of the subquery in cases where the result is already in the cache.

    Administration

    The cache is on by default. One can switch it off using the optimizer_switch subquery_cache setting, like so:

    The efficiency of the subquery cache is visible in 2 statistical variables:

    • - Global counter for all subquery cache hits.

    • - Global counter for all subquery cache misses.

    The session variables and influence the size of in-memory temporary tables in the table used for caching. It cannot grow more than the minimum of the above variables values (see the section for details).

    Visibility

    Your usage of the cache is visible in EXTENDED EXPLAIN output (warnings) as"<expr_cache><//list of parameters//>(//cached expression//)". For example:

    In the example above the presence of"<expr_cache><test.t1.a>(...)" is how you know you are using the subquery cache.

    Implementation

    Every subquery cache creates a temporary table where the results and all parameters are stored. It has a unique index over all parameters. First the cache is created in a table (if doing this is impossible the cache becomes disabled for that expression). When the table grows up to the minimum oftmp_table_size and max_heap_table_size, the hit rate will be checked:

    • if the hit rate is really small (<0.2) the cache will be disabled.

    • if the hit rate is moderate (<0.7) the table will be cleaned (all records deleted) to keep the table in memory

    • if the hit rate is high the table will be converted to a disk table (for 5.3.0 it can only be converted to a disk table).

    Performance Impact

    Here are some examples that show the performance impact of the subquery cache (these tests were made on a 2.53 GHz Intel Core 2 Duo MacBook Pro with dbt-3 scale 1 data set).

    example
    cache on
    cache off
    gain
    hit
    miss
    hit rate
    1
    2
    3
    4

    Example 1

    Dataset from DBT-3 benchmark, a query to find customers with balance near top in their nation:

    Example 2

    DBT-3 benchmark, Query #17

    Example 3

    DBT-3 benchmark, Query #2

    Example 4

    DBT-3 benchmark, Query #20

    See Also

    • blog post describing impact of subquery cache optimization on queries used by DynamicPageList MediaWiki extension

    • Another use case from the real world

    This page is licensed: CC BY-SA / Gnu FDL

    SET optimizer_switch='subquery_cache=off';

    miss

    hit rate

    1

    1.01sec

    1 hour 31 min 43.33sec

    5445x

    149975

    25

    99.98%

    2

    0.21sec

    1.41sec

    6.71x

    6285

    220

    96.6%

    3

    2.54sec

    2.55sec

    1.00044x

    151

    461

    24.67%

    4

    1.87sec

    1.95sec

    0.96x

    0

    23026

    0%

    example

    cache on

    cache off

    gain

    hit

    Subquery_cache_hit
    Subquery_cache_miss
    tmp_table_size
    max_heap_table_size
    Implementation
    MEMORY
    Query cache
    mariadb-subquery-cache-in-real-use-case.html
    EXPLAIN EXTENDED SELECT * FROM t1 WHERE a IN (SELECT b FROM t2);
    +----+--------------------+-------+------+---------------+------+---------+------+------+----------+-------------+
    | id | select_type        | table | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
    +----+--------------------+-------+------+---------------+------+---------+------+------+----------+-------------+
    |  1 | PRIMARY            | t1    | ALL  | NULL          | NULL | NULL    | NULL |    2 |   100.00 | Using where |
    |  2 | DEPENDENT SUBQUERY | t2    | ALL  | NULL          | NULL | NULL    | NULL |    2 |   100.00 | Using where |
    +----+--------------------+-------+------+---------------+------+---------+------+------+----------+-------------+
    2 rows in set, 1 warning (0.00 sec)
    
    SHOW WARNINGS;
    +-------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Level | Code | Message                                                                                                                                                                                                    |
    +-------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Note  | 1003 | SELECT `test`.`t1`.`a` AS `a` from `test`.`t1` WHERE <expr_cache><`test`.`t1`.`a`>(<in_optimizer>(`test`.`t1`.`a`,<exists>(SELECT 1 FROM `test`.`t2` WHERE (<cache>(`test`.`t1`.`a`) = `test`.`t2`.`b`)))) |
    +-------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    1 row in set (0.00 sec)
    hit rate = hit / (hit + miss)
    SELECT COUNT(*) FROM customer 
    WHERE 
       c_acctbal > 0.8 * (SELECT MAX(c_acctbal) 
                          FROM customer C 
                          WHERE C.c_nationkey=customer.c_nationkey
                          GROUP BY c_nationkey);
    SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly 
    FROM lineitem, part 
    WHERE 
      p_partkey = l_partkey AND 
      p_brand = 'Brand#42' AND p_container = 'JUMBO BAG' AND 
      l_quantity < (SELECT 0.2 * AVG(l_quantity) FROM lineitem 
                    WHERE l_partkey = p_partkey);
    SELECT
            s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment
    FROM
            part, supplier, partsupp, nation, region
    WHERE
            p_partkey = ps_partkey AND s_suppkey = ps_suppkey AND p_size = 33
            AND p_type LIKE '%STEEL' AND s_nationkey = n_nationkey
            AND n_regionkey = r_regionkey AND r_name = 'MIDDLE EAST'
            AND ps_supplycost = (
                    SELECT
                            MIN(ps_supplycost)
                    FROM
                            partsupp, supplier, nation, region
                    WHERE
                            p_partkey = ps_partkey AND s_suppkey = ps_suppkey
                            AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey
                            AND r_name = 'MIDDLE EAST'
            )
    ORDER BY
            s_acctbal DESC, n_name, s_name, p_partkey;
    SELECT
            s_name, s_address
    FROM
            supplier, nation
    WHERE
            s_suppkey IN (
                    SELECT
                            DISTINCT (ps_suppkey)
                    FROM
                            partsupp, part
                    WHERE
                            ps_partkey=p_partkey
                            AND p_name LIKE 'indian%'
                            AND ps_availqty > (
                                    SELECT
                                            0.5 * SUM(l_quantity)
                                    FROM
                                            lineitem
                                    WHERE
                                            l_partkey = ps_partkey
                                            AND l_suppkey = ps_suppkey
                                            AND l_shipdate >= '1995-01-01'
                                            AND l_shipdate < date_ADD('1995-01-01',INTERVAL 1 YEAR)
                                    )
            )
            AND s_nationkey = n_nationkey AND n_name = 'JAPAN'
    ORDER BY
            s_name;

    Semi-join Subquery Optimizations

    MariaDB has a set of optimizations specifically targeted at semi-join subqueries.

    What is a Semi-Join Subquery

    A semi-join subquery has a form of

    that is, the subquery is an IN-subquery and it is located in the WHERE clause. The most important part here is

    with semi-join subquery, we're only interested in records of outer_tables that have matches in the subquery

    Let's see why this is important. Consider a semi-join subquery:

    One can execute it "naturally", by starting from countries in Europe and checking if they have populous Cities:

    The semi-join property also allows "backwards" execution: we can start from big cities, and check which countries they are in:

    To contrast, let's change the subquery to be non-semi-join:

    It is still possible to start from countries, and then check

    • if a country has any big cities

    • if it has a large surface area:

    The opposite, city-to-country way is not possible. This is not a semi-join.

    Difference from Inner Joins

    Semi-join operations are similar to regular relational joins. There is a difference though: with semi-joins, you don't care how many matches an inner table has for an outer row. In the above countries-with-big-cities example, Germany will be returned once, even if it has three cities with populations of more than one million each.

    Semi-Join Optimizations in MariaDB

    MariaDB uses semi-join optimizations to run IN subqueries.The optimizations are enabled by default. You can disable them by turning off their like so:

    MariaDB has five different semi-join execution strategies:

    See Also

    • blog post

    This page is licensed: CC BY-SA / Gnu FDL

    SELECT ... FROM outer_tables WHERE expr IN (SELECT ... FROM inner_tables ...) AND ...
    SELECT * FROM Country 
    WHERE 
      Continent='Europe' AND 
      Country.Code IN (SELECT City.country 
                       FROM City 
                       WHERE City.Population>1*1000*1000);
    DuplicateWeedout execution strategy
    optimizer_switch
    Table pullout optimization
    FirstMatch execution strategy
    Semi-join Materialization execution strategy
    LooseScan execution strategy
    Subquery Optimizations Map
    "Observations about subquery use cases"
    http:en.wikipedia.org/wiki/Semijoin
    semi-join-outer-to-inner
    semi-join-inner-to-outer
    non-semi-join-subquery
    SELECT * FROM Country 
    WHERE 
       Country.Continent='Europe' AND 
       (Country.Code IN (SELECT City.country 
                       FROM City WHERE City.Population>1*1000*1000) 
        OR Country.SurfaceArea > 100*1000  -- Added this part
       );
    SET optimizer_switch='semijoin=off'
    MariaDB 10.3
    MariaDB 10.3.18
    MariaDB 5.3
    Subquery optimizations in MariaDB 5.3
    MariaDB 5.3
    MariaDB 10.0.12
    MariaDB 10.0.12
    MariaDB 5.1
    MariaDB 5.3

    Non-semi-join Subquery Optimizations

    Certain kinds of IN-subqueries cannot be flattened into semi-joins. These subqueries can be both correlated or non-correlated. In order to provide consistent performance in all cases, MariaDB provides several alternative strategies for these types of subqueries. Whenever several strategies are possible, the optimizer chooses the optimal one based on cost estimates.

    The two primary non-semi-join strategies are materialization (also called outside-in materialization), and in-to-exists transformation. Materialization is applicable only for non-correlated subqueries, while in-to-exist can be used both for correlated and non-correlated subqueries.

    Applicability

    An IN subquery cannot be flattened into a semi-join in the following cases. The examples below use the World database from the MariaDB regression test suite.

    Subquery in a disjunction (OR)

    The subquery is located directly or indirectly under an OR operation in the WHERE clause of the outer query.

    Query pattern:

    Example:

    Negated subquery predicate (NOT IN)

    The subquery predicate itself is negated.

    Query pattern:

    Example:

    Subquery in the SELECT or HAVING clause

    The subquery is located in the SELECT or HAVING clauses of the outer query.

    Query pattern:

    Example:

    Subquery with a UNION

    The subquery itself is a UNION, while the IN predicate may be anywhere in the query where IN is allowed.

    Query pattern:

    Example:

    Materialization for non-correlated IN-subqueries

    Materialization basics

    The basic idea of subquery materialization is to execute the subquery and store its result in an internal temporary table indexed on all its columns. Naturally, this is possible only when the subquery is non-correlated. The IN predicate tests whether its left operand is present in the subquery result. Therefore it is not necessary to store duplicate subquery result rows in the temporary table. Storing only unique subquery rows provides two benefits - the size of the temporary table is smaller, and the index on all its columns can be unique.

    If the size of the temporary table is less than the tmp_table_size system variable, the table is a hash-indexed in-memory HEAP table. In the rare cases when the subquery result exceeds this limit, the temporary table is stored on disk in an ARIA or MyISAM B-tree indexed table (ARIA is the default).

    Subquery materialization happens on demand during the first execution of the IN predicate. Once the subquery is materialized, the IN predicate is evaluated very efficiently by index lookups of the outer expression into the unique index of the materialized temporary table. If there is a match, IN is TRUE, otherwise IN is FALSE.

    NULL-aware efficient execution

    An IN predicate may produce a NULL result if there is a NULL value in either of its arguments. Depending on its location in a query, a NULL predicate value is equivalent to FALSE. These are the cases when substituting NULL with FALSE would reject exactly the same result rows. A NULL result of IN is indistinguishable from a FALSE if the IN predicate is:

    • not negated,

    • not a function argument,

    • inside a WHERE or ON clause.

    In all these cases the evaluation of IN is performed as described in the previous paragraph via index lookups into the materialized subquery. In all remaining cases when NULL cannot be substituted with FALSE, it is not possible to use index lookups. This is not a limitation in the server, but a consequence of the NULL semantics in the ANSI SQL standard.

    Suppose an IN predicate is evaluated as

    , that is, the left operand of IN is a NULL value, and there are no NULLs in the subquery. In this case the value of IN is neither FALSE, nor TRUE. Instead it is NULL. If we were to perform an index lookup with the NULL as a key, such a value would not be found in not_null_col, and the IN predicate would incorrectly produce a FALSE.

    In general, an NULL value on either side of an IN acts as a "wildcard" that matches any value, and if a match exists, the result of IN is NULL. Consider the following example:

    If the left argument of IN is the row: (7, NULL, 9), and the result of the right subquery operand of IN is the table:

    The the IN predicate matches the row (7, 11, 9), and the result of IN is NULL. Matches where the differing values on either side of the IN arguments are matched by a NULL in the other IN argument, are called partial matches.

    In order to efficiently compute the result of an IN predicate in the presence of NULLs, MariaDB implements two special algorithms for .

    • Rowid-merge partial matching This technique is used when the number of rows in the subquery result is above a certain limit. The technique creates special indexes on some of the columns of the temporary table, and merges them by alternative scanning of each index thus performing an operation similar to set-intersection.

    • Table scan partial matching This algorithm is used for very small tables when the overhead of the rowid-merge algorithm is not justifiable. Then the server simply scans the materialized subquery, and checks for partial matches. Since this strategy doesn't need any in-memory buffers, it is also used when there is not enough memory to hold the indexes of the rowid-merge strategy.

    Limitations

    In principle the subquery materialization strategy is universal, however, due to some technical limitations in the MariaDB server, there are few cases when the server cannot apply this optimization.

    • BLOB fields Either the left operand of an IN predicate refers to a BLOB field, or the subquery selects one or more BLOBs.

    • Incomparable fields TODO

    In the above cases, the server reverts to the transformation.

    The IN-TO-EXISTS transformation

    This optimization is the only subquery execution strategy that existed in older versions of MariaDB and MySQL prior to . We have made various changes and fixed a number of bugs in this code as well, but in essence it remains the same.

    Performance discussion

    Example speedup over MySQL 5.x and /5.2

    Depending on the query and data, either of the two strategies described here may result in orders of magnitude better/worse plan than the other strategy. Older versions of MariaDB and any current MySQL version (including MySQL 5.5, and MySQL 5.6 DMR as of July 2011) implement only the IN-TO-EXISTS transformation. As illustrated below, this strategy is inferior in many common cases to subquery materialization.

    Consider the following query over the data of the DBT3 benchmark scale 10. Find customers with top balance in their nations:

    The times to run this query is as follows:

    • Execution time in /MySQL 5.x (any MySQL): > 1 h The query takes more than one hour (we didn't wait longer), which makes it impractical to use subqueries in such cases. The EXPLAIN below shows that the subquery was transformed into a correlated one, which indicates an IN-TO-EXISTS transformation.

    • Execution time in : 43 sec In it takes less than a minute to run the same query. The EXPLAIN shows that the subquery remains uncorrelated, which is an indication that it is being executed via subquery materialization.

    The speedup here is practically infinite, because both MySQL and older MariaDB versions cannot complete the query in any reasonable time.

    In order to show the benefits of partial matching we extended the customer table from the DBT3 benchmark with two extra columns:

    • c_pref_nationkey - preferred nation to buy from,

    • c_pref_brand - preferred brand.

    Both columns are prefixed with the percent NULL values in the column, that is, c_pref_nationkey_05 contains 5% NULL values.

    Consider the query "Find all customers that didn't buy from a preferred country, and from a preferred brand withing some date ranges":

    • Execution time in /MySQL 5.x (any MySQL): 40 sec

    • Execution time in : 2 sec

    The speedup for this query is 20 times.

    Performance guidelines

    TODO

    Optimizer control

    In certain cases it may be necessary to override the choice of the optimizer. Typically this is needed for benchmarking or testing purposes, or to mimic the behavior of an older version of the server, or if the optimizer made a poor choice.

    All the above strategies can be controlled via the following switches in system variable.

    • materialization=on/off In some very special cases, even if materialization was forced, the optimizer may still revert to the IN-TO-EXISTS strategy if materialization is not applicable. In the cases when materialization requres partial matching (because of the presense of NULL values), there are two subordinate switches that control the two partial matching strategies:

      • partial_match_rowid_merge=on/off This switch controls the Rowid-merge strategy. In addition to this switch, the system variable controls the maximum memory available to the Rowid-merge strategy.

      • partial_match_table_scan=on/off Controls the alternative partial match strategy that performs matching via a table scan.

    The two main optimizer switches - materialization and in_to_exists cannot be simultaneously off. If both are set to off, the server will issue an error.

    This page is licensed: CC BY-SA / Gnu FDL

    in_to_exists=on/off This switch controls the IN-TO-EXISTS transformation.

  • tmp_table_size and max_heap_table_size system variables The tmp_table_size system variable sets the upper limit for internal MEMORY temporary tables. If an internal temporary table exceeds this size, it is converted automatically into a Aria or MyISAM table on disk with a B-tree index. Notice however, that a MEMORY table cannot be larger than max_heap_table_size.

  • partial matching, described here in detail
    IN-TO-EXISTS
    optimizer_switch
    rowid_merge_buff_size
    MariaDB 5.3
    MariaDB 5.1
    MariaDB 5.2
    MariaDB 5.3
    MariaDB 5.3
    MariaDB 5.2
    MariaDB 5.3
    SELECT ... FROM ... WHERE (expr1, ..., exprN) [NOT] IN (SELECT ... ) OR expr;
    SELECT Name FROM Country
    WHERE (Code IN (SELECT Country FROM City WHERE City.Population > 100000) OR
           Name LIKE 'L%') AND
          surfacearea > 1000000;
    SELECT ... FROM ... WHERE ... (expr1, ..., exprN) NOT IN (SELECT ... ) ...;
    SELECT Country.Name
    FROM Country, CountryLanguage 
    WHERE Code NOT IN (SELECT Country FROM CountryLanguage WHERE LANGUAGE = 'English')
      AND CountryLanguage.Language = 'French'
      AND Code = Country;
    SELECT field1, ..., (SELECT ...)  WHERE ...;
    SELECT ...  WHERE ... HAVING (SELECT ...);
    SELECT Name, City.id IN (SELECT capital FROM Country WHERE capital IS NOT NULL) AS is_capital
    FROM City
    WHERE City.population > 10000000;
    ... [NOT] IN (SELECT ... UNION SELECT ...)
    SELECT * FROM City WHERE (Name, 91) IN
    (SELECT Name, round(Population/1000) FROM City WHERE Country = "IND" AND Population > 2500000
    UNION
     SELECT Name, round(Population/1000) FROM City WHERE Country = "IND" AND Population < 100000);
    NULL IN (SELECT
    not_null_col FROM t1)
    (7, 8, 10)
    (6, NULL, NULL)
    (7, 11, 9)
    SELECT * FROM part
    WHERE p_partkey IN
          (SELECT l_partkey FROM lineitem
           WHERE l_shipdate BETWEEN '1997-01-01' AND '1997-02-01')
    ORDER BY p_retailprice DESC LIMIT 10;
    +--+------------------+--------+--------------+-------------------+----+------+---------------------------+
    |id|select_type       |table   |type          |key                |ref |rows  |Extra                      |
    +--+------------------+--------+--------------+-------------------+----+------+---------------------------+
    | 1|PRIMARY           |part    |ALL           |NULL               |NULL|199755|Using where; Using filesort|
    | 2|DEPENDENT SUBQUERY|lineitem|index_subquery|i_l_suppkey_partkey|func|    14|Using where                |
    +--+------------------+--------+--------------+-------------------+----+------+---------------------------+
    +--+------------+-----------+------+------------------+----+------+-------------------------------+
    |id|select_type |table      |type  |key               |ref |rows  |Extra                          |
    +--+------------+-----------+------+------------------+----+------+-------------------------------+
    | 1|PRIMARY     |part       |ALL   |NULL              |NULL|199755|Using temporary; Using filesort|
    | 1|PRIMARY     |<subquery2>|eq_ref|distinct_key      |func|     1|                               |
    | 2|MATERIALIZED|lineitem   |range |l_shipdate_partkey|NULL|160060|Using where; Using index       |
    +--+------------+-----------+------+------------------+----+------+-------------------------------+
    SELECT COUNT(*)
    FROM customer
    WHERE (c_custkey, c_pref_nationkey_05, c_pref_brand_05) NOT IN
      (SELECT o_custkey, s_nationkey, p_brand
       FROM orders, supplier, part, lineitem
       WHERE l_orderkey = o_orderkey AND
             l_suppkey = s_suppkey AND
             l_partkey = p_partkey AND
             p_retailprice < 1200 AND
             l_shipdate >= '1996-04-01' AND l_shipdate < '1996-04-05' AND
             o_orderdate >= '1996-04-01' AND o_orderdate < '1996-04-05');