Histogram-Based Statistics

Histogram-based statistics are a mechanism to improve the query plan chosen by the optimizer in certain situations. Before their introduction, all conditions on non-indexed columns were ignored when searching for the best execution plan. Histograms can be collected for both indexed and non-indexed columns, and are made available to the optimizer.

Histogram statistics are stored in the mysql.column_stats table, which stores data for engine-independent table statistics, and so are essentially a subset of engine-independent table statistics.

Histograms are used by default from MariaDB 10.4.3 if they are available. However, histogram statistics are not automatically collected, as collection is expensive, requiring a full table scan. See Collecting Statistics with the ANALYZE TABLE Statement for details.

Consider this example, using the following query:

SELECT * FROM t1,t2 WHERE t1.a=t2.a AND t2.b BETWEEN 1 AND 3;

Let's assume that

  • table t1 contains 100 records

  • table t2 contains 1000 records

  • there is a primary index on t1(a)

  • there is a secondary index on t2(a)

  • there is no index defined on column t2.b

  • the selectivity of the condition t2.b BETWEEN (1,3) is high (~ 1%)

Before histograms were introduced, the optimizer would choose the plan that:

  • accesses t1 using a table scan

  • accesses t2 using index t2(a)

  • checks the condition t2.b BETWEEN 1 AND 3

This plan examines all rows of both tables and performs 100 index look-ups.

With histograms available, the optimizer can choose the following, more efficient plan:

  • accesses table t2 in a table scan

  • checks the condition t2.b BETWEEN 1 AND 3

  • accesses t1 using index t1(a)

This plan also examine all rows from t2, but it performs only 10 look-ups to access 10 rows of table t1.

System Variables

There are a number of system variables that affect histograms.

histogram_size

The histogram_size variable determines the size, in bytes, from 0 to 255, used for a histogram. This is effectively the number of bins for histogram_type=SINGLE_PREC_HB or number of bins/2 for histogram_type=DOUBLE_PREC_HB. If it is set to 0 (the default for MariaDB 10.4.2 and below), no histograms are created when running an ANALYZE TABLE.

histogram_type

The histogram_type variable determines whether single precision (SINGLE_PREC_HB) or double precision (DOUBLE_PREC_HB) height-balanced histograms are created. From MariaDB 10.4.3, double precision is the default. For MariaDB 10.4.2 and below, single precision is the default.

From MariaDB 10.8, JSON_HB, JSON-format histograms, are accepted.

optimizer_use_condition_selectivity

The optimizer_use_condition_selectivity controls which statistics can be used by the optimizer when looking for the best query execution plan.

  • 1 Use selectivity of predicates as in MariaDB 5.5.

  • 2 Use selectivity of all range predicates supported by indexes.

  • 3 Use selectivity of all range predicates estimated without histogram.

  • 4 Use selectivity of all range predicates estimated with histogram.

  • 5 Additionally use selectivity of certain non-range predicates calculated on record sample.

From MariaDB 10.4.1, the default is 4. Until MariaDB 10.4.0, the default is 1.

Example

Here is an example of the dramatic impact histogram-based statistics can make. The query is based on DBT3 Benchmark Q20 with 60 million records in the lineitem table.

First,

Next, a really bad plan, yet one sometimes chosen:

Persistent statistics don't improve matters:

The default flags for optimizer_switch do not help much:

Using statistics doesn't help either:

Now, taking into account the cost of the dependent subquery:

Finally, using join_buffer as well:

See Also

This page is licensed: CC BY-SA / Gnu FDL

Last updated

Was this helpful?