DISTINCT removal in aggregate functions
Basics
One can use DISTINCT
keyword to de-duplicate the arguments of an aggregate function. For example:
SELECT COUNT(DISTINCT col1) FROM tbl1;
In order to compute this, the optimizer has to de-duplicate the function arguments. This may be computationally expensive.
After fix for MDEV-30660 (available from MariaDB 10.5.25, MariaDB 10.6.18, MariaDB 10.11.8, MariaDB 11.0.6, MariaDB 11.1.5, MariaDB 11.2.4, MariaDB 11.4.2) , the optimizer is able to detect that set of aggregate function arguments is already distinct and so no de-duplication is necessary.
When one can skip de-duplication
The most basic example: if we're doing a select from one table, then the values of primary_key
are already distinct:
SELECT aggregate_func(DISTINCT tbl.primary_key, ...) FROM tbl;
If the SELECT has other constant tables, that's also ok, as they will not create duplicates.
The next step: a part of the primary key can be "bound" by the GROUP BY clause. Consider a query:
SELECT aggregate_func(DISTINCT t1.pk1, ...) FROM t1 GROUP BY t1.pk2;
Suppose the table has PRIMARY KEY(pk1, pk2)
. Grouping by pk2
fixes the value of pk2
within each group. Then, the values of pk1
must be unique within each group, and de-duplication is not necessary.
Observability
EXPLAIN
or EXPLAIN FORMAT=JSON
do not show any details about how aggregate functions are computed. One has to look at the Optimizer Trace. Search for aggregator_type
:
When de-duplication is necessary, it will show:
{ "prepare_sum_aggregators": { "function": "count(distinct t1.col1)", "aggregator_type": "distinct" } }
When de-duplication is not necessary, it will show:
{ "prepare_sum_aggregators": { "function": "count(distinct t1.pk1)", "aggregator_type": "simple" } }