DISTINCT removal in aggregate functions
Basics
One can use DISTINCT keyword to de-duplicate the arguments of an aggregate function. For example:
SELECT COUNT(DISTINCT col1) FROM tbl1;
In order to compute this, the optimizer has to de-duplicate the function arguments. This may be computationally expensive.
After fix for MDEV-30660 (available from MariaDB 10.5.25, 10.6.18, 10.11.8, 11.0.6, 11.1.5, 11.2.4, 11.4.2) , the optimizer is able to detect that set of aggregate function arguments is already distinct and so no de-duplication is necessary.
When one can skip de-duplication
The most basic example: if we're doing a select from one table, then the values of primary_key are already distinct:
SELECT aggregate_func(DISTINCT tbl.primary_key, ...) FROM tbl;
If the SELECT has other constant tables, that's also ok, as they will not create duplicates.
The next step: a part of the primary key can be "bound" by the GROUP BY clause. Consider a query:
SELECT aggregate_func(DISTINCT t1.pk1, ...) FROM t1 GROUP BY t1.pk2;
Suppose the table has PRIMARY KEY(pk1, pk2). GROUP BY pk2 fixes the value of pk2 within each. Then, the values of pk1 must be unique within each group, and de-duplication is not necessary.
Observability
EXPLAIN or EXPLAIN FORMAT=JSON do not show any details about how aggregate functions are computed. One has to look at the Optimizer Trace. Search for aggregator_type:
When de-duplication is necessary, it will show:
{
"prepare_sum_aggregators": {
"function": "count(distinct t1.col1)",
"aggregator_type": "distinct"
}
}
When de-duplication is not necessary, it will show:
{
"prepare_sum_aggregators": {
"function": "count(distinct t1.pk1)",
"aggregator_type": "simple"
}
}