Window Functions速览

MariaDB starting with 10.2

开窗函数是从MariaDB 10.2开始支持的新功能。

简介

通过开窗函数,可以为分组后的行执行一些普通SQL语句无法执行的计算。

语法

function (expression) OVER (
  [ PARTITION BY expression_list ]
  [ ORDER BY order_list [ frame_clause ] ] ) 

function:
  A valid window function

expression_list:
  expression | column_name [, expr_list ]

order_list:
  expression | column_name [ ASC | DESC ] 
  [, ... ]

frame_clause:

描述

在某种概念上来说,开窗函数类似于聚合函数,它们都是对一些行集合做计算。但和聚合函数不同的是,开窗函数的输出不是单行标量的。

MariaDB目前支持的开窗函数包括ROW_NUMBER, RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST, NTILE, COUNT, SUM, AVG, BIT_OR, BIT_ANDBIT_XOR

开窗函数最明显的特征是在计算函数之后使用OVER关键字。默认情况下,开窗函数将所有行作为行集(即所谓的"窗口"、"分区"),可以对其使用ORDER BY子句进行排序。如果使用PARTITION BY子句,可以指定按照哪列进行分组,每个组是一个窗口。

例如,给定如下数据:

CREATE TABLE student (name CHAR(10), test CHAR(10), score TINYINT); 

INSERT INTO student VALUES 
  ('Chun', 'SQL', 75), ('Chun', 'Tuning', 73), 
  ('Esben', 'SQL', 43), ('Esben', 'Tuning', 31), 
  ('Kaolin', 'SQL', 56), ('Kaolin', 'Tuning', 88), 
  ('Tatiana', 'SQL', 87), ('Tatiana', 'Tuning', 83);

下面的两个查询分别按照test字段和name字段分组,并分别返回这两个分组中的平均值:

SELECT name, test, score, AVG(score) OVER (PARTITION BY test) 
  AS average_by_test FROM student;
+---------+--------+-------+-----------------+
| name    | test   | score | average_by_test |
+---------+--------+-------+-----------------+
| Chun    | SQL    |    75 |         65.2500 |
| Chun    | Tuning |    73 |         68.7500 |
| Esben   | SQL    |    43 |         65.2500 |
| Esben   | Tuning |    31 |         68.7500 |
| Kaolin  | SQL    |    56 |         65.2500 |
| Kaolin  | Tuning |    88 |         68.7500 |
| Tatiana | SQL    |    87 |         65.2500 |
| Tatiana | Tuning |    83 |         68.7500 |
+---------+--------+-------+-----------------+

SELECT name, test, score, AVG(score) OVER (PARTITION BY name) 
  AS average_by_name FROM student;
+---------+--------+-------+-----------------+
| name    | test   | score | average_by_name |
+---------+--------+-------+-----------------+
| Chun    | SQL    |    75 |         74.0000 |
| Chun    | Tuning |    73 |         74.0000 |
| Esben   | SQL    |    43 |         37.0000 |
| Esben   | Tuning |    31 |         37.0000 |
| Kaolin  | SQL    |    56 |         72.0000 |
| Kaolin  | Tuning |    88 |         72.0000 |
| Tatiana | SQL    |    87 |         85.0000 |
| Tatiana | Tuning |    83 |         85.0000 |
+---------+--------+-------+-----------------+

使用范围

开窗函数是标准SQL:2003中开始引入的特性,在标准SQL的子版本中又对其做了些扩展。

MariaDB:

  • 支持ROWS和RANGE类型的frame计算
    • 支持所有frame边界,包括RANGE PRECEDING|FOLLOWING n(不像PostgreSQL或MS SQL Server一样不支持边界)
    • RANGE类型的frame边界不支持DATE[TIME]数据类型和算术类型(MDEV-9727)
  • 不支持GROUP类型的frame(主流数据库产品似乎都不支持该特性)
  • 不支持frame排除(其他数据库产品也不支持)(MDEV-9724)
  • 不支持显式的NULLS FIRSTNULLS LAST
  • 不支持窗口函数嵌套
  • MDEV-6115 is the main jira task for window functions development. Other tasks are are attached as sub-tasks
  • bb-10.2-mdev9543 is the feature tree for window functions. Development is ongoing, and this tree has the newest changes.
  • Testcases are in mysql-test/t/win*.test

示例

给定如下示例数据:

CREATE TABLE users (
  email VARCHAR(30), 
  first_name VARCHAR(30), 
  last_name VARCHAR(30), 
  account_type VARCHAR(30)
);

INSERT INTO users VALUES 
  ('admin@boss.org', 'Admin', 'Boss', 'admin'), 
  ('bob.carlsen@foo.bar', 'Bob', 'Carlsen', 'regular'),
  ('eddie.stevens@data.org', 'Eddie', 'Stevens', 'regular'),
  ('john.smith@xyz.org', 'John', 'Smith', 'regular'), 
  ('root@boss.org', 'Root', 'Chief', 'admin')

首先,对email按字母顺序排序,并让rnum值从1开始。可以使用窗口函数ROW_NUMBER实现:

SELECT row_number() OVER (ORDER BY email) AS rnum,
    email, first_name, last_name, account_type
FROM users ORDER BY email;
+------+------------------------+------------+-----------+--------------+
| rnum | email                  | first_name | last_name | account_type |
+------+------------------------+------------+-----------+--------------+
|    1 | admin@boss.org         | Admin      | Boss      | admin        |
|    2 | bob.carlsen@foo.bar    | Bob        | Carlsen   | regular      |
|    3 | eddie.stevens@data.org | Eddie      | Stevens   | regular      |
|    4 | john.smith@xyz.org     | John       | Smith     | regular      |
|    5 | root@boss.org          | Root       | Chief     | admin        |
+------+------------------------+------------+-----------+--------------

还可以根据account类型生成独立的序列号,只需使用PARTITION BY子句对account进行分区:

SELECT row_number() OVER (PARTITION BY account_type ORDER BY email) AS rnum, 
  email, first_name, last_name, account_type 
FROM users ORDER BY account_type,email;
+------+------------------------+------------+-----------+--------------+
| rnum | email                  | first_name | last_name | account_type |
+------+------------------------+------------+-----------+--------------+
|    1 | admin@boss.org         | Admin      | Boss      | admin        |
|    2 | root@boss.org          | Root       | Chief     | admin        |
|    1 | bob.carlsen@foo.bar    | Bob        | Carlsen   | regular      |
|    2 | eddie.stevens@data.org | Eddie      | Stevens   | regular      |
|    3 | john.smith@xyz.org     | John       | Smith     | regular      |
+------+------------------------+------------+-----------+--------------+

给定如下结构的表和数据,需求是检索出每个department中薪水前5的员工。

CREATE TABLE employee_salaries (dept VARCHAR(20), name VARCHAR(20), salary INT(11));

INSERT INTO employee_salaries VALUES
('Engineering', 'Dharma', 3500),
('Engineering', 'Bình', 3000),
('Engineering', 'Adalynn', 2800),
('Engineering', 'Samuel', 2500),
('Engineering', 'Cveta', 2200),
('Engineering', 'Ebele', 1800),
('Sales', 'Carbry', 500),
('Sales', 'Clytemnestra', 400),
('Sales', 'Juraj', 300),
('Sales', 'Kalpana', 300),
('Sales', 'Svantepolk', 250),
('Sales', 'Angelo', 200);

如果不使用窗口函数,可以使用下面的语句来实现:

select dept, name, salary
from employee_salaries as t1
where (select count(t2.salary)
       from employee_salaries as t2
       where t1.name != t2.name and
             t1.dept = t2.dept and
             t2.salary > t1.salary) < 5
order by dept, salary desc;

+-------------+--------------+--------+
| dept        | name         | salary |
+-------------+--------------+--------+
| Engineering | Dharma       |   3500 |
| Engineering | Bình         |   3000 |
| Engineering | Adalynn      |   2800 |
| Engineering | Samuel       |   2500 |
| Engineering | Cveta        |   2200 |
| Sales       | Carbry       |    500 |
| Sales       | Clytemnestra |    400 |
| Sales       | Juraj        |    300 |
| Sales       | Kalpana      |    300 |
| Sales       | Svantepolk   |    250 |
+-------------+--------------+--------+

但以上实现方式有如下缺点:

  • 如果employee_salary表没有索引且表较大,查询将耗费大量时间
  • 添加和维护索引会增加开销,且即使deptsalary上有索引,每次子查询执行时也都会执行索引查找(index lookup)从而加重开销。

尝试使用窗口函数来实现上面的需求。首先,使用RANK窗口函数为每个department中的employee生成排名序列。

select rank() over (partition by dept order by salary desc) as ranking,
    dept, name, salary
    from employee_salaries
    order by dept, ranking;
+---------+-------------+--------------+--------+
| ranking | dept        | name         | salary |
+---------+-------------+--------------+--------+
|       1 | Engineering | Dharma       |   3500 |
|       2 | Engineering | Bình         |   3000 |
|       3 | Engineering | Adalynn      |   2800 |
|       4 | Engineering | Samuel       |   2500 |
|       5 | Engineering | Cveta        |   2200 |
|       6 | Engineering | Ebele        |   1800 |
|       1 | Sales       | Carbry       |    500 |
|       2 | Sales       | Clytemnestra |    400 |
|       3 | Sales       | Juraj        |    300 |
|       3 | Sales       | Kalpana      |    300 |
|       5 | Sales       | Svantepolk   |    250 |
|       6 | Sales       | Angelo       |    200 |
+---------+-------------+--------------+--------+

每个department都有独立的排名序列。该特殊序列是rank()函数根据OVER()子句中的ORDER BY(对salary排序)生成的。为了结果易读性,在外部查询中还使用了ORDER BY子句对deptranking列进行排序。

现在,我们需要减少每个department中的员工数量,以便挑选出薪水前5的员工。但下面的语句会报错:

select
rank() over (partition by dept order by salary desc) as ranking,
dept, name, salary
from employee_salaries
where ranking <= 5
order by dept, ranking;

ERROR 1054 (42S22): Unknown column 'ranking' in 'where clause'

通过WHERE子句尝试从每个department窗口中挑选前ranking<=5的行将无法正确执行,因为ranking是窗口函数计算的来的。而窗口函数的计算是在WHERE子句、GROUP BY子句、HAVING子句之后才执行的,它仅在ORDER BY子句之前。因此窗口函数是在筛选之后,并对结果集分组之后才计算的,WHERE子句无法筛选窗口函数计算的结果。

要解决这各问题,需要将上面查询的结果导出到另一张表中,然后对该表进行WHERE筛选:

select *from (select rank() over (partition by dept order by salary desc) as ranking,
  dept, name, salary
from employee_salaries) as salary_ranks
where (salary_ranks.ranking <= 5)
  order by dept, ranking;
+---------+-------------+--------------+--------+
| ranking | dept        | name         | salary |
+---------+-------------+--------------+--------+
|       1 | Engineering | Dharma       |   3500 |
|       2 | Engineering | Bình         |   3000 |
|       3 | Engineering | Adalynn      |   2800 |
|       4 | Engineering | Samuel       |   2500 |
|       5 | Engineering | Cveta        |   2200 |
|       1 | Sales       | Carbry       |    500 |
|       2 | Sales       | Clytemnestra |    400 |
|       3 | Sales       | Juraj        |    300 |
|       3 | Sales       | Kalpana      |    300 |
|       5 | Sales       | Svantepolk   |    250 |
+---------+-------------+--------------+--------+

See also

Comments

Comments loading...
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.