Window Functions速览
MariaDB starting with 10.2
开窗函数是从MariaDB 10.2开始支持的新功能。
简介
通过开窗函数,可以为分组后的行执行一些普通SQL语句无法执行的计算。
语法
function (expression) OVER ( [ PARTITION BY expression_list ] [ ORDER BY order_list [ frame_clause ] ] ) function: A valid window function expression_list: expression | column_name [, expr_list ] order_list: expression | column_name [ ASC | DESC ] [, ... ] frame_clause:
描述
在某种概念上来说,开窗函数类似于聚合函数,它们都是对一些行集合做计算。但和聚合函数不同的是,开窗函数的输出不是单行标量的。
MariaDB目前支持的开窗函数包括ROW_NUMBER, RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST, NTILE, COUNT, SUM, AVG, BIT_OR, BIT_AND 和 BIT_XOR。
开窗函数最明显的特征是在计算函数之后使用OVER关键字。默认情况下,开窗函数将所有行作为行集(即所谓的"窗口"、"分区"),可以对其使用ORDER BY子句进行排序。如果使用PARTITION BY子句,可以指定按照哪列进行分组,每个组是一个窗口。
例如,给定如下数据:
CREATE TABLE student (name CHAR(10), test CHAR(10), score TINYINT); INSERT INTO student VALUES ('Chun', 'SQL', 75), ('Chun', 'Tuning', 73), ('Esben', 'SQL', 43), ('Esben', 'Tuning', 31), ('Kaolin', 'SQL', 56), ('Kaolin', 'Tuning', 88), ('Tatiana', 'SQL', 87), ('Tatiana', 'Tuning', 83);
下面的两个查询分别按照test字段和name字段分组,并分别返回这两个分组中的平均值:
SELECT name, test, score, AVG(score) OVER (PARTITION BY test) AS average_by_test FROM student; +---------+--------+-------+-----------------+ | name | test | score | average_by_test | +---------+--------+-------+-----------------+ | Chun | SQL | 75 | 65.2500 | | Chun | Tuning | 73 | 68.7500 | | Esben | SQL | 43 | 65.2500 | | Esben | Tuning | 31 | 68.7500 | | Kaolin | SQL | 56 | 65.2500 | | Kaolin | Tuning | 88 | 68.7500 | | Tatiana | SQL | 87 | 65.2500 | | Tatiana | Tuning | 83 | 68.7500 | +---------+--------+-------+-----------------+ SELECT name, test, score, AVG(score) OVER (PARTITION BY name) AS average_by_name FROM student; +---------+--------+-------+-----------------+ | name | test | score | average_by_name | +---------+--------+-------+-----------------+ | Chun | SQL | 75 | 74.0000 | | Chun | Tuning | 73 | 74.0000 | | Esben | SQL | 43 | 37.0000 | | Esben | Tuning | 31 | 37.0000 | | Kaolin | SQL | 56 | 72.0000 | | Kaolin | Tuning | 88 | 72.0000 | | Tatiana | SQL | 87 | 85.0000 | | Tatiana | Tuning | 83 | 85.0000 | +---------+--------+-------+-----------------+
使用范围
开窗函数是标准SQL:2003中开始引入的特性,在标准SQL的子版本中又对其做了些扩展。
MariaDB:
- 支持ROWS和RANGE类型的frame计算
- 支持所有frame边界,包括
RANGE PRECEDING|FOLLOWING n
(不像PostgreSQL或MS SQL Server一样不支持边界) - RANGE类型的frame边界不支持DATE[TIME]数据类型和算术类型(MDEV-9727)
- 支持所有frame边界,包括
- 不支持GROUP类型的frame(主流数据库产品似乎都不支持该特性)
- 不支持frame排除(其他数据库产品也不支持)(MDEV-9724)
- 不支持显式的
NULLS FIRST
和NULLS LAST
- 不支持窗口函数嵌套
- 支持以下窗口函数:
Links
- MDEV-6115 is the main jira task for window functions development. Other tasks are are attached as sub-tasks
- bb-10.2-mdev9543 is the feature tree for window functions. Development is ongoing, and this tree has the newest changes.
- Testcases are in
mysql-test/t/win*.test
示例
给定如下示例数据:
CREATE TABLE users ( email VARCHAR(30), first_name VARCHAR(30), last_name VARCHAR(30), account_type VARCHAR(30) ); INSERT INTO users VALUES ('admin@boss.org', 'Admin', 'Boss', 'admin'), ('bob.carlsen@foo.bar', 'Bob', 'Carlsen', 'regular'), ('eddie.stevens@data.org', 'Eddie', 'Stevens', 'regular'), ('john.smith@xyz.org', 'John', 'Smith', 'regular'), ('root@boss.org', 'Root', 'Chief', 'admin')
首先,对email按字母顺序排序,并让rnum值从1开始。可以使用窗口函数ROW_NUMBER实现:
SELECT row_number() OVER (ORDER BY email) AS rnum, email, first_name, last_name, account_type FROM users ORDER BY email; +------+------------------------+------------+-----------+--------------+ | rnum | email | first_name | last_name | account_type | +------+------------------------+------------+-----------+--------------+ | 1 | admin@boss.org | Admin | Boss | admin | | 2 | bob.carlsen@foo.bar | Bob | Carlsen | regular | | 3 | eddie.stevens@data.org | Eddie | Stevens | regular | | 4 | john.smith@xyz.org | John | Smith | regular | | 5 | root@boss.org | Root | Chief | admin | +------+------------------------+------------+-----------+--------------
还可以根据account类型生成独立的序列号,只需使用PARTITION BY子句对account进行分区:
SELECT row_number() OVER (PARTITION BY account_type ORDER BY email) AS rnum, email, first_name, last_name, account_type FROM users ORDER BY account_type,email; +------+------------------------+------------+-----------+--------------+ | rnum | email | first_name | last_name | account_type | +------+------------------------+------------+-----------+--------------+ | 1 | admin@boss.org | Admin | Boss | admin | | 2 | root@boss.org | Root | Chief | admin | | 1 | bob.carlsen@foo.bar | Bob | Carlsen | regular | | 2 | eddie.stevens@data.org | Eddie | Stevens | regular | | 3 | john.smith@xyz.org | John | Smith | regular | +------+------------------------+------------+-----------+--------------+
给定如下结构的表和数据,需求是检索出每个department中薪水前5的员工。
CREATE TABLE employee_salaries (dept VARCHAR(20), name VARCHAR(20), salary INT(11)); INSERT INTO employee_salaries VALUES ('Engineering', 'Dharma', 3500), ('Engineering', 'Bình', 3000), ('Engineering', 'Adalynn', 2800), ('Engineering', 'Samuel', 2500), ('Engineering', 'Cveta', 2200), ('Engineering', 'Ebele', 1800), ('Sales', 'Carbry', 500), ('Sales', 'Clytemnestra', 400), ('Sales', 'Juraj', 300), ('Sales', 'Kalpana', 300), ('Sales', 'Svantepolk', 250), ('Sales', 'Angelo', 200);
如果不使用窗口函数,可以使用下面的语句来实现:
select dept, name, salary from employee_salaries as t1 where (select count(t2.salary) from employee_salaries as t2 where t1.name != t2.name and t1.dept = t2.dept and t2.salary > t1.salary) < 5 order by dept, salary desc; +-------------+--------------+--------+ | dept | name | salary | +-------------+--------------+--------+ | Engineering | Dharma | 3500 | | Engineering | Bình | 3000 | | Engineering | Adalynn | 2800 | | Engineering | Samuel | 2500 | | Engineering | Cveta | 2200 | | Sales | Carbry | 500 | | Sales | Clytemnestra | 400 | | Sales | Juraj | 300 | | Sales | Kalpana | 300 | | Sales | Svantepolk | 250 | +-------------+--------------+--------+
但以上实现方式有如下缺点:
- 如果employee_salary表没有索引且表较大,查询将耗费大量时间
- 添加和维护索引会增加开销,且即使dept和salary上有索引,每次子查询执行时也都会执行索引查找(index lookup)从而加重开销。
尝试使用窗口函数来实现上面的需求。首先,使用RANK窗口函数为每个department中的employee生成排名序列。
select rank() over (partition by dept order by salary desc) as ranking, dept, name, salary from employee_salaries order by dept, ranking; +---------+-------------+--------------+--------+ | ranking | dept | name | salary | +---------+-------------+--------------+--------+ | 1 | Engineering | Dharma | 3500 | | 2 | Engineering | Bình | 3000 | | 3 | Engineering | Adalynn | 2800 | | 4 | Engineering | Samuel | 2500 | | 5 | Engineering | Cveta | 2200 | | 6 | Engineering | Ebele | 1800 | | 1 | Sales | Carbry | 500 | | 2 | Sales | Clytemnestra | 400 | | 3 | Sales | Juraj | 300 | | 3 | Sales | Kalpana | 300 | | 5 | Sales | Svantepolk | 250 | | 6 | Sales | Angelo | 200 | +---------+-------------+--------------+--------+
每个department都有独立的排名序列。该特殊序列是rank()函数根据OVER()子句中的ORDER BY(对salary排序)生成的。为了结果易读性,在外部查询中还使用了ORDER BY子句对dept和ranking列进行排序。
现在,我们需要减少每个department中的员工数量,以便挑选出薪水前5的员工。但下面的语句会报错:
select rank() over (partition by dept order by salary desc) as ranking, dept, name, salary from employee_salaries where ranking <= 5 order by dept, ranking; ERROR 1054 (42S22): Unknown column 'ranking' in 'where clause'
通过WHERE子句尝试从每个department窗口中挑选前ranking<=5
的行将无法正确执行,因为ranking是窗口函数计算的来的。而窗口函数的计算是在WHERE子句、GROUP BY子句、HAVING子句之后才执行的,它仅在ORDER BY子句之前。因此窗口函数是在筛选之后,并对结果集分组之后才计算的,WHERE子句无法筛选窗口函数计算的结果。
要解决这各问题,需要将上面查询的结果导出到另一张表中,然后对该表进行WHERE筛选:
select *from (select rank() over (partition by dept order by salary desc) as ranking, dept, name, salary from employee_salaries) as salary_ranks where (salary_ranks.ranking <= 5) order by dept, ranking; +---------+-------------+--------------+--------+ | ranking | dept | name | salary | +---------+-------------+--------------+--------+ | 1 | Engineering | Dharma | 3500 | | 2 | Engineering | Bình | 3000 | | 3 | Engineering | Adalynn | 2800 | | 4 | Engineering | Samuel | 2500 | | 5 | Engineering | Cveta | 2200 | | 1 | Sales | Carbry | 500 | | 2 | Sales | Clytemnestra | 400 | | 3 | Sales | Juraj | 300 | | 3 | Sales | Kalpana | 300 | | 5 | Sales | Svantepolk | 250 | +---------+-------------+--------------+--------+