As a data engineer, SQL window functions have become some of my most valuable tools. I use them regularly to explore data, validate assumptions, generate quick insights, and ensure data quality. In fact, I consider a solid understanding of window functions to be a foundational skill for any data engineer.
A window function is composed of two parts:
- function
- OVER clause
- PARTITION BY: divide the data into groups
- ORDER BY: specify the order of rows within each group
We’ll use the Employee dataset below to explore each window function in detail.
| empId | Name | Age | Department | Salary |
| 0001 | Daniela | 33 | Finance | 50,000 |
| 0002 | Arturo | 40 | Finance | 60,000 |
| 0003 | Mariana | 28 | Sales | 50,000 |
| 0004 | Bertha | 62 | Sales | 80,000 |
| 0005 | Alejandra | 37 | IT | 40,000 |
| 0006 | Jose | 25 | IT | 50,000 |
| 0007 | Ricardo | 33 | Human Resources | 55000 |
-- create
CREATE TABLE EMPLOYEE (
empId INTEGER PRIMARY KEY,
name TEXT NOT NULL,
age INTEGER NOT NULL,
department TEXT NOT NULL,
salary FLOAT NOT NULL
);
-- insert
INSERT INTO EMPLOYEE VALUES (0001, 'Daniela', 33, 'Finance', 50000);
INSERT INTO EMPLOYEE VALUES (0002, 'Arturo', 40, 'Finance', 60000);
INSERT INTO EMPLOYEE VALUES (0003, 'Mariana', 28, 'Sales', 50000);
INSERT INTO EMPLOYEE VALUES (0004, 'Bertha', 62, 'Sales', 80000);
INSERT INTO EMPLOYEE VALUES (0005, 'Alejandra', 37, 'IT', 40000);
INSERT INTO EMPLOYEE VALUES (0006, 'Jose', 25, 'IT', 50000);
INSERT INTO EMPLOYEE VALUES (0007, 'Ricardo', 33, 'Human Resources', 55000);
-- fetch
SELECT * FROM EMPLOYEE;
Aggregate Functions
SUM()
Sum values within a window
SELECT
name,
department,
SUM(salary) OVER (PARTITION BY department ) total_sal_dept
FROM
EMPLOYEE;
name | department | total_sal_dept
-----------+-----------------+----------------
Arturo | Finance | 110000
Daniela | Finance | 110000
Ricardo | Human Resources | 55000
Jose | IT | 90000
Alejandra | IT | 90000
Bertha | Sales | 130000
Mariana | Sales | 130000
(7 rows)
AVG()
Calculates the average value within a window
SELECT
name,
department,
AVG(age) OVER (PARTITION BY department ) avg_age_by_dept
FROM
EMPLOYEE;
name | department | avg_age_by_dept -----------+-----------------+--------------------- Arturo | Finance | 36.5000000000000000 Daniela | Finance | 36.5000000000000000 Ricardo | Human Resources | 33.0000000000000000 Jose | IT | 31.0000000000000000 Alejandra | IT | 31.0000000000000000 Bertha | Sales | 45.0000000000000000 Mariana | Sales | 45.0000000000000000 (7 rows)
COUNT()
Counts the rows within a window
SELECT
name,
department,
COUNT(*) OVER (PARTITION BY department ) num_emp_by_dept
FROM
EMPLOYEE;
name | department | num_emp_by_dept
-----------+-----------------+-----------------
Arturo | Finance | 2
Daniela | Finance | 2
Ricardo | Human Resources | 1
Jose | IT | 2
Alejandra | IT | 2
Bertha | Sales | 2
Mariana | Sales | 2
(7 rows)
MAX()
Returns the maximum value within a window
SELECT
name,
department,
salary,
MAX(salary) OVER (PARTITION BY department ) max_salary_by_dept
FROM
EMPLOYEE;
name | department | salary | max_salary_by_dept
-----------+-----------------+--------+--------------------
Arturo | Finance | 60000 | 60000
Daniela | Finance | 50000 | 60000
Ricardo | Human Resources | 55000 | 55000
Jose | IT | 50000 | 50000
Alejandra | IT | 40000 | 50000
Bertha | Sales | 80000 | 80000
Mariana | Sales | 50000 | 80000
(7 rows)
MIN()
Returns the maximum value within a window
SELECT
name,
department,
salary,
MIN(salary) OVER (PARTITION BY department ) min_salary_by_dept
FROM
EMPLOYEE;
name | department | salary | min_salary_by_dept
-----------+-----------------+--------+--------------------
Arturo | Finance | 60000 | 50000
Daniela | Finance | 50000 | 50000
Ricardo | Human Resources | 55000 | 55000
Jose | IT | 50000 | 40000
Alejandra | IT | 40000 | 40000
Bertha | Sales | 80000 | 50000
Mariana | Sales | 50000 | 50000
(7 rows)
Ranking Functions
RANK()
Assigns ranks to rows, skipping ranks for duplicates.
SELECT
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) rank_by_salary_per_dept,
RANK() OVER (ORDER BY salary DESC) rank_by_salary_org
FROM
EMPLOYEE
ORDER BY
department DESC;
name | department | salary | rank_sal_dept | rank_sal_org
-----------+-----------------+--------+---------------+--------------
Bertha | Sales | 80000 | 1 | 1
Mariana | Sales | 50000 | 2 | 4
Jose | IT | 50000 | 1 | 4
Alejandra | IT | 40000 | 2 | 7
Ricardo | Human Resources | 55000 | 1 | 3
Daniela | Finance | 50000 | 2 | 4
Arturo | Finance | 60000 | 1 | 2
(7 rows)
DENSE_RANK()
Assigns ranks to rows without skipping rank numbers for duplicates.
SELECT
name,
department,
salary,
DENSE_RANK() OVER (ORDER BY salary DESC) rank_by_salary_org
FROM
EMPLOYEE
ORDER BY
rank_by_salary_org;
name | department | salary | rank_by_salary_org -----------+-----------------+--------+-------------------- Bertha | Sales | 80000 | 1 Arturo | Finance | 60000 | 2 Ricardo | Human Resources | 55000 | 3 Mariana | Sales | 50000 | 4 Daniela | Finance | 50000 | 4 Jose | IT | 50000 | 4 Alejandra | IT | 40000 | 5 (7 rows)
Notice how with DENSE_RANK(), Alejandra gets assigned the rank 5 while with RANK(), she gets 7
ROW_NUMBER()
Assigns a unique number to each row in the result set.
SELECT
name,
department,
salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) rn_by_salary
FROM
EMPLOYEE
ORDER BY
rn_by_salary;
name | department | salary | rn_by_salary -----------+-----------------+--------+-------------- Bertha | Sales | 80000 | 1 Arturo | Finance | 60000 | 2 Ricardo | Human Resources | 55000 | 3 Mariana | Sales | 50000 | 4 Daniela | Finance | 50000 | 5 Jose | IT | 50000 | 6 Alejandra | IT | 40000 | 7 (7 rows)
Notice how with ROW_NUMBER(), there are no ties between Mariana, Daniela, and Jose
PERCENT_RANK()
Shows the relative rank of a row as a percentage between 0 and 1, excluding the first row (which is always 0).
The formula for PERCENT_RANK() is:
PERCENT_RANK = (rank - 1) / (total_rows - 1)
Where:
rank= the rank of the current row (usingRANK())total_rows= number of rows in the partition (or entire dataset if no partition)- The result is always between
0and1
SELECT
name,
department,
salary,
PERCENT_RANK() OVER (ORDER BY salary DESC) rn_by_salary
FROM
EMPLOYEE
ORDER BY
rn_by_salary;
salary | rnk | percent_rank = (rank - 1) / (7 - 1)
--------+-----+----------
40000 | 1 | (1 - 1) / 6 = 0.0000
50000 | 2 | (2 - 1) / 6 = 0.1667
50000 | 2 | (2 - 1) / 6 = 0.1667
50000 | 2 | (2 - 1) / 6 = 0.1667
55000 | 5 | (5 - 1) / 6 = 0.6667
60000 | 6 | (6 - 1) / 6 = 0.8333
80000 | 7 | (7 - 1) / 6 = 1.0000
name | department | salary | rn_by_salary -----------+-----------------+--------+--------------------- Alejandra | IT | 40000 | 0 Mariana | Sales | 50000 | 0.16666666666666666 Daniela | Finance | 50000 | 0.16666666666666666 Jose | IT | 50000 | 0.16666666666666666 Ricardo | Human Resources | 55000 | 0.6666666666666666 Arturo | Finance | 60000 | 0.8333333333333334 Bertha | Sales | 80000 | 1 (7 rows)
LAG()
Access to previous rows based on the offset argument.
SELECT
empId,
name,
department,
salary,
LAG(salary) OVER (ORDER BY salary DESC) AS previous_salary
FROM EMPLOYEE;
empid | name | department | salary | previous_salary
-------+-----------+-----------------+--------+-----------------
4 | Bertha | Sales | 80000 |
2 | Arturo | Finance | 60000 | 80000
7 | Ricardo | Human Resources | 55000 | 60000
3 | Mariana | Sales | 50000 | 55000
1 | Daniela | Finance | 50000 | 50000
6 | Jose | IT | 50000 | 50000
5 | Alejandra | IT | 40000 | 50000
(7 rows)
If the ORDER BY changes to ASC, the results also change
SELECT
empId,
name,
department,
salary,
LAG(salary) OVER (ORDER BY salary ) AS previous_salary
FROM EMPLOYEE;
empid | name | department | salary | previous_salary
-------+-----------+-----------------+--------+-----------------
5 | Alejandra | IT | 40000 |
3 | Mariana | Sales | 50000 | 40000
1 | Daniela | Finance | 50000 | 50000
6 | Jose | IT | 50000 | 50000
7 | Ricardo | Human Resources | 55000 | 50000
2 | Arturo | Finance | 60000 | 55000
4 | Bertha | Sales | 80000 | 60000
(7 rows)
LEAD()
Access the next row based on an offset argument
SELECT
empId,
name,
department,
salary,
LEAD(salary) OVER (ORDER BY salary ) AS next_salary
FROM EMPLOYEE;
empid | name | department | salary | next_salary
-------+-----------+-----------------+--------+-------------
5 | Alejandra | IT | 40000 | 50000
3 | Mariana | Sales | 50000 | 50000
1 | Daniela | Finance | 50000 | 50000
6 | Jose | IT | 50000 | 55000
7 | Ricardo | Human Resources | 55000 | 60000
2 | Arturo | Finance | 60000 | 80000
4 | Bertha | Sales | 80000 |
(7 rows)
Similar to LAG(), the ORDER BY modifies the output
SELECT
empId,
name,
department,
salary,
LEAD(salary) OVER (ORDER BY salary DESC) AS next_salary
FROM EMPLOYEE;
empid | name | department | salary | next_salary
-------+-----------+-----------------+--------+-------------
4 | Bertha | Sales | 80000 | 60000
2 | Arturo | Finance | 60000 | 55000
7 | Ricardo | Human Resources | 55000 | 50000
3 | Mariana | Sales | 50000 | 50000
1 | Daniela | Finance | 50000 | 50000
6 | Jose | IT | 50000 | 40000
5 | Alejandra | IT | 40000 |
(7 rows)
To avoid nulls, you can use:
LEAD(value_expression, offset, default)
LAG(value_expression, offset, default)
| Argument | Meaning |
|---|---|
value_expression | The column or expression to retrieve from the previous/next row (salary) |
offset | How many rows ahead to look (default is 1) |
default | What to return if there is no previous/next row (default is NULL) |