PostgreSQL: Window Functions, a Must-Have Skill for Data Engineers

As a data engineer, SQL window functions have become some of my most valuable tools. I use them regularly to explore data, validate assumptions, generate quick insights, and ensure data quality. In fact, I consider a solid understanding of window functions to be a foundational skill for any data engineer.

A window function is composed of two parts:

  1. function
  2. OVER clause
    • PARTITION BY: divide the data into groups
    • ORDER BY: specify the order of rows within each group

We’ll use the Employee dataset below to explore each window function in detail.

empIdNameAgeDepartmentSalary
0001Daniela33Finance50,000
0002Arturo40Finance60,000
0003Mariana28Sales50,000
0004Bertha62Sales80,000
0005Alejandra37IT40,000
0006Jose25IT50,000
0007Ricardo33Human Resources55000

-- create
CREATE TABLE EMPLOYEE (
  empId INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  age INTEGER NOT NULL,
  department TEXT NOT NULL,
  salary FLOAT NOT NULL
);

-- insert
INSERT INTO EMPLOYEE VALUES (0001, 'Daniela', 33, 'Finance', 50000);
INSERT INTO EMPLOYEE VALUES (0002, 'Arturo', 40, 'Finance', 60000);
INSERT INTO EMPLOYEE VALUES (0003, 'Mariana', 28, 'Sales', 50000);
INSERT INTO EMPLOYEE VALUES (0004, 'Bertha', 62, 'Sales', 80000);
INSERT INTO EMPLOYEE VALUES (0005, 'Alejandra', 37, 'IT', 40000);
INSERT INTO EMPLOYEE VALUES (0006, 'Jose', 25, 'IT', 50000);
INSERT INTO EMPLOYEE VALUES (0007, 'Ricardo', 33, 'Human Resources', 55000);

-- fetch 
SELECT * FROM EMPLOYEE;

Aggregate Functions

SUM()

Sum values within a window

SELECT 
  name,
  department,
  SUM(salary) OVER (PARTITION BY department )  total_sal_dept
FROM 
  EMPLOYEE;

   name    |   department    | total_sal_dept 
-----------+-----------------+----------------
Arturo | Finance | 110000
Daniela | Finance | 110000
Ricardo | Human Resources | 55000
Jose | IT | 90000
Alejandra | IT | 90000
Bertha | Sales | 130000
Mariana | Sales | 130000
(7 rows)

AVG()

Calculates the average value within a window

SELECT 
  name,
  department,
  AVG(age) OVER (PARTITION BY department )  avg_age_by_dept
FROM 
  EMPLOYEE;

   name    |   department    |   avg_age_by_dept   
-----------+-----------------+---------------------
 Arturo    | Finance         | 36.5000000000000000
 Daniela   | Finance         | 36.5000000000000000
 Ricardo   | Human Resources | 33.0000000000000000
 Jose      | IT              | 31.0000000000000000
 Alejandra | IT              | 31.0000000000000000
 Bertha    | Sales           | 45.0000000000000000
 Mariana   | Sales           | 45.0000000000000000
(7 rows)

COUNT()

Counts the rows within a window

SELECT 
  name,
  department,
  COUNT(*) OVER (PARTITION BY department )  num_emp_by_dept
FROM 
  EMPLOYEE;

   name    |   department    | num_emp_by_dept 
-----------+-----------------+-----------------
Arturo | Finance | 2
Daniela | Finance | 2
Ricardo | Human Resources | 1
Jose | IT | 2
Alejandra | IT | 2
Bertha | Sales | 2
Mariana | Sales | 2
(7 rows)

MAX()

Returns the maximum value within a window

SELECT 
  name,
  department,
  salary,
  MAX(salary) OVER (PARTITION BY department )  max_salary_by_dept
FROM 
  EMPLOYEE;

   name    |   department    | salary | max_salary_by_dept 
-----------+-----------------+--------+--------------------
Arturo | Finance | 60000 | 60000
Daniela | Finance | 50000 | 60000
Ricardo | Human Resources | 55000 | 55000
Jose | IT | 50000 | 50000
Alejandra | IT | 40000 | 50000
Bertha | Sales | 80000 | 80000
Mariana | Sales | 50000 | 80000
(7 rows)

MIN()

Returns the maximum value within a window

SELECT 
  name,
  department,
  salary,
  MIN(salary) OVER (PARTITION BY department )  min_salary_by_dept
FROM 
  EMPLOYEE;

   name    |   department    | salary | min_salary_by_dept 
-----------+-----------------+--------+--------------------
Arturo | Finance | 60000 | 50000
Daniela | Finance | 50000 | 50000
Ricardo | Human Resources | 55000 | 55000
Jose | IT | 50000 | 40000
Alejandra | IT | 40000 | 40000
Bertha | Sales | 80000 | 50000
Mariana | Sales | 50000 | 50000
(7 rows)

Ranking Functions

RANK()

 Assigns ranks to rows, skipping ranks for duplicates.

SELECT 
  name,
  department,
  salary,
  RANK() OVER (PARTITION BY department ORDER BY salary DESC)  rank_by_salary_per_dept,
  RANK() OVER (ORDER BY salary DESC) rank_by_salary_org
FROM 
  EMPLOYEE
ORDER BY
  department DESC;

   name    |   department    | salary | rank_sal_dept | rank_sal_org 
-----------+-----------------+--------+---------------+--------------
Bertha | Sales | 80000 | 1 | 1
Mariana | Sales | 50000 | 2 | 4
Jose | IT | 50000 | 1 | 4
Alejandra | IT | 40000 | 2 | 7
Ricardo | Human Resources | 55000 | 1 | 3
Daniela | Finance | 50000 | 2 | 4
Arturo | Finance | 60000 | 1 | 2
(7 rows)

DENSE_RANK()

Assigns ranks to rows without skipping rank numbers for duplicates.

SELECT 
  name,
  department,
  salary,
  DENSE_RANK() OVER (ORDER BY salary DESC) rank_by_salary_org
FROM 
  EMPLOYEE
ORDER BY
  rank_by_salary_org;

   name    |   department    | salary | rank_by_salary_org 
-----------+-----------------+--------+--------------------
 Bertha    | Sales           |  80000 |                  1
 Arturo    | Finance         |  60000 |                  2
 Ricardo   | Human Resources |  55000 |                  3
 Mariana   | Sales           |  50000 |                  4
 Daniela   | Finance         |  50000 |                  4
 Jose      | IT              |  50000 |                  4
 Alejandra | IT              |  40000 |                  5
(7 rows)

Notice how with DENSE_RANK(), Alejandra gets assigned the rank 5 while with RANK(), she gets 7

ROW_NUMBER()

Assigns a unique number to each row in the result set.

SELECT 
  name,
  department,
  salary,
  ROW_NUMBER() OVER (ORDER BY salary DESC) rn_by_salary
FROM 
  EMPLOYEE
ORDER BY
  rn_by_salary;

   name    |   department    | salary | rn_by_salary 
-----------+-----------------+--------+--------------
 Bertha    | Sales           |  80000 |            1
 Arturo    | Finance         |  60000 |            2
 Ricardo   | Human Resources |  55000 |            3
 Mariana   | Sales           |  50000 |            4
 Daniela   | Finance         |  50000 |            5
 Jose      | IT              |  50000 |            6
 Alejandra | IT              |  40000 |            7
(7 rows)

Notice how with ROW_NUMBER(), there are no ties between Mariana, Daniela, and Jose

PERCENT_RANK()

Shows the relative rank of a row as a percentage between 0 and 1, excluding the first row (which is always 0).

The formula for PERCENT_RANK() is:

PERCENT_RANK = (rank - 1) / (total_rows - 1)

Where:

  • rank = the rank of the current row (using RANK())
  • total_rows = number of rows in the partition (or entire dataset if no partition)
  • The result is always between 0 and 1
SELECT 
  name,
  department,
  salary,
  PERCENT_RANK() OVER (ORDER BY salary DESC) rn_by_salary
FROM 
  EMPLOYEE
ORDER BY
  rn_by_salary;
salary | rnk | percent_rank = (rank - 1) / (7 - 1)
--------+-----+----------
40000 | 1 | (1 - 1) / 6 = 0.0000
50000 | 2 | (2 - 1) / 6 = 0.1667
50000 | 2 | (2 - 1) / 6 = 0.1667
50000 | 2 | (2 - 1) / 6 = 0.1667
55000 | 5 | (5 - 1) / 6 = 0.6667
60000 | 6 | (6 - 1) / 6 = 0.8333
80000 | 7 | (7 - 1) / 6 = 1.0000

   name    |   department    | salary |    rn_by_salary     
-----------+-----------------+--------+---------------------
 Alejandra | IT              |  40000 |                   0
 Mariana   | Sales           |  50000 | 0.16666666666666666
 Daniela   | Finance         |  50000 | 0.16666666666666666
 Jose      | IT              |  50000 | 0.16666666666666666
 Ricardo   | Human Resources |  55000 |  0.6666666666666666
 Arturo    | Finance         |  60000 |  0.8333333333333334
 Bertha    | Sales           |  80000 |                   1
(7 rows)

LAG()

 Access to previous rows based on the offset argument.

SELECT 
  empId,
  name,
  department,
  salary,
  LAG(salary) OVER (ORDER BY salary DESC)  AS previous_salary
FROM EMPLOYEE;

 empid |   name    |   department    | salary | previous_salary 
-------+-----------+-----------------+--------+-----------------
     4 | Bertha    | Sales           |  80000 |                
     2 | Arturo    | Finance         |  60000 |           80000
     7 | Ricardo   | Human Resources |  55000 |           60000
     3 | Mariana   | Sales           |  50000 |           55000
     1 | Daniela   | Finance         |  50000 |           50000
     6 | Jose      | IT              |  50000 |           50000
     5 | Alejandra | IT              |  40000 |           50000
(7 rows)

If the ORDER BY changes to ASC, the results also change

SELECT 
  empId,
  name,
  department,
  salary,
  LAG(salary) OVER (ORDER BY salary )  AS previous_salary
FROM EMPLOYEE;

 empid |   name    |   department    | salary | previous_salary 
-------+-----------+-----------------+--------+-----------------
5 | Alejandra | IT | 40000 |
3 | Mariana | Sales | 50000 | 40000
1 | Daniela | Finance | 50000 | 50000
6 | Jose | IT | 50000 | 50000
7 | Ricardo | Human Resources | 55000 | 50000
2 | Arturo | Finance | 60000 | 55000
4 | Bertha | Sales | 80000 | 60000
(7 rows)

LEAD()

Access the next row based on an offset argument

SELECT 
  empId,
  name,
  department,
  salary,
    LEAD(salary) OVER (ORDER BY salary )  AS next_salary
FROM EMPLOYEE;

 empid |   name    |   department    | salary | next_salary 
-------+-----------+-----------------+--------+-------------
     5 | Alejandra | IT              |  40000 |       50000
     3 | Mariana   | Sales           |  50000 |       50000
     1 | Daniela   | Finance         |  50000 |       50000
     6 | Jose      | IT              |  50000 |       55000
     7 | Ricardo   | Human Resources |  55000 |       60000
     2 | Arturo    | Finance         |  60000 |       80000
     4 | Bertha    | Sales           |  80000 |            
(7 rows)

Similar to LAG(), the ORDER BY modifies the output

SELECT 
  empId,
  name,
  department,
  salary,
    LEAD(salary) OVER (ORDER BY salary DESC)  AS next_salary
FROM EMPLOYEE;

 empid |   name    |   department    | salary | next_salary 
-------+-----------+-----------------+--------+-------------
     4 | Bertha    | Sales           |  80000 |       60000
     2 | Arturo    | Finance         |  60000 |       55000
     7 | Ricardo   | Human Resources |  55000 |       50000
     3 | Mariana   | Sales           |  50000 |       50000
     1 | Daniela   | Finance         |  50000 |       50000
     6 | Jose      | IT              |  50000 |       40000
     5 | Alejandra | IT              |  40000 |            
(7 rows)

To avoid nulls, you can use:

LEAD(value_expression, offset, default)

LAG(value_expression, offset, default)

ArgumentMeaning
value_expressionThe column or expression to retrieve from the previous/next row (salary)
offsetHow many rows ahead to look (default is 1)
defaultWhat to return if there is no previous/next row (default is NULL)