PostgreSQL: Window Functions, a Must-Have Skill for Data Engineers

As a data engineer, SQL window functions have become some of my most valuable tools. I use them regularly to explore data, validate assumptions, generate quick insights, and ensure data quality. In fact, I consider a solid understanding of window functions to be a foundational skill for any data engineer.

A window function is composed of two parts:

function
OVER clause
- PARTITION BY: divide the data into groups
- ORDER BY: specify the order of rows within each group

We’ll use the Employee dataset below to explore each window function in detail.

empId	Name	Age	Department	Salary
0001	Daniela	33	Finance	50,000
0002	Arturo	40	Finance	60,000
0003	Mariana	28	Sales	50,000
0004	Bertha	62	Sales	80,000
0005	Alejandra	37	IT	40,000
0006	Jose	25	IT	50,000
0007	Ricardo	33	Human Resources	55000


-- create
CREATE TABLE EMPLOYEE (
  empId INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  age INTEGER NOT NULL,
  department TEXT NOT NULL,
  salary FLOAT NOT NULL
);

-- insert
INSERT INTO EMPLOYEE VALUES (0001, 'Daniela', 33, 'Finance', 50000);
INSERT INTO EMPLOYEE VALUES (0002, 'Arturo', 40, 'Finance', 60000);
INSERT INTO EMPLOYEE VALUES (0003, 'Mariana', 28, 'Sales', 50000);
INSERT INTO EMPLOYEE VALUES (0004, 'Bertha', 62, 'Sales', 80000);
INSERT INTO EMPLOYEE VALUES (0005, 'Alejandra', 37, 'IT', 40000);
INSERT INTO EMPLOYEE VALUES (0006, 'Jose', 25, 'IT', 50000);
INSERT INTO EMPLOYEE VALUES (0007, 'Ricardo', 33, 'Human Resources', 55000);

-- fetch 
SELECT * FROM EMPLOYEE;

Aggregate Functions

SUM()

Sum values within a window

SELECT 
  name,
  department,
  SUM(salary) OVER (PARTITION BY department )  total_sal_dept
FROM 
  EMPLOYEE;

   name    |   department    | total_sal_dept 
-----------+-----------------+----------------
 Arturo    | Finance         |         110000
 Daniela   | Finance         |         110000
 Ricardo   | Human Resources |          55000
 Jose      | IT              |          90000
 Alejandra | IT              |          90000
 Bertha    | Sales           |         130000
 Mariana   | Sales           |         130000
(7 rows)

AVG()

Calculates the average value within a window

SELECT 
  name,
  department,
  AVG(age) OVER (PARTITION BY department )  avg_age_by_dept
FROM 
  EMPLOYEE;

   name    |   department    |   avg_age_by_dept   
-----------+-----------------+---------------------
 Arturo    | Finance         | 36.5000000000000000
 Daniela   | Finance         | 36.5000000000000000
 Ricardo   | Human Resources | 33.0000000000000000
 Jose      | IT              | 31.0000000000000000
 Alejandra | IT              | 31.0000000000000000
 Bertha    | Sales           | 45.0000000000000000
 Mariana   | Sales           | 45.0000000000000000
(7 rows)

COUNT()

Counts the rows within a window

SELECT 
  name,
  department,
  COUNT(*) OVER (PARTITION BY department )  num_emp_by_dept
FROM 
  EMPLOYEE;

   name    |   department    | num_emp_by_dept 
-----------+-----------------+-----------------
 Arturo    | Finance         |               2
 Daniela   | Finance         |               2
 Ricardo   | Human Resources |               1
 Jose      | IT              |               2
 Alejandra | IT              |               2
 Bertha    | Sales           |               2
 Mariana   | Sales           |               2
(7 rows)

MAX()

Returns the maximum value within a window

SELECT 
  name,
  department,
  salary,
  MAX(salary) OVER (PARTITION BY department )  max_salary_by_dept
FROM 
  EMPLOYEE;

   name    |   department    | salary | max_salary_by_dept 
-----------+-----------------+--------+--------------------
 Arturo    | Finance         |  60000 |              60000
 Daniela   | Finance         |  50000 |              60000
 Ricardo   | Human Resources |  55000 |              55000
 Jose      | IT              |  50000 |              50000
 Alejandra | IT              |  40000 |              50000
 Bertha    | Sales           |  80000 |              80000
 Mariana   | Sales           |  50000 |              80000
(7 rows)

MIN()

Returns the maximum value within a window

SELECT 
  name,
  department,
  salary,
  MIN(salary) OVER (PARTITION BY department )  min_salary_by_dept
FROM 
  EMPLOYEE;

   name    |   department    | salary | min_salary_by_dept 
-----------+-----------------+--------+--------------------
 Arturo    | Finance         |  60000 |              50000
 Daniela   | Finance         |  50000 |              50000
 Ricardo   | Human Resources |  55000 |              55000
 Jose      | IT              |  50000 |              40000
 Alejandra | IT              |  40000 |              40000
 Bertha    | Sales           |  80000 |              50000
 Mariana   | Sales           |  50000 |              50000
(7 rows)

Ranking Functions

RANK()

Assigns ranks to rows, skipping ranks for duplicates.

SELECT 
  name,
  department,
  salary,
  RANK() OVER (PARTITION BY department ORDER BY salary DESC)  rank_by_salary_per_dept,
  RANK() OVER (ORDER BY salary DESC) rank_by_salary_org
FROM 
  EMPLOYEE
ORDER BY
  department DESC;

   name    |   department    | salary | rank_sal_dept | rank_sal_org 
-----------+-----------------+--------+---------------+--------------
 Bertha    | Sales           |  80000 |             1 |            1
 Mariana   | Sales           |  50000 |             2 |            4
 Jose      | IT              |  50000 |             1 |            4
 Alejandra | IT              |  40000 |             2 |            7
 Ricardo   | Human Resources |  55000 |             1 |            3
 Daniela   | Finance         |  50000 |             2 |            4
 Arturo    | Finance         |  60000 |             1 |            2
(7 rows)

DENSE_RANK()

Assigns ranks to rows without skipping rank numbers for duplicates.

SELECT 
  name,
  department,
  salary,
  DENSE_RANK() OVER (ORDER BY salary DESC) rank_by_salary_org
FROM 
  EMPLOYEE
ORDER BY
  rank_by_salary_org;

   name    |   department    | salary | rank_by_salary_org 
-----------+-----------------+--------+--------------------
 Bertha    | Sales           |  80000 |                  1
 Arturo    | Finance         |  60000 |                  2
 Ricardo   | Human Resources |  55000 |                  3
 Mariana   | Sales           |  50000 |                  4
 Daniela   | Finance         |  50000 |                  4
 Jose      | IT              |  50000 |                  4
 Alejandra | IT              |  40000 |                  5
(7 rows)

Notice how with DENSE_RANK(), Alejandra gets assigned the rank 5 while with RANK(), she gets 7

ROW_NUMBER()

Assigns a unique number to each row in the result set.

SELECT 
  name,
  department,
  salary,
  ROW_NUMBER() OVER (ORDER BY salary DESC) rn_by_salary
FROM 
  EMPLOYEE
ORDER BY
  rn_by_salary;

   name    |   department    | salary | rn_by_salary 
-----------+-----------------+--------+--------------
 Bertha    | Sales           |  80000 |            1
 Arturo    | Finance         |  60000 |            2
 Ricardo   | Human Resources |  55000 |            3
 Mariana   | Sales           |  50000 |            4
 Daniela   | Finance         |  50000 |            5
 Jose      | IT              |  50000 |            6
 Alejandra | IT              |  40000 |            7
(7 rows)

Notice how with ROW_NUMBER(), there are no ties between Mariana, Daniela, and Jose

PERCENT_RANK()

Shows the relative rank of a row as a percentage between 0 and 1, excluding the first row (which is always 0).

The formula for PERCENT_RANK() is:

PERCENT_RANK = (rank - 1) / (total_rows - 1)

Where:

rank = the rank of the current row (using RANK())
total_rows = number of rows in the partition (or entire dataset if no partition)
The result is always between 0 and 1

SELECT 
  name,
  department,
  salary,
  PERCENT_RANK() OVER (ORDER BY salary DESC) rn_by_salary
FROM 
  EMPLOYEE
ORDER BY
  rn_by_salary;

salary | rnk | percent_rank = (rank - 1) / (7 - 1)
--------+-----+----------
  40000 |   1 |        (1 - 1) / 6 = 0.0000
  50000 |   2 |        (2 - 1) / 6 = 0.1667
  50000 |   2 |        (2 - 1) / 6 = 0.1667
  50000 |   2 |        (2 - 1) / 6 = 0.1667
  55000 |   5 |        (5 - 1) / 6 = 0.6667
  60000 |   6 |        (6 - 1) / 6 = 0.8333
  80000 |   7 |        (7 - 1) / 6 = 1.0000

   name    |   department    | salary |    rn_by_salary     
-----------+-----------------+--------+---------------------
 Alejandra | IT              |  40000 |                   0
 Mariana   | Sales           |  50000 | 0.16666666666666666
 Daniela   | Finance         |  50000 | 0.16666666666666666
 Jose      | IT              |  50000 | 0.16666666666666666
 Ricardo   | Human Resources |  55000 |  0.6666666666666666
 Arturo    | Finance         |  60000 |  0.8333333333333334
 Bertha    | Sales           |  80000 |                   1
(7 rows)

LAG()

Access to previous rows based on the offset argument.

SELECT 
  empId,
  name,
  department,
  salary,
  LAG(salary) OVER (ORDER BY salary DESC)  AS previous_salary
FROM EMPLOYEE;

 empid |   name    |   department    | salary | previous_salary 
-------+-----------+-----------------+--------+-----------------
     4 | Bertha    | Sales           |  80000 |                
     2 | Arturo    | Finance         |  60000 |           80000
     7 | Ricardo   | Human Resources |  55000 |           60000
     3 | Mariana   | Sales           |  50000 |           55000
     1 | Daniela   | Finance         |  50000 |           50000
     6 | Jose      | IT              |  50000 |           50000
     5 | Alejandra | IT              |  40000 |           50000
(7 rows)

If the ORDER BY changes to ASC, the results also change

SELECT 
  empId,
  name,
  department,
  salary,
  LAG(salary) OVER (ORDER BY salary )  AS previous_salary
FROM EMPLOYEE;

 empid |   name    |   department    | salary | previous_salary 
-------+-----------+-----------------+--------+-----------------
     5 | Alejandra | IT              |  40000 |                
     3 | Mariana   | Sales           |  50000 |           40000
     1 | Daniela   | Finance         |  50000 |           50000
     6 | Jose      | IT              |  50000 |           50000
     7 | Ricardo   | Human Resources |  55000 |           50000
     2 | Arturo    | Finance         |  60000 |           55000
     4 | Bertha    | Sales           |  80000 |           60000
(7 rows)

LEAD()

Access the next row based on an offset argument

SELECT 
  empId,
  name,
  department,
  salary,
    LEAD(salary) OVER (ORDER BY salary )  AS next_salary
FROM EMPLOYEE;

 empid |   name    |   department    | salary | next_salary 
-------+-----------+-----------------+--------+-------------
     5 | Alejandra | IT              |  40000 |       50000
     3 | Mariana   | Sales           |  50000 |       50000
     1 | Daniela   | Finance         |  50000 |       50000
     6 | Jose      | IT              |  50000 |       55000
     7 | Ricardo   | Human Resources |  55000 |       60000
     2 | Arturo    | Finance         |  60000 |       80000
     4 | Bertha    | Sales           |  80000 |            
(7 rows)

Similar to LAG(), the ORDER BY modifies the output

SELECT 
  empId,
  name,
  department,
  salary,
    LEAD(salary) OVER (ORDER BY salary DESC)  AS next_salary
FROM EMPLOYEE;

 empid |   name    |   department    | salary | next_salary 
-------+-----------+-----------------+--------+-------------
     4 | Bertha    | Sales           |  80000 |       60000
     2 | Arturo    | Finance         |  60000 |       55000
     7 | Ricardo   | Human Resources |  55000 |       50000
     3 | Mariana   | Sales           |  50000 |       50000
     1 | Daniela   | Finance         |  50000 |       50000
     6 | Jose      | IT              |  50000 |       40000
     5 | Alejandra | IT              |  40000 |            
(7 rows)

To avoid nulls, you can use:

LEAD(value_expression, offset, default)

LAG(value_expression, offset, default)

Argument	Meaning
`value_expression`	The column or expression to retrieve from the previous/next row (`salary`)
`offset`	How many rows ahead to look (default is `1`)
`default`	What to return if there is no previous/next row (default is `NULL`)