How and When to Use SQL Window Functions
Musa Atlıhan
Posted on October 29, 2020
You may think SQL window functions are similar to aggregating but it actually defines a window of rows with a given length around the current row.
You can do,
- aggregate operations: SUM, COUNT, MAX, MIN
- ranking: RANK, DENSE_RANK, ROW_NUMBER, NTILE
- value: LAG, LEAD, FIRST_VALUE, LAST_VALUE
Let's give a concrete example to understand what exactly a window function is doing.
Consider you are a data scientist working on a classification task and you want to predict if an event will occur within the next three hours. In your data, the label is 1 if the event occurred at that exact time and 0 otherwise. Therefore you need a backfill operation to build the target. You can do it by using SQL window functions. Let's create the data frame first,
We don't need to create a database, we will use the pandasql
library for executing SQL queries over pandas
data frames (download via pip install pandas
and pip install pandasql
).
# import libraries
import pandas as pd
data = [
("2020-01-01 06:00:00", 0),
("2020-01-01 07:00:00", 0),
("2020-01-01 08:00:00", 0),
("2020-01-01 09:00:00", 0),
("2020-01-01 10:00:00", 0),
("2020-01-01 11:00:00", 1),
("2020-01-01 12:00:00", 0),
("2020-01-01 13:00:00", 0),
("2020-01-01 14:00:00", 0),
("2020-01-01 15:00:00", 0),
]
# create data frame
df = pd.DataFrame(data, columns=["datetime", "event"])
df.head(10)
Here is the dataframe:
Let's solve the problem:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
solution = pysqldf(
"""
SELECT datetime, event, MAX(event) OVER
(
ORDER BY datetime ASC
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING
) as target
FROM df
"""
)
solution.head(10)
We ordered the rows using datetime
column in ascending order and filled the window rows with the MAX value. This is the data frame we get with the target column:
We labeled the previous 3 rows before 11am in the target
column.
Posted on October 29, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.