How and When to Use SQL Window Functions

musaatlihan

Musa Atlıhan

Posted on October 29, 2020

How and When to Use SQL Window Functions

You may think SQL window functions are similar to aggregating but it actually defines a window of rows with a given length around the current row.

You can do,

  1. aggregate operations: SUM, COUNT, MAX, MIN
  2. ranking: RANK, DENSE_RANK, ROW_NUMBER, NTILE
  3. value: LAG, LEAD, FIRST_VALUE, LAST_VALUE

Let's give a concrete example to understand what exactly a window function is doing.

Consider you are a data scientist working on a classification task and you want to predict if an event will occur within the next three hours. In your data, the label is 1 if the event occurred at that exact time and 0 otherwise. Therefore you need a backfill operation to build the target. You can do it by using SQL window functions. Let's create the data frame first,

We don't need to create a database, we will use the pandasql library for executing SQL queries over pandas data frames (download via pip install pandas and pip install pandasql).

# import libraries 
import pandas as pd

data = [
    ("2020-01-01 06:00:00", 0),
    ("2020-01-01 07:00:00", 0),
    ("2020-01-01 08:00:00", 0),
    ("2020-01-01 09:00:00", 0),
    ("2020-01-01 10:00:00", 0),
    ("2020-01-01 11:00:00", 1),
    ("2020-01-01 12:00:00", 0),
    ("2020-01-01 13:00:00", 0),
    ("2020-01-01 14:00:00", 0),
    ("2020-01-01 15:00:00", 0),
]

# create data frame
df = pd.DataFrame(data, columns=["datetime", "event"]) 

df.head(10)
Enter fullscreen mode Exit fullscreen mode

Here is the dataframe:

dataframe head

Let's solve the problem:

from pandasql import sqldf

pysqldf = lambda q: sqldf(q, globals())

solution = pysqldf(
"""
SELECT datetime, event, MAX(event) OVER
(
    ORDER BY datetime ASC
    ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING
) as target
FROM df
"""
)

solution.head(10)
Enter fullscreen mode Exit fullscreen mode

We ordered the rows using datetime column in ascending order and filled the window rows with the MAX value. This is the data frame we get with the target column:

solution

We labeled the previous 3 rows before 11am in the target column.

💖 💪 🙅 🚩
musaatlihan
Musa Atlıhan

Posted on October 29, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related