Stuff I Learned during Hanukkah of Data 2023

ajkerrigan

AJ Kerrigan

Posted on December 18, 2023

Stuff I Learned during Hanukkah of Data 2023

Background on Hanukkah of Data, motivation behind this post

Hanukkah of Data is a series of data-themed puzzles, where you solve puzzles to move your way through a holiday-themed story using a fictional dataset. I think of it as "Advent of Code meets SQL Murder Mystery".

Last year I worked through the challenges using VisiData, Datasette, and Pandas. I walked through my thought process and solutions in a series of posts.

This year I decided to commit to a set of tools on day 1 (Polars and Jupyter) and use them for the whole challenge. It seemed silly to do a whole new meandering walkthrough, so instead I'll highlight a few things that stuck out after finishing the challenge and sitting on it for a few days. Here we go!

Here are a few things that stuck with me after using Polars and Jupyter for Hanukkah of Data this year:

Parsing Dates at Read Time Can Be Fiddly

While I had no trouble parsing dates from an existing Polars DataFrame, I hit two bumps trying to parse dates from strings while reading a SQLite file with read_database():

Helping SQLite Help Me

First, I got an error that looked like this:

ComputeError: could not append value: "2017-01-31 00:11:08"
of type: str to the builder; make sure that all rows have
the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's
capacity
Enter fullscreen mode Exit fullscreen mode

That turned out to be related to pola-rs/polars#11912, and this linked comment provided a deceptively simple solution - use PARSE_DECLTYPES when creating the connection:

con = sqlite3.connect(
-    "5784/noahs.sqlite"
+    "5784/noahs.sqlite", detect_types=sqlite3.PARSE_DECLTYPES
)
Enter fullscreen mode Exit fullscreen mode

Falling Back to Explicit Post-Read Parsing

With that fixed, I still hit overflow errors on the birthdate column. As far as I've been able to tell from here and elsewhere, this is because birthdate was a simple YYYY-MM-DD format string rather than a full ISO format date? In any case, I couldn't find a way to parse that at read time.

Note: I expect this is a failure in my brain and fingers, rather than Polars.

So I was able to parse a couple datetimes at read time, and leaned on .str.strptime() to parse birthdates after reading:

df = pl.read_database(
    connection=con,
    query="""
        ...
    """,
    schema_overrides={
        "ordered": pl.Datetime,
        "shipped": pl.Datetime,
    },
).with_columns(birthdate=pl.col("birthdate").str.strptime(pl.Date))
Enter fullscreen mode Exit fullscreen mode

nbdime Rocks

I remember hearing about nbdime and thinking it sounded useful, but I've never really needed it since I rarely use Jupyter in the first place. But then I made some changes to my Hanukkah of Data 2023 notebook to work with the follow-up "speed run" challenge (a new dataset and slightly tweaked clues), and the native Git diff was too noisy to be useful. nbdime came to the rescue! Here are the changes I had to make for days 2 and 3 during the speed run:

Snippet of a Notebook Diff with nbdime's nbdiff-web tool

(Full-size image)

Text representation of this diff
nbdiff 2023/Hanukkah of Data 2023.ipynb (656616004639c75824de43284ff34bbdc9d89f37) 2023/Hanukkah of Data 2023.ipynb
--- 2023/Hanukkah of Data 2023.ipynb (656616004639c75824de43284ff34bbdc9d89f37)  (no timestamp)
+++ 2023/Hanukkah of Data 2023.ipynb  2023-12-18 11:43:03.005108
## modified /cells/12/source:
@@ -1,7 +1,12 @@
-day2 = df.filter(
-    (pl.col("name").str.contains(r"J.*P"))
-    & (pl.col("desc").str.contains(r"(?i)(coffee|bagel)"))
-    & (pl.col("ordered").str.strptime(pl.Datetime).dt.year() == 2017)
+day2 = (
+    df.filter(
+        (pl.col("name").str.contains(r"D.*?S"))
+        & (pl.col("desc").str.contains(r"(?i)(coffee|bagel)"))
+        & (pl.col("ordered").dt.year() == 2017)
+    )
+    .group_by("phone", "citystatezip")
+    .agg(pl.col("orderid").count().alias("ordercount"))
+    .top_k(1, by="ordercount")
 )

-day2.select("phone").unique().item()
+day2.select("phone")

## modified /cells/14/source:
@@ -1,13 +1,8 @@
 neighborhood = day2.select("citystatezip").unique().item()

 day3 = df.filter(
-    (
-        pl.col("birthdate")
-        .str.strptime(pl.Datetime)
-        .dt.ordinal_day()
-        .is_between(173, 203)
-    )
-    & (pl.col("birthdate").str.strptime(pl.Datetime).dt.year() % 12 == 7)
+    (pl.col("birthdate").dt.ordinal_day().is_between(266, 296))
+    & (pl.col("birthdate").dt.year() % 12 == 11)
     & (pl.col("citystatezip") == neighborhood)
 )
Enter fullscreen mode Exit fullscreen mode

Polars LazyFrames are Neat

For this eight-day challenge, I knew I would be reusing the same core dataset and just slicing it different ways. Reading the full data set into a starter DataFrame seemed fine for that use case, but I still wanted to fiddle with the Polars Lazy API a bit.

On Day 4, for example, I needed to rank customers by the number of bakery items they bought before 5am. That involves a few steps:

  • Filter: I only care about bakery orders (an order sku that starts with BKY) in the 0400-0459 time range.
  • Aggregate: Count the filtered orders by customer
  • Sort: Show the top customers by order count

Which I can do with an eager/non-lazy call like this:

day4 = (
    df.filter(
        (pl.col("sku").str.starts_with("BKY")) & (pl.col("ordered").dt.hour() == 4)
    )
    .group_by(["name", "phone"])
    .agg(pl.col("orderid").count().alias("ordercount"))
    .top_k(5, by="ordercount")
)

day4.select(["name", "phone", "ordercount"])
Enter fullscreen mode Exit fullscreen mode

or this very similar looking lazified version, which only requires two changes (calling df.lazy() up front to queue operations, and collect() later to combine/perform them):

day4_lazy = (
    df.lazy()
    .filter((pl.col("sku").str.starts_with("BKY")) & (pl.col("ordered").dt.hour() == 4))
    .group_by(["name", "phone"])
    .agg(pl.col("orderid").count().alias("ordercount"))
    .top_k(5, by="ordercount")
)

day4_lazy.collect().select(["name", "phone", "ordercount"])
Enter fullscreen mode Exit fullscreen mode

The neat part is that before executing those queued operations, the explain() and show_graph() methods can offer peeks at the query plan. And it lets you choose whether or not that plan display includes LazyFrame optimizations. Check it out, here's the unoptimized version of my day 4 operations:

Unoptimized Query Plan

Text version
SLICE[offset: 0, len: 5]
  SORT BY [col("ordercount")]
    AGGREGATE
        [col("orderid").count().alias("ordercount")] BY [col("name"), col("phone")] FROM
      FILTER [(col("sku").str.starts_with([Utf8(BKY)])) & ([(col("ordered").dt.hour()) == (4)])] FROM

      DF ["customerid", "name", "address", "citystatezip"]
 PROJECT */18 COLUMNS
 SELECTION: "None"
Enter fullscreen mode Exit fullscreen mode

And here's the optimized version:

Optimized Query Plan

Text version
SORT BY [col("ordercount")]
  AGGREGATE
    [col("orderid").count().alias("ordercount")] BY [col("name"), col("phone")] FROM
    DF ["customerid", "name", "address", "citystatezip"]
 PROJECT 5/18 COLUMNS
 SELECTION: "[(col(\"sku\").str.starts_with([Utf8(BKY)])) & ([(col(\"ordered\").dt.hour()) == (4)])]"
Enter fullscreen mode Exit fullscreen mode


Enter fullscreen mode Exit fullscreen mode

It's pretty cool to see what Polars does under the hood to combine operations and make them more efficient.

I didn't get the full benefit of all available lazy optimizations because of the data I had and how I was using it. It's still very neat stuff to be aware of though!

💖 💪 🙅 🚩
ajkerrigan
AJ Kerrigan

Posted on December 18, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related