Stuff I Learned during Hanukkah of Data 2023
AJ Kerrigan
Posted on December 18, 2023
Hanukkah of Data is a series of data-themed puzzles, where you solve puzzles to move your way through a holiday-themed story using a fictional dataset. I think of it as "Advent of Code meets SQL Murder Mystery". Last year I worked through the challenges using VisiData, Datasette, and Pandas. I walked through my thought process and solutions in a series of posts. This year I decided to commit to a set of tools on day 1 (Polars and Jupyter) and use them for the whole challenge. It seemed silly to do a whole new meandering walkthrough, so instead I'll highlight a few things that stuck out after finishing the challenge and sitting on it for a few days. Here we go!Background on Hanukkah of Data, motivation behind this post
Here are a few things that stuck with me after using Polars and Jupyter for Hanukkah of Data this year:
Parsing Dates at Read Time Can Be Fiddly
While I had no trouble parsing dates from an existing Polars DataFrame, I hit two bumps trying to parse dates from strings while reading a SQLite file with read_database():
Helping SQLite Help Me
First, I got an error that looked like this:
ComputeError: could not append value: "2017-01-31 00:11:08"
of type: str to the builder; make sure that all rows have
the same schema or consider increasing `infer_schema_length`
it might also be that a value overflows the data-type's
capacity
That turned out to be related to pola-rs/polars#11912, and this linked comment provided a deceptively simple solution - use PARSE_DECLTYPES when creating the connection:
con = sqlite3.connect(
- "5784/noahs.sqlite"
+ "5784/noahs.sqlite", detect_types=sqlite3.PARSE_DECLTYPES
)
Falling Back to Explicit Post-Read Parsing
With that fixed, I still hit overflow errors on the birthdate
column. As far as I've been able to tell from here and elsewhere, this is because birthdate was a simple YYYY-MM-DD
format string rather than a full ISO format date? In any case, I couldn't find a way to parse that at read time.
Note: I expect this is a failure in my brain and fingers, rather than Polars.
So I was able to parse a couple datetimes at read time, and leaned on .str.strptime()
to parse birthdates after reading:
df = pl.read_database(
connection=con,
query="""
...
""",
schema_overrides={
"ordered": pl.Datetime,
"shipped": pl.Datetime,
},
).with_columns(birthdate=pl.col("birthdate").str.strptime(pl.Date))
nbdime Rocks
I remember hearing about nbdime and thinking it sounded useful, but I've never really needed it since I rarely use Jupyter in the first place. But then I made some changes to my Hanukkah of Data 2023 notebook to work with the follow-up "speed run" challenge (a new dataset and slightly tweaked clues), and the native Git diff was too noisy to be useful. nbdime came to the rescue! Here are the changes I had to make for days 2 and 3 during the speed run:
Text representation of this diff
nbdiff 2023/Hanukkah of Data 2023.ipynb (656616004639c75824de43284ff34bbdc9d89f37) 2023/Hanukkah of Data 2023.ipynb
--- 2023/Hanukkah of Data 2023.ipynb (656616004639c75824de43284ff34bbdc9d89f37) (no timestamp)
+++ 2023/Hanukkah of Data 2023.ipynb 2023-12-18 11:43:03.005108
## modified /cells/12/source:
@@ -1,7 +1,12 @@
-day2 = df.filter(
- (pl.col("name").str.contains(r"J.*P"))
- & (pl.col("desc").str.contains(r"(?i)(coffee|bagel)"))
- & (pl.col("ordered").str.strptime(pl.Datetime).dt.year() == 2017)
+day2 = (
+ df.filter(
+ (pl.col("name").str.contains(r"D.*?S"))
+ & (pl.col("desc").str.contains(r"(?i)(coffee|bagel)"))
+ & (pl.col("ordered").dt.year() == 2017)
+ )
+ .group_by("phone", "citystatezip")
+ .agg(pl.col("orderid").count().alias("ordercount"))
+ .top_k(1, by="ordercount")
)
-day2.select("phone").unique().item()
+day2.select("phone")
## modified /cells/14/source:
@@ -1,13 +1,8 @@
neighborhood = day2.select("citystatezip").unique().item()
day3 = df.filter(
- (
- pl.col("birthdate")
- .str.strptime(pl.Datetime)
- .dt.ordinal_day()
- .is_between(173, 203)
- )
- & (pl.col("birthdate").str.strptime(pl.Datetime).dt.year() % 12 == 7)
+ (pl.col("birthdate").dt.ordinal_day().is_between(266, 296))
+ & (pl.col("birthdate").dt.year() % 12 == 11)
& (pl.col("citystatezip") == neighborhood)
)
Polars LazyFrames are Neat
For this eight-day challenge, I knew I would be reusing the same core dataset and just slicing it different ways. Reading the full data set into a starter DataFrame seemed fine for that use case, but I still wanted to fiddle with the Polars Lazy API a bit.
On Day 4, for example, I needed to rank customers by the number of bakery items they bought before 5am. That involves a few steps:
- Filter: I only care about bakery orders (an order sku that starts with BKY) in the 0400-0459 time range.
- Aggregate: Count the filtered orders by customer
- Sort: Show the top customers by order count
Which I can do with an eager/non-lazy call like this:
day4 = (
df.filter(
(pl.col("sku").str.starts_with("BKY")) & (pl.col("ordered").dt.hour() == 4)
)
.group_by(["name", "phone"])
.agg(pl.col("orderid").count().alias("ordercount"))
.top_k(5, by="ordercount")
)
day4.select(["name", "phone", "ordercount"])
or this very similar looking lazified version, which only requires two changes (calling df.lazy()
up front to queue operations, and collect()
later to combine/perform them):
day4_lazy = (
df.lazy()
.filter((pl.col("sku").str.starts_with("BKY")) & (pl.col("ordered").dt.hour() == 4))
.group_by(["name", "phone"])
.agg(pl.col("orderid").count().alias("ordercount"))
.top_k(5, by="ordercount")
)
day4_lazy.collect().select(["name", "phone", "ordercount"])
The neat part is that before executing those queued operations, the explain()
and show_graph()
methods can offer peeks at the query plan. And it lets you choose whether or not that plan display includes LazyFrame optimizations. Check it out, here's the unoptimized version of my day 4 operations:
Text version
SLICE[offset: 0, len: 5]
SORT BY [col("ordercount")]
AGGREGATE
[col("orderid").count().alias("ordercount")] BY [col("name"), col("phone")] FROM
FILTER [(col("sku").str.starts_with([Utf8(BKY)])) & ([(col("ordered").dt.hour()) == (4)])] FROM
DF ["customerid", "name", "address", "citystatezip"]
PROJECT */18 COLUMNS
SELECTION: "None"
And here's the optimized version:
Text version
SORT BY [col("ordercount")]
AGGREGATE
[col("orderid").count().alias("ordercount")] BY [col("name"), col("phone")] FROM
DF ["customerid", "name", "address", "citystatezip"]
PROJECT 5/18 COLUMNS
SELECTION: "[(col(\"sku\").str.starts_with([Utf8(BKY)])) & ([(col(\"ordered\").dt.hour()) == (4)])]"
It's pretty cool to see what Polars does under the hood to combine operations and make them more efficient.
I didn't get the full benefit of all available lazy optimizations because of the data I had and how I was using it. It's still very neat stuff to be aware of though!
Posted on December 18, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.