Hanukkah of Data 2022 - Puzzle 2
AJ Kerrigan
Posted on December 30, 2022
Hanukkah of Data is a series of data-themed puzzles, where you solve puzzles to move your way through a holiday-themed story using a fictional dataset. See the introductory post for a bit more detail, but the pitch in my head is "Advent of Code meets SQL Murder Mystery". This post walks through my approach to the second puzzle.
Warning: This post unavoidably contains spoilers. If you'd rather do the puzzles on your own first, please close this browser tab and run away :).
Thinking
In the second puzzle, we find that the rug in our story was so dirty that the cleaners needed help from a special contractor. And we get a clue about how to track that contractor down:
As they’re right across the street from Noah’s, they usually talked about the project over coffee and bagels at Noah’s before handing off the item to be cleaned. The contractors would pick up the tab and expense it, along with their cleaning supplies.
“So this rug was apparently one of those special projects. The claim ticket said ‘2017 spec JD’. ‘2017’ is the year the item was brought in, and ‘JD’ is the initials of the contractor.
So extracting some facts...
- We're looking for a Noah's order from sometime in 2017
- The order has coffee and bagels
- The customer has initials JD
Tool Choices
In the pre-work section of the last post, I mentioned poking around at the data before working on any specific puzzles to get a feel for it. Looking at this puzzle's clues, it seemed clear that the approach would be very different from puzzle 1. Rather than focusing on a single table, we'd need to look at the full picture of an order:
- The customer placing the order
- The date and time it took place
- Which items were included
VisiData is great for quick exploration and can perform joins, but considering the number of tables involved I figured this would be more straightforward with SQL. And since SQLite was an available format, Datasette seemed like a great fit.
Doing (Datasette)
Joining and Selecting
Before thinking too hard about filters, I built up the joins one table at a time until I had access to all the fields I'd need:
- Customer name so I could look for initials
- Phone number to submit as an answer
- Ordered date to look for 2017
- Description to find bagels and coffee
The SELECT
clause ended up looking like this:
select
c.name,
c.phone,
o.ordered,
p.desc
from
customers c
join orders o on c.customerid = o.customerid
join orders_items i on o.orderid = i.orderid
join products p on i.sku = p.sku
Filtering
Order Date
The easiest first slice to whittle down this result set seemed to be filtering by order date. SQLite's strftime()
function handles that nicely:
where
strftime('%Y', o.ordered) = '2017'
Contractor Initials
I'm not a SQLite expert at all, so I hit the docs looking for the most friendly way to look for the initials JD. LIKE
? regexes? Something else? I came up with this:
where
strftime('%Y', o.ordered) = '2017'
and c.name glob 'J* D*'
Because...
-
like
isn't case-sensitive by default (though that can be controlled with the case_sensitive_like pragma) -
glob
is case-sensitive by default - SQLite includes a
REGEXP
operator, but it errors out unless you define aregexp()
application-defined function (or load an extension that does)
This seems like handy stuff to know!
Coffee and Bagels
I wanted orders that had both bagels and coffee, which GROUP BY
and HAVING
covers well:
group by
c.name,
c.phone,
o.ordered
having
sum(
case
when desc like '%coffee%' then 1
else 0
end
) > 0
and sum(
case
when desc like '%bagel%' then 1
else 0
end
) > 0
All Together Now
After building that up bit by bit, the final query looks like this:
select
c.name,
c.phone,
o.ordered,
p.desc
from
customers c
join orders o on c.customerid = o.customerid
join orders_items i on o.orderid = i.orderid
join products p on i.sku = p.sku
where
strftime('%Y', o.ordered) = '2017'
and c.name glob 'J* D*'
group by
c.name,
c.phone,
o.ordered
having
sum(
case
when desc like '%coffee%' then 1
else 0
end
) > 0
and sum(
case
when desc like '%bagel%' then 1
else 0
end
) > 0
And it gives a clear single result 👍️.
Doing (Pandas)
It was rewarding to dig into SQLite a bit while solving this puzzle, so I figured this would be a good opportunity to learn a bit more about pandas too! So how would I adapt this working SQL solution to pandas?
Preparation
To start, it seemed reasonable to flesh out my initial SELECT
query and use that to build a starting DataFrame:
import numpy as np
import pandas as pd
import sqlite3
con = sqlite3.connect('noahs.sqlite')
df = pd.read_sql_query('''
select
c.*,
o.orderid,
o.ordered,
o.shipped,
o.total,
i.qty,
i.unit_price,
p.sku,
p.desc,
p.wholesale_cost
from
customers c
join orders o on c.customerid = o.customerid
join orders_items i on o.orderid = i.orderid
join products p on i.sku = p.sku''',
con,
parse_dates=['ordered','shipped'])
Looking at counts with df.nunique()
, it looked like most of the columns were repeated a bunch. And I've heard/read that categorical dtypes can be useful there. So why not give them a shot:
dtypes = {col: 'category' for col, dtype in df.dtypes.to_dict().items() if dtype != np.dtype('datetime64[ns]')}
#df.astype(dtypes).memory_usage(deep=True).sum()
#df.memory_usage(deep=True).sum()
# ^^ ~10% mem usage after type conversion!
df = df.astype(dtypes)
Note: I don't think this was actually a good move... first, it was textbook premature optimization. Second, I hit some errors and performance issues later that seemed to stem from this. But hey, that's what experimentation is for.
I then saved the DataFrame to a Parquet file just so it'd be quicker to reload later:
df.to_parquet('noahs.parquet')
Filtering
Using the .str
accessor methods helped find the contractor initials and orders with bagels or coffee, while .dt.year
handled the order year check:
df[
(df.name.str.contains(r"J.* D.*"))
& (df.ordered.dt.year == 2017)
& (df.desc.str.contains(r"bagel|coffee", case=False))
]
After some time in the pandas docs, it seemed like the next best step would be to use groupby()
and filter()
to find orders containing both bagels and coffee. This seemed like a reasonable idea:
df[
(df.name.str.contains(r"J.* D.*"))
& (df.ordered.dt.year == 2017)
& (df.desc.str.contains(r"bagel|coffee", case=False))
].groupby(["orderid"]).filter(
lambda x: x.desc.str.contains("bagel", case=False).any()
and x.desc.str.contains("coffee", case=False).any()
)[
["name", "phone", "desc"]
]
But... BOOM. It just hung and killed my process.
I suspect that this isn't the best way to do this sort of thing. But separately, it seems that using groupby()
on a column with a category
dtype is bad news. Changing orderid
's dtype to int
made this run in a snap. Filtering for bagels and coffee twice feels a little dirty, but avoiding obvious non-matching orders before grouping seemed like a useful thing.
So after minor tweaks, I had the right answer but was also developing some suspicions.
Refining / Reusing
Lots of pandas users seem to use notebooks, but I find it much more comfortable to explore in a REPL. So I wrote a quick script that would let ptipython -i hod_resume.py
land me in a fresh REPL with my DataFrame loaded:
import pandas as pd
df = pd.read_parquet('noahs.parquet').astype({
'customerid': 'int',
'name': 'string',
'address': 'string',
'citystatezip': 'string',
'birthdate': 'datetime64[ns]',
'phone': 'string',
'orderid': 'int',
'ordered': 'datetime64[ns]',
'ordered_date': 'datetime64[ns]',
'shipped': 'datetime64[ns]',
'total': 'float',
'qty': 'int',
'unit_price': 'float',
'sku': 'string',
'desc': 'string',
'wholesale_cost': 'float'
})
I consider that list of data types pretty fluid. Getting rid of object
dtypes seemed like a generally good idea, but category
was a mixed bag that felt great at times and like a footgun at others. I think I need to understand them better to avoid doing silly things, and in the meantime I experimented and switched dtypes often.
Posted on December 30, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.