Hanukkah of Data 2022 - Puzzle 8 and Wrap-Up
AJ Kerrigan
Posted on December 30, 2022
Hanukkah of Data is a series of data-themed puzzles, where you solve puzzles to move your way through a holiday-themed story using a fictional dataset. See the introductory post for a bit more detail, but the pitch in my head is "Advent of Code meets SQL Murder Mystery". This post walks through my approach to the eighth puzzle.
Warning: This post unavoidably contains spoilers. If you'd rather do the puzzles on your own first, please close this browser tab and run away :).
Thinking
For the last puzzle, we're looking for the person who still probably has the rug! Turns out he's a bit of a pack rat:
She wound up getting a newer and more expensive rug, so she gave it to an acquaintance of hers who collects all sorts of junk. Apparently he owns an entire set of Noah’s collectibles! He probably still has the rug, even.
The plan here will be find what counts as a Noah's collectible and see who has the most of them.
Doing (Datasette)
Based on a peek at the products table, it looks like collectibles all have a 'COL' sku prefix. So we should be able to find customers who have ordered the most distinct items with that prefix:
select
name,
phone,
count(distinct sku) as colcount
from
noahs_order_detail
where
sku like 'COL%'
group by
name,
phone
order by
colcount desc
And yes, Travis has more than 2x the collectible count of any other customer!
Doing (Pandas)
A pretty straight SQL-->pandas translation seems to work fine here:
df[
df.sku.str.contains('^COL')
].groupby(
['name','phone']
).agg(
{'sku':'nunique'}
).sort_values(by='sku').tail()
Wrap-Up
This was a lot of fun! Some specific takeaways:
- Huge thanks to The Devottys for putting this challenge together
- Saul & Anja pour a lot of heart and smarts into VisiData and the data/terminal-loving community
- I'm consistently impressed by Dwimmer's artwork, he's magic at the terminal
- There's something dangerous or funky about using
category
dtypes in pandas withgroupby()
- I got various errors or saw operations take 40x longer, though this is at least partly my inexperience / ignorance
- Issues like https://github.com/pandas-dev/pandas/issues/36698 suggest I'm not alone!
- Working through problems with multiple tools can sometimes help you rethink your approach in all of them
- You don't have to be Simon Willison to get a lot out of SQLite
I should also point out the official wrap-up post here, which links to other community posts.
Posted on December 30, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.