Hanukkah of Data 2022 - Puzzle 4
AJ Kerrigan
Posted on December 30, 2022
Hanukkah of Data is a series of data-themed puzzles, where you solve puzzles to move your way through a holiday-themed story using a fictional dataset. See the introductory post for a bit more detail, but the pitch in my head is "Advent of Code meets SQL Murder Mystery". This post walks through my approach to the fourth puzzle.
Warning: This post unavoidably contains spoilers. If you'd rather do the puzzles on your own first, please close this browser tab and run away :).
Thinking
On the fourth day, we find a woman who likes to eat pastries early in the morning:
“A few weeks later my bike chain broke on the way home, and I needed to get it fixed before work the next day. Thankfully, this woman I met on Tinder came over at 5am with her bike chain repair kit and some pastries from Noah’s. Apparently she liked to get up before dawn and claim the first pastries that came out of the oven.
Doing (Pandas)
This time I turned things around and started with Pandas. Because I had explored the data a bit prior to starting puzzles, I noticed that the sku
column started with a category/department prefix. Baked goods all seemed to have a bakery prefix of BKY
, so finding orders with that sku prefix seemed like a good start:
df[(df.sku.str.contains('BKY'))]
That was far too many results (as expected), so the next step was to find bakery orders early in the morning. Let's say between 4am and 9am, which should cover cases where someone got up before dawn and got the first pastries out of the oven:
df[(df.sku.str.contains('BKY')) & (df.ordered.dt.hour.between(4,9))]
That was still a bunch of rows! But the clues make it sound like this is a habit. So how about looking at which customers get those early morning pastries most often?
df[
(df.sku.str.contains('BKY'))
& (df.ordered.dt.hour.between(4,9))
].groupby(['name','phone']).size().sort_values().tail()
That suggests Cristina Booker as the clear winner, and narrowing that time range from 4-9am to 4-6am makes it even clearer.
Doing (Datasette)
Adapting the pandas logic to SQL left me with this query:
select
c.name,
c.phone,
count(*) as ordercount
from
customers c
join orders o on c.customerid = o.customerid
join orders_items i on o.orderid = i.orderid
join products p on i.sku = p.sku
where
p.sku like 'BKY%'
and cast(strftime('%H', o.ordered) as int) <= 5
group by
c.name,
c.phone
order by ordercount desc
Posted on December 30, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.