Hanukkah of Data 2022 - Puzzle 7
AJ Kerrigan
Posted on December 30, 2022
Hanukkah of Data is a series of data-themed puzzles, where you solve puzzles to move your way through a holiday-themed story using a fictional dataset. See the introductory post for a bit more detail, but the pitch in my head is "Advent of Code meets SQL Murder Mystery". This post walks through my approach to the seventh puzzle.
Warning: This post unavoidably contains spoilers. If you'd rather do the puzzles on your own first, please close this browser tab and run away :).
Thinking
It turns out that one day while Emily was robbing Noah's blind, another customer bought the same item in a different color:
โWell I turned around and sure enough this cute guy was holding something I had bought. He said โI got almost exactly the same thing!โ We laughed about it and wound up swapping items because he had wanted the color I got. We had a moment when our eyes met and my heart stopped for a second. I asked him to get some food with me and we spent the rest of the day together.
So we'll need to find orders on the same day as Emily, where the item is almost the same as what Emily bought...
Tackling this in pandas felt a bit daunting to me, so I figured I'd start with SQL. Looking through the data I saw that color usually showed up inside parentheses in a description. That seemed something worth pursuing.
Doing (Datasette)
Emily's Orders
Breaking the problem down into smaller pieces, I figured we'd want to have Emily's orders first. So I set up a common table expression for those:
with emily_orders as (
select
o.ordered,
p.sku,
p.desc
from
customers c
join orders o on c.customerid = o.customerid
join orders_items i on o.orderid = i.orderid
join products p on i.sku = p.sku
where
c.name = 'Emily Randolph'
)
Orders on Emily Days
Then I pulled in non-Emily orders that were placed on the same day as one of Emily's:
select
c.name,
c.address,
c.citystatezip,
c.phone,
o.ordered,
em.ordered as em_ordered,
p.sku,
em.sku as em_sku,
p.desc
from
customers c
join orders o on c.customerid = o.customerid
and date(o.ordered) in (
select
distinct date(ordered)
from
emily_orders
)
That was too many results, so I fine-tuned the order time to look for non-Emily orders placed within 5 minutes of an Emily order:
join emily_orders em on c.name != 'Emily Randolph'
and abs(
strftime('%s', o.ordered) - strftime('%s', em.ordered)
) <= 300
Matching by SKU Prefix
That was still too many results, so I tried filtering on orders that matched the sku prefix:
join emily_orders em on c.name != 'Emily Randolph'
join orders_items i on o.orderid = i.orderid
join products p on i.sku = p.sku
where
substr(p.sku, 1, 3) = substr(em.sku, 1, 3)
Fine, I'll Filter on Description
I was trying to avoid comparing descriptions, but ran out of other ideas before finding a definitive match. So I tried matching just the portion of a description in front of any parentheses by adding this to the select:
case
when instr(p.desc, '(') = 0 then p.desc
else substr(p.desc, 1, instr(p.desc, '(') - 1)
end as stripped_desc,
case
when instr(em.desc, '(') = 0 then em.desc
else substr(em.desc, 1, instr(em.desc, '(') - 1)
end as em_stripped_desc
And matching it in the WHERE
:
where
substr(p.sku, 1, 3) = substr(em.sku, 1, 3)
and stripped_desc = em_stripped_desc
This felt pretty awkward to me as any latent SQL Server or Postgres instincts were useless. But it did work!
I think there are better ways to do this with extensions and/or full-text search support. This felt like a really ugly way to work around a lack of fuzzy matching or string splitting operations.
The Whole is Uglier Than the Sum of Its Parts
That was a bumpy ride, and I ended up with this monstrosity:
with emily_orders as (
select
o.ordered,
p.sku,
p.desc
from
customers c
join orders o on c.customerid = o.customerid
join orders_items i on o.orderid = i.orderid
join products p on i.sku = p.sku
where
c.name = 'Emily Randolph'
)
select
c.name,
c.phone,
o.ordered,
em.ordered as em_ordered,
p.sku,
em.sku as em_sku,
p.desc,
case
when instr(p.desc, '(') = 0 then p.desc
else substr(p.desc, 1, instr(p.desc, '(') - 1)
end as stripped_desc,
em.desc,
case
when instr(em.desc, '(') = 0 then em.desc
else substr(em.desc, 1, instr(em.desc, '(') - 1)
end as em_stripped_desc
from
customers c
join orders o on c.customerid = o.customerid
and date(o.ordered) in (
select
distinct date(ordered)
from
emily_orders
) -- rough cut
join emily_orders em on c.name != 'Emily Randolph'
and abs(
strftime('%s', o.ordered) - strftime('%s', em.ordered)
) <= 300 -- fine tune
join orders_items i on o.orderid = i.orderid
join products p on i.sku = p.sku
where
substr(p.sku, 1, 3) = substr(em.sku, 1, 3)
and stripped_desc = em_stripped_desc
But also an answer, so I'll take it.
Doing (Pandas)
Because the SQL approach felt so gross to me, I tried to rethink it rather than just "porting" it to pandas. The start was the same - finding a group of not-Emily orders on the same day as Emily orders:
df['ordered_date'] = df.ordered.dt.date
emily = df[df.name == "Emily Randolph"]
not_emily = df[df.name != "Emily Randolph"]
merged = not_emily.merge(emily, on="ordered_date")
But then I tried to come up with a better way to indicate "descriptions that differ just by color". This was tricky! I experimented a bit with difflib which is fun for looking at close string matches. But in the end I leaned on a little helper function to split a description into a set of lowercased words with the parentheses and other punctuation discarded:
import re
stemset = lambda x: {val.lower() for val in re.findall(r"\w+", x)}
Which I could use in a transform()
, and then use apply()
to find the symmetric difference of those sets between each Emily and not-Emily order:
merged["desc_diff"] = merged.transform({"desc_x": stemset, "desc_y": stemset}).apply(
lambda x: x.desc_x ^ x.desc_y, axis=1
)
Because the color wasn't always present in an item description, I figured I could look for cases where the symmetric difference between description wordsets had either 1 or 2 words. Combining that with the "orders within 5 minutes of each other" condition, that left this:
merged[
merged.apply(
lambda x: abs((x.ordered_x - x.ordered_y).total_seconds()) <= 300
and len(x.desc_diff) in (1, 2),
axis=1,
)
]
Tricky Colors
I already had the answer by this point, but there was one thing nagging me: how would I check for specifically a color difference rather than just a 1-2 word difference? This is where I had to shake an angry yet appreciative fist at the puzzle design. Because while I could find lists of color names in packages like webcolors, matplotlib, Crayola APIs, etc... the relevant colors in this puzzle were outside the core set of Crayola/HTML/CSS colors. The best alternative I found was to pull out a distinct set of colors like this:
colors = set(
df.desc.transform(
lambda x: (match := re.search(r'\(([a-z]+)\)', x) ) and match.groups() or []
).explode().dropna().unique()
)
...and then check to be sure my desc_diff
column contained only words from that set. In this case that ended up not being necessary or useful, but hey I was curious.
Getting Chainy
Some pandas users like to chain their methods together rather than leaving a trail of intermediate variables. I don't have a consistent preference, but it feels like good practice to work both ways. So if I tried to chainify this pandas code, it would look something like this:
df[df.name != "Emily Randolph"].merge(
df[df.name == "Emily Randolph"], on="ordered_date"
).assign(
desc_diff=lambda x: x.transform({"desc_x": stemset, "desc_y": stemset}).apply(
lambda x: x.desc_x ^ x.desc_y, axis=1
)
).pipe(
lambda x: x[
x.apply(
lambda x: abs((x.ordered_x - x.ordered_y).total_seconds()) <= 300
and len(x.desc_diff) in (1, 2),
axis=1,
)
]
)
I hope that when I browse other people's solutions I find a cleaner chained version, because this looks gross to me ๐ .
Posted on December 30, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.