Tracking grocery price trends on AWS - Part 2 - Analytics

markymarkus

Markus Toivakka

Posted on July 28, 2022

Tracking grocery price trends on AWS - Part 2 - Analytics

In Part 1, we implemented ingestion pipeline for grocery receipts and if you have followed through the instructions, you should now have test data from two grocery receipts extracted to JSON files in S3.

In this second part, I'm going to show you basic steps on how to run analysis on the extracted grocery data. AWS services we are going to use are AWS Athena and Amazon Quicksight.

Setup Athena

First we are creating Athena table for extracted data. Data schema is a simple one. Just remember to change the bucket name to match your case and import partitioned data with MSCK REPAIR.

CREATE EXTERNAL TABLE grocery_items (
    name string,
    price float,
    currency string,
    unit string,
    date string
  )
PARTITIONED BY (store string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-grocery-tracking-output/' 

MSCK REPAIR TABLE grocery_items
Enter fullscreen mode Exit fullscreen mode

Query the data with SELECT * FROM grocery_items and you should get 37 rows of grocery data. Nice, that means the solution is end-to-end working.

Next we need more data.


Backfill the purchase history

If you were to start collecting receipts from this day onwards, it can be weeks or more likely months before there is a price change on the item. Luckily it's possible to get grocery purchase receipt history to see price changes leading to the current point.

S-Group and Kesko, biggest players in Finnish grocery market, both have applications with opt-in services for digitised receipts(apps: S-Mobiili, K-Ruoka). Both applications can export receipts in PDF. All we need to do is to shovel the PDF receipts into the ingestion pipeline and visualise the results. You can also scan/photo receipts into JPG and that is working as well.

Some words about the data

  • In analysis, I have treated grocery item name as unique identifier. For example, KAURAJUOMA(Oat Milk) in following trends means The Oat Milk. It is very possible that actual receipt line KAURAJUOMA can refer to BrandA Kaurajuoma(1.5e) or BrandB Kaurajuoma(1.95e). In such case, same item name in the data would have two different prices. That is very possible anomaly in the price item analysis if your buying pattern is not consistent.

  • If there is a data point missing on the graph that doesn't mean there has not been a price change. It just indicates there is no data on my receipts about the price change.


Query the grocery data

I have used S-Group's app and Omat Ostot service for about two years and on the following examples I am showing some trends based on that data. Basically I have first exported Omat Ostot - grocery receipts to PDF and then synced everything to the ingestion pipeline input - bucket.

I have about 6000 rows of data in grocery_items table.

First query is to check which grocery items are occurring most in the data. Which items are purchased most frequently:

SELECT name,COUNT(name) AS item_line_count FROM items GROUP BY name ORDER BY item_line_count DESC

1   KAURAJUOMA  161
2   LUOMU RASVATON MAITO    125
3   MAITOJUOMA LAKTON RASVATO   124
4   PAPRIKA PUNAINEN IRTO   87
5   PEHMEÄ MAITORAHKA  84
6   RUISPUIKULAT    79
7   KURKKU SUOMI    79
8   BANAANI LUOMU   74
9   TOMAATTI SUOMI  73
10  AVOKADOPUSSI    70
Enter fullscreen mode Exit fullscreen mode

Huh, no beer on that list. Just milk, rye bread and vegetables.

Visualize the price trends

Next, I created a dataset from Athena grocery data to Amazon Quicksight. Following graphs are created from the dataset, data filtered to S-Market Kaleva supermarket.

First price trends are for dairy and meat products:

Image description

Next, vegetables and fruits. Seasonal price fluctuation is clearly the trend with the fresh vegetables. Bananas though..

Image description

Cereal products, bread, flour, misc. Coffee was a first product that caught my attention of the price increases early spring.

Image description

Based on these, talk about grocery inflation 2022 is a real deal. Prices are increasing in almost every category of tracked grocery products. It is interesting to see how the trends will be a few months down the line.

Conclusion

By following the steps detailed in these posts, you can implement described receipt ingest pipeline and start doing analyses on the data.

For this quick exercise, I wanted to just focus on the grocery inflation. Another interesting angle for the data would be to analyse prices between grocery chains. That would require understanding of the grocery range in each market and how to match references in the data for meaningful comparison.. And much more receipts.

Thanks for reading!

💖 💪 🙅 🚩
markymarkus
Markus Toivakka

Posted on July 28, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related