AWS COVID ETL
Justin Wheeler
Posted on September 21, 2020
Overview
This month I saw that acloud.guru was going to start doing challenges to promote learning. This month's challenge was an event-driven Python application around COVID-19. You can read about that here:
https://acloudguru.com/blog/engineering/cloudguruchallenge-python-aws-etl
My project didn't go as well as I would've hoped, but I think it fostered a better learning experience because of my pitfalls.
Assets
GitHub Code:
https://github.com/wheelerswebservices/cgc-aws-covid-etl
Approach
I read the problem statement and quickly went to work. This was a huge mistake and I deeply regret it. I decided to go for a serverless approach and began coding the low-level steps in Python.
I thought I would use DynamoDB to build a serverless application. If I had spent more time planning maybe I would've avoided the issues that arose.
Issue 1
There was an ask to do an incremental load. That's reasonable. If your source data is large you don't want to update it over and over again. It sounds simple enough, but this is actually quite hard to do with a NoSQL DB like DynamoDB.
The only way to provide an incremental load functionality would be to scan the entire table before each load. This is costly and time consuming. Then I found that the PutItem method I was using would overwrite the data if it already existed for the partition key and realized that would be quicker and cheaper to reload the entire dataset everyday. So I continued...
Issue 2
I decided to use QuickSight for my visualization layer, trying to stick with AWS-owned services. Besides QuickSight was a lot cheaper than Tableau at first glance.
Unfortunately, QuickSight does not support DynamoDB as a native data source. Meaning I would have to move my data again before I could visualize it.
I had initially looked at using DynamoDB streams and Kinetica Firehose to feed the data into S3, but I decided it would be much simpler and easier if I just updated my Lambda function to write a JSON file to S3 after updating DynamoDB.
Surely that would solve everything!
Issue 3
More pain with QuickSight soon followed when I discovered that QuickSight does not allow public dashboards. If I wanted to make my dashboard visible to others I would need to:
- Write an application
- Configure authentication for that application with some service like Cognito
- Embed my QuickSight dashboard into that application
All of these tasks seemed too much for this simple challenge. Thus, you get a static picture of the dashboard only I get to enjoy.
A Better Way
The further I got into my application the more I started thinking about what I would've done if I had spent more time planning.
- Would I have noticed these issues before I started?
- Would I have went in a different direction?
I'd like to think the answer to both of these questions is yes, yet I will never really know.
I've thought about using Aurora Serverless instead of DynamoDB, except the fact it needs to be within a VPC surely complicates things.
- Lambda would need to be within a VPC
- I have to create a VPC
- I have to create Subnet(s)
- I have to create Security Group(s)
- I have to create Network Access Control List(s)
- I have to upgrade to QuickSight Enterprise to access VPC resources
https://docs.aws.amazon.com/quicksight/latest/user/working-with-aws-vpc.html
Conclusion
I'm really not sure why I went the way I did, although it was a fun experience regardless. I'm very intrigued to see what other people do for their challenge and even more intrigued to see what the October #CloudGuruChallenge will entail.
One thing is certain though, that I will surely spend sufficient time planning everything before I get started on my next project. No matter what it may be.
Posted on September 21, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.