AWS COVID ETL

Overview

This month I saw that acloud.guru was going to start doing challenges to promote learning. This month's challenge was an event-driven Python application around COVID-19. You can read about that here:
https://acloudguru.com/blog/engineering/cloudguruchallenge-python-aws-etl

My project didn't go as well as I would've hoped, but I think it fostered a better learning experience because of my pitfalls.

Assets

GitHub Code:
https://github.com/wheelerswebservices/cgc-aws-covid-etl

Architecture Diagram:

QuickSight Dashboard:

Approach

I read the problem statement and quickly went to work. This was a huge mistake and I deeply regret it. I decided to go for a serverless approach and began coding the low-level steps in Python.

I thought I would use DynamoDB to build a serverless application. If I had spent more time planning maybe I would've avoided the issues that arose.

Issue 1

There was an ask to do an incremental load. That's reasonable. If your source data is large you don't want to update it over and over again. It sounds simple enough, but this is actually quite hard to do with a NoSQL DB like DynamoDB.

The only way to provide an incremental load functionality would be to scan the entire table before each load. This is costly and time consuming. Then I found that the PutItem method I was using would overwrite the data if it already existed for the partition key and realized that would be quicker and cheaper to reload the entire dataset everyday. So I continued...

Issue 2

I decided to use QuickSight for my visualization layer, trying to stick with AWS-owned services. Besides QuickSight was a lot cheaper than Tableau at first glance.

Unfortunately, QuickSight does not support DynamoDB as a native data source. Meaning I would have to move my data again before I could visualize it.

I had initially looked at using DynamoDB streams and Kinetica Firehose to feed the data into S3, but I decided it would be much simpler and easier if I just updated my Lambda function to write a JSON file to S3 after updating DynamoDB.

https://aws.amazon.com/blogs/database/automatically-archive-items-to-s3-using-dynamodb-time-to-live-with-aws-lambda-and-amazon-kinesis-firehose/

Surely that would solve everything!

Issue 3

More pain with QuickSight soon followed when I discovered that QuickSight does not allow public dashboards. If I wanted to make my dashboard visible to others I would need to:

Write an application
Configure authentication for that application with some service like Cognito
Embed my QuickSight dashboard into that application

All of these tasks seemed too much for this simple challenge. Thus, you get a static picture of the dashboard only I get to enjoy.

A Better Way

The further I got into my application the more I started thinking about what I would've done if I had spent more time planning.

Would I have noticed these issues before I started?
Would I have went in a different direction?

I'd like to think the answer to both of these questions is yes, yet I will never really know.

I've thought about using Aurora Serverless instead of DynamoDB, except the fact it needs to be within a VPC surely complicates things.

Lambda would need to be within a VPC
I have to create a VPC
I have to create Subnet(s)
I have to create Security Group(s)
I have to create Network Access Control List(s)
I have to upgrade to QuickSight Enterprise to access VPC resources

https://docs.aws.amazon.com/quicksight/latest/user/working-with-aws-vpc.html

Conclusion

I'm really not sure why I went the way I did, although it was a fun experience regardless. I'm very intrigued to see what other people do for their challenge and even more intrigued to see what the October #CloudGuruChallenge will entail.

One thing is certain though, that I will surely spend sufficient time planning everything before I get started on my next project. No matter what it may be.

Blog

Justin Wheeler