Automate a Papertrail with AWS Lambda
Jakob Ondrey
Posted on April 5, 2022
Photo by Nana Smirnova on Unsplash
Create Lambda Infrastructure with Build (w/ docker running) and Deploy: The CDK will take care of the rest.TLDR
The aws_lambda_python_alpha
module (or probably any aws_lambda_<language>_alpha
module) is really great for writing, building, packaging, and deploying serverless functions with the AWS CDK.
import aws_cdk.aws_lambda as lambda_
import aws_cdk.aws_lambda_python_alpha as lambda_python
alpha
module
cdk synth --no-staging
cdk deploy
About Me
My name is Jakob and I am a DevOps Engineer. I used to be a lot of other things as well (Dish Washer, Retail Employee, Camp Counselor, Army Medic, Infectious Disease Researcher), but now I am a DevOps Engineer. I received no formal CS education but I'm not self taught, because I had thousands of instructors who taught me through their tutorials and blog posts. The culture of information sharing within the software engineering community is vital to everyone, especially those like me who didn't have other options. So, as I learn new things I will be documenting them through the eyes of someone learning for the first time, because those are the people most in need of a guide. Happy Learning! And don't be a stranger.
The Problem
A couple of months ago I caused quite a stir by pointing out that my local school district (the 3rd largest in the United States) was underrepresenting COVID-19 cases on their School Dashboards in the middle of the Omicron surge.
To summarize: They changed which cases were being displayed without changing the legends of the graphs and charts displaying the data. This resulted in thousands of cases missing that, based on the legends, should have been there. It was very bad work on their part.
One tool that helped me tell the story and counter the school district's excuses were archived snapshots of the district's web pages that had been captured by the Internet Archive's Wayback Machine.
The Solution
If you haven't heard of the Wayback Machine, go check it out. It is essentially a time capsule of web pages and it allowed me to prove that on (at least) certain dates that the district graphs were not showing what they said they were. It was very useful for me.
So what does this have to do with anything? Well, in the spirit of accountability, and wanting to play with the AWS-CDK I built a stack that you can deploy to automate the archival of web pages to the Internet Archive. Using a serverless Lambda function, it's so cost effective it is basically free (you would need to make over a million archive requests per day, every day to go over the AWS Free Tier.
You can find the repo here if you want to take a look or clone it for your own use and I will also outline the juicy bits below, but briefly:
Using Python and the AWS-CDK, I have created a CloudFormation stack consisting primarily of a Lambda function that uses a wayback machine python library to request the archival of a list of urls, and an EventBridge event that triggers the Lambda. The CDK also makes all the accessory roles and permission sets that are required for this all to function.
The Function
The function code (below) sits in a directory with a requirements.txt
file. The CDK (with the help of docker) will create a container, install the requirements.txt file, and deploy that artifact for you to AWS. It's quite neat.
1 from waybackpy import WaybackMachineSaveAPI
2 import signal
3 import logging
4
5 logger = logging.getLogger()
6 logger.setLevel(logging.INFO)
7
8 user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
9 url_list = [
10 "https://api.cps.edu/health/help",
11 ]
12
13 def handler(sigum, frame):
14 raise Exception("Request Sent")
15
16 def lambda_handler(event, context):
17 signal.signal(signal.SIGALRM, handler)
18 for url in url_list:
19 signal.alarm(5) # resets alarm
20 try:
21 save_api = WaybackMachineSaveAPI(url, user_agent)
22 save_api.save()
23 except Exception:
24 logger.info("Request Sent: {}".format(url))
25 continue
At the top import and logging are set as well as the user_agent for the Wayback API. Then there is a list for all the URLs you want to save.
Skipping the handler
function, the lambda will enter into the lambda_handler
function. the first thing it will do on line 17
is use the signal
library to listen for a signal alarm and call the handler
function if that alarm reaches 0. This is what will allow us to just throw requests at the wayback machine without having to wait the >30 seconds for a response that the archival is done.
Then we will iterate over url_list
resetting a signal alarm to 5 seconds each time and then requesting the url be saved. Meanwhile the signal alarm is counting down and when it reaches 0 the requested exception is raised, the request is logged, and we move on to the next iteration of resetting the alarm and requesting the url be saved.
I think it a nice way to self rate limit, while not having to wait for a response. I suppose you could lower the timer down to 1 second if you really wanted to.
The Stack
The CDK stack is maintained in /wayback/wayback_stack.py
.
Note that both aws_lambda
and aws_lambda_python_alpha
modules are imported and that aws_lambda_python_alpha
is imported on its own. That is important.
The stack contains two things:
wayback_function
The function is created with the aws_lambda_python_alpha
module. It is used instead of aws_lambda
because it will automatically package and even allow local testing should we be interested in that.
16 wayback_function = lambda_python.PythonFunction(
17 self,
18 "wayback",
19 function_name="WaybackArchiver",
20 runtime=lambda_.Runtime.PYTHON_3_9,
21 entry="./wayback_app",
22 index="app.py",
23 handler="lambda_handler",
24 memory_size=128,
25 timeout=Duration.seconds(300),
26 )
Most of the parameters make sense when viewed in context, but I will explicitly point out that entry=
on line 21
is the DIRECTORY that will be packaged up by the CDK. If there is a requirements.txt
file in this directory all the libraries and modules indicated here will be installed.
index=
and handler=
on lines 22-23
together tell the CDK which file and function in the entry
directory should be the starting point at execution.
Event Rule
This is not named because it isn't referenced anywhere, but this section creates the EventBridge rule that triggers the Lambda. There are only two important things here, schedule
and targets
.
28 event.Rule(self, "WaybackRule",
29 rule_name="WaybackRule",
30 schedule=event.Schedule.cron(
31 minute="0",
32 hour="9",
33 ),
34 targets=[
35 targets.LambdaFunction(wayback_function),
36 ],
37 )
line 30
indicates that you will be using the CDK's version of cron notation. Note that the timezone is UTC and everything here is represented by a string. In addition, parameters that are not stated default to *
. So, for example, not including day=
above would mean all/any days of the month.
line 34
is where you can specify what you want to happen when this event occurs. In this case it is referencing the wayback_function
named above.
Building and Deploying
The stack can be built, including the packaged lambda (assuming docker is running) and deployed with the following commands:
cdk synth --no-staging
cdk deploy
Once created you will have a function that will archive a list of web pages (that you don't have to manage storage of) without further direction from you.
Modifications to the stack or the application just need to rebuilt and deployed and the old scheme will be replaced with the modifications.
Now, figure out what web-pages you think could use some oversight and start archiving them. You never know when an external record of there content at a given time will bear fruit!
Posted on April 5, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.