Web scraping with Python and AWS Lambda: A modern approach
Adelmo Filho
Posted on June 13, 2021
In December 2020, AWS started to support Lambda functions as container images, which is a real breakdown that allows us to deploy way more complex projects with the same you-pay-only-for-what-you-use pricing and serverless architecture.
Web scraping workloads have real benefits from this Upgrade due to an easier installation of selenium
.
Let's code!
The Dockerfile bellow is based on the oficial lambda container image for python 3.8 (it is really awful to create this image from scratch).
# Dockerfile
FROM public.ecr.aws/lambda/python:3.8
RUN yum install -y \
Xvfb \
wget \
unzip
# Install google-chrome-stable
RUN wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm && \
yum localinstall -y google-chrome-stable_current_x86_64.rpm
# Install chromedriver
RUN wget https://chromedriver.storage.googleapis.com/2.40/chromedriver_linux64.zip && \
unzip chromedriver_linux64.zip && \
chmod 775 chromedriver
# Install selenium
RUN pip3 install -U pip selenium
# Copy lambda's main script
COPY app.py .
CMD ["app.lambda_handler"]
The python script below configures the Selenium with a Chrome headless. Note the path of the chrome driver at the driver definition - such path comes from the work directory of the base image.
# app.py
from selenium import webdriver
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
chromeOptions.add_argument("--remote-debugging-port=9222")
chromeOptions.add_argument('--no-sandbox')
driver = webdriver.Chrome('/var/task/chromedriver',chrome_options=chromeOptions)
def lambda_handler(event, context):
driver.get("http://www.python.org")
return {
"statusCode": 200,
"body": driver.title
}
Finally, build and run the container image!
$ docker build -t scrapper:latest .
$ docker run -p 9000:8080 scrapper:latest
In order to test your new web scraping containerized lambda function, run the following command.
$ curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'
{"statusCode": 200, "body": "Welcome to Python.org"}
Posted on June 13, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.