Web scraping with Python and AWS Lambda: A modern approach

adelmofilho42

Adelmo Filho

Posted on June 13, 2021

Web scraping with Python and AWS Lambda: A modern approach

In December 2020, AWS started to support Lambda functions as container images, which is a real breakdown that allows us to deploy way more complex projects with the same you-pay-only-for-what-you-use pricing and serverless architecture.

Web scraping workloads have real benefits from this Upgrade due to an easier installation of selenium.

Let's code!

The Dockerfile bellow is based on the oficial lambda container image for python 3.8 (it is really awful to create this image from scratch).

# Dockerfile
FROM public.ecr.aws/lambda/python:3.8

RUN yum install -y \
    Xvfb \
    wget \
    unzip

# Install google-chrome-stable
RUN wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm && \
    yum localinstall -y google-chrome-stable_current_x86_64.rpm

# Install chromedriver
RUN wget https://chromedriver.storage.googleapis.com/2.40/chromedriver_linux64.zip && \
    unzip chromedriver_linux64.zip && \
    chmod 775 chromedriver

# Install selenium
RUN pip3 install -U pip selenium

# Copy lambda's main script
COPY app.py .

CMD ["app.lambda_handler"]
Enter fullscreen mode Exit fullscreen mode

The python script below configures the Selenium with a Chrome headless. Note the path of the chrome driver at the driver definition - such path comes from the work directory of the base image.

# app.py

from selenium import webdriver
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
chromeOptions.add_argument("--remote-debugging-port=9222")
chromeOptions.add_argument('--no-sandbox')
driver = webdriver.Chrome('/var/task/chromedriver',chrome_options=chromeOptions)

def lambda_handler(event, context):
    driver.get("http://www.python.org")
    return {
        "statusCode": 200,
        "body": driver.title
    }
Enter fullscreen mode Exit fullscreen mode

Finally, build and run the container image!

$ docker build -t scrapper:latest .
Enter fullscreen mode Exit fullscreen mode
$ docker run -p 9000:8080  scrapper:latest
Enter fullscreen mode Exit fullscreen mode

In order to test your new web scraping containerized lambda function, run the following command.

$ curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'

{"statusCode": 200, "body": "Welcome to Python.org"}
Enter fullscreen mode Exit fullscreen mode
💖 💪 🙅 🚩
adelmofilho42
Adelmo Filho

Posted on June 13, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related