Nazli Ander
Posted on August 2, 2020
While web scraping, I came across many useful applications such as listing old prices of some financial assets or finding current news topics. Although those examples are quite interesting to apply, frequently there was one main goal to reach at the end that is creating a database with the scraped information.
Whenever I went a bit further on scraping, I ended up in the websites using Javascript to display the data that I needed. Hence, I bumped into Selenium, which is a web testing and automation tool. In this small write up, I aim to list some steps that I find quite useful while setting up Selenium within a Docker container.
Introduction to Selenium WebDriver
Selenium WebDriver is a web automation or testing tool. It was created by Simon Stewart in 2006, as the first cross-platform testing framework that could control the browser from the OS level.
So with Selenium, I can run some automated actions on browsers (clicks, hovers, and fill forms) by directly communicating with them. Java, C#, PHP, Python, Perl, Go and Ruby are the supported languages for the bindings. Since I am more familiar with Python, I will be talking about it.
To work on a browser, I need to choose among a set of browser options like Firefox, Chrome (Chromium), Edge, and Safari. As a personal opinion, Chrome with a headless option (not generating a user interface) is the most performant one, hence I will be sticking to that.
Pulling the Image and Setting Up Google Chrome
To start with my custom Selenium-Python image, I need a Python image, here in this write-up I picked up the version 3.8.
Then I can install Google Chrome on top of it. Remember, without the Google Chrome itself, I cannot run Selenium on top of it to run our tasks. There are a few steps to apply for setting up Google Chrome in Linux:
- Adding Google Chrome trusting keys to apt
- Adding Google Chrome stable version to the repositories
- Updating the repositories to see the stable version in apt
- Installing
google-chrome-stable
FROM python:3.8
# Adding trusting keys to apt for repositories
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
# Adding Google Chrome to the repositories
RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
# Updating apt to see and install Google Chrome
RUN apt-get -y update
# Magic happens
RUN apt-get install -y google-chrome-stable
Installing Chrome Driver
Selenium requires a driver interface to work with the defined browser. Hence, I need to find a way to install Chrome Driver in our Linux image. Here are the steps to follow for doing this:
- Installing unzip as we will need for the zipped Chrome Driver
- Download the Chrome Driver into a folder called
/tmp/chromedriver.zip
, this name can be changed - Unzipping the
/tmp/chromedriver.zip
into the Linux executable path
After those steps, I need to set the display port (99) as Selenium is using this. It will avoid some crushes.
# Installing Unzip
RUN apt-get install -yqq unzip
# Download the Chrome Driver
RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip
# Unzip the Chrome Driver into /usr/local/bin directory
RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/
# Set display port as an environment variable
ENV DISPLAY=:99
Preparing the Docker for a Run
All the steps above were only for setting up Chrome in our Dockerfile. To run my Python application (app.py
) using Docker, I might need the following lines into our Dockerfile.
COPY . /app
WORKDIR /app
RUN pip install --upgrade pip
RUN pip install -r requirements.txt
CMD ["python", "./app.py"]
Apart from those Docker settings, I would like to briefly mention some Docker specific chrome options while setting up the Chrome Driver via Python. I want to explicitly show those a few options in one function as set_chrome_options
. Here I set up the example pseudocode with a function below. I need 4 specific arguments to run our Chrome Driver inside Docker:
- Explicitly saying that this is a headless application with
--headless
- Explicitly bypassing the security level in Docker with
--no-sandbox
. There is a nice Stackoverflow thread over this, apparently as Docker deamon always runs as a root user, Chrome crushes. - Explicitly disabling the usage of
/dev/shm/
. The/dev/shm
partition is too small in certain VM environments, causing Chrome to fail or crash. - Disabling the images with
chrome_prefs["profile.default_content_settings"] = {"images": 2}
.
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
def set_chrome_options() -> None:
"""Sets chrome options for Selenium.
Chrome options for headless browser is enabled.
"""
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_prefs = {}
chrome_options.experimental_options["prefs"] = chrome_prefs
chrome_prefs["profile.default_content_settings"] = {"images": 2}
return chrome_options
if __name__ == "__main__":
driver = webdriver.Chrome(options=chrome_options)
# Do stuff with your driver
driver.close()
Last Words
Here is the Dockerfile, that I took as an example. While creating this, I used the links that I shared to solve the problems that I faced. There might be other kinds of solutions to the problems that I faced. I am curious to listen to those.
Until now, I used it to scrape web archives for asset prices, books, yellow pages, and judgment texts. Although Selenium is not designed for web scraping, I leveraged this nice tool for taming Javascript using websites. But I should admit that, if the information that I was looking for was not hiding in Javascript, I would have been definitely a lot happier with using only Requests
, BeautifulSoup4
and/or Scrapy
for Python. Because all those are simpler to set up, and more performant.
Happy Scraping!
Posted on August 2, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.