The one where we build a web scrapper and a slackbot - Part 1
Victory Akaniru
Posted on February 23, 2020
The problem
As software engineers, part of what we do revolves around making seemingly easy things a little bit easier. who would imagine doing these three things would be a chore?
- Visit Brainyquote
- Find and copy a random quote about excellence from the site.
- Post the quote to a slack channel.
It seems simple enough to do but if done every day for a year becomes boring and tedious.
Python is a scripting language built for things like this! With python, we could automate this whole process and not have to do the same thing every day.
The Product
Well, that's exactly what we would be doing π. In this two-part series, we would be building a slackbot that periodically sends a random quote about excellence to a specified slack channel. Some of our MVP features would include
- Scraping tool: This would be responsible for getting a whole lot of quotes and saving them to a JSON file for future use
- A Slack bot: That would be responsible for periodically(maybe every morning?) sending one random quote to a slack channel. This part of the project would require us to write some simple code for posing the message to a Slack channel at intervals.
Prerequisites
- A python environment and some basic knowledge of Python. That's it
Part 1: The scrapping tool
First off we need to get some groundwork done by creating a basic project setup, a virtual environment and installing some packages
- cd newly_created_folder
- mkdir scrapping-tool
- cd scrapping-tool
- touch __init__.py main.py scroll.py selenium_driver.py
At this point, we're good to go but I strongly recommend you create a virtual environment for this project. If you have virtualenv
installed on your PC all you have to do is run the following commands
- virtualenv --python=python3 venv
- source venv/bin/activate
If you don't or have questions around what a virtualenv is... you may want to Read this
Next, install the following 3rd party packages
- BeautifulSoup to help us scrape any website for data
- selenium to automate browser interactions while doing so and lxml to interface with BeautifulSoup and parse data to LXML.
run the following command on your terminal
pip3 install BeautifulSoup selenium lxml
Finally, download chrome driver
by following basic instructions here. This would enable us to run a headless version of chrome when using selenium for automation. If you're on a mac you can simply run
brew cask install chromedriver
Setup sometimes may endure for a night but code comes in the morning... UNKNOWN
Let's write some code!
In the scrapping-tool
folder you created, locate the selenium_driver.py
file and paste the following code in
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/usr/local/bin/chromedriver", chrome_options=options)
This piece of code imports webdriver
from selenium and adds some configuration options for webdriver like incognito, headless mode, etc, finally we make use of the chromedriver we installed earlier by pointing to the path where it was downloaded to. we save this in a driver
variable for future use.
By adding the
__init__.py
file in our folder we told python to consider every file in that folder a package. This means functions, variables, etc are exposed by default from any location in our app π.
Part of the hassles that come with automation in web browsers comes up when human interaction is needed. For example, the website we are trying to scrape has some functionalities you would notice once you open the site.
- On the first visit to the website, you would have to click and accept the privacy policy
- After that, we see the page with all those quotes we would like to get, but then this page implements an infinite scroll.
We won't be doing much automation if we were to help our browser click that button or help the browser scroll when it gets to the bottom of the page. These problems bring us to our next step scroll.py
.
The key to scrapping a website properly lies in your ability to hit inspect and find that class or id with which you can access that element
In the file scroll.py
Copy and paste the code below.
import time
def scroll(driver, timeout):
scroll_pause_time = timeout
# wait for terms modal to popup and then click
driver.implicitly_wait(timeout)
privacy_button = driver.find_elements_by_css_selector(".qc-cmp-buttons > button:nth-child(2)")
privacy_button[0].click()
time.sleep(2)
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(scroll_pause_time)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
# If heights are the same it will exit the function
break
last_height = new_height
A few things to note
- We create a
scroll
function which takes two parameters,driver
(our page source) andtimeout
(wait time). - We make use of some methods available on the driver object like
find_elements_by_css_selector
this would help us locate elements. like in our case locate the privacy button and where to start our infinite scrolling. - We also make use of
execute_script
method which takes our browsers window object as a parameter to enable us to scroll the website, determine page height, etc - Notice the
while
loop? This loop checks our browser and calculate's new scroll height by comparing our current height with the last scroll height. if both heights are the same we break the loop meaning we are at the end of the page.
Bringing it all together, we build the scrapper itself.
In main.py
still within the scrapping-tool
folder, add the following code
import re
import json
from bs4 import BeautifulSoup
from selenium_driver import driver # here we import the driver we configured earlier
from scroll import scroll # the scroll method
def get_quotes(url):
try:
# implicitly_wait tells the driver to wait before throwing an exception
driver.implicitly_wait(30)
# driver.get(url) opens the page
driver.get(url)
# This starts the scrolling by passing the driver and a timeout
scroll(driver, 5)
# Once scroll returns bs4 parsers the page_source
soup = BeautifulSoup(driver.page_source, "lxml")
# Them we close the driver as soup_a is storing the page source
driver.close()
# Empty array to store the links
quotes = []
regex_quotes = re.compile('^b-qt')
regex_authors = re.compile('^bq-aut')
quotes_list = soup.find_all('a', attrs={'class': regex_quotes})
authors_list = soup.find_all('a', attrs={'class': regex_authors})
quotes = []
zipped_quotes = list(zip(quotes_list, authors_list))
for i, x in enumerate(zipped_quotes):
quote = x[0]
author = x[1]
quotes.append({
"id": f"id-{i}",
"quote": quote.get_text(),
"author": author.get_text(),
"author-link": author.get('href')
})
with open("quotes.json", 'w') as json_file:
json.dump(quotes, json_file)
except Exception as e:
print(e, '>>>>>>>>>>>>>>>Exception>>>>>>>>>>>>>>')
get_quotes('https://www.brainyquote.com/topics/excellence-quotes')
What do we have here?
- We import the
BeautifulSoup4
library, some inbuilt python packages likere(regular expression )
and json. - We also import the functions packages we created earlier like scroll and driver.
- We create a
get_quotes
function that takes in a URL as a parameter. - With this, we tell our browser to wait a Lil before throwing an error(sometimes network issues may slow things down).
- We called the scroll function to do its thing.
- And once that is done we pass
driver.page_source
to BeautifulSoup4. printingdriver.page_source
at this point would show a bunch of HTML tags -We call close to stop browser interactions, we have all we need now
The goal is to scrape a quote, its author and a link to get all of that author's quotes. at this point, we have all of that data albeit in a format we cannot work with yet(HTML tags) also notice from the code that we are extracting data for the author separately and the same for quotes. How do we link each quote to its author? we also need to create a python dictionary containing all those pieces of information, give them unique id's and also form the author's links. Python zip
function to the rescue, to put it simply this function takes two lists and generates a series of tuples containing elements from each list. We also made use of enumerate
function this means we can unpack index and data from the tuples returned from the zip function. With that, we unpack and loop over the returned tuple, create a python dictionary containing the data we want and append that to the quotes array. We also called a BeautifulSoup4 method get_text()
on the author and quote to enable us to return actual texts from our HTML tags. we also called get('href')
which returns any property of a tag we specify, in our case href, this is how we get the link to the author's quotes. Finally, we save the contents of our quotes
list to a json file by creating a quotes.json
file and dumping our data into it by calling json.dump
.
To run the scrapper
python scrapping-tool/main.py
To see all this in action, you can comment out this piece of code
options.add_argument('--headless')
in the fileselenium_driver.py
.
Yo! Thatβs it for now. Feel free to leave a comment, feedback or opinions in the comments. In part two of this article, we will go through creating a slackbot that would display these scrapped quotes on a slack channel. That would also mean we configure a flask project that would enable us to run a server and implement a scheduler!
To view the full code for this article click here
Posted on February 23, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.