Web Scraping with Python
Luciano Muñoz
Posted on June 14, 2022
Scraping is a technique for extracting data from a website when there’s no other source to obtain that data, like a Rest API or RSS feed.
Google and other search engines have bots scraping all the information that later you see in your search results. Not just search engines use scraping, but also other kinds of websites, for example, for pricing, flights, and hotels comparison.
I want to show you in this post how we can develop our own bot. You could use the same approach that I use to track the price of that product you want, or get the latest news from the news website you like, the limit is your imagination.
I recommend you the book Web Scraping with Python if you want to go a bit deeper into web scraping with Python 🐍.
Let's go!
¿How does scraping work?
There are different ways, but the most used technique is by obtaining the HTML code of the target website, and telling our bot what tags or attributes it has to search for, and where the information we want is stored.
Imagine a website with this HTML structure:
<section class="layout-articles">
<!-- News 1 -->
<article>
<h1 class="title">
<a href="/news-1" title="News 1">
News 1
</a>
</h1>
<img src="news-1.jpg">
</article>
<!-- News 2 -->
<article>
<h1 class="title">
<a href="/news-2" title="News 2">
News 2
</a>
</h1>
<img src="news-2.jpg">
</article>
</section>
If we want to get all the news titles of this page, we could search for the section
element which contains the attribute class="layout-articles"
, and from them get all the a
tags that contain the title and URL of each news item.
This is just a simple example for you to have a better understanding of scraping.
¿What are we going to build?
There’s a great site called Simple Desktops, with a cool collection of fancy wallpapers, and our bot will take care of browsing the pages of this site and downloading each wallpaper 👏👏👏.
https://giphy.com/embed/l41lUJ1YoZB1lHVPG
First, let’s analyze the HTML structure of the website, which allows us to understand the steps our bot must follow for its task:
- Website pagination works as follows,
/browse/
,/browse/1/
,/browse/2/
- On each page, each wallpaper is a
div class="desktop"
containing animg
tag whosesrc
attribute has the URL to download the wallpaper. - The site uses a thumbnail generator implicit in the URL of each wallpaper image, but if we delete the text • that referred to the resize we can have access to the original image: Apple_Park.png~~.295x184_q100.png~~ 😎.
- The URL to the next page is stored in the
<a class="more"
tag.
With the information collected before we can say that our algorithm must follow these steps:
- Do a request to
/browse/
URL - Get the wallpapers URL from the
src
attribute ofimg
tag contained in eachdiv class="desktop"
tag - Remove the resize from the wallpaper URL
- Download the wallpapers
- Get the URL of the next page of the site and repeat step 2
Great, now that we know what to do… ¡let´s code!🎈
¿How to create a bot in Python?
These are the packages we will use:
- os: for handling file paths and folders
- re: for regular expressions
- shutil: for file operations
- requests: for HTTP requests
- BeautifulSoup: for parsing the HTML code, the heart of our bot ❤️
BeautifulSoup and requests are two packages not built-in in Python, so we’re going to install them with pip
:
$ pip install beautifulsoup4
$ pip install requests
We’re going to split our code into functions to make it easy to read and debug.
Create a directory and inside create a file called simpledesktop-bot.py
. First, we start by importing the packages:
import os
import re
import shutil
import requests
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup
At the entry point of our app we configure the initial data so that it can start running:
if __name__ == '__main__':
# Run, run, run
url = 'http://simpledesktops.com'
first_path = '/browse/'
download_directory = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'wallpapers')
# Create download directory if it does not exists
if not os.path.exists(download_directory):
os.makedirs(download_directory)
# Start crawling
processPage(url, first_path, download_directory)
At the beginning, we set the initial data, like the website URL, the path of the first page where our bot will start running, and a directory to store the downloaded wallpapers. If that directory doesn’t exist we create it with the os.makedirs
method.
In the last place, we call the function processPage()
to start the scraping process.
def processPage(url, path, download_directory):
"""
Recursive function that deliver pages to request and wallpaper's data to the other functions
"""
print('\nPATH:', path)
print('=========================')
wallpapers = getPageContent(url + path)
if wallpapers['images']:
downloadWallpaper(wallpapers['images'], download_directory)
else:
print('This page does not contain any wallpaper')
if wallpapers['next_page']:
processPage(url, wallpapers['next_page'], download_directory)
else:
print('THIS IS THE END, BUDDY')
processPage()
is a recursive function that acts as a wrapper to manage the calls to the other functions.
The first called function is getPageContent()
, which makes the HTTP request, analyzes the HTML structure, and returns a dictionary with the following data:
- images: it’s a list containing each wallpaper’s URL
- next_page: the URL path to the next page to process
If wallpapers['images']
is not empty, then we call downloadWallpaper()
, which receives the list of image’s URL and the download directory, and it’s in charge of processing each download.
Lastly, if wallpapers['next_page']
exist, then we call recursively processPage()
with the path for the next page, otherwise, the program ends.
Now let’s see the code of each function that processPage()
calls.
def getPageContent(url):
"""
Get wallpaper and next page data from requested page
"""
images = []
next_page = None
html = requestPage(url)
if html is not None:
# Search wallpapers URL
wallpapers = html.find_all('div', {'class': 'desktop'})
for wp in wallpapers:
img = wp.find('img')
images.append(img.attrs['src'])
# Search for next page URL
try:
more_button = html.find('a', {'class':'more'})
next_page = more_button.attrs['href']
except:
pass
return {'images': images, 'next_page': next_page}
getPageContent()
is the heart of our program, because its goal is to make a request to the page received by parameter and return a list of the wallpaper’s URL and the URL path of the next page.
First let’s initialize the image
and next_page
variables, which are going to store the return data.
Then we call requestPage()
, which makes the HTTP request and returns the HTML content already parsed and ready to be manipulated. Here is where we see the black magic behind BeautifulSoup!. Using the find_all
method we get a list of div class="desktop"
tag. Then we loop over the list and using the find
method we search the img
tag and extract the wallpaper URL from the src
attribute. Each URL is stored in the images
list.
Next, we search for the a class="more"
tag, extract the href
attribute and store it in the next_page
variable.
Lastly, we return a dictionary containing images
and next_page
.
def requestPage(url):
"""
Request pages and parse HTML response
"""
try:
raw_html = requests.get(url)
try:
html = BeautifulSoup(raw_html.text, features='html.parser')
return html
except:
print('Error parsing HTML code')
return None
except HTTPError as e:
print(e.reason)
return None
Now let’s see what requestPage()
does. It requests the URL page received by parameter, and stores the payload into the raw_html
variable. Then parse the plain HTML with BeautifulSoup and return the parsed content.
With try/except
we intercept any error that may be raised.
def downloadWallpaper(wallpapers, directory):
"""
Process wallpaper downloads
"""
for url in wallpapers:
match_url = re.match('^.+?(\.png|jpg)', url)
if match_url:
formated_url = match_url.group(0)
filename = formated_url[formated_url.rfind('/')+1:]
file_path = os.path.join(directory, filename)
print(file_path)
if not os.path.exists(file_path):
with requests.get(formated_url, stream=True) as wp_file:
with open(file_path, 'wb') as output_file:
shutil.copyfileobj(wp_file.raw, output_file)
else:
print('Wallpaper URL is invalid')
downloadWallpaper()
receives a list with the wallpaper’s URL to process each download. The first task this function does is delete from the URL the piece of text that works as a resize.
http://static.simpledesktops.com/uploads/desktops/2020/03/30/piano.jpg.300x189_q100.png
Deleting .300x189_q100.png
from the end of the URL allows us to download the image with their original size. To accomplish this task we’re using the regular expression ^.+?(\.png|jpg)
, which returns the URL from the start until the first occurrence of .png
or .jpg
is found. If there is no match then the URL is not valid.
Then we extract the file name using the function rfind(’/’)
to find the first slash character starting from the right of the string, where the filename starts. With this value and the directory, we save in the variable file_path
the destination in our computer where the wallpaper will be downloaded.
In the next block of code, we check first if the wallpaper doesn't already exist to prevent downloading it again. If the file does not exist we execute the following steps:
- We download the file using
requests.get()
and store a reference to the binary file in memory in the variablewp_file
. - Then we
open()
the local file in binary and writing mode and reference that file asoutput_file
. - The last step is to copy the content of
wp_file
(the downloaded image) intooutput_file
(the file on disk) usingshutil.copyfileobj()
.
We have already downloaded the wallpaper and saved it on our disk.
There’s no need to free the memory of opened files because we’re working inside a with
statement, which manage it automatically.
And that’s all, we can now execute the program. To run it just open the console and type python3 simpledesktop-bot.py
:
$ python3 simpledesktop-bot.py
PATH: /browse/
=========================
/Users/MyUser/simple-desktop-scraper/wallpapers/sphericalharmonics1.png
/Users/MyUser/simple-desktop-scraper/wallpapers/Dinosaur_eye_2.png
/Users/MyUser/simple-desktop-scraper/wallpapers/trippin.png
...
PATH: /browse/2/
=========================
/Users/MyUser/simple-desktop-scraper/wallpapers/Apple_Park.png
/Users/MyUser/simple-desktop-scraper/wallpapers/triangles.png
/Users/MyUser/simple-desktop-scraper/wallpapers/thanksgiving_twelvewalls.png
...
PATH: /browse/3/
=========================
/Users/MyUser/simple-desktop-scraper/wallpapers/minimalistic_rubik_cube_2880.png
/Users/MyUser/simple-desktop-scraper/wallpapers/Nesting_Dolls.png
/Users/MyUser/simple-desktop-scraper/wallpapers/flat_bamboo_wallpaper.png
...
You can find the code in the GitHub repository, and if you like it give it a star 😉.
SimpleDesktop-Bot
Thanks for reading, it is very valuable for me that you have read this post. I hope you have learned something new, as I did when I was writing and coding this, if so leave a comment or send me a tweet because I would love to know it.
See you soon! 😉
Thanks for reading. If you are interested in knowing more about me, you can give me a follow or contact me on Instagram, Twitter, or LinkedIn 💗.
Posted on June 14, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.