Web Scraping with Python

luciano_dev

Luciano Muñoz

Posted on June 14, 2022

Web Scraping with Python

Scraping is a technique for extracting data from a website when there’s no other source to obtain that data, like a Rest API or RSS feed.

Google and other search engines have bots scraping all the information that later you see in your search results. Not just search engines use scraping, but also other kinds of websites, for example, for pricing, flights, and hotels comparison.

I want to show you in this post how we can develop our own bot. You could use the same approach that I use to track the price of that product you want, or get the latest news from the news website you like, the limit is your imagination.

I recommend you the book Web Scraping with Python if you want to go a bit deeper into web scraping with Python 🐍.

Let's go!


¿How does scraping work?

There are different ways, but the most used technique is by obtaining the HTML code of the target website, and telling our bot what tags or attributes it has to search for, and where the information we want is stored.

Imagine a website with this HTML structure:

<section class="layout-articles">
    <!-- News 1 -->
    <article>
        <h1 class="title">
            <a href="/news-1" title="News 1">
                News 1
            </a>
        </h1>
        <img src="news-1.jpg">
    </article>
    <!-- News 2 -->
    <article>
        <h1 class="title">
            <a href="/news-2" title="News 2">
                News 2
            </a>
        </h1>
        <img src="news-2.jpg">
    </article>
</section>
Enter fullscreen mode Exit fullscreen mode

If we want to get all the news titles of this page, we could search for the section element which contains the attribute class="layout-articles", and from them get all the a tags that contain the title and URL of each news item.

This is just a simple example for you to have a better understanding of scraping.


¿What are we going to build?

There’s a great site called Simple Desktops, with a cool collection of fancy wallpapers, and our bot will take care of browsing the pages of this site and downloading each wallpaper 👏👏👏.
https://giphy.com/embed/l41lUJ1YoZB1lHVPG
First, let’s analyze the HTML structure of the website, which allows us to understand the steps our bot must follow for its task:

  • Website pagination works as follows, /browse/, /browse/1/, /browse/2/
  • On each page, each wallpaper is a div class="desktop" containing an img tag whose src attribute has the URL to download the wallpaper.
  • The site uses a thumbnail generator implicit in the URL of each wallpaper image, but if we delete the text • that referred to the resize we can have access to the original image: Apple_Park.png~~.295x184_q100.png~~ 😎.
  • The URL to the next page is stored in the <a class="more" tag.

With the information collected before we can say that our algorithm must follow these steps:

  1. Do a request to /browse/ URL
  2. Get the wallpapers URL from the src attribute of img tag contained in each div class="desktop" tag
  3. Remove the resize from the wallpaper URL
  4. Download the wallpapers
  5. Get the URL of the next page of the site and repeat step 2

Great, now that we know what to do… ¡let´s code!🎈


¿How to create a bot in Python?

These are the packages we will use:

  • os: for handling file paths and folders
  • re: for regular expressions
  • shutil: for file operations
  • requests: for HTTP requests
  • BeautifulSoup: for parsing the HTML code, the heart of our bot ❤️

BeautifulSoup and requests are two packages not built-in in Python, so we’re going to install them with pip:

$ pip install beautifulsoup4
$ pip install requests
Enter fullscreen mode Exit fullscreen mode

We’re going to split our code into functions to make it easy to read and debug.

Create a directory and inside create a file called simpledesktop-bot.py. First, we start by importing the packages:

import os
import re
import shutil
import requests
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup
Enter fullscreen mode Exit fullscreen mode

At the entry point of our app we configure the initial data so that it can start running:

if __name__ == '__main__':
    # Run, run, run
    url = 'http://simpledesktops.com'
    first_path = '/browse/'
    download_directory = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'wallpapers')

    # Create download directory if it does not exists
    if not os.path.exists(download_directory):
        os.makedirs(download_directory)

    # Start crawling
    processPage(url, first_path, download_directory)
Enter fullscreen mode Exit fullscreen mode

At the beginning, we set the initial data, like the website URL, the path of the first page where our bot will start running, and a directory to store the downloaded wallpapers. If that directory doesn’t exist we create it with the os.makedirs method.

In the last place, we call the function processPage() to start the scraping process.

def processPage(url, path, download_directory):
    """
    Recursive function that deliver pages to request and wallpaper's data to the other functions
    """
    print('\nPATH:', path)
    print('=========================')

    wallpapers = getPageContent(url + path)
    if wallpapers['images']:
        downloadWallpaper(wallpapers['images'], download_directory)
    else:
        print('This page does not contain any wallpaper')

    if wallpapers['next_page']:
        processPage(url, wallpapers['next_page'], download_directory)
    else:
        print('THIS IS THE END, BUDDY')
Enter fullscreen mode Exit fullscreen mode

processPage() is a recursive function that acts as a wrapper to manage the calls to the other functions.

The first called function is getPageContent(), which makes the HTTP request, analyzes the HTML structure, and returns a dictionary with the following data:

  • images: it’s a list containing each wallpaper’s URL
  • next_page: the URL path to the next page to process

If wallpapers['images'] is not empty, then we call downloadWallpaper(), which receives the list of image’s URL and the download directory, and it’s in charge of processing each download.

Lastly, if wallpapers['next_page'] exist, then we call recursively processPage() with the path for the next page, otherwise, the program ends.

Now let’s see the code of each function that processPage() calls.

def getPageContent(url):
    """
    Get wallpaper and next page data from requested page
    """
    images = []
    next_page = None

    html = requestPage(url)
    if html is not None:
        # Search wallpapers URL
        wallpapers = html.find_all('div', {'class': 'desktop'})
        for wp in wallpapers:
            img = wp.find('img')
            images.append(img.attrs['src'])

        # Search for next page URL
        try:
            more_button = html.find('a', {'class':'more'})
            next_page = more_button.attrs['href']
        except:
            pass

    return {'images': images, 'next_page': next_page}
Enter fullscreen mode Exit fullscreen mode

getPageContent() is the heart of our program, because its goal is to make a request to the page received by parameter and return a list of the wallpaper’s URL and the URL path of the next page.

First let’s initialize the image and next_page variables, which are going to store the return data.

Then we call requestPage(), which makes the HTTP request and returns the HTML content already parsed and ready to be manipulated. Here is where we see the black magic behind BeautifulSoup!. Using the find_all method we get a list of div class="desktop" tag. Then we loop over the list and using the find method we search the img tag and extract the wallpaper URL from the src attribute. Each URL is stored in the images list.

Next, we search for the a class="more" tag, extract the href attribute and store it in the next_page variable.

Lastly, we return a dictionary containing images and next_page.

def requestPage(url):
    """
    Request pages and parse HTML response
    """
    try:
        raw_html = requests.get(url)
        try:
            html = BeautifulSoup(raw_html.text, features='html.parser')
            return html
        except:
            print('Error parsing HTML code')
            return None
    except HTTPError as e:
        print(e.reason)
        return None
Enter fullscreen mode Exit fullscreen mode

Now let’s see what requestPage() does. It requests the URL page received by parameter, and stores the payload into the raw_html variable. Then parse the plain HTML with BeautifulSoup and return the parsed content.

With try/except we intercept any error that may be raised.

def downloadWallpaper(wallpapers, directory):
    """
    Process wallpaper downloads
    """
    for url in wallpapers:
        match_url = re.match('^.+?(\.png|jpg)', url)
        if match_url:
            formated_url = match_url.group(0)
            filename = formated_url[formated_url.rfind('/')+1:]
            file_path = os.path.join(directory, filename)
            print(file_path)

            if not os.path.exists(file_path):
                with requests.get(formated_url, stream=True) as wp_file:
                    with open(file_path, 'wb') as output_file:
                        shutil.copyfileobj(wp_file.raw, output_file)
        else:
            print('Wallpaper URL is invalid')
Enter fullscreen mode Exit fullscreen mode

downloadWallpaper() receives a list with the wallpaper’s URL to process each download. The first task this function does is delete from the URL the piece of text that works as a resize.

http://static.simpledesktops.com/uploads/desktops/2020/03/30/piano.jpg.300x189_q100.png

Deleting .300x189_q100.png from the end of the URL allows us to download the image with their original size. To accomplish this task we’re using the regular expression ^.+?(\.png|jpg), which returns the URL from the start until the first occurrence of .png or .jpg is found. If there is no match then the URL is not valid.

Then we extract the file name using the function rfind(’/’) to find the first slash character starting from the right of the string, where the filename starts. With this value and the directory, we save in the variable file_path the destination in our computer where the wallpaper will be downloaded.

In the next block of code, we check first if the wallpaper doesn't already exist to prevent downloading it again. If the file does not exist we execute the following steps:

  • We download the file using requests.get() and store a reference to the binary file in memory in the variable wp_file.
  • Then we open() the local file in binary and writing mode and reference that file as output_file.
  • The last step is to copy the content of wp_file (the downloaded image) into output_file (the file on disk) using shutil.copyfileobj().

We have already downloaded the wallpaper and saved it on our disk.

There’s no need to free the memory of opened files because we’re working inside a with statement, which manage it automatically.

And that’s all, we can now execute the program. To run it just open the console and type python3 simpledesktop-bot.py:

$ python3 simpledesktop-bot.py

PATH: /browse/
=========================
/Users/MyUser/simple-desktop-scraper/wallpapers/sphericalharmonics1.png
/Users/MyUser/simple-desktop-scraper/wallpapers/Dinosaur_eye_2.png
/Users/MyUser/simple-desktop-scraper/wallpapers/trippin.png
...

PATH: /browse/2/
=========================
/Users/MyUser/simple-desktop-scraper/wallpapers/Apple_Park.png
/Users/MyUser/simple-desktop-scraper/wallpapers/triangles.png
/Users/MyUser/simple-desktop-scraper/wallpapers/thanksgiving_twelvewalls.png
...

PATH: /browse/3/
=========================
/Users/MyUser/simple-desktop-scraper/wallpapers/minimalistic_rubik_cube_2880.png
/Users/MyUser/simple-desktop-scraper/wallpapers/Nesting_Dolls.png
/Users/MyUser/simple-desktop-scraper/wallpapers/flat_bamboo_wallpaper.png
...
Enter fullscreen mode Exit fullscreen mode

You can find the code in the GitHub repository, and if you like it give it a star 😉.

SimpleDesktop-Bot



Thanks for reading, it is very valuable for me that you have read this post. I hope you have learned something new, as I did when I was writing and coding this, if so leave a comment or send me a tweet because I would love to know it.

See you soon! 😉


Thanks for reading. If you are interested in knowing more about me, you can give me a follow or contact me on Instagram, Twitter, or LinkedIn 💗.

💖 💪 🙅 🚩
luciano_dev
Luciano Muñoz

Posted on June 14, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related