Web scraping with Python

percivalvillal3

Percival Villalva

Posted on February 14, 2023

Web scraping with Python

Explore some of the best Python libraries and frameworks available for web scraping and learn how to use them in your projects.

Getting started with web scraping in Python

Python is one of the most popular programming languages out there and is used across many different fields, such as AI, web development, automation, data science, and data extraction.

For years, Python has been the go-to language for data extraction, boasting a large community of developers as well as a wide range of web scraping tools to help scrapers extract almost any data they wish from the web.

This article will explore some of the best libraries and frameworks available for web scraping in Python and provide a quick sample of how to use them in different scraping scenarios.

Requirements

To fully understand the content and code samples showcased in this post, you should:

  • Have Python installed on your computer

  • Have a basic understanding of CSS selectors

  • Be comfortable navigating the browser DevTools to find and select page elements

HTTP Clients

In the context of web scraping, HTTP clients are used for sending requests to the target website and retrieving information such as the website's HTML code or JSON payload.

Requests

Requests logo

Requests is the most popular HTTP library for Python. It is supported by solid documentation and has been adopted by a huge community.

βš’οΈ Main Features

  • Keep-Alive & Connection Pooling

  • Browser-style SSL Verification

  • HTTP(S) Proxy Support

  • Connection Timeouts

  • Chunked Requests

βš™οΈ Installation

pip install requests
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Code Sample

Send a request to the target website, retrieve its HTML code, and print the result to the console.

import requests

response = requests.get('https://news.ycombinator.com/')

print(response.text)
Enter fullscreen mode Exit fullscreen mode

HTTPX

HTTPX

HTTPX is a fully featured HTTP client library for Python 3, including an integrated command-line client while providing both sync and async APIs.

βš’οΈ Main Features

  • A broadly requests-compatible API

  • An integrated command-line client

  • Standard synchronous interface, but with async support if you need it

  • Fully type annotated

βš™οΈ Installation

# Using pip
pip install httpx

# For Python 3 macOS users
pip3 install httpx
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Code Sample

Similar to the Requests example, we will send a request to the target website, retrieve the HTML of the page and print it to the console along with the request status code.

import httpx

response = httpx.get('https://news.ycombinator.com/')

status_code = response.status_code
html = response.text

print(status_code, html)
Enter fullscreen mode Exit fullscreen mode

HTML and XML parser

In web scraping, HTML and XML parsers are used to interpret the response we get back from our target website, often in the form of HTML code.* A library such as Beautiful Soup will help us parse this response and extract data from websites.*

Beautiful Soup

Beautiful Soup logo

Beautiful Soup (also known as BS4) is a Python library for pulling data out of HTML and XML files with just a few lines of code. BS4 is relatively easy to use and presents itself as a lightweight option for tackling simple scraping tasks with speed.

βš’οΈ Main features

  • Implements a subset of core jQuery, providing developers with a familiar and easy-to-use syntax.

  • Works with a simple and consistent DOM model, making parsing, manipulating, and rendering incredibly efficient.

  • Offers great flexibility, being able to parse nearly any HTML or XML document.

βš™οΈ Installation

pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Code Sample

Let's now see how we can use Beautiful Soup + HTTPX to extract the title content, rank, and URL from all the articles on the first page of Hacker News.

from bs4 import BeautifulSoup
import httpx

response = httpx.get("https://news.ycombinator.com/news")
yc_web_page = response.content

soup = BeautifulSoup(yc_web_page)
articles = soup.find_all(class_="athing")

for article in articles:
    data = {
        "URL": article.find(class_="titleline").find("a").get('href'),
        "title": article.find(class_="titleline").getText(),
        "rank": article.find(class_="rank").getText().replace(".", "")
    }
    print(data)
Enter fullscreen mode Exit fullscreen mode

A few seconds after running the script, we will see a dictionary containing each article's URL, ranking, and title printed on our console.

Output example:


{'URL': 'https://vpnoverview.com/news/wifi-routers-used-to-produce-3d-images-of-humans/', 'title': 'WiFi Routers Used to Produce 3D Images of Humans (vpnoverview.com)', 'rank': '1'}
{'URL': 'https://openjdk.org/jeps/8300786', 'title': 'JEP draft: No longer require super() and this() to appear first in a constructor (openjdk.org)', 'rank': '2'}
{'URL': 'item?id=34482433', 'title': 'Ask HN: Those making $500+/month on side projects in 2023 -- Show and tell', 'rank': '3'}
{'URL': 'https://www.solipsys.co.uk/new/ThePointOfTheBanachTarskiTheorem.html?wa22hn', 'title': 'The Point of the Banach-Tarski Theorem (solipsys.co.uk)', 'rank': '4'}
{'URL': 'https://initialcommit.com/blog/git-sim', 'title': 'Git-sim: Visually simulate Git operations in your own repos (initialcommit.com)', 'rank': '5'}
{'URL': 'https://www.cell.com/cell-reports-medicine/fulltext/S2666-3791(22)00474-8', 'title': 'Brief structured respiration enhances mood and reduces physiological arousal (cell.com)', 'rank': '6'}
{'URL': 'https://en.wikipedia.org/wiki/I,_Libertine', 'title': 'I, Libertine (wikipedia.org)', 'rank': '7'}
{'URL': 'item?id=34465956', 'title': 'Ask HN: Why did BASIC use line numbers instead of a full screen editor?', 'rank': '8'}
{'URL': 'https://arxiv.org/abs/2203.03456', 'title': 'Negative-weight single-source shortest paths in near-linear time (arxiv.org)', 'rank': '9'}
{'URL': 'https://onesignal.com/careers', 'title': 'OneSignal (YC S11) Is Hiring Engineers (onesignal.com)', 'rank': '10'}
{'URL': 'https://neelc.org/posts/chatgpt-gmail-spam/', 'title': "Bypassing Gmail's spam filters with ChatGPT (neelc.org)", 'rank': '11'}
{'URL': 'https://cyber.dabamos.de/88x31/', 'title': 'The 88x31 GIF Collection (dabamos.de)', 'rank': '12'}
{'URL': 'https://www.middleeasteye.net/opinion/david-graeber-vs-yuval-harari-forgotten-cities-myths-how-civilisation-began', 'title': 'The Dawn of Everything challenges a mainstream telling of prehistory (middleeasteye.net)', 'rank': '13'}
{'URL': 'https://blog.thinkst.com/2023/01/swipe-right-on-our-new-credit-card-tokens.html', 'title': 'Detect breaches with Canary credit cards (thinkst.com)', 'rank': '14'}
{'URL': 'https://www.atlasobscura.com/articles/heritage-appalachian-apples', 'title': 'Appalachian Apple hunter who rescued 1k 'lost' varieties (2021) (atlasobscura.com)', 'rank': '15'}
{'URL': 'https://www.workingsoftware.dev/software-architecture-documentation-the-ultimate-guide/', 'title': 'The Guide to Software Architecture Documentation (workingsoftware.dev)', 'rank': '16'}
{'URL': 'https://arstechnica.com/tech-policy/2023/01/supreme-court-allows-reddit-mods-to-anonymously-defend-section-230/', 'title': 'Supreme Court allows Reddit mods to anonymously defend Section 230 (arstechnica.com)', 'rank': '17'}
{'URL': 'https://neurosciencenews.com/insula-empathy-pain-21818/', 'title': 'How do we experience the pain of other people? (neurosciencenews.com)', 'rank': '18'}
{'URL': 'https://lwn.net/SubscriberLink/920158/313ec4305df220bb/', 'title': 'Nolibc: A minimal C-library replacement shipped with the kernel (lwn.net)', 'rank': '19'}
{'URL': 'https://www.economist.com/1843/2017/05/04/the-body-in-the-buddha', 'title': 'The Body in the Buddha (2017) (economist.com)', 'rank': '20'}
{'URL': 'https://simonwillison.net/2023/Jan/13/semantic-search-answers/', 'title': 'How to implement Q&A against your docs with GPT3 embeddings and Datasette (simonwillison.net)', 'rank': '21'}
{'URL': 'https://destevez.net/2023/01/decoding-lunar-flashlight/', 'title': 'Decoding Lunar Flashlight (destevez.net)', 'rank': '22'}
{'URL': 'https://www.hampsteadheath.net/about', 'title': 'Hampstead Heath (hampsteadheath.net)', 'rank': '23'}
{'URL': 'https://www.otherlife.co/francisbacon/', 'title': 'The violent focus of Francis Bacon (otherlife.co)', 'rank': '24'}
{'URL': 'https://arstechnica.com/gaming/2019/10/explaining-how-fighting-games-use-delay-based-and-rollback-netcode/', 'title': 'How fighting games use delay-based and rollback netcode (2019) (arstechnica.com)', 'rank': '25'}
{'URL': 'https://essays.georgestrakhov.com/ai-is-not-a-horse/', 'title': 'AI Is Not a Horse (georgestrakhov.com)', 'rank': '26'}
{'URL': 'https://lawliberty.org/features/the-mystery-of-richard-posner/', 'title': 'The Mystery of Richard Posner (lawliberty.org)', 'rank': '27'}
{'URL': 'https://rodneybrooks.com/predictions-scorecard-2023-january-01/', 'title': 'Rodney Brooks Predictions Scorecard (rodneybrooks.com)', 'rank': '28'}
{'URL': 'https://www.notamonadtutorial.com/how-to-transform-code-into-arithmetic-circuits/', 'title': 'How to transform code into arithmetic circuits (notamonadtutorial.com)', 'rank': '29'}
{'URL': 'https://github.com/jhhoward/WolfensteinCGA', 'title': 'Wolfenstein 3D with a CGA Renderer (github.com/jhhoward)', 'rank': '30'}
Enter fullscreen mode Exit fullscreen mode

Browser automation tools

Browser automation libraries and frameworks have an off-label use for web scraping. Their ability to emulate a real browser is essentialfor access*ing* data on websites that require JavaScript to load their content.**

Selenium

Selenium logo

Selenium is primarily a browser automation framework and ecosystem with an off-label use for web scraping. It uses the WebDriver protocol to control a headless browser and perform actions like clicking buttons, filling out forms, and scrolling.

Because of its ability to render JavaScript, Selenium can be used to scrape dynamically loaded content.

βš’οΈ Main features

  • Multi-Browser Support (Firefox, Chrome, Safari, Opera...)

  • Multi-Language Compatibility

  • Automate manual user interactions, such as UI testing, form submissions, and keyboard inputs.

  • Dynamic web elements handling

βš™οΈ Installation

# Install Selenium
pip install selenium

# We will also need to install webdriver-manager to run the code sample below
pip install webdriver-manager
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Code Sample

To demonstrate some of Selenium's capabilities, let's go to Amazon, scrape The Hitchhiker's Guide to the Galaxy product page, and save a screenshot of the accessed page.

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

# Insert the website URL that we want to scrape
url = "https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C"

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)

# Create a dictionary with the scraped data
book = {
    "book_title": driver.find_element(By.ID,  'productTitle').text,
    "author": driver.find_element(By.CSS_SELECTOR, '.a-link-normal.contributorNameID').text,
    "edition": driver.find_element(By.ID, 'productSubtitle').text,
    "price": driver.find_element(By.CSS_SELECTOR,  '.a-size-base.a-color-price.a-color-price').text,
}

# Save a screenshot from the accessed page and print the dictionary contents to the console
driver.save_screenshot('book.png')
print(book)
Enter fullscreen mode Exit fullscreen mode

After the script finishes its run, we will see an object containing the book's title, author, edition, and prices logged to the console, and a screenshot of the page saved as book.png .

Output example:

{
    "book_title": "The Hitchhiker's Guide to the Galaxy: The Illustrated Edition",
    "author": "Douglas Adams",
    "edition": "Kindle Edition",
    "price": "$7.99"
}
Enter fullscreen mode Exit fullscreen mode

Saved image:

Playwright

Playwright logo

By definition, Playwright is an open-source framework for web testing and automation developed and maintained by Microsoft.

Despite having many features in common with Selenium, Playwright is considered a more modern and capable choice for automation, testing, and web scraping in Python.

βš’οΈ Main features

  • Auto-wait. Playwright, by default, waits for elements to be actionable before performing actions, eliminating the need for artificial timeouts.

  • Cross-browser support, being able to drive Chromium, WebKit, Firefox, and Microsoft Edge.

  • Cross-platform support. Available on Windows, Linux, and macOS, locally or on CI, headless, or headed.

βš™οΈ Installation

# Using pip
pip install pytest-playwright

# For Python 3 macOS users
pip3 install pytest-playwright

# Install the required browsers
playwright install
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Code Sample

To highlight Playwright's features as well as its similarities with Selenium, let's go back to Amazon's website and extract some data from The Hitchhiker's Guide to the Galaxy.

Playwright version:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.firefox.launch(
        headless=False
    )
    page = browser.new_page()
    page.goto("https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C")

    # Create a dictionary with the scraped data
    book = {
    "book_title": page.query_selector('#productTitle').inner_text().strip(),
    "author": page.query_selector('.author .a-link-normal.contributorNameID').inner_text().strip(),
    "edition": page.query_selector('#productSubtitle').inner_text().strip(),
    "price": page.query_selector('.a-size-base.a-color-price.a-color-price').inner_text().strip(),
    }

    print(book)
    page.screenshot(path="book.png")

    browser.close()
Enter fullscreen mode Exit fullscreen mode

After the scraper finishes its run, the Firefox browser controlled by Playwright will close, and the extracted data will be logged into the console.

Scrapy: a full-fledged Python web crawling framework

Scrapy

HTTPX

Scrapy is a fast high-level web crawling and web scraping framework written with Twisted, a popular event-driven networking framework, which gives it asynchronous capabilities.

Unlike the tools mentioned earlier, Scrapy is a full-fledged web crawling framework designed specifically for data extraction, with built-in support for handling requests, processing responses, and exporting data.

Additionally, Scrapy provides handy out-of-the-box features, such as support for following links, handling multiple request types, and error handling, making it a powerful tool for web scraping projects of any size and complexity.

βš’οΈ Main features

  • Feed exports in multiple formats, such as JSON, CSV, and XML.

  • Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions

  • An interactive shell console for trying out the CSS and XPath expressions to scrape data and debug your spiders.

  • Built-in extensions and middlewares for handling, cookies, HTTP authentication and caching user-agent spoofing, and more

βš™οΈ Installation

pip install scrapy
Enter fullscreen mode Exit fullscreen mode

πŸ“ Project setup

To demonstrate some Scrapy's features, we will once again extract data from articles displayed on Hacker News.

We will start by scraping the top 30 articles and then use Scrapy's CrawlSpider to follow the available page links and extract data from all the articles on the website.

To begin, let's create a new directory and install Scrapy to initialize the project and create a new spider:

# Create new directory and move into it
mkdir scrapy-project
cd scrapy-project

# Install Scrapy
pip install scrapy

# Initialize project
scrapy startproject scrapydemo

# Generate spider
scrapy genspider demospider https://news.ycombinator.com/
Enter fullscreen mode Exit fullscreen mode

After our spider is generated, let's specify the encoding for the output file, which will contain the data scraped from the target website by adding FEED_EXPORT_ENCODING = "utf-8" to our settings.py file.

πŸ’‘ Code Sample

Finally, go to the demospider.py file and write some code:

import scrapy

class DemospiderSpider(scrapy.Spider):
    name = 'demospider'

    def start_requests(self):
        yield scrapy.Request(url='https://news.ycombinator.com/')

    def parse(self, response):
        for article in response.css('tr.athing'):
            yield {
                "URL": article.css(".titleline a::attr(href)").get(),
                "title": article.css(".titleline a::text").get(),
                "rank": article.css(".rank::text").get().replace(".", "")
        }
Enter fullscreen mode Exit fullscreen mode

Then, let's use the following command to run the spider and store the scraped data in a results.json file:

scrapy crawl demospider -o results.json
Enter fullscreen mode Exit fullscreen mode

πŸ•·οΈ Using Scrapy's CrawlSpider

Now that we know how to extract data from the articles on the first page of Hacker News let's use Scrapy's CrawlSpider to follow the next page links and collect the data from all the articles on the website.

To do that, we will make some adjustments to our demospider.py file:

# Add imports CrawlSpider, Rule and LinkExtractor πŸ‘‡
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

# Change the spider from "scrapy.Spider" to "CrawlSpider"
class DemospiderSpider(CrawlSpider):
    name = 'demospider'
    allowed_domains = ['news.ycombinator.com']
    start_urls = ['https://news.ycombinator.com/news?p=1']

    # Define a rule that should be followed by the link extractor.
    # In this case, Scrapy will follow all the links with the "morelink" class
    # And call the "parse_article" function on every crawled page
    rules = (
        (Rule(LinkExtractor(restrict_css='.morelink'), callback='parse_article', follow=True),)
    )

    # When using the CrawlSpider we cannot use a parse function called "parse".
    # Otherwise, it will override the default function.
    # So, just rename it to something else, for example, "parse_article"
    def parse_article(self, response):
        for article in response.css('tr.athing'):
            yield {
                "URL": article.css(".titleline a::attr(href)").get(),
                "title": article.css(".titleline a::text").get(),
                "rank": article.css(".rank::text").get().replace(".", "")
            }
Enter fullscreen mode Exit fullscreen mode

Finally, let's add a small delay between each of Scrapy's requests to avoid overloading the server. We can do that by adding DOWNLOAD_DELAY = 0.5 to our settings.py file.

Great! Now we are ready to run our scraper and get the data from all the articles displayed on Hacker News. Just run the command scrapy crawl demospider -o results.json and wait for the run to finish.

Expected output:

🎭 Using Playwright with Scrapy

Scrapy and Playwright are one of the most efficient combos for modern web scraping in Python.

This combo allows us to benefit from Playwright's ability to access dynamically loaded content on websites, and retrieve code from the page, so we can use Scrapy to extract data from it.

To integrate Playwright with Scrapy, we will use the scrapy-playwright library. Then, we will scrape https://www.mintmobile.com/product/google-pixel-7-pro-bundle/ to demonstrate how to extract data from a website using Playwright and Scrapy.

Mint Mobile requires JavaScript to load most of the content displayed on its product page, which makes it an ideal scenario for using Playwright in the context of web scraping.

Mint Mobile product page with JavaScript disabled:

Mint Mobile JavaScript disabled

Mint Mobile product page with JavaScript enabled:

Mint Mobile JavaScript enabled

βš™οΈ Project setup

Start by creating a directory to house our project and installing the necessary dependencies:

# Create new directory and move into it
mkdir scrapy-playwright
cd scrapy-playwright
Enter fullscreen mode Exit fullscreen mode

Installation:

# Install Scrapy and scrapy-playwright
pip install scrapy scrapy-playwright

# Install the required browsers if you are running Playwright for the first time
playwright install

# Or install a subset of the available browsers you plan on using
playwright install firefox chromium
Enter fullscreen mode Exit fullscreen mode

Next, start the Scrapy project and generate a spider:

scrapy startproject pwsdemo
scrapy genspider demospider https://www.mintmobile.com/
Enter fullscreen mode Exit fullscreen mode

Now, let's activate scrapy-playwright by adding DOWNLOAD_HANDLERS and TWISTED_REACTOR to the scraper configuration in settings.py

# scrapy-playwright configuration

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
Enter fullscreen mode Exit fullscreen mode

Great! We are now ready to write some code to scrape our target website.

πŸ’‘ Code Sample

So, without further ado, let's use Playwright + Scrapy to extract data from Mint Mobile.

import scrapy
from scrapy_playwright.page import PageMethod

class DemospiderSpider(scrapy.Spider):
    name = 'demospider'

    def start_requests(self):
        yield scrapy.Request('https://www.mintmobile.com/product/google-pixel-7-pro-bundle/',
        meta= dict(
            # Use Playwright
            playwright = True,
            # Keep the page object so we can work with it later on
            playwright_include_page = True,
            # Use PageMethods to wait for the content we want to scrape to be properly loaded before extracting the data
            playwright_page_methods = [
                PageMethod('wait_for_selector', 'div.m-productCard--device')
                ]
        ))

    def parse(self, response):
        yield {
            "name": response.css("div.m-productCard__heading h1::text").get().strip(),
            "memory": response.css("div.composited_product_details_wrapper > div > div > div:nth-child(2) > div.label > span::text").get().replace(':', '').strip(),
            "pay_monthly_price": response.css("div.composite_price_monthly > span::text").get(),
            "pay_today_price": response.css("div.composite_price p.price span.amount::attr(aria-label)").get().split()[0],
    };
Enter fullscreen mode Exit fullscreen mode

Expected output:
Finally, run the spider using the command scrapy crawl demospider -o results.json to scrape the target data and store it in a results.json file:

[
    {
        "name": "Google Pixel 7 Pro",
        "memory": "128GB",
        "pay_monthly_price": "50",
        "pay_today_price": "589"
    }
]
Enter fullscreen mode Exit fullscreen mode

Learning resources πŸ“š

If you want to dive deeper into some of the libraries and frameworks we presented during this post, here is a curated list of great videos and articles about the topic:

General web scraping

Beautiful Soup Tutorials

Browser automation tools

Scrapy

Discord

Finally, don't forget to join the Apify & Crawlee community on Discord to connect with other web scraping and automation enthusiasts. πŸš€

Crawlee & Apify

Join the best web scraping & automation community. | 2,627 members

favicon discord.com
πŸ’– πŸ’ͺ πŸ™… 🚩
percivalvillal3
Percival Villalva

Posted on February 14, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related