BeautifulSoup is so 2000-and-late: Web Scraping in 2020
Max Humber
Posted on October 9, 2020
BeautifulSoup (bs4
) was created over a decade-and-a-half ago. And it's been the standard for web scraping ever since. But it's time for something new, because bs4
is so 2000-and-late.
In this post we'll explore 10 reasons why gazpacho is the future of web scraping, by scraping parts of this post!
1. No Dependencies
gazpacho
is installed at command line:
pip install gazpacho
With no extra dependencies:
pip freeze
# gazpacho==1.1
In contrast, bs4
is packaged with soupsieve
and lxml
. I won't tell you how to write software, but minimizing dependencies is usually a good idea...
2. Batteries Included
The html for this blog post can be fetched and made parse-able with Soup.get
:
from gazpacho import Soup
url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
soup = Soup.get(url)
Unfortunately, you'll need requests
on top of bs4
to do the same thing:
import requests
from bs4 import BeautifulSoup
url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
html = requests.get(url).text
bsoup = BeautifulSoup(html)
3. Simple find
ing
bs4
is a monster. There are 184 methods and attributes attached to every BeautifulSoup
object. Making it hard to know what to use and when to use it:
len(dir(BeautifulSoup()))
# 184
In contrast, Soup
objects in gazpacho
are simple; there are just seven methods and attributes to keep track of:
[method for method in dir(Soup())]
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']
Looking at that list it's clear that to find the title of this post (nested inside of an h1
tag), for example, we'll need to use .find
:
soup.find('h1')
4. Prototyping to Production
gazpacho
is awesome for prototyping and even better for production. By default, .find
will return one Soup
object if it finds just one element, or a list of Soup
objects if it finds more than one.
To guarantee and enforce return types in production the mode=
argument in .find
can be set manually:
title = (soup
.find("header", {'id': 'main-title'}, mode="first")
.find("h1", mode="all")[0]
.text
)
In contrast, bs4
has 27 find methods and they all return something different:
[method for method in dir(BeautifulSoup()) if 'find' in method]
5. PEP 561 Compliant
As of version 1.1, gazpacho
is PEP 561 compliant. Meaning that the entire library is typed and will work with your typed (or standard duck/un-typed!) code-base:
help(soup.find)
# Signature:
# soup.find(
# tag: str,
# attrs: Union[Dict[str, Any], NoneType] = None,
# *,
# partial: bool = True,
# mode: str = 'automatic',
# strict: Union[bool, NoneType] = None,
# ) -> Union[List[ForwardRef('Soup')], ForwardRef('Soup'), NoneType]
6. Automatic Formatting
The html on dev.to and this post is well formatted. But if it weren't:
header = soup.find("div", {'class': 'crayons-article__header__meta'})
html = str(header.find("div", {'class': 'mb-4 spec__tags'}))
bad_html = html.replace("\n", "") # remove new line characters
print(bad_html)
# <div class="mb-4 spec__tags"> <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B"> <span class="crayons-tag__prefix">#</span> python </a> <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:"> <span class="crayons-tag__prefix">#</span> webscraping </a> <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:"> <span class="crayons-tag__prefix">#</span> gazpacho </a> <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368"> <span class="crayons-tag__prefix">#</span> hacktoberfest </a></div>
gazpacho
would be able to automatically format and indent the bad/malformed html:
tags = Soup(bad_html)
Making things easier to read:
print(tags)
# <div class="mb-4 spec__tags">
# <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B">
# <span class="crayons-tag__prefix">#</span>
# python
# </a>
# <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:">
# <span class="crayons-tag__prefix">#</span>
# webscraping
# </a>
# <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:">
# <span class="crayons-tag__prefix">#</span>
# gazpacho
# </a>
# <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368">
# <span class="crayons-tag__prefix">#</span>
# hacktoberfest
# </a>
# </div>
7. Speed
gazpacho
is fast. It takes just 258 µs to scrape the tag links for this post:
%%timeit
tags = Soup(bad_html)
tags = tags.find("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 258 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
While bs4
takes nearly twice as long to do the same thing:
%%timeit
tags = BeautifulSoup(bad_html)
tags = tags.find_all("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 465 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
8. Partial Matching
gazpacho
can partially match html element attributes. For instance, the sidebar for this page is displayed with the following html:
<aside class="crayons-layout__sidebar-right" aria-label="Right sidebar navigation">
And can be matched exactly with:
soup.find("aside", {"class": "crayons-layout__sidebar-right"}, partial=False)
Or partially (the default behaviour) with:
sidebar = soup.find("aside", {'aria-label': 'Right sidebar'}, partial=True)
# finding my name
sidebar.find("span", {'class': 'crayons-subtitle-2'}, partial=True).text
9. Debt-free
gazpacho is Python 3 first, Black, typed with mypy, and about ~400 sloc. It's easy to read through the source:
import inspect
source = inspect.getsource(Soup.find)
print(source)
And like bs4
isn't riddled with Python 2 technical debt.
10. Open (and Friendly)!
Most importantly, gazpacho
is open-source, hosted on GitHub (instead of some clunky custom platform) and looking for contributors.
If you're participating in #hacktoberfest, we'd love to have you. There's a couple of open issues that could use some help!
Posted on October 9, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.