BeautifulSoup (bs4) was created over a decade-and-a-half ago. And it's been the standard for web scraping ever since. But it's time for something new, because bs4 is so 2000-and-late.

In this post we'll explore 10 reasons why gazpacho is the future of web scraping, by scraping parts of this post!

1. No Dependencies

gazpacho is installed at command line:

pip install gazpacho

With no extra dependencies:

pip freeze
# gazpacho==1.1

In contrast, bs4 is packaged with soupsieve and lxml. I won't tell you how to write software, but minimizing dependencies is usually a good idea...

2. Batteries Included

The html for this blog post can be fetched and made parse-able with Soup.get:

from gazpacho import Soup

url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
soup = Soup.get(url)

Unfortunately, you'll need requests on top of bs4 to do the same thing:

import requests
from bs4 import BeautifulSoup

url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
html = requests.get(url).text
bsoup = BeautifulSoup(html)

3. Simple `find`ing

bs4 is a monster. There are 184 methods and attributes attached to every BeautifulSoup object. Making it hard to know what to use and when to use it:

len(dir(BeautifulSoup()))
# 184

In contrast, Soup objects in gazpacho are simple; there are just seven methods and attributes to keep track of:

[method for method in dir(Soup())]
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']

Looking at that list it's clear that to find the title of this post (nested inside of an h1 tag), for example, we'll need to use .find:

soup.find('h1')

4. Prototyping to Production

gazpacho is awesome for prototyping and even better for production. By default, .find will return one Soup object if it finds just one element, or a list of Soup objects if it finds more than one.

To guarantee and enforce return types in production the mode= argument in .find can be set manually:

title = (soup
    .find("header", {'id': 'main-title'}, mode="first")
    .find("h1", mode="all")[0]
    .text
)

In contrast, bs4 has 27 find methods and they all return something different:

[method for method in dir(BeautifulSoup()) if 'find' in method]

5. PEP 561 Compliant

As of version 1.1, gazpacho is PEP 561 compliant. Meaning that the entire library is typed and will work with your typed (or standard duck/un-typed!) code-base:

help(soup.find)
# Signature:
# soup.find(
#     tag: str,
#     attrs: Union[Dict[str, Any], NoneType] = None,
#     *,
#     partial: bool = True,
#     mode: str = 'automatic',
#     strict: Union[bool, NoneType] = None,
# ) -> Union[List[ForwardRef('Soup')], ForwardRef('Soup'), NoneType]

6. Automatic Formatting

The html on dev.to and this post is well formatted. But if it weren't:

header = soup.find("div", {'class': 'crayons-article__header__meta'})
html = str(header.find("div", {'class': 'mb-4 spec__tags'}))
bad_html = html.replace("\n", "") # remove new line characters
print(bad_html)
# <div class="mb-4 spec__tags">  <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B">    <span class="crayons-tag__prefix">#</span>    python  </a>  <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:">    <span class="crayons-tag__prefix">#</span>    webscraping  </a>  <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:">    <span class="crayons-tag__prefix">#</span>    gazpacho  </a>  <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368">    <span class="crayons-tag__prefix">#</span>    hacktoberfest  </a></div>

gazpacho would be able to automatically format and indent the bad/malformed html:

tags = Soup(bad_html)

Making things easier to read:

print(tags)
# <div class="mb-4 spec__tags">
#   <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B">
#     <span class="crayons-tag__prefix">#</span>
#         python
#   </a>
#   <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:">
#     <span class="crayons-tag__prefix">#</span>
#         webscraping
#   </a>
#   <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:">
#     <span class="crayons-tag__prefix">#</span>
#         gazpacho
#   </a>
#   <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368">
#     <span class="crayons-tag__prefix">#</span>
#         hacktoberfest
#   </a>
# </div>

7. Speed

gazpacho is fast. It takes just 258 µs to scrape the tag links for this post:

%%timeit
tags = Soup(bad_html)
tags = tags.find("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 258 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

While bs4 takes nearly twice as long to do the same thing:

%%timeit
tags = BeautifulSoup(bad_html)
tags = tags.find_all("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 465 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

8. Partial Matching

gazpacho can partially match html element attributes. For instance, the sidebar for this page is displayed with the following html:

<aside class="crayons-layout__sidebar-right" aria-label="Right sidebar navigation">

And can be matched exactly with:

soup.find("aside", {"class": "crayons-layout__sidebar-right"}, partial=False)

Or partially (the default behaviour) with:

sidebar = soup.find("aside", {'aria-label': 'Right sidebar'}, partial=True)

# finding my name
sidebar.find("span", {'class': 'crayons-subtitle-2'}, partial=True).text

9. Debt-free

gazpacho is Python 3 first, Black, typed with mypy, and about ~400 sloc. It's easy to read through the source:

import inspect

source = inspect.getsource(Soup.find)
print(source)

And like bs4 isn't riddled with Python 2 technical debt.

10. Open (and Friendly)!

Most importantly, gazpacho is open-source, hosted on GitHub (instead of some clunky custom platform) and looking for contributors.

If you're participating in #hacktoberfest, we'd love to have you. There's a couple of open issues that could use some help!

Blog

BeautifulSoup is so 2000-and-late: Web Scraping in 2020

Max Humber

1. No Dependencies

2. Batteries Included

3. Simple `find`ing

4. Prototyping to Production

5. PEP 561 Compliant

6. Automatic Formatting

7. Speed

8. Partial Matching

9. Debt-free

10. Open (and Friendly)!

Join Our Newsletter. No Spam, Only the good stuff.

Related

BeautifulSoup is so 2000-and-late: Web Scraping in 2020

Max Humber

1. No Dependencies

2. Batteries Included

3. Simple finding

4. Prototyping to Production

5. PEP 561 Compliant

6. Automatic Formatting

7. Speed

8. Partial Matching

9. Debt-free

10. Open (and Friendly)!

Join Our Newsletter. No Spam, Only the good stuff.

Related

3. Simple `find`ing