BeautifulSoup is so 2000-and-late: Web Scraping in 2020

maxhumber

Max Humber

Posted on October 9, 2020

BeautifulSoup is so 2000-and-late: Web Scraping in 2020

BeautifulSoup (bs4) was created over a decade-and-a-half ago. And it's been the standard for web scraping ever since. But it's time for something new, because bs4 is so 2000-and-late.

In this post we'll explore 10 reasons why gazpacho is the future of web scraping, by scraping parts of this post!

1. No Dependencies

gazpacho is installed at command line:

pip install gazpacho 
Enter fullscreen mode Exit fullscreen mode

With no extra dependencies:

pip freeze
# gazpacho==1.1
Enter fullscreen mode Exit fullscreen mode

In contrast, bs4 is packaged with soupsieve and lxml. I won't tell you how to write software, but minimizing dependencies is usually a good idea...

2. Batteries Included

The html for this blog post can be fetched and made parse-able with Soup.get:

from gazpacho import Soup

url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
soup = Soup.get(url)
Enter fullscreen mode Exit fullscreen mode

Unfortunately, you'll need requests on top of bs4 to do the same thing:

import requests
from bs4 import BeautifulSoup

url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
html = requests.get(url).text
bsoup = BeautifulSoup(html)
Enter fullscreen mode Exit fullscreen mode

3. Simple finding

bs4 is a monster. There are 184 methods and attributes attached to every BeautifulSoup object. Making it hard to know what to use and when to use it:

len(dir(BeautifulSoup()))
# 184
Enter fullscreen mode Exit fullscreen mode

In contrast, Soup objects in gazpacho are simple; there are just seven methods and attributes to keep track of:

[method for method in dir(Soup())]
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']
Enter fullscreen mode Exit fullscreen mode

Looking at that list it's clear that to find the title of this post (nested inside of an h1 tag), for example, we'll need to use .find:

soup.find('h1')
Enter fullscreen mode Exit fullscreen mode

4. Prototyping to Production

gazpacho is awesome for prototyping and even better for production. By default, .find will return one Soup object if it finds just one element, or a list of Soup objects if it finds more than one.

To guarantee and enforce return types in production the mode= argument in .find can be set manually:

title = (soup
    .find("header", {'id': 'main-title'}, mode="first")
    .find("h1", mode="all")[0]
    .text
)
Enter fullscreen mode Exit fullscreen mode

In contrast, bs4 has 27 find methods and they all return something different:

[method for method in dir(BeautifulSoup()) if 'find' in method] 
Enter fullscreen mode Exit fullscreen mode

5. PEP 561 Compliant

As of version 1.1, gazpacho is PEP 561 compliant. Meaning that the entire library is typed and will work with your typed (or standard duck/un-typed!) code-base:

help(soup.find)
# Signature:
# soup.find(
#     tag: str,
#     attrs: Union[Dict[str, Any], NoneType] = None,
#     *,
#     partial: bool = True,
#     mode: str = 'automatic',
#     strict: Union[bool, NoneType] = None,
# ) -> Union[List[ForwardRef('Soup')], ForwardRef('Soup'), NoneType]

Enter fullscreen mode Exit fullscreen mode

6. Automatic Formatting

The html on dev.to and this post is well formatted. But if it weren't:

header = soup.find("div", {'class': 'crayons-article__header__meta'})
html = str(header.find("div", {'class': 'mb-4 spec__tags'}))
bad_html = html.replace("\n", "") # remove new line characters
print(bad_html)
# <div class="mb-4 spec__tags">  <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B">    <span class="crayons-tag__prefix">#</span>    python  </a>  <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:">    <span class="crayons-tag__prefix">#</span>    webscraping  </a>  <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:">    <span class="crayons-tag__prefix">#</span>    gazpacho  </a>  <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368">    <span class="crayons-tag__prefix">#</span>    hacktoberfest  </a></div>
Enter fullscreen mode Exit fullscreen mode

gazpacho would be able to automatically format and indent the bad/malformed html:

tags = Soup(bad_html)
Enter fullscreen mode Exit fullscreen mode

Making things easier to read:

print(tags)
# <div class="mb-4 spec__tags">
#   <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B">
#     <span class="crayons-tag__prefix">#</span>
#         python
#   </a>
#   <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:">
#     <span class="crayons-tag__prefix">#</span>
#         webscraping
#   </a>
#   <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:">
#     <span class="crayons-tag__prefix">#</span>
#         gazpacho
#   </a>
#   <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368">
#     <span class="crayons-tag__prefix">#</span>
#         hacktoberfest
#   </a>
# </div>
Enter fullscreen mode Exit fullscreen mode

7. Speed

gazpacho is fast. It takes just 258 µs to scrape the tag links for this post:

%%timeit
tags = Soup(bad_html)
tags = tags.find("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 258 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Enter fullscreen mode Exit fullscreen mode

While bs4 takes nearly twice as long to do the same thing:

%%timeit
tags = BeautifulSoup(bad_html)
tags = tags.find_all("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 465 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Enter fullscreen mode Exit fullscreen mode

8. Partial Matching

gazpacho can partially match html element attributes. For instance, the sidebar for this page is displayed with the following html:

<aside class="crayons-layout__sidebar-right" aria-label="Right sidebar navigation">
Enter fullscreen mode Exit fullscreen mode

And can be matched exactly with:

soup.find("aside", {"class": "crayons-layout__sidebar-right"}, partial=False)
Enter fullscreen mode Exit fullscreen mode

Or partially (the default behaviour) with:

sidebar = soup.find("aside", {'aria-label': 'Right sidebar'}, partial=True)

# finding my name
sidebar.find("span", {'class': 'crayons-subtitle-2'}, partial=True).text 
Enter fullscreen mode Exit fullscreen mode

9. Debt-free

gazpacho is Python 3 first, Black, typed with mypy, and about ~400 sloc. It's easy to read through the source:

import inspect

source = inspect.getsource(Soup.find)
print(source)
Enter fullscreen mode Exit fullscreen mode

And like bs4 isn't riddled with Python 2 technical debt.

10. Open (and Friendly)!

Most importantly, gazpacho is open-source, hosted on GitHub (instead of some clunky custom platform) and looking for contributors.

If you're participating in #hacktoberfest, we'd love to have you. There's a couple of open issues that could use some help!

💖 💪 🙅 🚩
maxhumber
Max Humber

Posted on October 9, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related