Introduction

The concept of web Scraping is to write code that programmatically takes data from a web page. This data is often in the form of HTML and the scraper code can be written to extract the data out of that HTML. This article will cover what web scraping is, the ethics of using a scraper, and the tools needed to scrape.

There are three main steps to web scraping:

Get the HTML
Extract Data from HTML
Save & Use Scraped Data

Use Case and Ethics

Why would one need to use a web scraper? It is often the case that websites do not provide the information in their database in a easy-to-understand format to the public willy nilly. This may make it difficult to get just the data that you want which is where web scraping comes in.

Say you were looking at a list of apartments on a site like Zillow and want to chart the change in rent price over a 5 years to get a sense of when the best time to rent an apartment is. Now, Zillow does not have a public API for easy access to the properties on their databse. An alternative is to use web scraping where you write code that will extract the data you want from the HTML of a web page. This is much more effecient than manually grabbing and pasting the data yourself.

The obvious question here is: "Is web scraping ethical?"
The answer depends. Technically, there is no law that states that you cannot use web scraping and the HTML on web pages are publicly available to anyone who has access to the page.

If used for simple data, for your personal use, then there is no real problem. However, if you use it to create a clone of a website, make profit off of it, and the original website catches on, this could spell legal trouble!

The convention is to be polite and respect what the website owners want if they have a position on their data's privacy.

Tools: Beautiful Soup

To actually scrape data from a site, you can use web scraping tools which are programmed to parse through a site and grab data.

First, you manually make a reqeust to get the HTML data that you can parse through. This is done with a get() request to a site which returns data, and you use that data with a web scraping software of your choice.

Installation

Second, you need web scraping software that will help parse through the HTML that is returned from the request. Beautiful Soup, for example, is a package you can install in your local environment that allows the extraction of data from the HTML returned from your request using Python.

To install Beautiful Soup, run this in your command line:

python3 -m pip install bs4

Utilization

Third, assuming you have an HTML document that gets returned from the request, you can now use Beautiful Soup to extract whatever information you want. To use it with your returned HTML, you must import it first in your HTML doc:

from bs4 import BeautifulSoup

Lastly, we turn the returned HTML into an object instance of Beautiful Soup. To do this, you create a variable and assign it the evaluated result. Here is a way to do that:

souped = BeautifulSoup(html, "html.parser")

This way, you have access to the BeautifulSoup methods such as fetch_all(), find_parent(), find_previous_sibling(), replace_with(), etc. that you can invoke on the Beautiful Soup instance.

Takeaways

Web Scraping is a way of grabbing information in a structured form efficiently. However, questions have been raised about the ethical nature of its usage. The convention is to be polite and aware of the site owner's preferences.

Blog

Quick Look: Web Scraping

Blujay0

Introduction

Use Case and Ethics

Tools: Beautiful Soup

Installation

Utilization

Takeaways

Resources

Join Our Newsletter. No Spam, Only the good stuff.

Related