Effortlessly scrape HTML tables into Python using pd.read_html!

chrisgreening

Chris Greening

Posted on August 8, 2023

Effortlessly scrape HTML tables into Python using pd.read_html!

In today's data-driven world, the ability to extract and synthesize information from various online sources is not just a powerful skill - it's often a necessity!

And often, that data comes in the form of HTML <table>'s scattered across the web. The challenge then becomes: How do we extract and transform this data into a form that's easily accessible in Python?

With the pandas.read_html function, we're offered a convenient solution to extract our data into the highly versatile pandas.DataFrame and get our analyses running quick and efficiently!

from pandas import read_html
Enter fullscreen mode Exit fullscreen mode

Table of Contents

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

favicon christophergreening.com

Prerequisites

What is pd.read_html?

pd.read_html is a function within pandas, a popular data manipulation library in Python. Its purpose is to scrape an HTML page (either from a URL or as a string) and extract all the table's found on the page

Here's a quick breakdown of how it works:

  • Specify the source: We tell pd.read_html where to find the HTML content. This could be a URL pointing to a webpage or a string containing raw HTML code
  • Scrape the tables: pd.read_html scans the HTML content and identifies the tables within it
  • Transform into pd.DataFrame's: Once the tables are found, pd.read_html converts them into pd.DataFrames for easy analysis and manipulation

So with just one line of code we can scrape all the tables on a webpage and get right into our analyses without having to worry about manual entry or extraction

Using pd.read_html in practice

Leveraging pd.read_html is a straightforward process that can save us significant time and effort

Here's a step-by-step guide to using this function to get tables from a webpage right into our Python environments:

Import pandas: First let's import pandas into our script:

import pandas as pd
Enter fullscreen mode Exit fullscreen mode

Specify the source and call pd.read_html: Determine where pd.read_html should look for the HTML content. It could be a URL or a string containing HTML code. For this example let's pull some tables off of the Python Wiki page:

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
tables = pd.read_html(url)
Enter fullscreen mode Exit fullscreen mode

Table on the Python wiki page listing the different data types in Python

Access the tables: The result is a list of pd.DataFrames, each representing a table found on the page. We can access them by their index:

df = tables[0]
Enter fullscreen mode Exit fullscreen mode

Analyze and manipulate: From here, we're free to work with the data just like we would with any other DataFrame in pandas - filtering rows, calculating statistics, or visualizing the data!

Here's what a complete snippet might look like:

import pandas as pd
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
tables = pd.read_html(url)
df = tables[0]
Enter fullscreen mode Exit fullscreen mode

Screenshot of DataFrame output scraped from the HTML table

Conclusion

Using pd.read_html to scrape an HTML string offers flexibility and control over the content we're working with

Whether we're handling locally stored HTML files or scraping right from the web, this method allows us to fully leverage pd.read_html's table extraction capabilities

Thanks so much for reading and if you liked my content, be sure to check out some of my other work or connect with me on social media or my personal website πŸ˜„

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

favicon christophergreening.com
πŸ’– πŸ’ͺ πŸ™… 🚩
chrisgreening
Chris Greening

Posted on August 8, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related