Effortlessly scrape HTML tables into Python using pd.read_html!
Chris Greening
Posted on August 8, 2023
In today's data-driven world, the ability to extract and synthesize information from various online sources is not just a powerful skill - it's often a necessity!
And often, that data comes in the form of HTML <table>
's scattered across the web. The challenge then becomes: How do we extract and transform this data into a form that's easily accessible in Python?
With the pandas.read_html
function, we're offered a convenient solution to extract our data into the highly versatile pandas.DataFrame
and get our analyses running quick and efficiently!
from pandas import read_html
Table of Contents
Prerequisites
What is pd.read_html?
pd.read_html
is a function within pandas
, a popular data manipulation library in Python. Its purpose is to scrape an HTML page (either from a URL or as a string) and extract all the table's found on the page
Here's a quick breakdown of how it works:
-
Specify the source: We tell
pd.read_html
where to find the HTML content. This could be a URL pointing to a webpage or a string containing raw HTML code -
Scrape the tables:
pd.read_html
scans the HTML content and identifies the tables within it -
Transform into
pd.DataFrame
's: Once the tables are found,pd.read_html
converts them intopd.DataFrames
for easy analysis and manipulation
So with just one line of code we can scrape all the tables on a webpage and get right into our analyses without having to worry about manual entry or extraction
Using pd.read_html in practice
Leveraging pd.read_html
is a straightforward process that can save us significant time and effort
Here's a step-by-step guide to using this function to get tables from a webpage right into our Python environments:
Import pandas
: First let's import pandas into our script:
import pandas as pd
Specify the source and call pd.read_html
: Determine where pd.read_html
should look for the HTML content. It could be a URL or a string containing HTML code. For this example let's pull some tables off of the Python Wiki page:
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
tables = pd.read_html(url)
Access the tables: The result is a list
of pd.DataFrames
, each representing a table found on the page. We can access them by their index:
df = tables[0]
Analyze and manipulate: From here, we're free to work with the data just like we would with any other DataFrame
in pandas
- filtering rows, calculating statistics, or visualizing the data!
Here's what a complete snippet might look like:
import pandas as pd
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
tables = pd.read_html(url)
df = tables[0]
Conclusion
Using pd.read_html
to scrape an HTML string offers flexibility and control over the content we're working with
Whether we're handling locally stored HTML files or scraping right from the web, this method allows us to fully leverage pd.read_html
's table extraction capabilities
Thanks so much for reading and if you liked my content, be sure to check out some of my other work or connect with me on social media or my personal website π
Posted on August 8, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.