Scraping webpage header text with Python
Eva
Posted on October 15, 2024
| task |
We have several lists of URLs and need to get the headers from these pages. Assuming all the headers are wrapped in h1 tags and these pages are done in HTML/CSS.
| tech |
Python* and Beautifulsoup
*you need to install Python and bs4, and create a virtual environment to run Python. [how-to tk]
| solution |
First, put the URL lists into its .txt file and save them under the same folder as the project. After that, create another .txt file to store your output (output.txt).
Create a Python file (.py) to write the script in.
Here is the logic:
1) go to the URL, and get the HTML content of the page
2) if we can get the HTML content and the h1 tag exist
3) we get the text inside of the h1 tag and put it into our output.txt file
4) repeat the above steps for all the URLs
5) repeat the above for all the URL list .txt files in the folder if you have separate URL lists.
Full script:
import requests
from bs4 import BeautifulSoup
def print_h1(url: str):
response = requests.get(url) # querying the webpage, and get an object as a response
soup = BeautifulSoup(response.text, 'html.parser')
if response.status_code != 200 or soup.h1 == None:
print("FAILED: ", url)
return
print("\t", soup.h1.text)
files = ["urls/urllist1.txt", "urls/urllist2.txt"]
for file in files:
with open(file) as f:
urls = f.readlines()
print(file)
for url in urls:
print_h1(url.strip())
Here is the breakdown:
1) First, we need to import dependencies.
import requests
from bs4 import BeautifulSoup
2) Write an algorithm to loop through the URL list files, and loop through all the URLs in one list.
Use a custom function "print_h1" to extract the h1 from the page, and url.strip() removes any extra space around the URLs in the .txt file. We also added "print(file)" so we know which URL list file all the headers belong to.
for file in files:
with open(file) as f:
urls = f.readlines()
print(file)
for url in urls:
print_h1(url.strip())
3) Create an array for all the URL list files.
files = ["urls/urllist1.txt", "urls/urllist2.txt"]
At the end of #2, we have a function "print_h1" that needs to be defined. Now let's create the function.
response = requests.get(url)
Get the URL from the .txt file
soup = BeautifulSoup(response.text, 'html.parser')
This line creates a Python Beautiful Soup object and passes it to Python’s built-in HTML parser.
if response.status_code != 200 or soup.h1 == None:
print("FAILED: ", url)
return
This is our error-catching block. If we can't get an URL as a response, or there is no h1 tag on the page, we print "FAILED" and append the URL.
print("\t", soup.h1.text)
We print the text in the h1 tag in output.txt. "\t" adds a tab in front of the text.
[tk: how do we make the function print the output in output.txt?]
Run the Python file and it should write all the headers and error messages into the output file.
credit:
Christian Rang
reference:
https://oxylabs.io/blog/beautiful-soup-parsing-tutorial
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Photo by Javier Quesada on Unsplash
Posted on October 15, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024