WEB SCRAPING USING PYTHON AND BEAUTIFUL SOUP.
Alexander Maina
Posted on January 15, 2023
I previously submitted a back-end developer job application to a certain website. But the website never sent me updates through email, and this scenario only sometimes presented itself. As a result, I had to access the internet daily and would continue to scroll and gaze. Until I discovered web scraping, I was unable to take advantage of an opportunity the next season.
As a result, I feel the need to share my knowledge about Python web scraping.
What is Web Scraping?
Web Scraping is a term used to refer to using a program to download and process content from the web.
Interestingly, copying and pasting the contents of the web is a basic example of web scraping.However Web scraping involves automation.
What is Beautiful Soup?
It is a Python module that parses (analyze and identify the parts of) Hyper-Text Mark-Up Language, a language in which web pages are written in.
Beautiful Soup does not come installed in python and hence it needs to be installed before initial use.
The BeautifulSoup moduleโs name is bs4 (for Beautiful Soup, version 4).
1.0 Installing Beautiful Soup and Requests Library.
Open the command line and type python
to open the python interpreter in interactive mode.
To install Beautiful Soup, type the following command on the command line:
pip install beautifulsoup4
While beautifulsoup4 is the name used for installation, we use bs4 to import Beautiful Soup
We also need to install requests python library by typing the following command on the command line.
pip install requests
2.0 Scraping the Page
In this section, we now need to get the contents of the candidate page. We will use https://www.xyz.com
as an example of our candidate page.
import requests
URL = "https://www.xyz.com"
response = requests.get(URL)
This returns the Content of the https://www.xyz.com
page. This includes all elements and attributes present inside the page.
3.0 Parsing the HTML page content.
This refers to parsing this lengthy code answer using Python's help to make it more accessible and allow you to select the information you need.
We first of all need to import Beautiful soup and then create a variable to store our parsed content.
import requests
from bs4 import BeautifulSoup
URL = "https://www.xyz.com"
response = requests.get(URL)
soup = BeautifulSoup (response.content, "html.parser")
"html.parser"
is a built-in Python library that parses HTML and XML documents. It creates a parse tree from the HTML content that can be used to extract information from the website.
4.0 Find element by attribute.
Take for example we need to find jobs based on a div
with id = "available"
. We now need to scan through the entire page and find all elements with id = "available"
.
import requests
from bs4 import BeautifulSoup
URL = "https://www.xyz.com"
response = requests.get(URL)
soup = BeautifulSoup (page.content, "html.parser")
job = soup.find(id = "available")
This returns the list of all jobs available. Below is an example:
<div id = "available">
<!--Job listings-->
</div>
You can also chain multiple find_all()
method to make the search more specific.For example:
job = soup.find_all("p", string="Posts")
The above code will scan through the paragraphs, trying to find a string Posts
. If it happens that this has been mispelt or typed with a different case, it will return no object.To rectify this, we can use lambda()
function as follows;
python_job = results.find_all(
"h2", string=lambda text: "python" in text.lower()
)
The above code will scan through all h2
in the page and convert them to lower case. It'll then find the substring 'search, and return the results.
5.0 Display the available jobs.
Now print the available jobs.
print(job.text)
You can as well use len
if you need to see the number of jobs available.
print(len(job.text))
I hope that this step-by-step guide has instilled new skills in you. Happy Coding!!!
Posted on January 15, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 12, 2024