Scraping 1000's of News Articles using 10 simple steps
Kajal
Posted on July 28, 2020
Web Scraping Series: Using Python and Software |
---|
1. Scraping web pages without using Software: Python |
2. Scraping web Pages using Software: Octoparse |
Table Of Content |
---|
1.1 Introduction |
1.1.1 Why This article? |
1.1.2 WHOM THIS ARTICLE IS USEFUL FOR ? |
1.2 Overview |
1.2.1A brief introduction to webpage design and HTML |
1.2.2Web-scraping using BeautifulSoup in PYTHON |
Step-1:Installing Packages |
Step-2:Importing Libraries |
Step-3:Making Simple requests |
Step-4:Inspecting the Response Object |
Step-5:Delaying request time |
Step-6:Extracting Content from HTML |
Step-7:Finding elements and attributes |
Step-8:Making Dataset |
Step-9:Visualising Dataset |
Step-10:Making csv file & saving it to your machine |
1.3 Suggestion & conclusion |
1.3.1Full Code |
INTRODUCTION
WHY THIS ARTICLE?
Aim of this article is to scrape news articles from different websites using Python. Generally, web scraping involves accessing numerous websites and collecting data from them. However, we can limit ourselves to collect a large amounts of information from a single source and use it as a dataset.
Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.
So, I get motivated to do web scraping while working on my Machine-Learning project on Fake News Detection System. Whenever we begin a machine learning project, the first thing that we need is a dataset. While there are many datasets that you can find online with varied information, sometimes you wish to extract data on your own and begin your own investigation. I was needed with a dataset that I couldn't able to find anywhere according to my need.
So this motivated me to make my own Dataset for my project accordingly. And that's how I did my project from the scratch. My Project was basically based on classifying different news articles into two main categories FAKE & REAL.
FAKE-NEWS DATASET
For this project, The first task was to get a dataset which is already labeled with "FAKE", so this can be achieved by scraping data from some verified & certified news websites, on which we can rely on for fact of news articles and it is really a very difficult task to get genuine "FAKE NEWS".
I go through these news websites to get my FAKE-NEWS Dataset
But honestly speaking, I end up scraping data from one website i.e., Politifact.
And there is a strong reason to do so, As you go through the listed links up there, you will conclude that we needed a dataset with already labeled category i.e., "FAKE" but also we don't want our news articles to be in a modified form as such. We want to extract a raw news article without any keywords specifying whether the given news article in a dataset is "FAKE" or not.
So for example, If you go through the link "BoomLive.in", you will find that the news articles specifying "FAKE" are not in its actual form and altered on basis of some analysis of the fact-checking team. So this altered text on model training in ML will give us a biased result every time and the model that we made using this kind of dataset will result into a dumb one which can only predict news articles having keywords like "FAKE", "DID?", "IS?" in it and will not be going to perform well on a new testing set of data.
That's why we use Politifact to scrape our "FAKE-NEWS DATASET".
Even though there are challenges too in a labelling news article but we will be going to cover up that in a further section.
REAL-NEWS DATASET
The second task was to create a "REAL-NEWS" dataset, So that was easy if you are scrapping news-articles from trusted or verified news websites like "TOI", "IndiaToday", "TheHindu" & so many...So we can trust these websites that they are listing the factual/actual data and even if not, then we are assuming the same to be true and will train our model accordingly.
But for my project, I scrape data for real and fake from one website only (i.e., Politifact.com), since I am getting what I needed from it and also it is advisable when we are scraping data using python to use one website at a time. Although you can scrape multiple pages of that particular website altogether in one module by just running an outer for loop.
WHOM THIS ARTICLE IS USEFUL FOR?
Whoever is working on some project where you need to scrape data in thousands, this article is definitely for you 😃.It doesn't matter if you are from a programming background or not, because there are many times when people other than programmers from different backgrounds needed data as per their project, survey, or whatsoever purpose. But non-programmers find it difficult to understand any programming language, So I will make scrapping easy for them too by introducing some software from which they can scrape any kind of data in a huge amount easily. Although, Scraping using python is not that difficult if you follow along with me while reading this blog 😎, the only thing that you need to focus on is the HTML source code of a webpage. Once, you able to understand how webpages are written in HTML and able to identify attributes and elements of your interest, you can scrape any website 😋. For non-programmers, if you want to do web-scraping using python, just focus on HTML code mainly, python syntax is not that difficult to understand, It's just all libraries, some functions, and keywords that you needed to remember and understand. So I tried to explain every step with transparency, I hope at the end of this series, you will be able to scrape different types of the layout of webpages.
OVERVIEW
This post covers the first part: News articles web scraping using PYTHON. We’ll create a script that scrapes the latest news articles from different newspapers and stores the text, which will be fed into the model afterward to get a prediction of its category.
A brief introduction to webpage design and HTML:
If we want to be able to extract news articles (or, in fact, any other kind of text) from a website, the first step is to know how a website works.
We will follow an example to understand this:
When we insert an URL into the web browser (i.e. Google Chrome, Firefox, etc…) and access to it, what we see is the combination of three technologies:
HTML (HyperText Markup Language): it is the standard language for adding content to a website. It allows us to insert text, images, and other things to our site. In one word, HTML defines the content of every webpage on the internet.
CSS (Cascading Style Sheets): this language allows us to set the visual design of a website. This means it determines the style/presentation of a webpage including colors, layouts, and fonts.
JavaScript: JavaScript is a dynamic computer programming language. It allows us to make the content and the style interactive & provides a dynamic interface between client-side script and user.
Note that these three are programming languages. They will allow us to create and manipulate every aspect of the design of a webpage.
Let’s illustrate these concepts with an example. When we visit the Politifact page, we see the following:
If we disabled JavaScript, we would not be able to use this pop-up anymore, as you can see, we are not able to see a video pop up window now:
And If we deleted the CSS content from the webpage, we would see something like this:
So, At this point, I will be going to ask you a question.
"If you want to extract the content of a webpage via web-scraping, where do you need to look up?"
So, At this point, I hope you guys are clear about what kind of source code do we need to scrape.😎 Yeah, you are absolutely right, If you are thinking about HTML 😉
So, the last step before performing web scraping methods is to understand the bit of the HTML language.
HTML
HTML language is a "hypertext markup language" that defines the content of a webpage and constitute of elements and attributes, for scraping data, you should be familiar with inspecting those elements.
-An element could be a heading, paragraph, division, anchor tag & so many...
-An attribute could be that the heading is in bold letters.
These tags are represented with a opening symbol <tag>
and closing symbol</tag>
e.g.,
<p>This is paragraph.</p>
<h1><b>This is heading one in bold letters</b></h1>
Web-scraping using BeautifulSoup in PYTHON
Enough talk, show me the code.
Step-1 : Installing Packages
We will first begin with installing necessary packages:
beautifulsoup4
To install it, Please type the following code into your python distribution.
! pip install beautifulsoup4
BeautifulSoup under bs4 package is a library used to parse HTML & XML docs into python in a very easy & convenient way and access its elements by identifying them with their tags and attributes.
It is very easy to use yet very powerful package to extract any kind of data from the internet in just 5-6 lines.
requests
To install it, use the following command in your IDE or command shell.
! pip install requests
So as to provide BeautifulSoup with the HTML code of any page, we will need with the requests module.
urllib
To install it, use the following command:
! pip install urllib
urllib module is the URL handling module for python. It is used to fetch URLs(Uniform Resource Locator)
Although, here we are using this module for a different purpose, to call libraries like:
- time(using which we can call sleep() function to delay or suspends execution for the given number of seconds.
- sys(It is used here to get exception info like type of error, error object, info about the error.
Step-2 : Importing Libraries
Now we will import all the required libraries:
BeautifulSoup
To import it, use the following command onto your IDE
from bs4 import BeautifulSoup
This library helps us with getting HTML structure of any page that we want to work with and provides functions to access specific elements and extract relevant info.
urllib
To import it, type following command
import urllib.request,sys,time
- urllib.request : It helps in defining functions & classes which help in opening URLs
- urllib.sys : It's functions & classes helps us with retrieving exception info.
- urllib.time :Python has a module named time which provides several useful functions to handle time-related tasks. One of the popular functions among them is sleep().
requests
To import it, just type import before this library keyword.
import requests
This module allows us to send the HTTP requests to web-server using python. (HTTP messages consist of requests from client to server and responses from server to client.)
pandas
import pandas as pd
It is a high-level data-manipulation tool that we needed to visualize our structured scraped data.
will use this library to make DataFrame(Key data structure of this library). DataFrames allow us to store and manipulate tabular data in rows of observations and columns of variables.
import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd
Step-3 : Making Simple requests
with the request
module, we can get the HTML content and store into the page
variable.
Make a simple get request(just fetching a page)
#url of the page that we want to Scarpe
#+str() is used to convert int datatype of the page no. and concatenate that to a URL for pagination purposes.
URL = 'https://www.politifact.com/factchecks/list/?page='+str(page)
#Use the browser to get the URL. This is a suspicious command that might blow up.
page = requests.get(url)
Since,
requests.get(url)
is a suspicious command and might throw an exception, we will call it in a try-except block
try:
# this might throw an exception if something goes wrong.
page=requests.get(url)
# this describes what to do if an exception is thrown
except Exception as e:
# get the exception information
error_type, error_obj, error_info = sys.exc_info()
#print the link that cause the problem
print ('ERROR FOR LINK:',url)
#print error info and line that threw the exception
print (error_type, 'Line:', error_info.tb_lineno)
continue
We will also use an outer for loop for pagination purposes.
Step-4 : Inspecting the Response Object
I. See what response code the server sent back (useful for
detecting 4XX or 5XX errors.
page.status_code
Output:
The HTTP 200 OK success status response code indicates that the request has succeeded.
II. Access the full response as text(get the HTML of the page in a big string)
page.text
It will return HTML content of a response object in Unicode.
Alternative:
page.content
whereas, It will return the content of response in bytes.
III. Look for a specific substring of text within the response.
if "Politifact" in page.text:
print("Yes, Scarpe it")
IV. Check the response’s Content Type (see if you got back HTML,
JSON, XML, etc)
print (page.headers.get("content-type", "unknown"))
Step-5 : Delaying request time
Next with the time module, we can call sleep(2) function with a value of 2 seconds. Here it delayed sending requests to a web-server by 2 seconds.
time.sleep(2)
The sleep() function suspends execution of the current thread for a given number of seconds.
Next, we need to create
Step 6 : Extracting Content from HTML
Now that you’ve made your HTTP request and gotten some HTML content, it’s time to parse it so that you can extract the values you’re looking for.
A)Using Regular Expressions
Using Regular Expressions for looking up HTML content is strongly not recommended at all.
However, regular expressions are still useful for finding specific string patterns like prices, email addresses, or phone numbers.
Run a regular expression on the response text to look for specific string patterns:
import re # put this at the top of the file
...
print(re.findall(r'\$[0-9,.]+', page.text))
B)Using BeautifulSoup's object Soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work
soup = BeautifulSoup(page.text, "html.parser")
Below listed command will Look for all the tags e.g.,<li>
with specific attribute 'o-listicle__item'
links=soup.find_all('li',attrs={'class':'o-listicle__item'})
INSPECTING WEBPAGE
For being able to understand above code, you need to inspect the webpage & please do follow along:
1)Go to listed URL above
2)press ctrl+shift+I to inspect it.
3)This is how your 'Inspect window' will look like:
- press ctrl+shift+C to select an element in the page to inspect it or go to the leftmost arrow in header of Inspect window. 4)For getting above specific element & attribute in inspect window
- First, tries to go to every section of the webpage, & see changes on your inspect window, you will easily grasp the idea behind how webpages are working and which element is what and what particular attribute is contributing to the webpage.
- When done with the above step, now I am assuming that you can understand the working of the above element
<li>
and it's attribute. - Since, I needed with the news section of a particular article, I go to that article section by selecting the inspect element option in the inspect window, It will highlight that article section on the web-page and it's HTML source on Inspect Window. Voila!✨
Did you able to locate the same tag on your machine?
If yes, You are all set to understand every bit of HTML tags whatsoever I have used in my code.
Continuing with my code: 😅
print(len(links))
This command will help you to inspect how many news articles are there on a given page.
Help you understand accordingly, up to what level you need to paginate your loop for extracting huge data.
Step-7 : Finding elements and attributes
- Look for all anchor tags on the page (useful if you’re building a crawler and need to find the next pages to visit)
links = soup.find_all("a")
- It will find a division tag under
<li>
tag where div tag should contain listed or specific attribute value. Here 'j' is an iterable variable that is iterating over response object 'Links' for all news articles listed on a given page.
Statement = j.find("div",attrs={'class':'m-statement__quote'})
- text.strip() function will return text contained within this tag and strip any kind of extra spaces, '\n','\t' from the text string object.
Statement = j.find("div",attrs={'class':'m-
statement__quote'}).text.strip()
Voula! 🌟 We have scraped the first attribute i.e., Statement of our dataset 😋
- In the same division section, It will look for anchor tag and return with the value of hypertext link. Again, strip() function is used to get our values organized so that our CSV file looks good.
Link=j.find("div",attrs={'class':'m-statement__quote'}).find('a')['href'].strip()
- For getting Date attribute, you need to inspect web-page first, As there is a string contained with it. So calling text function without specifying indexing, you will get something like this But we don't need text other than the date, So I use indexing. Although you can clean your attribute later using some regex combinations. 'footer' is an element that contained the required text.
Date = j.find('div',attrs={'class':'m-statement__body'}).find('footer').text[-14:-1].strip()
- Here also, I have done everything same as before except get(), which is extracting content of a attribute passed(i.e., title)
Source = j.find('div', attrs={'class':'m-statement__author'}).find('a').get('title').strip()
- Since, For my project, I needed a dataset that is not already altered and also, I need to know already about thousands of articles that lie in which category for my training data. and No-one can do that manually. So, On this website, I do find articles attached already with labels but the text is not retrievable because it is contained in an image. For this kind of specific task, you can use get() to retrieve particular text effectively. Here, I am passing 'alt' as an attribute to get(), which contains our Label text.
Label = j.find('div', attrs ={'class':'m-statement__content'}).find('img',attrs={'class':'c-image__original'}).get('alt').strip()
In below lines of code, I have put all concepts together & tried to fetch details for five different attributes of my Dataset.
for j in links:
Statement = j.find("div",attrs={'class':'m-statement__quote'}).text.strip()
Link=st.find('a')['href'].strip()
Date = j.find('div',attrs={'class':'m-statement__body'}).find('footer').text[-14:-1].strip()
Source = j.find('div', attrs={'class':'m-statement__author'}).find('a').get('title').strip()
Label = j.find('div', attrs ={'class':'m-statement__content'}).find('img',attrs={'class':'c-image__original'}).get('alt').strip()
frame.append([Statement,Link,Date,Source,Label])
upperframe.extend(frame)
Step-8:Making Dataset
Append each attribute value to a empty list 'frame' for each article
frame.append([Statement,Link,Date,Source,Label])
Then, extend this list to an empty list 'upperframe' for each page.
upperframe.extend(frame)
Step-9 : Visualising Dataset
If you wanted to visualise your data on jupyter, you can use pandas dataframe to do so.
data=pd.DataFrame(upperframe, columns=['Statement','Link','Date','Source','Label'])
data.head()
Step-10 : Making csv file & saving it to your machine
A) Opening & writing to file
The below command will help you to write csv file and save it to your machine in the same directory as where your python file has been saved in
filename="NEWS.csv"
f=open(filename,"w")
headers="Statement,Link,Date, Source, Label\n"
f.write(headers)
....
f.write(Statement.replace(",","^")+","+Link+",
"+Date.replace(",","^")+","+Source.replace(",","^")+","+Label.replace(",","^")+"\n")
This line will write each attribute to a file with replacing any ',' with '^'.
f.write(Statement.replace(",","^")+","+Link+","+Date.replace(",","^")+","+Source.replace(",","^")+","+Label.replace(",","^")+"\n")
So, when you run this file on command shell, It will make a CSV file in your .py file directory.
On opening it, you might see weird data if you don't use strip() while scraping. So do check it without applying strip() and if you don't replace '^' with ',', It will also look weird.
So replace it using these simple steps:
- open your excel file (.csv file)
- press ctrl+H (a pop-up window will come asking about find what & replace with)
- give '^' value to 'find what' attribute and give ',' value in 'replace with' attribute.
- press Replace All
- click Close & Wohoo!😍 You are all done with having your dataset in perfect form. and don't forget to close your file with the following command after done with both the for loops,
f.close()
and running the same code again and again might throw an error if it has already created a dataset using the file writing method.
B) converting dataframe into csv file using to_csv()
So, instead of this lengthy method, you can opt for another method: to_csv() is also used to convert dataframe into a csv file and also provide with a attribute to specify path.
path = 'C:\\Users\\Kajal\\Desktop\\KAJAL\\Project\\Datasets\\'
data.to_csv(path+'NEWS.csv')
To avoid the ambiguity and allow portability of your code you can use this:
import os
data.to_csv(os.path.join(path,r'NEWS.csv'))
this will append your CSV name to your destination path correctly.
SUGGESTION & CONCLUSION
Although I will suggest using the first method using open file and writing to it and then close it, I know it is a bit lengthy & tacky to implement but at least it will not provide you with ambiguous data as to_csv method mostly does.
See in the above image, how it extracts ambiguous data for the Statement attribute.
So, instead of spending hours cleaning your data manually, I would suggest writing a few extra lines of code specified in the first method.
Now, you are done with it.✌️
IMPORTANT NOTE: If you tried to copy-paste my source code for scraping different websites & run it, It might possible that it will throw an error. In fact, It will definitely throw an error because each webpage's layout is different & for that, you need to make changes accordingly.
I hope you will find it useful and liked my article.😇 Please feel free to share your thoughts and hit me up with any queries you might have.😉
Full Code
This article is the first part of the series of web-scraping and for those who come from non-technical backgrounds, read the second part of this series here.
I hope you will find it useful and liked my article.😇 Please feel free to share your thoughts and hit me up with any queries you might have. You can reach me via following :
Subscribe to my YouTube channel for video contents coming soon here
Connect and reach me on LinkedIn
Posted on July 28, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.