Wow! Scraping Wikipedia content With 10 line of code!

powerexploit

powerexploit

Posted on October 5, 2019

Wow! Scraping Wikipedia content With 10 line of code!

"Hackers loves to use scraping to harvest data.~Ankit Dobhal"
original blog is here - >blog

Welcome to My Blog

Hello my Computer Geek Friend!!This is a blog about scraping wikipedia content using python & bs4(python module),So what is exactly web scraping & from where this term comes from?Let's Try To Understand!!
Web Scraping - :
Web scraping is data scraping process used for extracting data from websites.Web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.It comes when world wide web born.Most of time search engine like google uses crawling process in their search result.

Scraping With Python - :
Web scraping & crawling can be done with the help of some softwares but in Nowadays Python is gaining its popularty in the field of web scraping & crawling ,& as we all know python is one of the most famous & powerful scripting languages generally for hackers & shell coders.Python have some amazing & powerful modules & libraries which makes this scraping process so easy & useful,Their are two important modules in python one is requests & another is BeautifulSoup.

Let's Write Python Script to scrape wikipedia content or wikpedia searcher:

I have a basic understanding how to do get request to websites using python , so first of all I open up my vscode editor and create file name as wikipy.py.Then import sys library(command line argument), requests library(for downloading & get method to wikipedia), & my favorite library BeautifulSoup as bs4 (To extact content from wikipedia page).
Alt Text
Now its time to use get method to requests data from wikipedia server , but wait I want to create a wikipedia searcher which will scrape the data according to my command line argument.So let's create a variable name as res to store get method to wikipedia search url & add it with my command line argument.
Alt Text
note: I uses raise_for_status() method if their is any error code and status code comes so this method will raise that & whole script will terminate.
Alt Text
res download the whole page but it is complicating to extract data from the page bacuase it shows the html format data , so now this is time to use BeautifulSoup to extract data. So I am creating a variable name as wiki to extract data.
Alt Text
note: As you can in wiki variable I uses Beautiful Soup function with two parameters ,So what they are exactly? let's understand. res.text is a text format of the page which is downloaded with the help of res variable & html.parser is a parser which will help me to structure the data into html format.

I want to scrape the p tag content according to command line argument because the whole text content of Wikipedia page is inside the p tag you can check this with the help of developer tools of chrome & Firefox.
Alt Text
Now I am using .select() function to select p tag & for loop to looping throgh it ,then finally printing the text elements imside p tag with.getText() function.
Alt Text

Yeah we did it in just 10 line of code bravo!!!
Its time to run the script with command line argument >>
Alt Text

Thankyou all for visting my blog you can also check my gist for wikipy script the link is below!!
wiki.py
follow me on github & linkedlin for more exciting blogs and scripts!

This blog is basically quoted from my blog website visit original blog->
https://ankitdobhal.github.io/posts/2019/10/Scraping%20Wikipedia%20With%20Python/

💖 💪 🙅 🚩
powerexploit
powerexploit

Posted on October 5, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related