Web Scrapping With Python.

Suppose you want some data of a product from a company? Let's say the price of all commodities to be in a comma separated value(CSV) or photos from a social media! what will you do?
Actually, you can copy information from the respective site and paste it into your own file. But what if you want to get a huge amount of information from the site as soon as possible? Such as large amounts of data from a website to train a Machine Learning algorithm?
In that case, copy and paste will not work! And then you will need to use Web Scraping.Web scraping uses intelligence automation methods to get thousands or even millions of data sets in a smaller amount of time.

What is Web Scraping?

Web scraping is a means of extracting vast volumes of data from websites in an automated manner. The majority of this data is unstructured HTML data that is converted to structured data in a spreadsheet or database before being used in various applications.
To gather data from websites, web scraping can be done in a variety of methods. These options include leveraging internet services, specific APIs, and even writing your own web scraping code from scratch. Many huge websites, such as Google, Twitter, Facebook, StackOverflow, and others, provide APIs that let you access their data in a structured fashion.

Application of web scrapping

Market research
Price monitoring
News monitoring
Email marketing
Sentiment Analysis

Prerequisites

Python

Why python🤔, since it is the most popular language for web scraping as it can handle most of the processes easily. It also has a variety of libraries that were created specifically for Web Scraping that is scrapy and beautiful soup.

So let's start 😀😀😀💪💪

1. Installing of python.

Install python 3 and virtualenv then make virtual environment.

Install python 3 first by running following line of code in terminal:

$ sudo apt install python3

Then install virtual environment, in our terminal type in:

$ sudo apt install python3-venv

After installing python and virtualenv, create a folder and virtualenv then activate the created virtualenv.

Create project folder:

mkdir web_scrap

So lets go to the inside of web_scrap directory :

cd web_scrap

Create virtualenv:

virtualenv env

activate virtualenv:

. env/bin/activate

This are basic steps to setup our coding environment, check out this for more.

2. Create python file.

Create a python file scrap.py and open it in visual studio or on your favorite text editor.

3. Import packages.

Download and import packages in the virtual environment.

pip install requests

pip install bs4

pip install termcolor

The python modules that will be using:

re - regular expression.
requests- to scrap data directory from Instagram.
beautifulSoup - to get specific filtered part from all data.
urllib - to use request to download from url.
os - to store downloaded file in our media folder.

4. Get website link.

Let's add a simple input system to get any url as an input url:

url = input("enter here your url from instagram")

Get any url from Instagram then get data from the url using requests.

data = requests.get(url)

You can print the data and check the results.

print(data)

The codes

The outuput

Now let's take a case for a video.

https://www.instagram.com/p/B_wH2aCnyEh/?utm_medium=copy_link

This is the page with the video.

And here is the source code.

And In This Page If you just find(by ctrl + F) ‘mp4’ . Then You will find something like this:

The link that contain the mp4 is the main thing we need:

"https://instagram.fnbo9-1.fna.fbcdn.net/v/t50.2886-16/95332972_323221645317471_817729865566514230_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InZ0c192b2RfdXJsZ2VuLjQ4MC5mZWVkLmRlZmF1bHQiLCJxZV9ncm91cHMiOiJbXCJpZ193ZWJfZGVsaXZlcnlfdnRzX290ZlwiXSJ9\u0026_nc_ht=instagram.fnbo9-1.fna.fbcdn.net\u0026_nc_cat=103\u0026_nc_ohc=Q1fkDGBA2oEAX9xsGin\u0026edm=AABBvjUBAAAA\u0026vs=18035297806253182_2714272676\u0026_nc_vs=HBksFQAYJEdHeXFyZ1ZmVlZybjl5VUJBRGJzVWUtNktGa0xia1lMQUFBRhUAAsgBABUAGCRHSFlhdkFWNG9oRUFsSEFHQVAwaFlDdDdtOVl0YmtZTEFBQUYVAgLIAQAoABgAGwGIB3VzZV9vaWwBMBUAACb8yIvTv8CJQBUCKAJDMywXQCbul41P3zsYEmRhc2hfYmFzZWxpbmVfMV92MREAdeoHAA%3D%3D\u0026ccb=7-4\u0026oe=621DCC10\u0026oh=00_AT_7jbU74b8Fm9-U5y6GQhURJihmzKNI_AEvVNjI4e-Blw\u0026_nc_sid=83d603"

Due to Instagram terms instead use the below link for video:

https://www.w3schools.com/html/movie.mp4

match = re.findall(r’url\W\W\W([-\W\w]+)\W\W\Wvideo_view_count’, str)

What the code above does is to find the url above whenever we run the code.

To extract the video we have to declare a variable name extraction and inside this variable we will store the file format for video, as shown below.

extraction = “.mp4”

Also do the same for image but use profile_pic_url:

"https://instagram.fnbo9-1.fna.fbcdn.net/v/t51.2885-19/274607143_1204294113308064_418123174948225933_n.jpg?stp=dst-jpg_s150x150\u0026_nc_ht=instagram.fnbo9-1.fna.fbcdn.net\u0026_nc_cat=100\u0026_nc_ohc=L3oR46dvCW0AX-fS68k\u0026edm=AABBvjUBAAAA\u0026ccb=7-4\u0026oh=00_AT_7whkb_tXXNikAlnrI8yBifCb9zDwZK0Zt5q462q93Vw\u0026oe=6222855B\u0026_nc_sid=83d603"

as shown below.

source code :

search profile_pic_url:

For image link use:

https://www.w3schools.com/html/pic_trulli.jpg

match = re.findall(r'profile_pic_url\W\W\W([\W\w]+)\W\W\Wdisplay_resources’, str)

And Now Our extraction variable value is :

extraction = “.jpg”

So last line of this step is to collect the actual post video or image’s url in a variable as a regular exp. array to string. To do that :

res = match[0]

5. Data extraction.

Here we have to download and get the caption of the post.

We will use BeautifulSoup in our code to get the caption or title of the post.We have to assign all data (str) to pass through BS4 and filter it .

page = BeautifulSoup(str, "html.parser") title = page.find("title") title = title.get_text()

So the code will find the title of this page and store the title varible.
After this we have to perform regular expression to make our file name saved and also store in a media folder.

title = re.sub(r"\W+", "_", title) title = "download/web_scrap"+title+"web_scrap" print("\n"+title)

We use download/ because we want to store our downloaded file in a new folder called download/.

if res != "" :
print('found \n \n'+'\033[1m'+colored(res, 'green')+'\033[0m'+'\n') #'found word:cat'
 download = input("Do you want to download(y/N) : ")
if (download == "y" or download == "Y"):
  try:
   fileName = title
   print("Downloading.....")
   DFU.urlretrieve(res, fileName+extraction)
   print("Download Successfully!")
   os.system("tree download")
except:
   print("Sorry! Download Unsuccessful")
else:
 print('did not find or post is from private account')
 exit()

So if res variable is not empty then print the actual link of post.Then make a input and this input will ask you that you want to download this file or not.To do so, answer with y or n .If answer is Y or y then it will continue working.

if (download == “y”):

That's all on how to download an image and a video from a social media Instagram.

Get the source code here

THank you for taking your time to go through this article.

Blog

Web Scrapping With Python.

FRANCIS ODERO

KEEP MOVING ON 💪💪💪💪💪💪

HAPPY CODING

Join Our Newsletter. No Spam, Only the good stuff.

Related