Web Scraping Vs Web Crawling
Pranav MM
Posted on June 12, 2024
Web Scraping or Web Crawling
Search and gather Aka crawling and scraping refers to the acquisition of important website data by the use of automated bots. Web scraping is pretty common to track and analyze data and compare to its former self, Examples may include the Market data, finance, E-Commerce and Retail . Now you may ask, What exactly does it mean to crawl a website, What does it mean to Scrap a website?
| How is it related to each other?
Suppose you have a Gmail with no storage left (Which I hope you don't) and you wish to acquire one important file, What would you do? You would Give up Start to go through each file and Stalin sort the files to get the right one. This exact combined action of seperating and acquiring the important data translates to a webpage cohesively which is termed by Crawling and Gathering.
The Good, the Bad and the Wayback machine
Established in 1996, by Brewster Kahle and Bruce Gilliat, The wayback machine aka The internet Archive aka the warehouse of digital content that has seen its testament of time. It allows users to access the archvied versions of the website, evenn allowing you to navigate the website through its establishment. It works by sending automated web crawlers to various publicly available websites amd taking snapshots. It can be easily accessed and used by all, at https://wayback-api.archive.org/
What it can't store
"With large data comes big storage bills", With a infinite pile of information coming up on its doorsteps, its storage capabilites have increased tenfolds. As of January 2024, It stores around 99 Petabytes, and is expected to increase about 100 Terabytes per month, such renders the Internet Archive unable to store the following
- Dynamic Pages
- Emails
- Chats
- Databases
- Classified Military Content (Obviously)
"Talk is Cheap. Show me the Code"
-Linus Torvalds
Creating your own time capsule is very easy by setting up a Web Crawler that preys into the website and collects data at regular intervals of time. Creation of your own bot for scraping is easily achieveable using various libraries like BeauitfulSoup (for Python) and Cheerio (for Javascript)
For Python Enthusiasts
| You can install the libraries installed using the following pip command
pip install beautifulsoup4
It utilises
| Code:
import requests
from bs4 import BeautifulSoup
def crawl_page(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
links = []
for a_tag in soup.find_all("a", href=True):
link = a_tag["href"]
if link.startswith("http"):
links.append(link)
return links
seed_url = "https://en.wikipedia.org/wiki/Ludic_fallacy"
visited_urls = []
crawl_depth = 2
def crawl(url, depth):
if depth == 0 or url in visited_urls:
return
visited_urls.append(url)
links = crawl_page(url)
for link in links:
crawl(link, depth-1)
crawl("https://en.wikipedia.org/wiki/Ludic_fallacy", 2)
print("Crawled URLs:", visited_urls)```
{% endraw %}
{% raw %}
For Javascript Enthusiasts
| Prerequisites include libraries such as Axios and Cheerio
npm install axios cheerio
Axios fulfills the job of making HTTP Requests to the website while Cheerio manipulates the incoming website data and allows you to extract valuable information using CSS-Style selectors which stores the extracted data as JSON files as objects with properties
| Code:
javascript
const axios = require('axios');
const cheerio = require('cheerio');
const targetUrl = 'https://en.wikipedia.org/wiki/Ludic_fallacy';
async function scrapeData() {
try {
const response = await axios.get(targetUrl);
const html = response.data;
const $ = cheerio.load(html);
const titles = $('h1').text().trim();
const descriptions = $('p').text().trim();
console.log('Titles:', titles);
console.log('Descriptions:', descriptions);
} catch (error) {
console.error('Error scraping data:', error);
}
}
scrapeData();
Make sure to be mindful of the website's terms and conditions and abide by by the robots.txt to pratice ethical scraping and to prevent yourself from legal trouble and have fun coding along the way.
Posted on June 12, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.