What is a Web Crawler
ZeeshanAli-0704
Posted on December 14, 2023
Step-1: What is a Web Crawler?
A web crawler is a software program which browses the World Wide Web in a methodical and automated manner.
It collects documents by recursively fetching links from a set of starting pages.
Many sites, particularly search engines, use web crawling as a means of providing up-to-date data.
Search engines download all the pages to create an index on them to perform faster searches.
Some other uses of web crawlers are:
To test web pages and links for valid syntax and structure.
To monitor sites to see when their structure or contents change.
To maintain mirror sites for popular Web sites.
To search for copyright infringements.
To build a special-purpose index, e.g., one that has some understanding of the content stored in multimedia files on the Web
Step-2: Requirements and Goals of the System
Let’s assume we need to crawl all the web.
Scalability: Our service needs to be scalable such that it can crawl the entire Web, and can be used to fetch hundreds of millions of Web documents.
Extensibility: Our service should be designed in a modular way, with the expectation that new functionality will be added to it. There could be newer document types that needs to be downloaded and processed in the future.
Step-3: Some Design Considerations
Is it a crawler for HTML pages only? Or should we fetch and store other types of media i.e. sound files, images, videos,etc. also ?
This is important because the answer can change the design.
If we have to write a general-purpose crawler to download different media types, we might have to break down the parsing module into different sets of modules:
one for HTML
another for images
another for videos
Here each module extracts what is considered interesting for that media type.
Let’s assume for now that our crawler is going to deal with HTML only, but it should be extensible and make it easy to add support for new media types.
What protocols are we looking at ? HTTP ? What about FTP links ? What different protocols should our crawler handle ?
For simplicity, we will assume HTTP.
Again, we need to keep in mind that it shouldn’t be hard to extend the design to use FTP and other protocols later.What is the expected number of pages we will crawl ? How big will the URL database become ?
Assuming we need to crawl one billion websites.
Since a website can contain many, many URLs, let’s assume we have 15 billion different web pages that will be reached by our crawler.
Step-5: High Level Design:
Basic Algorithm
The basic algorithm executed by any Web crawler is to take a list of seed URLs as its input.
And then repeatedly execute the following steps:
Pick a URL from the unvisited URL list.
Determine the IP Address of its host-name.
Establishing a connection to the host to download the corresponding document.
Parse the document contents to look for new URLs.
Add the new URLs to the list of unvisited URLs.
Process the downloaded document, e.g., store it or index its contents, etc.
Go back to step-1.
A bare minimum crawler needs at least these components:
URL frontier: To store the list of URLs to download and also prioritize which URLs should be crawled first.
HTTP Fetcher: To retrieve a web page from the server.
Extractor: To extract links from HTML documents.
Duplicate Eliminator: To make sure same content is not extracted twice unintentionally.
Datastore: To store retrieve pages and URL and other metadata.
1. URL Frontier:
The URL frontier is the data structure that contains all the URLs that remain to be downloaded.
We can crawl by performing a breadth-first traversal of the Web, starting from the pages in the seed set.
Such traversals are easily implemented by using a FIFO queue.
Since we’ll be having a huge list of URLs to crawl, we can distribute our URL frontier into multiple servers.
Let’s assume on each server we have multiple worker threads performing the crawling tasks.
Let’s also assume that our hash function maps each URL to a server which will be responsible for crawling it.
Following politeness requirements must be kept in mind while designing a distributed URL frontier:
Our crawler should not overload a server by downloading a lot of pages from it.
We should not have multiple machines connecting a web server.
To implement this politeness constraint, our crawler can have a collection of distinct FIFO sub-queues on each server.
Each worker thread will have its separate sub-queue, from which it removes URLs for crawling.
When a new URL needs to be added, the FIFO sub-queue in which it is placed will be determined by the URL’s canonical hostname.
Our hash function can map each hostname to a thread number.
Together, these two points imply that at most one worker thread will download documents from a given Web server.
It will also not overload a Web server by using FIFO queue.
2. Fetcher Module:
Fetcher module downloads the document corresponding to a given URL using the appropriate network protocol like HTTP.
As discussed above webmasters create robot.txt to make certain parts of their websites off limits for the crawler.
To avoid downloading this file on every request, our crawler’s HTTP protocol module can maintain a fixed-sized cache mapping host-names to their robots exclusion rules.
3. Document Input Stream (DIS):
Our crawler’s design enables the same document to be processed by multiple processing modules.
To avoid downloading a document multiple times, cache document locally using an abstraction called a Document Input Stream (DIS).
A DIS is an input stream that caches the entire contents of the document read from the internet.
It also provides methods to re-read the document.
4. Document Dedupe Test:
Many documents on the Web are available under multiple, different URLs.
There are also many cases in which documents are mirrored on various servers.
Both of these effects will cause any Web crawler to download the same document contents multiple times
We can use MD5 or SHA to calculate these checksums.
5. URL Filters:
The URL filtering mechanism provides a customizable way to control the set of URLs that are downloaded.
This is used to blacklist websites so that our crawler can ignore them.
Before adding each URL to the frontier, the worker thread consults the user-supplied URL filter.
6. Domain Name Resolution (DNS):
Before contacting a Web server, a Web crawler must use DNS to map the Web server’s hostname into an IP address.
DNS name resolution will be a big bottleneck of our crawlers given the amount of URLs we will be working with.
7. URL Dedupe Test:
Posted on December 14, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024