Building a crawler
moose
Posted on November 20, 2021
So I haven't written in a while. It's a slow day so I thought, how about we write a web crawler tutorial?
so, this code can easily be switched to js, and you can find a deno_dom replacement in cheerio with node js.
The main thing a crawler needs is valid markup.
tl;dr
[https://github.com/salugi/smooth_crawl]
I am currently building out my project for a deno/rust search engine with mellisearch as part of a larger project.
Since this is a deno backend, I'm taking advantage of the ability to use typescript OOB. Its handy, especially with the ability to create interfaces that can extend others.
What should a web crawler do?
- Crawl to a depth blindly
- Be rate limited in the calling
- Crawl unique links
- Parse URLs
- Handle edge cases
- probably some other stuff
We are going to make one that does all of those.
First, we will look at building a self contained crawler which is painfully slow (but, doesn't DOS a site).
Since this tutorial is using Deno (https://deno.land)
https://deno.land/#installation
super easy.
1 . Let's start with creating the folder structure
mine:
smooth_crawl
|_src
|models
2 . Open up the smooth crawl directory in your ide of choice
3 . Create entry.ts file
./smooth_crawl/entry.ts
4 . Create base objects.
In our case, the base object will be a ts interface named RecordKey and HttpRecord.
Create RecordKey.ts file
./smooth_crawl/models/RecordKey.ts
- copy paste the following code
export interface RecordKey {
id:string,
creation_date:any,
archive_object:any
}
- Create HttpRecord.ts file
./smooth_crawl/models/HttpRecord.ts
- Copy paste the following into it
import { RecordKey } from "./RecordKey.ts";
export interface HttpRecord extends RecordKey {
url:URL,
response:any,
response_text:string
}
These serve has data models, they hold data, they do nothing outside of that.
5 . Build http client
Deno uses the fetch API. It is worth noting that since JS is widely used in some sites, a puppeteer implementation would be needed for some sites. We won't focus on puppeteer in this tutorial, mainly relying on the fetch API to handle the HTTP reqs. Below I have annotated some of the code to explain what is going on
- Create http_client.ts file
./smooth_crawl/http_client.ts
- Copy and paste the following code:
// @ts-ignore
import {v4} from "https://deno.land/std/uuid/mod.ts";
import {HttpRecord} from "./models/HttpRecord.ts";
/*
returns http text (normally html)
*/
const controller = new AbortController()
const timeoutId = setTimeout(() => controller.abort(), 5000)
export async function get_html_text(unparsed_url:string) : Promise<string> {
return new Promise(async function (resolve, reject) {
//parse url cuz symbols
let parsed_url = new URL(unparsed_url)
//send get
await fetch(parsed_url.href,{signal:controller.signal}).then(function (result) {
clearTimeout(timeoutId)
if (result !== undefined) {
//turn result to text.
result.text().then(function (text) {
resolve(text)
}).catch(error => {
console.error("get_html_text result.text errored out")
reject(error)
})
}
}).catch(error => {
console.error("get_html_text fetch errored out")
reject(error)
})
})
}
/*
returns http record
*/
export async function get_http_record(unparsed_url:string) : Promise<HttpRecord> {
return new Promise(async function (resolve, reject) {
let parsed_url = new URL(unparsed_url)
let record : HttpRecord ={
id:v4.generate(),
creation_date : Date.now(),
url:parsed_url,
response:{},
response_text:"",
archive_object:{}
}
await fetch(record.url.href,{signal:controller.signal}).then(function (result) {
clearTimeout(timeoutId)
if (result !== undefined && result !== null) {
record.response = result
// turn result to text.
result.text().then(function (text) {
if (text.length > 1){
record.response_text = text
}
resolve(record)
}).catch(error => {
console.error("get_http_record result.text errored out")
reject(error)
})
}
}).catch(error => {
console.error("get_http_record fetch errored out")
reject(error)
})
})
}
The explanation of these two methods is that they do very similar things, sans one is just returning the text of an HTTP request and the other is returning an HTTP record. The text return, or get_html_text() is lower weight and doesn't unnecessarily create objects. The reason behind this is because this acts as a depth check
Some pages could have, say 50,000 links on a single page right? Sounds ludicrously high, but it is out there if not just to make people work harder. But this first html function acts as a way to be as non committal as possible to depth check the site so we don't blow out.
Now that we have the file created, we need to test it. go back to the entry.ts file in root of the smooth_crawl directory and copy paste this code:
import {get_html_text} from "./src/http_client.ts";
// @ts-ignore
let smoke = Deno.args
// @ts-ignore
let html_text = await get_html_text(smoke[0])
console.log(html_text)
then from the smooth_crawl directory run the command:
deno run --allow-net ./entry.ts https://example.com
if it doesn't error out, we are ready to move into conductor concept and parsing the html.
6 . Create the conductor.ts file
- Create conductor.ts file
./smooth_crawl/src/conductor.ts
- Copy paste the following in
import {get_html_text, get_http_record} from "./http_client.ts"
import {catalogue_basic_data, catalogue_links} from "./cataloguer.ts";
import {HttpRecord} from "./models/HttpRecord.ts";
const non_crawl_file = ["jpg", "pdf", "gif", "webm", "jpeg","css","js","png"]
/*
returns http record
archival objects:
link data (all links on a page parsed)
metadata (all metadata tages)
*/
export function conduct_basic_archive(unparsed_url:string) : Promise<HttpRecord> {
return new Promise<HttpRecord>(async(resolve,reject)=> {
try {
let parsed_url = new URL(unparsed_url)
let record = await get_http_record(parsed_url.href)
let archival_data : any = await catalogue_basic_data(parsed_url.origin, record.response_text)
record.archive_object.links = archival_data.link_data
record.archive_object.meta = archival_data.meta_data
resolve(record)
} catch (error) {
reject(error)
}
})
}
/*
harvests links, number_to_gather === number of links to gather
*/
export async function conduct_link_harvest(link:string, link_limit:number, page_limit:number) : Promise<Array<string>> {
return new Promise<Array<string>>(async (resolve, reject)=>{
try {
let links = Array();
links.push(link)
for (let i = 0; i < links.length; i++) {
let url : URL = new URL(links[i])
// @ts-ignore
let text : string = await get_html_text(url.href)
let unharvested_links : Array<URL> = await catalogue_links(url.origin, text)
let harvested_links : Array<string> = await harvest_links(links, unharvested_links)
let stop : number = 0;
if (links.length + harvested_links.length > link_limit){
stop = link_limit - links.length
}else{
stop = harvested_links.length
}
for (let j = 0; j < stop; j++) {
links.push(harvested_links[j])
}
if(i >= page_limit){
break;
}
}
resolve(links)
} catch (error) {
reject(error)
}
})
}
function harvest_links(gathered_links: Array<string>, links:Array<any>) : Promise<Array<string>> {
return new Promise( (resolve, reject) => {
try {
let return_array = Array()
for (const link of links) {
let should_add = !gathered_links.includes(link.href)
let file_extension = get_url_extension(link.href)
let not_in_list = !non_crawl_file.includes(file_extension)
if (
should_add
&&
not_in_list
) {
return_array.push(link.href)
}
}
resolve(return_array)
} catch (error) {
console.error(error)
}
})
}
function get_url_extension( url: string ) {
//@ts-ignore
return url.split(/[#?]/)[0].split('.').pop().trim();
}
export async function conduct_worker_harvest(link:string, link_limit:number, page_limit:number) : Promise<Array<string>> {
return new Promise<Array<string>>(async (resolve, reject)=>{
try {
let links = Array();
let should_break = false
links.push(link)
for (let page_index = 0; page_index < links.length; page_index++) {
let url : URL = new URL(links[page_index])
// @ts-ignore
let text : string = await get_html_text(url.href)
let unharvested_links : Array<URL> = await catalogue_links(url.origin, text)
let harvested_links : Array<string> = await harvest_links(links, unharvested_links)
let stop : number = 0;
if (links.length + harvested_links.length > link_limit){
stop = link_limit - links.length
should_break = true
}else{
stop = harvested_links.length
}
if(page_index >= page_limit ){
should_break = true
}
for (let j = 0; j < stop; j++) {
links.push(harvested_links[j])
//publisher.publish_message( { url : harvested_links[j] } )
}
if(should_break){
break;
}
}
resolve(links)
} catch (error) {
reject(error)
}
})
}
This will error out. We have to add to it. However, the concept is that the conductor conducts actions. A mishmash on smaller actions that build bigger things.
Example
http call -> crawl -> return object
on to the next step.
7 . create the cataloguer.ts file
- Create file cataloguer.ts
./smooth_crawl/src/cataloguer.ts
- Copy paste the following code into it
import {DOMParser} from 'https://deno.land/x/deno_dom/deno-dom-wasm.ts';
import {v4} from "https://deno.land/std/uuid/mod.ts";
export async function catalogue_links(origin:string, text:string):Promise<any>{
return new Promise(function(resolve) {
try {
let link_set = Array();
if(text.length > 1) {
const document: any = new DOMParser().parseFromString(text, 'text/html');
if (document === undefined) {
let funnel_point = "cataloguer.ts"
let funk = "crawl"
let error = "unable to interchange gen_object"
let id = v4.generate()
resolve(link_set)
} else {
let link_jagged_array = Array<Array<any>>(
document.querySelectorAll('a'),
document.querySelectorAll('link'),
document.querySelectorAll('base'),
document.querySelectorAll('area')
)
for (let i = 0; i < link_jagged_array.length; i++) {
for (let j = 0; j < link_jagged_array[i].length; j++) {
if (link_jagged_array[i][j].attributes.href !== undefined
&&
link_jagged_array[i][j].attributes.href.length > 0) {
link_set.push(link_jagged_array[i][j].attributes.href)
}
}
}
link_set = [...new Set(link_set)]
// @ts-ignore
let fully_parsed_links = link_parse(origin, link_set)
resolve(fully_parsed_links)
}
}else{
resolve(link_set)
}
}catch(error){
console.error(error)
}
})
}
export async function catalogue_basic_data(origin:string, text:string):Promise<any>{
return new Promise(function(resolve) {
try {
let link_set = Array();
if(text.length > 1) {
const document: any = new DOMParser().parseFromString(text, 'text/html');
if (document === undefined) {
console.error("document not defined")
} else {
let link_jagged_array = Array<Array<any>>(
document.querySelectorAll('a'),
document.querySelectorAll('link'),
document.querySelectorAll('base'),
document.querySelectorAll('area')
)
let meta_information = document.querySelectorAll('meta')
for (let i = 0; i < link_jagged_array.length; i++) {
for (let j = 0; j < link_jagged_array[i].length; j++) {
if (link_jagged_array[i][j].attributes.href !== undefined
&&
link_jagged_array[i][j].attributes.href.length > 0) {
link_set.push(link_jagged_array[i][j].attributes.href)
}
}
}
link_set = [...new Set(link_set)]
// @ts-ignore
let fully_parsed_links = link_parse(origin, link_set)
let parsed_meta_information = meta_parse(meta_information)
let archives = {
link_data:fully_parsed_links,
meta_data:parsed_meta_information
}
resolve(archives)
}
}else{
resolve(link_set)
}
}catch(error){
let funnel_point = "cataloguer.ts"
let funk = "crawl"
let id = v4.generate()
console.error(error)
}
})
}
function meta_parse(a:Array<any>):Array<string>{
try {
let out = Array<any>();
for (let i = 0; i < a.length; i++) {
if (a[i].attributes.content !== undefined
&&
a[i].attributes.content !== null) {
let meta_tag = {
name: "",
content: Array()
}
if (a[i].attributes.charset !== undefined) {
meta_tag.name = "charset"
meta_tag.content.push(a[i].attributes.charset)
out.push(meta_tag)
continue
} else if (a[i].attributes.property !== undefined) {
meta_tag.name = a[i].attributes.property
meta_tag.content = a[i].attributes.content.split(",")
out.push(meta_tag)
continue
} else if (a[i].attributes["http-equiv"] !== undefined) {
meta_tag.name = a[i].attributes["http-equiv"]
meta_tag.content = a[i].attributes.content.split(",")
out.push(meta_tag)
continue
} else if (a[i].attributes.name !== undefined) {
meta_tag.name = a[i].attributes.name
meta_tag.content = a[i].attributes.content.split(",")
out.push(meta_tag)
continue
}else {
out.push({
"meta-related":a[i].attributes.content
})
}
}
}
return out
}catch(error){
console.log("crawler-tools.ts")
Deno.exit(2)
}
}
function meta_check(a:string):Boolean{
if (
/^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$/.test(a) ||
/(tel:.*)/.test(a) ||
/(javascript:.*)/.test(a) ||
/(mailto:.*)/.test(a)
) {
return true
}else{
return false
}
}
function link_parse(domain:string, lineage_links: Array<string>):any{
try {
let c: Array<any> = new Array()
if (lineage_links.length > 1) {
for (let i = 0; i < lineage_links.length; i++) {
if (
!/\s/g.test(lineage_links[i])
&&
lineage_links[i].length > 0
) {
let test = lineage_links[i].substring(0, 4)
if (meta_check(lineage_links[i])) {
continue
} else if (/[\/]/.test(test.substring(0, 1))) {
if (/[\/]/.test(test.substring(1, 2))) {
let reparse_backslash = lineage_links[i].slice(1, lineage_links[i].length)
lineage_links[i] = reparse_backslash
}
c.push(new URL(domain + lineage_links[i]))
continue
} else if (
(/\.|#|\?|[A-Za-z0-9]/.test(test.substring(0, 1))
&&
!/(http)/.test(test))
) {
try {
//weed out potential non http protos
let url = new URL(lineage_links[i])
} catch {
let url = new URL("/" + lineage_links[i], domain)
c.push(url)
}
continue
} else if (/\\\"/.test(test)) {
let edge_case_split_tester = lineage_links[i].split(/\\\"/)
lineage_links[i] = edge_case_split_tester[0]
if (!/http/.test(lineage_links[i].substring(0, 4))) {
let url = new URL("/" + lineage_links[i], domain)
c.push(url)
continue
}
} else {
try {
let link_to_test = new URL(lineage_links[i])
let temp_url = new URL(domain)
let host_domain = temp_url.host.split(".")
let host_tester = host_domain[host_domain.length - 2] + host_domain[host_domain.length - 1]
let compare_domain = link_to_test.host.split(".")
let compare_tester = compare_domain[compare_domain.length - 2] + compare_domain[compare_domain.length - 1]
if (host_tester !== compare_tester) {
continue
}
c.push(link_to_test)
} catch (error) {
console.error(error)
}
continue
}
}
}
}
return c
}catch(err){
console.error(err)
}
}
This file uses deno_dom (https://github.com/b-fuze/deno-dom) to catalogue and find links.
It handles the parsing of the different links to be able to make them usable.
next make your entry.ts file look like this
import {get_html_text} from "./src/http_client.ts";
import {conduct_link_harvest} from "./src/conductor.ts";
// @ts-ignore
//let html_text = await get_html_text(smoke[0])
//console.log(html_text)
//smoke[0] is the link
//the other two are set checks,
//grab 100 links or crawl 20 pages, whichever comes first
let links = await conduct_link_harvest(smoke[0] ,100, 20)
console.log(links)
again, run the following command (may take a little time)
I am using funnyjunk as a test, feel free to use a site you want granted it doesn't have to have js load
deno run --allow-net ./entry.ts https://funnyjunk.com
if this runs, we are ready to move on to the last part of this build.
8 . create operator.ts file
- create operator.ts file
./smooth_crawl/src/operator.ts
- copy paste the following code
import {
conduct_link_harvest,
} from "./conductor.ts"
import {HttpRecord} from "./models/HttpRecord.ts";
import {conduct_basic_archive} from "./conductor.ts";
export async function operate_crawl(url:string,link_limit:number){
try{
let crawl_links = await conduct_link_harvest(url,link_limit,50)
let http_records = new Array<HttpRecord>()
for(let i = 0; i < crawl_links.length;i++){
let record = await conduct_basic_archive(crawl_links[i])
http_records.push(record)
}
return http_records
}catch(error){
console.error(error)
}
}
this code is to use the conductor to do many things in a more consolidated way.
now go back to the entry.ts files, delete it's contents and paste in the following
import {operate_crawl} from "./src/operator.ts";
let smoke = Deno.args
try{
let limit_int = parseInt(smoke[1])
let url = new URL(smoke[0])
// @ts-ignore
let crawled_pages : Array<HttpRecord> = await operate_crawl(url.href, limit_int)
crawled_pages.forEach(element =>{
console.log(element.url.href, element.response.status)
})
} catch (err) {
console.error(err)
}
now run the command:
(again I use funnyjunk, use what you please)
NOTE: 5 is the amount of pages to crawl
deno run --allow-net ./entry.ts https://funnyjunk.com 5
That is the crawler. The operate_crawl is a blind crawler for any site to go to a custom depth. This returns the crawled object in the form of a HTTP record (which we created first).
This is a slow way to do it, it crawls blindly unique links, returns a list of objects and is naturally rate limited.
But we can go faster in the same process to get a better archival tool
Below we implement the bookie. The bookie serves as a way to keep track of archival records, like our HttpRecord. This has a function "book_http_record" with could conduct a basic archive and then potentially save it to a database.
9 . Lets create our new bookie
- create file
./smooth_crawl/bookie.ts
- copy past the following in
import EventEmitter from "https://deno.land/x/events/mod.ts"
import {conduct_basic_archive} from "./conductor.ts";
export let bookie_emitter = new EventEmitter()
export function book_http_record(unparsed_url : string){
(async () =>{
try {
let parsed_url = new URL(unparsed_url)
let record = await conduct_basic_archive(parsed_url.href)
console.log(record.url.href,"recorded with status",record.response.status)
}catch (error) {
let funk = "book_http_record"
console.error(funk)
console.error(error)
}
})()
}
bookie_emitter.on("book_http_archive", await book_http_record)
this is a deno event emitter
so now we go back to the operator.ts file
add the method
export async function operate_harvest(url:string,link_limit:number,page_limit:number){
try{
// let publisher = new PublisherFactory("havest_basic_archive")
let crawl_links = await conduct_link_harvest(url,link_limit,page_limit)
for(let i = 0; i < crawl_links.length;i++){
await new Promise(resolve => setTimeout(resolve, 80))
bookie_emitter.emit("book_http_archive", crawl_links[i])
}
// publisher.close_publisher()
return crawl_links
}catch(error){
console.error(error)
}
}
That promise acts as a rate limiter in js, a blocking function in non-blocking io, I know. But, you'd get some people mad at you if you go zooming over their site.
now head back to the entry file
and change its contents to this
import {operate_crawl, operate_harvest} from "./src/operator.ts";
let smoke = Deno.args
try{
let limit_int = parseInt(smoke[2])
let url = new URL(smoke[1])
let option = smoke[0].split(/-/)[1]
switch(option){
case "sc":
// @ts-ignore
let crawled_pages : Array<HttpRecord> = await operate_crawl(url.href, limit_int)
crawled_pages.forEach(element =>{
console.log("crawled", element.url.href, "with status", element.response.status)
})
break;
case "eh":
let harvested_links = await operate_harvest(url.href,limit_int,10)
// @ts-ignore
console.log("harvested", harvested_links.length, "links")
console.log(harvested_links)
break;
default:
console.error("not a valid option")
break;
}
} catch (err) {
console.error(err)
}
Notice we kept the other method. "sc" stands for slow craw and "eh" stands for event harvest.
now run the command:
(again, I use funnyjunk as an example, pick your own)
$ deno run -A --unstable ./entry.ts -sc https://funnyjunk.com 5
if that doesn't error out we are finished.
Now, I wouldn't recommend anyone do thousands of links on a single site with either. One because you may blow your buffer and the over because you may make some server admin super mad.
Posted on November 20, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.