moose
Posted on January 3, 2022
Waxy is the crawler portion, and it's still very much in its infancy. Here is where its at:
https://github.com/salugi/waxy
https://crates.io/crates/waxy
What does it do?
Well, it crawls. It can blindly crawl any (well, every site I've tested) site to a blind depth and provide statistics on the site. It can crawl only specific URLs based on a regex. You can crawl 100, 1000, or 27.
With the pages it can parse specific tags and grab text to better index a site. Currently I am working on a scraper wrapper (https://crates.io/crates/scraper) for parsing selectors. It provides hashmaps of meta data.
Pages equal Records, and records get pressed by crawling pages. I made a design decision to store the pages alone and just change the way they get parsed. I also mainly deal in strings. String hashsets, hashmaps. String inputs, outputs. I convert data to strings as I have found it easier to translate when dealing with higher level abstractions. Something to consider is that it is not a fast process, this generally takes time when sequentially producing records.
I have provided functionality to produce multiple records as a way to better run a worker thread. This thread/server is dedicated to crawling. It's not meant to be a speedy process as reindexing a lot and a fast crawl speed likely throws alarms off sites you want to crawl. Hook it up as a producer over a socket, you will likely be able to go real quick. But it's not a fruitful process or one that is really consolidated to a single function.
This is my first crate, and it has been an interesting experience. I've learned a lot about planning. Learning about versioning, when to and when not to.
Crates io is a real slick system and I fully intend to make use of all the free functionality that comes with the platform. This is just the crawler. I'm realizing what a large endeavor it is.
As to it being part of a community driven search engine, I'm going to attempt to make this a part of FOSS suite people/communities can use to make search engines that index retrieve, archive based on what a community finds important and not what google finds important.
I'm toying with ideas for the oddjob server, as well as the indexing engine, the algorithms. Theres a lot of reading and a lot of failure. Tiring.
But, it's nice to have something published.
I intend to incorporate chrome headless at some point. because unfortunately js
heres the rundown
waxy = "0.1.1"
tokio = { version = "1", features = ["full"] }
main.rs
use waxy::pressers::HtmlPresser;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
//Wax worker
/*
create a single record from url
*/
match HtmlPresser::press_record("https://example.com").await{
Ok(res)=>{
println!("{:?}", res);
},
Err(..)=>{
println!("went bad")
}
}
println!();
println!("----------------------");
println!();
/*
crawl a vector or urls for a vector of documents
*/
match HtmlPresser::press_records(vec!["https://example.com"]).await{
Ok(res)=>{
println!("{:?}", res.len());
},
Err(..)=>{
println!("went bad")
}
}
println!();
println!("----------------------");
println!();
/*
crawl a domain, the "1" is the limit of pages you are willing to crawl.
This is a slow process and created a return array of HtmlRecords
*/
match HtmlPresser::press_records_blind("https://funnyjunk.com",1).await{
Ok(res)=>{
println!("{:?}", res.len());
},
Err(..)=>{
println!("went bad")
}
}
/*
blind crawl a domain for links,
inputs:
url to site
link limit, limit of the number of links willing to be grabbed
page limit, limit of the number of pages to crawl for links
This will just url collect, use case is getting a blind depth or a blind crawl of the site.
*/
match HtmlPresser::press_urls("https://example.com",1,1).await{
Ok(res)=>{
println!("{:?}", res.len());
},
Err(..)=>{
println!("went bad")
}
}
println!();
println!("----------------------");
println!();
/*
blind crawl a domain for links that match a pattern,
inputs:
url to site
pattern the url should match
link limit, limit of the number of links willing to be grabbed
page limit, limit of the number of pages to crawl for links
This would be useful when wanting to crawl a thread. This is NOT domain name specific
*/
match HtmlPresser::press_curated_urls("https://example.com", ".", 1,1).await{
Ok(res)=>{
println!("{:?}", res);
},
Err(..)=>{
println!("went bad")
}
}
println!();
println!("----------------------");
println!();
/*
blind crawl a domain for document whose urls that match a pattern,
inputs:
url to site
pattern the url should match
page limit, limit of the number of pages to crawl for links
Does the URL thing, but just with records
*/
match HtmlPresser::press_curated_records("https://example.com", ".", 1).await{
Ok(res)=>{
println!("{:?}", res);
},
Err(..)=>{
println!("went bad")
}
}
println!();
println!("----------------------");
println!();
/*
section illustrating the way records parse themselves.
*/
//get doc
let record = HtmlPresser::press_record("https://funnyjunk.com").await.unwrap();
//get anchors
println!("{:?}",record.anchors());
println!();
println!("{:?}",record.anchors_curate("."));
println!();
println!("{:?}",record.domain_anchors());
println!();
//call headers
println!("{:?}",record.headers);
println!();
//call meta data
println!("{:?}",record.meta_data());
println!();
//tag text and html
println!("{:?}",record.tag_html("title"));
println!();
println!("{:?}",record.tag_text("div"));
println!();
println!();
Ok(())
}
Posted on January 3, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.