Al
Posted on December 19, 2023
Note
Probably won't work on windows, not tested on windows and because of how it sets up paths with the forward slashes, it may not work on windows.
tldr: code
Prereqs: having Rust installed
This is a tutorial for building a basic archiver with Reqwest. If you want to try to do js heavy sites, i suggest you use Fantoccini (jonhoo is a beast). Granted a lot of sites don't like bots so they take a lot of steps with antibots. Though I've heard people doing some good things with Selenium. If you go this route you'd want to change things like user agents, options and some other stuff. In this code we'll only be doing user agent changing within reqwest.
Note
the archive uses the path and builds directories on it. So
rustlang.org/rust/language
will translate to
rustlang.org/rust/language/date/index.html it will also have an image folder, a css folder and a js folder.
So lets start setting up the project.
run
cargo new <Project Name>
rename to what ever you want to name it.
Now in the Cargo.toml file, add the following dependencies:
[dependencies]
url = "2.5.0"
reqwest = "0.11.23"
bytes = "1.5"
dirs = "5.0.1"
regex = "1.10.2"
scraper = "0.18.1"
lazy_static = "1.4.0"
chrono = "0.4.31"
rand = "0.8.5"
image = "0.24.6"
tokio = { version = "1.35", features = ["full"] }
Lets add the ways to parse necessary links for things like img's, css, js. Things pertinent to a webpage.
Create a file under ./src called html.rs
Now lets add the required dependencies:
use chrono::Utc;
use lazy_static::lazy_static;
use regex::Regex;
use scraper::{Html, Selector};
use std::collections::HashSet;
use url::{ParseError, Url};
Now those are out of the way, time to add add the HtmlRecord struct.
#[derive(Debug)]
pub struct HtmlRecord {
pub origin: String,
pub date_time: String,
pub body: String,
pub html: Html,
}
Now lets implement the new:
impl HtmlRecord {
pub fn new(origin: String, body: String) -> HtmlRecord {
HtmlRecord {
origin,
date_time: Utc::now().format("%d-%m-%Y-%H:%M:%S").to_string(),
html: Html::parse_document(&body),
body,
}
}
}
This basically takes the time of making and attaches it to the date_time field. The html field is the html parsed from the body which we will use to get links and stuff. The body is the html we will save.
Next, under new add the following method:
//the tuple returns the unparsed string in the 0's spot
//returns the parsed link in the 1's spot
pub fn get_image_links(&self) -> Result<Option<HashSet<(String, String)>>, String> {
//checks for base64 images
lazy_static! {
static ref RE3: Regex = Regex::new(r";base64,").unwrap();
}
let mut link_hashset: HashSet<(String, String)> = HashSet::new();
//select image tags
let selector = Selector::parse("img").unwrap();
//loop through img tags
for element in self.html.select(&selector) {
//grab the source attribute of the tag
match element.value().attr("src") {
//if we have a link
Some(link) => {
//see if a relative link
if Url::parse(link) == Err(ParseError::RelativeUrlWithoutBase) {
//get base url
let plink = Url::parse(&self.origin)
.expect("get css links, origin could not be parsed")
.join(link)
.expect("css links, could not join")
.to_string();
//push to return vector
link_hashset.insert((link.to_string(), plink.to_string()));
//check if base64 and continue if so
} else if RE3.is_match(link) {
continue;
//if fully formed link, push to return vector
} else if let Ok(parsed_link) = Url::parse(link) {
link_hashset.insert((link.to_string(), parsed_link.to_string()));
}
}
//No src, contine
None => continue,
};
}
//If hashset is empty return an Ok of None
if link_hashset.is_empty() {
Ok(None)
//return some image links
} else {
Ok(Some(link_hashset))
}
}
I commented the code, if you have any questions feel free to ask.
Next we want to add the get_css_links under get_image_links
pub fn get_css_links(&self) -> Result<Option<HashSet<(String, String)>>, String> {
let mut link_hashset: HashSet<(String, String)> = HashSet::new();
//get links
let selector = Selector::parse("link").unwrap();
//loop through elements
for element in self.html.select(&selector) {
//check if stylesheets
if element.value().attr("rel").unwrap() == "stylesheet" {
//get the href
match element.value().attr("href") {
Some(link) => {
//take care of relative links here
if Url::parse(link) == Err(ParseError::RelativeUrlWithoutBase) {
//create url
let plink = Url::parse(&self.origin)
.expect("get css links, origin could not be parsed")
.join(link)
.expect("css links, could not join")
.to_string();
//add to hashset
link_hashset.insert((link.to_string(), plink.to_string()));
} else if let Ok(parsed_link) = Url::parse(link) {
link_hashset.insert((link.to_string(), parsed_link.to_string()));
}
}
None => continue,
};
}
}
if link_hashset.is_empty() {
Ok(None)
} else {
Ok(Some(link_hashset))
}
}
Now lets lets add the get_js_links under the get_css_links
//get js links
pub fn get_js_links(&self) -> Result<Option<HashSet<(String, String)>>, String> {
//create hashset
let mut link_hashset: HashSet<(String, String)> = HashSet::new();
//get the selector which is basically used for getting the script tags
let selector = Selector::parse("script").unwrap();
for element in self.html.select(&selector) {
//get src attribute of the script tag
match element.value().attr("src") {
Some(link) => {
if Url::parse(link) == Err(ParseError::RelativeUrlWithoutBase) {
//parse relative url
let plink = Url::parse(&self.origin)
.expect("get js links, origin could not be parsed ")
.join(link)
.expect("js links, could not join")
.to_string();
link_hashset.insert((link.to_string(), plink.to_string()));
} else if let Ok(parsed_link) = Url::parse(link) {
//url doesnt need to be parsed, add it to the hashset
link_hashset.insert((link.to_string(), parsed_link.to_string()));
}
}
None => continue,
};
}
//if hashset is empty return a result of None
if link_hashset.is_empty() {
Ok(None)
} else {
//return a result of some
Ok(Some(link_hashset))
}
}
and that is it for the html.rs file.
Now unto the client.rs file
first set up the usings:
use crate::html::HtmlRecord;
use bytes::Bytes;
use reqwest::header::USER_AGENT;
use url::Url;
Now set up the AGENT const under the usings, we'll be using the mozilla agent.
const AGENT: &str =
"Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion";
Now set up the Client struct
pub(crate) struct Client {
pub client: reqwest::Client,
}
now lets set up the new function
impl Client {
pub fn new() -> Self {
Self {
client: reqwest::Client::new(),
}
}
}
below the closing tag add the following method
pub fn replace_encoded_chars(body: String) -> String {
body.replace("<", "<")
.replace(">", ">")
.replace(""", "\"")
.replace("&", "&")
.replace("&apos", "\'")
}
This is as the function states: replaces encoded characters to we get good results.
Now back into the Client struct, under new method add the following method:
pub async fn fetch_html_record(&mut self, url_str: &str) -> Result<HtmlRecord, reqwest::Error> {
let url_parsed = Url::parse(url_str).expect("cannot parse");
let res = self
.client
.get(url_parsed.as_str())
.header(USER_AGENT, AGENT)
.send()
.await?;
let body = res.text().await.expect("unable to parse html text");
let body = replace_encoded_chars(body);
let record: HtmlRecord = HtmlRecord::new(url_parsed.to_string(), body);
Ok(record)
}
This gets the html and creates a record with it.
Next add the fetch_image_bytes method:
pub async fn fetch_image_bytes(&mut self, url_str: &str) -> Result<Bytes, String> {
let url_parsed = Url::parse(url_str).expect("cannot parse");
let res = self
.client
.get(url_parsed.as_str())
.header(USER_AGENT, AGENT)
.send()
.await
.map_err(|e| format!("fetch image bytes failed for url {}:\n {}", url_parsed, e))?;
let status_value = res.status().as_u16();
if status_value == 200 {
let image_bytes = res.bytes().await.expect("unable to parse html text");
Ok(image_bytes)
} else {
Err("status on image call not a 200 OKAY".to_string())
}
}
Lastly add the fetch_string_resource methods. This grabs the css and the js for the webpage we are archiving.
pub async fn fetch_string_resource(&mut self, url_str: &str) -> Result<String, String> {
let url_parsed = Url::parse(url_str).expect("cannot parse");
let res = self
.client
.get(url_parsed.as_str())
.header(USER_AGENT, AGENT)
.send()
.await
.map_err(|e| format!("fetch string resource failed for url {}: {}", url_parsed, e))?;
let status_value = res.status().as_u16();
if status_value == 200 {
let string_resource = res.text().await.expect("unable to parse html text");
Ok(string_resource)
} else {
Err("status on css call not a 200 OKAY".to_string())
}
}
The client was pretty easy. Now for the bread and butter, the archiver.
create the archiver struct
pub struct Archiver;
now create the save_page method. This uses everything we built upon to save the page to the base directory we provided
impl Archiver {
async fn save_page(
html_record: &mut HtmlRecord,
client: &mut Client,
base_path: &str,
) -> Result<String, String> {
//set up the directory to save the page in
let url = Url::parse(&html_record.origin).expect("can't parse origin url");
let host_name = url.host().expect("can't get host").to_string();
let mut url_path = url.path().to_string();
let mut base_path = base_path.to_string();
if !base_path.ends_with('/') {
base_path.push('/');
}
if !url_path.ends_with('/') {
url_path.push('/');
}
let directory = format!(
"{}{}{}{}",
base_path, host_name, url_path, html_record.date_time
);
//create the directory
fs::create_dir_all(&directory).map_err(|e| format!("Failed to create directory: {}", e))?;
//Get images
match html_record.get_image_links() {
Ok(Some(t_image_links)) => {
assert!(fs::create_dir_all(format!("{}/images", directory)).is_ok());
for link in t_image_links {
if let Ok(image_bytes) = client.fetch_image_bytes(&link.1).await {
if let Ok(tmp_image) = image::load_from_memory(&image_bytes) {
let file_name = get_file_name(&link.1)
.unwrap_or_else(|| random_name_generator() + ".png");
let fqn = format!("{}/images/{}", directory, file_name);
let body_replacement_text = format!("./images/{}", file_name);
if (file_name.ends_with(".png")
&& tmp_image
.save_with_format(&fqn, image::ImageFormat::Png)
.is_ok())
|| (!file_name.ends_with(".png") && tmp_image.save(&fqn).is_ok())
{
html_record.body =
html_record.body.replace(&link.0, &body_replacement_text);
}
}
}
}
}
Ok(None) => {
println!("no images for url: {}", url);
}
Err(e) => {
println!("error {}", e)
}
}
//Get css links
match html_record.get_css_links() {
Ok(Some(t_css_links)) => {
assert!(fs::create_dir_all(format!("{}/css", directory)).is_ok());
for link in t_css_links {
let file_name =
get_file_name(&link.1).unwrap_or_else(|| random_name_generator() + "css");
if let Ok(css) = client.fetch_string_resource(&link.1).await {
let fqn = format!("{}/css/{}", directory, file_name);
let mut file = File::create(&fqn).unwrap();
if file.write(css.as_bytes()).is_ok() {
let body_replacement_text = format!("./css/{}", file_name);
html_record.body =
html_record.body.replace(&link.0, &body_replacement_text);
} else {
println!("couldnt write css for url {}", &fqn);
}
}
}
}
Ok(None) => {
println!("no css for url: {}", url);
}
Err(e) => {
println!("error for url {}\n error: {}", url, e)
}
}
//get js links
match html_record.get_js_links() {
Ok(Some(t_js_links)) => {
assert!(fs::create_dir(format!("{}/js", directory)).is_ok());
for link in t_js_links {
let file_name =
get_file_name(&link.1).unwrap_or_else(|| random_name_generator() + "js");
if let Ok(js) = client.fetch_string_resource(&link.1).await {
let fqn = format!("{}/js/{}", directory, file_name);
if let Ok(mut output) = File::create(fqn) {
if output.write(js.as_bytes()).is_ok() {
let body_replacement_text = format!("./js/{}", file_name);
html_record.body =
html_record.body.replace(&link.0, &body_replacement_text);
}
}
}
}
}
Ok(None) => {
println!("no js for url: {}", url);
}
Err(e) => {
println!("error for url : {}\n error :{}", url, e);
}
}
//write html to file
let fqn_html = format!("{}/index.html", directory);
let mut file_html = File::create(fqn_html.clone()).unwrap();
if file_html.write(html_record.body.as_bytes()).is_ok() {
Ok(fqn_html)
} else {
Err("error archiving site".to_string())
}
}
}
Run through this code. Basically it follows the pattern of getting the resources and saving them to the body of the html document.
Now right above save_page, create the following method:
pub async fn create_archive(
&mut self,
client: &mut Client,
url: &str,
path: &str,
) -> Result<String, String> {
//create record
let mut record = client
.fetch_html_record(url)
.await
.unwrap_or_else(|_| panic!("fetch_html_record failed \n url {}", url));
//save the page
match Archiver::save_page(&mut record, client, path).await {
Ok(archive_path) => Ok(archive_path),
Err(e) => Err(e),
}
}
This gets the HtmlRecord and then extracts all its resources.
Now lastly in the main, add the mods
mod archiver;
mod client;
mod html;
Now add the usings
use crate::archiver::Archiver;
use crate::client::Client;
Now change the main to this:
#[tokio::main]
async fn main() {
let url = "https://en.wikipedia.org/wiki/Rust_(programming_language)";
/*
change these two lines if you want to use an absolute path, or create the directory "/Projects/archive_test"
*/
//this will grab your home directory
let home_dir = dirs::home_dir().expect("Failed to get home directory");
//make sure this directory exists:
let custom_path = "/Projects/archive_test";
//this is the absolute path to your home directory and the added directories to the spot you want to
//add your archives to.
let new_dir = format!("{}{}", home_dir.to_str().unwrap(), custom_path);
//create the client and pass it to the archiver
let mut client = Client::new();
let mut archiver = Archiver;
let path = archiver.create_archive(&mut client, url, &new_dir).await;
//path of the archived site
println!("{:?}", path);
}
change the values i used (the site url and the path (which needs to be absolute)).
now run the following command in the base directory of the project
cargo run
The output path is where your archive now resides.
Posted on December 19, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.