Web Scraping With JavaScript And Node JS - An Ultimate Guide
Serpdog
Posted on February 7, 2023
Introduction
JavaScript has now become one of the most preferred languages for web scraping. Its ability to extract the data from SPA(Single Page Application) is boosting its popularity. Developers can easily automate their tasks with the help of libraries like Puppeteer and Cheerio, which are available in JavaScript.
In this blog, we are going to discuss various web scraping libraries present in JavaScript, their advantages and disadvantages, determine the best among them, and at the end, we will discuss some differences between Python and JavaScript in terms of web scraping.
Web Scraping With Node JS
Before starting with the tutorial, let us learn some basics of web scraping.
What is Web Scraping?
Web Scraping is the process of extracting data from a single or bunch of websites with the help of HTTP requests on the website’s server to get access to the raw HTML of a particular webpage and then converting it into a format you want.
There are various uses of Web Scraping:
- SEO — Web Scraping can be used to scrape Google Search Results, which can be used for various objectives like SERP Monitoring, keyword tracking, etc.
- News Monitoring — Web Scraping can enable access to a large number of articles from various media agencies which can be used to keep a track of current news and events.
- Lead Generation — Web Scraping helps to extract the contact details of a person who can be your potential customer.
- Price Comparison — Web Scraping can be used to gather product pricing from multiple online sellers for price comparison.
Best Web Scraping Libraries in Node JS
The best web scraping libraries present in Node JS are:
- Unirest
- Axios
- SuperAgent
- Cheerio
- Puppeteer
- Playwright
- Nightmare
Let us start discussing these various web scraping libraries.
HTTP CLIENT
HTTP client libraries are used to interact with website servers by sending requests and retrieving the response. In the following sections, we will discuss several libraries that can be utilized for making HTTP requests.
Unirest
Unirest is a lightweight HTTP request library available in multiple languages, built and maintained by Kong. It supports various HTTP methods like GET, POST, DELETE, HEAD, etc which can be easily added to your applications, making it a preferable choice for effortless use cases.
Unirest is one of the most popular JavaScript libraries that can be utilized to extract the valuable data available on the internet.
Let us take an example of how we can do it. Before starting, I am assuming that you have already set up your Node JS project with a working directory.
First, install Unirest JS by running the below command in your project terminal.
npm i unirest
Now, we will request the target URL to extract the raw HTML data.
const unirest = require(“unirest”);
const getData = async() => {
try{
const response = await unirest.get("https://www.reddit.com/r/programming.json")
console.log(response.body); // HTML
}
catch(e)
{
console.log(e);
}
}
getData();
This is how you can create a basic scraper with Unirest.
Advantages:
- All HTTP methods are supported, including GET, POST, DELETE, etc.
- It is very fast for web scraping tasks and can handle a large amount of load without any problem.
- It allows file transfer over a server in a much simpler way.
Axios
Axios is a promise-based HTTP client for both Node JS and browsers. Axios is widely used among the developer community because of its wide range of methods, simplicity, and active maintenance. It also supports various features like cancel requests, automatic transforms for JSON data, etc.
You can install the Axios library by running the below command in your terminal.
npm i axios
Making an HTTP request with Axios is quite simple.
const axios = require(“axios”);
const getData = async() => {
try{
const response = await axios.get("https://books.toscrape.com/")
console.log(response.data); // HTML
}
catch(e)
{
console.log(e);
}
}
getData();
Advantages:
- It can intercept an HTTP request and can modify it.
- It has large community support and is actively maintained by its founders making it a reliable option for making HTTP requests.
- It can transform the request and response data.
SuperAgent
SuperAgent is another lightweight HTTP Client library for both Node JS and browser. It supports many high-level HTTP client features. It features a similar API as Axios and supports, both promise and async/await syntax for handling responses.
You can install SuperAgent by running the following command.
npm i superagent
You can make an HTTP request using async/await with SuperAgent like this:
const superagent = require(“superagent”);
const getData = async() => {
try{
const response = await superagent.get("https://books.toscrape.com/")
console.log(response.text); // HTML
}
catch(e)
{
console.log(e);
}
}
getData();
Advantages:
- SuperAgent can be easily extended via various plugins.
- It works in both the browser and node.
Disadvantages:
- Fewer features as compared to other HTTP client libraries like
- Axios.
- Documentation is not provided in detail.
Web Parsing Libraries
Web Parsing Libraries are used to extract the required data from the raw HTML or XML document. There are various web parsing libraries present in JavaScript including Cheerio, JSONPath, html-parse-stringify2, etc. In the following section, we will discuss Cheerio, the most popular web parsing library in JavaScript.
Cheerio
Cheerio is a lightweight web parsing library based on the powerful API of jQuery that can be used to parse and extract data from HTML and XML documents.
Cheerio is blazingly fast in HTML parsing, manipulating, and rendering as it works with a simple consistent DOM model. It is not a web browser as it can’t produce visual rendering, apply CSS, and execute JavaScript. For scraping SPA(Single Page Applications) we need complete browser automation tools like Puppeteer, Playwright, etc which we will discuss in a bit.
Let us scrape the title of the book in the below image.
First, we will install the Cheerio library.
npm i cheerio
Then, we can extract the title by running the below code.
const unirest = require(“unirest”);
const cheerio = require(“cheerio”);
const getData = async() => {
try{
const response = await unirest.get("https://books.toscrape.com/catalogue/sharp-objects_997/index.html")
const $ = cheerio.load(response.body);
console.log("Book Title: " + $("h1").text()); // "Book Title: Sharp Objects"
}
catch(e)
{
console.log(e);
}
}
getData();
The process is quite similar to what we have done in the Unirest section, but with a little difference. In the above code, we load the extracted HTML into a Cheerio constant, and then we used the CSS Selector of the title to extract the required data.
Advantages:
- Faster than any other web parsing library.
- Cheerio has a very simple syntax and is similar to jQuery which allows developers to scrape web pages easily.
- Cheerio can be used or integrated with various web scraping libraries like Unirest and Axios, which can be a great combo for scraping a website.
Disadvantages:
- It cannot execute Javascript.
Headless Browsers
Nowadays, website development has become more advanced, and developers are preferring more dynamic content on their websites, which is possible because of JavaScript. But this content rendered by JavaScript is not accessible while doing web scraping with a simple HTTP GET request.
The only way you can scrape the dynamic content is by using headless browsers. Let us discuss the libraries which can help in scraping that content.
Puppeteer
Puppeteer is a Node JS library designed by Google that provides a high-level API that allows you to control Chrome or Chromium browsers.
Features associated with Puppeteer JS:
- Puppeteer can be used to have better control over Chrome.
- It can generate screenshots and PDFs of web pages.
- It can be used to scrape web pages that use JavaScript to load the content dynamically.
Let us scrape all the book titles and their links on this website.
But first, we will install the puppeteer library.
npm i puppeteer
Now, we will prepare a script to scrape the required information.
Now, write the below code in your js file.
const browser = await puppeteer.launch({
headless: false,
});
const page = await browser.newPage();
await page.goto("https://books.toscrape.com/index.html" , {
waitUntil: 'domcontentloaded'
})
Step-by-step explanation:
- First, we launched the browser with the headless mode set to false, which allows us to see exactly what is happening.
- Then, we created a new page in the headless browser.
- After that, we navigated to our target URL and waited until the HTML completely loaded.
Now, we will parse the HTML.
let data = await page.evaluate(() => {
return Array.from(document.querySelectorAll(“article h3”)).map((el) => {
return {
title: el.querySelector(“a”).getAttribute(“title”),
link: el.querySelector(“a”).getAttribute(“href”),
};
});
});
The page.evalueate()
will execute the javascript
within the current page context. And then, the document.querySelectorAll()
will select all the elements which identify with article h3 tags. The document.querySelector()
is the same, but it selects a single HTML element.
Great! Now, we will print the data and close the browser.
console.log(data)
await browser.close();
This will give you 20 titles and links to the books present on the web page.
Advantages:
- We can perform various activities on the web page, like clicking on the buttons and links, navigating between the pages, scrolling the web page, etc.
- It can be used to take screenshots of web pages.
- The evaluate() function in the puppeteer JS helps you to execute Javascript.
- You don’t need an external driver to run the tests.
Disadvantages:
- It requires very high CPU usage to run.
- It currently supports only Chrome web browsers.
Playwright
Playwright is a test automation framework to automate web browsers like Chrome, Firefox, and Safari with an API similar to Puppeteer. It was developed by the same team that worked on Puppeteer. Like Puppeteer, Playwright can also run in the headless and non-headless modes making it suitable for a wide range of uses from automating tasks to web scraping or web crawling.
Major Differences between Playwright and Puppeteer:
- Playwright is compatible with Chrome, Firefox, and Safari, while Puppeteer only supports Chrome web browsers.
- Playwright provides a wide range of options to control the browser in headless mode.
- Puppeteer is limited to Javascript only, while Playwright supports various languages like C#, .NET, Java, Python, etc.
Let us install Playwright now.
npm i playwright
We will now prepare a basic script to scrape the prices and stock availability from the same website which we used in the Puppeteer section.
The syntax is quite similar to Puppeteer.
const browser = await playwright[‘chromium’].launch({ headless: false,});
const context = await browser.newContext();
const page = await context.newPage();
await page.goto(“https://books.toscrape.com/index.html");
The newContext()
will create a new browser context.
Now, we will prepare our parser.
let articles = await page.$$("article");
let data = [];
for(let article of articles)
{
data.push({
price: await article.$eval("p.price_color", el => el.textContent),
availability: await article.$eval("p.availability", el => el.textContent),
});
}
Then, we will close our browser.
await browser.close();
Advantages:
- It supports multiple languages like Python, Java, .Net, and Javascript.
- It is faster than any other web browser automation library.
- It supports multiple web browsers like Chrome, Firefox, and Safari on a single API.
- Its documentation is well-written which makes it easy for developers to learn and use.
Nightmare JS
Nightmare is a high-level web automation library designed to automate browsing, web scraping, and various other tasks. It uses Electron(similar to Phantom JS, but twice as fast) which provides it with a headless browser, making it efficient and easy to use. It is predominantly used for UI testing and crawling.
It can be used to mimic user actions such as navigating to a website, clicking a button or a link, typing, etc with an API that provides a smooth experience for each script block.
Install Nightmare JS by running the following command.
npm i nightmare
Now, we will search for the results of “Serpdog” on duckduckgo.com.
const Nightmare = require(‘nightmare’)
const nightmare = Nightmare()
nightmare
.goto(‘https://duckduckgo.com')
.type(‘#search_form_input_homepage’, ‘Serpdog’)
.click(‘#search_button_homepage’)
.wait(‘.nrn-react-div’)
.evaluate(() =>
{
return Array.from(document.querySelectorAll(‘.nrn-react-div’)).map((el) => {
return {
title: el.querySelector(“h2”).innerText.replace(“\n”,””),
link: el.querySelector(“h2 a”).href
}
})
})
.end()
.then((data) => {
console.log(data)
})
.catch((error) => {
console.error(‘Search failed:’, error)
})
In the above code, first, we declared an instance of Nightmare. Then, we navigated to the Duckduckgo search page.
Then, we used the type()
method to type Serpdog in the search field, and submitted the form by clicking the search button on the homepage using the click()
method. We will make our scraper wait till the search results are loaded, after that we will extract the search results present on the web page with the help of their CSS selectors.
Advantages:
- It is faster than Puppeteer.
- Fewer resources are needed to run the program.
Disadvantages:
- It doesn’t have good community support like Puppeteer. Also, some undiscovered issues exist on Electron, which can allow a malicious website to execute code on your computer.
Other libraries
In this section, we will discuss some alternatives to the previously discussed libraries.
Node Fetch
Node Fetch is a lightweight library that brings Fetch API to Node JS, allowing HTTP requests efficiently in the Node JS environment.
Features:
- It allows the use of promises and async functions.
- It implements the Fetch API functionality in Node JS.
- Simple API that is maintained regularly, and is easy to use and understand.
You can install Node Fetch by running the following command.
npm i node-fetch
Here is how you can use Node Fetch for web scraping.
const fetch = require(“node-fetch”)
const getData = async() => {
const response = await fetch(‘https://en.wikipedia.org/wiki/JavaScript');
const body = await response.text();
console.log(body);
}
getData();
Osmosis
Osmosis is a web scraping library used for extracting HTML or XML documents from the web page.
Features:
- It has no large dependencies like jQuery and Cheerio.
- It has a clean promise-like interface.
- Fast parsing and small memory footprint.
Advantages:
- It supports retries and redirects limits.
- Supports single and multiple proxies.
- Supports form submission, session cookies, etc.
Is Node JS good for web scraping?
Yes, Node JS is good for web scraping. It has various powerful libraries like Axios and Puppeteer which makes it a preferred choice for data extraction. Also, the ease of extraction of data from websites that uses JavaScript to load dynamic content has made it a great option for web scraping tasks.
In the end, the great community support available for Node JS will never let you down!
Conclusion
In this tutorial, we learned about various libraries in Node JS which can be used for scraping, we also learned their advantages and disadvantages.
If you think we can complete your web scraping tasks and help you collect data, feel free to contact us.
I hope this tutorial gave you a complete overview of web scraping with Node JS. Please do not hesitate to message me if I missed something. Follow me on Twitter. Thanks for reading!
Additional Resources
I have prepared a complete list of blogs for scraping Google on Node JS which can give you an idea of how to gather data from advanced websites like Google.
Posted on February 7, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 6, 2024