Matt Angelosanto
Posted on March 21, 2023
Written by Juan Cruz Martinez✏️
A web scraper is a tool or script that allows you to obtain information (usually in large amounts) from websites and web APIs to extract insights or compile databases with information. Search engines like Google scrape the web to index sites and provide them as results to users' queries.
Search engines are complicated systems, but the general idea remains the same. In this article, you'll learn about some Node.js web scraping libraries and techniques. You’ll also learn about their differences and when each can be an excellent fit for your needs. Before we jump into them, let's review some considerations.
Jump ahead:
- What you need to know before scraping the web
- The best Node.js web scraping libraries
- Which is the best Node.js scraper?
What you need to know before scraping the web
Even though web scraping is legal for publicly available information, you should be aware that many sites put limitations in place as part of their terms of service. Some may even code limitations like rate limits to prevent you from slowing down their services, but why is that?
When you scrape information from a site, you use those resources. Suppose you are aggressive enough in terms of accessing too many pages too quickly. In that case, you may degrade the site's general performance for its users. So, when scraping the web, get consent or permission from the owner and be mindful of the strains you are setting into their sites.
Lastly, web scraping requires a considerable effort for development and, in many cases, maintenance, as changes in the structure of the target site may break down your scraping code and require you to update your script to adjust to the new formats. For this reason, I prefer consuming an API when available and scraping the web only as a last option.
Now, let's jump directly into the best Node.js web scraping libraries.
The best Node.js web scraping libraries
So, whether you want to build your own search engine or monitor a website to alert you when tickets for your favorite concert are available or need essential information for your company, Node.js web scraper libraries have you covered.
Axios
If you are familiar with Axios, you know that this option may not sound too sexy for scrapping the web. However, it is a simple solution that can get the job done in many situations using a library you already know and love while keeping your codebase simple.
Axios is a promised-based HTTP client for Node.js and the browser that became super popular among JavaScript projects for its simplicity and adaptability. Although Axios is typically used in the context of calling REST APIs, it can fetch the HTML of websites.
Because Axios will limit to only getting the response from the server, it will be up to you to parse and work with the result. Therefore, I recommend using this library when working with JSON responses or for simple scraping needs.
You can install Axios using your favorite package manager, like so:
npm install axios
Here is an example of its usage to list all the articles' headlines from the LogRocket blog's homepage:
const axios = require('axios');
axios
.get("https://logrocket.com/blog")
.then(function (response) {
const reTitles = /(?<=\<h2 class="card-title"><a\shref=.*?\>).*?(?=\<\/a\>)/g;
[...response.data.matchAll(reTitles)].forEach(title => console.log(`- ${title}`));
});
In the example above, you can see how Axios is great for HTTP requests. However, parsing the HTML in complex structures requires elaborating complex rules — or regular expressions — even for simple tasks.
So, if regular expressions are not your thing and you prefer a more DOM-based approach, you could transform the HTML into a DOM-like object with libraries like JSDom or Cheerio.
Let’s explore the example above but using JSDom instead:
const axios = require('axios');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
axios
.get("https://logrocket.com/blog")
.then(function (response) {
const dom = new JSDOM(response.data);
[...dom.window.document.querySelectorAll('.card-title a')].forEach(el => console.log(`- ${el.textContent}`));
});
Such a solution would soon encounter its limitations. For example, you’ll only get the raw response from the server. What if elements on the page you want to access are loaded asynchronously? What about single-page applications (SPA) where the HTML simply loads JavaScript libraries that do all the rendering work on the client? Or, what if you encounter one of the limitations imposed by such libraries? After all, they are not a full HTML/DOM implementation but a subset of the same.
In scenarios like the above or for complex websites, a completely different approach and using other libraries may be the best choice.
Puppeteer
Puppeteer is a high-level Node.js API to control Chrome or Chromium with code. So, what does it mean for us in terms of web scraping?
With Puppeteer, you access the full power of a full fetch browser like Chromium (running in the background in headless mode) to navigate websites and fully render styles, scripts, and asynchronous information.
To use Puppeteer in your project, you can install it as any other JavaScript package:
npm install puppeteer
Now, let’s see an example of Puppeteer in action:
const puppeteer = require("puppeteer");
async function parseLogRocketBlogHome() {
// Launch the browser
const browser = await puppeteer.launch();
// Open a new tab
const page = await browser.newPage();
// Visit the page and wait until network connections are completed
await page.goto('https://logrocket.com/blog', { waitUntil: 'networkidle2' });
// Interact with the DOM to retrieve the titles
const titles = await page.evaluate(() => {
// Select all elements with crayons-tag class
return [...document.querySelectorAll('.card-title a')].map(el => el.textContent);
});
// Don't forget to close the browser instance to clean up the memory
await browser.close();
// Print the results
titles.forEach(title => console.log(`- ${title}`))
}
parseLogRocketBlogHome();
While Puppeteer is a fantastic solution, it is more complex to work on, especially for simple projects, and it is much more demanding in terms of resources. You are, after all running a full Chromium browser, and we know how memory hungry those can be.
X-Ray
X-Ray is a Node.js library created for scraping the web. So, it is no surprise that its API is heavily focused on that task. Thus, it abstracts most of the complexity we have seen in Puppeteer and Axios from developers.
To install X-Ray, you can run the following:
npm install x-ray
Now, let’s build our example using X-Ray:
const Xray = require('x-ray');
const x = Xray()
x('https://logrocket.com/blog', {
titles: ['.card-title a']
})((err, result) => {
result.titles.forEach(title => console.log(`- ${title}`));
});
X-Ray is a great option if your use case involves scrapping large amounts of webpages. It supports concurrency and pagination out-of-the-box, so you don’t need to worry about those details.
Osmosis
Osmosis is very similar to X-Ray because it is a library designed explicitly for scraping webpages and extracting data from HTML, XML, and JSON documents.
To install the package, run the following:
npm install osmosis
And, here is the sample code:
var osmosis = require('osmosis');
osmosis.get('https://logrocket.com/blog')
.set({
titles: ['.card-title a']
})
.data(function(result) {
result.titles.forEach(title => console.log(`- ${title}`));
});
As you can see, Osmosis is similar to X-Ray in the syntax and style used to retrieve and work with data.
Which is the best Node.js scraper?
So, this brings us to the question of which is the best Node.js scraper? Well, the best Node.js scraper is the one that best fits your project needs. With that said, today, you learned some factors to help influence your decision.
Ultimately, for most tasks, any option will do. So, choose the one you feel most comfortable with. In my professional life, I’ve had the opportunity to build multiple projects with information-gathering requirements from publicly available information and internal systems.
Because the requirements were diverse, each of these projects used different approaches and libraries. Ranging from as simple as using Axios, to X-Ray, and ultimately resulting in Puppeteer for the most complex situations.
Finally, always respect the website's terms and conditions regardless of the scraper you choose. Scraping data can be a powerful tool, but with that comes great responsibility.
Thanks for reading!
Posted on March 21, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.