Web Scraping React Application using Node.js
collegewap
Posted on April 8, 2023
You might have searched for web scraping and got solutions that use Cheerio and axios/fetch.
The problem with this approach is we cannot scrape dynamically rendered web pages or client-side rendered web pages using Cheerio.
To scrape such webpages, we need to wait for the page to finish rendering.
In this article, we will see how to wait for a particular section to appear on the page and then access that element.
Initial setup
Consider the page https://cra-crawl.vercel.app.
Here, we have a title and a list of fruits.
If you inspect the page, you will see that the heading is inside the h1 tag and the list has a class named 'fruits-list'.
We will be using these 2 elements to access the heading and the list of fruits.
Creating Node project
Create a directory called node-react-scraper
and run the command npm init -y
. This will initialize an npm project.
Now install the package puppeteer
using the following command:
npm i puppeteer
Puppeteer is a headless browser (Browser without UI) to automatically browse a web page.
Create a file called index.js
inside the root directory.
Reading the heading
We can use the puppeteer as follows in index.js
const puppeteer = require("puppeteer")
// starting Puppeteer
puppeteer
.launch()
.then(async browser => {
const page = await browser.newPage()
await page.goto("https://cra-crawl.vercel.app/")
//Wait for the page to be loaded
await page.waitForSelector("h1")
let heading = await page.evaluate(() => {
const h1 = document.body.querySelector("h1")
return h1.innerText
})
console.log({ heading })
// closing the browser
await browser.close()
})
.catch(function (err) {
console.error(err)
})
In the above code, you can see that we are waiting for the h1
tag to appear on the page and then only accessing it.
You can run the code using the command node index.js
.
Accessing the list of fruits
If you want to access the list of fruits, you can do so by using the following code:
const puppeteer = require("puppeteer")
// starting Puppeteer
puppeteer
.launch()
.then(async browser => {
const page = await browser.newPage()
await page.goto("https://cra-crawl.vercel.app/")
//Wait for the page to be loaded
await page.waitForSelector(".fruits-list")
let heading = await page.evaluate(() => {
const h1 = document.body.querySelector("h1")
return h1.innerText
})
console.log({ heading })
let allFruits = await page.evaluate(() => {
const fruitsList = document.body.querySelectorAll(".fruits-list li")
let fruits = []
fruitsList.forEach(value => {
fruits.push(value.innerText)
})
return fruits
})
console.log({ allFruits })
// closing the browser
await browser.close()
})
.catch(function (err) {
console.error(err)
})
Here we are using the querySelectorAll API to get the list of nodes containing fruits. Once we get the list, we are looping through the nodes and accessing the text inside it.
Source code
You can view the complete source code here.
Posted on April 8, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024