Web-scraping with NodeJS

nitinreddy3

Nitin Reddy

Posted on April 11, 2020

Web-scraping with NodeJS

Today we are going to learn about how we can do web-scraping with NodeJS and some other tools.
We will be fetching the data from a web URL with the GET request and store it in a CSV file.

The codebase is available at Node-WEbScrap

Alt Text

Tools and things required:-

  • NodeJS
  • NPM packages
    1. request-promise - It helps us to make HTTP requests to the source Uri and get the data
    2. cheerio - This is used to load and parse markup data.
    3. json2csv - This is used to convert the JSON data to the CSV format
  • Basic knowledge of JavaScript

Let's get started with the project

  • Create a NodeJS project
   $ mkdir node-webscrap
   $ cd node-webscrap
   $ npm init
   $ yarn add request-promise request cheerio json2csv
Enter fullscreen mode Exit fullscreen mode
  • Create an index.js file in the root directory of your project
   $ touch index.js
Enter fullscreen mode Exit fullscreen mode
  • Get all the required modules inside the index.js
    const request = require("request-promise")
    const cheerio = require("cheerio")
    const fs = require("fs")
    const json2csv = require("json2csv").Parser;
Enter fullscreen mode Exit fullscreen mode
  • Next, create an array of movies with proper strings. I have used rotten tomatoes to get the movie review URLs
   const movies = [
     "https://www.rottentomatoes.com/m/the_last_full_measure",
     "https://www.rottentomatoes.com/m/stray_dolls"
   ];
Enter fullscreen mode Exit fullscreen mode
  • Now create a function with the below code base
   const dataRepresent = async() => {
     let rottenTomatoData = []

     for (let movie of movies) {
     const response = await request({
      uri: movie,
      headers: {
        "accept": 
"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "en-US,en;q=0.9,es;q=0.8"
      },
      gzip: true,
     })

     let $ = cheerio.load(response);
     let title = $("h1[class='mop-ratings-wrap__title mop-ratings-wrap__title--top']").text().trim()
     let tomatoMeterObj = $('#tomato_meter_link > .mop-ratings-wrap__percentage');
     let tomatoMeter = tomatoMeterObj && tomatoMeterObj.text().trim();
     let audMeterObj = $('.audience-score > .mop-ratings-wrap__score >  .articleLink  > .mop-ratings-wrap__percentage');
     let audMeter = audMeterObj && audMeterObj.text().trim();
     let summary = $('.mop-ratings-wrap__text').text().trim()

     rottenTomatoData.push({
      title,
      tomatoMeter,
      audMeter,
      summary,
     });
   }
   const j2cp = new json2csv()
   const csv = j2cp.parse(rottenTomatoData);
   fs.writeFileSync('./rottenTomatoes.csv', csv, "utf-8")
 }
Enter fullscreen mode Exit fullscreen mode
  • Call the function at the end in the index.js file
    dataRepresent();
Enter fullscreen mode Exit fullscreen mode
  • After running the index.js from the command line, you should see the file "rottenTomatoes.csv" getting generated in the project's root directory
   $ node .\index.js
Enter fullscreen mode Exit fullscreen mode

So here we are iterating over the movies array asynchronously and using request-promise npm module we are passing headers, uri and the required parameter like gzip to fetch the raw HTML data. Using cheerio we can parse the data by using jquery selectors to get the data.

Then we push the data into "rottenTomatoData" array and write the data in the file named as "rottenTomatoes.csv" using fs module provided by NodeJS out of the box

So that's it for the day. I will come up with some learnings and will share them with you.

Thanks for reading and please share it across with other folks and keep learning!!

💖 💪 🙅 🚩
nitinreddy3
Nitin Reddy

Posted on April 11, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Web-scraping with NodeJS
node Web-scraping with NodeJS

April 11, 2020