How to create a scraper with Cheerio

rtagliavia

Robert

Posted on June 13, 2022

How to create a scraper with Cheerio

In this post we will learn how to scrape a website using cheerio, and then create an API with the scraped data with node.js that late you can use with a frontend.

The website that we will be using for this example is pricecharting.

You can contact me by telegram if you need to hire a Full Stack developer..

You can also find me on discord as Appu#9136.

You can clone the repo if you want..

This example is only for learning purposes

Creating our Project.

  1. open your terminal and type following
  2. mkdir node-cheerio-tut
  3. cd node-cheerio-tut
  4. npm init --y
  5. code .

Dependencies.

To install dependencies go to your project folder open a terminal and type the following:

npm i axios cheerio express mongoose
Enter fullscreen mode Exit fullscreen mode

And for dev dependencies type

npm i -D nodemon
Enter fullscreen mode Exit fullscreen mode

Project file structure:

node-cheerio-tut/
├── node_modules/
├── public/
├── src/
│ ├── routes/
│ ├── database.js
│ └── index.js
└── package.json

Table Of Contents.

  1. Setup the project
  2. Using Cheerio to scrape data
  3. Sending the response
  4. Organizing our code
  5. Conclusion

First go to your package.json and add these lines.

  "scripts": {
    "start": "node ./src index.js",
    "dev": "nodemon ./src index.js"
  },
Enter fullscreen mode Exit fullscreen mode

Let's code.

1. Setup the project

Llet's go to index.js inside the src folder and set up our basic server with express.

const expres = require('express')

const app = express()

//server
app.listen(3000, () => {
  console.log('listening on port 3000')
})
Enter fullscreen mode Exit fullscreen mode

Now let's run this command npm run dev and we should get this message:

listening on port 3000
Enter fullscreen mode Exit fullscreen mode

Now in our index.js lets import axios and cheerio, then I will explain the code below.

  1. We are going to add a const url with the URL value, in this case https://www.pricecharting.com/search-products?q=. (when you do a search in this web, you will be redirected to a new page, with a new route and a parameter with the value of the name you searched for).

main site

searchbar

So we are going to use that URL, also the website has two types of search, one by price and another by market, if we don't specify the type in the URL it will set market type by default. I leave it like this because in market returns the cover of the game and the system (we will use them later).

  1. We will add this middleware app.use(express.json()) because we don't want to get undefined when we do the post request.

  2. We will create a route with the post method to send a body to our server, (I am going to use the REST Client VScode extension to test the API, but you can use postman or whatever you want).

test.http

POST http://localhost:3000
Content-Type: application/json

{
  "game": "final fantasy"
}
Enter fullscreen mode Exit fullscreen mode
final fantasy
Enter fullscreen mode Exit fullscreen mode

As you can see we are getting the response, in this case i named the property game.

const axios = require("axios");
const cheerio = require("cheerio");
const express = require('express')

//initializations
const app = express()

const url = "https://www.pricecharting.com/search-products?q="

//middlwares
app.use(express.json())

app.post('/', async (req, res) => {
  // console.log(req.body.game)
  const game = req.body.game.trim().replace(/\s+/g, '+')
})

//server
app.listen(3000, () => {
  console.log('listening on port 3000')
})
Enter fullscreen mode Exit fullscreen mode
  1. Now we are going to create a constant named game that will store the value from req.body.game the we will use some methods to get the result like this final+fantasy.
  • First we're going to use trim() to remove the whitespace characters from the start and end of the string.

  • Then we will replace the whitespaces between the words with a + symbol with replace(/\s+/g, '+') .

2. Using Cheerio to scrape data .

Finally we're going to use cheerio.

  1. Now that we have our game constant we're going to use axios to make a request to our URL + the game title.

  2. We are going to use a try catch block, if we get a response then we will store it in a constant named html then we will use cheerio to load that data.

  3. We are going to create a constant named games that will store this value $(".offer", html).

  • If you open your developer tools and go to the elements tab you will that .offer class belongs to a table like the image below.

developer tools

  • If you take a look to this image you will easily understand what is going on in the code.
  1. Now we are going to loop trough that table to get each title, and we can do that using .find(".product_name"), then .find(".a"), then we want the text() from the a tag.
.
.
.

app.post('/', async (req, res) => {
  const game = req.body.game.trim().replace(/\s+/g, '+')
  await axios(url + game)
    try {
      const response = await axios.get(url + game)
      const html = response.data;
      const $ = cheerio.load(html)

      const games =  $(".offer", html)

      games.each((i, el) => {
        const gameTitle = $(el)
        .find(".product_name") 
        .find("a")
        .text()
        .replace(/\s+/g, ' ')
        .trim()

        console.log(gameTitle)
      })


    } catch (error) {
      console.log(error)
    }
})

.
.
.
Enter fullscreen mode Exit fullscreen mode
  • If you try this with console.log(title) you will get a message like this.
Final Fantasy VII
Final Fantasy III
Final Fantasy
Final Fantasy VIII
Final Fantasy II
.
.
.
Enter fullscreen mode Exit fullscreen mode
  • Now let's add more fields, for this example i want an id, a cover image and a system.
.
.
.

app.post('/', async (req, res) => {
  const game = req.body.game.trim().replace(/\s+/g, '+')
  await axios(url + game)
    try {
      const response = await axios.get(url + game)
      const html = response.data;
      const $ = cheerio.load(html)

      const games =  $(".offer", html)

      games.each((i, el) => {
        const gameTitle = $(el)
        .find(".product_name") 
        .find("a")
        .text()
        .replace(/\s+/g, ' ')
        .trim()

        const id = $(el).attr('id').slice(8);

        //cover image
        const coverImage = $(el).find(".photo").find("img").attr("src");

        const system = $(el)
        .find("br")
        .get(0)
        .nextSibling.nodeValue.replace(/\n/g, "")
        .trim();
      })


    } catch (error) {
      console.log(error)
    }
})

.
.
.
Enter fullscreen mode Exit fullscreen mode

3. Sending the response .

Let's store this data in an array, so in order to do this, let's create an array named videoGames.

.
.
.

const url = "https://www.pricecharting.com/search-products?q=";
let videoGames = []


app.post('/', async (req, res) => {
  const game = req.body.game.trim().replace(/\s+/g, '+')
  await axios(url + game)
    try {
      const response = await axios.get(url + game)
      const html = response.data;
      const $ = cheerio.load(html)

      const games =  $(".offer", html)

      games.each((i, el) => {
        const gameTitle = $(el)
        .find(".product_name") 
        .find("a")
        .text()
        .replace(/\s+/g, ' ')
        .trim()

        const id = $(el).attr('id').slice(8);

        //cover image
        const coverImage = $(el).find(".photo").find("img").attr("src");

        const gameSystem = $(el)
        .find("br")
        .get(0)
        .nextSibling.nodeValue.replace(/\n/g, "")
        .trim();
      })

      videoGames.push({
        id,
        gameTitle,
        coverImage,
        gameSystem
      })

      res.json(videoGames)

    } catch (error) {
      console.log(error)
    }

})
.
.
.
Enter fullscreen mode Exit fullscreen mode

If you try the route again you will get a result similar to the image below.

response

Optionally I made an array to get only certain systems because I didn't want to receive the same title with PAL and NTSC system, so I left the default system (NTSC).

.
.
.

const consoles = [
  "Nintendo DS",
  "Nintendo 64",
  "Nintendo NES",
  "Nintendo Switch",
  "Super Nintendo",
  "Gamecube",
  "Wii",
  "Wii U",
  "Switch",
  "GameBoy",
  "GameBoy Color",
  "GameBoy Advance",
  "Nintendo 3DS",
  "Playstation",
  "Playstation 2",
  "Playstation 3",
  "Playstation 4",
  "Playstation 5",
  "PSP",
  "Playstation Vita",
  "PC Games",
]

.
.
.

app.post('/', async (req, res) => {
  .
  .
  .

  if (!system.includes(gameSystem)) return;
  videoGames.push({
    id,
    gameTitle,
    coverImage,
    gameSystem,
  });
  .
  .
  .
})
.
.
.
Enter fullscreen mode Exit fullscreen mode

4. Organizing our code .

Let's organize it a little bit, let's create a folder in src named routes then create a file named index.js.

Copy and paste the code below.

const {Router} = require('express')
const cheerio = require("cheerio");
const axios = require("axios");
const router = Router()

const url = "https://www.pricecharting.com/search-products?q="
let videoGames = []

const system = [
  "Nintendo DS",
  "Nintendo 64",
  "Nintendo NES",
  "Nintendo Switch",
  "Super Nintendo",
  "Gamecube",
  "Wii",
  "Wii U",
  "Switch",
  "GameBoy",
  "GameBoy Color",
  "GameBoy Advance",
  "Nintendo 3DS",
  "Playstation",
  "Playstation 2",
  "Playstation 3",
  "Playstation 4",
  "Playstation 5",
  "PSP",
  "Playstation Vita",
  "PC Games",
]


router.post('/', async (req, res) => {
  const game = req.body.game.trim().replace(/\s+/g, '+')
  await axios(url + game)
    try {
      const response = await axios.get(url + game)
      const html = response.data;
      const $ = cheerio.load(html)
      const games =  $(".offer", html)

      games.each((i, el) => {
        const gameTitle = $(el)
        .find(".product_name") 
        .find("a")
        .text()
        .replace(/\s+/g, ' ')
        .trim()

        const id = $(el).attr('id').slice(8);
        const coverImage = $(el).find(".photo").find("img").attr("src");

        const gameSystem = $(el)
          .find("br")
          .get(0)
          .nextSibling.nodeValue.replace(/\n/g, "")
          .trim();

        if (!system.includes(gameSystem)) return;
        videoGames.push({
          id,
          gameTitle,
          coverImage,
          gameSystem,
          backlog: false
        });

      })


      res.json(videoGames)

    } catch (error) {
      console.log(error)
    }


})

module.exports = router
Enter fullscreen mode Exit fullscreen mode

Now let's go back to our main file in src index.js and leave the code like this.

const express = require('express')

//routes
const main = require('./routes/index')


const app = express()


//middlwares
app.use(express.json())

//routes
app.use(main)


app.listen(3000, () => {
  console.log('Server running on port 3000')
})
Enter fullscreen mode Exit fullscreen mode

If you try it you will see that it still works without any troubles.

5. Conclusion

We learned how to make a simple scraper with cheerio.

I really hope you have been able to follow the post without any trouble, otherwise I apologize, please leave me your doubts or comments.

I plan to make a next post extending this code, adding more routes, mongodb, and a front end.

You can contact me by telegram if you need to hire a Full Stack developer.

You can also find me on discord as Appu#9136

You can clone the repo if you want.

Thanks for your time.

💖 💪 🙅 🚩
rtagliavia
Robert

Posted on June 13, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related