How to create a scraper with Cheerio
Robert
Posted on June 13, 2022
In this post we will learn how to scrape a website using cheerio, and then create an API with the scraped data with node.js
that late you can use with a frontend
.
The website that we will be using for this example is pricecharting.
You can contact me by telegram if you need to hire a Full Stack developer..
You can also find me on discord as Appu#9136.
You can clone the repo if you want..
This example is only for learning purposes
Creating our Project.
- open your terminal and type following
- mkdir node-cheerio-tut
- cd node-cheerio-tut
- npm init --y
- code .
Dependencies.
To install dependencies go to your project folder open a terminal and type the following:
npm i axios cheerio express mongoose
And for dev dependencies type
npm i -D nodemon
Project file structure:
node-cheerio-tut/
├── node_modules/
├── public/
├── src/
│ ├── routes/
│ ├── database.js
│ └── index.js
└── package.json
Table Of Contents.
First go to your package.json
and add these lines.
"scripts": {
"start": "node ./src index.js",
"dev": "nodemon ./src index.js"
},
Let's code.
1. Setup the project
Llet's go to index.js inside the src folder and set up our basic server with express.
const expres = require('express')
const app = express()
//server
app.listen(3000, () => {
console.log('listening on port 3000')
})
Now let's run this command npm run dev
and we should get this message:
listening on port 3000
Now in our index.js lets import axios and cheerio, then I will explain the code below.
- We are going to add a const url with the URL value, in this case
https://www.pricecharting.com/search-products?q=
. (when you do a search in this web, you will be redirected to a new page, with a new route and a parameter with the value of the name you searched for).
So we are going to use that URL, also the website has two types of search, one by price and another by market, if we don't specify the type in the URL it will set market type by default. I leave it like this because in market returns the cover of the game and the system (we will use them later).
We will add this middleware
app.use(express.json())
because we don't want to getundefined
when we do the post request.We will create a route with the post method to send a body to our server, (I am going to use the REST Client VScode extension to test the API, but you can use postman or whatever you want).
test.http
POST http://localhost:3000
Content-Type: application/json
{
"game": "final fantasy"
}
final fantasy
As you can see we are getting the response, in this case i named the property game.
const axios = require("axios");
const cheerio = require("cheerio");
const express = require('express')
//initializations
const app = express()
const url = "https://www.pricecharting.com/search-products?q="
//middlwares
app.use(express.json())
app.post('/', async (req, res) => {
// console.log(req.body.game)
const game = req.body.game.trim().replace(/\s+/g, '+')
})
//server
app.listen(3000, () => {
console.log('listening on port 3000')
})
- Now we are going to create a constant named game that will store the value from
req.body.game
the we will use some methods to get the result like thisfinal+fantasy
.
First we're going to use
trim()
to remove the whitespace characters from the start and end of the string.Then we will replace the whitespaces between the words with a
+
symbol withreplace(/\s+/g, '+')
.
2. Using Cheerio to scrape data .
Finally we're going to use cheerio.
Now that we have our game constant we're going to use axios to make a request to our URL + the game title.
We are going to use a
try catch block
, if we get a response then we will store it in a constant namedhtml
then we will use cheerio to load that data.We are going to create a constant named games that will store this value
$(".offer", html)
.
- If you open your developer tools and go to the elements tab you will that .offer class belongs to a table like the image below.
- If you take a look to this image you will easily understand what is going on in the code.
- Now we are going to loop trough that table to get each title, and we can do that using
.find(".product_name")
, then.find(".a")
, then we want thetext()
from the a tag.
.
.
.
app.post('/', async (req, res) => {
const game = req.body.game.trim().replace(/\s+/g, '+')
await axios(url + game)
try {
const response = await axios.get(url + game)
const html = response.data;
const $ = cheerio.load(html)
const games = $(".offer", html)
games.each((i, el) => {
const gameTitle = $(el)
.find(".product_name")
.find("a")
.text()
.replace(/\s+/g, ' ')
.trim()
console.log(gameTitle)
})
} catch (error) {
console.log(error)
}
})
.
.
.
- If you try this with
console.log(title)
you will get a message like this.
Final Fantasy VII
Final Fantasy III
Final Fantasy
Final Fantasy VIII
Final Fantasy II
.
.
.
- Now let's add more fields, for this example i want an id, a cover image and a system.
.
.
.
app.post('/', async (req, res) => {
const game = req.body.game.trim().replace(/\s+/g, '+')
await axios(url + game)
try {
const response = await axios.get(url + game)
const html = response.data;
const $ = cheerio.load(html)
const games = $(".offer", html)
games.each((i, el) => {
const gameTitle = $(el)
.find(".product_name")
.find("a")
.text()
.replace(/\s+/g, ' ')
.trim()
const id = $(el).attr('id').slice(8);
//cover image
const coverImage = $(el).find(".photo").find("img").attr("src");
const system = $(el)
.find("br")
.get(0)
.nextSibling.nodeValue.replace(/\n/g, "")
.trim();
})
} catch (error) {
console.log(error)
}
})
.
.
.
3. Sending the response .
Let's store this data in an array, so in order to do this, let's create an array named videoGames.
.
.
.
const url = "https://www.pricecharting.com/search-products?q=";
let videoGames = []
app.post('/', async (req, res) => {
const game = req.body.game.trim().replace(/\s+/g, '+')
await axios(url + game)
try {
const response = await axios.get(url + game)
const html = response.data;
const $ = cheerio.load(html)
const games = $(".offer", html)
games.each((i, el) => {
const gameTitle = $(el)
.find(".product_name")
.find("a")
.text()
.replace(/\s+/g, ' ')
.trim()
const id = $(el).attr('id').slice(8);
//cover image
const coverImage = $(el).find(".photo").find("img").attr("src");
const gameSystem = $(el)
.find("br")
.get(0)
.nextSibling.nodeValue.replace(/\n/g, "")
.trim();
})
videoGames.push({
id,
gameTitle,
coverImage,
gameSystem
})
res.json(videoGames)
} catch (error) {
console.log(error)
}
})
.
.
.
If you try the route again you will get a result similar to the image below.
Optionally I made an array to get only certain systems because I didn't want to receive the same title with PAL and NTSC system, so I left the default system (NTSC).
.
.
.
const consoles = [
"Nintendo DS",
"Nintendo 64",
"Nintendo NES",
"Nintendo Switch",
"Super Nintendo",
"Gamecube",
"Wii",
"Wii U",
"Switch",
"GameBoy",
"GameBoy Color",
"GameBoy Advance",
"Nintendo 3DS",
"Playstation",
"Playstation 2",
"Playstation 3",
"Playstation 4",
"Playstation 5",
"PSP",
"Playstation Vita",
"PC Games",
]
.
.
.
app.post('/', async (req, res) => {
.
.
.
if (!system.includes(gameSystem)) return;
videoGames.push({
id,
gameTitle,
coverImage,
gameSystem,
});
.
.
.
})
.
.
.
4. Organizing our code .
Let's organize it a little bit, let's create a folder in src
named routes then create a file named index.js.
Copy and paste the code below.
const {Router} = require('express')
const cheerio = require("cheerio");
const axios = require("axios");
const router = Router()
const url = "https://www.pricecharting.com/search-products?q="
let videoGames = []
const system = [
"Nintendo DS",
"Nintendo 64",
"Nintendo NES",
"Nintendo Switch",
"Super Nintendo",
"Gamecube",
"Wii",
"Wii U",
"Switch",
"GameBoy",
"GameBoy Color",
"GameBoy Advance",
"Nintendo 3DS",
"Playstation",
"Playstation 2",
"Playstation 3",
"Playstation 4",
"Playstation 5",
"PSP",
"Playstation Vita",
"PC Games",
]
router.post('/', async (req, res) => {
const game = req.body.game.trim().replace(/\s+/g, '+')
await axios(url + game)
try {
const response = await axios.get(url + game)
const html = response.data;
const $ = cheerio.load(html)
const games = $(".offer", html)
games.each((i, el) => {
const gameTitle = $(el)
.find(".product_name")
.find("a")
.text()
.replace(/\s+/g, ' ')
.trim()
const id = $(el).attr('id').slice(8);
const coverImage = $(el).find(".photo").find("img").attr("src");
const gameSystem = $(el)
.find("br")
.get(0)
.nextSibling.nodeValue.replace(/\n/g, "")
.trim();
if (!system.includes(gameSystem)) return;
videoGames.push({
id,
gameTitle,
coverImage,
gameSystem,
backlog: false
});
})
res.json(videoGames)
} catch (error) {
console.log(error)
}
})
module.exports = router
Now let's go back to our main file in src
index.js and leave the code like this.
const express = require('express')
//routes
const main = require('./routes/index')
const app = express()
//middlwares
app.use(express.json())
//routes
app.use(main)
app.listen(3000, () => {
console.log('Server running on port 3000')
})
If you try it you will see that it still works without any troubles.
5. Conclusion
We learned how to make a simple scraper with cheerio.
I really hope you have been able to follow the post without any trouble, otherwise I apologize, please leave me your doubts or comments.
I plan to make a next post extending this code, adding more routes, mongodb, and a front end.
You can contact me by telegram if you need to hire a Full Stack developer.
You can also find me on discord as Appu#9136
You can clone the repo if you want.
Thanks for your time.
Posted on June 13, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.