Strapi, another use case: Build your own API from any website with Puppeteer
ELABBASSI Hicham
Posted on April 17, 2020
The objective of this tutorial is to build a simple job search API with Strapi and Puppeteer. Strapi is an open-source Headless CMS written in NodeJS and Puppeteer is an open-source Headless Browser (Chrome) NodeJS API.
It seems that the time is for headless tools...š (Anyway, there is no direct link between Strapi & Puppeteer except the "Headless" word.)
Strapi
Strapi is used to build powerful APIs without efforts. Several features are available in Strapi including CRON tasks configuration (And this is a good thing because we will use them to schedule the Puppeteer script execution).
1. Strapi installation
Well, let's start this tutorial. The first thing we need to do is to install Strapi.
yarn create strapi-app job-api --quickstart
If you don't want to use
yarn
, there are other possibilities to install Strapi in the documentation.
2. Strapi admin user
This command should install Strapi and open your browser. Then, you will be able to create your admin user.
3. Job Collection type
In the Strapi admin home page, click on the blue button CREATE YOUR FIRST CONTENT-TYPE
.
You will be redirected to the collection type creation form.
After that, you will be able to add fields to the Job collection type.
For our basic example, we will need to create five text fields (title, linkedinUrl, companyName, descriptionSnippet, and timeFromNow).
Don't forget to click on the Save button to restart the Strapi server
After that, we can put the Strapi admin aside for the moment and open the Strapi repository in an editor.
Strapi CRON task
Firstly, we'll need to enable CRON in the Strapi server configuration.
Open the config/environments/development/server.json
file
{
"host": "localhost",
"port": 1337,
"proxy": {
"enabled": false
},
"cron": {
"enabled": true
},
"admin": {
"autoOpen": false
}
}
Then let's create the CRON task. Open the ~/job-api/config/functions/cron.js
file and replace the content by this
"use strict";
module.exports = {
// The cron should display "{date} : My super cron task!" at every minute.
"*/1 * * * *": (date) => {
console.log(`${date} : My super cron task!\n`);
},
};
Now, restart the Strapi server and let's see if our cron task is running properly.
yarn develop
yarn run v1.21.1
$ strapi develop
Project information
āāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Time ā Thu Apr 16 2020 01:40:49 GMT+0200 (GMT+02:00) ā
ā Launched in ā 1647 ms ā
ā Environment ā development ā
ā Process PID ā 20988 ā
ā Version ā 3.0.0-beta.18.7 (node v10.16.0) ā
āāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Actions available
Welcome back!
To manage your project š, go to the administration panel at:
http://localhost:1337/admin
To access the server ā”ļø, go to:
http://localhost:1337
Thu Apr 16 2020 01:41:00 GMT+0200 (GMT+02:00) : My super cron task !
Thu Apr 16 2020 01:42:00 GMT+0200 (GMT+02:00) : My super cron task !
Thu Apr 16 2020 01:43:00 GMT+0200 (GMT+02:00) : My super cron task !
...
We can see that {date} : My super cron task !
is displayed every minute in the terminal.
Puppeteer
Puppeteer is used to automating any action you can perform in the browser. You can use it to automate flows, take screenshots and generate PDFs. In this tutorial, we will use Puppeteer to get the list of ReactJS jobs from Linkedin. We will also use Cheerio to select the data in the received markup.
Now that the CRON task is working well, we will install Puppeteer and Cheerio in the Strapi project.
cd job-api
yarn add puppeteer cheerio
Let's adapt the CRON task to get a list of ReactJS job published on linkedin the last 24 hours in San Francisco.
In the ~/job-api/config/functions/cron.js
"use strict";
// Require the puppeteer module.
const puppeteer = require("puppeteer");
module.exports = {
// Execute this script every 24 hours. (If you need to change the cron
// expression, you can find an online cron expression editor like
// https://crontab.guru
"0 */24 * * *": async (date) => {
// 1 - Create a new browser.
const browser = await puppeteer.launch({
args: ["--no-sandbox", "--disable-setuid-sandbox", "--lang=fr-FR"],
});
// 2 - Open a new page on that browser.
const page = await browser.newPage();
// 3 - Navigate to the linkedin url with the right filters.
await page.goto(
"https://fr.linkedin.com/jobs/search?keywords=React.js&location=R%C3%A9gion%20de%20la%20baie%20de%20San%20Francisco&trk=guest_job_search_jobs-search-bar_search-submit&redirect=false&position=1&pageNum=0&f_TP=1"
);
// 4 - Get the content of the page.
let content = await page.content();
},
};
Parse the html content
with Cheerio and store the job with the Strapi global.
"use strict";
const puppeteer = require("puppeteer");
const cheerio = require("cheerio");
module.exports = {
"0 */24 * * *": async (date) => {
const browser = await puppeteer.launch({
args: ["--no-sandbox", "--disable-setuid-sandbox", "--lang=fr-FR"],
});
const page = await browser.newPage();
await page.goto(
"https://fr.linkedin.com/jobs/search?keywords=React.js&location=R%C3%A9gion%20de%20la%20baie%20de%20San%20Francisco&trk=guest_job_search_jobs-search-bar_search-submit&redirect=false&position=1&pageNum=0&f_TP=1"
);
let content = await page.content();
// 1 - Load the HTML
const $ = cheerio.load(content);
// 2 - Select the HTML element you need
// For the tutorial case, we need to select the list of jobs and for each element, we will
// create a new job object to store it in the database with Strapi.
$("li.result-card.job-result-card").each((i, el) => {
if (Array.isArray(el.children)) {
const job = {
title: el.children[0].children[0].children[0].data,
linkedinUrl: el.children[0].attribs.href,
companyName:
el.children[2].children[1].children[0].data ||
el.children[2].children[1].children[0].children[0].data,
descriptionSnippet:
el.children[2].children[2].children[1].children[0].data,
timeFromNow: el.children[2].children[2].children[2].children[0].data,
};
// 4 - Store the job with the Strapi global.
strapi.services.job.create(job);
}
});
// 5 - Close the browser
browser.close();
},
};
Restart the Strapi server and let's go back to the admin
http://localhost:1337/admin
.
In the Job content manager, you should see the data from LinkedIn
Good job ! You've just build an API from another website in few minutes š
Posted on April 17, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.