Getting Started With Puppeteer
Rohit Dalal
Posted on August 13, 2020
In this post, I will try to walk you through the basics of Puppeteer, a browser automation library for Node.js. Puppeteer is created and maintained by Google Chrome and it's the de-facto standard when it comes for browser automation in JavaScript.
Let's get started with this post 🚀.
What is Puppeteer?
Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.
This is the definition of Puppeteer' official Website. Simply stated, it is a headless browser API which gives you the ability to run Chrome or Chromium browser automatically based on the code you wrote to automate it. Now, you will ask "What is a headless browser?". So, the headless browser is a browser without GUI. You can also run Puppeteer in non-headless( GUI ) mode (as given in the above definition). More on that further.
It can do various things for you and some of them are listed below:
Web Scrapping
Take a screenshot of the page
Generate PDF's of the page/s
Automate certain repetitive tasks
... and many more.
Let's see how to install this awesome package now!
Installation
There are two ways to install this library in your machine.
- The standard way (Library + Browser):
If you install this way, it will download a new version of Chromium Browser in your project directory of size ~180MB. This download will definitely take time and depends on your internet speed. After installing, you don't need to do any custom settings in order to run the code. Puppeteer will register the locally installed browser in your pwd as default to run any code involving Puppeteer.
npm install --save puppeteer
Well, what if you don't wanna download this ~180MB browser? There's the next step for this.
- The short way (Only Library):
This is the short and less in size solution to avoid the browser download. This will only install the core package (~3MB), not the browser. If you do this way, you must have a working version of Chrome or Chrome Canary browser installed in your machine which you use for daily purposes which you can use for Puppeteer by passing additional info while writing code specifying the path of the Chrome installation. (We will see this later in the post. Don't worry!)
npm install --save puppeteer-core
Note
Please note that puppeteer-core works only for development purpose. If you want to deploy such application to the web, you must use the complete package because the local path you specify while developing will be invalid in production.
If you want to read more on puppeteer vs puppeteer-core, here is the link
Now that we have completed the installation, let's write some code using this library.
Setup
In this post, we will see two working examples using Puppeteer.
Scrapping Google Search Results
Take a screenshot of any Webpage
To get up and running for this demo, create a new Node.js project by typing
npm run init
After initialization, you can install the package by either of the above-mentioned ways. If you are using the short way, there is only one place where you have to make changes to the code. That will be clear as we see in action.
Grab some coffee and let us see the examples in action.
Scrapping Google Search Results
Now, here we will be scrapping search results for any query of your choice from Google. We will store the scrapped results in an array of objects. The actual application may require DB access after scrapping. I leave that up to you.
Firstly, we import puppeteer from puppeteer-core and then we create a browser object with puppeteer.launch()
passing it launchOptions
, which is an object containing optional parameters. I have used async/await while writing this code. If you want to use .then()
, you can use that as well, it is basically a way to handle the returned Promise.
Description of the used launchOptions
properties:
headless
: Whether to open Puppeteer in headless mode or not? The default value is true.defaultViewport
: An object with width and height properties, which depicts its purpose itself.executablePath
: Path of Chrome/ Chrome Canary/ Chromium installed in your machine. Here is an easy guide on how to find that path. You should use this property only if you are using puppeteer-core. Double "\" denotes character escaping.
You can find a detailed list of arguments here.
After this, we create a new page using browser.newPage()
, which opens a new tab in the launched browser and navigates to https://www.google.com/search?q=coffee to scrape search results from. Upon successful page load, we grab the page content using page.content()
. If you try to print the scraped content at this point, you will see the entire page source in the console, but we are interested in only the search title and the associated link with the result. For that, we shall use a separate package named cheerio. Cheerio is a package which can parse and do all the things with the page-source at back-end/ server which jQuery does on the front-end.
Cheerios is a Fast, flexible, and lean implementation of core jQuery designed specifically for the server.
We parse the content using cheerio and store it in a variable $(used to show similarity with jQuery ). A div with class 'r' is a container for both, the search title and the actual link of one result. We then loop over all the "divs" elements with class='.r'
to get the title, which is an "h3" heading with class="LC20lb
DKV0Md".Now, grab the link from the children anchor tag of the parent div using the "href " property with .attr('href')
and then push the {title, link}
to the links array and here we finish the process by closing the tab and the browser.
Here is the full working code for the same:
//scrapeGoogle.js
const puppeteer = require('puppeteer-core')
const cheerio = require('cheerio')
const run = async () => {
let launchOptions = {
headless: false, //to see the execution as it happens
executablePath:
'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe',
}
let browser = await puppeteer.launch(launchOptions)
let page = await browser.newPage()
try {
await page.goto('https://www.google.com/search?q=coffee', {
waitUntil: 'domcontentloaded',
})
} catch (err) {
if (err instanceof puppeteer.errors.TimeoutError) {
throw new Error(err)
await browser.close()
}
}
let content = await page.content()
//cheerio
let $ = cheerio.load(content)
var links = []
$('.r').each(function (i, el) {
var title = $(this).find('.LC20lb').text()
var link = $(this).children('a').attr('href')
if (title.length > 0 && link.length > 0) {
links.push({ title, link })
}
})
console.log(links)
await page.close()
await browser.close()
}
run()
In this way, we have successfully scrapped Google search results using Puppeteer. You can improve this further by adding more and more features and scrapping more data. We completed the first example here.
Taking a screenshot of any Webpage
Now, this section will be very similar as above, except content scrapping. We take the screenshot with page.screenshot()
which returns a Promise and on its successful resolution, our image will be saved in the folder path you specify.
//screenshot.js
const ss = async () => {
let launchOptions = {
headless: false,
executablePath:
'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe',
defaultViewport: {
width: 1536,
height: 763,
},
}
let browser = await puppeteer.launch(launchOptions)
let page = await browser.newPage()
try {
await page.goto('https://www.google.com/search?q=chelsea', {
waitUntil: 'domcontentloaded',
})
} catch (err) {
if (err instanceof puppeteer.errors.TimeoutError) {
throw new Error(err)
await browser.close()
}
}
//main line
await page.screenshot({ path: 'screenshot.png' })
await page.close()
await browser.close()
}
ss()
As said, everything is the same here except just one line were to take the screenshot and save it with name 'screenshot.png'. {path:"your_path"}
is necessary, without which it will not save the screenshot.
Conclusion
Hooray, that's it for this post guys. If you have any queries regarding this post, feel free to contact me personally. If you liked this post, share it with your developer friends and social media.
Thank you. See you next time ;)
Posted on August 13, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024
November 28, 2024