Chrome® Powered Web Scraping with Puppeteer: Boosting Speed and Efficiency
Mersad
Posted on January 8, 2023
Chrome® Automation with Puppeteer : Scrape the Web with Style
Puppeteer is a Node.js library that provides a high-level API for controlling headless Chrome or Chromium over the DevTools Protocol. It is a powerful tool for web scraping because it allows you to scrape websites that use JavaScript, cookies, and other complex features that may not be possible to scrape with a traditional web scraper.
To use Puppeteer for web scraping, you will need to install it using npm (the Node Package Manager). Once installed, you can use Puppeteer in your Node.js script to programmatically control a headless Chrome browser and perform web scraping tasks.
Here is a simple example of how to use Puppeteer to scrape a webpage:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://radiojavan.com');
// Extract data from the page
const data = await page.evaluate(() => {
const name = document.querySelector('h1').textContent;
const price = document.querySelector('.price').textContent;
return { name, price };
});
console.log(data);
await browser.close();
})();
In this example, Puppeteer is used to open a new page in a headless Chrome browser, navigate to the specified URL, and then extract data from the page by using DOM manipulation methods like querySelector. The extracted data is stored in an object and logged to the console.
Puppeteer also provides many other useful features for web scraping, such as the ability to handle cookies, manipulate the DOM, and simulate user events like clicks and form submissions. With these capabilities, Puppeteer can be used to scrape virtually any modern website.
Here's a real-life app to extract artist-names
and song-titles
and output in array format to a text file...
const puppeteer = require('puppeteer');
const fs = require('fs');
const { Console } = require('console');
(async () => {
// browser config
const browser = await puppeteer.launch({headless:true,
args: [
'--start-maximized',
],
defaultViewport: null});
const page = await browser.newPage();
await page.goto('https://www.radiojavan.com/');
await page.waitForSelector('.grid');
await page.click("#featuredPlaylists > div.grid > a:nth-child(3) > img");
// node
const info = await page.evaluate(()=>{
const songs = document.querySelectorAll(".song");
const artists = document.querySelectorAll(".artist");
// song array
let song = []
songs.forEach(element => {
let x = element.textContent.trim()
song.push(x)
});
// artist array
let artist = []
artists.forEach(element => {
let x = element.textContent.trim()
artist.push(x)
});
res = song.concat(artist)
return res
})
// file write
await fs.writeFileSync("info.txt", info.join("\r\n"))
// console.log(info)
browser.close()
console.log("Success");
})();
Posted on January 8, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024