Scrape images from a search engine with JavaScript and Puppeteer

antoine_m

Antoine Mesnil

Posted on November 4, 2022

Scrape images from a search engine with JavaScript and Puppeteer

Introduction

In the previous post of this series, we discovered how to use Nodejs and Puppeteer for scraping and searching content on web pages. I recommend reading it first if you have never used Puppeteer or need to set up the project.

In this article, we will fetch full-resolution images from a search engine. Our goal time is to get a picture of every dog breed.

Script to get the images links

You should have Node.js and Puppeteer installed with npm or yarn.
We will use the same methods than on the first part.
We are going to use a simple JSON as our list of dog breeds that can be found here: dog breeds dataset

As for the search engine, we will scrape on Duckduckgo because it allows us to easily get the images at a full resolution which can be more tricky on Google images.

const puppeteer = require("puppeteer")
const data = require("./dog-breeds.json")

const script = async () => {
  //this will open visibly a chromium window, this is useful to see what is going on and test stuff before the finalized script
  const browser = await puppeteer.launch({ headless: false, slowMo: 100 })
  const page = await browser.newPage()

  //loop on every breed
  for (let dogBreed of data) {
    console.log("Start for breed:", dogBreed)
    const url = `https://duckduckgo.com/?q=${dogBreed.replaceAll(
      " ",
      "+"
    )}&va=b&t=hc&iar=images&iax=images&ia=images`

    //in case we encounter a page without images or an error
    try {
      await page.goto(url)

      //make sure the page is loaded and contain our targeted element
      await page.waitForNavigation()
      await page.waitForSelector(".tile--img__media")

      await page.evaluate(
        () => {
          const firstImage = document.querySelector(".tile--img__media")
          //we open the panel that contains the image info
          firstImage.click()
        },
        { delay: 400 }
      )

      //get the link of the image from the panel
      await page.waitForSelector(".detail__pane a")
      const link = await page.evaluate(
        () => {
          const links = document.querySelectorAll(".detail__pane a")
          const linkImage = Array.from(links).find((item) =>
            item.innerText.includes("fichier")
          )
          return linkImage?.getAttribute("href")
        },
        { delay: 250 }
      )
      console.log("link succesfully retrieved:", link)
      console.log("=====")
    } catch (e) {
      console.log(e)
    }
  }
}

script()
Enter fullscreen mode Exit fullscreen mode

After running the script with node scrapeImages.js you should get something like this:

Gif scraping puppeteer

Download and optimize the images

We now have the links of every images but some of them are quite heavy (>1mb).
Fortunately we can use another Node.js library to compress their size with minimal loss of quality: sharp

It is a massively used library (2M+ weekly download) to convert, resize and optimize images.

You can add this at the end of the script to have a folder filled with the optimized images

const stream = fs.createWriteStream(dogBreed + ".jpg")
await https.get(link, async function(response) {
  response.pipe(stream)
  stream.on("finish", () => {
    stream.close()
    console.log("Download Completed")
  })
})

//resize to a maximum width or height of 1000px
await sharp(`./${dogBreed}.jpg`)
  .resize(1000, 1000)
  .toFile(`./${dogBreed}-small.jpg`)
Enter fullscreen mode Exit fullscreen mode

Conclusion

You can adapt this script to get pretty much anything, you can also not limit yourself to the first image for each query but get every image. As for myself, I used this script to get the initial images for a tool I'm working on https://dreamclimate.city

screenshot dream climate city personal project


😄 Thanks for reading! If you found this article useful, consider to follow me on Twitter, I share tips on development, design and share my journey to create my own startup studio

💖 💪 🙅 🚩
antoine_m
Antoine Mesnil

Posted on November 4, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related