Web scraping Google Reverse Images results with Nodejs
Mikhail Zub
Posted on September 22, 2022
How reverse search happens
First of all, we need to paste image link to Google Image search:
Next, we need to click on the "Find image source":
What will be scraped
Full code
If you don't need an explanation, have a look at the full code example in the online IDE
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());
const imageUrl = "https://www.bugatti.com/fileadmin/_processed_/sei/p63/se-image-ce40627babaa7b180bc3dedd4354d61c.jpg"; // what we want to search
const URL = `https://images.google.com`;
async function setImage(page) {
const isPopup = await page.evaluate(() => {
return Array.from(document.querySelectorAll("iframe")).find((el) => el.style.visibility !== "hidden");
});
if (isPopup) {
for (let i = 0; i < 14; i++) {
await page.keyboard.press("Tab");
await page.waitForTimeout(500);
}
await page.keyboard.press("Enter");
}
await page.waitForTimeout(1500);
await page.click(".nDcEnd");
await page.waitForTimeout(1500);
await page.click(".PXT6cd input");
await page.keyboard.type(imageUrl);
await page.waitForTimeout(1500);
await page.click(".PXT6cd div");
await page.waitForTimeout(5000);
await page.click(".QeWRZ .WpHeLc");
}
async function fillInfoFromPage(page) {
return await page.evaluate(async () => {
return Array.from(document.querySelectorAll("#search .Ww4FFb")).map((el) => ({
title: el.querySelector(".yuRUbf > a > h3").textContent.trim(),
link: el.querySelector(".yuRUbf > a").getAttribute("href"),
snippet: el.querySelector(".VwiC3b").textContent.trim(),
}));
});
}
async function getReverseImageInfo() {
const browser = await puppeteer.launch({
headless: false,
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
const page = await browser.newPage();
await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
await page.waitForSelector(".nDcEnd");
await setImage(page);
await page.waitForTimeout(5000);
const pages = await browser.pages();
const page2 = pages[pages.length - 1];
let imageOrganicResults = [];
while (true) {
await page2.waitForSelector(".Ww4FFb");
imageOrganicResults.push(...(await fillInfoFromPage(page2)));
const nextButton = await page2.$$(".d6cvqb");
let isButtonActive;
if (nextButton) {
isButtonActive = await nextButton[1]?.$("a");
} else {
isButtonActive = await page2.$(".acRNod");
}
if (!isButtonActive) break;
await isButtonActive.click();
}
await browser.close();
return imageOrganicResults;
}
getReverseImageInfo().then((result) => console.dir(result, { depth: null }));
Preparation
First, we need to create a Node.js* project and add npm
packages puppeteer
, puppeteer-extra
and puppeteer-extra-plugin-stealth
to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.
To do this, in the directory with our project, open the command line and enter npm init -y
, and then npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
.
*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.
📌Note: also, you can use puppeteer
without any extensions, but I strongly recommended use it with puppeteer-extra
with puppeteer-extra-plugin-stealth
to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.
Process
The first step is to extract data from HTML elements, then change the page and repeat again. The process of getting the right CSS selectors is fairly easy via SelectorGadget Chrome extension which able us to grab CSS selectors by clicking on the desired element in the browser. However, it is not always working perfectly, especially when the website is heavily used by JavaScript.
We have a dedicated web Scraping with CSS Selectors blog post at SerpApi if you want to know a little bit more about them.
The Gif below illustrates the approach of selecting different parts of the results.
Code explanation
Declare puppeteer
to control Chromium browser from puppeteer-extra
library and StealthPlugin
to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth
library:
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
Next, we "say" to puppeteer
use StealthPlugin
, write the image link that we want to search, and Google Image URL:
puppeteer.use(StealthPlugin());
const imageUrl = "https://www.bugatti.com/fileadmin/_processed_/sei/p63/se-image-ce40627babaa7b180bc3dedd4354d61c.jpg"; // what we want to search
const URL = `https://images.google.com`;
Next, we write a function to paste image URL in the Google Search:
async function setImage(page) {
...
}
In this function, first, we need to check if Google proposes you Sign in or register (using evaluate()
andquerySelectorAll()
methods to get access to right HTML selectors, and make the new array from got NodeList with Array.from()
, and finally find()
methods to get the necessary data from an array):
const isPopup = await page.evaluate(() => {
return Array.from(document.querySelectorAll("iframe")).find((el) => el.style.visibility !== "hidden");
});
If it's true, we need to close this popup. Because this popup is placed in iframe
element, there's challenging to get control from the Puppeteer on it. So we use the simple way, just press the "Tab" key fourteen times (using keyboard.press()
method) with 0,5 sec timeout (using waitForTimeout
method) until the need button has been in focus and press the "Enter" button:
if (isPopup) {
for (let i = 0; i < 14; i++) { // 14 is the number of press the "Tab" key
await page.keyboard.press("Tab");
await page.waitForTimeout(500);
}
await page.keyboard.press("Enter");
}
Then we click on necessary buttons (using click()
method) and type the imageUrl
(using keyboard.type()
method). Before the last click we use waitForTimeout
method:
await page.waitForTimeout(1500);
await page.click(".nDcEnd");
await page.waitForTimeout(1500);
await page.click(".PXT6cd input");
await page.keyboard.type(imageUrl);
await page.waitForTimeout(1500);
await page.click(".PXT6cd div");
await page.waitForTimeout(5000);
await page.click(".QeWRZ .WpHeLc");
Next, we write a function to get need information from HTML selectors. We can do this with textContent
and trim()
methods, which get the raw text and removes white space from both sides of the string. If we need to get links, we use getAttribute()
method to get "href"
HTML element attribute:
async function fillInfoFromPage(page) {
return await page.evaluate(async () => {
return Array.from(document.querySelectorAll("#search .Ww4FFb")).map((el) => ({
title: el.querySelector(".yuRUbf > a > h3").textContent.trim(),
link: el.querySelector(".yuRUbf > a").getAttribute("href"),
snippet: el.querySelector(".VwiC3b").textContent.trim(),
}));
});
}
Next, write a function to control the browser, and get information:
async function getReverseImageInfo() {
...
}
In this function first we need to define browser
using puppeteer.launch({options})
method with current options
, such as headless: false
and args: ["--no-sandbox", "--disable-setuid-sandbox"]
.
These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page
:
const browser = await puppeteer.launch({
headless: false,
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
const page = await browser.newPage();
Next, we change default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout()
method, go to URL
with .goto()
method and use .waitForSelector()
method to wait until the selector is load:
await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
await page.waitForSelector(".nDcEnd");
Then, we wait until the setImage
functions was finished and change the page context from the new tab (get an array with all opened pages with browser.pages()
method and pick the last one):
await setImage(page);
await page.waitForTimeout(5000);
const pages = await browser.pages();
const page2 = pages[pages.length - 1];
Then we create the empty imageOrganicResults
array, use while
loop in which we wait for load results, and add results to the end of the imageOrganicResults
array (using push()
method and the spread syntax([...]
)).
After that we need to go to the next page. We check if the next page button is present on the page, we click it, and repeat our loop, otherwise, we end the loop:
let imageOrganicResults = [];
while (true) {
await page2.waitForSelector(".Ww4FFb");
imageOrganicResults.push(...(await fillInfoFromPage(page2)));
const nextButton = await page2.$$(".d6cvqb");
let isButtonActive;
if (nextButton) {
isButtonActive = await nextButton[1]?.$("a");
} else {
isButtonActive = await page2.$(".acRNod");
}
if (!isButtonActive) break;
await isButtonActive.click();
}
And finally, we close the browser, and return the received data:
await browser.close();
return imageOrganicResults;
Now we can launch our parser:
$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file
Output
[
{
"title":"Super Sport - Bugatti Veyron 16.4",
"link":"https://www.bugatti.com/models/veyron-models/veyron-164-super-sport/",
"snippet":"In the year of its market launch the Veyron 16.4 already set up a speed record for street cars. Adhering to the the Guiness World Record restrictions an ..."
},
{
"title":"Bugatti Veyron - Wikipedia",
"link":"https://en.wikipedia.org/wiki/Bugatti_Veyron",
"snippet":"The Super Sport version of the Veyron is one of the fastest street-legal production cars in the world, with a top speed of 431.072 km/h (267.856 mph)."
},
... and other results
]
Using Google Reverse Image API from SerpApi
This section is to show the comparison between the DIY solution and our solution.
The biggest difference is that you don't need to use browser automation to scrape results, create the parser from scratch and maintain it.
There's also a chance that the request might be blocked at some point from Google, we handle it on our backend so there's no need to figure out how to do it yourself or figure out which CAPTCHA, proxy provider to use.
First, we need to install google-search-results-nodejs
:
npm i google-search-results-nodejs
Here's the full code example, if you don't need an explanation:
const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(process.env.API_KEY);
const imageUrl = "https://www.bugatti.com/fileadmin/_processed_/sei/p63/se-image-ce40627babaa7b180bc3dedd4354d61c.jpg"; // what we want to search
const params = {
engine: "google_reverse_image", // search engine
image_url: imageUrl, // search image
};
const getJson = () => {
return new Promise((resolve) => {
search.json(params, resolve);
});
};
const getResults = async () => {
const organicResults = [];
while (true) {
const json = await getJson();
if (json.search_information?.organic_results_state === "Fully empty") break;
organicResults.push(...json.image_results);
params.start ? (params.start += 10) : (params.start = 10);
}
return organicResults;
};
getResults().then((result) => console.dir(result, { depth: null }));
Code explanation
First, we need to declare SerpApi
from google-search-results-nodejs
library and define new search
instance with your API key from SerpApi:
const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(API_KEY);
Next, we write an image URL and the necessary parameters for making a request:
const imageUrl = "https://www.bugatti.com/fileadmin/_processed_/sei/p63/se-image-ce40627babaa7b180bc3dedd4354d61c.jpg"; // what we want to search
const params = {
engine: "google_reverse_image", // search engine
image_url: imageUrl, // search image
};
Next, we wrap the search method from the SerpApi library in a promise to further work with the search results:
const getJson = () => {
return new Promise((resolve) => {
search.json(params, resolve);
});
};
And finally, we declare the function getResult
that gets data from the page and return it:
const getResults = async () => {
...
};
In this function first, we declare an array organicResults
with results data:
const organicResults = [];
Next, we need to use while
loop. In this loop we get json
with results, check if results are present on the page (organic_results_state
isn't "Fully empty"
), push results to organicResults
array, define the start number on the results page, and repeat the loop until results aren't present on the page:
while (true) {
const json = await getJson();
if (json.search_information?.organic_results_state === "Fully empty") break;
organicResults.push(...json.image_results);
params.start ? (params.start += 10) : (params.start = 10);
}
return organicResults;
After, we run the getResults
function and print all the received information in the console with the console.dir
method, which allows you to use an object with the necessary parameters to change default output options:
getResults().then((result) => console.dir(result, { depth: null }));
Output
[
{
"position":1,
"title":"Best Bugatti Cars in India - CARS24",
"link":"https://www.cars24.com/blog/best-bugatti-cars-in-india/",
"displayed_link":"https://www.cars24.com › blog › best-bugatti-cars-in-in...",
"thumbnail":"https://serpapi.com/searches/6319fad65c560673de2b144a/images/7c3a215cbf2776de47a9c447d0b97c5290a72394aec05f099004cb62a9250eee.jpeg",
"image_resolution":"1920 × 1080",
"snippet":"The Bugatti Veyron was originally launched in 2005 and was then, the fastest car ... Although the Divo is a super luxurious vehicle with a hefty price tag, ...",
"snippet_highlighted_words":[
"Bugatti Veyron",
"super"
],
"cached_page_link":"https://webcache.googleusercontent.com/search?q=cache:0BQ-hPQGl9IJ:https://www.cars24.com/blog/best-bugatti-cars-in-india/&cd=91&hl=en&ct=clnk&gl=us"
},
{
"position":2,
"title":"1056304 car, vehicle, road, Super Car, sports car, motion blur ...",
"link":"https://rare-gallery.com/1056304-car-vehicle-road-super-car-sports-car-motion-blur-bugatti-bugatti-chiron-bugatti-veyron-performance-.html",
"displayed_link":"https://rare-gallery.com › Another wallpapers",
"thumbnail":"https://serpapi.com/searches/6319fad65c560673de2b144a/images/7c3a215cbf2776de9f26993a213d2a5cf9506bec4885508e13e62c8036abbf11.jpeg",
"image_resolution":"1920 × 1080",
"snippet":"Wallpaper name: car, vehicle, road, Super Car, sports car, motion blur, Bugatti, Bugatti Chiron, Bugatti Veyron, performance car, wheel, supercar, ...",
"snippet_highlighted_words":[
"Super",
"Bugatti",
"Bugatti",
"Bugatti Veyron"
],
"cached_page_link":"https://webcache.googleusercontent.com/search?q=cache:lcVswIAM3eMJ:https://rare-gallery.com/1056304-car-vehicle-road-super-car-sports-car-motion-blur-bugatti-bugatti-chiron-bugatti-veyron-performance-.html&cd=92&hl=en&ct=clnk&gl=us"
},
... and other results
]
Links
If you want to see some projects made with SerpApi, write me a message.
Add a Feature Request💫 or a Bug🐞
Posted on September 22, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.