Puppyteer Crawler

The ability to use Puppeteer to automate Google Chrome makes it awesome for doing web crawling. Chrome executes the JavaScript and many times this yields more URLs to crawl. In this month's Puppeteer Experiment, I combine a Puppeteer powered web crawler with some #MachineLearning to crawl a pet shelter’s website for all of the adorable dog pictures. I call it, the Puppyteer Crawler (alternate title: The Puppeteer Collar). 🐶

Overview

This is going to be less of a guide and more of a journey on the "what" and the high level "how" of putting my Puppyteer Crawler together to find all of the adorable dog pictures.

You can jump straight to the source code on GitHub.

Components

I used about 7 or so libraries, but here are the important ones.

Headless Chrome Crawler

Headless Chrome Crawler is a Node.js/JavaScript dependency that you can configure to crawl websites. It differs from some other web crawlers in that it uses Google Chrome as the conduit through which webpages (and JavaScript) are loaded and executed.

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js.

Powered by Headless Chrome, the crawler provides simple APIs to crawl these dynamic websites with the following features...

It's not difficult to get up and running. The following snippet crawls Kevin Bacon's Wikipedia page, printing page titles and information along the way.

const HCCrawler = require('headless-chrome-crawler');

(async () => {
    let crawler = await HCCrawler.launch({
        maxDepth: 2,
        evaluatePage: (() => ({
            title: $('title').text(),
        })),
        onSuccess: (result => console.log(result)),
    });
    await crawler.queue('https://en.wikipedia.org/wiki/Kevin_Bacon');
    await crawler.onIdle();
    await crawler.close();
})();

My use case for using the crawler was finding all of the images loaded by Chrome while crawling a pet shelter's website. I implemented a customCrawl. A customCrawl allows you to, among other things, interact with the Puppeteer Page object.

customCrawl: async (page, crawl) => {
    await page.setRequestInterception(true);

    page.on('request', request => {
        let requestUrl = request.url();

        if (request.resourceType() == 'image' && !imageUrls.has(requestUrl)) {
            imageUrls.add(requestUrl);
            request.abort();
        } else {
            request.continue();
        }
    });
    let result = await crawl();
    result.content = await page.content();
    return result;
}

With access to the Page, I can use request interception to record the URLs that lead to images. I'm saving each url leading to an image for classification by Tensorflow in a later step.

Tensorflow.js

TensorFlow.js is a JavaScript wrapper around the popular machine learning framework, TensorFlow. TensorFlow is a framework for building, training, and using machine learning models to do advanced computation, like text-to-speech or image recognition. Normally, you'd write all of your TensorFlow logic in Python. TensorFlow.js allows you do accomplish your machine learning tasks with JavaScript. This means you can easily load models in the browser or server-side via Node.js.

TensorFlow.js also comes with a handful of pre-built machine learning models, so you don't need in PhD to get up and recognizing quickly.

My implementation takes a URL to an image we recorded in a previous step, gets the binary data from the web server, then provides it to a pre-built object recognition model, coco-ssd.

More about coco-ssd:

Object detection model that aims to localize and identify multiple objects in a single image.

This model is a TensorFlow.js port of the COCO-SSD model. For more information about Tensorflow object detection API, check out this readme in tensorflow/object_detection.

This model detects objects defined in the COCO dataset, which is a large-scale object detection, segmentation, and captioning dataset. You can find more information here. The model is capable of detecting 90 classes of objects. (SSD stands for Single Shot MultiBox Detection).

This TensorFlow.js model does not require you to know about machine learning. It can take input as any browser-based image elements (img, video, canvas elements, for example) and returns an array of bounding boxes with class name and confidence level.

The cool thing about coco-ssd is that it will detect as many things in an image as it can and generate a bounding box that identifies where in the image an object is located. The detect method will return an array of predictions, one for each object detected in the image.

const tf = require('@tensorflow/tfjs');
const tfnode = require('@tensorflow/tfjs-node');
const cocoSsd = require('@tensorflow-models/coco-ssd');
const request = require('request');

function getImagePixelData(imageUrl) {
    return new Promise((resolve, reject) => {
        let options = { url: imageUrl, method: "get", encoding: null };

        request(options, (err, response, buffer) => {
            if (err) { reject(err); } 
            else { resolve(buffer);}
        });
    });
}

(async () => {
    let model = await cocoSsd.load({ base: 'mobilenet_v2' });
    let predictions = [];

    try {
        let url = 'https://www.guidedogs.org/wp-content/uploads/2019/11/website-donate-mobile.jpg';
        let imageBuffer = await getImagePixelData(url);

        if (imageBuffer) {
            let input = tfnode.node.decodeImage(imageBuffer);
            predictions = await model.detect(input);
            console.log(predictions);
        }
    } catch (err) {
        console.error(err);
    }
})();

Here is a picture of a dog.

Passing it into the coco-ssd model yields:

[
  {
    bbox: [
      62.60044872760773,
      37.884591430425644,
      405.2848666906357,
      612.7625299990177
    ],
    class: 'dog',
    score: 0.984025239944458
  }
]

Get Up & Running

Step 1 - Clone the repository

git clone git@github.com:evanhalley/puppyteer-crawler.git

Step 2 - Download the dependencies

cd puppyteer-crawler
npm install

Step 3 - Find the photos of dogs

node . --url=spcawake.org --depth=1 --query=dog

Output

Searching https://spcawake.org for images containing a dog...
The domain for the URL is spcawake.org...
Starting crawl of https://spcawake.org...
Crawled 1 urls and found 25 images...
Classifying 25 images...
 ████████████████████████████████████████ 100% | ETA: 0s | 25/25
Images that contain a dog
https://spcawake.org/wp-content/uploads/2019/11/Clinic-Banner-2-820x461.jpg
https://spcawake.org/wp-content/uploads/2019/03/Dog-for-website.jpg
https://spcawake.org/wp-content/uploads/2019/03/volunteer-website-pic.jpg
https://spcawake.org/wp-content/uploads/2019/12/Social-Dog-250x250.jpg
https://spcawake.org/wp-content/uploads/2019/12/Alhanna-for-blog-v2-250x250.jpg

Recap

This experiment allowed me to use two libraries to accomplish a task that is normally kind of intense if done manually, depending on the size of the website. Using Tensorflow.js allows you to leverage models already created and trained to identify different types of objects. You could even train a model yourself to detect, for example, all of the pictures of 1992 Volkswagen GTIs on a classic car website.

Using a web crawler that leverages Puppeteer ensures you render the JavaScript and crawl URLs that result from the processed JavaScript. This makes collecting the data to feed to your model easy and painless.

✌🏿

(Originally published at evanhalley.dev

Blog

Evan