How to Scrape Websites with Tampermonkey and Express JS: A Guide

pradnyanandana

Pradnyanandana Suwitra

Posted on March 2, 2023

How to Scrape Websites with Tampermonkey and Express JS: A Guide

There are several reasons why choosing a browser script over automated scraping methods is a better option. One of the most common reasons is that some websites have implemented security measures that prevent automated scraping such as CAPTCHAs, IP blocking, or other forms of bot detection. With a browser script, you can bypass these security measures and access the data you need without triggering any alarms.In addition, using a browser script can provide more granular control over the scraping process, which can be customized to meet the specific requirements of your project.

One popular tool that can be used for browser scripting is Tampermonkey. This scraper offers a powerful and flexible option for collecting data from websites, as it enables you to write custom JavaScript code that can interact with the page and extract the information you need. Tampermonkey provides a simple and intuitive interface for managing user scripts, making installing, editing, and running your scripts across different websites easy.

However, you might need a place to store the scraped data. This is where a database comes into play. You can create an API to send the scraped data to a database. In this matter, you can use Express.js as one of the options.

Express.js is a popular solution for creating APIs quickly and efficiently. This framework is built on top of Node.js, making it a versatile tool for building web applications and APIs. By using Express.js, you can easily create a bridge between the scraped data and the database, making it simple to store, retrieve, and manipulate the data for your project.

Installing and Setting Up Tampermonkey

Tampermonkey is a versatile browser extension compatible with many popular browsers, including Chrome, Firefox, Edge, and Safari. To download Tampermonkey, you can simply perform a quick online search for "Tampermonkey extension" and select the appropriate link for your browser.

After installing Tampermonkey, a new icon should appear in your browser toolbar, allowing you to access the Tampermonkey dashboard easily. Note that the appearance and features of the dashboard may vary depending on which version of Tampermonkey you are using; at the time of writing, the latest version available is 4.18.1.

Tampermonkey dashboard

You will see an option to create a new script in the Tampermonkey dashboard. While you can directly write the script code in this document, for now, we will use the required feature to load a script from a file for this tutorial. We also use Express.js to make this file accessible via an URL.

One thing to note when creating a new script in Tampermonkey is one of the essential configuration options, which is the “match” field. This field allows you to specify which URLs in the script should be executed.

For example, you might configure the script to only run on a specific website, set of pages, or any website that matches a particular pattern. By carefully configuring the match field, you can ensure that your script only executes directly as you wish and avoids causing unintended side effects on other web pages.

For an exercise, we will use the tools to extract book data from the website http://books.toscrape.com/. Once we have scraped the data, we will attempt to obtain the details for each book, such as the title, price, and description. This may involve navigating multiple web pages and using various data parsing techniques to extract relevant information.

Setting Up an Express JS Server

Before using Express.js to create an API for storing your scraped data, you must ensure that Node.js is installed on your machine and configured correctly. Node.js is a JavaScript runtime that allows you to execute JavaScript code outside of a web browser environment, and it is a prerequisite for using Express.js.

Once you have verified that Node.js is installed and working correctly, you can install Express.js. There are many ways to install Express.js, including using npm (the Node.js package manager) or downloading the source code directly from the Express.js website. It's important to note that the installation process may vary depending on which operating system you are using and which version of Node.js you have installed.

To start installing Express.js, you can refer to their website's official tutorials and documentation. These resources provide step-by-step instructions for installing and configuring Express.js, examples, and best practices for building APIs and web applications with the framework. With Express.js installed and configured, you will be ready to create your custom API for storing scraped data and accessing it from other applications or services.

Configure Express.js to Serve Static Scripts

A few things must be done when serving a script file using a URL in an Express.js application. First, it's important to set up the appropriate middleware. In this case, the code snippet includes three middleware functions that must be implemented.

const express = require("express");
const cors = require("cors");
const cookieParser = require("cookie-parser");
const app = express();
const port = 3000;

app.use(cors());
app.use(express.static("public"));
app.use(cookieParser());

Enter fullscreen mode Exit fullscreen mode

The first middleware function is "cors". This security feature allows cross-origin requests to be made to the server. It's crucial to use cors middleware when serving files from another domain. By default, most web browsers will block requests to different domains to prevent unauthorized access to sensitive information.

The second middleware function is "express.static". This function tells Express.js where to find the static files that will be served to clients. In this case, the "public" folder is specified and the directory where the script file will be stored. When a client requests the script file, Express.js will look for it in the "public" folder and return it to the client.

The third middleware function is "cookieParser", which parses cookies sent with each request. Cookies are small text files stored on the client side and stored to store information about the user or their session. Express.js can parse these cookies and use the data to personalize the user's experience or perform other actions through this “cookieParser” middleware

Writing Scraping Script

Now, we are ready to write a script tasked with scraping the collection of book lists from the booktoscrape website. Once the book list has been collected, the script will navigate to the next page by clicking the "next" button and repeat the list collection process on the next page.

function runJob() {
    var data = [];

    document.querySelectorAll(".product_pod").forEach(function(product) {
        var title = product.querySelector("h3 a");
        var image = product.querySelector("img")
        var color = product.querySelector(".price_color");

        data.push({
            image: image && image.getAttribute("src"),
            price: color && color.innerText,
            title: title && title.getAttribute("title"),
            url: title && title.getAttribute("href"),
        });
    });

    GM_xmlhttpRequest({
        method: "POST",
        url: "http://localhost:3000/api/list",
        data: JSON.stringify({
            list: data,
        }),
        headers: {
            "Content-Type": "application/json",
        },
        onload: function(response) {
            var resp = JSON.parse(response.response);
            var respData = resp.data;

            console.log("Response:");
            console.log(respData);

            setTimeout(() => {
                console.log(document.querySelector(".pager .next a"))
                document.querySelector(".pager .next a").click();
            }, 10000);
        },
    });
};

runJob();
Enter fullscreen mode Exit fullscreen mode

We create JavaScript function called "runJob" that does the following:

  1. Creates an empty array called "data".
  2. Uses the "document.querySelectorAll" method to select all elements with the class "product_pod".
  3. Loops through each selected element using the "forEach" method and extracts the title, image, color, and URL of the book by using the "querySelector" method on each element.
  4. Pushes an object containing the extracted data into each book's "data" array.
  5. Sends a POST request to the local server at http://localhost:3000/api/list using the "GM_xmlhttpRequest" method provided by Tampermonkey. The request includes the data array as a JSON string in the request's body.
  6. When the response is received, the function parses the response JSON, logs the response data to the console, and uses the "setTimeout" method to wait 10 seconds before clicking the "next" button to move to the next page of books.

The purpose of this code is to scrape book data from a website and send it to a local server for further processing. The "runJob" function can be repeatedly called to scrape data from multiple website pages.

Tampermonkey Metadata

Tampermonkey metadata is a set of configuration parameters that are included at the top of a user script file to provide information about the script to Tampermonkey. We need to write some configuration before we run the script on the browser.

// ==UserScript==
// @name         Book Scrapper
// @namespace    http://tampermonkey.net/
// @version      0.1
// @description  try to take over the world!
// @author       You
// @match        https://books.toscrape.com/catalogue/category/books_1/*
// @require      http://localhost:3000/list.js?ver=1.0.2
// @icon         data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
// @grant        GM_xmlhttpRequest
// ==/UserScript==
Enter fullscreen mode Exit fullscreen mode

There are few things to note from the metadata above. First, the "match" line specifies the web page that this script should run on. In this case, the script will only run on URLs that match the pattern https://books.toscrape.com/catalogue/category/books_1/*. The asterisk at the end is a wildcard that matches any characters, so the script will run on any page that starts with that URL and has any additional characters following it.

Next, the "require" line specifies a script file which the current script requires. In this case, the script needs a file called "list.js" located on the local web server at http://localhost:3000. It's worth noting that the file needs to be directed to the public folder on the Express.js server for the script to be able to access it properly.

Finally, the "grant" line specifies that the script requires permission to request HTTP using the GM_xmlhttpRequest function. When the script is run for the first time, a new tab will appear, asking permission to make these requests. This security feature is built into Tampermonkey to prevent scripts from making unauthorized requests without the user's knowledge or consent.

Catch Data on the Server by Express.js

app.use(express.json({ limit: "50mb" }));
app.use(express.urlencoded({ limit: "50mb", extended: true, parameterLimit: 50000 }));

app.post("/api/list", async (req, res) => {
    const list = req.body.list;
    console.log(list);
    // Other script
    res.status(200).json({
        success: true,
        message: "Success post list",
        data: list,
    });
});
Enter fullscreen mode Exit fullscreen mode

This code creates an Express.js server to receive a POST request at the "/api/list" endpoint. The request should contain a JSON object with a "list" property in its body. The server then logs the received list to the console and sends a JSON response to the client.

The first two lines of code set up middleware for parsing incoming requests. The express.json middleware is used to parse JSON-encoded request bodies. The limit option is set to "50mb", which specifies the maximum request body size in bytes. Similarly, the express.urlencoded middleware is used to parse URL-encoded request bodies. The limit option is set to "50mb", the extended option is set to true to enable the use of nested objects in the request body, and the parameterLimit option is set to 50000 to limit the number of URL-encoded parameters.

The app.post method handles the POST request to "/api/list". It takes two arguments: the first argument is the route path, and the second is a callback function that handles the request and sends a response. In this case, the callback function receives the request object req and the response object res. The req.body.list property is used to access the list sent in the request body and is logged into the console. You can add other code, such as saving data to the database and processing other data.

Finally, the server sends a JSON response to the client with a success flag, a message, and the received list as data. The res.status(200) method sets the HTTP response status to 200 OK, and the res.json method sends the response in JSON format. The response contains a success flag set to true, a message with the text "Success post list", and a data property with the received list. This response will be sent back to the client that made the request.

Run the Script

Now you need to run the server and Tampermonkey. The following script should be added in Express.js:

app.listen(port, () => {
    console.log(`Example app listening on port ${port}`);
});
Enter fullscreen mode Exit fullscreen mode

This code starts a server that listens on a specific port. The app variable is an instance of an Express.js application. The listen method binds and listens for incoming requests on the specified network port. The method takes two arguments: the port number to listen on and an optional callback function that executes when the server starts listening.

In this code, the callback function logs a message to the console indicating that the server has started listening on the specified port. This message is displayed in the server's console window when the server is started, allowing the developer to confirm that the server has started successfully.

Once everything is run, try to access https://books.toscrape.com/catalogue/category/books_1/index.html. The data will be logged on the browser and server console if successful.

Data log on browser console

Data log on server console

The web scraper has been successfully created, and the data has been sent to the server. Additional processes can now be added such as storing and analyzing the data.

Final Thoughts

Using a browser script like Tampermonkey can provide benefits when it comes to web scraping. Tampermonkey is a powerful tool that allows users to customize and modify web pages by injecting custom scripts. With this tool, it is possible to scrape websites that have security measures in place, which would normally prevent standard scraping methods from working.

However, scraping websites without explicit permission can lead to legal and ethical complications, especially when the information being scraped is considered private or sensitive. Therefore, it is important to consider the ethical implications of scraping a website before doing it, as it may breach the terms of use or violate the website's privacy policy.

💖 💪 🙅 🚩
pradnyanandana
Pradnyanandana Suwitra

Posted on March 2, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related