Block resources with Playwright

kami4ka

Oleg Kulyk

Posted on June 21, 2021

Block resources with Playwright

This article will expose how to block specific resources (HTTP requests, CSS, video, images) from loading in Playwright. Playwright is Puppeteer's successor with the ability to control Chromium, Firefox, and Webkit. So I'd call it the second one of the most widely used web scraping and automation tools with headless browser support.

Why block resources

Block resources from loading while web scraping is a widespread technique that allows you to save time and costs.

For example, when you crawl a resource for product information (scrape price, product name, image URL, etc.), you don't need to load external fonts, CSS, videos, and images themselves. However, you'll need to extract text information and direct URLs for media content for most cases.

Also, such improvements will:

  • speed up your web scraper
  • increase number of pages scraped per minute (you'll pay less for your servers and will be able to get more information for the same infrastructure price)
  • decrease proxy bills (you won't use proxy for irrelevant content download)

Intercept requests with Playwright

Request interception

Since Playwright is a Puppeteer's successor with a similar API, it can be very native to try out using the exact request interception mechanism. Also, from the documentation for both libraries, we can find out the possibility of accessing the page's requests.

const playwright = require('playwright');

(async () => {
    const browser = await playwright['chromium'].launch();
    const context = await browser.newContext();
    const page = await context.newPage();
    page.on('request', (request) => {
       console.log(`Request: ${request.url()} to resource type: ${request.resourceType()}`);
    });
    await page.goto('https://amazon.com');

    await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

So, the output will provide information about the requested resource and its type.

Request: https://amazon.com/ to resource type: document
Request: https://www.amazon.com/ to resource type: document
......
Request: https://m.media-amazon.com/images/I/41Kf0mndKyL._AC_SY200_.jpg to resource type: image
Request: https://m.media-amazon.com/images/I/41ffko0T3kL._AC_SY200_.jpg to resource type: image
Request: https://m.media-amazon.com/images/I/51G8LfsNZzL._AC_SY200_.jpg to resource type: image
Request: https://m.media-amazon.com/images/I/41yavwjp-8L._AC_SY200_.jpg to resource type: image
......
Request: https://m.media-amazon.com/images/S/sash/2SazJx$EeTHfhMN.woff2 to resource type: font
Request: https://m.media-amazon.com/images/S/sash/ozb5-CLHQWI6Soc.woff2 to resource type: font
Request: https://m.media-amazon.com/images/S/sash/KwhNPG8Jz-Vz2X7.woff2 to resource type: font
Enter fullscreen mode Exit fullscreen mode

Still, according to Playwright's documentation, the Request callback object is immutable, so you won't be able to manipulate the request using this callback.

Let's check out the Playwright's suggestion about this situation:


  /**
   * Emitted when a page issues a request. The [request] object is read-only. In order to intercept and mutate requests, see
   * [page.route(url, handler)](https://playwright.dev/docs/api/class-page#pagerouteurl-handler) or
   * [browserContext.route(url, handler)](https://playwright.dev/docs/api/class-browsercontext#browsercontextrouteurl-handler).
   */
  on(event: 'request', listener: (request: Request) => void): this;
Enter fullscreen mode Exit fullscreen mode

Cool. Let's use page.route for the request manipulations.

Route interception

The concept behind using page.route interception is very similar to Puppeteer's page.on('request'), but requires indirect access to Request object using route.request.

const playwright = require('playwright');

(async () => {
    const browser = await playwright['chromium'].launch();
    const context = await browser.newContext();
    const page = await context.newPage();
    await page.route('**/*', (route) => {
        return route.request().resourceType() === 'image'
            ? route.abort()
            : route.continue()
    });
    await page.goto('https://amazon.com');
    await page.screenshot({ path: 'amazon_no_images.png' });

    await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

So, we're using intercepting routes and then indirectly accessing the requests behind these routes.

As a result, you will see the website images not being loaded.

Amazon scraping without images

All the supported resource types can be found below:

- stylesheet 
- image 
- media 
- font 
- script 
- texttrack 
- xhr 
- fetch 
- eventsource 
- websocket 
- manifest 
- other
Enter fullscreen mode Exit fullscreen mode

Also, you can apply any other condition for request prevention, like the resource URL:

const playwright = require('playwright');

(async () => {
    const browser = await playwright['chromium'].launch();
    const context = await browser.newContext();
    const page = await context.newPage();
    await page.route('**/*', (route) => {
        return route.request().url().endsWith('.jpg')
            ? route.abort()
            : route.continue()
    });
    await page.goto('https://amazon.com');
    await page.screenshot({ path: 'amazon_no_jpg_images.png' });

    await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

Improve SPA page scraping speed (Vue.js, React.js, etc.)

Since the start of my web scraping journey, I've found pretty neat the following exclusion list that improves Single-Page Application scrapers and decreases scraping time up to 10x times:

const playwright = require('playwright');

const RESOURCE_EXCLUSTIONS = ['image', 'stylesheet', 'media', 'font','other'];

(async () => {
    const browser = await playwright['chromium'].launch();
    const context = await browser.newContext();
    const page = await context.newPage();
    await page.route('**/*', (route) => {
        return RESOURCE_EXCLUSTIONS.includes(route.request().resourceType())
            ? route.abort()
            : route.continue()
    });
    await page.goto('https://amazon.com');

    await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

Such code snippet prevents binary and media content loading while providing all required dynamic web page load.

Summary

Request interception is a basic web scraping technique that allows improving crawler performance and saving money while doing data extraction at scale.

To save more money, you can check out the web scraping API concept. It already handles headless browser and proxies for you, so you'll forget about giant bills for servers and proxies.

Also, those articles might be interesting for you:

Happy Web Scraping, and don't forget to enable caching in your headless browser 💾

💖 💪 🙅 🚩
kami4ka
Oleg Kulyk

Posted on June 21, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Block resources with Playwright
javascript Block resources with Playwright

June 21, 2021