Web Scraping With Playwright: Tutorial for 2022

oxylabs

Oxylabs

Posted on November 3, 2022

Web Scraping With Playwright: Tutorial for 2022

You most probably won’t get surprised if we tell you that in recent years, the internet and its impact have grown tremendously. This can be attributed to the growth of the technologies that help create more user-friendly applications. Moreover, there is more and more automation at every step – from the development to the testing of web applications.

Having good tools to test web applications is crucial. Libraries such as Playwright help speed up processes by opening the web application in a browser and other user interactions such as clicking elements, typing text, and, of course, extracting public data from the web.

In this post, we’ll explain everything you need to know about Playwright and how it can be used for automation and even web scraping.

What is Playwright?

Playwright is a testing and automation framework that can automate web browser interactions. Simply put, you can write code that can open a browser. This means that all the web browser capabilities are available for use. The automation scripts can navigate to URLs, enter text, click buttons, extract text, etc. The most exciting feature of Playwright is that it can work with multiple pages at the same time, without getting blocked or having to wait for operations to complete in any of them.

It supports most browsers such as Google Chrome, Microsoft Edge using Chromium, Firefox. Safari is supported when using WebKit. In fact, cross-browser web automation is Playwright’s strength. The same code can be efficiently executed for all the browsers. Moreover, Playwright supports various programming languages such as Node.js, Python, Java, and .NET. You can write the code that opens websites and interacts with them using any of these languages.

Playwright’s documentation is extensive. It covers everything from getting started to a detailed explanation about all the classes and methods.

Support for proxies in Playwright

Playwright supports the use of proxies. Before we explore this subject further, here is a quick code snippet showing how to start using a proxy with Chromium:

Node.js:

const { chromium } = require('playwright'); "

const browser = await chromium.launch();

Python:

from playwright.async_api import async_playwright

import asyncio

with async_playwright() as p:

browser = await p.chromium.launch()

This code needs only slight modifications to fully utilize proxies.

In the case of Node.js, the launch function can accept an optional parameter of LauchOptions type. This LaunchOption object can, in turn, send several other parameters, e.g.,  headless. The other parameter needed is proxy. This proxy is another object with properties such as server, username, password, etc. The first step is to create an object where these parameters can be specified.

// Node.js

const launchOptions = {

    proxy: {

        server: 123.123.123.123:80'

    },

    headless: false

}

The next step is to pass this object to the launch function:

const browser = await chromium.launch(launchOptions);

In the case of Python, it’s slightly different. There’s no need to create an object of LaunchOptions. Instead, all the values can be sent as separate parameters. Here’s how the proxy dictionary will be sent:

# Python

proxy_to_use = {

    'server': '123.123.123.123:80'

}

browser = await pw.chromium.launch(proxy=proxy_to_use, headless=False)

When deciding on which proxy to use, it’s best to use residential proxies as they don’t leave a footprint and won’t trigger any security alarms. For example, our own Oxylabs’ Residential Proxies can help you with an extensive and stable proxy network. You can access proxies in a specific country, state, or even a city. What’s essential, you can integrate them easily with Playwright as well.

Basic scraping with Playwright

Let’s move to another topic where we’ll cover how to get started with Playwright using Node.js and Python.

If you’re using Node.js, create a new project and install the Playwright library. This can be done using these two simple commands:

npm init -y

npm install playwright

A basic script that opens a dynamic page is as follows:

const playwright = require('playwright');

(async () => {

    const browser = await playwright.chromium.launch({

        headless: false // Show the browser.

    });

   

    const page = await browser.newPage();

    await page.goto('https://books.toscrape.com/');

    await page.waitForTimeout(1000); // wait for 1 seconds

    await browser.close();

})();

Let’s take a look at the provided code – the first line of the code imports Playwright. Then, an instance of Chromium is launched. It allows the script to automate Chromium. Also, note that this script is running with a visible UI. We did it by passing headless:false. Then, a new browser page is opened. After that, the page.goto function navigates to the Books to Scrape web page. After that, there’s a wait of 1 second to show the page to the end-user. Finally, the browser is closed.

The same code can be written in Python easily. First, install Playwright using pip command:

pip install playwright

Note that Playwright supports two variations – synchronous and asynchronous. The following example uses the asynchronous API:

from playwright.async_api import async_playwright

import asyncio

 

async def main():

    async with async_playwright() as pw: 

        browser = await pw.chromium.launch(

            headless=False  # Show the browser

        )

        page = await browser.new_page()

        await page.goto('https://books.toscrape.com/')

        # Data Extraction Code Here

        await page.wait_for_timeout(1000)  # Wait for 1 second

        await browser.close()

       

if name == 'main':

    asyncio.run(main())

This code is similar to the Node.js code. The biggest difference is the use of asyncio library. Another difference is that the function names change from camelCase to snake_case.

If you want to create more than one browser context or want to have finer control, you can create a context object and create multiple pages in that context. This would open pages in new tabs:

const context = await browser.newContext();

const page1 = await context.newPage();

const page2 = await context.newPage();

You may also want to handle page context in your code. It’s possible to get the browser context that the page belongs to using the page.context()function.

Locating elements

To extract information from any element or to click any element, the first step is to locate the element. Playwright supports both CSS and XPath selectors.

This can be understood better with a practical example. Open https://books.toscrape.com/ in Chrome. Right-click the first book and select inspect
web scraping with playwright.

You can see that all the books are under the article element, which has a class product_prod.

To select all the books, you need to run a loop over all these article elements. These article elements can be selected using the CSS selector:

.product_pod

Similarly, the XPath selector would be as following:

//*[@class="product_pod"]

To use these selectors, the most common functions are as following:

  • $eval(selector, function) – selects the first element, sends the element to the function, and the result of the function is returned;
  • $$eval(selector, function) – same as above, except that it selects all elements;
  • querySelector(selector) – returns the first element;
  • querySelectorAll(selector)– return all the elements.

These methods will work correctly with both CSS and XPath Selectors.

Scraping text

Continuing with the example of Books to Scrape, after the page has been loaded, you can use a selector to extract all book containers using the $$eval function.

const books = await page.$$eval('.product_pod', all_items => {

// run a loop here

})

Now all the elements that contain book data can be extracted in a loop:

all_items.forEach(book => {

    const name = book.querySelector('h3').innerText;

})

Finally, the innerText attribute can be used to extract the data from each data point. Here’s the complete code in Node.js:

const playwright = require('playwright');

 

(async () => {

    const browser = await playwright.chromium.launch();

    const page = await browser.newPage();

    await page.goto('https://books.toscrape.com/');

    const books = await page.$$eval('.product_pod', all_items => {

        const data = [];

        all_items.forEach(book => {

            const name = book.querySelector('h3').innerText;

            const price = book.querySelector('.price_color').innerText;

            const stock = book.querySelector('.availability').innerText;

            data.push({ name, price, stock});

        });

        return data;

    });

    console.log(books);

    await browser.close();

})();

The code in Python will be a bit different. Python has a function eval_on_selector, which is similar to $eval of Node.js, but it’s not suitable for this scenario. The reason is that the second parameter still needs to be JavaScript. This can be good in a certain scenario, but in this case, it will be much better to write the entire code in Python.

It would be better to use query_selector and query_selector_all which will return an element and a list of elements respectively.

from playwright.async_api import async_playwright

import asyncio

 

 

async def main():

    async with async_playwright() as pw:

        browser = await pw.chromium.launch()

        page = await browser.new_page()

        await page.goto('https://books.toscrape.com')

 

        all_items = await page.query_selector_all('.product_pod')

        books = []

        for item in all_items:

            book = {}

            name_el = await item.query_selector('h3')

            book['name'] = await name_el.inner_text()

            price_el = await item.query_selector('.price_color')

            book['price'] = await price_el.inner_text()

            stock_el = await item.query_selector('.availability')

            book['stock'] = await stock_el.inner_text()

            books.append(book)

        print(books)

        await browser.close()

 

if name == 'main':

    asyncio.run(main())

The output of both the Node.js and the Python code will be the same. You can click here to find the complete code used in this post for your convenience.

Playwright vs Puppeteer and Selenium

There are other tools like Selenium and Puppeteer that can also do the same thing as Playwright.

However, Puppeteer is limited when it comes to browsers and programming languages. The only language that can be used is JavaScript, and the only browser that works with it is Chromium.

Selenium, on the other hand, supports all major browsers and a lot of programming languages. It is, however, slow and less developer-friendly.

Also note that Playwright can intercept network requests. For more details about network requests, see this page.

The following table is a quick summary of the differences and similarities:





















PLAYWRIGHT

PUPPETEER

SELENIUM

SPEED

Fast

Fast

Slower

DOCUMENTATION

Excellent

Excellent

Fair

DEVELOPER EXPERIENCE

Best

Good

Fair

PROGRAMMING LANGUAGES

JavaScript, Python, C#, Java

JavaScript

Java, Python, C#, Ruby

JavaScript, Kotlin

BACKED BY

Microsoft

Google

Community and Sponsors

COMMUNITY

Small but active

Large and active

Large and active

BROWSER SUPPORT

Chromium, Firefox, and WebKit

Chromium

Chrome, Firefox, IE, Edge, Opera, Safari, and more

Comparison of performance

As we mentioned in the previous section, because of the vast difference in the programming languages and supported browsers, it isn’t easy to compare every scenario.

The only combination that can be compared is when scripts are written in JavaScript to automate Chromium. This is the only combination that all three tools support.

A detailed comparison would be out of the scope of this post. You can read more about the performance of Puppeteer, Selenium, and Playwright in this article. The key takeaway is that Puppeteer is the fastest, followed by Playwright. Note that in some scenarios, Playwright was faster. Selenium is the slowest of the three.

Again, remember that Playwright has other advantages, such as multi-browser support, supporting multiple programming languages.

If you’re looking for a fast cross-browser web automation or don’t know JavaScript, Playwright will be your only choice.

Conclusion

In today’s post, we explored the capabilities of Playwright as a web testing tool that can be used for web scraping dynamic sites. Due to its asynchronous nature and cross-browser support, it’s a popular alternative to other tools. We also covered code examples in both Node.js and Python.

Playwright can help navigate to URLs, enter text, click buttons, extract text, etc. Most importantly, it can extract text that is rendered dynamically. These things can also be done by other tools such as Puppeteer and Selenium, but if you need to work with multiple browsers, or have to work with language other than JavaScript/Node.js, then Playwright would be a great choice.

If you’re interested to read more about other similar topics, check out our blog posts on web scraping with Selenium or Puppeteer tutorial.

And, of course, in case you have any questions or impressions about today’s tutorial, don’t hesitate to leave a comment below!

💖 💪 🙅 🚩
oxylabs
Oxylabs

Posted on November 3, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related