Webscraping using Playwright

rickynyairo

Ricky

Posted on October 10, 2023

Webscraping using Playwright

According to their twitter handle:

Playwright is an automation library for cross-browser end-to-end testing by Microsoft.

Think Selenium, Cypress, Puppeteer​.

Primarily, Playwright is used for automating the testing of web applications but its uses extend beyond this. These are among the top features that playwright offers:

Cross-browser. Playwright supports all modern rendering engines including Chromium (Chrome, Edge, Opera), WebKit (Safari), and Gecko (Firefox).​

Cross-platform. Test on Windows, Linux, and macOS, locally or on CI, headless or headed.​

Cross-language. Use the Playwright API in TypeScript, JavaScript, Python, .NET, Java.​

Test Mobile Web. Native mobile emulation of Google Chrome for Android and Mobile Safari. The same rendering engine works on your Desktop and in the Cloud.​

In this article, we're going to explore these features by using Playwright as a web scraper to get some data from GitHub. Web-scraping is a method to obtain data from websites using programmable tools. There are many use-cases for web scraping, including Data Mining, Weather data monitoring etc. Please note, however, that many websites do frown upon it.

What we'll build
We're going to open a user's profile on Github and scrape it to get the repository names and languages of the pinned repositories.

Find the code for this tutorial here:
https://github.com/rickynyairo/playwright-webscraping

Prerequisites
Playwright supports a variety of programming languages, but we'll use Typescript for this tutorial. Among the benefits here is that, it comes with Typescript support out of the box. No setup and config steps required.
We'll use NodeJS LTS (v18.18.0)
To install NodeJS, depending on your platform, use Node Version Manager.



$ nvm install --lts


Enter fullscreen mode Exit fullscreen mode

Proceed as follows once NodeJS is installed:
Install playwright using npm or yarn depending on your preferred package manager. Navigate to the folder in which you want to create the tests and run:



$ npm init playwright@latest


Enter fullscreen mode Exit fullscreen mode

or using yarn:



$ yarn create playwright


Enter fullscreen mode Exit fullscreen mode

For this article, we'll proceed assuming you're using npm.
Run the install command and select the following to get started:

  • Choose between TypeScript or JavaScript (default is TypeScript)
  • Name of your Tests folder (default is tests or e2e if you already have a tests folder in your project)
  • Add a GitHub Actions workflow to easily run tests on CI
  • Install Playwright browsers (default is true)

You should see this after a successful run:

results from npm init command

Running this command will download the browsers and create a few files in your folder



playwright.config.ts
package.json
package-lock.json
tests/
  example.spec.ts
tests-examples/
  demo-todo-app.spec.ts


Enter fullscreen mode Exit fullscreen mode

The playwright.config file defines the configuration for running playwright such as which browsers to use. We only need to run this on one browser. Edit projects property on the config object in this file and remove the other browsers to be left with:

playwright.config.ts



 /* Configure projects for major browsers */
  projects: [
    {
      name: 'chromium',
      use: { ...devices['Desktop Chrome'] },
    }
  ]


Enter fullscreen mode Exit fullscreen mode

We're only interested in the tests/ folder which contains an example test. You can have a look at the test-examples/ folder for more detailed test examples.

Run this to verify correct initialisation:



$ npx playwright test


Enter fullscreen mode Exit fullscreen mode

The tests should run and pass.

Next we'll navigate to a GitHub profile, find the pinned repositories and their languages. To do this we'll use a page and multiple locators.

Locator: A locator represents a way to find elements on the page at any moment.
Page object model: a page object represents a part of your web application. In this case, we'll have a profile page with locators for the profile avatar and pinned repositories.
Page Objects are a great way to create reusable parts of your web app that can be imported and reused for different tests without repeating all the logic for locators.

Let's create a Page Object for the profile picture that receives a username and opens the profile page.
Create a new file and call it ProfilePage.ts adding the following code there:

ProfilePage.ts



import { type Locator, type Page } from "@playwright/test";

export class GithubProfilePage {
  readonly page: Page;
  readonly pinnedRepositories: Locator;
  readonly username: string;

  constructor(page: Page, username: string) {
    this.username = username;
    this.page = page;
    this.pinnedRepositories = page.locator(".pinned-item-list-item-content");
  }

  async goto() {
    await this.page.goto(`https://github.com/${this.username}`);
  }

  async getPinnedRepositories() {
    return await this.pinnedRepositories.all();
  }

  async getRepoName(repo: Locator) {
    return await repo.locator(".text-bold").first().innerText();
  }

  async getRepoDescription(repo: Locator) {
    return await repo.locator(".pinned-item-desc").first().innerText();
  }

  async getRepoLanguage(repo: Locator) {
    return await repo
      .locator("[itemprop='programmingLanguage']")
      .first()
      .innerText();
  }
}


Enter fullscreen mode Exit fullscreen mode

Here, we're initialising a profile page object using a username and defining a few functions to find what we want in the page including the repo names and language in each repo.
This page object is fully reusable across tests.

Rename the example.spec.ts file to github.spec.ts and replace the contents with the following:

github.spec.ts



import { test, expect } from '@playwright/test';
import { GithubProfilePage } from './ProfilePage';

test('can get pinned repositories and their languages', async ({ page }) => {
  const profilePage = new GithubProfilePage(page, 'rickynyairo');
  await profilePage.goto();
  const pinnedRepositories = await profilePage.getPinnedRepositories();

  // get repo names and languages
  const repoAndLanguage = await Promise.all(pinnedRepositories.map(async (repo) => {
    const title = await profilePage.getRepoName(repo);
    const description = await profilePage.getRepoDescription(repo);
    const language = await profilePage.getRepoLanguage(repo);
    return { title, description, language };
  }));

  // log the language and title as a table
  console.table(repoAndLanguage, ['title', 'description', 'language']);

  expect(pinnedRepositories.length).toBeGreaterThan(0);
});


Enter fullscreen mode Exit fullscreen mode

We are then using the profile page to initialise opening the browser and navigating to the profile page. From here we can find the pinned repositories and their languages, and log them on the console.

Once done, run the following command:


 playwright test

Enter fullscreen mode Exit fullscreen mode

This should result in the following output:

final test run results

And there it is! you now have a simple web scraper. Explore using Playwright to do more automation on your own. You can, for example, use it to create an issue in a repository. To do this, you will have to go through a login process. This should make for a fun learning project.

Thanks for making it this far, please add a comment for any corrections you think I should make or just say hi :)

Special credit to:
https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72

💖 💪 🙅 🚩
rickynyairo
Ricky

Posted on October 10, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Webscraping using Playwright
javascript Webscraping using Playwright

October 10, 2023