Verify PDF contents using Playwright and pdf2json
ryanrosello-og
Posted on July 8, 2022
In this tutorial we will use Playwright inconjuction with pdf2json to validate contents of a pdf file. This is very common task that you will normally encounter when creating end to end automated tests.
The pdf file we will use for this example is plain old textual based pdf containing 6 pages. For simplicity, I have stored this file pdf_sample.pdf
into the root folder of the project.
Our goals are:
- validate the meta informaiion (keywords:"Standard Fees and Charges, 003-750, 3-750") contained within the file
- ensure the pdf file indeed has 6 pages
- assert whether the PDF file contains the correct text "When we may charge fees"
First up, you will need to add pdf2json
to your project using yarn (or npm):
yarn add pdf2json -D
Import pdf2json
into your spec file and create the initial scaffolding for our tests:
import PDFParser from 'pdf2json';
import { test, expect } from '@playwright/test';
test.describe('assert PDF contents using Playwright', () => {
test.beforeAll(async () => {
})
test('pdf file should have 6 pages', async () => {
});
test('contains the correct subheading text', async () => {
});
test('shows the correct meta information (keywords)', async () => {
});
});
Create a simple helper function that does the heavy lifting of parsing and loading the pdf contents into a variable:
async function getPDFContents(pdfFilePath: string): Promise<any> {
let pdfParser = new PDFParser();
return new Promise((resolve, reject) => {
pdfParser.on('pdfParser_dataError', (errData: {parserError: any}) =>
reject(errData.parserError)
);
pdfParser.on('pdfParser_dataReady', (pdfData) => {
resolve(pdfData);
});
pdfParser.loadPDF(pdfFilePath);
});
}
Create variable called pdfContents
scoped within the describe block:
let pdfContents: any
Update the beforeAll
to read the contents of the pdf into the variable
test.beforeAll(async ({}) => {
pdfContents = await getPDFContents('./pdf_sample.pdf')
})
If you were to debug and inspect the shape of the pdfContents
you will notice that the first 2 tests are quite easy to assert.
test('pdf file should have 6 pages', async () => {
expect(pdfContents.Pages.length, 'The pdf should have 6 pages').toEqual(6);
});
test('shows the correct meta informaion (keywords)', async () => {
expect(pdfContents.Meta.Keywords, 'PDF keyword was incorrect').toEqual('Standard Fees and Charges, 003-750, 3-750');
});
However, the last test (assert if "When we may charge fees" is contained in the file) is a little bit more convulted. You will need to expand the Pages
array and find the page where you expect the text to exists. You will then need to inspect Texts
array to find the text that you are looking for. In our example it was found in first page on the fourth line. This equates to pdfContents.Pages[0].Texts[3].R[0].T
One last complication remains, the raw text that we require "When%20we%20may%20charge%20fees" seems to be encoded. We can easily strip out the encoding use the decodeURI
function.
test('contains the correct subheading text', async () => {
const rawText = pdfContents.Pages[0].Texts[3].R[0].T
expect(decodeURI(rawText), 'The subheading text was incorrect').toEqual('When we may charge fees');
});
Our final test
Conclusion
I have demonstrated how you can easily verify contents of a pdf using Playwright and pdf2json. We have worked with a very basic pdf containing textual information. Unfortunately, pdf2json may not be able to handle more complex PDF files. YMMV 🥳🚀
Posted on July 8, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.