Techniques for Compressing PDF Files

chukwumaijem

Chukwuma Zikora

Posted on March 17, 2023

Techniques for Compressing PDF Files

Image description
A PDF is a file format used to present a document(including texts and images), in a manner independent of the software application used to view the document. The fact that images can be embedded in a PDF document is the main reason it's size can be very huge.

Most people scan receipts and other documents to PDF, and without OCR processing, the pages are stored as images rather than text, thereby increasing the overall size of the document.

To help optimize the document we will be using Ghostscript and qpdf to come up with a grey-scaled version of the document with a resolution of 300dpi.

NOTE: This tutorial assumes you have some knowledge of Docker and Nodejs.

Environment Setup

To keep our application contained, we will be using Docker to package it.

First, we create a project folder and create these two files within the folder. Dockerfile and index.js. You should have something similar to the this structure.

${project-dir}
├── Dockerfile
├── index.js
Enter fullscreen mode Exit fullscreen mode

We will be using a basic Nodejs alpine image for this.

FROM node:16.15-alpine

RUN apk add --update alpine-sdk

RUN mkdir -m 755 /home/node/application
COPY . /home/node/application
WORKDIR /home/node/application

CMD ["node", "index.js"]
Enter fullscreen mode Exit fullscreen mode

Write the script

Since we will be using command line utilities, we will be using Nodejs' built-in child_process to execute our commands.

const { exec } = require('child_process');

function compressFile() {
  return new Promise((resolve, reject) => {
    exec('command', (error, stdout, stderr) => {
      if (error) {
        return reject(error);
      }
      if (stderr) {
        return reject(stderr);
      }
      return resolve('Done');
    });
  });
}

compressFile();
Enter fullscreen mode Exit fullscreen mode

Solution 1 - Ghostscript

First we will need to modify our Dockerfile to add the ghostscript binary by adding a new line.

RUN apk add --no-cache ghostscript
Enter fullscreen mode Exit fullscreen mode

Your Dockerfile should now look similar to this.

FROM node:16.15-alpine

RUN apk add --no-cache python3 py3-pip
RUN apk add --no-cache ghostscript
RUN apk add --update alpine-sdk

RUN mkdir -m 755 /home/node/application
COPY . /home/node/application
WORKDIR /home/node/application

CMD ["node", "index.js"]
Enter fullscreen mode Exit fullscreen mode

To use convert our files, here is the command we will run against the Ghostscript binary.

gs \
  -sDEVICE=pdfwrite \
  -dCompatibilityLevel=1.5 \
  -dPDFSETTINGS=/printer \
  -dNOPAUSE \
  -dBATCH \
  -dQUIET \
  -sOutputFile=output.pdf \
  input.pdf
Enter fullscreen mode Exit fullscreen mode

Command Breakdown

  • -sDEVICE=pdfwrite selects which output device Ghostscript should use. We are compressing a PDF file so we will be using pdfwrite. See this page for other options.
  • -dCompatibilityLevel=1.5 generates a PDF version 1.5. Here's a list of all PDF versions.
  • -dPDFSETTINGS=/printer sets the image quality for printers. For additional compression choose /screen. Printer has a dpi of 300, while screen has 72.
  • -dBATCH and -dNOPAUSE Ghostscript will process the input file(s) without interaction and will exit when completed.
  • -dQUIET mutes routine information comments on standard output.
  • -sOutputFile=output.pdf sets the path to store the compressed file
  • input.pdf the path of the file to process.

You can read the docs to see other available options. For our use case, we will use be using the above listed options.

...
  const fileName = 'sample.pdf';
  const fileIn = `./${fileName}`;
  const fileOut = `-sOutputFile=./compressed__${new Date().toISOString()}__${fileName}`;
  const command = `gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 -dPDFSETTINGS=/printer -dNOPAUSE -dBATCH -dQUIET ${fileOut} ${fileIn}`;
...
Enter fullscreen mode Exit fullscreen mode

After execution the output file name will include the compressed and the date string in to differentiate between the compressed file and the original file.

Your complete code should look like this.

const { exec } = require('child_process');

function compressFile() {
  return new Promise((resolve, reject) => {
    const fileName = 'sample.pdf';
    const fileIn = `./${fileName}`;
    const fileOut = `-sOutputFile=./compressed__${new Date().toISOString()}__${fileName}`;
    const command = `gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/printer -dNOPAUSE -dBATCH -dQUIET ${fileOut} ${fileIn}`;

    exec(command, (error, stdout, stderr) => {
      if (error) {
        return reject(error);
      }
      if (stderr) {
        return reject(stderr);
      }
      console.log(`stdout: ${stdout}`);
      return resolve('Done');
    });
  });
}

compressFile();
Enter fullscreen mode Exit fullscreen mode

Solution 2 - QPDF

Similar to our ghostscript setup, we will need to add qpdf to our Dockerfile

RUN apk add --no-cache qpdf
Enter fullscreen mode Exit fullscreen mode

Your Dockerfile should now look similar to this.

FROM node:16.15-alpine

RUN apk add --no-cache python3 py3-pip
RUN apk add --no-cache qpdf
RUN apk add --update alpine-sdk

RUN mkdir -m 755 /home/node/application
COPY . /home/node/application
WORKDIR /home/node/application

CMD ["node", "index.js"]
Enter fullscreen mode Exit fullscreen mode

To use convert our files, here is the command we will run against the Ghostscript binary.

qpdf --optimize-images input.pdf output.pdf
Enter fullscreen mode Exit fullscreen mode

As you can see from qpdf options, we are explicitly asking the library to optimize the images in our pdf file. Next, we update our code to include the qpdf command

...
  const fileName = 'sample.pdf';
  const fileIn = `./${fileName}`;
  const fileOut = `./compressed__${new Date().toISOString()}__${fileName}`;
  const command = `qpdf --optimize-images ${fileOut} ${fileIn}`;
...
Enter fullscreen mode Exit fullscreen mode

Your complete code should look like this.

const { exec } = require('child_process');

function compressFile() {
  return new Promise((resolve, reject) => {
    const fileName = 'sample.pdf';
    const fileIn = `./${fileName}`;
    const fileOut = `./compressed__${new Date().toISOString()}__${fileName}`;
    const command = `qpdf --optimize-images ${fileOut} ${fileIn}`;

    exec(command, (error, stdout, stderr) => {
      if (error) {
        return reject(error);
      }
      if (stderr) {
        return reject(stderr);
      }
      console.log(`stdout: ${stdout}`);
      return resolve('Done');
    });
  });
}

compressFile();
Enter fullscreen mode Exit fullscreen mode

Test the code

First, let us build the image to bundle our code together with our chosen binary.

docker build . -t pdf-compressor
Enter fullscreen mode Exit fullscreen mode

To run the command I will be mounting the /home/node/application directory to a directory on my local machine that have the files I will like to compress so the code and reach it, and also output the compressed files in the same directory.

docker run -it -v ${PWD}:/home/node/application pdf-compressor
Enter fullscreen mode Exit fullscreen mode

Conclusion

The gains made on the compression depends mostly on how many uncompressed/unoptimized images are present in the document. You can test both solutions and tweak their options until you find a combination that gives you the best result.

Originally Published on BlockQueue's Blog

💖 💪 🙅 🚩
chukwumaijem
Chukwuma Zikora

Posted on March 17, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related