Techniques for Compressing PDF Files
Chukwuma Zikora
Posted on March 17, 2023
A PDF is a file format used to present a document(including texts and images), in a manner independent of the software application used to view the document. The fact that images can be embedded in a PDF document is the main reason it's size can be very huge.
Most people scan receipts and other documents to PDF, and without OCR processing, the pages are stored as images rather than text, thereby increasing the overall size of the document.
To help optimize the document we will be using Ghostscript and qpdf to come up with a grey-scaled version of the document with a resolution of 300dpi.
NOTE: This tutorial assumes you have some knowledge of Docker and Nodejs.
Environment Setup
To keep our application contained, we will be using Docker to package it.
First, we create a project folder and create these two files within the folder. Dockerfile
and index.js
. You should have something similar to the this structure.
${project-dir}
├── Dockerfile
├── index.js
We will be using a basic Nodejs alpine image for this.
FROM node:16.15-alpine
RUN apk add --update alpine-sdk
RUN mkdir -m 755 /home/node/application
COPY . /home/node/application
WORKDIR /home/node/application
CMD ["node", "index.js"]
Write the script
Since we will be using command line utilities, we will be using Nodejs' built-in child_process to execute our commands.
const { exec } = require('child_process');
function compressFile() {
return new Promise((resolve, reject) => {
exec('command', (error, stdout, stderr) => {
if (error) {
return reject(error);
}
if (stderr) {
return reject(stderr);
}
return resolve('Done');
});
});
}
compressFile();
Solution 1 - Ghostscript
First we will need to modify our Dockerfile to add the ghostscript binary by adding a new line.
RUN apk add --no-cache ghostscript
Your Dockerfile should now look similar to this.
FROM node:16.15-alpine
RUN apk add --no-cache python3 py3-pip
RUN apk add --no-cache ghostscript
RUN apk add --update alpine-sdk
RUN mkdir -m 755 /home/node/application
COPY . /home/node/application
WORKDIR /home/node/application
CMD ["node", "index.js"]
To use convert our files, here is the command we will run against the Ghostscript binary.
gs \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.5 \
-dPDFSETTINGS=/printer \
-dNOPAUSE \
-dBATCH \
-dQUIET \
-sOutputFile=output.pdf \
input.pdf
Command Breakdown
-
-sDEVICE=pdfwrite
selects which output device Ghostscript should use. We are compressing a PDF file so we will be using pdfwrite. See this page for other options. -
-dCompatibilityLevel=1.5
generates a PDF version 1.5. Here's a list of all PDF versions. -
-dPDFSETTINGS=/printer
sets the image quality for printers. For additional compression choose /screen. Printer has a dpi of 300, while screen has 72. -
-dBATCH
and-dNOPAUSE
Ghostscript will process the input file(s) without interaction and will exit when completed. -
-dQUIET
mutes routine information comments on standard output. -
-sOutputFile=output.pdf
sets the path to store the compressed file -
input.pdf
the path of the file to process.
You can read the docs to see other available options. For our use case, we will use be using the above listed options.
...
const fileName = 'sample.pdf';
const fileIn = `./${fileName}`;
const fileOut = `-sOutputFile=./compressed__${new Date().toISOString()}__${fileName}`;
const command = `gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 -dPDFSETTINGS=/printer -dNOPAUSE -dBATCH -dQUIET ${fileOut} ${fileIn}`;
...
After execution the output file name will include the compressed
and the date string
in to differentiate between the compressed file and the original file.
Your complete code should look like this.
const { exec } = require('child_process');
function compressFile() {
return new Promise((resolve, reject) => {
const fileName = 'sample.pdf';
const fileIn = `./${fileName}`;
const fileOut = `-sOutputFile=./compressed__${new Date().toISOString()}__${fileName}`;
const command = `gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/printer -dNOPAUSE -dBATCH -dQUIET ${fileOut} ${fileIn}`;
exec(command, (error, stdout, stderr) => {
if (error) {
return reject(error);
}
if (stderr) {
return reject(stderr);
}
console.log(`stdout: ${stdout}`);
return resolve('Done');
});
});
}
compressFile();
Solution 2 - QPDF
Similar to our ghostscript setup, we will need to add qpdf to our Dockerfile
RUN apk add --no-cache qpdf
Your Dockerfile should now look similar to this.
FROM node:16.15-alpine
RUN apk add --no-cache python3 py3-pip
RUN apk add --no-cache qpdf
RUN apk add --update alpine-sdk
RUN mkdir -m 755 /home/node/application
COPY . /home/node/application
WORKDIR /home/node/application
CMD ["node", "index.js"]
To use convert our files, here is the command we will run against the Ghostscript binary.
qpdf --optimize-images input.pdf output.pdf
As you can see from qpdf options, we are explicitly asking the library to optimize the images in our pdf file. Next, we update our code to include the qpdf command
...
const fileName = 'sample.pdf';
const fileIn = `./${fileName}`;
const fileOut = `./compressed__${new Date().toISOString()}__${fileName}`;
const command = `qpdf --optimize-images ${fileOut} ${fileIn}`;
...
Your complete code should look like this.
const { exec } = require('child_process');
function compressFile() {
return new Promise((resolve, reject) => {
const fileName = 'sample.pdf';
const fileIn = `./${fileName}`;
const fileOut = `./compressed__${new Date().toISOString()}__${fileName}`;
const command = `qpdf --optimize-images ${fileOut} ${fileIn}`;
exec(command, (error, stdout, stderr) => {
if (error) {
return reject(error);
}
if (stderr) {
return reject(stderr);
}
console.log(`stdout: ${stdout}`);
return resolve('Done');
});
});
}
compressFile();
Test the code
First, let us build the image to bundle our code together with our chosen binary.
docker build . -t pdf-compressor
To run the command I will be mounting the /home/node/application
directory to a directory on my local machine that have the files I will like to compress so the code and reach it, and also output the compressed files in the same directory.
docker run -it -v ${PWD}:/home/node/application pdf-compressor
Conclusion
The gains made on the compression depends mostly on how many uncompressed/unoptimized images are present in the document. You can test both solutions and tweak their options until you find a combination that gives you the best result.
Posted on March 17, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.