How to crop, split, remove pages from PDFs with Java and PDFBox

alexadam

Alex Adam

Posted on May 30, 2023

How to crop, split, remove pages from PDFs with Java and PDFBox

Create a new Java project with Maven

Let's start by creating a new Java project, named pdf_utils, with Maven:

mvn archetype:generate \
    -DgroupId=com.pdf.pdf_utils \
    -DartifactId=pdf_utils \
    -DarchetypeArtifactId=maven-archetype-quickstart \
    -DarchetypeVersion=1.4 \
    -DinteractiveMode=false
Enter fullscreen mode Exit fullscreen mode

Then, open the pdf_utils/pom.xml file and add a dependency to PDFBox, in the dependencies section:

<dependencies>
   ...
    <dependency>
      <groupId>org.apache.pdfbox</groupId>
      <artifactId>pdfbox</artifactId>
      <version>2.0.27</version>
    </dependency>
    ...
</dependencies>
Enter fullscreen mode Exit fullscreen mode

Also change the target & source compiler versions:

 <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>17</source>
          <target>17</target>
        </configuration>
      </plugin>
</plugins>
Enter fullscreen mode Exit fullscreen mode

Then rename the generated src/main/java/com/pdf/pdf_utils/App.java class to PDFUtils

How to crop a PDF

We will now develop a Java function that crops each page within a PDF document and saves the cropped content in a new PDF file. The cropping coordinates, specified in millimeters, will serve as inputs to the function.

  1. Add helper functions to convert between mm <-> units
  2. Convert mm to units
  3. Extract the doc's file name
  4. Crop each page
  5. Save the result & append "-cropped" to the ouyput file's name
public String cropPDFMM(float x, float y, float width, float height, String srcFilePath) throws IOException {
    // helper functions to convert between mm <-> units
    Function<Float, Float> mmToUnits = (Float a) -> a / 0.352778f;
    Function<Float, Float> unitsToMm = (Float a) -> a * 0.352778f;

    // convert mm to units
    float xUnits = mmToUnits.apply(x);
    float yUnits = mmToUnits.apply(y);
    float widthUnits = mmToUnits.apply(width);
    float heightUnits = mmToUnits.apply(height);

    // extract the doc's file name
    File srcFile = new File(srcFilePath);
    String fileName = srcFile.getName();
    int dotIndex = fileName.lastIndexOf('.');
    String fileNameWithoutExtension =  (dotIndex == -1) ? fileName : fileName.substring(0, dotIndex);

    // crop each page
    PDDocument doc = PDDocument.load(srcFile);
    int nrOfPages = doc.getNumberOfPages();
    PDRectangle newBox = new PDRectangle(
            xUnits,
            yUnits,
            widthUnits,
            heightUnits);
    for (int i = 0; i < nrOfPages; i++) {
        doc.getPage(i).setCropBox(newBox);
    }

    // save the result & append -cropped to the file name
    File outFile = new File(fileNameWithoutExtension + "-cropped.pdf");
    doc.save(outFile);
    doc.close();
    return outFile.getCanonicalPath();
}
Enter fullscreen mode Exit fullscreen mode

Let's test cropPDFMM by calling it from the main function:

public static void main( String[] args )
    {
    String srcFilePath = "/Users/user/.../file.pdf";
    PDFUtils app = new PDFUtils();

    try {
        ///// crop pdf
        float x = 18f;
        float y = 20f;
        float width = 140f;
        float height = 210f;
        String resultFilePath = app.cropPDFMM(x, y, width, height, srcFilePath);

        System.out.println( "Done!" );
    } catch (Exception e) {
        System.out.println(e);
    }
}
Enter fullscreen mode Exit fullscreen mode

You should see a file named file-cropped.pdf in the current directory.

How to remove pages from a PDF

To remove specific pages from a document, we can utilize an array of integer ranges. Each range consists of a start page and an end page ([startPage1, endPage1, startPage2, endPage2, ...]). The function iterates through each page of the document and checks if the page number falls outside of any of the specified ranges in the array. If a page is not within any range, it is appended to a new document.

  1. Add a helper function to test if a page is within a range
  2. Test each page number -> append it to a temp. doc
  3. Save the temp. doc. -> overwrite the input file
public void removePages(String srcFilePath, Integer[] pageRanges) throws IOException {
     // a helper function to test if a page is within a range
     BiPredicate<Integer, Integer[]> pageInInterval = (Integer page, Integer[] allPages) -> {
         for (int j = 0; j < allPages.length; j+=2) {
             int startPage = allPages[j];
             int endPage = allPages[j+1];
             if (page >= startPage-1 && page < endPage) {
                 return true;
             }
         }
         return false;
     };

     File srcFile = new File(srcFilePath);
     PDDocument pdfDocument = PDDocument.load(srcFile);
     PDDocument tmpDoc = new PDDocument();

     // test if a page is within a range
     // if not, append the page to a temp. doc.
     for (int i = 0; i < pdfDocument.getNumberOfPages(); i++) {
         if (pageInInterval.test(i, pageRanges)) {
             continue;
         }
         tmpDoc.addPage(pdfDocument.getPage(i));
     }

     // save the temporary doc.
     tmpDoc.save(new File(srcFilePath));
     tmpDoc.close();
     pdfDocument.close();
 }
Enter fullscreen mode Exit fullscreen mode

Let's test it by calling removePages in the main function:

 ///// remove pages
app.removePages(resultFilePath, new Integer[] {1, 21, 376, 428});
Enter fullscreen mode Exit fullscreen mode

It will overwrite the input (cropped) file.

How to split a PDF

We will now introduce a function that enables the splitting of a PDF into multiple separate PDFs, with each resulting file containing a specified number of pages. The function expects two inputs: the path to the source PDF document and the desired number of pages in each split file.

  1. Extract source file's name
  2. for each nrOfPages
  3. append them to a temporary document
  4. save the temp doc with the source file's name + index
public void splitPDF(String srcFilePath, int nrOfPages) throws IOException {
    // extract file's name
    File srcFile = new File(srcFilePath);
    String fileName = srcFile.getName();
    int dotIndex = fileName.lastIndexOf('.');
    String fileNameWithoutExtension =  (dotIndex == -1) ? fileName : fileName.substring(0, dotIndex);

    PDDocument pdfDocument = PDDocument.load(srcFile);

    // extract every nrOfPages to a temporary document
    // append an index to its name and save it
    for (int i = 1; i < pdfDocument.getNumberOfPages(); i+=nrOfPages) {
        Splitter splitter = new Splitter();

        int fromPage = i;
        int toPage = i+nrOfPages;
        splitter.setStartPage(fromPage);
        splitter.setEndPage(toPage);
        splitter.setSplitAtPage(toPage - fromPage );

        List<PDDocument> lst = splitter.split(pdfDocument);

        PDDocument pdfDocPartial = lst.get(0);
        File f = new File(fileNameWithoutExtension + "-" + i + ".pdf");
        pdfDocPartial.save(f);
        pdfDocPartial.close();
    }
    pdfDocument.close();
}
Enter fullscreen mode Exit fullscreen mode

Here is the full main() function:

public static void main( String[] args ){
    String srcFilePath = "/Users/user/pdfs/file.pdf";
    PDFUtils app = new PDFUtils();

    try {
         ///// crop pdf
        float x = 18f;
        float y = 20f;
        float width = 140f;
        float height = 210f;
        String resultFilePath = app.cropPDFMM(x, y, width, height, srcFilePath);

        ///// remove pages
        app.removePages(resultFilePath, new Integer[] {1, 21, 376, 428});

        ///// split pages
        app.splitPDF(resultFilePath, 20);

        System.out.println( "Done!" );
    } catch (Exception e) {
        System.out.println(e);
    }
}
Enter fullscreen mode Exit fullscreen mode

Merge 2 pages per sheet (2up)

This function is inspired from https://stackoverflow.com/questions/12093408/pdfbox-merge-2-portrait-pages-onto-a-single-side-by-side-landscape-page.

  1. Create a temporary document
  2. Iterate over the pages of the original doc. -> get the left & right pages
  3. Create a new "output" page with the right dimensions
  4. Append the left page at (0,0)
  5. Append the right page, translated to (left page's width, 0)
  6. Save the temp. doc -> overwrite the source file
public void mergePages(String srcFilePath) throws IOException {

    // SOURCE: https://stackoverflow.com/questions/12093408/pdfbox-merge-2-portrait-pages-onto-a-single-side-by-side-landscape-page
    File srcFile = new File(srcFilePath);
    PDDocument pdfDocument = PDDocument.load(srcFile);
    PDDocument outPdf = new PDDocument();

    for (int i = 0; i < pdfDocument.getNumberOfPages(); i+=2) {
        PDPage page1 = pdfDocument.getPage(i);
        PDPage page2 = pdfDocument.getPage(i+1);
        PDRectangle pdf1Frame = page1.getCropBox();
        PDRectangle pdf2Frame = page2.getCropBox();
        PDRectangle outPdfFrame = new PDRectangle(pdf1Frame.getWidth()+pdf2Frame.getWidth(), Math.max(pdf1Frame.getHeight(), pdf2Frame.getHeight()));

        // Create output page with calculated frame and add it to the document
        COSDictionary dict = new COSDictionary();
        dict.setItem(COSName.TYPE, COSName.PAGE);
        dict.setItem(COSName.MEDIA_BOX, outPdfFrame);
        dict.setItem(COSName.CROP_BOX, outPdfFrame);
        dict.setItem(COSName.ART_BOX, outPdfFrame);
        PDPage newP = new PDPage(dict);
        outPdf.addPage(newP);

        // Source PDF pages has to be imported as form XObjects to be able to insert them at a specific point in the output page
        LayerUtility layerUtility = new LayerUtility(outPdf);
        PDFormXObject formPdf1 = layerUtility.importPageAsForm(pdfDocument, page1);
        PDFormXObject formPdf2 = layerUtility.importPageAsForm(pdfDocument, page2);

        AffineTransform afLeft = new AffineTransform();
        layerUtility.appendFormAsLayer(newP, formPdf1, afLeft, "left" + i);
        AffineTransform afRight = AffineTransform.getTranslateInstance(pdf1Frame.getWidth(), 0.0);
        layerUtility.appendFormAsLayer(newP, formPdf2, afRight, "right" + i);
    }

    outPdf.save(srcFile);
    outPdf.close();
    pdfDocument.close();
}
Enter fullscreen mode Exit fullscreen mode

Update main() to test it:

...
 ///// 2 pages per sheet
app.mergePages(resultFilePath);
...
Enter fullscreen mode Exit fullscreen mode

Here is the full list of imports:

package com.pdf.pdf_utils;

import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.multipdf.LayerUtility;
import org.apache.pdfbox.multipdf.Splitter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;

import java.awt.geom.AffineTransform;
import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.function.BiPredicate;
import java.util.function.Function;
Enter fullscreen mode Exit fullscreen mode

The source code is available here.

💖 💪 🙅 🚩
alexadam
Alex Adam

Posted on May 30, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related