Eliot Jones
Posted on July 25, 2020
The question anyone who has tried to extract text from a PDF using C# will have asked themselves at one point or another is: why is this so complicated?
It's a good question and the answer lies in trade-offs made when the PDF format was designed.
To those unfamiliar with it I'd describe a PDF file as a picture. At a very high level it's a set of images defining how the pages in the document should appear. This means whatever platform you view it on, it should look (more-or-less) identical, whether you're on Windows, Linux, Chrome, Android, etc. The fact it contains text and font information is almost, but not quite, incidental.
The presence of fonts in the file helps applications that display PDFs draw text in (almost) the same way across platforms. The text content included in a document mostly just defines where letters from a font should be drawn. There are even some documents containing fonts where the text information has no actual relationship to the displayed glyphs, you might have encountered them before; in these documents if you highlight and copy paste some text that appears 'normal' when you paste it to another application it's just nonsense.
With that in mind there's no such thing as 'perfect' (or a lot of the time even passable) text extraction from PDFs. They're not primarily designed to transmit the text in a useful way, it's pretty much a side effect of the requirement to render the document that it even contains text at all.
For this reason some people just run OCR against all PDF documents and rely on the OCR to extract text from what is, and I'm repeating myself here, basically an image.
If you don't want to run OCR and you don't want to fork out a considerable amount of money for commercially licensed PDF software, what are your options for getting text out of a PDF in C#?
Options
For the following examples I'm targeting .NET Core 2.1 on Windows 10 using Visual Studio 2017. I'll be using the sample PDF found here but you can use any PDF file.
For the licensing discussion below - the traditional disclaimer that I am not a lawyer, I don't particularly understand software licenses. Consult someone who understands this stuff if licensing is a real issue for you.
iTextSharp
The original. One of the more well established PDF libraries in C#. Most versions of iTextSharp (now iText as of version 7) are covered by the AGPL. This is quite an 'aggressive' license that cannot be used for commercial purposes unless you also release your entire source code as source available (controversial take, I don't really consider AGPL open source) under the AGPL, or buy a commercial license.
There's an unofficial fork of iTextSharp from back when it was LGPL licensed (this is still a copyleft license - note that this link is to LGPL v2.1 rather than v2) before the change to the AGPL license with some recent changes to port it to .NET Core.
dotnet add package iTextSharp.LGPLv2.Core
Once you have the package installed you can refer to the examples on GitHub to accomplish most tasks. The following code opens a file from disk and write the text content to the console:
// Create a reader from the file bytes.
var reader = new PdfReader(File.ReadAllBytes(@"..\..\..\sample.pdf"));
for (var pageNum = 1; pageNum <= reader.NumberOfPages; pageNum++)
{
// Get the page content and tokenize it.
var contentBytes = reader.GetPageContent(pageNum);
var tokenizer = new PrTokeniser(new RandomAccessFileOrArray(contentBytes))
var stringsList = new List<string>();
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PrTokeniser.TK_STRING)
{
// Extract string tokens.
stringsList.Add(tokenizer.StringValue);
}
}
// Print the set of string tokens, one on each line.
Console.WriteLine(string.Join("\r\n", stringsList));
}
reader.Close();
The iTextSharp API has always struck me as a bit tricky to understand and the licensing would be a deal-breaker for me, even under the LGPL rather than AGPL. However you get access to the power of one of the largest, feature complete, C# PDF libraries.
PdfPig
Disclaimer: I'm the maintainer of this package.
PdfPig is an Apache 2.0 licensed library started as an attempt to port the Java PDFBox project to C#. I built PdfPig with a particular focus on extracting text from PDFs. Other use-cases like creating PDFs are less well supported, or for PDF to image or HTML to PDF, not supported at all.
First get the package from NuGet:
dotnet add package PdfPig
Then to open and extract the text, like we did for the previous library:
using (var pdf = PdfDocument.Open(@"..\..\..\sample.pdf"))
{
foreach (var page in pdf.GetPages())
{
// Either extract based on order in the underlying document with newlines and spaces.
var text = ContentOrderTextExtractor.GetText(page);
// Or based on grouping letters into words.
var otherText = string.Join(" ", page.GetWords());
// Or the raw text of the page's content stream.
var rawText = page.Text;
Console.WriteLine(text);
}
}
PdfPig provides multiple text extraction strategies. Porting the excellent PDFBox PDFTextStripper
is an outstanding issue but PdfPig exposes a rich API based around letters to support any custom text extraction logic.
Each page gives you access the the letters and their exact position on the page, plus almost all the information you could possibly want. Given the difficulty of extracting text content in a reliable order PdfPig is designed so that you can extract PDF text in any way you might need to, and enables you to build your own post-processing pipelines to give you the best possible results for your use-case.
docnet
docnet wraps the PDFium C++ library used by Chromium. It provides a C# API for the functionality available in the C++ library. This MIT licensed wrapper wraps the Apache 2.0 licensed PDFium code so is properly open source.
dotnet add package Docnet.core
Then you can extract the content from each page, or access the letters directly:
using (var docReader = DocLib.Instance.GetDocReader(@"..\..\..\sample.pdf", new PageDimensions()))
{
for (var i = 0; i < docReader.GetPageCount(); i++)
{
using (var pageReader = docReader.GetPageReader(i))
{
var text = pageReader.GetText();
Console.WriteLine(text);
}
}
}
docnet gives you the speed benefit of native libraries as well as the reassurance of running the PDF code which powers Chromium and by extension, Chrome. Currently it restricts you to targeting x64 but this may change in future.
PdfSharp
This is a port of the MIT licensed PdfSharp library to .NET Core. It seems to be primarily focused on creating, rather than reading, PDFs but also supports other operations. It also replaces the System.Drawing dependency of the original PDfSharp with the more cross-platform friendly ImageSharp library; which means, as usual, you should check the licenses of the dependencies (there was some talk of changing the ImageSharp license recently).
I couldn't find an immediately obvious API for text extraction and there seems to be an open issue for text extraction, but I thought I'd mention it as an option if you're looking to convert PDF to image, or work with the internal PDF structure.
Conclusion
We reviewed a few of the options available to a developer looking to read text from a PDF in C# on .NET Core. There's some difficulty finding proper open-source, rather than commercial or copyleft licensed software to achieve this task.
Even when we find a library it's still never going to extract text in reading order perfectly 100% of the time, since PDF was never designed to support this.
I've included the options I'm aware of, but if you feel I've missed any let me know in the comments.
I hope this article helps you write great software to bring the power of PDF to the people!
Posted on July 25, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.