Index PDFs on their pages in Umbraco
Jesper Mayntzhusen
Posted on October 29, 2023
Intro
The Umbraco documentation has an example on how to set up the PDF indexing package and adding a multisearcher. However that means it will index all of the PDFs in the media library in a seperate PDF index, and then it extends the searcher to search in both Umbraco nodes and PDF documents.
This means that when the user searches they can get PDF results which you can link directly to from the search results.
This blogpost will take a different approach where we don't necessarily search in a separate document index, but instead it will go through how to index PDFs in Umbraco and to add the pdf index to the content nodes that it's included on.
Setting up
To begin with I've started a new site on Umbraco 12.2.0 with the Clean starter kit to have some start content.
I've added a block for the blocklist where you can upload pdf files, and I've downloaded an Umbraco whitepaper as PDF from Umbraco.com to test with.
You can see the block with files being added to one of the blogposts in the starter kit, and the whitepaper is selected:
To help with typed models we will also enable Umbraco Modelsbuilder in the appsettings.json file and generate models:
"Umbraco": {
"CMS": {
"ModelsBuilder": {
"ModelsMode": "SourceCodeManual"
}
}
},
Installing the PDF indexing package
To index the PDFs we can use the official Umbraco package, it can be installed from NuGet where it is named Umbraco.ExaminePDF
.
After installing and restarting the site you can find the new index in the settings section for Examine indexes - if you rebuild you will see your PDF's from your media section. In my example I only have the Umbraco whitepaper:
However this data is not indexed on any content nodes - let's fix that!
Extending the External Index with PDF data
First of all we will hook into an event on the ExternalIndex in order to add a new pdf field we can put the data into:
We add a new file with the notification handler where we hook into the TransformingIndexValues event on the external index:
using Examine;
using Umbraco.Cms.Core.Events;
using Umbraco.Cms.Core.Notifications;
namespace PdfIndexing.Searching;
public class ExternalIndexTransformer : INotificationHandler<UmbracoApplicationStartedNotification>
{
private readonly IExamineManager _examineManager;
public ExternalIndexTransformer(IExamineManager examineManager)
{
_examineManager = examineManager;
}
public void Handle(UmbracoApplicationStartedNotification notification)
{
if (!_examineManager.TryGetIndex(Umbraco.Cms.Core.Constants.UmbracoIndexes.ExternalIndexName, out var index))
{
throw new InvalidOperationException(
$"No index found by name {Umbraco.Cms.Core.Constants.UmbracoIndexes.ExternalIndexName}");
}
index.TransformingIndexValues += IndexOnTransformingIndexValues;
}
private void IndexOnTransformingIndexValues(object? sender, IndexingItemEventArgs e)
{
throw new NotImplementedException();
}
}
And a composer to ensure it is registered in dependency injection:
using Umbraco.Cms.Core.Composing;
using Umbraco.Cms.Core.Notifications;
namespace PdfIndexing.Searching;
public class Composer : IComposer
{
public void Compose(IUmbracoBuilder builder)
{
builder.AddNotificationHandler<UmbracoApplicationStartedNotification, ExternalIndexTransformer>();
}
}
In the indexing event we want to add code that does:
- Only extend the Article doctype
- Get the node content from the cache and cast it to the Article type
- Get it's blocklist property (called ContentRows in this site)
- Loop through the blocks and only do something if it's the files row type which is the new PDF block.
private void IndexOnTransformingIndexValues(object? sender, IndexingItemEventArgs e)
{
if(e.ValueSet.ItemType is not Article.ModelTypeAlias) return;
if (!int.TryParse(e.ValueSet.Id, out int id)) return;
using var context = _umbracoContextFactory.EnsureUmbracoContext();
var content = context.UmbracoContext.Content?.GetById(id);
if (content is not Article typedContent) return;
var blockList = typedContent.ContentRows;
if(blockList is null) return;
foreach (var block in blockList)
{
if (block.Content is not FilesRow filesRow) continue;
}
}
Next we want to do the following:
- Loop through the selected files and get the
umbracoFile
property of the files - which in Umbraco correspond to the files path. - If it finds it we can use the PdfTextService from the
UmbracoExamine.PDF
package to extract the file content. - We add all of the files' content to a string which we can later index on the node.
foreach (var block in blockList)
{
if (block.Content is not FilesRow filesRow) continue;
if (filesRow.Files is null) continue;
foreach (var file in filesRow.Files)
{
var filePath = file.Value<string>("umbracoFile");
if(string.IsNullOrWhiteSpace(filePath)) continue;
try
{
var pdfString = _pdfTextService.ExtractText(filePath);
if (pdfString is not null)
{
pdfContent += pdfString + " ";
}
}
catch (Exception ex)
{
_logger.LogError(ex, $"Could not index the file content from path: {filePath}");
}
}
}
Finally, after the blocklist foreach loop we can take the new PDF string content and add as a new field in the index:
if(string.IsNullOrWhiteSpace(pdfContent)) return;
var indexFields = e.ValueSet.Values.ToDictionary(x => x.Key, x => x.Value.ToList());
indexFields.Add("pdfTextContent", new List<object>{pdfContent});
e.SetValues(indexFields.ToDictionary(x => x.Key, x => (IEnumerable<object>)x.Value));
See full event code here:
using Examine;
using Umbraco.Cms.Core.Events;
using Umbraco.Cms.Core.Notifications;
using Umbraco.Cms.Core.Web;
using Umbraco.Cms.Web.Common.PublishedModels;
using UmbracoExamine.PDF;
namespace PdfIndexing.Searching;
public class ExternalIndexTransformer : INotificationHandler<UmbracoApplicationStartedNotification>
{
private readonly IExamineManager _examineManager;
private readonly IUmbracoContextFactory _umbracoContextFactory;
private readonly PdfTextService _pdfTextService;
private readonly ILogger<ExternalIndexTransformer> _logger;
public ExternalIndexTransformer(IExamineManager examineManager,
IUmbracoContextFactory umbracoContextFactory,
PdfTextService pdfTextService,
ILogger<ExternalIndexTransformer> logger)
{
_examineManager = examineManager;
_umbracoContextFactory = umbracoContextFactory;
_pdfTextService = pdfTextService;
_logger = logger;
}
public void Handle(UmbracoApplicationStartedNotification notification)
{
if (!_examineManager.TryGetIndex(Umbraco.Cms.Core.Constants.UmbracoIndexes.ExternalIndexName, out var index))
{
throw new InvalidOperationException(
$"No index found by name {Umbraco.Cms.Core.Constants.UmbracoIndexes.ExternalIndexName}");
}
index.TransformingIndexValues += IndexOnTransformingIndexValues;
}
private void IndexOnTransformingIndexValues(object? sender, IndexingItemEventArgs e)
{
if(e.ValueSet.ItemType is not Article.ModelTypeAlias) return;
if (!int.TryParse(e.ValueSet.Id, out int id)) return;
using var context = _umbracoContextFactory.EnsureUmbracoContext();
var content = context.UmbracoContext.Content?.GetById(id);
if (content is not Article typedContent) return;
var blockList = typedContent.ContentRows;
if(blockList is null) return;
var pdfContent = string.Empty;
foreach (var block in blockList)
{
if (block.Content is not FilesRow filesRow) continue;
if (filesRow.Files is null) continue;
foreach (var file in filesRow.Files)
{
var filePath = file.Value<string>("umbracoFile");
if(string.IsNullOrWhiteSpace(filePath)) continue;
try
{
var pdfString = _pdfTextService.ExtractText(filePath);
if (pdfString is not null)
{
pdfContent += pdfString + " ";
}
}
catch (Exception ex)
{
_logger.LogError(ex, $"Could not index the file content from path: {filePath}");
}
}
}
if(string.IsNullOrWhiteSpace(pdfContent)) return;
var indexFields = e.ValueSet.Values.ToDictionary(x => x.Key, x => x.Value.ToList());
indexFields.Add("pdfTextContent", new List<object>{pdfContent});
e.SetValues(indexFields.ToDictionary(x => x.Key, x => (IEnumerable<object>)x.Value));
}
}
Finally we can reapply our new code, and go and save and publish an article with a PDF on it. After it has published the new event will have added the pdf content to the index for the content node in the new field "pdfTextContent":
Which means whatever search implementation used can be extended to also search on this new field to include the node into search results if the node contains matches in its pdf content.
NOTE: If an editor goes on the media node of the PDF and replaces the file the index will end up out of date - you can read more carefully why and what you can possibly do to get around the problem of referenced content being indexed in my other blogpost.
Posted on October 29, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024
November 7, 2024