Extract Content From ODF files using C#
Jay Malli
Posted on September 10, 2023
Basically, LibreOffice is a popular open-source office suite that provides users with the ability to create and edit documents, presentations, spreadsheets, and more. Files created with LibreOffice often have the Open Document Format (ODF) extension, such as .odt for text documents and .ods for spreadsheets.
Our requirement is to open this type of ODF files & extract content from it. We can't extract content from ODF files directly using File Class of IO namespace, then what should do?
Yaa right! if could get the data of file in XML then we can also get content from that XML data. Let's take example for Word doc file [LibreOffice Writer] which have extension ".odt" .
Step - 1 : Get the ODF file data in XML format
Step - 2: Filter out all .xml files & get xml data. "content.xml" file contains the content/text of file.
Step - 3 : Now Extract text content from XML data.
=> Source Code :
using System.IO.Compression;
using System.Xml;
namespace Content
{
public class ODF
{
public string ReadText(string filePath)
{
string textContent = "";
using (ZipArchive zipArchive = ZipFile.OpenRead(filePath))
{
foreach (var entry in zipArchive.Entries)
{
if (entry.FullName.EndsWith(".xml", StringComparison.OrdinalIgnoreCase))
{
using StreamReader reader = new StreamReader(entry.Open());
string xmlContent = reader.ReadToEnd();
textContent += ExtractTextFromXml(xmlContent);
}
}
}
return textContent; // output : text content
}
public string ExtractTextFromXml(string xmlContent)
{
string textContent = "";
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlContent);
// add required namespace for different types of documents
XmlNamespaceManager nsManager = new XmlNamespaceManager(xmlDoc.NameTable);
nsManager.AddNamespace("text", "urn:oasis:names:tc:opendocument:xmlns:text:1.0"); // for doc files with extension .odt
nsManager.AddNamespace("office", "urn:oasis:names:tc:opendocument:xmlns:office:1.0"); // comman for all ODF files
foreach (XmlNode node in xmlDoc.SelectNodes("//text:p | //text:h", nsManager))
{
textContent += node.InnerText + Environment.NewLine;
}
return textContent;
}
}
}
=> The example above demonstrates how to extract text content from a .odt document file. If you would like to extract content from other ODF files, such as spreadsheets (.ods) or presentations/ppt (.ods), you can find the necessary code in the repository mentioned below.
Thank you for joining me on this journey of discovery and learning. If you found this blog post valuable and would like to connect further, I'd love to connect with you on LinkedIn. You can find me at LinkedIn
If you have thoughts, questions, or experiences related to this topic, please drop a comment below.
Posted on September 10, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.