Faster XML Parsing with Elixir
Adz
Posted on October 11, 2022
The XML data format has been around since 1996. It was first envisioned as a lingua franca (bridging language) for data to be serialized and read into completely disparate systems (with different programming languages, operating systems, and even hardware). It has been wildly successful in that goal.
In software, though, 26 years is like a lifetime — and in hardware, it's an eternity. So much has changed in that short time that it's easy to forget just how different the world was when the standard was proposed.
That brings us to the topic of today's post. How can we parse XML quickly and efficiently? What does state-of-the-art XML parsing look like?
Let's get going!
XML Parsing with Elixir — An Introduction
As much as we may want to criticize XML (and there is plenty to criticize), we have to put it in its historical context. Anything that survives that long has clearly found some utility!
Thinking critically about XML as a format is great if you are in a position to change or improve it, or when deciding whether you should use it.
But for most, the choice is already made for us. The entire airline industry and much of finance run on XML — so if you work in those industries, you will likely need to parse it.
We will define parsing in this post as doing two things:
- Verifying a document is valid (i.e., that it is actually XML).
- Extracting values from that valid document.
A valid document is not much use if we can't access the values in our program's XML. We will also specifically talk about XML documents where we know the expected shape of the XML ahead of time.
We'll provide an overview of some possibilities when parsing XML, including actual numbers from real-world applications.
Three Approaches to XML Parsing with Elixir
There are three broad approaches we will talk about:
- Building a DOM
- Virtual Token Descriptor (VTD)
- Simple API for XML (SAX) parsing
Building a DOM
Building a DOM from XML is probably the most common way to think about XML parsing. It's what browsers do with HTML, and it preserves the tree structure that can make thinking about and validating an XML document much easier.
DOM stands for Document Object Model — but it really means a structured representation of XML in your given language. Elixir doesn't have Objects, but we can still have a DOM.
There are lots of mature tools in this area. Erlang even comes with a built-in data structure for representing an XML document called xmerl.
Xmerl represents XML so that XPath queries are possible, and an XPath query engine is also built into Erlang. What is XPath? It is to XML what SQL is to a database — a query language for accessing the data inside it.
For example, here is an XPath query to get the text from the XML snippet below: /Author/text()
<Author>James Eagan</Author>
You can use built-in functions to access the text in the tags:
xml = "<Author>James Eagan</Author>"
# First build the xmerl:
{xmerl, _} = xml |> :erlang.binary_to_list |> :xmerl_scan.string()
# Now we can apply the xpath.
:xmerl_xpath.string('/Author/text()', xmerl)
This tooling maturity can be a fast way to get up and running. There is a chance your team knows of XPath from other languages, which makes extracting data a breeze, and using the built-in xmerl format means 0 extra dependencies.
But there is a problem. Resources! With large XML files creating an xmerl, DOM can really balloon memory requirements. It can become prohibitively expensive in terms of latency and memory. Here is a Benchee output excerpt of a ~9mb XML file that uses xmerl and XPath:
Benchmarking xmerl with querying ...
Name ips average deviation median 99th %
...
xmerl with querying 0.0809 12.36 s ±1.98% 12.42 s 12.59 s
Comparison:
...
xmerl with querying 0.0809 - 12.68x slower +11.39 s
Memory usage statistics:
Name Memory usage
...
xmerl with querying 2340.31 MB - 10.22x memory usage +2111.41 MB
That's a lot of megabytes! And as the output indicates, we can do much better.
Virtual Token Descriptor (VTD) in Elixir
VTD stands for Virtual Token Descriptor. A VTD is a completely different way to think about parsing XML. Instead of building a DOM, the idea is to tokenize the XML document and store the locations of key bits of information inside it.
In Elixir, this could mean storing the start index and length of all the nodes and attrs in the XML. You then use binary_part
to access any given location (in constant time!) and extract values when needed.
We leave the original XML binary untouched in a VTD approach (unlike a DOM, where we manipulate the original XML binary). Instead, in a VTD we build up all the information we need to query the original binary.
VTD to XML Parsing: An Example
Let's imagine a simplified example. This is not how VTD works exactly, but it will give you a sense of what it does without getting bogged down in edge cases.
Given this XML:
<Author age="22">Seth Milchick</Author>
Let's imagine the information we want to be able to retrieve is:
- Node name —
"Author"
- age attribute —
"22"
-
<Author>
's text content
To do that, we could build a table that looks like this:
xml = "<Author age=\"22\">Seth Milchick</Author>"
table = [
{:node, 1, 5},
{:attribute, 8, 3},
{:attribute_value, 13, 2},
{:text, 17, 13},
]
The tuple denotes what we are pointing at, e.g., :node
, then provides the start index for that element. The last number in the tuple is how long the element is.
If we look to get the XPath /Author/text()
, we can access that table and see if the root node matches "Author"
, something like:
node_in_table? = Enum.any?(table, fn
{:node, start, length} -> binary_part(xml, start, length) == "Author"
_ -> false
end)
if node_in_table? do
case Enum.find(table, &match?({:attribute_value, _, _}, &1)) do
nil -> :not_text
{:attribute_value, start, length} -> binary_part(xml, start, length)
end
else
:not_found
end
As you can see, how we represent that table is very important. This is the secret sauce of VTD. If we were inefficient, we could end up with a table the same size (or larger!) than a DOM. Or we could take a long time to find the element we care about in the table, eliminating any potential performance gains from binary_part
.
An actual VTD table stores references to child and sibling nodes in a way that makes randomly accessing a given element as good as in a DOM. However, because we don't chop up the original binary into its own objects, the memory requirements can be vastly lower. The creators claim that the table's size can only be 1.5 times the size of the XML document. Lower memory requirements also mean it can be fast.
In fact, the parsing method was designed for on-chip implementations. This 'XML on a Chip?' document describes how you can use custom hardware to blaze through XML!
What's more, existing VTD implementations all support the full XPath feature set. That means it is (in theory) possible to have custom hardware churn through XML parsing without incurring a massive cost in developer complexity. I say 'in theory' because you first have to create the custom hardware, and then execute it with a suitably low-level language.
In fact, there are currently no Elixir libraries for VTD. I know of two libraries being worked on (one in Erlang and one in Elixir) as part-time open source projects, as of yet unfinished and unreleased. Perhaps you can take inspiration from other languages and have a go at porting it.
As cool and interesting as this approach sounds, though, there are more trade-offs.
Trade-offs of VTD
While XPath is probably familiar to some, it...isn't great. When a node in a path doesn't exist, you just get nil
or an empty string back. This is a pain because you are never sure if you mistyped the path or if the value actually isn't there. Verifying that can be manual and annoying. XPath functions also don't tell you which node doesn't exist, making debugging harder. Having XPath as the primary way you extract data from an XML document can prove quite painful.
Additionally, VTD requires an entire document to exist in memory first, unlike sax parsing (as we will see). This could make parsing a stream of XML with VTD difficult.
Finally, there is a minimum amount of data that we need to extract from an XML document. We need to turn that data into useful things (like an Elixir Date
struct) and those Elixir data structures are of a certain size. If they amount to more in memory than the parsing takes, we don't gain anything from further reducing the memory footprint of the parsing. In some situations, the impressive memory savings don't translate to an actual saving in our programs (because as soon as you transform the parsed data into structured data, the memory jumps up again anyway).
So, let's look at our final approach — sax parsing.
SAX Parsing in Elixir
A SAX (Simple API for XML) parser works by iterating through a document and emitting events when certain things happen. For example, the Saxy parser emits the following events.
:start_document
:start_element
:characters
:cdata
:end_element
:end_document
When Saxy sees the start of a tag, for example, it gets its name and attributes and calls a callback you can implement. You can do whatever you like with that information.
The specific events available vary by implementation. Erlang has a built-in SAX xmerl parser that offers extra events like :comment
and :ignorableWhitespace
.
The emission of events is usually very fast. In Elixir, we can use binary pattern matching to great effect; Saxy will iterate through a document and use pattern matching to extract node names and attributes. But this only achieves half of what you actually need. You then have to identify when a value is one you care about, which is not as simple as it sounds.
SAX Parsing: An Example
Let's imagine you want to access the text inside the <Name>
tag below:
<Author>
<Name>Nick Riviera</Name>
</Author>
For this simple case, you first need to wait for the :start_element
event where the node_name
is Name
:
def handle_event(:start_element, {"Name", attrs}, state) do
# In here we know we have just opened the Name tag.
end
# This case is for any other tag that we open.
def handle_event(:start_element, {_, _attrs}, state) do
{:ok, state}
end
But the :characters
event is separate, meaning we have to put something into the state to let us know which tag's characters we're seeing:
def handle_event(:start_element, {"Name", attrs}, state) do
# In here we know we have just opened the Name tag, so let's just put that in state.
{:ok, [{"Name", attrs} | state]}
end
def handle_event(:start_element, {_, _attrs}, state) do
{:ok, state}
end
Now, when we get the :characters
event, we can check the state and see if we are "inside" the tag we care about:
# If "state" is the name tag we know it's the text we care about.
def handle_event(:characters, text, [{"Name", _} | rest]) do
# Now we can do something with the text since we know it will be the text of <Name>.
# ...
end
def handle_event(:characters, _, state) do
{:ok, state}
end
That works great, but now consider this XML:
<Blog>
<Name>XML Parsing</Name>
<Author>
<Name>Nick Riviera</Name>
</Author>
</Blog>
Our approach will break here because we need to know which <Name>
node we have encountered.
The solution is to use a stack. Each time we open a tag, we add that element to the stack, and when we close an element, we can discard it.
Trade-offs of a Stack
Changing our mindset to think in stacks is not simple, especially when we normally think in recursion.
It also makes validating the XML more complicated. If you don't get the value that you expect, it can be difficult to figure out what went wrong and when. You can print the stack, but figuring out how it got to be like that basically requires implementing your own stack tracing algorithm. So debugging is not trivial.
Implementing the callbacks for each XML document you want to parse can be an awful lot of work. It is difficult to re-use shared code across handlers because each time you accumulate different state. You might want to create an %Author{}
struct for this:
<Author>
<Name>Nick Riviera</Name>
</Author>
But for this document:
<Flight departureTime="1pm" />
You might want a %Flight{}
struct.
The only generic functionality is to manage the stack as you progress through a document. However, the specific elements and what you want to pull out of a document and put into the state you accumulate will be different. You really have only two options:
- Write a unique handler per document that accumulates all the data you want to pull from the XML document.
- Have a generic handler used by every document, which puts everything into a generic XML data structure.
Number 1 allows you to collect only the data you wish to pull from the XML and ignore the rest but it requires writing a custom handler for each XML document.
And number 2 — well, number 2 is a DOM 😅.
By default, Saxy actually does number 2 — it creates a data structure called SimpleForm, a tuple DOM that looks like this:
{node_name, attributes, children}
Where children
is a list of more tuple nodes or text — for example:
<Blog>
<Name>XML Parsing</Name>
<Author>
<Name>Nick Riviera</Name>
</Author>
</Blog>
Becomes:
{"Blog", [], [
{"Name", [], ["XML Parsing"]},
{"Author", [], [
{"Name", [], ["Nick Riviera"]}
]}
]}
Great...but haven't we just run slap-bang into all of the DOM problems we mentioned above? Well, almost. We are in full control of the created DOM, which means we can be clever.
The xmerl DOM we mentioned before is designed to enable XPath, but if we can live without that (and I maintain that we can, nay, we can improve on it), we can significantly reduce the size of the DOM we create.
We need to be willing to do two things:
- Replace XPath with something.
- Design a more minimal DOM.
I have seen amazing improvements over xmerl doing just this; an order of magnitude less memory and a latency improvement of the same.
For 2, we can go with SimpleForm for now, but let's look at what we can do about 1.
Querying the DOM with DataSchema
While a custom query engine may sound scary, it's an opportunity to improve.
Opting to navigate a DOM in Elixir means we don't have a black box XPath implementation that just returns nil when it's not happy (leaving you to figure out if the data wasn't there or if you just typed the path wrong).
Usually, everything we ever need from the XML boils down to one of three things:
- A node
- A node's text
- A node's attr
We either want one, or many, of each of these things. We end up with an API that's something like:
get_text
get_attr
get_all_attrs
get_all_text
get_node
get_all_nodes
How DataSchema Works
To help improve the developer experience even further, I wrote DataSchema. DataSchema is like Ecto's embedded schemas but generalized to any input type, including XML.
It allows us to write declarative schemas that describe structs we can create from input data. For example, this is a schema for the XML snippet above:
<Blog>
<Name>XML Parsing</Name>
<Author>
<Name>Nick Riviera</Name>
</Author>
</Blog>
defmodule Blog do
import DataSchema, only: [data_schema: 1]
@data_accessor SimpleFormAccessor
data_schema(
field: {:title, ["Blog", "Name", "text()"], StringType},
field: {:author_name, ["Blog", "Author", "Name", "text()"], StringType},
)
end
There are some really good livebooks
in the repo to learn the library properly, but let's break down the basics of the schema.
Each field in a schema defines a key in a struct. You can see the key below:
field: {:title, ["Blog","Name","text()"], StringType},
# ^^^^^^^
# The struct key
The next element in that tuple is a path to a piece of data in the XML:
field: {:title, ["Blog","Name","text()"], StringType},
# ^^^^^^^
# Path to a value
We are in full control of what that path can look like, but for now, I've opted to make it a list of XML nodes that mirror what an XPath query would look like split on "/"
.
The last element in that tuple is a casting function. When we call:
DataSchema.to_struct(xml, Blog)
DataSchema will get the value at the end of the path (for each field) and pass it to a casting function. This gives us a chance to alter it in some way.
field: {:title, ["Blog","Name","text()"], StringType},
# ^^^^^^^
# casting function
The casting function can be a module that implements a cast/1
function, an anonymous fn, or a Mod Fun Args tuple.
Finally, DataSchema needs to apply the path to the input data
it transforms.
We provide a @data_accessor
in the schema:
...
@data_accessor SimpleFormAccessor
# ^^^^^^^^^
# Accessor module
...
This module will implement a callback that gets passed the path and the input data (in our case, XML). The return value of that function is given to the casting function.
This accessor is where we can implement the traversal of the SimpleForm data structure. This is relatively easy. Here is an example snippet getting the text of an element:
def get_text(["text()"], {_, _, children}) do
Enum.filter(children, &is_binary/1)
end
def get_text([node_name | rest], {node_name, _, children}) do
get_text(rest, children)
end
def get_text([node_name | _], _) do
{:error, "node not found #{node_name}"}
end
This small snippet demonstrates the improvement we can get over XPath because we can now log out the exact node that we could not find. This is a huge developer experience productivity boost.
All this comes together so that when we call DataSchema.to_struct(xml, Blog)
, it returns:
{:ok, %Blog{title: "XML Parsing", author_name: "Nick Riviera" }}
This is the flow of data:
XML binary =>
Saxy (to create a SimpleForm DOM) =>
Query that DOM with DataSchema =>
Struct
XML Parsing with Elixir: Taking it One Step Further
In the wild, I've found that the Saxy/DataSchema approach improves over the xmerl DOM by order of magnitude. But we can go one step further.
Our schemas are declarative specifications of all the data we want from the XML. We can feed this information into a saxy handler and only include elements in the DOM that our schema needs.
This nifty trick works out especially well for schemas where you only need a small amount of the data in the XML document. If the data you end up with is larger in memory than the DOM you build, you don't need to worry about making the DOM smaller. But if it isn't, then building the DOM can cause waste and garbage, as you end up ignoring most of it.
We need to transform our schema into a tree with the schema's paths — e.g., for this Blog
schema:
defmodule Author do
import DataSchema, only: [data_schema: 1]
@data_accessor SimpleFormAccessor
data_schema(
field: {:name, ["Author", "text()"], StringType},
)
end
defmodule Blog do
import DataSchema, only: [data_schema: 1]
@data_accessor SimpleFormAccessor
data_schema(
field: {:title, ["Blog", "Name", "text()"], StringType},
has_one: {:author, ["Blog", "Author"], StringType},
)
end
We would create a tree of paths that looks something like:
%{
"Blog" => %{
"text()" => true,
"Author" => %{
"text()" => true
}
}
}
Handing that tree to the Saxy handler allows us to skip entire sub-trees of the XML.
To demonstrate this, let's parse the XML with the above schema.
<Blog>
<Comments>
<Comment>Superb!</Comment>
<Comment>Amazing!</Comment>
</Comments>
<Name>XML Parsing</Name>
<Author>
<Name>Nick Riviera</Name>
</Author>
</Blog>
First, recall that the key to parsing in this manner is to create a stack of elements. As you can see, when we build a SimpleForm DOM, we maintain a stack of nodes. When we get a :start_element
event, we put that element onto the top of the stack.
When we receive an :end_element
event, we put the element that is on the top of the stack into the children of the element below it (its parent).
But this time our handler has a schema and therefore a record of the values we actually care about. We can now query it every time we start a new element and ask "is this element in the schema?". If it is, we do as explained above and add it to the stack. If it isn't, we add a special node to the stack:
{:skip, 1, tag_name}
When we close a tag that we are skipping (i.e. when the top of the stack is that tuple) then we just pop that {:skip, 1, tag_name}
element off of the stack and discard it. But the real win is if we open another tag before that one closes — i.e. when we meet the child of a skipped element.
In the example above, it would go like this:
- We open
<Comments>
and see it is not in the schema, so we add:skip
. - We get an event for the start of the
<Comment>
tag. We see that the top of the stack is a:skip
element, so we immediately return:
def handle_event(:start_element, _element, [{:skip, _, _} | _] = stack) do
{:ok, state}
end
This continues until we close <Comments>
and the :skip
element is removed. Then we continue as normal.
Okay, but what is that 1
doing in the middle? This is effectively a reference count.
Consider this XML:
<Blog>
<Comments>
<Comments>
</Comments>
</Comments>
</Blog>
While nasty, it's perfectly valid XML that risks the following:
- We add a
:skip
tag for the opening of the first<Comments>
. - The opening of the second
<Comments>
is skipped. - On the closing of the second
<Comments>
, the name of the "skipped" tag matches this one, so we remove the skipped element from the stack, even though we haven't actually finished skipping yet.
One solution to this is to add a :skip
tag for every element you want to skip. But then you don't save any memory, as you still end up with a stack of every element — you just remove it more quickly. The memory will seem lower but will spike high during parsing.
Instead, we'll keep a count of elements with the same name as the skipped element once we start skipping. Then, each time we close one of those elements, we decrement the count. If it's ever 0, we are done skipping and can remove the tag.
This approach speeds up the latency of parsing a little, but importantly (along with some other tricks) massively trims down the size of the SimpleForm DOM that gets created.
See a version of the full Saxy handler.
There is even more we could do in this vein. For example, this approach stores all nodes along the path to the value we want. But if we just want the value at the end of the path, could we keep that value and none of the intermediary nodes? This comes with its own problems. We'll leave discovering these problems as an exercise for you 🙂.
Wrap Up
This concludes our tour of some interesting ways to parse XML faster, including building a DOM, using a VTD, and SAX parsing. We also went one step further, feeding data into a
Saxy handler (including elements in the DOM needed by our schema).
My hope is to one day open source a complete implementation of the slimmed down Sax parser, but for now, I hope this post has stimulated some of your own ideas.
Happy coding!
P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!
Posted on October 11, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.