Binary pattern matching <<1::1, 0::1, 1::1>>
Danie Palm
Posted on April 23, 2024
I recently had to write a parser for a proprietary binary file format that stores photoplethysmography (PPG) and accelerometry data and various other health-related metrics derived from a wearable device. The format is similar in spirit to protocol buffers, but exploits domain knowledge to achieve an even more compact representation.
Elixir is not necessarily the most performant language for this kind of raw computation, being generally better suited to IO-bound tasks, but it does sport impressive binary handling capabilities that it inherits from Erlang.
In this post I want to share some of the basics and some of the more advanced techniques I used to parse this binary format. The format itself is not relevant, so we will use contrived examples.
Basics
The =
operator looks and behaves much like an assignment operator in Elixir, but is actually a match operator that allows you to match and destructure values. Also, strings are just binaries in Elixir, so we will often use strings in illustrations. A binary is simply a sequence of bytes. More specifically, they are bitstrings (sequences of bits) that are multiples of 8 bits long. Binaries and bitstrings are represented in Elixir as comma-separated values enclosed in double-angle brackets (<<
and >>
).
So let's have a look at the simplest string pattern matches:
# SUCCESS: match a string to another string of the same value
"string" = "string"
# FAILURE: match a string to another string of a different value
"string" = "not the same string"
# SUCCESS: match a string to a variable
# assigns "string" to `string`
string = "string"
# SUCCESS: match a string to a concatenation that shares the same prefix
# assigns "ing" to `ing`
"str" <> ing = "string"
# FAILURE: putting the variable part first (logically fine, but not allowed)
str <> "ing" = "string"
# FAILURE: not sharing the same prefix
"not the same str" <> ing = "string"
But strings are just binaries, so much the same can be achieved using the binary syntax:
# SUCCESS: an empty string matches an empty binary
<<>> = ""
# SUCCESS: a string matches a binary of the same value
<<"string">> = "string"
# SUCCESS: a string matches a variable of type binary
# assigns `"string"` to `string`
<<string::binary>> = "string"
# SUCCESS: a string matches a concatenation that shares the same prefix
# assigns `"ing"` to `ing`
<<"str", ing::binary>> = "string"
# FAILURE: putting the variable part first (logically fine, but not allowed)
<<str::binary, "ing">> = "string"
# SUCCESS: the parts make up the whole
<<"st", "ri", "ng">> = "string"
# SUCCESS: and you can even do nesting
# assigns `"ng"` to `ng`
<<"st", <<"ri", ng::binary>> >> = "string"
Enter the bytes
As mentioned earlier, a binary is a sequence of bits with a total length that is a multiple of 8. In other words, a binary is a sequence of bytes. While the underlying representation is a sequence of bits, it can be constructed from and destructured into almost any Elixir data type. To understand how, for instance, "st"
is a sequence of bytes, let's add another (0
) byte to it in iex:
iex> <<"st", 0>>
<<115, 116, 0>>
So Elixir takes the bytes that encode the "s"
, the "t"
and the 0
and packs them in sequence in the binary. You are not limited to integers and strings and can construct binaries from floats, individual bits, and many more:
iex> <<"s", 116, 1.2, 0x00>>
<<115, 116, 63, 243, 51, 51, 51, 51, 51, 51, 0>>
One byte for the string "s"
, one byte for the integer 116, 8 bytes for the float 1.2
, and a final null byte given in hex for fun. You can even construct the binary, in part or in full, from individual bits or from groups of arbitrary bite size:
# construct a binary from bits, note that 1::5 is effectively 00001
iex> <<1::1, 0::1, 1::1, 1::5>>
<<161>>
# to verify, let's construct the same binary from an integer in base 2
iex(7)> <<0b10100001>>
<<161>>
When possible, Elixir will display a binary as a string (strings are like syntactic sugar for binaries that meet the right criteria), but if the sequence of bytes does not make up a valid utf8-encoded string, it will display 8-bit unsigned integers instead.
Constructing and destructuring binaries
Several modifiers can be provided to signify to Elixir exactly how you intend a sequence of values to be converted to an underlying sequence of bits, or how to destructure bits into a sequence of values.
By default, integers without any modifiers (also called options) are treated as unsigned, big-endian integers of size 8 and unit 1 (i.e. 8 bits long). As such these are all equivalent:
<<101::unsigned-big-integer-size(8)-unit(1)>>
<<101::unsigned-big-integer-size(8)>>
<<101::unsigned-big-integer-8>>
<<101::unsigned-integer-8>>
<<101::integer-8>>
<<101::8>>
<<101>>
Similar options exist for floats, bits, and binaries where applicable.
We can now use these options, which can be provided in any order as long as they are separated with -
and follow a ::
, to match any particular parts of an incoming binary:
iex> <<elves::bytes-3, men::bytes-9, dwarves::bytes-7, dark_lord::bytes-1>> = "elfmortalmendwarves1"
iex> elves
"elf"
iex> men
"mortalmen"
iex> dwarves
"dwarves"
iex> dark_lord
"1"
Note that bytes
is an alias for binary
and by default the unit for binaries is 8 and the size is unspecified (if not provided, will eagerly match as much as possible). The following are therefore equivalent:
<<"st"::binary-size(2)-unit(8)>>
<<"st"::bytes-size(2)-unit(8)>>
<<"st"::bytes-size(2)>>
<<"st"::bytes-2>>
If no size is specified along with the binary modifier, it must be the final segment to be matched.
In addition to matching or destructuring, we can also construct a binary using the above options:
binary = <<
1::1, # the integer 1 encoded as one bit,
-100::unsigned-big-integer, # -100 encoded as unsigned int,
"followed by some text",
<<0,1,1>>, # and even nested raw binaries,
1::7 # 1 encoded as 7 bits
>>
The above assigns the following value to binary
:
<<
206, 51, 55, 182, 54, 55, 187, 178, 178, 16, 49, 60, 144,
57, 183, 182, 178, 144, 58, 50, 185, 186, 0, 0, 128, 129
>>
We could also have mixed in floats. They must consist of 16, 32, or 64 bits and are always signed.
Advanced pattern matching
Now let's get to some actual parsing. Suppose you have a binary "message" that encodes data in the following format (intentionally looking somewhat like protobuff):
message SomeData {
uint8 height_in_mm;
float16 weight_in_kg;
}
Using binary pattern matching, you can extract the values like this:
<<
height_in_mm::unsigned-integer,
weight_in_kg::float-16
>> = binary
That's it, you've extracted the data from the message in a single match operation. But nothing fancy so far. A slightly more interesting example is one where a binary stores array data. The message spec could look like this:
message SomeData {
uint8 non_array;
uint8[4] data_array;
}
And it could be parsed like this:
<<
non_array::unsigned-integer,
data_array_binary::binary-size(4)
>> = binary
data_array =
for <<value::unsigned-integer <- data_array_binary>> do
value
end
A hardcoded array length like that is hardly useful. What if we could encode the array length in the message itself?
message SomeData {
uint8 length;
uint8[length] data_array;
}
A binary encoding the information in the above message would start with a byte that indicates the length of the array, followed by length
bytes and could be parsed like this:
<<
length::unsigned-integer,
data_array_binary::binary-size(length)
>> = binary
data_array =
for <<value::unsigned-integer <- data_array_binary>> do
value
end
This would also work when the array length is 0. The binary could then consist of only a single null-byte. The length would be extracted as 0, and data_array_binary::binary-size(0)
will then effectively match on the remaining empty binary and be happy.
This opens the door for conditional pattern matching. Suppose you have a message in which the first byte provides all kinds of information about the rest of the message:
message ConditionalMatching {
1bit contains_height;
1bit contains_weight;
6bit six_more_unused_bits;
optional(uint8, contains_height) height;
optional(uint8, contains_weight) weight;
}
Then in principle you would be able to parse height and weight conditionally:
<<
contains_height::1,
contains_weight::1,
_::6,
height::unsigned-integer-size(contains_height*8),
weight::unsigned-integer-size(contains_weight*8)
>> = binary
You will have noticed that you can do arithmetic within the size
option. You are however limited to only the operators that are also allowed in guards: comparison, basic arithmentic, bitwise operators, in
/not in
etc. Note however that &&
and ||
are not allowed, and neither are function calls.
A similar interesting scenario is when a message contains dynamically typed data. For instance:
message SomeData {
dynamic_value value;
}
Here the binary could contain one or more bytes that can be used to first identify the type of the data that is to follow, followed by the actual data. One could parse it as follows:
<<
dynamic_type_id::unsigned-integer,
rest::binary
>> = binary
type = Types.lookup(dynamic_type_id)
# type-specific parsing of `rest`
However, this strategy prevents you from parsing the entire message in one go (supposing you want to support the case of messages that contain multiple dynamically typed values), because you have no information about the length of the rest of the binary. A better way is to encode information about the type inside the type id. As a simple example, one could take the first 4 bits of the first byte to indicate size, and the last 4 to encode information about the type (integer vs float, signed vs unsigned). Obviously, there are countless ways of efficiently encoding information in this way, but lets assume the first 4 bits encode the bit size:
<<
first_nibble_type_id::unsigned-integer-4,
last_nibble_type_id::unsigned-integer-4,
data_binary::binary-size(first_nibble_type_id)-unit(1),
potentially_more_fields_here::binary
>> = binary
data = parse_according_to_type(last_nibble_type_id, data_binary)
Assuming that some further info was encoded in the last nibble. We might not have been able to fully parse the data all the way to its final type, but we were at least able to extract it from the parent binary without first stopping and doing a lookup. Many more such tricks exist, and remember that you can always try multiple patterns to deal with special cases:
case binary do
<<first_pattern::unsigned-integer-8> -> ...
<<second_pattern::float-16>> -> ...
end
Bonus
In one scenario I had to parse a message that contains multiple values with dynamic types. Unfortunately the type id was specced in such a way that you could determine the bit size from it only in certain cases.
I basically needed logic like:
bit_size =
if last_nibble == 10 do
if first_nibble == 0 do
4
else
first_nibble
end
else
first_nibble
end
Unfortunately, if
statements and the like are not allowed in the size
option.
I could have matched on two patterns (one for the exception and one for the rule), but the possibilities double on every additional dynamically typed value and even though I generate the parsing code from the message specs, it didn't feel like the best way to go. Alternatively I could have parsed fields one by one, doing lookups on every step as needed, but this would have sacrificed leveraging binary parsing optimizations that are built into the BEAM.
I ended up with this kind of logic to determine the bit_size (taking inspiration from this SO) post):
import Bitwise
bit_size = first_nibble +
(1 + ((last_nibble - 10 ||| 10 - last_nibble) >>> 32)) *
(1 +
((first_nibble - 0 ||| 0 - first_nibble) >>>
32)) * 4
This accomplishes the same logic achieved earlier with if
blocks. Of course, the RHS goes straight into the size
option and is not first assigned to any value like I did here.
Another bonus
This one isn't strictly about pattern matching, but it is surprizing enough to warrant honourary mention. You may be aware that it is efficient to construct lists with the "cons" operator (|
). That is, prepending an element to a list doesn't involve copying the old list into the new list, the new list simply points to the old one:
old_list = [1, 2, 3]
new_list = [0 | old_list]
Well, it turns out that there is a similar way to efficiently extend binaries under certain conditions. The Erlang runtime system optimizes appending elements to an existing binary, avoiding copying when possible. Suppose you want to append some bytes to a binary:
for byte <- [0, 1, 2, 3], reduce: <<>> do
acc ->
<<acc::binary, byte>>
end
In the above code, the binary is updated, not copied!!??!! It sounds crazy, you can read all about it here.
There is a built-in function for this specific case (turning a list into a binary), but it's good to know about the underlying principle.
Conclusion
Elixir's binary pattern matching and construction is insanely powerful. It differs from many other languages in that is has superb support for matching even individual bits. Most other languages require you to first extract the byte, and then use shifts and masks to get to the bit of interest.
Elixir is not the most performant, but if the binary data you are parsing is a sequence of independent messages, you can likely find ways to parse it concurrently.
Posted on April 23, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.