Efficient file hashing in Elixir

diogoko

Diogo Kollross

Posted on March 16, 2021

Efficient file hashing in Elixir

You can download the code from this article from GitHub

Elixir in itself does not have functions to calculate hashes of files or data, but as usual you can use Erlang modules to do that. The crypto module offers several cryptographic services, including many hashing algorithms.

Naive approach

The easiest option is the hash function. It takes an atom indicating one of the supported algorithms and the data to be hashed. When hashing a file, you need to call File.read! or similar to read the data before calling hash.

data = File.read!("sample.pdf")
sha256 = :crypto.hash(:sha256, data)
Enter fullscreen mode Exit fullscreen mode

Streaming approach

The problem with using the hash function is that it only works if the whole file is present in the memory. When working with large files this can quickly degrade the performance of the application or even crash it.

An alternative is using the "streaming mode" of the hashing functions. Instead of feeding the data to the hashing function at once, you read the data in pieces and apply the hashing algorithm to each piece in sequence, updating its internal state until all data has been processed and the hashing algorithm has its final result. This is how these hashing algorithms actually work and this mode is available in other programming languages too.

Initialize hashing algorithm context
While there is more data:
    Feed the algorithm a piece of data
Get the final result from the hashing algorithm
Enter fullscreen mode Exit fullscreen mode

In Elixir, this can be implemented using the File.stream! function and Enum.reduce.

initial_hash_state = :crypto.hash_init(:sha256)

sha256 =
  File.stream!("sample.pdf", [], 2048)
  |> Enum.reduce(initial_hash_state, &:crypto.hash_update(&2, &1))
  |> :crypto.hash_final()
Enter fullscreen mode Exit fullscreen mode

The function hash_init creates a "hash state" object that is updated by the hashing algorithm as new data is processed. At this point, it's state is equivalent to hashing an empty file.

File.stream! produces an enumerable in which each item is a binary with length of up to 2048 bytes (in this example). This parameter can be tuned according to memory usage and performance requirements: larger buffers are faster but use more memory.

The enumerable returned by File.stream! is lazy and sometimes you need to explicitly execute it by calling Stream.run. Alternatively, most functions from the Enum module will trigger the execution of the file stream.

Inside Enum.reduce we call hash_update, passing the current hash state and the data to be processed. It returns the new state of the hasher, to be updated with the next item from the file stream or returned as the final result of the Enum.reduce call.

Having the hash state after processing the last data piece, we call hash_final to get the calculated digest as a binary.

Bonus: formatting the hash

The result of hashing algorithms is a sequence of bytes (16 for MD5, 20 for SHA1, 32 for SHA256, etc.), but usually we present them in hexadecimal format. To do that, use the Base.encode16 function.

formatted_sha256 = Base.encode16(sha256, case: :lower)
Enter fullscreen mode Exit fullscreen mode
💖 💪 🙅 🚩
diogoko
Diogo Kollross

Posted on March 16, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related