Hashing a File in Elixir

A hash function is a function that converts a variable size sequence of bytes (a string, a file content etc.) to a fixed size sequence of bytes, called digest. This means that hashing a file of any length, the hash function will always return the same unique sequence of bytes for that file. It’s a sort of digital fingerprint, usually represented by an hexadecimal string of length between 32 and 128 characters.

The hash of a file is useful, for example, to check if the content of two files is identical, or if the content was corrupted during the download.

There are different hash functions, MD5, SHA-1, SHA-2, SHA-3 etc. , many of them available in Elixir.

Update 👨‍💻: I had initially written the examples below using MD5 algorithm (which is the weakest in the list), just because I thought to be the fastest one. @Hauleth pointed out that SHA-1 and SHA-256 should be faster on new CPUs due to Intel SHA Extensions, so I rewrote the examples using SHA-256

Hashing a string

Let’s start hashing a string using the SHA-256 algorithm.

 iex> :crypto.hash(:sha256,"I love Elixir")
<<164, 35, 167, 235, 69, 224, 253, 77, 180, 92, 77, 172, 37,...>>
iex> :crypto.hash(:sha256,"I love Elixir!")
<<209, 119, 188, 230, 168, 124, 98, 212, 119, ...>>

We’ve used the hash/2 function in the :crypto Erlang module.

The first argument is the name of the hash algorithm we want to use, in this case :sha256, the second argument is the sequence of bytes we want to hash, in this case a string. It returns a sequence of bytes.
We see how the output changes just by appending a “!” character.

We can use Base.encode16/1 to get the hexadecimal string representation

iex> :crypto.hash(:sha256,"I love Elixir!") \
...> |> Base.encode16() \
...> |> String.downcase()
"d177bce6a87c62d4772f404fcad2f8c2d9606c04f99942b71d7c521eb79c4c3b"

If you are on a Linux or Mac machine, you can use a command line tool like sha256sum to see that the digest corresponds

$ echo -n 'I love Elixir!' |  sha256sum
d177bce6a87c62d4772f404fcad2f8c2d9606c04f99942b71d7c521eb79c4c3b -

Hashing a file

Calculating the hash of a file is conceptually the same as calculating the hash of a string. A file is a sequence of bytes and we could use the same :crypto.hash(:sha256, file_content_binary) function. But we saw that most of the time is not a good idea to load the whole file into memory!

We can use File.stream! and a different set of functions available in :crypto to read and process a file in chunks.
Let’s see first a simple example using the same string we’ve used before, divided into chunks

iex> [chunk_1, chunk_2] = ["I love ", "Elixir!"]
iex> hash_ref = :crypto.hash_init(:sha256)
#Reference<...36636>
iex> hash_ref = :crypto.hash_update(hash_ref, chunk_1)
#Reference<...36647>
iex> hash_ref = :crypto.hash_update(hash_ref, chunk_2)
#Reference<...36655>
iex> digest = :crypto.hash_final(hash_ref)
<<209, 119, 188, 230, 168, 124, 98, 212, 119, ...>>

iex> digest |> Base.encode16() |> String.downcase()
"d177bce6a87c62d4772f404fcad2f8c2d9606c04f99942b71d7c521eb79c4c3b"

We process the sequence in chunks getting the same result we’ve gotten previously, and we can obviously do the same with files:

hash_ref = :crypto.hash_init(:sha256)
    
File.stream!(file_path)
|> Enum.reduce(hash_ref, fn chunk, prev_ref-> 
  new_ref = :crypto.hash_update(prev_ref, chunk)
  new_ref
end)
|> :crypto.hash_final()
|> Base.encode16()
|> String.downcase()
  • We get a hash reference from :crypto.hash_init(:sha256), which is passed to Enum.reduce as the first accumulator.
  • We use Enum.reduce to read each chunk from the file and add it to the calculation. The :crypto.hash_update/2 returns a new reference which is then set as the new accumulator.
  • Once processed all the chunks the final reference is then piped into the :crypto.hash_final/1 function which returns the SHA-256 digest of the file.

We can write the reduce function in a nicer and more compact way

File.stream!(file_path)
|> Enum.reduce(:crypto.hash_init(:sha256),&(:crypto.hash_update(&2, &1)))
|> :crypto.hash_final()
|> Base.encode16()
|> String.downcase()

File.stream! chunks vs lines

By default File.stream! emits lines instead of just chunks. Emitting lines is slower than emitting chunks, I think because the stream needs to look for newlines while splitting the chunks in strings.


To force the stream to emit chunks we use File.stream!/3

iex> File.stream!(file_path, [], 2_048)
%File.Stream{
  line_or_bytes: 2048,
  modes: [:raw, :read_ahead, :binary],
  path: file_path,
  raw: true
}

setting a chunk size of 2048 bytes.

I made a quick benchmark (you can find on this gist) where we see that streaming chunks is faster and also better memory wise.

Name             ips        average  deviation         median         99th %
chunks       23.29 K       42.93 μs    ±63.44%       41.98 μs       83.98 μs
lines         9.21 K      108.54 μs    ±42.52%       93.98 μs      275.98 μs

Comparison:
chunks       23.29 K
lines         9.21 K - 2.53x slower +65.61 μs

Memory usage statistics:

Name      Memory usage
chunks         2.11 KB
lines         20.84 KB - 9.88x memory usage +18.73 KB

Wrap up

We’ve seen what a hash function is and how to easily calculate the hash of a file using Elixir.

In the past (unfortunately I think still in the present 😅), hash functions were used to store passwords in the database. If you need to securely handle and store passwords, please use the bcrypt_elixir library!


Also published on Medium.