Download Large Files with HTTPoison Async Requests

With HTTPoison is really easy to do HTTP requests in Elixir. If we need to get the latest BTC-USD exchange rate from the Coinbase API, we just use the HTTPoison.get! passing the url and query parameters.

iex> url = "https://api.coinbase.com/v2/prices/spot"
iex> resp = HTTPoison.get!(url, %{}, params: [currency: "USD"])

%HTTPoison.Response{
  body: "{\"data\":{\"amount\":\"3585.005\" ... }}",
  ...
  status_code: 200
}  

iex> %{"data" => %{"amount" => amount}} = Jason.decode!(resp.body)
iex> String.to_float(amount)
3585.005

We easily get the response with a JSON body containing the information we need. We then use a library like Jason to de-serialise the JSON string and then get the amount using pattern matching.

HTTPoison holds the response in memory. In the most of the cases, like the one above, it’s not an issue since the data we keep in memory is minimal.

But sometime the files we need to download, or in general the responses we receive from an HTTP server, are far beyond what we should keep in memory.

Memory issues downloading a large file

Let’s consider for example one of the high resolution TIF images in the www.spacetelescope.org website

Hubble mosaic of the majestic Sombrero Galaxy

At this link we find the beautiful image above (which I just realised is also the image on the uncle Bob‘s Clean Code book cover). We see that there are different versions, and we are obviously interested at the heaviest one: Full size original, 171Mb.

Now, what happens if we download it using the HTTPoison.get! function as we did before?

We can easily monitor the memory allocation with the super-useful Erlang Observer. We start it with :observer.start. Like I’ve shown in other articles, with this tool is really easy to monitor the total allocated memory.

Initial allocated memory

In the image above, we see the initial total memory allocation (which is around 30mb) just after starting an iex session and the observer.

# url = "https://www.spacetelescope.org/static/archives/images/original/opo0328a.tif"
iex> resp = HTTPoison.get!(url)

%HTTPoison.Response{
  body: <<73, 73, 42, ...>>
...
}

iex> File.write! "image.tif", resp.body
:ok
iex> byte_size(resp.body)
180104580

It worked but we see that the image binary is fully kept in memory, precisely in resp.body.
It’s easy to see how this can lead to dangerous scenarios, especially if we need to handle multiple downloads in a production environment.

Async Requests

With HTTPoison we can make asynchronous requests and receive a large HTTP response in chunks.
To make an async request we just need to pass the stream_to option to the HTTPoison.get! function.

iex> resp = HTTPoison.get! url, %{}, stream_to: self()
%HTTPoison.AsyncResponse{id: #Reference<...>}

We see how the HTTPoison.get! function returns immediately a HTTPoison.AsyncResponse struct, identified by an id.


We pass a PID (process id) to the stream_to option. HTTPoison will stream the response chunks, sending them as messages to this process.
For simplicity, we’ve used self(), which returns the PID of the current process, which in our case is the iex console.

Once the HTTPoison.get! returns, the current process starts receiving messages, which are part of the HTTPoison response.

We use the receive/1 block to see the messages received into the process’ mailbox.

receive do 
  msg -> msg
end
iex> receive do msg -> msg end
%HTTPoison.AsyncStatus{code: 200, id: #Reference<...>}

iex> receive do msg -> msg end
%HTTPoison.AsyncHeaders{
  headers: [
    {"Server", "nginx/1.13.7"},
    ...
  ]
...
}

iex> receive do msg -> msg end
%HTTPoison.AsyncChunk{
  chunk: <<73, 73, 42, 0, 146, ...>>,
  id: #Reference<...>
}

The first message is an %HTTPoison.AsyncStatus{} struct. We can use this struct to verify if we get a successful status code.

We then receive the headers inside a %HTTPoison.AsyncHeaders{} struct.

But the messages we are mainly interested about are the %HTTPoison.AsyncChunk{} structs, where we find chunks of the image’s binary. We can sequentially save these chunks in a file, letting the garbage collector to get rid of the chunks we’ve already processed.

We also see that these chunks are quite small, so the download’s memory footprint is small

iex> receive do 
  %HTTPoison.AsyncChunk{chunk: c} -> 
    byte_size(c) 
end
15998

One single chunk at a time

HTTPoison keeps sending messages to our receiver process, with the risk of flooding it.

Along with the stream_to option we can use another option: async: :once.

In this way, each time we are ready to receive a new chunk, we ask HTTPoison to send a new message, calling HTTPoison.stream_next(resp)

iex> resp = HTTPoison.get! url, %{}, 
               stream_to: self(), 
               async: :once

%HTTPoison.AsyncResponse{}
iex> receive do msg -> msg  end
%HTTPoison.AsyncStatus{}

iex> HTTPoison.stream_next(resp)
iex> receive do msg -> msg  end
%HTTPoison.AsyncHeaders{}

iex> HTTPoison.stream_next(resp)
iex> receive do msg -> msg  end
%HTTPoison.AsyncChunk{}

Recursively save the chunks

We are going to write a simple function to use on iex, that recursively downloads and saves the chunks.

Let’s then start the Observer, :observer.start, to monitor the memory allocated and, as before, we make an async request.

iex> resp = HTTPoison.get! url, %{}, 
            stream_to: self(), async: :once
%HTTPoison.AsyncResponse{}

iex> {:ok, fd} = File.open("image.tif",[:write, :binary])
{:ok, #PID<...>}

This time we’ve also opened a file where we are going to write the chunks we receive.

async_download = fn(resp, fd, download_fn) ->
  resp_id = resp.id

  receive do
    %HTTPoison.AsyncStatus{code: status_code, id: ^resp_id} ->
      IO.inspect(status_code)
      HTTPoison.stream_next(resp)
      download_fn.(resp, fd, download_fn)

    %HTTPoison.AsyncHeaders{headers: headers, id: ^resp_id} ->
      IO.inspect(headers)
      HTTPoison.stream_next(resp)
      download_fn.(resp, fd, download_fn)

    %HTTPoison.AsyncChunk{chunk: chunk, id: ^resp_id} ->
      IO.binwrite(fd, chunk)
      HTTPoison.stream_next(resp)
      download_fn.(resp, fd, download_fn)

    %HTTPoison.AsyncEnd{id: ^resp_id} ->
      File.close(fd)
  end
end

We define an anonymous function that accepts the arguments resp, the async response retuned by HTTPoison.get!, fd the opened file, and download_fn which is the reference to the function itself, so we can call it recursively.

Inside this function we receive the messages, pattern matching the different cases

  • In case of a %HTTPoison.AsyncStatus struct, which should be the first message, we just print the status_code, we ask for the next message with HTTPoison.stream_next(resp) and then recursively call download_fn passing resp, fd and itself.
  • %HTTPoison.AsyncHeaders we just print the headers, and again we ask for the new message and call the download_fn
  • When we receive %HTTPoison.AsyncChunk we save the chunk calling IO.binwrite(fd, chunk). We loop through all the chunks until we receive the %HTTPoison.AsyncEnd struct.
  • Receiving %HTTPoison.AsyncEnd we know that the download has finished. We close the file and return.

We didn’t write any particular error handling since our goal here is to just see a full file download using async request.

iex> async_download.(resp, fd, async_download)
200
[
  {"Server", "nginx/1.13.7"},
  {"Content-Type", "image/tiff"},
  {"Content-Length", "180104580"},
  ...
]
:ok 
Low memory allocation with async request

It works and as a result we find a beautiful image.tif file in our disk. Most importantly we see that the memory allocated is far lower than before.


Also published on Medium.