Exponential Backoff with Elixir
Resende
Posted on April 11, 2023
When we rely on external services, they may not always be available, may take a long time to respond, or may not be able to fulfil our requests. This can be frustrating, but if it's a one-time request and we only need to query one host, we can simply keep trying until we get a response.
However, if we have multiple providers for the same service and need to query them repeatedly, such as in a scenario where we're tracking prices for different franchises of a store and maintaining a leaderboard of the best prices, we can't rely on the same strategy of continually polling a single service. In these cases, we need a more sophisticated approach, such as exponential backoff, to manage the requests and handle any failures more efficiently.
In some cases, the stores that we're polling may not have the specific item we're looking for, or they may have an unstable server or poor internet connection. This can make it challenging to avoid wasting time on unsuccessful requests.
So how can we prevent permanent polling, active wait, blocking code, and scenarios like this:
you - are you ready yet ?
service - NO.
you - What about now?
service - ..... for the last time NO.
you - And now?
service - still NO.
you - Hey service could you give me ...?
service - Iup here you go!
Getting started
Today I'm gonna talk about exponential back-off, and how to deal with this kind of problem in a very simple way.
Better yet I'm gonna show you some examples in Elixir that make this whole process a lot simpler with OTP.
Exponential backoff is like taking a deep breath and counting to 10 before trying again when you don't succeed at something on the first try.
But instead of counting to 10, you wait a little longer each time you try again, giving yourself and others around you a chance to catch their breath and avoid chaos.
We will use GenServers so the final result ends up in a very simple and elegant implementation.
The premises are:
- You make the request:
- The request went through. Great, do the next request in 5 seconds.
- The request didn't go through, we'll try again in 25 seconds (5^2).
Now we could do this to infinity but let's do a smarter implementation with an allowlist and blocklist for the stores that return n
unsuccessful requests so we don't have an infinite waiting time for the next request
- You make the request :
- The request went through. Great, do the next request in 5 seconds.
- The request didn't go through, we'll try again in 25 seconds (5^2) and you get a ticket (1).
- The request didn't go through and you already have 3 tickets, you're going on the blocklist.
Cool so how do I implement this in Elixir?
Well, it's pretty easy!
We can have this structure :
- GenServer for the Stores
- GenServer for the Allowlist
- GenServer for polling each store
GenServer for the stores
In this, we have the main GenServer that will create the allowlist, and each of the polling store processes.
If you don't know, a GenServer is implemented in two parts: the client API and the server callbacks. If you want to learn more check this blog post on concurrency mechanisms in Elixir that I did a while back.
At the public API level, we only need a way to get the prices. This is the objective of this module, get prices.
defmodule Project.Stores do
def prices do
GenServer.call(__MODULE__, :prices)
end
# ....
end
On the private callback side, we need to support the public price call (of course) and the refresh callback to update the allow list.
The price call is gonna query the allowlist for the prices of the stores that are currently available and answering to our queries.
defmodule Project.Stores do
# ...
# private API
def handle_call(:prices, _from, %{allowlist: allowlist} = state),
do: {:reply, Stores.Allowlist.list(allowlist), state}
def handle_info(:refresh, %{period: period, allowlist: allowlist} = state) do
case API.list_stores() do
{:ok, stores} ->
start_store_polling(stores, allowlist)
error ->
error
end
schedule_work(period)
{:noreply, state}
end
# ...
end
Finally, we can have private functions like starting the polling GenServer for each store and a function to periodically wake up the process by sending a message to self.
This message will be caught by the handle_info on the private callback allowing for a refresh of the stores.
defmodule Project.Stores do
# ...
defp start_store_polling(stores, allowlist) do
for store <- stores do
Stores.Polling.start_link(store, allowlist)
end
end
defp schedule_work(period), do: Process.send_after(self(), :refresh, period)
end
GenServer for the allowlist
This allowlist will both keep all the current stores that are responding as well as their prices and a quarantine list for stores that may be removed if they keep failing to respond. With this we need the public API to support adding a store, adding its price, removing a store, quarantining a store, and finally listing the stores and prices.
defmodule Project.Stores.Allowlist do
use GenServer
def start_link,
do: GenServer.start_link(__MODULE__, :ok)
def init(:ok),
do: {:ok, %{allowlist: %{}, quarantine: MapSet.new()}}
def add(pid, store, price),
do: GenServer.cast(pid, {:add, store, price})
def remove(pid, store),
do: GenServer.cast(pid, {:remove, store})
def quarantine(pid, store),
do: GenServer.cast(pid, {:quarantine, store})
def list(pid),
do: GenServer.call(pid, :list)
# ...
end
The private API is self-describing, the cast adds, removes, lists, and changes a store from the allow list to the quarantine list. There's nothing outside of the ordinary here.
defmodule Project.Stores.Allowlist do
# ...
# private API
def handle_cast({:add, store, price}, state) do
quarantine = MapSet.delete(state.quarantine, store)
allowlist = Map.put(state.allowlist, store, price)
{:noreply, %{state | quarantine: quarantine, allowlist: allowlist}}
end
def handle_cast({:remove, store}, state) do
quarantine = MapSet.delete(state.quarantine, store)
allowlist = Map.delete(state.allowlist, store)
{:noreply, %{state | quarantine: quarantine, allowlist: allowlist}}
end
def handle_cast({:quarantine, store}, state) do
allowlist = Map.delete(state.allowlist, store)
quarantine = MapSet.put(state.quarantine, store)
{:noreply, %{state | quarantine: quarantine, allowlist: allowlist}}
end
def handle_call(:list, _from, state) do
{:reply, Map.values(state.allowlist), state}
end
end
GenServer for polling a store
The polling store GenServer itself is where all the fun stuff happens. It keeps track of all the requests and the number of failed attempts to communicate to the store and loads constant values like the number of max_attempts
allowed and the timeout
/base_timeout
.
defmodule Project.Stores.Polling do
use GenServer
import Project.Config, only: [config!: 2]
alias Project.Stores.Allowlist
def start_link(store, allowlist) do
name = process_name(store)
state = %{
store: store,
allowlist: allowlist,
attempts: 0,
max_attempts: config!(__MODULE__, :max_attempts),
timeout: 2_000,
base_timeout: 2_000
}
GenServer.start_link(__MODULE__, state, name: name)
end
# ...
end
On the init
function we pass the state that we received from start_link
and call the private function schedule_work
that will send a message to this process, and make it do some work (the name is accurate).
defmodule Project.Stores.Polling do
# ...
def init(state) do
schedule_work(state.base_timeout)
{:ok, state}
end
# ...
end
When the process receives a message to do some work it queries the store for the current price. Depending on the result, it's handled differently. This is especially easy with tuples in Elixir using the :ok
and :error
symbols on the first element.
defmodule Project.Stores.Polling do
# ...
def handle_info(:work, state) do
case query_price(state.store) do
{:ok, price} ->
handle_successful_polling(state, price)
{:error, _reason} ->
handle_missing_polling(state)
end
end
# ...
end
If the polling is successful we add the store and the price to the allowlist GenServer.
We will reschedule the polling with the base timeout and update the process state to set the timeout to the base timeout with 0 retry attempts.
defmodule Project.Stores.Polling do
# ...
defp handle_successful_polling(state, price) do
Allowlist.add(state.allowlist, state.store, price)
schedule_work(state.base_timeout)
new_state = %{state | timeout: state.base_timeout, attempts: 0}
{:noreply, new_state}
end
# ...
end
If the process misses the poll and it's the last attempt we are going to remove it from the Allowlist and update the state to mark the attempts as the max attempts.
defmodule Project.Stores.Polling do
# ...
defp handle_missing_polling(
%{attempts: attempts, max_attempts: max_attempts} = state
)
when attempts == max_attempts - 1 do
Allowlist.remove(state.allowlist, state.store)
{:stop, :normal, %{state | attempts: max_attempts}}
end
# ...
end
If it's not the last attempt and the process misses the polling we quarantine that store, exponentiate the timeout and increase the attempt counter.
After that, we set the new state with the timeout and attempts while scheduling the next poll with the new timeout.
defmodule Project.Stores.Polling do
# ...
defp handle_missing_polling(state) do
Allowlist.quarantine(state.allowlist, state.store)
timeout = exponential_backoff(state.timeout)
attempts = state.attempts + 1
new_state = %{state | timeout: timeout, attempts: attempts}
schedule_work(timeout)
{:noreply, new_state}
end
# ...
end
At the end of the file, we define the helper functions for this module,
exponential backoff, schedule_work, and process name.
The exponential backoff function is a wrapper around the power function from Erlang with a round of the value, the schedule work function is a send-after to the same process, and the process_name uses the module and store to give a unique name to each process.
defmodule Project.Stores.Polling do
# ...
defp exponential_backoff(timeout), do: :math.pow(timeout, 2) |> round()
defp schedule_work(timeout), do: Process.send_after(self(), :work, timeout)
defp process_name(store), do: :"#{__MODULE__}-#{store}"
end
Conclusion
This was a fairly technical blog post but I truly hope you learned something useful.
Although the example presented in this post is somewhat specific, with small changes it's possible to adapt it to your specific use case.
Hope this will help you in your next project if an exponential backoff is needed, maybe when you are dependent on some service that rate limits requests with PlugAttack as described in this blog post.
Thank you for reading and if you have any questions or doubts feel free to reach out on Twitter at @Resende_666.
Posted on April 11, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.