8 Common Causes of Flaky Tests in Elixir

itizadz

Adz

Posted on January 4, 2022

8 Common Causes of Flaky Tests in Elixir

Flaky tests are like meme stocks — many people have them, but no one knows what to do with them. Today, we will change that by diving into some common causes and, more importantly, solutions for flickering tests in Elixir.

Elixir has many great primitives that let us run tests asynchronously, including immutable data, lightweight processes, and the Ecto SQL sandbox. Running tests asynchronously can greatly speed up your test suite, but can also increase the chance of flaky tests.

What Are Flaky Tests?

Flaky tests are tests that sometimes fail. They erode confidence in your test suite and are hard to fix because they are hard to reproduce. Often they imply a test is broken (rather than the code) and so are ignored or retried until they work.

Locally, this slows you down, but it especially hurts your CI — every failure means at least one rebuild. Anything that doubles the time it takes to deploy code is very annoying. The culture of "oh, just retry" is a broken window that risks further decline in your codebase.

Find and Replicate Flaky Tests in Elixir

Flaky tests are usually easy to spot (they'll be the ones that fail on CI when you update the README), but they are harder to replicate locally.

Here, we can use ExUnit to try and help because it gives us the ability to run the tests in the same order as a previous test run. Usually, we run tests randomly to help encourage test isolation, but we can seed that randomness with a command-line option. Re-using the seed for a previous run will trigger the tests in the same order each time the seed is used. ExUnit outputs the used seed here:

Finished in 0.5 seconds (0.00s async, 0.5s sync)
91 tests, 0 failures

Randomized with seed 119489
#          The seed  ^^^^
Enter fullscreen mode Exit fullscreen mode

We can re-use that seed like this:

mix test --seed 119489
Enter fullscreen mode Exit fullscreen mode

However, this won't always reproduce the flake, especially if a database is the cause of the flakiness or if some resource constraint makes the flakiness more likely (CI is likely to have much less RAM, for example, than your dev machine). On top of that, the seed does not influence how quickly a test runs. If the tests run asynchronously, there is no guarantee that two tests will run at the same time again (even if they are triggered in the same order).

Imagine three tests are running concurrently: A, B, and C. The seed determines that the tests trigger in order A, B, and C. The first time these tests run, test A takes as long as the other two to finish. A starts, then B triggers and finishes, C triggers and finishes, and finally A finishes.

If we rerun these tests with the same seed, even though they trigger in the same order, A might finish before C starts this time, for whatever reason. That might mean you won't reproduce the conditions needed for the test to flicker. Using a seed is a good first stab, but it might not work.

Running the tests repeatedly can help. Here is a bash function that will run the tests until there is a failure:

function test_repeat {
  while mix test
  do
    echo "testing"
  done
}
Enter fullscreen mode Exit fullscreen mode

When you understand some of the common causes of flaky tests, you can often identify the problem just by looking. That's the level of intuition we want to build up here.

All flaky tests boil down to one thing: non-determinism. Non-determinism is when the same input can produce different results. We need to look out for non-determinism sneaking into our tests and think about how it can happen when tests run asynchronously.

Especially look out for the global state. Global state, I hear you say? But Elixir is functional! There is no global state! Well...that's not quite true.

Let's take some of the most common causes of flaky tests in turn below.

8 Common Causes of Flaky Tests

1. Using Application.put_env in Asynchronous Tests

When configuring Elixir apps, we can read values from the config using functions like Application.fetch_env!/2. You might be tempted to set an application state in the test setup to test behavior in different environments:

defmodule MyTest do
  use ExUnit.Case, async: true

  describe "on CI" do
    setup do
      old_value = Application.fetch_env!(:my_app, :is_CI)
      Application.put_env(:my_app, :is_CI, true)
      on_exit(fn -> Application.put_env(:my_app, :is_CI, old_value) end)
    end

    test "Does a thing!" do
      assert MyModule.fun == 1
    end
  end

  describe "Not on CI" do
    setup do
      old_value = Application.fetch_env!(:my_app, :is_CI)
      Application.put_env(:my_app, :is_CI, false)
      on_exit(fn -> Application.put_env(:my_app, :is_CI, old_value) end)
    end

    test "Does not do the thing" do
      assert MyModule.fun == 2
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

Don't do this if your tests run asynchronously. Application.fetch_env/2 can be called from anywhere, meaning it is effectively a global state. And worse than that, because you can Application.put_env/3 anywhere, it is effectively a global mutable state.

That means even if you reset the application, another asynchronous running test might read from the application after you have changed it (but before your test completes and changes it back). That test gets the wrong value and potentially fails, sometimes.

The Fix

Don't use Application.put_env in tests — or, if you have to, put it in a test with async: false and reset it using on_exit().

2. Incorrectly Configuring Ecto.SQL.Sandbox

Usually, we configure Ecto so that each test runs in its own transaction. Each test runs concurrently in its own process, and each process opens its own transaction to the database.

This is great because it means that Ecto can simply roll back that transaction when the test finishes. This allows us to run our tests asynchronously (in Postgres at least) without worrying that the state of the database is anything other than what our test specifies it to be.

However, sometimes we need to see what other processes do to a database. Imagine we test some code that starts a task that writes to the database. The test process needs to see what that task's process does to the database.

We can do this by putting Ecto.AdaptersSQL.Sandbox in :shared mode, allowing "a process to share its connection with any other process automatically".

But remember that each asynchronous test runs in its own process. That means that in :shared mode, any test running simultaneously with another test will share the transaction to the database and see all of the changes the other test makes.

This is the first way we could introduce non-determinism in tests.

The Fix

If a test sets the Ecto.Adapters.SQL.Sandbox to :shared mode, never run it asynchronously, like so:

defmodule MyTest do
  @moduledoc """
  This test should be synchronous because the sandbox runs in shared mode for it!
  """
  useExUnit.Case, async: false

  describe "my_fun/2" do
    ...
  end
end
Enter fullscreen mode Exit fullscreen mode

3. Having Non-unique Unique Data

Different databases implement transactions slightly differently. For now, I will talk about the most commonly used database in Elixir: Postgres.

Postgres

Postgres never lets a transaction see another's uncommitted changes (even though the sql standard technically allows it!), so the default transaction isolation level doesn't cause problems in concurrent tests.

Unfortunately, two concurrently running tests can still interfere via a different concurrency control that Postgres employs: locks.

A lock is a concurrency control mechanism that ensures different commands can be executed safely while other commands are happening. For example, if we wish to truncate a table, it isn't a good idea to try to insert something into that table at the same time. Truncate "locks" the table, preventing anything else from happening to that table while it is being truncated.

You can use explicit locking — where you tell Postgres to take a specific kind of lock — but each command has its own appropriate level of automatic locking. If Postgres thinks two concurrently running commands will conflict, the second command will wait for the first to complete.

By far, the most common problem here is with unique data. Let's say we have a user table with a unique email address. Now, let's imagine we have test 1 and 2, and both insert a user with the same email address — jeff@example.com — as part of the test setup.

When checking a unique index on insertion, Postgres will look at uncommitted transactions to find out if it can continue. Otherwise, it checks the index just before it inserts and ends up with two non-unique rows.

If test 1 intends to insert a user with the email jeff@example.com, test 2 cannot do that. But if, later on, test 1 never actually inserts them, then test 2 can insert their user. That means that Postgres can't know the answer to "can test 2 insert the user?" until after test 1 has finished. So it takes a lock on the row, which basically says: "hey test 2, wait until test 1 has finished its transaction before you continue".

Even though they happen in different isolated transactions that never actually commit, when Postgres sees one transaction has already "inserted" a row with a jeff@example.com email,
it figures out the next one has to wait for the first transaction to finish or rollback.

All this means that if you have data that should be unique but isn't across concurrently running tests, you will at best incur some performance penalty. This can be quite severe as it adds up across the codebase. You can end up with async tests effectively running synchronously!

But, if we add more tables with more unique columns that are not unique across tests, it gets even worse! Depending on the order in which data is set up, you can end up with deadlocks. Let's say our app has blogs with a unique title. Picture the following:

Test 1                                      Test 2

inserts user with email jeff@example.com
                                            inserts a blog with title "Deadlocks!"
                                            inserts with email jeff@example.com

inserts a blog with title "Deadlocks!"
Enter fullscreen mode Exit fullscreen mode

Test 1 inserts the user. Before test 2 can do the same, it must get a lock and wait for test 1 to finish.
Now test 2 inserts a blog, and for test 1 to do that, it must wait for test 2 to finish.
Then test 2 attempts to insert the user, so it gets the lock and says "I'll wait for test 1". At this point, test 2 is waiting for test 1 to finish.

Meanwhile, test 1 continues and attempts to insert the blog but can't because test 2 just did. So it waits to see if test 2 will commit or rollback. But test 2 is waiting for test 1 to finish, and now test 1 is waiting for test 2 to finish!

This is a deadlock. Usually, Postgres detects them automatically and one transaction gets rolled back, causing your test to flake.

The Fix

Sometimes you will hear advice like "ensure that the locks are acquired in a consistent order". This helps prevent a deadlock, but is tricky to do and would still incur the performance penalty mentioned.

The simplest golden rule is — if your data should be unique, make it unique across all tests:

%User{email: "#{Ecto.UUID.bingenerate}@example.com"}
Enter fullscreen mode Exit fullscreen mode

Really consider if your test data is unique as well. Using a uuid will make it unique. Picking a random number between 1 and 1,000 will not make it unique.

4. Writing to ETS or Persistent Term in Asynchronous Tests

ETS and persistent term are two data stores that come with Erlang. They are accessible from anywhere and, like all data stores, are stateful. That means if we have two tests running at the same time that set themselves up by writing into ETS or persistent term they can — and will — interfere with each other. For example, one test adds a record to the data store then deletes it, while another test asserts the number of records in that same table. They will interfere with each other.

The Fix

Your options are mock ets/persistent_term — after all, we don't need to test that these things do what they say (Erlang does that for us!) — or have the tests run synchronously. Prefer the former! Mox is a great choice for this sort of work.

5. Relying On The Order Of Logs

Sometimes you may want to test that a particular log line has been emitted. The best way to do this is with ExUnit's capture log:

logs = ExUnit.capture_log(fn -> MyFun.call() end)
assert logs == "This is my Log. There are many like it but this one is mine."
Enter fullscreen mode Exit fullscreen mode

This takes a function and captures all of the logs emitted during the execution of that function. It concatenates all the logs together into one binary and returns it.

But logs can be emitted from anywhere, and capture_log will capture them dutifully, making them, in a sense, global. And because capture_log concatenates all logs together, each new log line changes the result of capture_log. So, in effect, the returned binary is global and changing.

If this has set off warning bells: congratulations, you are right to be worried! If the tests are running asynchronously the result of capture_log will be different depending on what other tests are running at the time. If you assert what the captured logs should exactly look like, you will have a bad time.

The same principle applies if you use something like a Ring Logger or any custom logging backend that buffers the logs.

The Fix

Never rely on log order when making assertions. Assume that the log can include any number of other log lines. For assert capture, you can use regex's or the =~ operator to match on the subset of the log that you care about:

logs = ExUnit.capture_log(fn -> MyFun.call() end)
assert logs =~ "This is my Log. There are many like it but this one is mine."
Enter fullscreen mode Exit fullscreen mode

That way, the rest of the logs do not interfere with your assertions.

6. Failing to Specify Order in the Database

The most common cause of flaky tests in the wild is the assumption that a database will return results in a certain order when there is no such guarantee. In Postgres, for example, if an explicit order is not supplied then the results can be ordered in any way — even if it seems otherwise, most of the time! So all of these are bad ideas:

# No order, no guarantee!
assert Repo.all(Stuffs) == [first, second]

# Pattern matching wont help you here.
[first, second] = Repo.all(my_query)
assert first.thing == 1

# A preload also doesn't specify an order.
%MyModel{comments: [first_comment | _]} = Repo.get(MyModel, 1) |> Repo.preload(:comments)
assert first_comment.text == "Oh No!"
Enter fullscreen mode Exit fullscreen mode

The Fix

Some possible solutions are:

  1. Specify the order in the database query:
assert Repo.all(from(s in Stuffs, order_by: [s.inserted_at])) == [first, second]
Enter fullscreen mode Exit fullscreen mode

But wait, did you notice the problem above? What if the records have the same inserted_at — the sort isn't guaranteed to be stable. You need to handle that possibility too, e.g., if the id is an auto_incrementing integer, you could:

result = Repo.all(from(s in Stuffs, order_by: [s.inserted_at, s.id]))
assert result == [first, second]
Enter fullscreen mode Exit fullscreen mode
  1. Don't assert on order if it's not important:
result = Repo.all(Stuffs)
assert_unordered_list_equality(result, [first, second])
Enter fullscreen mode Exit fullscreen mode

Or even:

all = Repo.all(Stuffs)

assert length(all) == 2

jeff = Enum.find(all, & &1.name == "Jeff")
assert jeff == %{...}

joe = Enum.find(all, & &1.name == "Joe")
assert joe == %{...}
Enter fullscreen mode Exit fullscreen mode

7. Not Mocking Date/Time/Random ⏰

We've all been there. It's 2 PM on Wednesday, and for some reason, half the test suite fails when it all passed a minute ago. Yes, someone forgot to mock date time.

Imagine we have this function:

def is_in_the_past?(date) do
  Date.compare(date, Date.utc_today()) == :lt
end
Enter fullscreen mode Exit fullscreen mode

Now we add a test that looks like this:

test "returns false when the date is in the past" do
  date = ~D[2022-11-05]
  assert is_in_the_past?(date) == false
end
Enter fullscreen mode Exit fullscreen mode

Well, this will work until after 5th November 2022. This same idea can apply to Times, Dates, DateTimes, NaiveDateTimes, and any function that uses random — for example, Enum.take_random or the like.

The Fix

Mock those modules! Mox is a great choice for this sort of work.

8. Using assert_received/2 Instead of assert_receive/3

In ExUnit you can assert that a test process receives a message using either assert_receive/3 or assert_received/2. This lets you write tests for functions that spin up processes and send messages to other ones, e.g.:

defmodule Echo do
  def echo(pid, message) do
    send(pid, message)
  end
end

test "Echo.echo/2 sends the right message to the given pid" do
  ref = make_ref()

  Echo.echo(self(), {ref, :hello})

  assert_received({^ref, :hello})
end
Enter fullscreen mode Exit fullscreen mode

But there is a subtle difference between the two assertions — assert_receive/3 allows a timeout. This is an amount of time to wait for the message to appear in the current process's mailbox. Usually, this is all very quick, so a timeout is not needed, but send in elixir is non-blocking.

The send inside Ecto.echo happens, then immediately, we continue with our test. The next thing is to check the mailbox. In the right conditions (i.e., some performance blip), there is a small chance that assert_received/2 could look in the mailbox after the send happens but before the message reaches the inbox of the current process. If this does happen, we have a flaky test.

The Fix

Prefer assert_receive/3. Remember, the timeout is the maximum time it will wait — if the message gets there sooner, the test will finish sooner.

A Side-note: Should Tests Run Sync or Async?

You may have noticed that the chance of flickering tests decreases greatly if we make all tests synchronous. I do not recommend doing that. Running asynchronously greatly speeds up most test suites.

Similarly, mocking everything and writing only unit tests will also likely reduce the chance of flickering — but it's on us to decide whether such a testing strategy would give us enough confidence in our code.

One thing to note is that tests within a test file always run synchronously. If the file is marked to run async, the module might run at the same time that another module of tests runs. So, if you really need to have some synchronous tests, you can put them in their own module (in the same file). That allows most of the tests to run async:

defmodule MyTest do
  use ExUnit.Case, async: true

  test "my_fun/3" do
    ...
  end
end

defmodule MySynchronousTest do
  @moduledoc "These tests are synchronous because blah blah..."
  use ExUnit.Case, async: false

  describe "my_fun/2" do
    ...
  end
end
Enter fullscreen mode Exit fullscreen mode

Wrap up

In this post, we've defined flaky tests and seen how to replicate them, before running through 8 common causes of flaky tests and their fixes.

Here is a summary of what you should avoid doing, for easy reference:

  1. Don't use Application.put_env in async tests.
  2. Don't use :shared mode on the Ecto sql sandbox in async tests.
  3. Ensure unique data is unique across all tests.
  4. Never rely on the order of logs in a test.
  5. Consider mocking ETS/:persistent_term for testing.
  6. If you rely on database order, specify it.
  7. Mock dates, times, datetimes, and random.
  8. Prefer assert_receive/3 over assert_received/2

I hope you've found this post useful, and happy coding!

P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!

💖 💪 🙅 🚩
itizadz
Adz

Posted on January 4, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related