Why end-to-end tests are flaky, and how to fix them

End-to-end tests are an essential tool to verify that the true customer experience is working every time you make changes to your application.

They're also not without their problems.

End-to-end tests are notoriously difficult to maintain, for a couple of reasons:

When making changes to your app, you need to refactor your old tests to be able to handle the new changes.
Even when nothing changes about your app, sometimes your tests fail anyway, forcing you to investigate the failure. These are called test flakes.

What is a flaky test?

A test that is flaky will pass sometimes and fail other times, even when nothing changes about the test case itself. In other words, flaky tests are non-deterministic.

Why do flaky tests matter?

Flaky tests hurt reliability, and carry large explicit and implicit costs:

❌ Flaky tests hinder productivity

Every time a test flakes, a developer needs to investigate whether the failure is true or false. If the "flake rate" of your tests is constant, every new test you add represents more time the engineering team will have to spend in the future to maintain that test.

Furthermore, test failures have a cascading effect on overall productivity. In a CI/CD environment, test failures lead to failed pipelines, which lead to delayed deploys.

When the whole point of CI/CD is to speed up development cycles, flaky tests can defeat the whole purpose.

💸 Organizational costs

The hidden, and largest cost of test flakiness is the damage it has on the overall trust of the testing process.

If engineers cannot trust the results of their tests (because they don't know if the failures are true or false), they will not want to write tests to begin with. Test flakiness is the catalyst in a negative feedback loop:

If the tests are flaky, engineers won't trust the tests
If engineers don't trust the tests, they won't write them in the first place
If the codebase doesn't have tests, no one trusts the codebase
If no one trusts the codebase, then they need to write tests (which starts the cycle over again)

😡 Emotional costs

The least obvious cost to the organization, but the most obvious to the engineer who writes tests is the emotional toll they take. No one likes to write flaky tests knowing that they'll be stuck maintaining them in perpetuity.

Common causes of test flakiness, and how to deal with them

While end-to-end tests are notoriously flaky, they're at least predictably flaky. Being cognizant of these root causes goes a long way to authoring that are resilient to test flakes.

⏰Asynchronous waiting

Waiting for a fixed amount of time between steps in a test drastically increases the chances that your test will falsely fail between any given steps.

To avoid "waiting" between steps for fixed periods of time, instead write your tests to poll your application continuously to ensure it is in the appropriate state before asserting on the step criteria.

💡With walrus.ai, no wait steps are necessary between instructions. Simply relay the user story in plain English, and all transitions are handled automatically.

🎬Inconsistent starting state

If the initial conditions for your testing environment differ between runs, your tests will likely fail to even kick off because the starting assertions won't pass.

To avoid this, make sure that your application is in a consistent state every time before the test starts. To ensure the environment is the same every time, you should tear down the environment after every test run, whether it's a pass or a fail. That way if a flake occurs, it won't affect future test runs as well.

💡With walrus.ai, you can specify setup or teardown instructions directly in the body of the test, so you never have to worry about an inconsistent testing environment.

↩️Dependencies

If your tests depend on each other, or even worse, if they depend on themselves, they won't be able to run concurrently (which you will want to do if you are running tests in a CI/CD environment).

Concurrency and dependency doesn't have to be explicit either. If you use the same account or environment to run two separate tests that require different initial conditions (some user permission flipped on vs. off, for example), those tests are implicitly dependent.

To avoid this, consider splitting up the accounts you run tests with to be distinct. That way, you can mitigate implicit dependencies, and more easily execute setup and teardown within the test.

If tests are truly dependent, then they shouldn't run concurrently. Instead, trigger a dependent test based on the pass of the upstream test.

💡With walrus.ai, you can specify separate account credentials for different tests by using environment variables, thereby eliminating implicit dependency.

Is there a one-size-fits-all solution to test flakiness?

Walrus.ai lets engineers ship flake-free end-to-end tests in minutes:

😎Easy – tests can be written in plain English

🛠No maintenance – flakes and refactors are handled entirely by walrus.ai

💯Coverage for your whole app – test your hardest user experiences with ease, including APIs, third-party integrations, emails, etc.

Blog