David Sulc
Posted on October 17, 2023
In this first part of a two-part series, we'll explore how to avoid bad data and validate data at the boundary of a Phoenix application.
We'll use a few techniques to ensure that bad data doesn't degrade our application.
In part two, we'll specifically focus on leveraging Ecto under the hood to cast data.
Let's dive in!
Say No to Bad Data in Elixir
Bad data must be dealt with immediately, or it will spread throughout your system and degrade other data. Given how difficult and time-consuming it is to fix data issues, preventing bad data from entering your system is well worth the effort.
Here is what the official Elixir documentation has to say on the matter:
[...] when you don't validate the values at the boundary, the internals of your library are never quite sure which kind of values they are working with.
This advice does not only apply to libraries, but to any Elixir code. Every time you receive multiple options or work with external data, you should validate the data at the boundary and convert it to structured data. For example, if you provide a
GenServer
that can be started with multiple options, you want to validate those options when the server starts and rely only on structured data throughout the process life cycle. Similarly, if a database or a socket gives you a map of strings, after you receive the data, you should validate it and potentially convert it to a struct or a map of atoms.
But what is a boundary, and where does it live? The answer, dear reader, is mostly up to you. In a sense, boundaries exist wherever your functions accept unknown or unsafe data, to later be processed (once some assumptions about the data are made).
An Example of a Boundary in a Phoenix App
A typical example of a boundary in a Phoenix app is the boundary between the web layer and the business logic: parameters come in as JSON values and are parsed into maps with string keys. But a client is free to send any sort of data within the JSON payload: there could be incorrect keys, invalid values, extra key values, and so on.
Since you don't want to deal with this unwieldy data in every location within your app, the recommendation is to process this untrusted data in a single location and convert it into a well-known shape with validated content.
This is typically done in a Phoenix Context, where Ecto is leveraged under the hood to cast data (we'll explore this in our next post). Once this conversion has taken place, your domain logic can trust the content of the request payload without performing the same verification again.
More generally speaking, boundaries will typically crop up anywhere in your software where you accept some data of unknown quality and process it internally (typically over several iterations). Examples include:
- GenServers - where data is sent and processed into state changes.
- Web requests - which make their way to domain logic and database tables.
- Console input - which needs to be processed into arguments provided to a CLI application.
Additionally, boundaries can also show up when data has different meaning in different contexts (often called bounded contexts): a Customer might, for example, have a billing address, and a User might have a username, but both would refer to the same person in the world. In effect, boundaries are everywhere, which makes these techniques very handy when they're skillfully deployed.
Let's now check out a few techniques to prevent bad data from degrading our application.
Pattern Matching and Guards in Elixir
At the very local level, we can leverage pattern matching and guards to ensure we're always working with the data we expect.
This type of defensive code is beneficial anywhere you add it, be it within domain code (such as a Phoenix Context), the interface layer (in a Phoenix Controller, for example), or even deep within a free-standing function in a script.
Case Clauses
Case clauses can be very helpful in ensuring we explicitly list the data we agree to handle, particularly if there's no catch-all clause:
case Account.setup(...) do
{:ok, %Account{suspended: true}} -> ...
{:ok, %Account{initialized: true}} -> ...
{:error, ...} -> ...
end
The above code clearly communicates its intent: the only expected outcomes of the setup
function are an account (that is either initialized or suspended) or an error. Further, the case where an account is suspended "supersedes" an initialized account since the pattern match comes first.
Any other return value isn't expected and is therefore considered a bug: we explicitly don't handle those other cases, as we wouldn't know how to (if we did, they'd have their own case
clause as shown above). In the face of unexpected data, it's safest to crash so that the OTP system can restart the process from a clean "known good" state rather than propagate dirty data throughout a system.
Pattern Matching in Function Heads
Pattern matching in function heads is a great way to not only ensure that data conforms to your expectations, but also to communicate your intent.
Contrast:
defp suspend(%Account{} = account)
With:
defp suspend(%{suspended: _} = account)
These are very different functions: in the first, it's clear that an Account
is expected, so the account
variable can be safely used in functions expecting an Account
instance, whereas those assurances can't be made for the second version. In the second example, we're relying on the presence of the suspended
attribute and a variable name to infer the context. That's playing with fire: there's nothing preventing someone from calling the function with a %User{}
struct (assuming it also has a suspended
attribute).
That said, while matching in function heads is helpful and convenient, make sure you're not muddying the waters for the sake of convenience. Let's use the following example to explore the subject, even though it's not directly related to validation:
def handle_call(:checkout, %{workers: [h | t], monitors: monitors}) do
# ...
end
def handle_call(:checkout, %{workers: [], idle_overflow: [h | t]}) do
# ...
end
def handle_call(:checkout, %{workers: [], idle_overflow: [],
overflow: overflow, overflow_max: max, worker_sup: sup,
spec: spec, monitors: monitors})
when overflow < max do
# ...
end
def handle_call(:checkout, %{workers: [], idle_overflow: [],
overflow: overflow, overflow_max: max, waiting: waiting}) do
# ...
end
A lot of matching is going on here, but how much is to differentiate the function heads, and how much is for convenience (i.e., binding for later use)?
By leaving only the matches that differentiate the heads and moving other bindings to the function bodies, we can improve readability:
def handle_call(:checkout, %{workers: [_|_]} = state) do
%{monitors: monitors} = state
# ...
end
def handle_call(:checkout, %{workers: [], idle_overflow: [_]}) do
# ...
end
def handle_call(:checkout, %{workers: [], idle_overflow: [],
overflow: overflow, overflow_max: max} = state) when overflow < max do
%{worker_sup: sup, spec: spec, monitors: monitors} = state
# ...
end
def handle_call(:checkout, state) do
%{workers: [], idle_overflow: [], overflow: overflow,
overflow_max: max, waiting: waiting} = state
# ...
end
After refactoring the above (non-validation related) example code, it's now more obvious that:
- The first function matches when there are workers in the list.
- The second function matches when there are no workers, but there are values in the
idle_overflow
list. - The third function matches when there are no workers or
idle_overflow
, but the overflow hasn't reached its max level yet. - The fourth function is for the remaining case.
You'll also note that we've gone somewhat overboard on matching here, such as matching on empty worker lists, even though the first function would match in that case. This approach is a bit of defensive programming, as the code will stay functional if:
- The function clauses are reordered (e.g., during a refactoring), and their ordering importance is overlooked.
- Additional matching is added to the first clause, changing its match semantics. Keeping each function clause responsible for declaring the context it expects makes the code more resilient to change.
Avoiding excessive matching in function heads is also the approach recommended by José Valim:
FWIW, I tend to use this rule: if the key is necessary when matching the pattern, keep it in the pattern, otherwise, move it to the body. So I end up with code like this:
def some_fun(%{field1: :value} = struct) do
%{field2: value2, field3: value3} = struct
...
end
Guard Clauses
Guard clauses should be used whenever the definition of "what data is valid" can be tightened to make your code safer by design and to prevent processing unexpected/invalid data:
def annotate(%Account{} = account, annotation) when is_binary(annotation)
case account do
%Account{payment_method: payment_method} when not is_nil(payment_method) -> ...
end
This same approach can naturally also be used for with
and cond
statements.
Custom Guards
Declaring custom guards effectively communicates intent and prevents bad data, significantly improving your code.
First, you must declare the guard in a separate module located outside of the module(s) where you want to use the guard.
defmodule Account.Guards do
defguard is_suspended(account) when is_struct(account, Account) and account.suspended
end
Now our code is even more expressive:
import Account.Guards
case fetch_account(...) do
%Account{} = account when is_suspended(account) -> ...
...
end
And that's it!
Next Up
In this post, we looked at how to validate data at the boundary of an Elixir application. We also used a few pattern matching and guard clause techniques to reject bad data.
We've already covered a lot of ground: let's give it a rest for now. In the next article, we'll explore a few more options to ensure the data we work with remains squeaky clean, using Ecto.
See you then!
P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!
Posted on October 17, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.