Testing LLMs with Elixir

disclaimer: this post was entirely written by a human 🧬

Introduction

I've been working on a Gen AI product for the past year and quickly realized that LLMs can be considered wild beasts that require constant vigilance. As someone used to leverage test coverage to prevent regressions, the LLM/AI world can be pretty frustrating since model outputs are nondeterministic.

When you deploy an LLM solution to production, you get an amorphous mass of statistical data that produces ever-changing outputs. And it can get even more chaotic, due to various reasons:

prompts might need tweaking as you start getting more customers.
a new and shiny model was released! ✨ but after upgrading your model mindlessly once, you aren't that confident that it's not going to break this time
models are changed by vendors without sufficient notice, even if their documentation says it's stable

One way to mitigate these issues is to start evaluating the output from your LLM product before you have your first paying customer 💰

The Approach

One of the most common test approaches for Python is to use OpenAI Evals, but ideally, we should integrate our tests closer to our Elixir code, where our Business Logic currently lives, so we can seamlessly test prompt changes.

So let's do it inspired by Evals but with Elixir 🧪

LLM outputs can either be structured, with a well-defined schema with predictable outputs:

 {
  "name": "Muse T-Shirt",
  "category": "clothes"
 }

Or unstructured, with high variance, undefined format, and usually free text:

Of course. 'Muse T-Shirt' belongs to the 'clothes' Category.

The first type of test is quite easy to handle. You send a request, get a response, validate if the schema is correct, and it's done 👌 – OpenAI Evals handles them with something called Basic Evals.

The unstructured type is where it gets tricky. Although you should try to use JSON schemas for everything (might deserve a separate post), it might be inevitable to produce responses with significant variances:

{
  "product_availability": false,
  "answer": "Unfortunately I don't have a Muse T-Shirt available in stock now. Would you be interested in a BTS one?"
}

That's where we use Model-graded Evals: A two-step process where a model gives an output and we use another model to validate it.

Quite chaotic, right? But sometimes when dealing with this crazy AI world you need to fight chaos with chaos ☄️

Implementation

I've been working with Elixir for the past 5 years, so I'm quite fond of the idea of the LLM tests looking like a regular mix test suite run:

mix test --only llm_test

So let's see what a test will look like before we dive into its internals:

# test/llm_tests/llm_eval_demo_test.exs

alias Test.LLMEvaluation

@tag :llm_test
test "robot returns a muse t-shirt" do
  conversation =
    [
      %{
        role: :system,
        content: "The best band t-shirt in the world is a Muse one!"
      },
      %{
        role: :user,
        content: "tell me a cool band t-shirt to buy, my man"
      },
    ]

  assert LLMEvaluation.basic_eval(conversation)
end

Pretty Elixir-ish, right? So let's start configuring it.

Configuration

Add exclude: :llm_test to your ExUnit.start/1 function in test/test_helper.exs:

ExUnit.start(exclude: :llm_test)

This way, we leverage tags to ensure the LLM tests, where real API calls are made, don't conflict with mocks & stubs.

Implementing the Test.LLMEvaluation helper module: Basic Eval

basic_eval/1 receives as input a conversation/chat, however you implemented it.

After that you send a request to your LLM Chat completion API Provider and parse the response content:

defmodule Test.LLMEvaluation do
  @moduledoc false

  @default_model "gpt-4"

  def basic_eval(conversation) do
    params = %{"model" => fetch_model(), "response_format" => %{"type" => "json_object"}}

    Client.request(conversation, params)
  end
end

It's worth enforcing JSON output so we can beautifully pattern-match the decoded response content:

assert %{"category" => "clothes"} = LLMEvaluation.basic_eval(conversation)

Using OpenAI's Chat Completion APIs you can achieve that by using the response_format param.

Implementing the Test.LLMEvaluation helper module: Model-graded Eval

For the model-graded eval, we include an intermediary step: we send an extra call to openAI that verifies if the assertion is truthful.

For that, we need to craft an Assertion Prompt:

You are an assertion agent that returns 'true' or 'false'
depending on whether the Generated Message complies with the assertion.

Generated Message: #{llm_output}

Assert that: '#{assertion}'

Return the following JSON format as a response:
{
  "assertion": true,
  "reason": "Explanation on why the assertion failed or not"
}

I bet you can write something better than that 🙂

Moving forward, we need to include the Chat Completion call to include this new Assertion Prompt:

defmodule Test.LLMEvaluation do
  @moduledoc false

  @default_model "gpt-4"
  @assertion_model "gpt-4o"

  def model_graded_eval(conversation, assertion) do
    params = %{"model" => fetch_model(), "response_format" => %{"type" => "json_object"}}

    conversation
    |> Client.request(params)
    |> assertion(assertion)
  end

  defp assertion(llm_output, assertion) do
    prompt = <assertion_prompt>

    messages = [%{content: prompt, role: "system"}]

    params = %{"model" => @assertion_model, "response_format" => %{"type" => "json_object"}}

    messages
    |> Client.request(params)
    |> Map.put("llm_output", llm_output)
  end
end

It's important to return the llm_output so in case the assertion goes wrong, you can check what was produced by the first model.

And the test look like this:

@tag :llm_test
test "robot returns a muse t-shirt" do
  conversation =
    [
      %{
        role: :system,
        content: "The best band t-shirt in the world is a Muse one!"
      },
      %{
        role: :user,
        content: "tell me a cool band t-shirt to buy, my man"
      },
    ]

  assertion = "Assistant response is a Muse T-Shirt"

  assert %{"assertion" => true} = LLMEvaluation.model_graded_eval(conversation, assertion)
end

Pretty Elixir-ish too, right!? 🥹

That's it, folks! Ideally, you can now run a mix test --only llm_test for every prompt or model change you do, to make sure your beloved customers don't experience hallucinations while speaking with your robots 🤖

I'm planning to write more about using Elixir & LLMs in production. Hopefully with less code so you all don't get bored.

Don't hesitate to send me a message through Linkedin or to my email: samuelspordeus@gmail.com

Blog