Samuel Pordeus
Posted on June 19, 2024
disclaimer: this post was entirely written by a human ๐งฌ
Introduction
I've been working on a Gen AI product for the past year and quickly realized that LLMs can be considered wild beasts that require constant vigilance. As someone used to leverage test coverage to prevent regressions, the LLM/AI world can be pretty frustrating since model outputs are nondeterministic.
When you deploy an LLM solution to production, you get an amorphous mass of statistical data that produces ever-changing outputs. And it can get even more chaotic, due to various reasons:
- prompts might need tweaking as you start getting more customers.
- a new and shiny model was released! โจ but after upgrading your model mindlessly once, you aren't that confident that it's not going to break this time
- models are changed by vendors without sufficient notice, even if their documentation says it's stable
One way to mitigate these issues is to start evaluating the output from your LLM product before you have your first paying customer ๐ฐ
The Approach
One of the most common test approaches for Python is to use OpenAI Evals, but ideally, we should integrate our tests closer to our Elixir code, where our Business Logic currently lives, so we can seamlessly test prompt changes.
So let's do it inspired by Evals but with Elixir ๐งช
LLM outputs can either be structured, with a well-defined schema with predictable outputs:
{
"name": "Muse T-Shirt",
"category": "clothes"
}
Or unstructured, with high variance, undefined format, and usually free text:
Of course. 'Muse T-Shirt' belongs to the 'clothes' Category.
The first type of test is quite easy to handle. You send a request, get a response, validate if the schema is correct, and it's done ๐ โ OpenAI Evals handles them with something called Basic Evals.
The unstructured type is where it gets tricky. Although you should try to use JSON schemas for everything (might deserve a separate post), it might be inevitable to produce responses with significant variances:
{
"product_availability": false,
"answer": "Unfortunately I don't have a Muse T-Shirt available in stock now. Would you be interested in a BTS one?"
}
That's where we use Model-graded Evals: A two-step process where a model gives an output and we use another model to validate it.
Quite chaotic, right? But sometimes when dealing with this crazy AI world you need to fight chaos with chaos โ๏ธ
Implementation
I've been working with Elixir for the past 5 years, so I'm quite fond of the idea of the LLM tests looking like a regular mix test
suite run:
mix test --only llm_test
So let's see what a test will look like before we dive into its internals:
# test/llm_tests/llm_eval_demo_test.exs
alias Test.LLMEvaluation
@tag :llm_test
test "robot returns a muse t-shirt" do
conversation =
[
%{
role: :system,
content: "The best band t-shirt in the world is a Muse one!"
},
%{
role: :user,
content: "tell me a cool band t-shirt to buy, my man"
},
]
assert LLMEvaluation.basic_eval(conversation)
end
Pretty Elixir-ish, right? So let's start configuring it.
Configuration
Add exclude: :llm_test
to your ExUnit.start/1
function in test/test_helper.exs
:
ExUnit.start(exclude: :llm_test)
This way, we leverage tags to ensure the LLM tests, where real API calls are made, don't conflict with mocks & stubs.
Implementing the Test.LLMEvaluation helper module: Basic Eval
basic_eval/1
receives as input a conversation/chat, however you implemented it.
After that you send a request to your LLM Chat completion API Provider and parse the response content:
defmodule Test.LLMEvaluation do
@moduledoc false
@default_model "gpt-4"
def basic_eval(conversation) do
params = %{"model" => fetch_model(), "response_format" => %{"type" => "json_object"}}
Client.request(conversation, params)
end
end
It's worth enforcing JSON output so we can beautifully pattern-match the decoded response content:
assert %{"category" => "clothes"} = LLMEvaluation.basic_eval(conversation)
Using OpenAI's Chat Completion APIs you can achieve that by using the response_format
param.
Implementing the Test.LLMEvaluation helper module: Model-graded Eval
For the model-graded eval, we include an intermediary step: we send an extra call to openAI that verifies if the assertion is truthful.
For that, we need to craft an Assertion Prompt:
You are an assertion agent that returns 'true' or 'false'
depending on whether the Generated Message complies with the assertion.
Generated Message: #{llm_output}
Assert that: '#{assertion}'
Return the following JSON format as a response:
{
"assertion": true,
"reason": "Explanation on why the assertion failed or not"
}
I bet you can write something better than that ๐
Moving forward, we need to include the Chat Completion call to include this new Assertion Prompt:
defmodule Test.LLMEvaluation do
@moduledoc false
@default_model "gpt-4"
@assertion_model "gpt-4o"
def model_graded_eval(conversation, assertion) do
params = %{"model" => fetch_model(), "response_format" => %{"type" => "json_object"}}
conversation
|> Client.request(params)
|> assertion(assertion)
end
defp assertion(llm_output, assertion) do
prompt = <assertion_prompt>
messages = [%{content: prompt, role: "system"}]
params = %{"model" => @assertion_model, "response_format" => %{"type" => "json_object"}}
messages
|> Client.request(params)
|> Map.put("llm_output", llm_output)
end
end
It's important to return the llm_output
so in case the assertion goes wrong, you can check what was produced by the first model.
And the test look like this:
@tag :llm_test
test "robot returns a muse t-shirt" do
conversation =
[
%{
role: :system,
content: "The best band t-shirt in the world is a Muse one!"
},
%{
role: :user,
content: "tell me a cool band t-shirt to buy, my man"
},
]
assertion = "Assistant response is a Muse T-Shirt"
assert %{"assertion" => true} = LLMEvaluation.model_graded_eval(conversation, assertion)
end
Pretty Elixir-ish too, right!? ๐ฅน
That's it, folks! Ideally, you can now run a mix test --only llm_test
for every prompt or model change you do, to make sure your beloved customers don't experience hallucinations while speaking with your robots ๐ค
I'm planning to write more about using Elixir & LLMs in production. Hopefully with less code so you all don't get bored.
Don't hesitate to send me a message through Linkedin or to my email: samuelspordeus@gmail.com
Posted on June 19, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.