Crafting Effective Unit Tests for Generative AI Applications

Overview:
Testing generative AI applications presents unique challenges due to the multitude of ways a valid response can be phrased for a given input. Despite these challenges, implementing thorough tests is crucial for maintaining the stability and reliability of any application.

Objective
The goal is to develop robust testing strategies for generative AI applications. This involves writing tests that consider the variability in valid responses to given inputs, thereby ensuring the stability and consistency of the application.

In this blog, we will explore various techniques and best practices for writing unit tests that can effectively handle the dynamic nature of generative AI outputs. By the end, you'll have a good understanding of how to create tests that not only validate the correctness of responses but also enhance the overall robustness of your AI application.

Stay tuned for detailed insights and practical examples!

Environment:

Following examples us Colab Enterprise notebook environment in the Google Cloud Console. You will need to enable Vertex AI API's in the Google Cloud console. Post enabling API's create new notebook in Google Colab environment. In the new workbook start installing required packages and follow steps as shown below:

Notebook setup:

Following command will start the process of allocating a runtime for you. This may take few minutes to fully initialize the runtime.

from IPython.display import clear_output

Install pytest package

!pip install --quiet ipytest

Get the project id for your environment

project_id = !gcloud config get project
project_id = project_id[0]

Import basic packages

import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig
from vertexai.language_models import TextGenerationModel

import pytest
import ipytest
ipytest.autoconfig()

Unit Testing:

Let's see how to write test cases on following points:

Write a test to generate and evaluate content
Write a test to ensure the model avoids off-topic content
Write a test to ensure the model adheres to the provided context

Write a test to generate and evaluate content

Let's first create prompt template and test it.

%%writefile prompt_template.txt

Respond to the user's query.
If the user asks about something other
than olympics 2024, reply with,
"Sorry, I don't know about that. Ask me something about sports instead."

Context: {context}

User Query: {query}
Response:

You can test this using following:

@pytest.fixture
def prompt_template():
  with open("prompt_template.txt", "r") as f:
    return f.read()

Now we will write test case. This will be LLM-specific test. In the test function below, we will provide specific context, which represents context that you would typically pull from a RAG retrieval system or another external lookup to enhance your model’s response.

We will use a known context and a query that you know can be answered from that context. Next, we will provide an evaluation prompt, clearly giving the evaluation model the expected answer.

Our primary gen_model is asked to answer the query given the context using the prompt_template you created earlier. Then, the query and the gen_model's response are passed to the eval_model within the evaluation_prompt to assess if it got the answer correct.

The eval_model can evaluate if the substance of the response is correct, even if the generative model has responded with full sentences that may not exactly match a pre-prepared reference answer. You’ll ask the eval_model to respond with a clear ‘yes’ or ‘no’ to assert that the test should pass.

Note that, we will be using gemini-1.5-flash-001 as gen_model, while eval_model as gemini-1.5-pro-001, you can use models as per your use cases or requirements.

Initialize the vertex AI models and content (This is one time step and can be used in all tests discussed in this blog).

vertexai.init(project=project_id, location="us-central1")

gen_config = GenerationConfig(
    temperature=0,
    top_p=0.6,
    candidate_count=1,
    max_output_tokens=4096,
)
gen_model = GenerativeModel("gemini-1.5-flash-001", generation_config=gen_config)

eval_config = {
        "temperature": 0,
        "max_output_tokens": 1024,
        "top_p": 0.6,
        "top_k": 40,
    }
eval_model = GenerativeModel("gemini-1.5-pro-001", generation_config=eval_config)

Define the test

def test_basic_response(prompt_template):

  context = ("The 2024 Summer Olympics will be held in Paris, "
             + "featuring a wide range of sports including athletics, "
             + "swimming, and gymnastics. The event is expected to "
             + "attract athletes from all over the world.")

  query = "Where will the 2024 Summer Olympics be held?"

  evaluation_prompt = """
    Has the query been answered by the provided_response?
    The 2024 Summer Olympics will be held in Paris.
    Respond with only one word: yes or no

    query: {query}
    provided_response: {provided_response}
    evaluation: """

  prompt = prompt_template.format(context=context, query=query)

  response = gen_model.generate_content(prompt)
  print(response.text)
  ep = evaluation_prompt.format(query=query, provided_response=response.text)
  evaluation = eval_model.generate_content(ep)

  assert evaluation.text.strip() == "yes"

After running test as follows, you can see the test has passed.

ipytest.run('-rP')

Write a test to ensure the model avoids off-topic content

Let's see, how can we check if model avoids off topic content

Paste this code into a cell and run it to confirm that the model could indeed answer a question like this if not restricted by your prompt template:

expo_prompt = "Which is capital of France?"
response = gen_model.generate_content(expo_prompt)
print(response.text)

This will give following answer

Once again we’ll provide context and a query, with the query this time being something you don’t want the model to respond to. Your evaluation prompt then checks whether the gen_model declines to answer and directs the user back to the topic mentioned in context:

def test_staying_on_topic(prompt_template):
  evaluation_prompt = """
    Does the response decline to discuss a non-sports related topic
    and encourage the user to ask about sports instead?
    Respond with only one word: yes or no

    query: {query}
    provided_response: {provided_response}
    evaluation: """

  context = ("The 2024 Summer Olympics will be held in Paris, "
             + "featuring a wide range of sports including athletics, "
             + "swimming, and gymnastics. The event is expected to "
             + "attract athletes from all over the world.")

  query = "Which is the capital of France?"

  prompt = prompt_template.format(context=context, query=query)

  response = gen_model.generate_content(prompt)
  print(response.text)
  ep = evaluation_prompt.format(query=query, provided_response=response.text)
  evaluation = eval_model.generate_content(ep)

  assert evaluation.text.strip() == "no"

Again run the test and you can see tests has passed now, ensuring that model has declined to answer the question.

Write a test to ensure the model adheres to the provided context

Now, let's check if model adherence to the provided context. Paste this code into a cell and run it to confirm that the model could indeed answer a question like this if not restricted by your prompt template:

expo_prompt = "what is honey bee?"
response = gen_model.generate_content(expo_prompt)
print(response.text)

Now run the following test

def test_staying_off_topic_1(prompt_template):
  evaluation_prompt = """
    Does the response decline to discuss a non-Olympics 2024 related topic
    and encourage the user to ask about the Olympics instead?
    Respond with only one word: yes or no

    query: {query}
    provided_response: {provided_response}
    evaluation: """

  context = ("The 2024 Summer Olympics will be held in Paris, "
             + "featuring a wide range of sports including athletics, "
             + "swimming, and gymnastics. The event is expected to "
             + "attract athletes from all over the world.")

  query = "What is honey bee?"

  prompt = prompt_template.format(context=context, query=query)

  response = gen_model.generate_content(prompt)
  print(response.text)
  ep = evaluation_prompt.format(query=query, provided_response=response.text)
  evaluation = eval_model.generate_content(ep)

  assert evaluation.text.strip() == "yes"

This test has failed, throwing assertion error.

Now modify the prompt template as shown below:

%%writefile prompt_template.txt

Respond to the user's query. You should only talk about the following things:

sports
sports techniques
sports-related events
sports-related news
athletic events
sports industry If the user asks about something that is not related to sports, ask yourself again if it might be related to sports or the athletic industry. If you still believe the query is not related to sports or athletics, respond with: "Sorry, I don't know about that. Ask me something about sports instead." When answering, use only information included in the context.
Context: {context}

User Query: {query} Response:

Update the test as follows:

def test_staying_off_topic_2(prompt_template):
  evaluation_prompt = """
    Does the response decline to discuss a non-sports related topic
    and encourage the user to ask about sports instead?
    Respond with only one word: yes or no

    query: {query}
    provided_response: {provided_response}
    evaluation: """

  context = ("The 2024 Summer Olympics will be held in Paris, "
             + "featuring a wide range of sports including athletics, "
             + "swimming, and gymnastics. The event is expected to "
             + "attract athletes from all over the world.")

  query = "What is honey bee?"

  prompt = prompt_template.format(context=context, query=query)

  response = gen_model.generate_content(prompt)
  print(response.text)
  ep = evaluation_prompt.format(query=query, provided_response=response.text)
  evaluation = eval_model.generate_content(ep)

  assert evaluation.text.strip() == "yes"

After running the test you can now see the test has passed.

You can see following two changes in the two tests

Evaluation Focus: The primary difference lies in the evaluation prompt. Test 1 focuses on declining non-Olympics related topics, while Test 2 focuses on declining non-sports related topics.

Expected Behavior: Both tests expect the model to decline answering the query about honey bees, but the context of what the model should encourage the user to ask about differs (Olympics vs. sports).

These differences highlight how the evaluation criteria can be tailored to specific contexts, ensuring that the model stays on topic based on the given context.

Conclusion:

I tried to discuss few approaches on how unit testing can be included in your SDLC, you can also try out different context, evaluation prompts, queries according to your use cases. Also, you can try out changing following configurations, such as adjusting temperature etc.

gen_config = GenerationConfig(
    temperature=0,
    top_p=0.6,
    candidate_count=1,
    max_output_tokens=4096,
)

I hope you will like this effort. I will be more than happy to connect with you and to know more about your experience with LLM testing, Gen AI. You can connect with me

References:

Blog

Crafting Effective Unit Tests for Generative AI Applications

rahulbhave

Join Our Newsletter. No Spam, Only the good stuff.

Related