Testing Language Learning Models (LLMs) with OpenAI and SQuAD Dataset

#### Introduction

In this tutorial, we'll test the performance of an OpenAI language model (LLM) using the Stanford Question Answering Dataset (SQuAD). We'll use Python, pytest for testing, openai for accessing the OpenAI API, textblob for sentiment analysis, and fuzzywuzzy for text similarity.

Step 1: Environment Setup

Install Required Packages

Ensure you have pipenv installed for managing your virtual environment. If not, you can install it using:

   pip install pipenv

Create a new project directory and navigate to it:

   mkdir openai_squad_tests
   cd openai_squad_tests

Install the required packages:

   pipenv install requests textblob pytest openai fuzzywuzzy

Download the SQuAD Dataset

Download the SQuAD v1.1 dataset from here and save it in your project directory as squad_dataset.json.

Step 2: Project Structure

Your project directory should look like this:

openai_squad_tests/
├── Pipfile
├── Pipfile.lock
├── squad_dataset.json
└── test_openai.py

Step 3: Writing the Code

Create a file named test_openai.py and add the following code:

import pytest
import openai
import json
import random
import os
from textblob import TextBlob
from fuzzywuzzy import fuzz

# Load the SQuAD dataset
with open('squad_dataset.json', 'r') as file:
    squad_data = json.load(file)

# Read the OpenAI API key from the environment variable
api_key = os.getenv("OPENAI_API_KEY")

if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable not set")

# Initialize the OpenAI client
openai.api_key = api_key

# Function to call the OpenAI API
def get_openai_response(prompt):
    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=100
        )
        return response.choices[0]['message']['content'].strip()
    except Exception as e:
        raise ValueError(f"API request failed: {e}")

# Function to get sentiment polarity
def get_sentiment_polarity(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity

# Function to calculate similarity between two strings using fuzzy matching
def calculate_similarity(a, b):
    return fuzz.ratio(a, b)

# Extract questions from the SQuAD dataset
def extract_questions(squad_data):
    questions = []
    for item in squad_data['data']:
        for paragraph in item['paragraphs']:
            context = paragraph['context']
            for qas in paragraph['qas']:
                if qas['answers']:  # Check if there are any answers
                    question = qas['question']
                    expected_answer = qas['answers'][0]['text']
                    questions.append((context, question, expected_answer))
    return questions

# Extracted questions
questions = extract_questions(squad_data)

# Parameterize the test with a subset of random questions
num_tests = 100  # Change this to control the number of tests
random.seed(42)  # For reproducibility
sampled_questions = random.sample(questions, num_tests)

@pytest.mark.parametrize("context, question, expected_answer", sampled_questions)
def test_openai_answer(context, question, expected_answer):
    # Prepare the prompt for the OpenAI API
    prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"

    # Get the response from the OpenAI API
    try:
        generated_answer = get_openai_response(prompt)
    except ValueError as e:
        pytest.fail(f"API request failed: {e}")

    # Get sentiment polarity of the expected and generated answers
    expected_sentiment = get_sentiment_polarity(expected_answer)
    generated_sentiment = get_sentiment_polarity(generated_answer)

    # Calculate similarity between the expected and generated answers
    similarity = calculate_similarity(expected_answer, generated_answer)

    # Compare sentiments and similarity
    sentiment_match = abs(expected_sentiment - generated_sentiment) < 0.1  # Adjust threshold as needed
    similarity_threshold = 70  # Adjust threshold as needed

    assert sentiment_match or similarity > similarity_threshold, f"Expected: {expected_answer} (Sentiment: {expected_sentiment}), Got: {generated_answer} (Sentiment: {generated_sentiment}), Similarity: {similarity}"

Step 4: Running the Tests

Set the OPENAI_API_KEY environment variable:

   export OPENAI_API_KEY="your_openai_api_key"

Activate the Pipenv Shell:

   pipenv shell

Run the Tests:

   pytest test_openai.py

Why This is a Valuable Approach

1. Standardized Benchmarking

Benefit: Using a standardized dataset like SQuAD allows for consistent and repeatable benchmarking of model performance.
Explanation: SQuAD provides a large set of questions and answers derived from Wikipedia articles, making it an excellent resource for testing the question-answering capabilities of LLMs. By using this dataset, we can objectively evaluate how well the model understands and processes natural language in a controlled environment.

2. Diverse Test Cases

Benefit: The SQuAD dataset includes a wide variety of topics and question types.
Explanation: This diversity ensures that the model is tested across different contexts, subjects, and linguistic structures. It helps identify strengths and weaknesses in the model’s ability to handle different types of questions, ranging from factual queries to more complex inferential ones.

3. Quantitative Metrics

Benefit: Provides quantitative measures of model performance.
Explanation: By using metrics such as sentiment analysis and text similarity, we can quantitatively assess the model's accuracy. This objective measurement is crucial for comparing different versions of the model or evaluating the impact of fine-tuning and other modifications.

4. Error Analysis

Benefit: Facilitates detailed error analysis.
Explanation: By examining cases where the model’s responses do not match the expected answers, we can gain insights into specific areas where the model may be underperforming. This can guide further training and improvements in the model.

5. Reproducibility

Benefit: Ensures that testing results are reproducible.
Explanation: By documenting the testing process and using a well-defined dataset, other researchers and developers can reproduce the tests and validate the results. This transparency is critical for scientific research and development.

6. Incremental Improvements

Benefit: Helps track incremental improvements over time.
Explanation: Regularly testing the model with a standardized dataset allows for tracking its performance over time. This is useful for measuring the impact of updates, new training data, or changes in the model architecture.

7. Model Validation

Benefit: Validates model readiness for deployment.
Explanation: Before deploying a model into a production environment, it’s essential to validate its performance rigorously. Testing with the SQuAD dataset helps ensure that the model meets the required standards and is reliable for real-world applications.

Final Thoughts

Testing LLMs using a standardized dataset like SQuAD is a valuable approach for ensuring the model's robustness, accuracy, and reliability. By incorporating quantitative metrics, error analysis, and reproducibility, this approach not only validates the model’s performance but also provides insights for continuous improvement. It is an essential step for any serious development and deployment of AI models, ensuring they meet the high standards required for real-world applications.

Blog