Stress Testing VLMs: Multi QnA and Description Tasks
Aryan Kargwal
Posted on October 14, 2024
Video Link: https://youtu.be/pwW9zwVQ4L8
Repository Link: https://github.com/aryankargwal/genai-tutorials/tree/main
In the fast-evolving world of AI, Vision-Language Models (VLMs) have garnered attention for their ability to understand and generate responses based on visual and textual inputs. However, testing these models in a structured environment and comparing their performance across various scenarios is still a challenging task. This blog will walk you through an experiment where we used a custom-built Streamlit web application to stress test multiple VLMs like Llama 3.2, Qwen 2 VL, and GPT 4o on a range of tasks. We analyzed their response tokens, latency, and accuracy in generating answers to complex, multimodal questions.
However, please note that most of the findings are still hidden, as this application is part of my process of making a VLM benchmark, the first of which you can check out on Huggingface as SynCap-Flickr8K!
Why Compare Vision-Language Models?
The ability to compare the performance of different VLMs across domains is critical for:
- Understanding model efficiency (tokens used, latency).
- Measuring how well models can generate coherent responses based on image inputs and textual prompts.
- Creating benchmark datasets to improve further and fine-tune VLMs.
To achieve this, we built a VLM Stress Testing Web App in Python, utilizing Streamlit for a user-friendly interface. This allowed us to upload images, input textual prompts, and obtain model-generated responses in real time. The app also calculated and logged critical metrics such as the number of tokens used in responses and latency.
Project Setup
Our main application file, app.py
, uses Streamlit as the frontend and is integrated with API requests to call different VLM models. Each query to a model includes:
- Image: Encoded in Base64 format.
- Question: A text input by the user.
- Model ID: We allow users to choose between multiple VLMs.
The API response includes:
- Answer: The model-generated text.
- Latency: Time taken for the model to generate the answer.
- Token Count: Number of tokens used by the model in generating the response.
Below is the code structure for querying the models:
def query_model(base64_image, question, model_id, max_tokens=300, temperature=0.9, stream=False, frequency_penalty=0.2):
image_content = {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
}
prompt = question
data = {
"model": model_id,
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
image_content
]
}
],
"max_tokens": max_tokens,
"temperature": temperature,
"stream": stream,
"frequency_penalty": frequency_penalty
}
response = requests.post(url, headers=headers, json=data)
return response.json()
Task Definitions and Experiments
We tested four different tasks across multiple domains using the following models:
- Llama 3.2
- Qwen 2 VL
- GPT 4o
Domains:
- Medical: Questions related to complex medical scenarios.
- Retail: Product-related queries.
- CCTV: Surveillance footage analysis.
- Art: Generating artistic interpretations and descriptions.
The experiment involved five queries per task for each model, and we recorded the following metrics:
- Tokens: The number of tokens used by the model to generate a response.
- Latency: Time taken to return the response.
Results
Token Usage Comparison
The tables below highlight the token usage across the four domains for both Llama and GPT models.
Task | Q1 Tokens | Q2 Tokens | Q3 Tokens | Q4 Tokens | Q5 Tokens | Mean Tokens | Standard Deviation (Tokens) |
---|---|---|---|---|---|---|---|
Medical (Llama) | 1 | 12 | 1 | 1 | 1 | 3.2 | 4.81 |
Retail (Llama) | 18 | 39 | 83 | 40 | 124 | 60.8 | 32.77 |
CCTV (Llama) | 18 | 81 | 83 | 40 | 124 | 69.2 | 37.29 |
Art (Llama) | 11 | 71 | 88 | 154 | 40 | 72.2 | 51.21 |
Task | Q1 Tokens | Q2 Tokens | Q3 Tokens | Q4 Tokens | Q5 Tokens | Mean Tokens | Standard Deviation (Tokens) |
---|---|---|---|---|---|---|---|
Medical (GPT) | 1 | 10 | 1 | 1 | 1 | 2.4 | 4.04 |
Retail (GPT) | 7 | 13 | 26 | 14 | 29 | 17.8 | 8.53 |
CCTV (GPT) | 7 | 8 | 26 | 14 | 29 | 16.8 | 7.69 |
Art (GPT) | 10 | 13 | 102 | 43 | 35 | 40.6 | 35.73 |
Latency Comparison
Latency, measured in seconds, is another critical factor in evaluating the model's performance, especially for real-time applications. The following tables display latency results for the same set of tasks.
Task | Q1 Latency | Q2 Latency | Q3 Latency | Q4 Latency | Q5 Latency | Mean Latency | Standard Deviation (Latency) |
---|---|---|---|---|---|---|---|
Medical (Llama) | 0.74 | 0.97 | 0.78 | 0.98 | 1.19 | 0.73 | 0.19 |
Retail (Llama) | 1.63 | 3.00 | 3.02 | 1.67 | 3.14 | 2.09 | 0.74 |
CCTV (Llama) | 1.63 | 3.00 | 3.02 | 1.67 | 3.14 | 2.09 | 0.74 |
Art (Llama) | 1.35 | 2.46 | 2.91 | 4.45 | 2.09 | 2.46 | 1.06 |
Task | Q1 Latency | Q2 Latency | Q3 Latency | Q4 Latency | Q5 Latency | Mean Latency | Standard Deviation (Latency) |
---|---|---|---|---|---|---|---|
Medical (GPT) | 1.35 | 1.50 | 1.21 | 1.50 | 1.23 | 1.38 | 0.10 |
Retail (GPT) | 1.24 | 1.77 | 2.12 | 1.35 | 1.83 | 1.63 | 0.29 |
CCTV (GPT) | 1.20 | 2.12 | 1.80 | 1.35 | 1.83 | 1.68 | 0.32 |
Art (GPT) | 1.24 | 1.77 | 7.69 | 3.94 | 2.41 | 3.61 | 2.29 |
Observations
- Token Efficiency: Llama models generally use fewer tokens in response generation for simpler tasks like Medical compared to more complex domains like Art.
- Latency: Latency is higher for more complex images, especially for tasks like Retail and Art, indicating that these models take more time when generating in-depth descriptions or analyzing images.
- GPT vs. Llama: GPT models generally had lower token counts across the tasks, but the latency was comparable, with GPT showing slightly more variability in complex tasks like Art.
Conclusion and Future Work
This experiment highlights the importance of evaluating both token efficiency and latency when stress testing VLMs. The VLM Stress Test App allows us to quickly compare multiple models and analyze their performance across a variety of real-world tasks.
Future Plans:
- Additional Models: We plan to add more models like Mistral and Claude to the comparison.
- Expanded Dataset: New tasks in
domains like Legal and Education will be added to challenge the models further.
- Accuracy Metrics: We'll also integrate accuracy metrics like BLEU and ROUGE scores in the next iteration.
Check out our GitHub repository for the code and further instructions on how to set up and run your own VLM experiments.
Posted on October 14, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.