CVPR 2024 Datasets and Benchmarks - Part 2: Benchmarks

Author: Harpreet Sahota (Hacker in Residence at Voxel51)

In part one of this series, I explored some interesting datasets presented at CVPR 2024, highlighting how they’ll help advance computer vision and deep learning.

Now, it's time to turn our attention to the other side of the coin: benchmarks.

Just as musicians need stages to showcase their talent, deep learning models need benchmarks to demonstrate their capabilities and push the boundaries of what's possible. These standardized tasks and challenges provide a crucial yardstick for evaluating and comparing different models, driving healthy competition and accelerating progress.

CVPR 2024 has once again delivered a collection of innovative benchmarks that address existing limitations and explore new frontiers in computer vision.

In this second part of the series, I’ll highlight three benchmarks I found interesting:

ImageNet-D: Testing the robustness of image classifiers against real-world perturbations.
Polaris: Assessing the ability of vision-language models to follow natural language instructions in interactive environments.
VBench: Comprehensive Benchmark Suite for Video Generative Models

Each of these benchmarks presents unique challenges and opportunities for researchers, pushing the field towards more robust models. In the following sections, I'll focus on the following aspects of each dataset:

Task and Objective: Clearly define the specific task or problem the benchmark evaluates.

Dataset and Evaluation Metric: Provide details about the benchmark, including its size, composition, and the evaluation metrics employed to measure model performance.

Benchmark Design and Protocol: Explain the benchmark's design and protocol, including how the dataset is split into training, validation, and test sets.

Comparison to Existing Benchmarks: Compare the new benchmark to existing ones in the same domain, highlighting its unique challenges, evaluation criteria, and/or how the benchmark complements or improves upon existing benchmarks.

State-of-the-Art Results: Showcase the leaderboard on the benchmark, if it exists, and what the top-performing models are. If available, discuss the model’s key architectural features or training strategies.

Impact and Future Directions: Discuss the benchmark's potential impact, how it can drive research in new directions, and address important challenges in existing benchmarks.

ImageNet-D

tl;dr

Task: Object recognition on synthetic images
Metric: Top-1 Accuracy
Paper
GitHub
Dataset on Hugging Face

Task and Domain

The ImageNet-D benchmark evaluates the robustness of neural networks in object recognition tasks using synthetic images generated by diffusion models.

It assesses the performance of various vision models, ranging from standard visual classifiers to foundation models like CLIP and MiniGPT-4.
The primary objective is to rigorously test the robustness of these models in correctly identifying objects under challenging conditions.
The benchmark focuses explicitly on "hard" images designed to test the models' perception abilities.
Using synthetic images generated by diffusion models, ImageNet-D provides a rigorous evaluation of how well neural networks can handle variations in object representation.

Dataset Curation, Size, and Composition

The synthetic images were generated using Stable Diffusion models steered by language prompts. They tested the robustness of visual recognition systems by using diverse backgrounds, textures, and materials to challenge the models' perception capabilities.

It comprises 4,835 challenging images across 113 overlapping categories between ImageNet and ObjectNet.
The images feature a diverse array of backgrounds (3,764), textures (498), and materials (573) to push the limits of object recognition models.
The dataset is generated by pairing each object with 547 nuisance candidates from the Broden dataset, resulting in various realistic and challenging synthetic images.
The primary evaluation metric is top-1 accuracy in object recognition, which measures the proportion of correctly classified images.
Compared to standard datasets, ImageNet-D proves to be significantly more challenging, as evidenced by the notable drop in accuracy percentages for various state-of-the-art models.

Benchmark Design and Protocol

The benchmark's construction follows a rigorous process that involves image generation, labeling, hard image mining, human verification, and quality control:

The image generation process is formulated as Image(C, N) = Stable Diffusion(Prompt(C, N)), where C and N refer to the object category and nuisance, respectively.
- The nuisance N includes background, material, and texture.
- For example, images of backpacks with various backgrounds, materials, and textures are generated, offering a broader range of combinations than existing test sets.
Each generated image is labeled with its prompt category C as the ground truth for classification.
- An image is misclassified if the model's predicted label does not match the ground truth C.
After creating a large image pool with all object categories and nuisance pairs, the CLIP (ViT-L/14) model is evaluated on these images.
- Hard images are selected based on shared perception failure, defined as an image that leads multiple models to predict the object's label incorrectly. The test set is constructed using shared failures of known surrogate models.
- The test set is challenging if these failures lead to low accuracy in unknown models. This property is called transferable failure.
Human Labeling: Since ImageNet-D includes images with diverse object and nuisance pairs that may be rare in the real world, human labeling is performed using Amazon Mechanical Turk (MTurk). Workers are asked to answer two questions for each image:
- Can you recognize the desired object ([ground truth category]) in the image?
- Can the object in the image be used as the desired object ([ground truth category])?
To ensure workers understand the labeling criteria, they are asked to label two example images for practice, providing the correct answers. After the practice session, workers must label up to 20 images in one task, answering both questions for each image by selecting 'yes' or 'no'.
Sentinels ensure high-quality annotations. Workers' annotations are removed if they fail to select the correct answers for positive sentinels, select 'yes' for negative sentinels, or provide inconsistent answers for consistent sentinels.
- Positive sentinels are images that belong to the desired category and are correctly classified by multiple models.
- Negative sentinels are images that do not belong to the desired category.
- Consistent sentinels are images that appear twice in a random order.

Comparison to Existing Benchmarks

The big difference with ImageNet-D is that it creates entirely new. synthetic images. Here’s how it’s different from existing benchmarks that try to do the same thing:

Unlike ObjectNet, which collects real-world object images with controlled factors like background, or ImageNet-C, which introduces low-level visual corruptions, ImageNet-D generates entirely new images with diverse backgrounds, textures, and materials.
While ImageNet-9 combines foreground and background from different images, it is limited by poor image fidelity. Similarly, Stylized-ImageNet alters the textures of ImageNet images but cannot control global factors like backgrounds. In contrast, ImageNet-D allows for specific control over the image space, which is crucial for robustness benchmarks.
Compared to DREAM-OOD, which finds outliers by decoding sampled latent embeddings to images but lacks control over the image space, ImageNet-D focuses on hard images with a single attribute.
By generating new images and mining the most challenging ones as the test set, ImageNet-D achieves a greater accuracy drop compared to methods that modify existing datasets. The results show that ImageNet-D causes a significant accuracy drop, up to 60%, in a range of vision models, from standard visual classifiers to the latest foundation models like CLIP and MiniGPT-4.
The approach utilized in ImageNet-D demonstrates the potential for using generative models to evaluate model robustness, and its effectiveness is expected to grow further with advancements in generative models.

State-of-the-Art Results

ImageNet-D is a challenging benchmark for various state-of-the-art models, causing significant drops in their object recognition accuracy. Here are some key findings:

CLIP experiences a substantial accuracy reduction of 46.05% on ImageNet-D compared to its performance on ImageNet.
LLaVa's accuracy drops by 29.67% when evaluated on the ImageNet-D benchmark.
Despite being a more recent model, MiniGPT -4 still has a 16.81% decrease in accuracy on ImageNet-D.
All tested models show an accuracy drop of more than 16% on ImageNet-D compared to their performance on the standard ImageNet dataset.
Even the latest models, such as LLaVa-1.5 and LLaVa-NeXT, are not immune to the challenges posed by ImageNet-D, experiencing significant accuracy drops.

Impact and Future Directions

ImageNet-D demonstrates the effectiveness of using generative models to evaluate the robustness of neural networks. The authors suggest that their approach is general and has the potential for greater effectiveness as generative models improve. They aim to create more diverse and challenging test images in the future by capitalizing on advancements in generative models.

Polaris

tl;dr

Task: Image Captioning
Metric: Polos, based on the novel Multimodal Metric Learning from Human Feedback (M²LHF) framework
Project Page
Paper
GitHub
Dataset on Hugging Face

This paper was interesting because it introduces Polaris, a new large-scale benchmark dataset for evaluating image captioning models, and Polos, a state-of-the-art (SOTA) metric trained on this dataset.

Let’s quickly discuss the concepts of benchmarks and metrics.

A benchmark is a standardized dataset or suite of datasets used to evaluate and compare the performance of different models or algorithms on a specific task. In image captioning, a benchmark typically consists of images, associated human-written captions, and human judgments of caption quality for a subset of the data. The benchmark provides a common ground for comparing different captioning models or evaluation metrics.

A metric, on the other hand, is a method or function used to measure a model's performance on a specific task. In image captioning, a metric takes an image, a candidate caption, and possibly one or more reference captions as input. It outputs a score indicating the quality of the candidate caption. The metric's performance is evaluated by measuring how well its scores correlate with human judgments on a benchmark dataset.

Task and Objective

Automatic evaluation of image captioning models is essential for accelerating progress in image captioning, as it enables researchers to quickly and objectively compare different models and architectures without the need for time-consuming and expensive human evaluations.

This research aims to create an effective evaluation metric, Polos, designed explicitly for image captioning models, that closely mirrors human judgment of image caption quality. The main goal is the development of a metric that closely aligns with human judgments of caption quality, fluency, relevance, and descriptiveness.

This paper closely intertwines the development of the Polos metric and the Polaris benchmark dataset. The authors introduce the Multimodal Metric Learning from Human Feedback (M²LHF) framework, which is used to develop the Polos metric.

Dataset and Evaluation Metric

Evaluating image captioning models accurately requires metrics that align with human judgment.

However, existing datasets often lack the scale and diversity needed to train such metrics effectively. This paper addresses this challenge by introducing the Polaris dataset and the Polos evaluation metric.

Dataset

The Polaris dataset includes 13,691 images and 131,020 generated captions from 10 diverse image captioning models, providing a wide range of caption quality and style. The images used in Polaris are drawn from the MS-COCO and nocaps datasets, chosen for their widespread use in image captioning tasks and their diverse range of image content Additionally, it contains 262,040 human-written reference captions, which serve as a gold standard for comparison.
The Polaris dataset is split into training (78,631 samples), validation (26,269 samples), and test sets (26,123 samples).
The generated captions in the Polaris dataset encompass 3,154 unique words, totalling 1,177,512 words. On average, each generated caption is composed of 8.99 words.
The reference captions have a vocabulary of 22,275 unique words and a word count of 8,309,300. On average, each reference caption consists of 10.7 words.
The authors collected 131,020 human judgments from 550 evaluators on the image-caption pairs in the Polaris dataset to obtain a comprehensive assessment of caption quality.
Human evaluators rated each caption on a 5-point scale, considering factors such as fluency, image relevance, and detail level. These ratings were then normalized to a range of [0, 1] to facilitate comparison and evaluation.

Metric

Polos uses a parallel feature extraction mechanism that combines features from the CLIP model, which captures image-text similarity, and a RoBERTa model pretrained with SimCSE, which provides high-quality textual representations.
The extracted features are then passed through a multilayer perceptron (MLP) to predict the human evaluation score.
The primary objective of the Polos metric is to achieve a high correlation with human judgments, demonstrating its ability to assess caption quality accurately.
The effectiveness of Polos is quantified using Kendall's Tau correlation coefficients, specifically Tau-b for the Flickr8K-CF dataset and Tau-c for other datasets. These coefficients measure the alignment between the rankings produced by the Polos metric and those derived from human judgments, with a higher correlation indicating better performance.
The authors also introduce the Multimodal Metric Learning from Human Feedback (M²LHF) framework, a general approach for developing metrics that learn from human judgments on multimodal inputs.