Zero-Shot Prediction Plugin for FiftyOne

Author: Jacob Marks (Machine Learning Engineer at Voxel51)

Pre-label your computer vision data with CLIP, SAM, and other zero-shot models!

Welcome to week six of Ten Weeks of Plugins. During these ten weeks, we will be building a FiftyOne Plugin (or multiple!) each week and sharing the lessons learned!

If you’re new to them, FiftyOne Plugins provide a flexible mechanism for anyone to extend the functionality of their FiftyOne App. You may find the following resources helpful:

What we’ve built so far:

Week 0: 🌩️ Image Quality Issues & 📈 Concept Interpolation
Week 1: 🎨 AI Art Gallery & Twilio Automation
Week 2: ❓Visual Question Answering
Week 3: 🎥 YouTube Player Panel
Week 4: 🪞Image Deduplication
Week 5: 👓Optical Character Recognition (OCR) & 🔑Keyword Search

Ok, let’s dive into this week’s FiftyOne Plugin — Zero-Shot Prediction!

Zero-Shot Prediction 0️⃣🎯🔮

Most computer vision models are trained to predict on a preset list of label classes. In object detection, for instance, many of the most popular models like YOLOv8 and YOLO-NAS are pretrained with the classes from the MS COCO dataset. If you download the weights checkpoints for these models and run prediction on your dataset, you will generate object detection bounding boxes for the 80 COCO classes.

When you’re building your own machine learning application, typically you will be interested in a different set of classes. Sometimes the classes are similar to the pretrained classes, whereas other times they can be completely different. To use these model architectures on your data, you either need to fine tune the pretrained model, or train a model from scratch.

Sometimes, however, it may be beneficial to generate initial predictions with your set of classes before undertaking fine tuning or full-scale model training. For one, this can be a great way to accelerate the generation of ground truth labels. If the model is good enough, manually correcting its mistakes can be much quicker than labeling everything by hand. Additionally, having a set of predictions could prove useful when benchmarking the performance of any models you train.

In computer vision, this is known as zero-shot learning, or zero-shot prediction, because the goal is to generate predictions without explicitly being given any example predictions to learn from. With the advent of high quality multimodal models like CLIP and foundation models like Segment Anything, it is now possible to generate remarkably good zero-shot predictions for a variety of computer vision tasks, including:

This FiftyOne plugin streamlines the process of zero-shot prediction, unifying the interface across all four of these tasks, so that you can go from labels to predictions within the FiftyOne App, without writing a single line of code!

Plugin Overview & Functionality

For the sixth week of 10 Weeks of Plugins, I built a Zero-Shot Prediction Plugin. This plugin allows you to specify a set of label classes and a task, and generate preliminary labels for your entire dataset.

You can specify label classes either as a comma separated list, or by selecting a text file which contains a new class on each line:

The plugin has five (!) operators:

zero_shot_predict: an umbrella operator for all zero-shot tasks
zero_shot_classify: an interface for zero-shot classification
zero_shot_detect: an interface for zero-shot object detection
zero_shot_instance_segment: an interface for zero-shot instance segmentation
zero_shot_semantic_segment: an interface for zero-shot semantic segmentation

You can specify the computer vision task either from the modal for the main zero_shot_predict operator:

Or by selecting the appropriate task’s operator from the operator list:

After selecting a task, you will be prompted to select a model. For some tasks, such as classification, only one model (CLIP) is implemented out of the box. For others, there are multiple choices. Instance segmentation, for example, comes with three choices — one for each Segment Anything model size (B, L, H).

At the bottom of the operator’s modal can you specify the name of the field in which to store the resulting predictions. By default, the field name is a formatted version of the model name.

Delegating Execution

All five of the operators in this plugin can optionally have their execution delegated to be completed at a later time. If you choose to run the operators as delegated operators, you can schedule them in the App and then launch them from the command line with:

fiftyone delegated launch

Setting Inference Target

You can also choose to run any of these zero-shot prediction models on just a subset of your data. If you have a DatasetView loaded in the FiftyOne App that is distinct from the entire dataset — for example you are just looking at samples that match some filter — then you will have the option to run inference on either the entire dataset, or that particular subset. The same is true if you have samples “selected”:

Zero-Shot Prediction in Action

There are so many use cases for zero-shot prediction. Here’s just one example. Suppose you have some images with vehicles in them, and you want to detect their license plates. This isn’t one of the COCO label classes, but that is no longer a problem:

Installing the Plugin

If you haven’t already done so, install FiftyOne:

pip install fiftyone

Then you can download this plugin from the command line with:

fiftyone plugins download https://github.com/jacobmarks/zero-shot-prediction-plugin

Refresh the FiftyOne App, and you should see the five operators in your operators list when you press the " ` " key.

Because CLIP and SAM come with the FiftyOne Model Zoo, no additional steps are needed to use these models. However, models like CLIPSeg (used for semantic segmentation) and Owl-ViT (used for object detection and instance segmentation) require that you have the Hugging Face transformers library installed.

If you find other zero-shot models, you can add them as you see fit!

Lessons Learned

The Image Deduplication plugin is a Python Plugin with the usual structure (an __init__.py, fiftyone.yml, and REAMDE.md files). Additionally, it has an assets folder for storing icons, and a separate Python file for each task: e.g. object detection models implemented in
detection.py. I split the code up in this way to separate each model’s implementation and postprocessing details from the unified interface for prediction.

Keep up with FiftyOne’s Plugin Features

The FiftyOne Plugin system is already incredibly powerful, and it’s getting more powerful with each and every release. This plugin utilizes a few of the features that have been recently added: the file explorer, and delegated operators.

The file explorer, which you can use in your own Python plugins via the FileExplorerView view type, is a flexible widget that allows you to select a specific file or an entire folder from your file system. In FiftyOne Teams, it even allows you to navigate the files in your cloud buckets, all from within the FiftyOne App! In this plugin, I used the file explorer to let users select their labels, either from a local text file, or from URL.

Delegated operators allow you to delegate that certain tasks be completed at a later time. From within the FiftyOne App, you schedule the job, and then you can launch the job from the command line with:

fiftyone delegated launch

Delegating execution can be incredibly useful for long-running operations, like running inference with deep models on your dataset.

You can set your operators to run in delegated mode with the resolve_delegation()method. This, for instance, would result in the operator always running in delegated mode.

def resolve_delegation(self, ctx):
    True

For this plugin, I took a slightly different approach, instead letting the user decide whether they want to run in delegate mode via an input parameter:

def _execution_mode(ctx, inputs):
    delegate = ctx.params.get("delegate", False)

    if delegate:
        description = "Uncheck this box to execute the operation immediately"
    else:
        description = "Check this box to delegate execution of this task"

    inputs.bool(
        "delegate",
        default=False,
        required=True,
        label="Delegate execution?",
        description=description,
        view=types.CheckboxView(),
    )

    if delegate:
        inputs.view(
            "notice",
            types.Notice(
                label=(
                    "You've chosen delegated execution. Note that you must "
                    "have a delegated operation service running in order for "
                    "this task to be processed. See "
                    "https://docs.voxel51.com/plugins/index.html#operators "
                    "for more information"
                )
            ),
        )


def resolve_delegation(self, ctx):
        return ctx.params.get("delegate", False)

Reducing Boilerplate Code

Because the four task-specific operators naturally gave way to very similar information flows within the code, I found myself writing nearly identical code for the resolve_input()and execute() methods for each of these operators. This wasn’t very satisfying.

A more satisfying and cleaner approach is to pass the context object ctx into other functions which are defined outside of the operator object. To make this happen, I created _input_control_flow(ctx, task)and _execute_control_flow(ctx, task) functions which take in the ctx and a string specifying the task. This massively simplified the operator definitions. Take the zero shot instance segmentation operator for example:

class ZeroShotInstanceSegment(foo.Operator):
    @property
    def config(self):
        _config = foo.OperatorConfig(
            name="zero_shot_instance_segment",
            label="Perform Zero Shot Instance Segmentation",
            dynamic=True,
        )
        _config.icon = "/assets/icon.svg"
        return _config

    def resolve_delegation(self, ctx):
        return ctx.params.get("delegate", False)

    def resolve_input(self, ctx):
        inputs = _input_control_flow(ctx, "instance_segmentation")
        return types.Property(inputs)

    def execute(self, ctx):
        _execute_control_flow(ctx, "instance_segmentation")

Conclusion

Zero-shot prediction is becoming increasingly important for both pre-labeling and benchmarking workflows. This plugin saves you the trouble of dealing with myriad standards, unifying and simplifying the process of zero-shot prediction for four essential computer vision tasks. It will save you a lot of time, and make your life that much easier. The best part is that it can be used in conjunction with your existing annotation workflows, or with next week’s Active Learning plugin!

Stay tuned over the remaining weeks in the Ten Weeks of FiftyOne Plugins while we continue to pump out a killer lineup of plugins! You can track our journey in our ten-weeks-of-plugins repo — and I encourage you to fork the repo and join me on this journey!

Blog