Double Trouble: Eliminate Image Duplicates with FiftyOne
Jimmy Guerrero
Posted on March 13, 2024
Author: Jacob Marks (Machine Learning Engineer at Voxel51)
Find Exact and Approximate Duplicate Images with This Plugin
Welcome to week four of Ten Weeks of Plugins. During these ten weeks, we will be building a FiftyOne Plugin (or multiple!) each week and sharing the lessons learned!
If you’re new to them, FiftyOne Plugins provide a flexible mechanism for anyone to extend the functionality of their FiftyOne App. You may find the following resources helpful:
What we’ve built so far:
- Week 0: Image Quality Issues & Concept Interpolation
- Week 1: AI Art Gallery & Twilio Automation
- Week 2: Visual Question Answering
- Week 3: YouTube Player Panel
Ok, let’s dive into this week’s FiftyOne Plugin — Image Deduplication!
Image Deduplication 🖼️🪞🧹
The biggest challenge in training machine learning models is curating a high quality dataset. Duplicate (or very similar) data is a major roadblock to building such a dataset. Multiple copies of the same (or approximately the same) samples can lead to longer training times, higher training costs, and lower overall performance. On the flip side, you likely want a diverse dataset with good coverage over the data domain.
Duplicates come in two flavors:
- Exact duplicates: pixel-perfect matches, where one image is literally a down-to-the-bit copy of another
- Approximate duplicates: images (or other data) that are highly similar — typically evaluated by computing the closeness between samples with some similarity metric — and setting a threshold for similarity using this metric.
Deduplication is the task of removing these exact and approximate duplicates from a dataset.
Typically, deduplication involves writing a lot of code to find, visualize, and remove all of the duplicates in your dataset. With this FiftyOne plugin, that all changes. Now you can deduplicate your entire dataset from within the FiftyOne App, without writing a single line of code!
Plugin Overview & Functionality
For the fourth week of 10 Weeks of Plugins, I built an Image Deduplication Plugin. This plugin allows you to:
- Find both exact and approximate duplicate images in your dataset
- Visualize these groups of duplicates
- Delete all duplicates OR Keep a representative from each set of duplicates
The plugin has eight (!) operators (a powerful feature in FiftyOne that allow plugin developers to define custom operations that can be executed by users of the FiftyOne App), but don’t get overwhelmed — really it’s just two sets of analogous operators, for exact and approximate deduplication workflows.
After you install the plugin, when you open the operators list (pressing " `
" in the FiftyOne App) you should see these operators. Search for “dedup” to narrow down the list!
🔍 Finding Duplicates
The first pair of operators helps you to find duplicate images in your dataset.
-
find_approximate_duplicate_images
: uses a similarity index to find approximate duplicates.
You can specify either a distance threshold (how close the images need to be according to the similarity metric to be considered near duplicates) or a fraction of the dataset to mark as near duplicates.
If you haven’t computed a similarity index on your dataset, you can do so by running:
import fiftyone.brain as fob
fob.compute_similarity(dataset, brain_key = "sim", metric="cosine")
For a large dataset, you may want to use a vector database. In this case, check out our native integrations with Pinecone, Qdrant, Milvus, and LanceDB!
When the operation finishes, it will have created two saved views: approx_dup_view
, and approx_dup_groups_view
. You can access these by clicking on the saved views selector in the FiftyOne App, or programmatically via Python:
approx_dup_view = dataset.load_saved_view("approx_dup_view")
approx_dup_groups_view = dataset.load_saved_view("approx_dup_groups_view")
-
find_exact_duplicates
: uses file hashes to find exact duplicates
Essentially, the file hash computes a short signature for each sample based on the binary data stored in the image. The operator then checks if there are duplicate values of these signatures and marks these samples as duplicates.
The operator adds a filehash
field to each sample, and creates a saved view exact_dup_view
, which contains just the images with duplicate filehashes.
🪟Viewing Duplicates
Once you have found exact and/or approximate duplicates in your dataset, you may want to view these duplicates. For approximate duplicates, for instance, you may want to verify that the distance threshold you set was rigorous enough.
The Image Deduplication plugin makes it easy to do this with the display_approximate_duplicate_groups
and display_exact_duplicate_groups
operators. The names are pretty self-explanatory, but the former loads the approx_dup_groups_view
view we saved earlier, and the latter displays the samples in exact_dup_view
, grouped by filehash
.
🗑️Removing Duplicates
Once you have viewed your identified duplicates, it is time to clean your dataset. At this point, you have two options:
- Remove ALL duplicates: delete all samples marked as an exact or approximate duplicate
- Keep a representative: remove all but one duplicate from each set of exact or approximate duplicates
As always, there are sister operators for working with approximate and exact duplicates:
-
remove_all_approximate_duplicates
: removes all near-duplicate images from a dataset -
remove_all_exact_duplicates
: removes all exact duplicate images from a dataset -
deduplicate_approximate_duplicates
: removes near-duplicate images from a dataset, keeping a representative image from each duplicate set -
deduplicate_exact_duplicates
: removes exact duplicate images from a dataset, keeping a representative image from each duplicate set
Here’s an example of each:
Installing the Plugin
If you haven’t already done so, install FiftyOne:
pip install fiftyone
Then you can download this plugin from the command line with:
fiftyone plugins download https://github.com/jacobmarks/image-dedup-plugin
Refresh the FiftyOne App, and you should see the eight operators in your operators list when you press the " `
" key.
Lessons Learned
The Image Deduplication plugin is a Python Plugin with the usual structure (an __init__.py
, fiftyone.yml
, and REAMDE.md
files). Additionally, it has the following:
- An assets folder for storing icons
- A Python file
exact_dups.py
for handling the logic and computations involved for exact duplicates - A Python file
approx_dups.py
for handling the logic and computations involved for approximate duplicates
Splitting Code into Submodules
It’s typically good practice in software development to make code modular, splitting self-contained pieces of logic into separate functions or files. This is known as separation of concerns.
The Image Deduplication plugin was a good exercise in applying this principle to FiftyOne’s plugin system. To utilize functions or variables you define in another file in the FiftyOne plugin’s directory, you need to add the path to that file to your system path.
Here’s an example where we import find_exact_duplicates
from the exact_dups
file:
from fiftyone.core.utils import add_sys_path
with add_sys_path(os.path.dirname(os.path.abspath(__file__))):
# pylint: disable=no-name-in-module,import-error
from exact_dups import find_exact_duplicates
Starting from the innermost part of this expression:
-
__file__
is a variable containing the path to the current module — in this case the__init__.py
file. -
os.path.abspath
gets the absolute path for this file -
os.path.dirname
extracts the directory name of this absolute path -
add_sys_path
is a FiftyOne utility function that adds this to our system path
The second to last line, # pylint: disable=no-name-in-module,import-error
tells our linter not to throw an error when linting the file.
Loading a View
When executed, the display_approximate_duplicate_groups
and display_exact_duplicate_groups
operators each trigger the loading of specific views. Doing this is pretty straightforward, but it is worth noting that the data passed into params in the ctx.trigger()
call needs to be serialized. In fact, all data passed into parameter dictionaries for FiftyOne operators needs to be serialized.
Fortunately, FiftyOne DatasetView
objects are easy to serialize!
import json
from bson import json_util
def serialize_view(view):
return json.loads(json_util.dumps(view._serialize()))
Icons for Each Operator
The last tip is a simple but fun one: By utilizing the icon argument in the operator config, you can specify a unique icon for each operator. This is the icon that will then show up in the operators list with you hit " `
".
For example, here’s the start of the operator definition for FindExactDuplicates
:
class FindExactDuplicates(foo.Operator):
@property
def config(self):
return foo.OperatorConfig(
name="find_exact_duplicate_images",
label="Dedup: Find exact duplicates",
description="Find exact duplicates in the dataset",
icon="/assets/exact_duplicates.svg",
dynamic=True,
)
I like to put all of the SVGs I use as icons in an assets
folder to stay organized 📁.
Conclusion
Building a high quality dataset doesn’t have to be a hassle. With our Image Quality Issues Plugin from week 0, you can find a variety of common issues potentially plaguing images in your dataset, from peculiar aspect ratios to oversaturation. Now with the Image Deduplication Plugin (this post) you can also find and eliminate duplicates from your dataset in mere minutes!
Stay tuned over the remaining weeks in the Ten Weeks of FiftyOne Plugins while we continue to pump out a killer lineup of plugins! You can track our journey in our ten-weeks-of-plugins repo — and I encourage you to fork the repo and join me on this journey!
Posted on March 13, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
October 11, 2024