Data Quality: The Hidden Driver of AI Success

Author: Markus Woodson (Machine Learning Engineer at Voxel51)

Data quality is the unsung hero of successful machine learning models. While researchers push the boundaries with ever-evolving architectures, in practice, performance gains come from better data, not just bigger models. Poor-quality inputs—like blurry, underexposed, or overly compressed images; noisy lidar or depth signals; misaligned point clouds; incomplete or inconsistent 3D object reconstructions; and inaccurate sensor fusion between modalities—set models up for failure, much as they hinder human perception. In addition, balancing your data across all potential issues and edge cases enables the model to generalize across conditions. Even the best algorithms can’t compensate for weak data foundations.

The secret to building robust AI systems lies in exercising the data fundamentals: ensuring completeness, consistency, and relevance, supported by high-quality labeling and thoughtful data curation. In the end, it’s not just about models—it’s how we handle the data that drives meaningful innovation.

In this post, we’ll explore why data quality is essential to building reliable, high-performing machine learning models, especially in computer vision. We’ll see model failure cases due to data and examine just how pervasive these issues are, even in “cleaned” academic datasets. We’ll also discuss what FiftyOne has done and will continue to do to help users improve their data quality.

If you’re not yet familiar, hundreds of thousands of AI builders use FiftyOne to build visual AI applications by refining data and models in one place.

Data Quality In the Wild

The quality of your dataset can be critically important, especially in safety-focused applications like autonomous driving. Studies[1,2,3] of real-world incidents have shown that crashes often occur in scenarios where models encounter underrepresented or complex visual conditions, such as extreme glare, low visibility, fog, low-contrast environments, or scenarios where actors in the scene behave in novel and unique ways. These conditions are challenging not only for AI systems but also for human drivers. Finding these issues makes it easier to remedy them. Once found you can increase your data diversity by simulating similar scenarios similar to these. All of your sensors could have issues as well, not just the camera. There are potential hidden issues in your lidar, depth, or other sensor data. Without thorough data curation or generation that includes a variety of lighting, weather, and environmental settings, models are less prepared for these unusual yet impactful edge cases. Incorporating diverse, high-quality data with some of these issues mentioned can help create models that perform reliably across a wider range of scenarios, potentially reducing risks and improving overall safety in real-world applications.

Here are just two examples of failure modes of self-driving systems that led to fatal or serious accidents. In both, you’ll notice the images seen by the vehicle are very atypical when compared to normal driving data. There is fog, harsh lighting, darkness, and a lack of contrast.

You might think these edge cases are hard to come by or cover, but looking at longtime standard datasets like KITTI gives a different story. In the below pictures, you can see samples from the KITTI dataset loaded into FiftyOne that I filtered using a simple brightness parameter. You can see that these types of images with issues related to brightness have been commonplace for a while now.

Example images from the KITTI dataset. The samples were filtered by their brightness level to look at only the extreme bright cases using a new panel coming soon to FiftyOne Teams. We can quickly find these types of problematic images and use this information to guide model improvements for these cases.

Data quality is equally crucial in the world of generative AI, where massive datasets like LAION play a foundational role in training models such as Stable Diffusion. Because the LAION dataset is open, we can see firsthand the types of images that go into shaping these models through websites like haveibeentrained. While it includes a wide variety of visual content, it also brings to light common quality issues: near duplicates, exact duplicates, images lacking meaningful content, and all types of issues in between. These types of issues can lead to memorization, regurgitation, or generation of content which does not match the prompt. Additionally, datasets of this scale sourced from the entire internet can inadvertently include problematic material, like graphic or offensive content, which can influence the outputs of generative models. Such issues highlight the importance of rigorous data curation to ensure models not only generate diverse and creative outputs but also maintain quality and relevance.

Searching LAION for basic queries such as “man running” shows just how prevalent issues are in these datasets. Near and exact duplicates galore! Not even to mention the unrelated or potentially non-useful images present as well.

What Prioritizing Data Looks Like

Some scientists and engineers such as these and these recognize the importance of data quality, dedicating entire pages of their technical report to how they curate and annotate their large datasets.

From our own experiences building high-performing visual AI systems, we know well that AI/ML specialists struggle with the challenges of curating high-quality datasets. That's why we’ve invested in tools and plugins such as the data quality plugin for FiftyOne, which helps you find problematic images in your dataset such as blurry images, too bright or too dark images, and potentially noisy images. And this deduplication plugin for FiftyOne helps you find near and exact duplicates in your dataset.

But what else can we do? Going forward these are some of the ways we at Voxel51 ML are thinking about data quality:

Labels have quality issues as well! Just because you have labeled your data does not mean those labels are perfect. Human-labeled data can be prone to errors. We want to make finding and preventing these errors easier, improving annotation workflows.
Issues are problem-specific. You may consider the same data sample an issue while another might not, it all depends on the use case. As an extreme case, if you are training a model that should work in blurry and clear conditions, then blurry samples should be allowed. But if you expect all your test samples to be clear, then training on blurry samples might be a waste of valuable compute and model capacity. Customizing the detection of issues to your use case is critical.
Data curation is key. Recently, we’ve seen a surge of research in data-centric AI being applied to problems on a sometimes very large scale. Research has shown that curating a better dataset can give a better model and reduce training time, saving you both time and money. Going forward, we think better data curation for all types of problems in computer vision should be a core part of the model development process.

In the near future, you can expect new experiences in FiftyOne that will allow for automated discovery and handling of common quality issues. Building off of our robust plugin ecosystem, you can expect features that not only automatically detect issues but also give you, the ML practitioners, full control over defining what issues mean in your use case and quick ways to solve them. Concretely, we will deliver along the following fronts:

Automatically identifying potential issues based on smart presets
The ability to modify said presets to adapt to your dataset and use-case needs
Novel dataset explorations experience through the axes of these issues. Think, looking for bright or dark images, all through the touch of a slider.
All of this and more will be delivered through our newest feature, Python Panels!

P.S. There is a sneak peek of this new panel in this post!

Conclusion & Next Steps

We explored how essential data quality is to building high-performing visual AI systems and the practical challenges that arise in the process. By equipping AI/ML specialists with tools like FiftyOne, we’re aiming to make data curation, analysis, and model building not only more efficient but also fundamentally better.

For those ready to take the next step, FiftyOne offers a variety of resources and community support to get you started:

FiftyOne Community Slack: Join thousands of fellow AI builders in our Slack community, where you can exchange ideas, ask questions, and get insights directly from experienced developers and scientists working on real-world AI challenges.
Getting Started Workshops: Attend one of our workshops, which covers everything you need to get up and running with FiftyOne for streamlined dataset and model workflows.
GitHub Repository: Access the FiftyOne GitHub repo to dive into our open-source code, tutorials, and sample projects designed to help you incorporate FiftyOne into your own AI/ML workflows.

Blog

Data Quality: The Hidden Driver of AI Success

Jimmy Guerrero

Data Quality In the Wild

What Prioritizing Data Looks Like

Conclusion & Next Steps

Join Our Newsletter. No Spam, Only the good stuff.

Related