Vector Databases for Data-Centric AI

Other homes for this article:

https://medium.com/@george.pearse

Vector Databases are one of the newest tools in the MLOps / Data Engineering space. They're designed to be efficient at nearest neighbour queries over embeddings while providing a simple CRUD interface for maintainability.

Embeddings are the outputs of a layer of a Deep Learning model with respect to an input (single datapoint). They are learned representations within which objects of the same class are projected near to each other.

The best vector databases enable you to combine metadata queries, e.g. the dataset split or class you want the results to belong to, along with a nearest neighbour request e.g. return the nearest neighbours to this input example that do not have the same label. This is hard to achieve with nearest neighbour libraries such as Faiss and Annoy because the index is built up-front and cannot be filtered. To achieve an equivalent result you would need to return an excess of nearest neighbours and then apply the filter after.

I'd recommend the tensorflow embedding projector to develop an intuition if you're not familiar (MNIST with images is best).
Embedding projector - visualization of high-dimensional data
Visualize high dimensional TensorFlow Projector

Though it's important to note that the embeddings undergo dimensionality reduction via PCA, T-SNE or U-MAP in order to be projected into 3 dimensions.

This article:

Not All Vector Databases Are Made Equal

A detailed comparison of Milvus, Pinecone, Vespa, Weaviate, Vald, GSI and Qdrant

Does an excellent job of comparing the best offerings currently available but what might you actually want to use a Vector Database for? The below are most relevant to image classification problems.

Active Learning
Got an error in a validation set and want to fix it? Get the nearest neighbours to your error in your unlabelled dataset labelled, and retrain. Repeat until the problem is reduced or resolved. If the nearest neighbour query does not return many similar examples, consider using a package that enables you to increase the number of augmentations, or weighting, of these instances.
Unit Test Construction
Identified a specific type of problem that's particularly costly (e.g. can't distinguish between spoons and forks) and want to monitor your progress against it? Retrieve the nearest neighbours to an instance of the error case within the labelled set, provide a description of the problem and track how performance changes over time.
Closest Counterfactual
Think the labels of your training dataset may be inconsistent? Look at the nearest instances with a different label. Consider getting your experts to review the examples and come to consensus, add further descriptions to your labelling rules or keep them as examples to use in the training of labellers. NB: here you may be better off using something like KNN conformity or simply looking at the cases where there's the largest disagreement between your model and the label. Closest counterfactual is great but it is quite manual and doesn't scale well compared to more systematic approaches.
Finding Mislabelled Instances
Is your model making an error that seems like an easy case? Check the nearest neighbours within the training sets for mislabelled instances and check that you actually have some instances that are similar in the training set in the first place.

Ideally you want your Vector Database to be updated directly from your live ML service so that you always have access to the latest embeddings and don't have to maintain a separate batch pipeline just for the task. Let me know if you have any other uses of Vector Databases (particularly if valuable in image classification) and I'll add them to the list.
Let me know your thoughts. Please click follow if the content interests you.

Blog

Vector Databases for Data-Centric AI

GeorgePearse

Join Our Newsletter. No Spam, Only the good stuff.

Related