How to implement image search?
IronSpiderMan
Posted on December 19, 2023
Preface
Many search engines have built-in image search functions. The image search function can greatly simplify the search work.However, have you ever thought about how the image search function is implemented?
Today, we will analyze the principles and implement the image search function.
Let’s first discuss the difficulties of searching for pictures with pictures. The first thing to deal with is how to compare the similarity of pictures? What kind of pictures are called similar? People can judge at a glance, but computers are different. Pictures exist in the form of digital matrices, and similarity comparisons also compare the similarity of matrices. But there are some problems with this.
The second problem is the size issue. The sizes of pictures are usually different, and matrices of different sizes cannot be compared for similarity. But this is easy to solve, just modify the image size directly.
The third problem is that the information contained in pixels is very limited and cannot express abstract information. Such as painting style, objects, tones, etc.
Based on the above description, we now have to solve two problems: what information to use to replace pixel information, and how to calculate similarity.
Image features
Representing image features with pixels has many disadvantages, and there are many alternatives in computer vision. Such as SIFT, feature map, etc.In these methods, the image is represented as a vector in a way we call embedding.
Embedding is a very good idea and has been widely used in NLP.After we encode some content into vectors, we can perform algebraic operations on the content, including calculating similarity, semantic addition and subtraction, etc.The figure below is an example of word embedding.Tigers and cats are closer, cars and auto are closer.
The question now is, how to extract the image embedding.We can use convolutional networks such as ResNet and AlexNet, or we can use Vit networks. Here we choose Vit.
The Vit network can encode the image into a matrix with a shape of (patches+1)×embedding_dim. We select the first row in the center as the embedding of the image.
Below is the pytorch code:
import glob
import hashlib
import random
import numpy as np
import torch
import chromadb
from tqdm import tqdm
from PIL import Image
from transformers import ViTImageProcessor, ViTModel
from chromadb import Documents, EmbeddingFunction, Embeddings
device = "cuda" if torch.cuda.is_available() else "cpu"
class VitEmbeddingFunction(EmbeddingFunction):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
self.model = ViTModel.from_pretrained("google/vit-base-patch16-224")
def __call__(self, images: Documents) -> Embeddings:
images = [Image.open(image) for image in images]
inputs = self.processor(images, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**inputs)
last_hidden_state = outputs.last_hidden_state
return last_hidden_state[:, 0, :].numpy().tolist()
In the latest version of chromadb, the call method parameter name of EmbeddingFunction can only be input.
We only need to create a VitEmbeddingFunction instance, pass in the image and call the call method to obtain the image embedding. But this matter can be left to chromadb.
Vectordatabase
A vector database is not necessary, but can speed up searches. Before using the vector database, let's talk about how to complete the search.
We can use Euclidean distance to express similarity, and the following is the calculation formula.
distance = np.sqrt(np.sum((e1 - e2) ** 2))
We can use linear search to find the image with the minimum distance from our input image. But this search is too slow. A tree structure will be more efficient to search.
We can build a binary search tree for numbers, but how to build a tree for vectors?
This requires the use of clustering algorithms. We divide all pictures into m clusters. When searching, we first find the cluster center with the smallest distance, and then find the picture with the smallest distance in the current cluster. This reduces the time complexity from O(n) to O(n/m). We can also continue clustering within the cluster to form a tree, which can reduce the time complexity to O(logn).
In the vector database, the clustering operation is automatically completed and we can use it directly.
Below is the python code:
client = chromadb.PersistentClient(path="image_search")
collection = client.get_or_create_collection(name="images", embedding_function=VitEmbeddingFunction())
files = glob.glob(r"G:\datasets\lbxx\lbxx\*")
# You should store embeddings at first time.
# bar = tqdm(total=len(files))
# for file in files:
# collection.add(
# documents=[file],
# metadatas=[{"source": file}],
# ids=[md5encode(file)]
# )
# bar.update(1)
# file = random.choice(files)
file = r"G:\datasets\lbxx\lbxx\10067.jpeg"
similar_images = collection.query(
query_texts=[file],
n_results=10,
)
similar_images = similar_images['documents']
images = [file, *similar_images[0]]
image = np.hstack([np.array(Image.open(image).resize((512, 512)))
for image in images])
Image.fromarray(image).show()
Below are the search results.The one on the far left is the original image, and the one on the right is the result image.
The code has been uploaded to github.
Github: ImageSearch.
Posted on December 19, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.