Cosine Similarity Search on Vectors in Postgres with pgvector
AJAY SHRESTHA
Posted on November 25, 2024
In the realm of modern data processing, vector embeddings have become increasingly popular for applications in recommendation systems, search engines, NLP, and more. Vectors allow us to represent complex data (like documents, images, or user profiles) in a way that is both mathematically consistent and computationally efficient. Among various similarity measures, Cosine Similarity is widely used because it effectively measures the angle between two vectors, offering a reliable way to determine similarity. This article explores how to perform Cosine Similarity searches in PostgreSQL using the pgvector extension.
What is Cosine Similarity?
Cosine Similarity calculates the cosine of the angle between two vectors in a multi-dimensional space. The formula for cosine similarity between vectors A and B is:
Where:
- A⋅B represents the dot product of vectors
- ∣∣A∣∣ and ∣∣B∣∣ are the magnitudes (or lengths) of vectors A and B.
Cosine similarity is particularly effective for text data, where we want to ignore differences in magnitude and focus on the direction of vectors. This is why it's commonly used for tasks like text classification, image retrieval, and recommendation systems.
Introducing pgvector
Postgres's pgvector extension allows us to store and perform similarity searches on vectors. pgvector supports several distance metrics, including:
- Euclidean distance (L2): Measures the straight-line distance between two points in Euclidean space.
- Inner product: Calculates the dot product of two vectors.
- Cosine similarity: Evaluates the cosine of the angle between two vectors, often preferred when direction matters more than magnitude.
For enhanced functionality, pgvector provides the following distance functions represented by operators:
<+> - L1 distance: Also known as Manhattan distance, (added in 0.7.0).
<-> - L2 distance: The standard Euclidean distance.
<=> - Cosine distance: Measures the cosine similarity between vectors.
<#> - Negative inner product: Useful for maximizing the inner product.
<~> - Hamming distance: For binary vectors, counts the number of positions with differing bits (added in 0.7.0).
<%> - Jaccard distance: For binary vectors, measures dissimilarity between sample sets (added in 0.7.0).
For applications where direction is more important than magnitude, cosine similarity is often preferred.
Setting up pgvector in PostgreSQL
To use pgvector, you must first install it in your PostgreSQL instance. You can do this by running:
CREATE EXTENSION IF NOT EXISTS vector;
To perform a cosine similarity search on vectors stored in your django model, you’ll need to utilize PostgreSQL’s pgvector extension’s cosine similarity function. Here’s how you can implement this in Django:
from django.db import models
from pgvector.django import VectorField
from django.contrib.postgres.indexes import GinIndex
from sentence_transformers import SentenceTransformer
# Load SentenceTransformer model once
model = SentenceTransformer('all-MiniLM-L6-v2')
class Blog(models.Model):
title = models.CharField(max_length=255, db_index=True)
slug = models.SlugField(unique=True, max_length=255)
detail = models.TextField()
embeddings = VectorField(dimensions=384)
class Meta:
indexes = [
GinIndex(fields=['embeddings'], opclasses=['vector_cosine_ops'])
]
def save(self, *args, **kwargs):
# Calculate embeddings directly using SentenceTransformer
blog_text = f'{self.title} - {self.detail}'
self.embeddings = model.encode(blog_text).tolist()
super().save(*args, **kwargs)
In this model, each blog post’s text is converted into a 384-dimensional vector using a pre-trained model from SentenceTransformer. These embeddings are then stored in a VectorField provided by pgvector.
Testing Embeddings with Blog Entry
# Create a Blog Entry
blog = Blog(title="Test Blog", detail="This is a sample blog post.")
blog.save()
# Verify Embeddings
print(blog.embeddings) # Outputs a 384-dimensional embedding
Searching for Similar Blogs
We use the cosine similarity measure to find blogs similar to a given query. Here is a Django function that searches for similar blogs:
import numpy as np
from django.db.models import FloatField
from django.db.models.expressions import RawSQL
from .models import Blog
def find_similar_blog(query_vector, top_n=10):
"""
Finds blogs with the highest cosine similarity to the query vector.
Args:
- query_vector: Numpy array representing the query vector.
- top_n: Number of top results to return.
Returns:
- QuerySet of Blog objects with an additional 'similarity' field.
"""
# Ensure the query vector has the same dimensions (384) as the stored vectors
if isinstance(query_vector, np.ndarray):
query_vector = query_vector.tolist() # Convert numpy array to list
query_vector = query_vector[:384] # Ensure it's 384 dimensions
# Using RawSQL to calculate cosine similarity
similarity_annotation = RawSQL(
"1 - (embeddings <=> %s::vector)", # Cosine similarity calculation
(query_vector,),
output_field=FloatField()
)
# Query the database and annotate each JobVector with the similarity score
similar_blogs = Blog.objects.annotate(similarity=similarity_annotation).order_by('-similarity')[:top_n]
return similar_blogs
This function takes a query vector and finds the top n similar blog posts by annotating each Blog object with a similarity score calculated using PostgreSQL's <=> operator, which computes the cosine distance
Example of Using find_similar_blog function
# Retrieve the blog entry from the database
blog = Blog.objects.first()
# Extract the embeddings of the blog
blog_vector = blog.embeddings
# Pass the blog vector and the number of results you want to the function
similar_blogs = find_similar_blog(query_vector, top_n=5)
# Iterate through the returned QuerySet to display the similar blogs
for blog in similar_blogs:
print(f"{blog.title}")
This method is particularly useful for content-based recommendation systems where you might want to show users similar content to what they are reading. By using the embeddings of an existing blog post:
- Relevance: The search results are based on semantic similarity in the context of the content, which can enhance user engagement by providing more relevant recommendations.
- Ease of Use: Directly using database-stored vectors simplifies the process as there’s no need to compute the embeddings in real time for the query.
- Efficiency: Since the embeddings are precomputed and stored, this method is efficient and leverages the fast vector search capabilities of pgvector.
Implementing cosine similarity with pgvector in PostgreSQL offers a robust way to enhance the semantic search capabilities of your applications. By understanding and leveraging vector search, you can significantly improve the relevancy and precision of search results, providing a better user experience in content-heavy applications like blogs or news sites.
Posted on November 25, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.