adaboese
Posted on January 17, 2024
After spending a significant amount of time working with vector embeddings, I've started to see more and more use cases for them for every day problems. One of the most interesting ones I've seen is using vector embeddings to find the page that the user was looking for when they hit a 404 page.
What is a vector embedding?
A vector embedding is a way to represent a word or phrase as a vector. This is useful because it allows us to do math on words and phrases. For example, we can find the word that is closest to another word by finding the word with the smallest distance between the two vectors.
Financial Times has a great interactive article that explains vector embeddings in more detail.
How can we use vector embeddings to find the page that the user was looking for?
We can use vector embeddings to find the page that the user was looking for by finding the page with the smallest distance between the vector of the page and the vector of the user's query. In the context of a 404 page, user's query is the URL that they were trying to access.
It is suprisingly simple:
- we need to create a database of all the pages on our site
- we need to create a vector embedding for each page URL
- we need to create a vector embedding for the user's query
- we need to find the page with the smallest distance between the vector of the page and the vector of the user's query
In case of AIMD, I am doing this all in-memory, but you could also do this in a database (e.g. Pinecone). It all depends on how much data you have and how much compute you have available.
Deciding on a vector embedding model
The first step is to decide on a vector embedding model. I am using Supabase/gte-small because it is small model and it outperforms OpenAI's text-embedding-ada-002
model.
I wrote this abstraction that creates a vector embedding for a given text:
import { pipeline } from '@xenova/transformers';
export const generateEmbedding = async (subject: string): Promise<number[]> => {
const generateEmbedding = await pipeline(
'feature-extraction',
'Supabase/gte-small',
);
const result = await generateEmbedding(subject, {
normalize: true,
pooling: 'mean',
});
if (result.type === 'float32') {
return Array.from(result.data) as number[];
}
throw new Error('Expected embedding type to be float32');
};
Creating a database of all the pages on our site
The next step is to create a database of all the pages on our site.
Let's assume that we have an array of all the pages on our site:
type SitemapEntry = {
loc: string;
};
const staticPages: SitemapEntry[] = [
{
loc: 'https://aimd.app/',
},
{
loc: 'https://aimd.app/blog',
},
{
loc: 'https://aimd.app/blog/2024-01-15-top-seo-trends-for-2024-what-should-you-focus-on',
},
{
loc: 'https://aimd.app/blog/2024-01-07-maximizing-article-visibility-understanding-and-applying-e-e-a-t-in-seo',
},
// ...
];
We can then create a database of all the pages on our site by creating a vector embedding for each page URL:
type Metadata = Record<string, string | number | boolean>;
type DatabaseEntry = {
metadata: Metadata;
vector: number[];
};
const entries: DatabaseEntry[] = [];
for (const page of staticPages) {
const vector = await generateEmbedding(page.loc);
entries.push({
metadata: {
url: page.loc,
},
vector,
});
}
entries
is now a database of all the pages on our site.
Finding the page with the smallest distance between the vector of the page and the vector of the user's query
The last step is to find the page with the smallest distance between the vector of the page and the vector of the user's query.
Let's assume that we have a user's query
:
const query = 'https://aimd.app/blog/2024-01-17-using-vector-embeddings-to-overengineer-404-pages';
First, we need to create a vector embedding for the user's query:
const queryVector = await generateEmbedding(query);
Then, we need a way to calculate a distance between two vectors. For this, we can use cosine similarity:
import similarity from 'compute-cosine-similarity';
Finally, we can find the page with the smallest distance between the vector of the page and the vector of the user's query:
const closestEntry = entries.reduce((closestEntry, entry) => {
const distance = similarity(queryVector, entry.vector);
if (distance > closestEntry.distance) {
return {
distance,
entry,
};
}
return closestEntry;
}, {
distance: -Infinity,
entry: null,
});
closestEntry.entry
is now the page that has the most similar URL to the page the user was similar.
The best part is that this does not even need to be the exact page that the user was looking for, e.g. in case the page was removed. It will be whichever page has the most similar URL to the page the user was looking for.
Using with Remix
Just to complete the example, here is how you would use this with Remix:
// app/routes/$.tsx
import { type MetaFunction } from '@remix-run/node';
import { Link, useLoaderData } from '@remix-run/react';
import { json, type LoaderFunctionArgs } from '@remix-run/server-runtime';
import { findNearestUrl } from '#app/services/sitemap.server';
export const meta: MetaFunction = () => {
return [
{
title: '404',
},
];
};
export const loader = async ({ request }: LoaderFunctionArgs) => {
const nearestUrl = await findNearestUrl(request.url);
return json({
nearestUrl,
});
};
const Route = () => {
const data = useLoaderData<typeof loader>();
return (
<div>
<h1>404</h1>
<p>Were you looking for <Link to={data.nearestUrl.url}>{data.nearestUrl.url}</Link>?</p>
</div>
);
};
export default Route;
Now your 404 page will suggest the page that the user was most likely looking for.
Examples of this in the wild
Here is how that looks once deployed:
- 2024-01-07-maximizing-article-visibility-e-e-a-t-in-seo
- 2024-01-07-maximizing-article-visibility-understanding-applying-e-e-a-t-in-seo
- 2024-01-07-maximizing-article-visibility-understanding-and-applying-eeat-in-seo
Even though all of these pages do not exist, they all produce a 404 page that links to the correct page.
In practice, this type of hint will be most useful for pages that were removed or renamed, e.g. I have accidentally introduced numerous 404s on this site by changing the dates of the posts.
Do we even need 404 pages?
This is a bit of a tangent, but I think it is worth mentioning that we might not even need 404 pages. Instead, we could just redirect the user to the page that they were looking for.
Realistically, the only reason we have 404 pages is because we don't know what the user was looking for. But if we can use vector embeddings to find the page that the user was looking for, then we can just redirect them to that page.
I will be experimenting with this on AIMD in the future.
Posted on January 17, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.