Embeddings index components
David Mezzetti
Posted on August 24, 2022
The main components of txtai are embeddings
, pipeline
, workflow
and an api
. The following shows the top level view of the txtai src tree.
Abbreviated listing of src/txtai
ann
api
database
embeddings
pipeline
scoring
vectors
workflow
One might ask, why are ann
, database
, scoring
and vectors
top level packages and not under the embeddings
package? The embeddings
package provides the glue between these components, making everything easy to use. The reason is that each of these packages are modular and can be used on their own!
This article will go through a series of examples demonstrating how these components can be used standalone as well as combined together to build custom search indexes.
Note: This is intended as a deep dive into txtai embeddings
components. There are much simpler high-level APIs for standard use cases.
Install dependencies
Install txtai
and all dependencies.
# Install txtai
pip install txtai datasets
Load dataset
This example will use the ag_news
dataset, which is a collection of news article headlines.
from datasets import load_dataset
dataset = load_dataset("ag_news", split="train")
Approximate nearest neighbor (ANN) and Vectors
In this section, we'll use the ann
and vectors
package to build a similarity index over the ag_news
dataset.
The first step is vectorizing the text. We'll use a sentence-transformers
model.
import numpy as np
from txtai.vectors import VectorsFactory
model = VectorsFactory.create({"path": "sentence-transformers/all-MiniLM-L6-v2"}, None)
embeddings = []
# List of all text elements
texts = dataset["text"]
# Create embeddings buffer, vector model has 384 features
embeddings = np.zeros(dtype=np.float32, shape=(len(texts), 384))
# Vectorize text in batches
batch, index, batchsize = [], 0, 128
for text in texts:
batch.append(text)
if len(batch) == batchsize:
vectors = model.encode(batch)
embeddings[index : index + vectors.shape[0]] = vectors
index += vectors.shape[0]
batch = []
# Last batch
if batch:
vectors = model.encode(batch)
embeddings[index : index + vectors.shape[0]] = vectors
# Normalize embeddings
embeddings /= np.linalg.norm(embeddings, axis=1)[:, np.newaxis]
# Print shape
embeddings.shape
(120000, 384)
Next we'll build a vector index using these embeddings!
from txtai.ann import ANNFactory
# Create Faiss index using normalized embeddings
ann = ANNFactory.create({"backend": "faiss"})
ann.index(embeddings)
# Show total
ann.count()
120000
Now let's run a search.
query = model.encode(["best planets to explore for life"])
query /= np.linalg.norm(query)
for uid, score in ann.search(query, 3)[0]:
print(uid, texts[uid], score)
17752 Rocky Road: Planet hunting gets closer to Earth Astronomers have discovered the three lightest planets known outside the solar system, moving researchers closer to the goal of finding extrasolar planets that resemble Earth. 0.599043607711792
16158 Earth #39;s #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 0.5688529014587402
45029 Coming Soon: "Good" Jupiters Most of the extrasolar planets discovered to date are gas giants like Jupiter, but their orbits are either much closer to their parent stars or are highly eccentric. Planet hunters are on the verge of confirming the discovery of Jupiter-size planets with Jupiter-like orbits. Solar systems that contain these "good" Jupiters may harbor habitable Earth-like planets as well. 0.5606889724731445
And there it is, a full vector search system without using the embeddings
package.
Just as a reminder, the following much simpler code does the same thing with an Embeddings instance.
from txtai.embeddings import Embeddings
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2"})
embeddings.index((x, text, None) for x, text in enumerate(texts))
for uid, score in embeddings.search("best planets to explore for life"):
print(uid, texts[uid], score)
17752 Rocky Road: Planet hunting gets closer to Earth Astronomers have discovered the three lightest planets known outside the solar system, moving researchers closer to the goal of finding extrasolar planets that resemble Earth. 0.599043607711792
16158 Earth #39;s #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 0.568852961063385
45029 Coming Soon: "Good" Jupiters Most of the extrasolar planets discovered to date are gas giants like Jupiter, but their orbits are either much closer to their parent stars or are highly eccentric. Planet hunters are on the verge of confirming the discovery of Jupiter-size planets with Jupiter-like orbits. Solar systems that contain these "good" Jupiters may harbor habitable Earth-like planets as well. 0.560688853263855
Database
When the content
parameter is enabled, an Embeddings instance stores both vector content and raw content in a database. But the database
package can be used standalone too.
from txtai.database import DatabaseFactory
# Load content into database
database = DatabaseFactory.create({"content": True})
database.insert((x, row, None) for x, row in enumerate(dataset))
# Show total
database.search("select count(*) from txtai")
[{'count(*)': 120000}]
The full txtai SQL query syntax is available, including working with dynamically created fields.
database.search("select count(*), label from txtai group by label")
[{'count(*)': 30000, 'label': 0},
{'count(*)': 30000, 'label': 1},
{'count(*)': 30000, 'label': 2},
{'count(*)': 30000, 'label': 3}]
Let's run a query to find text containing the word planets.
for row in database.search("select id, text from txtai where text like '%planets%' limit 3"):
print(row["id"], row["text"])
100 Comets, Asteroids and Planets around a Nearby Star (SPACE.com) SPACE.com - A nearby star thought to harbor comets and asteroids now appears to be home to planets, too. The presumed worlds are smaller than Jupiter and could be as tiny as Pluto, new observations suggest.
102 Redesigning Rockets: NASA Space Propulsion Finds a New Home (SPACE.com) SPACE.com - While the exploration of the Moon and other planets in our solar system is nbsp;exciting, the first task for astronauts and robots alike is to actually nbsp;get to those destinations.
272 Sharpest Image Ever Obtained of a Circumstellar Disk Reveals Signs of Young Planets MAUNA KEA, Hawaii -- The sharpest image ever taken of a dust disk around another star has revealed structures in the disk which are signs of unseen planets. Dr...
Since this is just a SQL database, text search is quite limited. The query above just retrieved results with the word planets in it.
Scoring
Since the original txtai release, there has been a scoring
package. The main use case for this package is building a weighted sentence embeddings vector when using word vector models. But this package can also be used standalone to build BM25, TF-IDF and/or SIF text indexes.
from txtai.scoring import ScoringFactory
# Build index
scoring = ScoringFactory.create({"method": "bm25", "terms": True, "content": True})
scoring.index((x, text, None) for x, text in enumerate(texts))
# Show total
scoring.count()
120000
for row in scoring.search("planets explore life earth", 3):
print(row["id"], row["text"], row["score"])
16327 3 Planets Are Found Close in Size to Earth, Making Scientists Think 'Life' A trio of newly discovered worlds are much smaller than any other planets previously discovered outside of the solar system. 17.768332448130707
16158 Earth #39;s #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 17.65941968170793
16620 New Planets could advance search for Life Astronomers in Europe and the United States have found two new planets about 20 times the size of Earth beyond the solar system. The discovery might be a giant leap forward in 17.65941968170793
The search above ran a BM25 search across the dataset. The search will return more keyword/literal results. With proper query construction, the results can be decent.
Comparing the vector search results earlier and these results are a good lesson in the differences between keyword and vector search.
Database and Scoring
Earlier we showed how the ann
and vectors
components can be combined to build a vector search engine. Can we combine the database
and scoring
components to add keyword search to a database? Yes!
def search(query, limit=3):
# Get similar clauses, if any
similar = database.parse(query).get("similar")
return database.search(query, [scoring.search(args[0], limit * 10) for args in similar] if similar else None, limit)
# Rebuild scoring - only need terms index
scoring = ScoringFactory.create({"method": "bm25", "terms": True})
scoring.index((x, text, None) for x, text in enumerate(texts))
for row in search("select id, text, score from txtai where similar('planets explore life earth') and label = 0"):
print(row["id"], row["text"], row["score"])
15363 NASA to Announce New Class of Planets Astronomers have discovered four new planets in a week's time, an exciting end-of-summer flurry that signals a sharper era in the hunt for new worlds. While none of these new bodies would be mistaken as Earth's twin, some appear to be noticeably smaller and more solid - more like Earth and Mars - than the gargantuan, gaseous giants identified before... 12.582923259697132
15900 Astronomers Spot Smallest Planets Yet American astronomers say they have discovered the two smallest planets yet orbiting nearby stars, trumping a small planet discovery by European scientists five days ago and capping the latest round in a frenzied hunt for other worlds like Earth. All three of these smaller planets belong to a new class of "exoplanets" - those that orbit stars other than our sun, the scientists said in a briefing Tuesday... 12.563928231067155
15879 Astronomers see two new planets US astronomers find the smallest worlds detected circling other stars and say it is a breakthrough in the search for life in space. 12.078383982352994
And there it is, scoring-based similarity search with the same syntax as standard txtai vector queries, including additional filters!
txtai is built on vector search, machine learning and finding results based on semantic meaning. It's been well-discussed from a functionality standpoint how vector search has many advantages over keyword search. The one advantage keyword search has is speed.
Wrapping up
This notebook walked through each of the packages used by an Embeddings index. The Embeddings index makes this all transparent and easy to use. But each of the components do stand on their own and can be individually integrated into a project!
Posted on August 24, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.