Build an Embeddings index from a data source

davidmezzetti

David Mezzetti

Posted on January 28, 2021

Build an Embeddings index from a data source

In Part 1, we gave a general overview of txtai, the backing technology and examples of how to use it for similarity searches. Part 2 covered an embedding index with a larger dataset.

For real world large-scale use cases, data is often stored in a database (Elasticsearch, SQL, MongoDB, files, etc). Here we'll show how to read from SQLite, build a Embedding index backed by word embeddings and run queries against the generated Embeddings index.

This example covers functionality found in the paperai library. See that library for a full solution that can be used with the dataset discussed below.

Install dependencies

Install txtai and all dependencies. Since this article is building word vectors, we need to install the similarity extras package.

pip install txtai[similarity]
Enter fullscreen mode Exit fullscreen mode

Download data

This example is going to work off a subset of the CORD-19 dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.

The following download is a SQLite database generated from a Kaggle notebook. More information on this data format, can be found in the CORD-19 Analysis notebook.

wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz
gunzip tests.gz
mv tests articles.sqlite
Enter fullscreen mode Exit fullscreen mode

Build Word Vectors

This example will build a search system backed by word embeddings. While not quite as powerful as transformer embeddings, they often provide a good tradeoff of performance to functionality for an embedding based search system.

For this article, we'll build our own custom embeddings for demo purposes. A number of pre-trained word embedding models are available:

import os
import sqlite3
import tempfile

from txtai.pipeline import Tokenizer
from txtai.vectors import WordVectors

print("Streaming tokens to temporary file")

# Stream tokens to temp working file
with tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False) as output:
  # Save file path
  tokens = output.name

  db = sqlite3.connect("articles.sqlite")
  cur = db.cursor()
  cur.execute("SELECT Text from sections")

  for row in cur:
    output.write(" ".join(Tokenizer.tokenize(row[0])) + "\n")

  # Free database resources
  db.close()

# Build word vectors model - 300 dimensions, 3 min occurrences
WordVectors.build(tokens, 300, 3, "cord19-300d")

# Remove temporary tokens file
os.remove(tokens)
Enter fullscreen mode Exit fullscreen mode
# Show files
ls -l
Enter fullscreen mode Exit fullscreen mode
Streaming tokens to temporary file
Building 300 dimension model
Converting vectors to magnitude format
total 78948
-rw-r--r-- 1 root root  8065024 Aug 25 01:44 articles.sqlite
-rw-r--r-- 1 root root 24145920 Jan  9 20:45 cord19-300d.magnitude
-rw-r--r-- 1 root root 48625387 Jan  9 20:45 cord19-300d.txt
drwxr-xr-x 1 root root     4096 Jan  6 18:10 sample_data
Enter fullscreen mode Exit fullscreen mode

Build an embeddings index

The following steps build an embeddings index using the word vector model just created. This model builds a BM25 + fastText index. BM25 is used to build a weighted average of the word embeddings for a section. More information on this method can be found in this Medium article.

import sqlite3

import regex as re

from txtai.embeddings import Embeddings
from txtai.pipeline import Tokenizer

def stream():
  # Connection to database file
  db = sqlite3.connect("articles.sqlite")
  cur = db.cursor()

  # Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.
  cur.execute("SELECT Id, Name, Text FROM sections WHERE (labels is null or labels NOT IN ('FRAGMENT', 'QUESTION')) AND tags is not null")

  count = 0
  for row in cur:
    # Unpack row
    uid, name, text = row

    # Only process certain document sections
    if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
      # Tokenize text
      tokens = Tokenizer.tokenize(text)

      document = (uid, tokens, None)

      count += 1
      if count % 1000 == 0:
        print("Streamed %d documents" % (count), end="\r")

      # Skip documents with no tokens parsed
      if tokens:
        yield document

  print("Iterated over %d total rows" % (count))

  # Free database resources
  db.close()

# BM25 + fastText vectors
embeddings = Embeddings({"path": "cord19-300d.magnitude",
                         "scoring": "bm25",
                         "pca": 3})

# Build scoring index if scoring method provided
if embeddings.config.get("scoring"):
  embeddings.score(stream())

# Build embeddings index
embeddings.index(stream())
Enter fullscreen mode Exit fullscreen mode
Iterated over 21499 total rows
Iterated over 21499 total rows
Enter fullscreen mode Exit fullscreen mode

Query data

The following runs a query against the embeddings index for the terms "risk factors". It finds the top 5 matches and returns the corresponding documents associated with each match.

import pandas as pd

from IPython.display import display, HTML

pd.set_option("display.max_colwidth", None)

db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

results = []
for uid, score in embeddings.search("risk factors", 5):
  cur.execute("SELECT article, text FROM sections WHERE id = ?", [uid])
  uid, text = cur.fetchone()

  cur.execute("SELECT Title, Published, Reference from articles where id = ?", [uid])
  results.append(cur.fetchone() + (text,))

# Free database resources
db.close()

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match"])

display(HTML(df.to_html(index=False)))
Enter fullscreen mode Exit fullscreen mode
Title Published Reference Match
Management of osteoarthritis during COVID‐19 pandemic 2020-05-21 00:00:00 https://doi.org/10.1002/cpt.1910 Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .
Work-related and Personal Factors Associated with Mental Well-being during COVID-19 Response: A Survey of Health Care and Other Workers 2020-06-11 00:00:00 http://medrxiv.org/cgi/content/short/2020.06.09.20126722v1?rss=1 Poor family supportive behaviors by supervisors were also associated with these outcomes [1.40 (1.21 - 1.62), 1.69 (1.48 - 1.92), 1.54 (1.44 - 1.64)].
No evidence that androgen regulation of pulmonary TMPRSS2 explains sex-discordant COVID-19 outcomes 2020-04-21 00:00:00 https://doi.org/10.1101/2020.04.21.051201 In addition to male sex, smoking is a risk factor for COVID-19 susceptibility and poor clinical outcomes .
Current status of potential therapeutic candidates for the COVID-19 crisis 2020-04-22 00:00:00 https://doi.org/10.1016/j.bbi.2020.04.046 There was no difference on 28-day mortality between heparin users and nonusers.
COVID-19: what has been learned and to be learned about the novel coronavirus disease 2020-03-15 00:00:00 https://doi.org/10.7150/ijbs.45134 • Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.

Extracting additional columns from query results

The example above uses the Embeddings index to find the top 5 best matches. In addition to this, an Extractor instance (this will be explained further in part 5) is used to ask additional questions over the search results, creating a richer query response.

from txtai.pipeline import Extractor

# Create extractor instance using qa model designed for the CORD-19 dataset
extractor = Extractor(embeddings, "NeuML/bert-small-cord19qa")

db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

results = []
for uid, score in embeddings.search("risk factors", 5):
  cur.execute("SELECT article, text FROM sections WHERE id = ?", [uid])
  uid, text = cur.fetchone()

  # Get list of document text sections to use for the context
  cur.execute("SELECT Name, Text FROM sections WHERE (labels is null or labels NOT IN ('FRAGMENT', 'QUESTION')) AND article = ? ORDER BY Id", [uid])
  texts = []
  for name, txt in cur.fetchall():
    if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
      texts.append(txt)

  cur.execute("SELECT Title, Published, Reference from articles where id = ?", [uid])
  article = cur.fetchone()

  # Use QA extractor to derive additional columns
  answers = extractor([("Risk Factors", "risk factors", "What risk factors?", False),
                       ("Locations", "hospital country", "What locations?", False)], texts)

  results.append(article + (text,) + tuple([answer[1] for answer in answers]))

# Free database resources
db.close()

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match", "Risk Factors", "Locations"])
display(HTML(df.to_html(index=False)))
Enter fullscreen mode Exit fullscreen mode
Title Published Reference Match Risk Factors Locations
Management of osteoarthritis during COVID‐19 pandemic 2020-05-21 00:00:00 https://doi.org/10.1002/cpt.1910 Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) . sex, obesity, genetic factors and mechanical factors None
Work-related and Personal Factors Associated with Mental Well-being during COVID-19 Response: A Survey of Health Care and Other Workers 2020-06-11 00:00:00 http://medrxiv.org/cgi/content/short/2020.06.09.20126722v1?rss=1 Poor family supportive behaviors by supervisors were also associated with these outcomes [1.40 (1.21 - 1.62), 1.69 (1.48 - 1.92), 1.54 (1.44 - 1.64)]. Poor family supportive behaviors None
No evidence that androgen regulation of pulmonary TMPRSS2 explains sex-discordant COVID-19 outcomes 2020-04-21 00:00:00 https://doi.org/10.1101/2020.04.21.051201 In addition to male sex, smoking is a risk factor for COVID-19 susceptibility and poor clinical outcomes . Higher morbidity and mortality None
Current status of potential therapeutic candidates for the COVID-19 crisis 2020-04-22 00:00:00 https://doi.org/10.1016/j.bbi.2020.04.046 There was no difference on 28-day mortality between heparin users and nonusers. elicited strong inflammatory responses are favorable or detrimental None
COVID-19: what has been learned and to be learned about the novel coronavirus disease 2020-03-15 00:00:00 https://doi.org/10.7150/ijbs.45134 • Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia. sex (male), age (≥60), and severe pneumonia None

In the example above, the Embeddings index is used to find the top N results for a given query. On top of that, a question-answer extractor is used to derive additional columns based on a list of questions. In this case, the "Risk Factors" and "Location" columns were pulled from the document text.

💖 💪 🙅 🚩
davidmezzetti
David Mezzetti

Posted on January 28, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Granting autonomy to agents
ai Granting autonomy to agents

November 25, 2024

💡 What's new in txtai 8.0
ai 💡 What's new in txtai 8.0

November 18, 2024

Generative Audio
ai Generative Audio

October 13, 2024

Speech to Speech RAG
ai Speech to Speech RAG

September 27, 2024