Building Blocks for Hybrid Search: Combining Keyword and Semantic Search

shannonlal

Shannon Lal

Posted on February 14, 2024

Building Blocks for Hybrid Search: Combining Keyword and Semantic Search

Welcome back to my series on Hybrid Search, where we explore the fusion of keyword-based and semantic search to create a powerful search engine. In this second instalment, we'll delve into the key components necessary for hybrid search such a using MongoDB Atlas and OpenAI's machine learning capabilities. Our final goal, which will be showcased in the next blog, is to demonstrate a live hybrid search.

Understanding Keyword and Semantic Searches

Keyword search is the traditional form of search we're all familiar with. It operates on matching exact phrases or words within a document's content. Semantic search, on the other hand, interprets the meaning behind the words, providing results based on context and intent. Semantic search utilizes vector embeddings, which are mathematical representations of text, to find documents with similar meanings even if they don't share the same keywords.

Implementing Keyword Search with Node.js and MongoDB

MongoDB Atlas offers a robust full-text search feature built on Apache Lucene. To utilize this, we define a search index schema, specifying which fields to index and how to process them. Here's an example schema for our transportation business descriptions:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "description": {
        "analyzer": "lucene.standard",
        "type": "string"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The above schema directs MongoDB to use the standard Lucene analyzer for the description field, preparing it for keyword searches. Here's how we query this index with Node.js:

// Simplified code example for querying the search index
async function querySearch(searchTerm) {
  // ... (setup and connection code)

  const pipeline = [
    {
      $search: {
        text: {
          query: searchTerm,
          path: 'description',
        },
      },
    },
    {
      $project: {
        name: 1,
        description: 1,
        score: { $meta: 'searchScore' },
      },
    },
  ];
  const cursor = await collection.aggregate(pipeline);
  // ... (handling results and errors)
}
Enter fullscreen mode Exit fullscreen mode

This script connects to our MongoDB collection, runs a text search for the given term, and projects the search score—a relevance score assigned by MongoDB based on the text index.

Creating Vector Embeddings with OpenAI

To perform semantic searches, we first need to convert our text data into vector embeddings. Here's how we can leverage OpenAI's API to create these embeddings:

// Function to generate vector embeddings from text
async function generateEmbedding(description) {
  // ... (setup code for OpenAI client)

  const response = await openai.embeddings.create({
    input: description,
    model: 'text-embedding-3-small',
  });
  return response.data[0].embedding;
}
Enter fullscreen mode Exit fullscreen mode

OpenAI's embedding endpoint transforms a piece of text into a 1536-dimensional vector representing semantic information. These vectors can then be used to compute similarity scores with other text vectors.

Vector Search with MongoDB

With our descriptions turned into vector embeddings, we're ready to perform semantic searches. MongoDB's vector indexes facilitate this:

{
  "fields": [
    {
      "numDimensions": 1536,
      "path": "embedding",
      "similarity": "cosine",
      "type": "vector"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The numDimensions corresponds to the vector size from OpenAI's embeddings. The similarity field is set to cosine, directing MongoDB to use cosine similarity for comparing vectors—a measure of orientation that is often used to compare documents in multi-dimensional space.

Here's the Node.js code that performs a vector search:

// Simplified code example for semantic search
async function semanticSearch(searchTerm) {
  // ... (setup and connection code)

  const embedding = await generateEmbedding(searchTerm);
  const pipeline = [
    // ... (vector search pipeline)
  ];
  const cursor = await collection.aggregate(pipeline);
  // ... (handling results and errors)
}
Enter fullscreen mode Exit fullscreen mode

Each returned document includes a vectorSearchScore that quantifies the similarity between the search term's embedding and the document's embedding.

Conclusion and Look Ahead

We've covered the individual components necessary for hybrid search: keyword search for precise matching and semantic search for understanding context. In our upcoming blog, we will bring these elements together to showcase a fully functioning hybrid search that leverages the best of both worlds to deliver accurate and contextually relevant search results.

Stay tuned for a walkthrough demonstration where we'll see these technologies in action, completing our journey through the intricacies of modern search capabilities.

💖 💪 🙅 🚩
shannonlal
Shannon Lal

Posted on February 14, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related