Building AI-powered search using LangChain and Milvus
Ayush
Posted on December 11, 2023
This blog is co-authored by Ayush Pandey, Senior Software Engineer and Amit Kaushal, Software Manager at Intuit.
Artificial Intelligence (AI) has revolutionized the way we interact with technology, and one of the most significant applications of AI is in search. With the help of AI, search tools can surface more accurate and relevant results to users. In this blog, we will discuss how to build an AI-powered search engine using LangChain and Milvus.
Before we dive into the demo, let’s talk through some of the concepts and tools involved.
What is Langchain?
LangChain is a framework for developing applications powered by language models. Use cases include applications for document question answering, building conversational interfaces for database interactions, and much more. We believe that the most powerful and differentiated applications will not only leverage a language model, but will also be:
Data-aware: connect a language model to other sources of data
Agentic: allow a language model to interact with its environment
LangChain provides the modular components used to build such applications, which can be used standalone or combined for more complexity.
What is Milvus?
Milvus was created in 2019 with a singular goal: to store, index, and manage massive embedding vectors generated by deep neural networks and other machine learning (ML) models.
As a database specifically designed to handle queries over input vectors, it is capable of indexing vectors on a trillion scale. Unlike existing relational databases, which mainly deal with structured data following a pre-defined pattern, Milvus is designed from the bottom-up to handle embedding vectors converted from unstructured data.
Vector embeddings: why the hype?
Vector embeddings are a powerful tool for developers working with natural language processing (NLP) and ML applications. Vector embeddings are a way of representing words or phrases as vectors in a high-dimensional space, where each dimension represents a different feature of the word or phrase. This allows developers to perform complex operations on text data, such as sentiment analysis, text classification, and machine translation.
Let’s go over a simple explainer of semantic features: scientists sort different types of animals in the world into categories based on certain characteristics. For example, Birds are a type of warm-blooded vertebrate that are adapted to fly. Based on these features, we created word coordinates to represent animals based on their Type and Domestication score. These scores are called “semantic features,” which capture parts of the meanings of each word. Now that the words have corresponding numerical values, we can then plot these words as points on a graph, where the x-axis represents Type, and the y-axis represents Domestication score.
We can add new words to the plot based on their meanings. For example, where should the words "Lions" and "Parrots" go? How about "Whales"? Or "Snakes"?
There are also several libraries and tools available for developers who want to work with vector embeddings. Some popular libraries include Gensim, TensorFlow, and PyTorch. These libraries provide pre-trained models for word2vec and GloVe, as well as tools for training custom models on specific datasets.
Demo: Using a similarity search for asking questions from a wikipedia
First, let’s go through some prerequisites.
Install LangChain and Milvus on your local system:
! python -m pip install --upgrade pymilvus langchain openai tiktoken
Then, import required modules:
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Milvus
from langchain.document_loaders import TextLoader
from langchain.document_loaders import WebBaseLoader
Next, import your OpenAI API key:
import os
import getpass
os.environ['OPENAI_API_KEY'] = "your-openai-api-key"
Then, load in a Wikipedia document (here we’re grabbing the article for Intuit QuickBooks) using WebBaseLoader client, and split it into chunks:
loader = WebBaseLoader([
"https://en.wikipedia.org/wiki/QuickBooks",
])
docs = loader.load()
# Split the documents into smaller chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(docs)
Afterwards, use OpenAIEmbeddings and store everything in a Milvus vector database.
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vector_db = Milvus.from_documents(
docs,
embeddings,
connection_args={"host": "HostName", "port": "19530"},
)
It’s time to try semantic searching! Let’s ask a question using LangChain and Milvus:
query = "What is quickbooks?"
docs = vector_db.similarity_search(query)
docs[0].page_content
Output:
'Retrieved from "https://en.wikipedia.org/w/index.php?title=QuickBooks&oldid=1155606425"\nCategories: Accounting softwareIntuit softwareHidden categories: CS1 maint: url-statusArticles with short descriptionShort description is different from WikidataUse mdy dates from March 2019Articles containing potentially dated statements from May 2014All articles containing potentially dated statements\n\n\n\n\n\n\n This page was last edited on 18 May 2023, at 23:04\xa0(UTC).\nText is available under the Creative Commons Attribution-ShareAlike License 3.0;\nadditional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.\n\n\nPrivacy policy\nAbout Wikipedia\nDisclaimers\nContact Wikipedia\nMobile view\nDevelopers\nStatistics\nCookie statement\n\n\n\n\n\n\n\n\n\n\n\nToggle limited content width'
The results above are decent, but need quite a lot of formatting help. Let’s try using load_qa_with_sources_chain to ask the questions instead for a cleaner output:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.llms import OpenAI
chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="map_reduce", return_intermediate_steps=True)
query = "What is quickbooks?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)
Output:
{'intermediate_steps': [' No relevant text.',
' QuickBooks is an accounting software package developed and marketed by Intuit. First introduced in 1983, QuickBooks products are geared mainly toward small and medium-sized businesses and offer on-premises accounting applications as well as cloud-based versions that accept business payments, manage and pay bills, and payroll functions.',
" Intuit also offers a cloud service called QuickBooks Online (QBO). The user pays a monthly subscription fee rather than an upfront fee and accesses the software exclusively through a secure logon via a Web browser. QuickBooks Online is supported on Chrome, Firefox, Internet Explorer 10, Safari 6.1, and also accessible via Chrome on Android and Safari on iOS 7. Quickbooks Online offers integration with other third-party software and financial services, such as banks, payroll companies, and expense management software. QuickBooks desktop also supports a migration feature where customers can migrate their desktop data from a pro or prem SKU's to Quickbooks Online.",
' QuickBooks - Wikipedia \nInitial release, Subsequent releases, QuickBooks Online, QuickBooks Point of Sale, Add-on programs.'],
'output_text': ' QuickBooks is an accounting software package developed and marketed by Intuit. It offers on-premises accounting applications as well as cloud-based versions that accept business payments, manage and pay bills, and payroll functions. QuickBooks Online is a cloud service that offers integration with other third-party software and financial services.\nSOURCES: https://en.wikipedia.org/wiki/QuickBooks'}
Other search-related use cases using LangChain and Milvus:
- E-commerce search engine: the language model can be trained on product descriptions and reviews, and the data can be converted into vectors using Milvus. The vectors can then be indexed in Milvus, and a search interface can be built to retrieve relevant products based on user queries.
- Image search engine: the language model can be trained on image captions and tags, and the images can be converted into vectors using Milvus. The vectors can then be indexed in Milvus, and a search interface can be built to retrieve relevant images based on user queries.
- Video search engine: the language model can be trained on video titles and descriptions, and the videos can be converted into vectors using Milvus. The vectors can then be indexed in Milvus, and a search interface can be built to retrieve relevant videos based on user queries.
By following the simple steps we’ve outlined here, developers can use LangChain and Milvus to build search engines for various use cases ranging from a simple document search to applications in e-commerce, image, and video search. We hope this was a helpful starter guide, please leave a comment if you have any further questions!
*References and further reading: *
Posted on December 11, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.