Leveraging Elasticsearch and LangChain: A Guide to Using Aliases and Filters with LLMs

janakmandavgade

Janak Mandavgade

Posted on August 3, 2024

Leveraging Elasticsearch and LangChain: A Guide to Using Aliases and Filters with LLMs

Hello, dear readers! In this comprehensive article, we will explore how to build a chatbot using LangChain and Elasticsearch. We will focus on improving the quality of the retrieved data throughout the process. Let’s dive in!

1. Install Required Dependencies

To get started, you need to install the necessary dependencies. Use the following command:

pip install langchain langchain_openai langchain_elasticsearch langchain_community ipython pypdf
Enter fullscreen mode Exit fullscreen mode

2. Load and Extract Data Using PyPDFLoader

Next, we will load and extract data from a PDF file. Replace the placeholder with the actual path to your PDF file:

file_path = ""
loader = PyPDFLoader(file_path=file_path)
Enter fullscreen mode Exit fullscreen mode

3. Load and Split Data

After loading the data, we will split it into manageable chunks using the RecursiveCharacterTextSplitter. This allows for easier handling and retrieval of information.

elements = loader.load_and_split(text_splitter=RecursiveCharacterTextSplitter(
    chunk_size=1500, chunk_overlap=50,
))
Enter fullscreen mode Exit fullscreen mode

4. Adding Metadata to Splits

If you want to include metadata within the splits, you can use the following code snippet. Replace the placeholder fields with your desired values:

for e in elements:
    e.metadata["key1"] = "value1"
    e.metadata["key2"] = "value2"
Enter fullscreen mode Exit fullscreen mode

You can return the elements or print them to verify:

return elements  # OR print(elements)
Enter fullscreen mode Exit fullscreen mode

5. Initialize the Elasticsearch Client

Now, we will set up the Elasticsearch client to connect to the hosted database. Ensure to replace the placeholders with your actual environment variable names:

ELASTICSEARCH_URL = os.environ.get("ELASTICSEARCH_URL")
ELASTICSEARCH_API_KEY = os.environ.get("ELASTICSEARCH_API_KEY")
es_client = Elasticsearch(api_key=ELASTICSEARCH_API_KEY, hosts=[ELASTICSEARCH_URL])
print(es_client)
Enter fullscreen mode Exit fullscreen mode

6. Check for Index Existence and Create Embeddings

Next, we will check if the Elasticsearch index exists and create embeddings if it does not. Replace index_name with your desired index name:

if not es_client.indices.exists(index=index_name):
    db = ElasticsearchStore.from_documents(
        elements, embeddings, es_url=ELASTICSEARCH_URL, index_name=index_name, es_api_key=ELASTICSEARCH_API_KEY,
        bulk_kwargs={"request_timeout": 600},
    )
Enter fullscreen mode Exit fullscreen mode

If you want to check for aliases and create embeddings, you can use:

if not es_client.indices.exists_alias(name="alias_name"):
    db = ElasticsearchStore.from_documents(
        elements, embeddings, es_url=ELASTICSEARCH_URL, index_name="", es_api_key=ELASTICSEARCH_API_KEY,
        bulk_kwargs={"request_timeout": 600},
    )
Enter fullscreen mode Exit fullscreen mode

7. Add Documents to Existing Indices or Aliases

If you need to add documents to an existing index or alias, use the following code. This will not create a new index:

if es_client.indices.exists_alias(name="alias_name") or es_client.indices.exists(index=index_name):
    db = ElasticsearchStore.from_documents(
        elements, embeddings, es_url=ELASTICSEARCH_URL, index_name="", es_api_key=ELASTICSEARCH_API_KEY,
        bulk_kwargs={"request_timeout": 600},
    )
Enter fullscreen mode Exit fullscreen mode

Note: If you encounter a "connection timed out" error while connecting to Elasticsearch, ensure that your internet is working fine and use the following timeout setting:

bulk_kwargs={"request_timeout": 600}
Enter fullscreen mode Exit fullscreen mode

8. Adding Aliases to the Elasticsearch Index

To add aliases to the Elasticsearch index, you can use the following command. Aliases are useful for grouping multiple embeddings.

es_client.indices.put_alias(index=index_name, name="alias_name")
Enter fullscreen mode Exit fullscreen mode

For example, if you have indices for multiple vendors, you can create aliases for categories like fruits, groceries, clothing, etc., at the time of creating embeddings.

9. Searching with the Elasticsearch Interface

If an index exists, you can use its specific embedding interface to perform searches:

vectordb = ElasticsearchStore(es_connection=es_client, index_name=index_name, embedding=embeddings)
Enter fullscreen mode Exit fullscreen mode

To search using a specific alias, replace the index name with the alias name:

vectordb = ElasticsearchStore(es_connection=es_client, index_name="alias_name", embedding=embeddings)
Enter fullscreen mode Exit fullscreen mode

If you want to search across all indices in Elasticsearch, use the wildcard *:

vectordb = ElasticsearchStore(es_connection=es_client, index_name="*", embedding=embeddings)
Enter fullscreen mode Exit fullscreen mode

10. Creating a Retriever

The next step is to create a retriever for the earlier Elasticsearch interface. You can specify how many documents you want to retrieve:

retriever = vectordb.as_retriever(search_kwargs={"k": 10})
Enter fullscreen mode Exit fullscreen mode

11. Filtering Using Metadata

To include filtering based on metadata, use the following code snippet. This example searches for documents that match either value1 or value2 in specific metadata fields:

retriever = vectordb.as_retriever(search_kwargs={
    "k": 5,
    "bool": {
        "filter": [{
            "bool": {
                "should": [
                    {"terms": {"metadata.x.keyword": ["value1"]}},
                    {"terms": {"metadata.y.keyword": ["value2"]}}
                ]
            }
        }]
    }
})
Enter fullscreen mode Exit fullscreen mode

If you wish to filter using a single key in the metadata, you can use:

retriever = vectordb.as_retriever(search_kwargs={
    "k": 5,
    "filter": [
        {
            "terms": {
                "metadata.x.keyword": ["value1"],
            }
        },
    ]
})
Enter fullscreen mode Exit fullscreen mode

12. Understanding the K Value

The k value signifies the number of retrieved document chunks, sorted in descending order based on their similarity scores.

13. Interacting with the LLM

Now, you can ask questions to the language model (LLM). First, instantiate the LLM using ChatOpenAI:

llm = ChatOpenAI(temperature=0.5, openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo-16k")
Enter fullscreen mode Exit fullscreen mode

Next, create a prompt that defines how you want the model to respond:

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Keep the answer concise. Context: {context} """),
    ("human", """Question/s: {question} Answer/s: """)
])
Enter fullscreen mode Exit fullscreen mode

14. Creating a Chain and Invoking It

Then, create a chain that combines the retriever, prompt, and LLM:

chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser()
Enter fullscreen mode Exit fullscreen mode

Finally, invoke the chain and pass in your question:

res = chain.invoke(question)
print(res)
Enter fullscreen mode Exit fullscreen mode

15. Retrieving Associated Costs and Token Usage

To retrieve the associated costs and total tokens, use the following snippet:

with get_openai_callback() as cb:
    res = chain.invoke(question)
    total_tokens = cb.total_tokens
    cost = cb.total_cost
    print(res, total_tokens, cost)
Enter fullscreen mode Exit fullscreen mode

Conclusion

And there you have it! You’ve harnessed the power of Elasticsearch and LangChain to build a chatbot capable of retrieving and processing data intelligently. Thank you for reading, and happy coding!

💖 💪 🙅 🚩
janakmandavgade
Janak Mandavgade

Posted on August 3, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related