Semantic Query Over PDF Documents Using Qdrant and LangChain

Introduction

Large Language Models (LLMs) have emerged as a foundation stone in the field of deep learning, showcasing their ability to understand and generate text similar to human language patterns. A crucial element within LLMs is Vectorization, specifically referring to vector embeddings. While the term "vector" is a familiar concept from school-level linear algebra, in the context of language models, vector embeddings act as numerical representations for words or sentences. There are several practical applications of LLMs, especially in querying private data such as PDFs, texts, and CSV files, and so on. Many of these applications emphasize the significance of "vectorization" in content processing.

Langchain stands out by helping create pipelines around LLMs, introducing innovative approaches to content analysis and processing. Langchain integrates well with vectorization workflows, and therefore, it has become a central component to building LLM-driven applications.

Qdrant: Overview

In the world of vector search databases, there's a variety of choices with different features and performance levels, ranging from established closed-source options to newer ones. However, Qdrant stands out as a top-tier open-source vector search database, known for its user-friendly onboarding and robust-free tier bandwidth. Additionally, it provides large storage and processing bandwidth for experimentation. Qdrant DB assists the creation of a semantic relation by providing APIs for inserting, searching, and managing points with payloads. Particularly, it supports filtering, a valuable feature in semantic-based matching searches and other applications.

Creating Semantic Query Using Qdrant and Langchain

This blog explores the construction of a semantic query using Qdrant DB and Langchain. We will explore how to integrate Qdrant DB into the framework of a Q&A system for semantic relations. It's fascinating to see the continuous arrival of new vector database systems and technologies, making this journey quite exciting.

The significance of semantic query, powered by Large Language Models (LLM), in today's world cannot be overstated. It plays a crucial role in understanding and extracting meaningful information from vast datasets. The ability of an LLM to understand the signs and traces of language allows for more accurate and context-mindful queries, changing the way we interact with data.

When coupled with Qdrant DB and Langchain, semantic query becomes even more powerful. Qdrant DB's efficient handling of high-dimensional vector data aligns seamlessly with the language understanding capabilities of Langchain. This combined effect enhances the precision and relevance of semantic queries, making it a vital tool for extracting valuable insights in various fields such as natural language processing, information retrieval, and knowledge finding.

The Architecture of the Project

Initially, we start by uploading a file, typically a PDF, onto our application. Following that, we extract text from the file, which is a crucial part of the process.

The next step involves implementing Langchain, a tool that helps create pipelines that breaks down the text into manageable chunks, and creates embeddings through an embeddings model. To achieve this, we use the OpenAI text-embedding-ada-002 model, converting text into embeddings. These embeddings are then stored in a vector store database.

Langchain plays a key role in recognizing the user's intent and extracting entities from the provided PDF file. This information is then sent back to the application. The intent helps in calling the appropriate services, while the entity details assist in locating the relevant data.

With all the necessary data now securely stored in the vector store, the next task is to ask questions. To enable this, we must bring in context from the chunks, ensuring an organized flow. Qdrant, a database we employ for similarity search, helps to identify the best combinations. The shortlisted chunks are then sent to the OpenAI GPT 3.5 turbo model, which reads and analyzes them to generate accurate answers.

The Step-by-Step Guide

Before you start coding, it's essential to obtain access to Qdrant DB.

Step 1. Sign up for a free trial account on the Qdrant website.

Step 2. Go to Data Access Control and create a new cluster, mentioning your cluster name.

Step 3. Copy the API key and URL displayed on the screen and paste them somewhere secure. It will be needed later during coding.

Let’s start coding:

To start off, let's get the packages we need installed. It's the first step to get everything set up and ready to go:

Let's now bring in all the key packages that will come into play in building the application. In this step, we set up the qdrant-client library for accessing the Qdrant cloud interface.

Also, we installed PyPDF for handling PDF files, and OpenAI for accessing the OpenAI key for the use cases (you need to generate your Secret API key from your User settings). Additionally, we installed LangChain for CharacterTextSplitter and embeddings.

In this section, I've made a function to read a PDF file, and have used one of my own research papers as the input document. The initial step involves extracting data from the PDF file, accomplished by employing PyPDF2. The function is designed to retrieve the text from the file and return it for further processing.

Next, the extracted text from the PDF is transformed into smaller chunks. This is essential to enhance the accuracy of the embedding process. To achieve this, the code snippet below uses the "Character Text Splitter" from langchain.text_splitter. Each chunk is configured to encompass 1000 words, with a designated overlap to ensure that critical information is retained throughout the chunking procedure.

Afterward, we move on to the task of converting text chunks into vector embeddings. This conversion is a key as it allows us to extract the semantics of the text, facilitating the Question and Answer (QA) process.

The function takes two parameters: ‘text_chunks’, representing the chunks of text to be embedded, and an optional parameter ‘model_id’ defaulted to "text-embedding-ada-002." The function initializes an empty list called ‘points’ to store the embedding information for each text chunk. It then iterates through the text chunks, making an OpenAI API call to create embeddings for each chunk.

The unique embeddings are extracted from the API response, and a unique ID is generated for each point using the “uuid” library. The structured data for each point, including the unique ID, embeddings, and a payload containing the original text chunk, is appended to the points list. In the end, the function returns a list of points, each encapsulating the embedding information for a specific text chunk. The code snippet provided below illustrates the transformation of text chunks into embeddings.

Now, we establish a connection to access the Qdrant interface, as we're utilizing the Qdrant vector store database for data storage and subsequent question-answering based on the stored data. While there are various vector store databases available, I opted for Qdrant in this instance.

The following code snippet demonstrates the process of establishing a connection for Qdrant:

In this code, the QdrantClient from "qdrant_client" is employed, and the parameters used for connection are the "url" and "api_key." The QdrantClient is initialized with these connection details, providing a means to interact with the Qdrant vector store database. A function named "create_collection" is initiated with a Qdrant collection named "my-collection". Vector parameters are set with a size of 1536 dimensions and a cosine distance metric. Subsequently, information about the collection is obtained through the "get_collection" function.

Here, we're setting up a function to save the generated embeddings into the Qdrant database for Extractive Question-Answering (QA).

The code snippet defines the "insert_data" function, responsible for storing generated embeddings and associated data into the Qdrant database for obtaining responses. The function employs the Qdrant "upsert" operation, specifying the "collection_name" as "my-collection" and setting "wait" to True. The data to be inserted is sourced from the "get_points" parameter, which retrieves points from the list generated by the "get_embedding" function.

The provided code defines the function "create_answer_with_context," which operates in two parts. Initially, it converts a user's question into an embedding using the OpenAI "text-embedding-ada-002" model. Later, the function retrieves an answer based on the embeddings obtained from the user's input or question.

In the first part, the user's question is processed using the OpenAI API to generate embeddings. These embeddings are then used to query the Qdrant database, specifically the "my-collection" collection. The search operation returns a set of results limited to three.

The second part involves printing the user's question and creating a prompt for the OpenAI ChatCompletion model. This model, specified as "gpt-3.5-turbo," is utilized to generate a response based on the concatenated string of the retrieved prompts and the user's question. The final response is extracted from the completion choices and returned by the function.

The "main" function arranges these key steps: reading raw text from a PDF, segmenting it into chunks, generating vector embeddings, and inserting them into the Qdrant database. It then asks a specific question, retrieves a contextual answer, and prints the result. This smooth workflow showcases the system's ability to provide relevant answers based on stored information. The script runs upon direct execution through the "main" check.

Conclusion

In conclusion, our study involved the practical construction of a semantic query using Qdrant DB and Langchain, with code implementations detailing the integration process. The incorporation of Qdrant DB into the Q&A system for semantic relations was executed seamlessly through code snippets and practical examples.

Blog