How to Build a Langchain PDF Chatbot

Have you ever imagined asking questions from the document files on your computer instead of searching for information in them manually? You are not alone; many developers come across this obstacle. Thankfully, the rise of generative AI and conversational AI technology makes this possible.

In 2022, there was a massive explosion of generative AI algorithms. OpenAI released the infamous chatbot, ChatGPT. It is an acronym for Chat Generative Pre-trained Transformers. This sophisticated AI system, powered by deep learning technology, has taken the world by surprise due to its impressive ability to generate human-like text and make conversations with human beings.

In this article, you are going to be given a brief introduction to Large Language Models (LLMs), learn what the Langchain framework is all about, and how you can build your first Langchain PDF chatbot.

What are Large Language Models?

LLMs are machine learning models trained on large amounts of textual data (articles, PDFs, web content) to equip them with an understanding of the underlying statistical and grammatical relationship between the words, thereby enabling them to predict the next sequence of words given an input text (also called prompts).

Simply put, an LLM is a machine-learning algorithm trained on next-sentence prediction. LLMs are trained using self supervised machine learning techniques. The performance of the language model depends on the quality of the data used to train it. The higher quality the data is, the better its performance.

The most common example is ChatGPT-3. We also have some other examples of popular LLMs such as:

Llama by Huggingface
Palm2 by Google
Cohere by CohereAI

Limitations of ChatGPT

Many developers write code with ChatGPT, despite its limitations with contextualization based on your existing repositories. Here are some of the other limitations of ChatGPT:

Limited knowledge - Its knowledge is based only on the data that was available to it at the time of training. It only has access to information dated before 2021, although the OpenAI team is working to ensure it has access to current data
Hallucination - This occurs when LLMs give information that is not true
It doesn't have access to your private data (i.e. data on your system, data on your Google Drive)

Let's get an understanding of what Langchain is all about before we dive into creating a Langchain PDF chatbot.

Why Langchain?

The Langchain framework was developed to solve some of the limitations accustomed to using the LLMs. The Langchain is a framework that makes working with language models easy. Also, it gives you the tools and building blocks needed to build “LLM powered” applications.

It enables you to build applications that are:

Context-aware - They have information relevant to a given user query
Reasoning - make use of LLMs to make decisions

The Langchain framework has some modules that handle different offerings of the framework. These include:

Model I/O - handles interfacing with LLMs
Retrieval - enables retrieval of external data
Chains - helps to connect different Langchain components
Agents - allows you to use LLMs as reasoning engines
Memory - It persists application state

The Langchain framework has a lot of applications; some of which include:

Chatbot
Summarization
Extraction
Personal assistant
Querying database

Although the OpenAI team has done a great job in building ChatGPT, it has a few limitations as aforementioned.

The Langchain framework is here to help overcome the limitations of ChatGPT and other LLMs.

At this point, you know what LLMs are all about, examples of some popular LLMs, and how the Langchain framework fits into the picture. Let's proceed to build our chatbot PDF with the Langchain framework.

Coding your Langchain PDF Chatbot

Below is an architectural diagram that shows how the whole system interacts:

Before proceeding to the next session, you must ensure the following Python packages are installed. Run the code cell below to install the required packages.

!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

The function of the “PyPDF2” library is to help us load our PDF document so that we can work with it.

The FAISS (Facebook AI Similarity Search) library helps us to index and store text embeddings. It was developed by the Facebook AI team to carry out efficient similar searches between vectors. Furthermore, it has support for GPU (Graphics Processing Unit)–which helps to process data at insane speeds.

You also need to get an OpenAI API (Application Programming Interface) key from their website.

Import Libraries

The next step we are going to take is to import the libraries we will be using in building the Langchain PDF chatbot.

from PyPDF2 import PdfReader
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
import os
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

The Langchain library offers integration with different vector stores–It is similar to a normal database but this time, it is vectors that are being stored. The "text_splitter" is used by the Langchain library to chunk up the data in the pdf file.

# Get your API keys from Openai, you will need to create an account.
import os
os.environ["OPENAI_API_KEY"] = "YOURAPIKEY"

Save your OpenAI API key in your systems environment–It is good practice to store API keys as environmental variables and call them wherever you need them in your code.

# read in your pdf file
pdf_reader = PdfReader(“path to the pdf file”)

The next step is for you to load your PDF document from the location it is stored on your system. For this article, we are going to be using the “GPT 4 Technical Report”. You can make use of any PDF file of your choice.

# read data from the file and put them into a variable called text
text = ''
for i, page in enumerate(pdf_reader.pages):
text = page.extract_text()
if text:
text += text

What this line of code does is convert the PDF into text format so that we will be able to break it into chunks.

Chunk your Documents

The next step is to split the PDF document into chunks so that we don't hit the token size limit during information retrieval.

text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 512,
chunk_overlap  = 32,
length_function = len,
)
texts = text_splitter.split_text(text)

What are tokens?

The tokenization step is usually one of the preprocessing steps before feeding text data into a language model.

Tokenization is the process of splitting chunks of text into smaller units that can easily be processed by the LLM.

Tokens are the building blocks of LLMs.They can be as short as a character (i.e. “a”,”d”) or as long as a word (i.e. “play”). The token size limit is the amount of tokens the LLM is allowed to process at once to ensure efficient delivery.

The recursivecharactertextsplitter helps to split the data in such a way that the semantic meanings between the words are preserved.

We will proceed to download the word embeddings:

# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

We make use of the embeddings to convert the chunk of text into vectors:

docsearch = FAISS.from_texts(texts, embeddings)

We proceed to create a vector database (our knowledge base) using the FAISS library and the OpenAI embeddings.

Once the database has been created, we can then query it.

chain = load_qa_chain(OpenAI(), chain_type="stuff")

We are making use of the qa_chain from Langchain to connect our similarity search to the prompts–user input. Remember that you can use chains for more complex tasks; this is a basic task.

The “temperature” parameter determines the degree of randomness of the text generated by the LLM.

The stuff parameter in our qa_chain enables us to build applications like this where documents are small and only a few are passed in for most calls.

Query the PDF

Finally, we get to ask our PDF questions

query = "who are the authors of the article?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

This is what the output looks like:

Another question:

Conclusion

In this article, we have discussed LLMs, their limitations, and how the Langchain framework solves some of their problems. You have also learned how to build your own custom Langchain PDF chatbot.

The new wave of generative AI technology is not going away anytime soon. Building custom Langchain PDF chatbots helps you overcome some of the limitations of traditional LLMs due to its flexible framework. Have fun implementing your PDF chatbot!

Blog