Run ChatGPT-Style Questions Over Your Own Files Using the OpenAI API and LangChain!

Introduction

I'm sure you've all heard about ChatGPT by now. It's an amazing Large Language Model (LLM) system that is opening up new and exciting innovative capabilities. However, it's been trained over a huge corpus of text from across the internet but what if you want to query your own file or files? Thanks to the simple (but powerful!) OpenAI API and the amazing work done by the team at LangChain, we can knock up a basic Question and Answering application that answers questions from your files. This is all very new technology so I'm also learning as I go along and am always open to hearing feedback and improvements I can make - feel free to comment!

The goal of the article is to get you started with Question and Answering your own document(s). However, as described in the Improvements section below, various aspects can be optimised. If there's enough interest, I can go into more detail about those topics in future articles.

Sound good? Let's get to it! (Full code is on my GitHub)

High-Level Steps

Set up our development environment, API Key, and dependencies
Load in our file or directory containing multiple files
Create and persist (optional) our database of embeddings (will briefly explain what they are later)
Set up our chain and ask questions about the document(s) we loaded in

Prerequisites

You'll need an OpenAI API Key (I recommend putting a hard limit on pricing so you don't accidentally go over, especially when experimenting with code (you may automatically get free credit for new users but I've had my account for more than 3 months so those credits expired for me). You can also use the OpenAI calculator to estimate costs - we'll be using gpt-3.5-turbo model in this article)
Developer Environment (of course). I'm using OpenAI Python SDK, LangChain, and VS Code for the IDE. The requirements.txt file is available in the GitHub repo.
A file or files to test with. I recommend starting with a single file to test with. As someone with a quant fund background and using this for trading information, I'll be using the Microsoft Q2 FY23 Earnings Call Transcript (from this page)

Set up the OpenAI Key

If you haven't done so already, create an account over at OpenAI
(Optional but recommended) - Go to Billing...Usage Limits... and set your Soft and Hard limits. I used £10 but feel free to use whatever you're comfortable with. This prevents you from overspending more than you expected, especially useful when prototyping and experimenting with the API
If you haven't got free credits, you may need to enter your payment details to gain access
Head over to the API Keys section and generate a new secret - Copy this secret before closing the window otherwise you won't get a chance to see it in full again

When dealing with API keys and secrets, I like to use environment variables for security. So in your directory, create a file called ".env" (note the full-stop/period at the beginning)

In the .env file, type OPENAI_API_KEY = '<your secret key from above>'

# [.env file]
OPENAI_API_KEY = 'sk-....' # enter your entire key here

If you're using Git, create a .gitignore file and add ".env" in the file as we don't want to commit this to our repo on accident and leak our secret key! I've also added "db/" which will be our database folder. I don't want to commit the database which could contain personal document data so ensuring that doesn't get committed either.

# [.gitignore file]
.env # This will prevent the .env file from being commmitted to your repo
db/ # This will be our database folder. I don't want to commit it so adding here

Install all the required dependencies. Download the requirements.txt file from here and run

pip3 install -r requirements.txt

alternatively, you can manually use pip to install the dependencies below:

chromadb==0.3.21
langchain==0.0.146
python-dotenv==1.0.0

Let's open our main Python file and load our dependencies. I'm calling the app "ChatGPMe" (sorry, couldn't resist the pun...😁) but feel free to name it what you like. In this article, I have removed the type annotations for clarity but the GitHub version contains the strongly typed version (I think it's good practice to add strong typing to Python code, I miss it from C#!)

# dotenv is a library that allows us to securely load env variables
from dotenv import load_dotenv 

# used to load an individual file (TextLoader) or multiple files (DirectoryLoader)
from langchain.document_loaders import TextLoader, DirectoryLoader

# used to split the text within documents and chunk the data
from langchain.text_splitter import CharacterTextSplitter

# use embedding from OpenAI (but others available)
from langchain.embeddings import OpenAIEmbeddings

# using Chroma database to store our vector embeddings
from langchain.vectorstores import Chroma

# use this to configure the Chroma database  
from chromadb.config import Settings

# we'll use the chain that allows Question and Answering and provides source of where it got the data from. This is useful if you have multiple files. If you don't need the source, you can use RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain

# we'll use the OpenAI Chat model to interact with the embeddings. This is the model that allows us to query in a similar way to ChatGPT
from langchain.chat_models import ChatOpenAI

# we'll need this for reading/storing from directories
import os

You may notice that many of the LangChain libraries above end in the plural. This is because LangChain is a framework for apps powered by language models so it allows numerous different chains, database stores, chat models and such, not just OpenAI/ChatGPT ones! This opens up huge possibilities for running offline models, open-source models and other great features.

We'll load the .env file using dotenv. This library makes it easier and more secure to work with environment files to help secure secret keys and such. You could hardcode the API key directly in your file but this way is more secure and generally considered good practice.

# looks for the .env file and loads the variable(s) 
load_dotenv()

Excellent, we now have our dependencies and API key set up, let's get to the fun bit!

Load the Files and Embeddings

This is optional but I found it worthwhile. By default, if you don't persist the database, it will be transient which means that the database is deleted when your program ends. Your documents will have to be analysed every time you run the program. For a small number of files, it's fine, but can quickly add to the loading time if you need to analyse multiple files every time you run the app. So let's create a couple of variables we'll use to store the database in a folder.

# get the absolute path of this Python file
FULL_PATH = os.path.dirname(os.path.abspath(__file__))

# get the full path with a folder called "db" appended
# this is where the database and index will be persisted
DB_DIR = os.path.join(FULL_PATH, "db")

Let's load in the file we want to query. I'm going to query the Microsoft's Earnings Call transcript from Q2 2023 but feel free to load whatever document(s) you like.

# use TextLoader for an individual file
# explicitly stating the encoding is also recommmended
doc_loader = TextLoader('MSFT_Call_Transcript.txt', encoding="utf8")

# if you want to load multiple files, place them in a directory 
# and use DirectoryLoader; comment above and uncomment below
#doc_loader = DirectoryLoader('my_directory')

# load the document
document = doc_loader.load()

I'll only be using TextLoader but the syntax is the same for DirectoryLoader so you can do a drop-in replacement with the load() method.

We've loaded the files but now we need to split the text into what's called chunks. Essentially, chunking allows you to group words into "chunks" to allow more meaning to a sentence. For example, the sentence below in the context of a football (soccer) game:

"The striker scored a goal in the final minute of the game."

One possible way to chunk this sentence is:

Chunk 1: "The striker"
Chunk 2: "scored"
Chunk 3: "a goal in the final minute"
Chunk 4: "of the game"

However, notice that Chunk 3 and Chunk 4 share the words "final minute" contextually. This is an example of chunk overlap. While this chunking still conveys the essential information of the sentence, it is not as precise as it could be. A better way to chunk the sentence would be:

Chunk 1: "The striker"
Chunk 2: "scored"
Chunk 3: "a goal"
Chunk 4: "in the final minute"
Chunk 5: "of the game"

In this revised version, there is no overlap between the chunks, and each chunk conveys a more distinct and specific idea. Ideally, when you chunk, you choose values that prevent chunk overlap. However, chunking is a whole topic of its own so will leave it there. If you want to find out more, you can search for chunking in Natural Language Processing (NLP) where good chunking is critical to the optimum usage of NLP models.

So, with the quick chunking detour above, let's split our document with 512 as a chunk size and 0 as the overlap - feel free to play with these depending on your document.

# obtain an instance of the splitter with the relevant parameters 
text_splitter = CharacterTextSplitter(chunk_size=512 , chunk_overlap=0)

# split the document data
split_docs = text_splitter.split_documents(document)

We now want to load the OpenAI embeddings. An embedding is essentially converting language as we use it to numerical values (vectors) so that a computer understands the words and their relationship to other words. Words with similar meanings will have a similar representation. Like chunking, Embedding is a huge topic but here's a nice article on Word2Vec which is one way to create word embeddings. Let's get back on track with using embeddings created by OpenAI.

# load the embeddings from OpenAI
openai_embeddings = OpenAIEmbeddings()

Simple! Let's now create our Chroma database to store these embeddings. Chroma was written from the ground up to be an AI-native database and works well with LangChain to quickly develop and iterate AI applications.

We'll start by configuring the parameters of the database

# configure our database
client_settings = Settings(
    chroma_db_impl="duckdb+parquet", #we'll store as parquet files/DuckDB
    persist_directory=DB_DIR, #location to store 
    anonymized_telemetry=False # optional but showing how to toggle telemetry
)

Now let's create the actual vector store (i.e. the database storing our embeddings).

# create a class level variable for the vector store
vector_store = None

# check if the database exists already
# if not, create it, otherwise read from the database
if not os.path.exists(DB_DIR):
    # Create the database from the document(s) above and use the OpenAI embeddings for the word to vector conversions. We also pass the "persist_directory" parameter which means this won't be a transient database, it will be stored on the hard drive at the DB_DIR location. We also pass the settings we created earlier and give the collection a name
    vector_store = Chroma.from_documents(texts, embeddings,  persist_directory=DB_DIR, client_settings=client_settings,
                    collection_name="Transcripts_Store")

    # It's key to called the persist() method otherwise it won't be saved 
    vector_store.persist()
else:
    # As the database already exists, load the collection from there
    vector_store = Chroma(collection_name="Transcripts_Store", persist_directory=DB_DIR, embedding_function=embeddings, client_settings=client_settings)

We now have our embeddings stored! The final step is to load our chain and start querying.

Create the Chain and Query

LangChain, as the name implies, has main chains to use and experiment with. Chains essentially allow you to "chain" together multiple components, such as taking input data, formatting it to a prompt template, and then passing it to an LLM. You can create your own chains or, as I'm doing here, use pre-existing chains which cover common use cases. For our case, I'm going to use RetrievalQAWithSourcesChain. As the name implies, it also returns the source(s) used to obtain the answer. I'm doing this to show that the demo you see above is only using my document and not reaching out to the web for answers (shown by the Google question at the end).

# create and configure our chain
# we're using ChatOpenAI LLM with the 'gpt-3.5-turbo' model
# we're setting the temperature to 0. The higher the temperature, the more 'creative' the answers. In my case, I want as factual and direct from source info as possible
# 'stuff' is the default chain_type which means it uses all the data from the document
# set the retriever to be our embeddings database
qa_with_source = RetrievalQAWithSourcesChain.from_chain_type(
     llm=ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo'),
     chain_type="stuff",     
     retriever = vector_store.as_retriever()
    )

There are currently four chain types but we're using the default one, 'stuff', which uses the entire document in one go. However, other methods like map_reduce can help with batching documents so you don't surpass token limits but that's a whole other topic.

We're almost there! Let's create a quick function which handles the answering of the question and then create a loop for the user to ask questions to the document.

# function to use our RetrievalQAWithSourcesChain
def query_document(question):
    response = qa_with_source({"question": question})

# loop through to allow the user to ask questions until they type in 'quit'
while(True):
    # make the user input yellow using ANSI codes
    print("What is your query? ", end="")
    user_query = input("\033[33m")
    print("\033[0m")
    if(user_query == "quit"):
        break
    response = query_document(user_query)
    # make the answer green and source blue using ANSI codes
    print(f'Answer: \033[32m{response["answer"]}\033[0m')
    print(f'\033[34mSources: {response["sources"]}\033[0m')

And that's it! Hope that starts you what is an exciting field of development. Please feel free to comment and provide feedback.

Improvements

This is just the tip of the iceberg! For me personally, automating and running this with preset prompts across transcripts from various companies can provide good insights to help with trading decisions. For those interested in the financial/trading aspects of AI, you might like to read my short post on BloombergGPT. There is so much potential for alternative data and fundamentals analysis, it's a very exciting field. However, outside of that, it's also useful for your own personal files and organisation/searching and almost limitless other possibilities!

Specifically, there are several improvements to be made, here are a few:

Offline - This is a big one and maybe a topic for another blog if there's interest. Your data is still sent to OpenAI unless you opt-out or use the Azure version of the API which has a more strict usage policy for your data. A great open-source project called Hugging Face has numerous models and datasets to get your AI projects up and running. LangChain also supports Hugging Face so you could start experimenting with using offline Hugging Face models with LangChain to run everything without the internet or API costs
Automate - Individually querying is useful but some situations may require a large number of actions or sequential actions. This is where AutoGPT can come in,
Chunking - I've hardcoded 512 and you may have seen messages saying that some of the chunking surpassed that. An improvement would be to use more dynamic chunking numbers tailored to the input documents
Token management and Prompt Templates - tokens are key to the API and you can optimise them such that you don't waste unnecessary tokens in your API call and still get the same results. This saves you money as you're using less of the limit and also allows more tailored prompts to provide better answers.

As I say, many more features can be explored but this was my first foray into trying to utilise OpenAI models for my personal documents and trading data. A lot of documentation, bug tickets, and workaround reading was involved so I hope I've saved you some time!

The full code can be found on my GitHub

Enjoy :)

Blog