Debugging Your RAG Application: A LangChain, Python, and OpenAI Tutorial

Let’s explore a real-world example of debugging a RAG-type application. I recently undertook this process while updating our company knowledge base – a resource for potential clients and employees to learn about us.

Tech Stack:

I work with Python and the LangChain framework, specifically using LangChain Expression Language (LCEL) to build chains. You can find the LangChain LCEL documentation here.

This approach services as a good alternative to LangChain’s debugging tool, LangSmith.

# Load memory
def get_session_history(session_id: str) -> ConversationBufferMemory:
    if session_id not in store:
        store[session_id] = ConversationBufferMemory(
            return_messages=True, output_key="answer", input_key="question"
        )
    return store[session_id]

def _get_loaded_memory(x):
    return get_session_history(x["session_id"]).load_memory_variables({"question": x["question"]})

def load_memory_chain():
    return RunnablePassthrough.assign(
        chat_history=RunnableLambda(_get_loaded_memory) | itemgetter("history"),
    )

# Create Question
def create_question_chain():
    return {
        "standalone_question": {
                                   "question": itemgetter("question"),
                                   "chat_history": lambda x: get_buffer_string(x["chat_history"]),
                               }
                               | CONDENSE_QUESTION_PROMPT
                               | llm
                               | StrOutputParser(),
        "role": itemgetter("role"),
    }

# Retrieve Documents
def retrieve_documents_chain(vector_store):
    retriever = vector_store.as_retriever()
    return {
        "role": itemgetter("role"),
        "docs": itemgetter("standalone_question") | retriever,
        "question": lambda x: x["standalone_question"],
    }

# Answer
def create_answer_chain():
    final_inputs = {
        "role": itemgetter("role"),
        "context": lambda x: combine_documents(x["docs"], DEFAULT_DOCUMENT_PROMPT),
        "question": itemgetter("question"),
    }
    return {
        "answer": final_inputs | ANSWER_PROMPT | llm,
        "docs": itemgetter("docs"),
    }

# Final Chain looks like this
chain = load_memory_chain() | create_question_chain() | retrieve_documents_chain() | create_answer_chain()

While debugging, I prefer using a cheaper model like gpt-3.5-turbo for its cost-effectiveness. The less advanced models are more than adequate for basic testing. For final testing and deployment to production, you might consider upgrading to gpt-4-turbo or a similar advanced model.

I also favor Jupyter notebooks for much of my debugging. This way, I can include the notebook in a .gitignore file, reducing cleanup from debugging shenanigans in my main code. I can also run very specific pieces of my code without plumbing overhead.

Initial Observations

I noticed that basic queries received correct answers, but any follow-up question would lack the appropriate context, indicating that conversational memory was no longer functioning effectively.

Here's what I observed:

Question: What are the Focused Labs core values?
> AI: The core values of Focused Labs are Love Your Craft, Listen First, and Learn Why ✅
> Sources: ...

Question: Tell me more about the first one.
> AI: Based on the given context, the first one is about the importance of the "Red" step in Test Driven Development (TDD). ❌
> Sources: ...

However, I expected responses more in line with explanations like "Love Your craft is when you are passionate about what you do."

For more context, this issue with conversational memory arose while I was implementing a new feature: allowing end users to customize responses based on their role. So, for example, a developer could receive a highly technical answer while a marketing manager would see more high-level details.

Debugging Steps

1. Ensure Role Feature Integrity

To avoid impacting the newly implemented role feature, I made it overly obvious and active in every response during this debugging session by temporarily updating my system prompt.

SYSTEM_PROMPT = """Answer the question from the perspective of a {role}."""

DEBUGGING_SYSTEM_PROMPT = """Answer the question in a {role} accent."""

Here's how the AI responded, clearly adhering to my updated prompt:

Question: What are the Focused Labs core values?
Role: pirate
> AI: Arr, the core values of Focused Labs be Love Your Craft, Listen First, and Learn Why, matey! ✅
> Sources: ...

Question: Tell me more about the first one.
> AI: Arr, the first one be talkin' about the importance of reachin' the "Red" stage in Test Driven Development... ✅
> Sources: ...

2. Creating a Visual Representation

I created a diagram of the app to visualize the process flow.

I began at the end of my flow and worked backward to identify issues. I first checked whether my LLM was answering questions based on the provided context. Upon inspecting the sources, I realized that the given context was a blog on TDD.

> Sources: [{'URL': 'https://focusedlabs.io/blog/tdd-first-step-think'}, ...]

Thus, I ruled out the answer component as the source of the bug.

3. Tracing the Bug's Origin

Next, I examined the logic for retrieving documents. I added a 'standalone question' key to every input and output chain to log runtime values, which revealed that questions were being incorrectly rephrased.

💡Adding these keys to the chains allows us to log the values seen by the components at runtime. Using breakpoints will only show the code when it’s instantiated and not populated with real-time values.

# Code Snippet with added keys
def retrieve_documents_chain(vector_store):
    retriever = vector_store.as_retriever()
    return {
          .
                .
                .
        "standalone_question": itemgetter("standalone_question") # Added
    }

def create_answer_chain():
    final_inputs = {
          .
                .
                .
        "standalone_question": itemgetter("standalone_question") # Added 
    }
    return {
          .
                .
                .
        "standalone_question": itemgetter("standalone_question") # Added
    }

I expected the standalone_question to be more specific, like “What can you tell me about the core value of Love your Craft?”

Question: What are the Focused Labs core values?
> standalone_question: What are the core values of Focused Labs? ✅

Question: Tell me more about the first one.
> standalone_question: What can you tell me about the first one? ❌

4. Identifying the Exact Source

I focused on the chat_history variable, suspecting an issue with how the chat history was being recognized.

def retrieve_documents_chain(vector_store):
    retriever = vector_store.as_retriever()
    return {
          .
                .
                .
        "standalone_question": itemgetter("standalone_question") # Added
                "chat_history": itemgetter("chat_history") # Added
    }

def create_answer_chain():
    final_inputs = {
          .
                .
                .
        "standalone_question": itemgetter("standalone_question") # Added 
                "chat_history": itemgetter("chat_history") # Added
    }
    return {
          .
                .
                .
        "standalone_question": itemgetter("standalone_question") # Added
                "chat_history": itemgetter("chat_history") # Added
    }

Question: What are the Focused Labs core values?

Question: Tell me more about the first one.
> chat_history: [] ❌

🔔 Found the issue! Since the chat_history was blank, it wasn’t being loaded as I had assumed.

5. Implementing the Solution

I resolved the issue by checking my conversation memory store. As a dict, the conversation memory store was sensitive to the type of saved messages. I saved the messages with a str converted version of session_id. But, I invoked with an Optional[UUID] version. So, while the conversation memory store itself was set up correctly, I needed to update how I invoked my chain.

result = 
chain.invoke({"question": question, "session_id": session_id, "role": role})

Therefore, I updated the session_id type to str.

result = 
chain.invoke({"question": question, "session_id": str(session_id), "role": role})

6. Confirming the Fix

I confirmed that the conversation memory now functioned correctly.

Question: What are the Focused Labs core values?

Question: Tell me more about the first one.
> chat_history: ['What are the Focused Labs core values?'] ✅
> standalone_question: Can you provide more information about the first core value: Love Your Craft? ✅
> AI: This value means that we are passionate about being the best at what we do, paying attention to every detail... ✅
> Sources: [{'URL': 'https://www.notion.so/Who-are-we-c42efb179fa64f6bb7866deb363fb7ef'}, ...] ✅

7. Final Cleanup and Future-Proofing

I reverted back from the temporary pirate accent debug feature used for easy identification of the role feature.

I decided to maintain detailed logging within the system for future debugging efforts.

Key Takeaways

Debugging AI Systems: A mix of traditional and AI-specific debugging techniques is essential.
Opting for Cost-Effective Models: Use more affordable models to reduce costs during repeated queries.
Importance of Transparency: Clear visibility into each step and component of your RAG accelerates debugging.
Type Consistency: Paying attention to small details, like variable types, can significantly impact functionality.

Thanks for reading!

Stay tuned for more insights into the world of software engineering and AI. Have questions or insights? Feel free to share them in the comments below!

Blog