My Journey Building DebugGPT: 5 Lessons Learned with ChatGPT & Prompt management

During the month of September 2023 I worked on a new open source tool called DebugGPT. Debug GPT is a chrome extension that leverages ChatGPT to translate the error messages in the Chrome Console into developer friendly, actionable explanations. On top of this it can translate into the language of choice of the user.

This was the first time I was building a feature around an LLM and in this blog post I will share 5 key things I learned about ChatGPT by building Debug GPT.

Background: For context I am a software engineer with seven years experience doing cloud cost optimization work.
Up until this project I had played around a little bit with the ChatGPT app here and there but never used the API. My knowledge of LLMs and machine learning is fairly minimal but I was eager to try to manipulate an LLM API for a project.

1. Discovering prompt manipulation

You’ll need to use a system prompt or system message to prime the LLM for what you want from it. The API documentation was really useful in this regard and my team and I got started very quickly. You’ll need to basically design an instruction and fine-tune that instruction over trial and error until you have built the confidence that the output meets your criterias for a range of case scenarios you have imagined.

For DebugGPT we send through, in the body of the request, 2 elements to OpenAI. The first part is the prompt which I will detail below and the second part of the request is the error message we’re requesting ChatGPT translate into a human-readable explanation..

We didn’t iterate too many times to reach the outcome we wanted. Significant milestones were:
prompting the core instruction, staging the LLM role, adding the requirement for actionable steps and last but not least asking the LLM for translations when need be.

a. Prompting the core instruction

The original prompt was fairly simple:

Your job is to accept messages asking for assistance understanding error messages and return an assessment of the error.

b. Staging the LLM role

One thing that I have learned in how to design the instructions for an LLM is the importance of staging. Staging consists in making an introduction to the LLM where you set the context in which you require the LLM to act and you explain to it the role/responsibilities it is assigned to. This has a great influence over the quality and consistency of responses you will get from it.
Here is how we decided to stage the LLM by giving it context to what was expected of it.

You are an expert programmer and debugger. Your job is to accept messages asking for assistance understanding error messages and return an assessment of the error.

c. Adding the requirement for actionable steps

We noticed that the LLM would sometimes return the best next steps to guide the users towards solving their issue and sometimes not. For example, for an undefined variable I could click Enhance and get for the same message either “this variable is not defined. You should look into your codebase” or ”the variable is not defined”. The first option being more qualitative and helpful I looked into ways to systematize it. I wanted to correct this indeterminism in the output and force the LLM hand a little.
This is what I came up with:

You are an expert programmer and debugger. Your job is to accept messages asking for assistance understanding error messages and return a detailed assessment of the error with possible next steps.

d. Asking for translations

As we were progressing with the app development I thought of adding translations to make the console more accessible to people who might not speak English. To bring languages into the mix I first researched online what languages OpenAI supports best and found out that OpenAI does not excel at all languages because of the limited datasets available to train the LLM on.
I settled for the following languages: English, Chinese, Arabic, German, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Spanish, Thai. Regardless of the information found online we added a warning that these translations were AI generated and we tested in some of the languages me and my team members master to check the quality (namely French and Spanish).

In terms of prompt management it means we added at the end of the prompt something in the tone of “ please answer in this language” but I realized that once in a while the AI will just completely forget about that last part and not translate its output. We consequently turned the prompt around and put the language instruction in the core of the message

You are an expert programmer and debugger. Your job is to accept messages asking for assistance understanding error messages and return a detailed ${lang} assessment of the error with possible next steps.

Moving forward I think I could tune that prompt a little bit but I would need user feedback to identify clear areas of improvements. Empirically, what we have been getting back from ChatGPT has met our acceptance criterias.

2.Testing and monitoring quality is not the easiest part

The non-deterministic reality of LLMs is something that really stands out compared to standard software development. I did not fully grasp until now how quality control for these LLM enabled solutions is going to be the next big challenge. My approach was a bit manual - testing things and hoping I will not run into weird edge cases. One improvement I will look into further down the road would be to implement a more formal way of testing and keeping a registry of tests that have been performed. I definitely think that there is space and possibility to do some automated testing and what I would research next is a tool where you could grab a few different error messages and slightly change the system prompt and monitor if the output appreciably changes. It would require a lot of maintenance/iterative work though which is not ideal for open source projects like this.

We are not doing any formal monitoring on the quality of the LLM outputs at the moment for DebugGPT. For the time being the only way to monitor if people are satisfied with the tool is through issues in the project repo or comments in the chrome extension marketplace. The marketplace also provides some basic active usage figures to get a feel for adoption.
Of course tracking inputs and outputs was out of the question for obvious privacy/ethical reasons.

One way to approach this would be to have a feedback call-to-action button in the console view where people can copy/paste the original error message and the ChatGPT output they received (if they feel comfortable doing so) or redact manually a message to us (if they don’t).
Another potential improvement would be to create a way for users to look at their DebugGPT history. This would make it easier for users to retrieve a message explanation (and eventually share it with us if they were not satisfied with it).

For now I rely completely on the developer community to open an issue on Github if they run into any problem.

3.The magic of LLM is very dependable on the UX you wrap it around with

Retrospectively I realize that all these features and solutions leveraging LLMs depend as much on the UX designed to deliver it to users as it relies on the magic of LLM. The biggest question for DebugGPT was to find out how we put these “enhanced” error messages in the hands of developers. I looked for a way that remains in the developer workflow.

While trying to validate my idea for DebugGPT I looked for similar ideas but did not find any solution available in the Chrome Console like ours. That is why I decided to create DebugGPT.

4. Playing with the API was easier than expected

OpenAI API documentation is actually pretty nice, it includes some code snippets. I quickly understood what kind of requests I needed to make (see point 3 for prompt management). They have some useful code snippets.
On top of this there are already a few libraries available to communicate with the API covering an interesting range of use cases (though I did not use those). At the time we built the OpenAI API client, the Chrome extension was written in ES6 without a build step, so the challenge was writing a thin client that could take the API key and send a simple request to ChatGPT. Later, we transitioned to using React, which requires a build step, so we could transition to an existing OpenAI client in the future, but our simple ES6 client still works well for the tool.

Conclusion

For now I am seeking as much feedback as I can get to work on improving DebugGPT to fit developers day-to-day with debugging in the browser and maybe add more capabilities.
If you use DebugGPT and you notice any inconsistency in the output please feel free to share as much details as you possibly can through a GitHub issue so that we can keep on improving the prompt.
If the enhanced message is not quite right or seems weird you can paste the issue and the error message if you are comfortable with that or you can redact whatever is of particular interest to you.

If some core functionality of the extension is not quite working the way you would expect it to then please share this piece of information as well.
If you want to take a crack at improving our prompt we are happy to review pull requests for that too.

Stack, tools and libraries I used for this project

The ChatGPT API client we built into the Chrome extension uses ES6 and has no external dependencies outside of the standard browser JavaScript APIs (i.e. fetch)
We transitioned from a no build, a Javascript chrome extension to a small build React front end. We pulled into React for the different views, some of the rendering and some of the pieces like that.
For securing the OpenAI API key we used
Chrome’s API for extension local storage
Crypto.js library to encrypt the key (more about this on this blog post about the challenges of token management we run into)
DebugGPT is a relatively portable piece of code that somebody could rip out and use somewhere else without pulling in another dependency.
The repo of the extension is available here https://github.com/CircleCI-Public/debug-gpt

Blog