GPT Pilot - a dev tool that writes 95% of coding tasks

An MVP for a scalable dev tool that writes production-ready apps from scratch as the developer oversees the implementation

In this blog post, I will explain the tech behind GPT Pilot - a dev tool that uses GPT-4 to code an entire, production-ready app.

The main premise is that AI can now write most of the code for an app, even 95%.

That sounds great, right?

Well, an app won’t work unless all the code works completely. So, how do you make that happen? Well, this post is part of my research project to see if AI can really do 95% of developers' coding tasks. I decided to use GPT-4 to make a tool that writes scalable apps with the developer's oversight.

I will show you the main idea behind GPT Pilot, the crucial concepts it's built upon, and its workflow up until the coding part.

Currently, it’s in an early stage and can create only simple web apps. Still, I will cover the entire concept of how it can work at scale and demonstrate how much of the coding tasks AI can do while the developer acts as a tech lead overseeing the entire development process.

Here are some example apps that I created with it created:

Ok, let's dive in.

How does GPT Pilot work?

First, you enter a description of an app you want to build. Then, GPT Pilot works with an LLM (currently GPT-4) to clarify the app requirements, and finally, it writes the code. It uses many AI Agents that mimic the workflow of a development agency.

After you describe the app, the Product Owner Agent breaks down the business specifications and asks you questions to clear up any unclear areas.
Then, the Software Architect Agent breaks down the technical requirements and lists the technologies that will be used to build the app.
Then, the DevOps Agent sets up the environment on the machine based on the architecture.
Then, the Tech Team Lead Agent breaks down the app development process into development tasks where each task needs to have:
- Description of the task (this is the main description upon which the Developer agent will later create code)
- Description of automated tests that will need to be written so that GPT Pilot can follow TDD principles
- Description for human verification, which is basically how you, the human developer, can check if the task was successfully implemented
Finally, the Developer and the Code Monkey Agents take tasks one by one and start coding the app. The Developer breaks down each task into smaller steps, which are lower-level technical requirements that might not need to be reviewed by a human or tested with an automated test (eg. install some package).

In the next blog post, I will write in more detail about how Developer and Code Monkey work (here's a sneak peek diagram that shows the coding workflow), but now, let's see the main pillars upon which the GPT Pilot is built.

3 Main Pillars of GPT Pilot

I call these the pillars because, since this is a research project, I wanted to be reminded of them as I work on GPT Pilot. I want to explore what's the most that AI can do to boost developers' productivity, so all improvements I make need to lead to that goal and not create something that writes simple, fully working apps but doesn't work at scale.

Pillar #1. Developer needs to be involved in the process of app creation

As I mentioned above, I think that we are still far away from an LLM that can just be hooked up to a CLI and work by itself to create any app by itself. Nevertheless, GPT-4 works amazingly well when writing code. I use ChatGPT all the time to speed up my development process - especially when I need to work on some new technology or an API or if I need to create a standalone script. The first time I realized how powerful it can be was a couple of months ago when it took me 2 hours with ChatGPT to create a Redis proxy that would usually take 20 hours to develop from scratch. I wrote a whole post about that here.

So, to enable AI to generate a fully working app, we need to allow it to work closely with the developer who oversees the development process and acts as a tech team lead while AI writes most of the code. So, the developer needs to be able to change the code at any moment, and GPT Pilot needs to continue working with those changes (eg. add an API key or fix an issue if an AI gets stuck).

Here are the areas in which the developer can intervene in the development process:

After each development task is finished, the developer should review it and make sure it works as expected (this is the point where you would usually commit the latest changes)
After each failed test or command run - it might be easier for the developer to debug something (eg. if a port on your machine is reserved but the generated app is trying to use it - then you need to hardcode some other port)
If the AI doesn't have access to an external service - eg. in case you need to get and add an API key to the environment

Pillar #2. The app needs to be coded step by step

Let's say you want to create a simple app, and you know everything you need to code and have the entire architecture in your head. Even then, you won't code it out entirely, then run it for the first time and debug all the issues at once. Instead, you will split the app development into smaller tasks, implement one (like add routes), run it, debug, and then move on to the next task. This way, you can fix issues as they arise.

The same should be in the case when AI codes.

Like a human, it will make mistakes for sure, so for it to have an easier time debugging and for the developer to understand what is happening in the generated code, the AI shouldn't just spit out the entire codebase at once. Instead, the app should be generated and debugged step by step just like a developer would do - eg. setup routes, add database connection, etc.

Other code generators like Smol Developer and GPT Engineer work in a way that you write a prompt about the app you want to build, they try coding out the entire app and give you the entire codebase at once. While AI is great, it's still far away from coding a fully working app from the first try so these tools give you the codebase that is really hard to get into and, more importantly, it's infinitely harder to debug.

I think that if GPT Pilot creates the app step by step, both AI and the developer overseeing it will be able to fix issues more easily, and the entire development process will flow much more smoothly.

Pillar #3. GPT Pilot needs to be scalable

GPT Pilot has to be able to create large production-ready apps and not only on small apps where the entire codebase can fit into the LLM context. The problem is that all learning that an LLM has is done in-context. Maybe one day, the LLM could be fine-tuned for each specific project, but right now, it seems like that would be a very slow and redundant process.

The way that GPT Pilot addresses this issue is with context rewinding, recursive conversations, and TDD.

Context rewinding

The idea behind context rewinding is relatively simple - for solving each development task, the context size of the first message to the LLM has to be relatively the same. For example, the context size of the first LLM message while implementing development task #5 has to be more or less the same as the first message while developing task #50. Because of this, the conversation needs to be rewound to the first message upon each task.

For GPT Pilot to solve tasks #5 and #50 in the same way, it has to understand what has been coded so far along with the business context behind all code that's currently written so that it can create the new code only for the task that it's currently solving and not rewrite the entire app.

I'll go deeper into this concept in the next blog post, but essentially, when GPT Pilot creates code, it makes the pseudocode for each code block that it writes, as well as descriptions for each file and folder it needs to create. So, when we need to implement each task, in a separate conversation, we show the LLM the current folder/file structure; it selects only the code that is relevant to the current task, and then, we add only that code to the original conversation that will write the actual implementation of the task.

Recursive conversations

Recursive conversations are conversations with the LLM that are set up in a way that they can be used "recursively". For example, if GPT Pilot detects an error, it needs to debug it, but let's say that another error happens during the debugging process. Then, GPT Pilot needs to stop debugging the first issue, fix the second one, and then get back to fixing the first issue. This is a crucial concept that, I believe, needs to work to make AI build large and scalable apps. It works by rewinding the context and explaining each error in the recursion separately. Once the deepest level error is fixed, we move up in the recursion and continue fixing errors until the entire recursion is completed.

TDD (Test Driven Development)

For GPT Pilot to scale the codebase, improve it, change requirements, and add new features, it needs to be able to create new code without breaking previously written code. There is no better way to do this than working with TDD methodology. For all code that GPT Pilot writes, it needs to write tests that check if the code works as intended so that whenever new changes are made, all regression tests can be run to check if anything breaks.

I'll go deeper into these three concepts in the next blog post, in which I'll break down the entire development process of GPT Pilot.

Next up

In this first blog post, I discussed the high-level overview of how GPT Pilot works. In posts 2 and 3, I show you:

How do Developer and Code Monkey agents work together to implement code (write new files or update existing ones).
How recursive conversations and context rewinding work in practice.
How can we rewind the app development process and restore it from any development step.
How are all agents structured - As you might've noticed, there are GPT Pilot agents, and some might be redundant right now, but I think these agents will evolve over time, so I wanted to build agents in a modular way, just like the regular code.

But while you are waiting, head over to GitHub, clone GPT Pilot repository, and experiment with it. Let me know if you are successful, and while you’re there, don’t forget to star the repo - it would mean a lot to me.

Thank you for reading 🙏, and I’ll see you in the next post!

If you have any feedback or ideas, please let me know in the comments or email me at zvonimir@pythagora.ai, and if you want to get notified when the next blog post is out, you can add your email here.

Blog