The Future of Terraform: ClickOps

Quarter questions

Every now and then it's important to step back from what we're doing and think about the future. At Terrateam, we like to ask a question each quarter to get our gears turning. This quarter we asked:

What will Infrastructure as Code (IaC) look like in five years?

Infrastructure is slow to change

Software infrastructure does not move quickly. While greenfield projects may be able to take advantage of the latest trends, once software is made and in production, the lower bits don't change that much.

It is difficult to migrate an existing project from DynamoDB to PostgreSQL or from JavaScript to Rust. The number of banks still running on mainframes with COBOL might make you shudder.

If the Oxide podcasts are any indication, things get more ossified further down the stack. In many cases, we prefer to add a layer to existing infrastructure to give us the new functionality we want rather than rewriting something from scratch.

More people creating and modifying infrastructure is good

More and more people are encoding their infrastructure in some way that can be tracked, audited, destroyed, and recreated. Terraform is the biggest contender along with Pulumi and the AWS CDK. These tools provide a way to provision and manage infrastructure. And while they are very powerful, they can be daunting to learn. There is a lot of desire to break the traditional silos of development and operations, empowering software engineers to manage their own infrastructure, but in many cases it's too much to ask them to become proficient in a whole new ecosystem.

It's not uncommon for organizations to use templates in order to simplify the process. Existing TACOS tools support this and tools like Backstage are also being used for this. Want to create a database and some compute? Fill out a form and the code will be generated. The learning curve is minimal. The downside of templates is that they only work for creation. They are not much use for modifying existing code. As the project grows, development will slow down as the team will have to learn Terraform, AWS CDK, or Pulumi. ClickOps is the idea of providing GUIs for manipulating infrastructure where the output is code, such as Terraform HCL.

There are a lot of ways that ClickOps can be achieved, but we think the any ClickOps solution should use HCL as its source and destination language. ClickOps tools can integrate into the existing rich ecosystem of Terraform tooling and advanced users can read and write Terraform code by hand if they wish. With ClickOps, we can maximize the number of people who can comfortably manage infrastructure.

CDKTF, AWS CDK, and Pulumi are steps in the wrong direction

According to Hashicorp:

HCL is a toolkit for creating structured configuration languages that are both human- and machine-friendly, for use with command-line tools. Although intended to be generally useful, it is primarily targeted towards devops tools, servers, etc.

Whether or not one thinks Hashicorp's HCL is good at expressing infrastructure, it is in the least, a very limited language, which makes it hard to write convoluted code. HCL is closer to other configuration languages like YAML or INI than a programming language.

On the other hand, the AWS CDK and Pulumi allow one to write infrastructure in general purpose languages, such as TypeScript, Go or Python, This gives a lot more power to the user but also means the code representing their infrastructure can be quite complex. HashiCorp is working on CDKTF to compete with AWS CDK and Pulumi.

One perceived benefit of CDKs is that they allow multiple languages to be used in managing infrastructure, such as TypeScript, Go, Python or Java. This has downsides as well. Documentation explodes as each language needs to be covered. APIs for each language can be auto generated, but auto generated APIs can feel awkward, so native APIs need to be maintained. There is a lot of magic going on to make these general purpose languages generate declarative infrastructure which makes writing code for a CDK quite a different experience. While it might be TypeScript one is writing, how they can use it is going to be different from what they are familiar with.

The number of places that a bug can be encountered goes from just the Terraform CLI to now all the machinery between the CDK and Terraform. Finding help becomes a challenge: is one trying to resolve a Python issue or a CDK issue? What if you have a bug in your TypeScript CDK and the answer found online is in response to a person asking in Go? Can you even understand the answer if it requires details of that language?

HashiCorp has tried to keep up with AWS CDK and Pulumi with CDKTF. The way CDKTF works is the CDK code is compiled to a program which, when run, generates HCL, in JSON format. CDKTF then runs the Terraform CLI on the generated HCL. For those that worked with JavaScript before source maps were a thing: compiling one language to another is a great way to solve some problems but it makes debugging difficult without a lot of tooling that can translate an issue in the
compiled code to the source code.

The multi-language approach makes creating tooling more challenging. There has been a lot of tooling for Terraform built that consumes .tf files, like linters, security scanning, cost estimation, etc. The CDKs make this a lot more difficult. Not all of them can read HCL in JSON format, and even if they can, the developer experience of looking at an issue in generated code is not ideal.

Platform engineering is trying to deliver the self-service tools teams want to consume to rapidly deploy all components of software. While it may sound like a TypeScript developer would feel more empowered by writing their infrastructure in TypeScript, the reality is that it's a significant undertaking to learn to use these tools properly when all one wants to do is create or modify a few resources for their project. This is also a common source of technical debt and fragility. Most users will probably learn the minimal amount they need to in order to make progress in their project, and oftentimes this may not be the best solution for the longevity of a codebase.

These tools are straddling an awkward line that is optimized for no-one. Traditional DevOps are not software engineers and software engineers are not DevOps. By making infrastructure a software engineering problem, it puts all parties in an unfamiliar position.

I am not saying no-one is capable of using these tools well. The DevOps and software engineers I've worked with are more than capable. This is a matter of attention. If you look at what a DevOps engineer has to deal with day-in and day-out, the nuances of TypeScript or Go will take a backseat. And conversely, the nuances of, for example, a VPC will take a backseat to a software engineer delivering a new feature. The gap that the AWS CDK and Pulumi try to bridge is not optimized for anyone and this is how we get bugs, and more dangerously, security holes.

On the surface, the AWS CDK, Pulumi, and CDKTF seem like a step forward, but they bring so much complexity without really solving the issue of making infrastructure easier to create, modify, and maintain, that they are anything but progress.

The limits of HCL are also its strengths. In this way, Hashicorp HCL shines. Those limitations mean it is easy for humans and computers to read, write, and understand. With HCL, we allow humans to write Terraform code when it suits them but also allow computers to provide different views of that code as needed.

Keeping IaC as code is vital. We do not want to rewrite source control, losing the rich ecosystem. But we do want to allow the Terraform code to be manipulated in the way that suits that user best.

ClickOps enables more users

ClickOps is the idea of providing GUIs for manipulating infrastructure which produce code. There are a lot of ways that ClickOps can be achieved, but we think that any ClickOps solution should use HCL as its source and destination language.

ClickOps tools can integrate into the existing rich ecosystem of Terraform tooling and advanced users can read and write Terraform code by hand if they wish. The goal here is to increase the number of users that can comfortably modify infrastructure while not alienating those who are already comfortable. We think HCL is a language that can cross that chasm, but not because it's easy for everyone to use but because it is a good balance of being consumable by humans and computers. HCL is a great language for ClickOps.

We want ClickOps that allows for the creation and modification of infrastructure without losing any of the value we get of IaC. We still want to have pull requests before the change is applied. We want to utilize the powerful ecosystem of Terraform tooling. We want advanced users to be able to write Terraform code.

Imagine a world where you have a service that knows how to read and write Terraform code. It presents that code to the user as if it were an AWS console. They click through the GUI in order to make their changes and it writes the change out as Terraform code. The user can then make a pull request from the change. A reviewer then reads the code, as they would any other change. Or maybe they prefer to use the service to visualize the pull request.

Many users are familiar with the AWS console and would be able to make and review changes quickly, never having seen the Terraform code. Users that are more comfortable in Terraform could look at the generated code. We could create more than just write-once templates. An organization could have a template for creating their infrastructure in a testing environment. Once they are ready, they want to transition those changes to a production environment. A service could understand the difference between these two environments, take the testing infrastructure code and produce new code that has the changes for production. This doesn't have to be a one-time change, but every time an infrastructure change is validated in testing it can transform it into production, making the appropriate changes.

With ClickOps, we can have GUIs that have rich interfaces, preferably self documenting, enabling more users to work with infrastructure. By providing powerful templates, wizards, and intuitive interfaces, users are more likely to make the right changes. The end result is speed and stability. Software engineers can make infrastructure changes faster, not having to wait for support from operations, and they can also make them safer, the ClickOps platform providing guardrails on their changes. By using HCL as the underlying language for all of this, we get the benefits of a GUI without losing the benefits of IaC.

The future of ClickOps

We are curious what the community thinks.

Do you think we are way off?
Is the AWS CDK, Pulumi, and CDKTF actually the future?
What is your experience with colleagues when they help manage their own infrastructure?

We are hoping this blog post stirs up discussion. Even if you hate the idea, please let us know.

Blog