Terraforming in 2021 – new features, testing and compliance

diogoaurelio

diogoaurelio

Posted on May 2, 2021

Terraforming in 2021 – new features, testing and compliance

Both João and I have been using regularly terraform for some time now, and are big fans of it. Big enough to even have worked for quite a while on a side project of our own to simplify managing terraform environments (which unfortunately did not succeed), at least. And yes, now that we are preparing the next one with mklabs, which naturally also heavily relies on terraform scripts, it made only sense to start by sharing some of our favorite tools that gravitate in the terraform ecosystem.

We kick off with the latest updates in terraform itself, from 0.12 until the latest 0.15, and then go straight to surrounding tooling, where we mainly focus on testing and compliance checking for infrastructure.

You can find all the code supporting this this post here.

Overview of latest Terraform versions

Version 0.15 has just recently been released. Yet, a lot of companies out there are still running environments up to 0.11.X, and for a good reason. Though version 0.12 was launched already a while ago (2019), it brought great disruption. If this is your case, chances are that you gave up on keep up to date with its progress. Thus, we thought it would make sense to first review some of terraform features you might be missing out, before we try to convince you also considering further improvements.

0.12

  • First-class expression syntax: probably the most noticeable change, no more string interpolation sintax (confettis blowing in the background)!
  • Generalized type system: (yay!) able to specify types of variables;
  • Iteration constructs: introduction a the for operator allowing closer proximity to programming languages and thus more expressiveness to the DSL;

More details, see the general availability announcement page.

0.13

  • Module improvements - brings ability to use several meta tags that until so far were only available for resources in modules. In this case, we can now use depends_on for coupling dependencies with modules (finally!), along with ability instantiate multiple module instances with count or for_each; more details here;

Example specifying a module dependency (a personal favorite), source hashicorp blog

  • Custom variable validation - when specifying variables one can from this version on wards specify custom rules for the input that is accepted and respective error message to allow fast failure; more details here;

Example variable validation, source hashicorp blog

  • Support 3rd party providers - terraform now allows one to reference your own provider in the terraform block required_providers. The required_providers keyword in the terraform block already existed in terraform 12, though with was restricted to hashicorp's own providers; now you can reference your own DNS source to host your own registry, and upon terraform init it will be installed same as it used to all other providers; more details here; note that if you are already using the required_providers keyword in terraform block, to migrate from version 12 to 13 you should adapt it, as shown in the following example:

Example required_providers, source terraform upgrade guides

More details, see the general availability announcement page.

0.14

  • Sensitive variables & outputs - allowing one to flag a variable as "sensitive" will redact its output on the CLI;
  • Improved diff - this is probably one of the most common complaints I hear, namely the difficulty in reading a terraform plan (and along with it the same for apply and show) output; with this new change, it hides unchanged fields (irrelevant for the diff) while displaying a count of the hidden elements for better clarity; more details here;

Example excerpt diff, source hashicorp blog

  • dependency lockfile - as soon as you run terraform init with version 14 you will notice that a .terraform.lock.hcl file will be created in that directory and outside the .terraform directory. This file is intended to be added to version control (git committed), to guarantee that the exact same provider versions are run everywhere;

More details, see the general availability announcement page.

0.15

Finally, here are some of the main highlights the latest version just very recently announced:

  • Remote state data source compatibility: in order to make it easier to upgrade to newer versions, you can use reference remote state data objects that are using older versions; note that this feature has been back ported into previous releases as well, namely 0.14.0, 0.13.6, and 0.12.30;
  • Improvements to conceal sensitive values (passwords, for example): following version 0.14 work now provider developers can specify the properties that are by default sensitive and should be anyway hidden in outputs; moreover, terraform also added a function (sensitive) for users to be able to explicitly hide values;
  • Improvements in logging behavior: ability to control provider and terraform log level separately TF_LOG_CORE=level and TF_LOG_PROVIDER=level.

More details, see the general availability announcement page.

Handling multiple versions

Assuming that these new features convinced you to upgrade your existing terraform environments, being realistic here, this will not happen from one day to another. You will have a transition period (if not permanently) where you have environments with different terraform versions. That is OK, you can still keep your sanity while hopping between all of them thanks to tools like the following:

  • TFEnv - terraform environment switcher inspired (from the ruby world) by rbenv written with shell scripts;
  • Terraform Switcher - yet another project essentially doing the same written in go;

Both of these projects overlap almost entirely, so we will simply exemplify with one of them, namely tfenv:

Here we are showing how you can switch between installed versions tfenv use <local-version>, how you can check which versions are already locally installed with tfenv list, all versions currently available tfenv list-remote (minor detail: the current version of the library I'm using to record my terminal, terminalizer, does not capture me scrolling up and selecting version terraform 0.14.5)
Last but not least, we also show a cool feature from tfenv, namely the ability to automatically recognize the minimum required version in a given environment. Same goes for the latest version, in case you are wondering. And yes, this is also available in terraform switcher project.

Testing

Testing is probably the most confusing topic in the Infrastructure-as-Code (IaC) land, and terraform not being an exception, as a lot of different tools and procedures get thrown in this same bag, when, well, they probably should not. Usually when talking about testing, people usually mean three different things:

  • static checks - validation of mainly the structure code without actually running the code; a fast away for performing sanity checks, either local and/or in your CI/CD pipelines;
  • integration tests - provided you are properly using inversion of control with variables, they give you the power to test your modules or environments for different generic cases;
  • compliance checks - tests done in the aftermath, after all resources have already been deployed; these can serve distinct purposes that we will explore better later, but essentially the goal is to confirm that what is deployed is what was initially intended and expressed in terraform;

Yes, you got that right: if you were looking forward to some classic unit testing, you can forget that for now. Let us dive straight into more details.

Static checks

Again, these are not tests strictly speaking, but rather just simple validation checks one can run to catch common errors without actually deploying anything. There is an array of tools out there, but let us start with the one provided out of the box in terraform binary.

It is not uncommon for the tool creators to provide their own validators. Kubernetes provides validate kubectl apply -f <file> --dry-run --validate=true, helm provides lint helm lint <path>; terraform provides validate.

In the following example running terraform validate would catch two of the three issues:

Validate will require to have previously run terraform init, so that it can leverage providers. In our example it will detect the typo in the aws bucket resource reference, along with the invalid CIDR provided to create the VPC. What it will not detect, however, is the non existent EC2 instance type.

TFLint comes to the rescue. Being yet another open source tool written in go, it comes as a binary much like terraform and does not even require terraform to be installed.

Running with the deep option requires one to provide credentials, providing a more thorough inspection. In our case, it detected the incorrect instance type, as well as the wrong AMI.

You can also customize tflint to inject variables, define modules to ignore, etc. You can check the user guide for more details.

Now besides linting for configuration issues in your code, another recommendation is to check for security issues, such as too permissive security group rules, unencrypted resources, etc.

Here again more than one tool exists to assist. We will highlight two of the most popular ones here: tfsec and checkov. Both provide a predefined set of checks that they use to inspect your code, allowing to explicitly open exceptions (if you really want to) by annotating your code with comments, and adjust the configuration to ignore some modules, for example.

TFSec is written in Go, and is probably the fastest to get started, and currently provides up to 10 checks for the current main cloud providers (AWS, GCP, and Azure). The potential downside is that works exclusively for Terraform, so you will need to use additional tools to inspect kubernetes/helm/cloudformation etc.

Checkov, on the other hand, despite being a more recent tool, has seen stellar development speed (being developed by a startup with good founding rounds, and PaloAlto Networks acquisition can't hurt). Not only do they have a really comprehensive number of checks across all the main cloud providers, but they also span across multiple technologies, such as Kubernetes, Cloudformation, serverless and ARM templates. And the list keeps growing. Checkov provides you the option to run either a pure static check by just pointing to the terraform directory or terraform file, or by actually running it against a terraform plan file. The nice thing about running it directly is naturally the simplicity of not requiring to have the target environment accessible to test the code.

These tools are grabbing a lot of attention lately, as the double checking for security issues was usually locked in the hands of devops/devsecops teams, which in practice constituted a development bottleneck. By injecting these checks early in CI/CD pipelines, a great deal of development speed is freed without compromising security.

Integration tests

Terratest is probably the closest one can get now a days to testing the specific peace of terraform code. It is a Go library, and requires one to write tests in Go. This is obviously a potential limitation as not all teams have knowledge in Go. On the upside, I would argue that the learning curve of learning Go to get the basics - read enough for writing terraform tests - is not steep if you know already at least one programming language.

Having worked already in several Go projects in the context of mklabs, we're naturally favorably biased towards Terratest. So, even if you have never tried Go, we would still recommend having a look on our sample repository, where we provide go mod setup to make it easy getting started. And if this did not convince you yet, here's what might: you can also use Terratest to test Dockerfiles, Kubernetes and Packer setups.

This might seem intimidating, but we would argue that the benefits are worth it. Let us look at an example skeleton of a test setup for AWS:

Essentially the skeleton is always the same:

  1. start by defining the variables you want to inject as input for your terratest run; you might want to rather inject random variables even, to test in greater depth;
  2. make sure you tell the setup to destroy all created resources after the terraform apply has been completed using the defer statement; this is equivalent to the trap keywork in shell, and it will execute even if something fails in the meanwhile; the only prerequisite is that you declare it before you call terraform.InitAndApply() method;
  3. Let the test be deployed by passing the terraform config setup you have previously declared in terraformOptions to terraform.InitAndApply method;
  4. Finally, declare the things you want to assert; that is, declare what you expect should have been deployed - the expected - and see if they match what was actually deployed - the result. The simplest way is to check in terraform apply output, which is accessible via `terraform.Output()` method. Alternatively, terragrunt also provides you some methods out of the box for frequently checked things on a per provide basis.

Sounds like fun, right? Here is an example of some of the tests we could write to our basic terraform example:

Here is the output of running these tests for our demo setup:

Final thoughts regarding terratest: we find terratest a great tool to implement changes in your infrastructure with confidence, no matter if they are just simple day to day changes, or bigger and complexer upgrades or migrations.

We just scratched the surface here on the tests one can develop with terratest - for example combining SSH access into instances and confirming access to resources, etc. However, there are no free lunches, and this can come at a price: tests can take a long time (depending obviously on what you are testing), and require a non trivial time investment - learning how to use, writing them, and setting up the environment, as you probably will want to run them in an isolated environment. For example, in AWS case, this would ideally be a dedicated account. We recommend reading best practices on how to perform testing from terragrunt, the company behind terratest.

Compliance checks

The last mile is asserting that what you wanted to be deployed, was indeed deployed exactly as you wanted. Abusing terraform null_resource is a classic one leading to unintended surprises. One way of achieving this would be to run these tests right after the terraform apply stage of your CI/CD pipeline.

But the next question you might have is how do I know that these configurations stay that way, that no one changed things inadvertently ? We've seen this situation arise in different forms: changes done by users manually via GUI or CLI; via different terraform environments mutating overlapping resources properties; or as a by-product of using different IaC tools, for example configuring some bits with Cloudformation, some K8s or Helm, etc.

Arguably more interesting would be scheduling to run these checks continuously and repeatedly, to make sure things stay as you expected. Let's see some options out there to achieve this.

The rogue approach: in practice you can do this type of compliance checks with any programming language you favor. As we shown before, one can export terraform' state into json format, and then use, for example, python with pytest and boto3 libraries to compare what is deployed with the desired output. You could even go further, and use boto3 to scan your accounts for different aspects that you considered go against best practices, such as lack of encryption (in-flight and at rest).

While this might be tempting at first sight, you might end up writing way more testing code, than the actual terraform code. Not only it seems kind of silly, but it can get hairy.

The second reason not to follow this approach, is that there are already several solutions out there fairly tested and intuitive to use that can help you with this task, which we will cover now:

  • Driftctl - open source go library from cloudskiff for terraform;
  • Sentinel - Hashicorp's own solution;
  • terraform-compliance - open source BDD based solution dedicated for terraform;
  • Conftest - test suite for multiple frameworks besides terraform, such as kubernetes and dockerfiles;
  • Inspec - Chef compliance testing tool, written in ruby;
  • Built-in cloud provider - each cloud provider has it's own inspections mechanisms in place;

You may be wondering why the first place in our list is a library still in beta. That is a good point, and we would argue that its progress seems to be promising, and that we like to support startups. Moreover we think its simplicity of usage deserves a highlight as it delivers one thing and one thing only: check if what is in your terraform state file is what is actually deployed. Simply point to your terraform statefile when you run driftctl scan command and you will get a detailed report if you have drifted. Due to the low effort required to implement this library and value provided, we think it deserves taking it for a spin.

Next on our list is Hashicorp's (the company behind terraform) own enterprise solution for this, Sentinel. This is could make sense if you are already using other Hashicorp's enterprise functionality, benefiting from Terraform Enterprise.

A direct open source comparable alternative would be using terraform-compliance. It follows BDD directives so that you can specify in an easy human readable way your expectations, using:

  • given: a given resource type;
  • when: an optional condition you might want to add;
  • then: what you expectation is;

Here is an example of a test file for AWS S3 buckets:

Although we do get the appeal of easily understandable tests that do not require knowing how code, we find terraform-compliance lacks flexibility to test various aspects, mainly due to the BDD nature. Moreover, there are more people disenchanted by it with some valid points.

If you like terraform-compliance, Conftest might also be worth having a look. It has its own DSL to write policies, and allows you to test multiple frameworks. We found this blog post from Lennard Eijsackers very informative, and would thus rather recommend you to check it out.

Before we dive into own cloud provider compliance checking services, we want to highlight yet another open source tool, namely InSpec. It allows you to write tests in ruby, and was built on top of RSpec. If you know already awsspec, then this should feel very similar, with the advantage that InSpec also supports GCP and Azure.

Even though none of us is actively a ruby developer, we find InSpec very easy to get started with and, most importantly, very powerful. It allows you to combine IaC checks to target resources deployed and mapped in terraform state files, with other general policies for cloud account configuration. Moreover, it also allows you to combine additional security checks, such as on OS level configurations and services running. Let us illustrate how you could write these checks. The following check illustrates how to define global policies that an account should obey:

The next one illustrates for generic networking:

The previous examples only illustrate generic control checks that validate overall best practices in your acocunt. However, you might want to also perform assertions regarding what is in your terraform state versus what is deployed - what the cloudskiff engineers rightly name drift. Christoph Hartmann - one of InSpecs creators - has a nice blog post explaining how to use InSpec and integrate with terraform. The approach is essentially as described previously in the rogue approach - import terraform state json file and use it as the expected assertion.

Built-in cloud provider policy tools

Each cloud provider has a native tool to address company-wide governance policies. Some examples of services are:

These are just some of the services that we know that can be used for such enforcement. Most of these services go beyond just checking cloud config, and also provide security inspections at instance level, for example. The important point to keep in mind is that these security checks are post deployment compliance assertion, not for preventing configuration issues.

Final thoughts

We find that there is not a single silver bullet that solves all problems, and the best strategy is actually a combination of multiple tactics employed at different stages. The two main phases that require different approaches are pre and post deployment.

For pre-deployment, a combination of static checks and actual tests can be used. Static checks are a great starter, as they are easy to setup, and allow you to enforce generally good practices across the whole organization. In other words, static checks provide you an powerful easy win. With terratest, on the other hand, you mainly gain on confidence that the IaC code will actually work, and that you will catch faux pas. However, terratest does not come for free, and does require a learning curve with go, along with time investment to develop the actual tests.

After your code gets deployed comes the next challenge: making sure things stay well architected. Regular scheduled checks that assess if your infrastructure is running as intended should also be part of your devOps strategy, and for this two main routes exist: open source or using dedicated cloud services.

And that is it. Thank you for reading, we hope this has been useful. Feel free to reach out to us if you have questions or suggestions.

Once again, you can find all the code supporting this this post here.

Sources

As usual, here is the summary of sources used for this post:

💖 💪 🙅 🚩
diogoaurelio
diogoaurelio

Posted on May 2, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related