TeamOps: Process as Code
Jens Claes
Posted on February 3, 2021
Often teams dislike the tools they use on a daily basis because these tools don't provide the right value. Usually, this is caused by old configurations that are (no longer) working well for the team and are just too hard or too cumbersome to change. Let me show how we use the "Infrastructure as Code" principle at DataCamp to iterate on our tools faster, making them work for us instead of against us.
The invisible process
When I was still an engineering manager, I learned that having a well-defined and efficient process is paramount to a high-performing team. Such a process ensures that things move forward in the right direction while being as invisible as possible. A process should only hinder individuals when they are trying to do something that is either bad practice or violates the working agreement that the team made. The process should help the team create good habits and avoid bad habits. For the rest, the process should be invisible, allowing the team to focus on their actual job.
Finding a good process is one thing, but upholding it is another piece of the puzzle. And that's where I want to focus on in this article. In the past, processes were enforced by filling forms, by having humans doing manual checks and verifications, etc. A bureaucratic mess to ensure we work efficiently and effectively; the irony is not lost on me.
Luckily for us, we have a lot of tools these days which can do this job for us.
Configuring tools is painful...
At DataCamp we use GitHub branch protection to ensure that all changes are peer-reviewed and that they pass the quality checks we have implemented using Continuous Integration. However, with the number of repositories we have, we used to keep running into repositories that didn't have any branch protection configured. Every time this happened, the team had to message their engineering manager so they could set up branch protection for yet another repository. This manual process was error-prone. We knew this was bound to create a problem and at some point, changes were merged to master without following the proper process.
This is just one example of how configuring some tools properly and consistently, can be cumbersome and error-prone. Escalating to an engineering manager and dropping all the work in their lap is not ideal. And on top of that, when we inevitably notice the problem too late, the damage has been done and there might be more work to fix what went wrong. Even though we knew what the process should look like (code review, …) and how to configure the tool to enforce this (use branch protection), we still allowed an incident to happen.
Adopting a workflow from another team is even worse, as now we first have to investigate how the other team configured their tools and then go on and click the same buttons.
In the end, tools are such a pain to configure consistently, they often end up ill-configured, which causes the tools to become the actual problem rather than a solution to a problem.
...but it doesn't have to be
A process only works if it's properly and consistently followed. If we use humans/forms to ensure processes are followed, we enter a bureaucratic mess and if we use tools we enter a painful journey to configure them all correctly.
Luckily, there is a third option. What if we automate the configuration of our tools? This is exactly what Infrastructure as Code is all about.
At DataCamp we solved the GitHub branch protection problem illustrated before using Terraform. To add a repository to our automation, it's just a matter of adding something like the following code snippet:
module "campus-app" {
source = "./repository"
name = "campus-app"
description = "The campus application for DataCamp"
team = "lx"
required_status_checks = [
"bundlewatch",
"push_commit"
]
has_wiki = false
homepage_url = "https://campus.datacamp.com"
}
Because this is (configuration) code, we can make a pull-request. On that pull-request, we can run CI checks. Terraform provides a command to preview the changes you will make. E.g. the pull-request that introduced the above code showed this:
The issues and the downloads are disabled (because we never use this), the wiki is kept because we explicitly told it to (by default the wiki would be removed), the topics are changed to match the new team abbreviation. When we renamed the team, we forgot to update this.
Branch protection is changed to enforce it for admins. The bundlewatch and push_commit status check are required. If we would have passed in an empty array in the configuration code, the Terraform tool would have reported an error. As a policy, we require at least 1 status check to ensure all projects have some sort of Continuous Integration set up.
The rest of the branch protection settings are centrally managed, we don't need to do it per-repository as we want it to be the same everywhere within our organization. This is some of the centrally managed code.
variable "name" {
type = string
}
variable "description" {
type = string
}
variable "team" {
type = string
}
variable "has_wiki" {
type = bool
default = false
}
variable "topics" {
type = list(string)
default = []
}
resource "github_repository" "repository" {
name = var.name
description = var.description
delete_branch_on_merge = true
private = true
has_wiki = var.has_wiki
has_projects = false
has_issues = false
topics = setunion(toset(local.SETTINGS_BY_TEAM[var.team].topics), var.topics, ["managed-by-github-infra"])
}
Some things like enabling GitHub issues are managed on the organization level (always set to false), others are managed per repository but with a global default (like the wiki), others are team specific and some are a mix between all of the above (e.g. topics are partially coming from the team, partially repository-specific and partially global)
We used to live in a world with a lot of repositories where none were correctly configured, most somewhat correctly configured and a lot completely wrongly configured. Now, all our repositories are correctly configured. Instead of having to configure repository per repository, we can now choose which settings we want on the repository level, which on the team level, and which on the organization level. On top of this, everybody can now contribute by proposing a change to this configuration.
A whole new world
By representing the configuration of our tools (and therefore our process) as code, we changed the game.
Everybody can now create a PR to configure our GitHub repositories. It's not only engineering managers that have the permission to make changes. Using pull requests will still involve the engineering manager as a reviewer, but it empowers individuals in a team. This way, we can spread the knowledge between engineering managers (who can copy the code for their own teams), as well as with the team themselves (who can now propose changes themselves).
Having a PR to configure tools might be a big win from a Change Management point of view as well. You now have a track record of all the changes and if something doesn't work, it's simple to revert the change.
If you wonder how a team configured something, you can go look up the code including potential comments, git blame, corresponding PR with more information. Even if the person that configured it, has left the company, there is still a lot of information that can be found.
And if it's useful and complex, you could even build an abstraction layer on top. E.g. There are quite some options regarding branch protection. We can figure out once how we want them to be configured and then nobody needs to think about that anymore. And if this configuration doesn't cut it anymore, we can see whether we need to make it configurable, or whether it should change for all teams. But we don't end up with a lot of configuration that's just outdated and not worth the effort to fix.
These abstraction layers also allow us to define policies.
E.g. at DataCamp, we don't use GitHub issues, so let's turn them off for all repositories. And we prefer not to use the wiki but it's okay for older projects. We could have an open-source policy for repositories that does enable GitHub issues and maybe configures the branch protection differently.
TeamOps: "Process as Code"
I believe that moving our engineering tool configuration into Infrastructure as Code will enable all of the advantages lined out above. It's empowering teams to configure their tools to match their process which will unlock their true potential.
We are still experimenting with this approach ourselves but so far, the benefits are already showing and we've yet to stumble on big downsides to this approach. The biggest struggle is probably that it's still quite new. Not all tools support this workflow yet. As more companies move into this space, I'm sure this will change!
For now, I refer to this idea as "TeamOps" or "Process as Code". I believe this idea has the potential of changing the way we work. Quite literally, it allows and encourages us on our quest to improve our process.
If you want to know more, or you have set up a similar solution already, feel free to reach out to me on Twitter or LinkedIn.
By the way, we are hiring! There are a lot of open positions across many departments including Data Science and Engineering! Come and see, we're eager to meet you!
Posted on February 3, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.