DevOps Roadmap 2022

In the last few weeks, I met some folks in my mentoring sessions, who are new to DevOps or in the mid of their career, were interested in knowing what to learn in 2022. DevOps skills are high in demand and there is constant learning required to keep yourself in sync with market demand.

This post is to share the notes that can help you. Let's see some guidance based on my experience and understanding.

Roadmap

Be fundamentally strong in networking technologies

Understand the concepts such as HTTP/2, QUIC or HTTP3, Layer 4 and Layer 7 protocols, mTLS, Proxies, DNS, BGP, how load balancing works, IPTables, the working of Internet, IP addresses and schemes, and lastly the Network design. I found Julia Evans's blog very useful and my go to place when I need to understand stuff in a simple way. She has covered a wide variety of topics in her blog posts and zines.

Master the operating system fundamentals particularly Linux

As most of the systems (VMs, Containers, etc) run Linux, it is important to know from top to bottom. Learn scheduling, systemd interface, init system, cgroups and namespaces, performance tuning, and mastering the command line utilities - awk, sed, jq, yq, curl, ssh, openssl etc., Learn performance troubleshooting from Brendan's blog.

CI/CD

If you are still into Jenkins, it is fine. But, the world has moved to cloud-native pipelines. Conceptually not much has changed in this space, but you can look into Github Actions, Tekton etc. How to do releases better? Understand various deployment strategies such as blue green and canary.

Containerization and Virtualization

Apart from the popular Docker runtime, try containerd, podman etc and knowing How to containerise applications, how to implement container security, how to run and orchestrate VMs in Kubernetes, see KubeVirt project.

Container Orchestration

Kubernetes is now a de facto standard for running containers. There is a lot of content on the Internet to learn Kubernetes. Focus on configuration best practices, application design, security and scheduling. Setting up a cluster is getting trivial now but the day 2 operational stuff such as setting up, monitoring, logging, CI/CD, how to scale the cluster, cost optimization and security are some of the challenges that you are expected to solve.

Observability at Scale

Most of the engineers are aware of the Prometheus Grafana stack or similar. Trends suggest that many organizations are consolidating their Kubernetes clusters and observability, both from the performance and cost perspective, this helps. Learn about the advanced configurations and architectures of Prometheus, and how to scale them. Look into technologies like Thanos, Cortex, VictoriaMetrics, Datadog, and Loki. Continuous profiling tools such as Parca, periscope, hypertrace and distributed tracing with open telemetry. Service meshes such as Istio are popular ingredient in cloud-native recipes.

Platform team as a Product team

The platform team is becoming more like a centralized product team who are focusing on their internal platform customers such as developers and testers. The goal is to improve the ways of working and bring some order to the teams. Try to improvise on the problems the Developer and QA team faces. You are the enabler for other teams, instead of taking all the work in a central team, coach the dev team to take up typical DevOps responsibilities. That way you can scale and don't burn yourself too much.

Security

In many small organisations, security was a second class citizen. Product features were given more priority. But, due to growing sophisticated attacks and various strict compliances, companies are adapting to a shift-left security strategy. End-to-end encryption, strong RBAC, IAM policies, governance and auditing, implementation of benchmarks such as NIST, CIS, ISO27001 are common. Container security, Policy as code, Cloud Governance and Supply chain security are hot topics.

Programming

DevOps or SRE role is now taking the cross-cutting concerns of the Developers and creating tooling that can help in improving their productivity while enforcing the standards. A good quality software engineering practice and skill are required to craft the high quality platform components.

I can't give enough stress to this. The good organizations are looking for good programming experience in Platform engineers. It is important in site reliability engineering as well, where you need to be fluent in programming, able to read, understand and debug the code written by others and if necessary, fix it.

Python and Golang are the most popular ones. My suggestion is Golang due to features like strong concurrency, strict type checking, adoption in various orgs, toolchains and as many major projects are built using Golang, it makes sense to learn that over Python.

A few simple things you can try:

Write a CLI in your programming language.
Learn to write a REST API and interact with databases
Parallelism and Concurrency

Infrastructure as Code

Terraform is a standard in the projects. Once you understand the concept, it is easy to adapt to any other tooling as most of them are based on DSL.

Cloud

Most of the cloud works in the same way. So if you know one cloud well, you can easily work with other cloud providers. Focus on how you can design applications using cloud-native components in a highly available, resilient, secured, and cost-effective way.

Technical Writing

You might be wondering why I am talking about technical writing when discussing DevOps. A lot of folks don't give enough attention to this, but it is super important how you communicate and work with other teams. The future of work is remote and emails, slack/teams, chats are the primary channels to talk and convey idea to others.

On a regular basis, you might be creating documents such as runbooks, postmortems, RFCs, architectural decision records and software design docs, to name a few. A clear, easy to understand document does wonders. It can help you save your and the reader's time and improve overall productivity. Suggest you to read this article.

Site Reliability Engineering

The boundary between DevOps and SRE is getting thin. In some organisations, the same person might be performing both roles. Understand the concepts behind SLI, SLO, and Error budgets and SRE practices. Each organization does it differently, so I wouldn't suggest copy-paste someone else's culture in to your team. Refer to the Google SRE culture.

Conclusion

Personally, I am excited about following this year. This is not a definite list as it keeps changing with time.

Service Mesh - Istio, Cilium Sidecarless mesh, Tetrate and Solo's Gloo mesh offering.
How to improve Developer Productivity? It is a mix of culture, automation and tools.
SRE Platforms - honeycomb, Last9.
DevPortals - again linked with the motive of improving productivity and bridging knowledge gap.
Observability - technologies such as open telemetry, hypertrace, Thanos, VictoriaMetrics, Vector.
Security - supply chain security, code signing, tightening cloud security.
Golang - improving the current skills.
Serverless computing and event-driven architectures
Web3 - understanding the landscape related to DevOps and Infrastructure

Be curious and keep learning. Continuous bite-size learning is easy which you can do along with your full time job. If you still have any questions, feel free to ping me on twitter.