How to Measure DevOps Metrics

DevOps is supposed to help streamline the process of taking code changes and getting them to production for users to enjoy. But what exactly does it mean for the process to be "streamlined"? One way to answer this is to start measuring metrics.

Why metrics are important to track

Metrics give us a way to make sure our quality stays the same over time because we have numbers and key identifiers to compare against. Without any metrics being measured, you don’t have a way to measure improvements or regressions. You just have to react to them as they come up.

When you know the indicators that show what condition your system is in, it lets you catch issues faster than if you don't have a steady-state to compare to. This also helps when you get ready for system upgrades. You'll be able to give more accurate estimates of the number of resources your systems use.

After you've recorded some key metrics for a while, you'll start noticing places you could improve your application or ways you can reallocate resources to where they are needed more. Knowing the normal operating state of your system's pipeline is crucial and it takes time to set up a monitoring tool.

The main thing is that you decide to watch some metrics to get an idea of what's going on when you start the deploy process. In the beginning, it might seem hard to figure out what the best metrics for a pipeline are.

Figuring out which metrics are important to you

You can conduct chaos engineering experiments to test different conditions and learn more about which metrics are the most important to your system. You can look at things like, time from build to deploy, the number of bugs that get caught in different phases of the pipeline, and build size.

Thinking about what you should measure can be one of the harder parts of the effectiveness of the metrics you choose. When you're considering metrics, look at what the most important results of your pipeline are.

Do you need your app to get through the process as quickly as possible, regardless of errors? Can you figure out why that sporadic issue keeps stopping the deploy process? What's blocking you from getting your changes to production with confidence?

That's how you're going to find those key metrics quickly. Running experiments and looking at common deploy problems will show you what's important early on. This is one of the ways you can make sure that your metrics are relevant.

Monitoring tools

Here are some monitoring tools that you can use to get insight into your pipeline setup.

Splunk
ELK
Prometheus

There are a number of chaos engineering tools that let you run experiments based on the insight you get from monitoring or just hypotheses you have.

Chaos Toolkit
ChaosMonkey
Gremlin Free
Kube Monkey

The point of all of these tools is to test out hypotheses you have about the way your system works. You can check for different points of failure that might be of concern. These tools give you a way to run and learn from your experiments to start understanding what your key DevOps metrics are.

What this looks like in practice

As an example, we'll be using Chaos Toolkit to set up an experiment. To start, make a new file and name it experiment.json. This will hold the values we need to run the experiment.

We'll define the title and description of the experiment we want to run. Then we define what steady-state looks like for the service. This gives the experiment something to compare the results against.

{
   "title": "Does our service tolerate the loss of its data file?",
   "description": "Our service reads data from a specific file, can it handle what happens if that file disappears?",
   "tags": [
       "tutorial",
       "filesystem"
   ],
  "steady-state-hypothesis": {
       "title": "The data file must exist",
       "probes": [
           {
               "type": "probe",
               "name": "service-is-unavailable",
               "tolerance": [200, 503],
               "provider": {
                   "type": "http",
                   "url": "http://localhost:3000/"
               }
           }
       ]
   },
  "method": [
       {
           "name": "move-data-file",
           "type": "action",
           "provider": {
               "type": "python",
               "module": "os",
               "func": "rename",
               "arguments": {
                   "src": "./data.dat",
                   "dst": "./data.dat.old"
               }
           }
       }
   ]
}

To learn more about the details of using Chaos Toolkit, I'd suggest you start with their tutorial. It's a great way to find and prove some of that the key metrics you're thinking of are truly impactful to the system.

Other considerations

When you're thinking about where in the pipeline would be best to record your metrics, don’t be afraid to experiment. You might see how different tools affect your overall build to deploy time.

Conclusion

When you watch a few key points in your pipeline, it'll help you make your deploys even better. You'll know that something is wrong when the state of your pipeline doesn't match the normal steady-state of your pipeline.

Make sure you follow me on Twitter because I post about stuff like this and other tech topics all the time!

If you’re wondering which tool you should check out first, try Conducto for your CI/CD pipeline. It’s pretty easy to get up and running and it’s even easier to debug while it’s running live.

Blog