Originally published on Failure is Inevitable.

Implementing SRE practices and culture can be challenging. Fortunately, there are a variety of tools for each aspect of SRE: monitoring, SLOs and error budgeting, incident management, incident retrospectives, alerting, chaos engineering, and more. In this blog, we’ll talk about what to look for in an SRE tool, and how they’ll help you on your journey to reliability excellence.

Monitoring Tools

At the heart of all SRE decision-making is data. Without logging latency, availability, and other reliability metrics throughout your system, you’ll have no way of knowing where to invest your development efforts. A number of monitoring tools such as AppDynamics, Datadog, Grafana, and Prometheus are available to help collect this data and display it in efficient ways.

Monitoring can be broken down into four main categories:

Resource monitoring: reports on how servers are running with metrics such as RAM usage, CPU load, and remaining disk space.
Network monitoring: reports on incoming and outgoing traffic which can be broken down into the frequency and size of specific requests.
Application performance monitoring: reports on the performance of services by sending internal requests to them and monitoring metrics such as response time, completeness of response, and data freshness.
Third-party component monitoring: reports on the health and availability of third-party services integrated into your system.

To get a full picture of your service, you’ll want to incorporate elements of all four of these categories. Most monitoring tools will provide options for multiple categories. Look for ones that integrate well with your existing tool stack, as you’ll need the monitoring tool to be able to gather and interpret data directly from your existing sources. Try to find tools that can generate visualizations and reports that your team will find useful. For example, if you’re trying to see which services generate the most network traffic, look for a tool that can create pie charts of overall network usage.

SLOs and Error Budgeting

Once monitoring is in place, there’s no better way to put that data to work than building SLOs and error budgets around them. By choosing service level indicators with the highest customer impact, SLOs can safely empower development to accelerate.

An SLO tool should help with:

Consolidating monitoring data into the service level indicators, combining several sources into a single measurement.
Empowering you to set thresholds for this metric over time, such as a total amount of downtime per month.
Dictating policies to be enacted when the metric exceeds these thresholds, integrating into alerting and collaboration tools.

The inverse of the SLO is the error budget: the amount of room left on the SLO before exceeding the threshold. Development teams can use this error budget to safely move forward on projects that could impact SLOs, confident that they won’t step over the line. As your SLO and error budget will be key decision-making tools in development decisions, find tools that can clearly display changes over time.

Alerting

When responding to incidents, the most valuable resource is your team. However, teams are also a depletable resource. Alerting engineers too frequently results in burnout and alert fatigue. Setting fair on-call schedules and properly assigning ownership of services can be complex, but alerting tools will help you stay organized and consistent. Top alerting tools include PagerDuty, Opsgenie, and VictorOps.

The most important function of an alerting tool is reliability. This seems obvious but should not be overlooked. Ensure your alerting tool can reach your team on whatever platforms and devices they’re most accustomed to using. Likewise, your alerting tool should integrate with your monitoring services to allow for observations to automatically trigger alerts.

Scheduling on-call is another task that becomes complex as you account for service ownership and load balancing. Alerting tools can help by building calendars around user-defined roles and teams and logging responses to help qualitatively assess load. This ensures that you adopt a people-first on-call system.

Incident Management

Failure is inevitable. There will always be unpredictable incidents that require novel responses. In an SRE mindset, incidents aren’t failures or setbacks, but unplanned investments in reliability.

Good incident response involves several components, each of which can be assisted by tools such as Blameless, PagerDuty, Opsgenie, and ServiceNow:

Assessing and prioritizing through incident classification
Prepared responses based on classification, including runbooks
Alerting and escalation to get the correct people involved Communication and role-based coordination
Logging and documenting the response in an incident retrospective
Learning from the retrospective and integrating it into further development

To get the most out of your incidents, find tools that reduce cognitive load in each of these areas. The faster and easier it is for responding engineers to use your incident protocols, the more likely they are to use them. By automating procedures based on incident classification, you’ll be able to codify incident response procedures, getting services repaired faster.

Incident Retrospectives

After an incident ends, the opportunity for learning has only begun. SRE tools that help you construct thorough and meaningful incident retrospectives will give you an excellent foundation for review and growth. Look for tools that automatically collect useful data, including relevant metrics, the resources utilized, and communication between team members. The end result should be a comprehensive, accessible, and narrative document that implements our best practices. Each incident retrospective will tell the story of the incident, making it a valuable resource for onboarding new SREs, creating game days to stress test the system and build resilience, and more.

Many teams collaborate on post-incident reviews through editors such as Google Docs and Confluence; a solution like Blameless can also centralize metadata from post-incident reviews for easy reporting on objects like tags, follow-up action items, and more. Once you have your document, SRE tools can help you integrate follow-up items into your normal development cycles. This helps teams ensure that incidents don’t repeat themselves and that priority issues are handled with as much attention as feature work. This also informs SLOs, as follow-up actions can include increasing monitoring in certain trouble areas in order to get early warnings of future issues before they become customer-facing.

These cultural lessons of incident retrospectives are just as important as implementing the practices. The tools cannot change culture alone, but the experience of using tools and reviewing the data they provide can. Take care that your tools reflect the steps and thought processes that foster an empathetic culture in response to incidents.

Chaos Engineering

Chaos engineering is a disciple practiced for testing resilience. Chaos engineering tools such as Gremlin and Chaos Monkey simulate outages, intense server loads, or other crises that could jeopardize reliability. These experiments take place in small replica environments with no consequence to the live build of the service. The response teams react as if the incident is real, however, testing to see if their procedures are effective. Monitoring of the simulated systems shows how the real systems would fare under similar conditions.

To be effective, a chaos engineering tool will need to affect systems as if it was a real external threat. This requires extensive integration of the tool into your entire system, as it will need to simulate loads and requests on the level of individual servers, service requests across the cloud, or any other point where an incident could occur. Make sure that the tool is compatible with your entire architecture. Another important tip is to have an incident coordination and management system in place before you begin injecting controlled chaos into your systems, to ensure a smooth process and maximize the value of your experiments.

Likewise, you’ll need to monitor the results of your experiments and learn from them. Your chaos engineering tool should give you meaningful results across experiments. Chaos engineering provides an opportunity for incident responders to build experience using and refining procedures. Be sure you can track how this expertise grows as well.

Choosing an SRE toolstack is an investment. Tools will have learning curves and challenges in implementation, but ultimately pay for themselves in saved time and toil. For more guidance on building your ultimate SRE solution, check out our Buyers’ Guide for Reliability here. And if you’d like to see how Blameless helps level up your SRE practices with SLOs, collaboration, incident retrospectives and more, join us for a demo!

Blog

Choosing the Right SRE Tools

Hannah Culver