Journey of Streamlining Oncall and Incident Management

falitjain

Falit Jain

Posted on July 12, 2024

Journey of Streamlining Oncall and Incident Management

For many engineering and operations teams, being available when needed is essential to maintaining the dependability and availability of their services. One of the main responsibilities is to assist in meeting different SLAs. The key principles of on-call work are discussed in this article along with real-world examples from the industry on how to plan and carry out these tasks for a worldwide team of site reliability engineers (SREs).

An overview of the main ideas

When there is a person on call, they are available for production incidents promptly and can prevent a breach of service level agreements (SLAs) that could have a major negative impact on the organisation. An SRE team usually dedicates at least 25% of its time to being on call; for instance, they might be on call for one week out of every month.

SREs are engineering teams that, historically, have approached on-call work as more than merely an engineering challenge. Some of the most difficult tasks are managing workloads, scheduling, keeping up with technological advancements, and managing work. Any organisation must also instil the culture of site reliability engineering.

The following are the essential components that SRE teams must take into account in order to successfully handle on-call shifts.

  • Timetable for on-call creating on-call plans with the right amount of work-life balance
  • Modify the compositionTypical tasks that those who are available for call should complete HandoffIssues should be summarised and given to the person taking the following shift.
  • Post-mortem conferencesWeekly conversation about platform stability-related occurrences
  • Create plans for escalation.Effective flow of escalation with deadlines for turnaround
  • Enhancing the page loadCreating effective pager policies
  • runbook upkeepA synopsis that serves as a "Swiss army knife" for SREs who are on call
  • Management of Change Coordinating the introduction of platform modifications
  • Instruction and record-keeping establishing documentation for the training of both new and current SREs and integrating additional team members ‍

Scheduling for on-call and using Slack

Traditionally, the purpose of SRE teams has been to maintain complicated distributed software systems. These systems could be set up in a number of data centres worldwide. Teams can use tools like Pagerly for creating schedules on Slack.


Image description

‍Handover for on-call

A "handover" procedure is required at the conclusion of every shift, during which the team taking over on-call responsibilities is informed about on-call matters as well as other pertinent matters. For a total of five consecutive working days, this cycle is repeated. If traffic is less than it is during the week, SRE on-call staff could work one shift per weekend. An extra day off the following week or, if they would choose, cash payment should be given to this person as compensation.

While most SRE teams have the aforementioned structure, some are set up differently, with extremely small satellite teams supplementing a disproportionately big centralised office. If the duties are distributed among several regions in that case, the small satellite teams could feel overburdened and excluded, which could eventually lead to demoralisation and a high turnover rate. The expense of having to deal with on-call concerns outside of regular business hours is then seen as justified by having complete ownership of responsibilities at a single place.

Arranging for a rotation with a team from a particular region
The on-call schedule could be created by dividing the year into quarters, such as January to March, April to June, July to September, and October to December, if the company does not have a multi-region team. One of these groups should be allocated to the current team, and every three months, the nocturnal work effort should be rotated.

It is best to have a timetable like this to support the human sleep cycle and have a well-structured team that cycles every three months rather than every few days, which is more strenuous on people's schedules, as it is not good to be on call a few days per week.


Image description

Management of PTO and vacation time

Managing the personal time off (PTO) plan is essential since the availability and dependability of the platform as a whole is dependent on the SRE position. On-call support needs to take precedence over development work, and the team size should be large enough for those who are not on call to cover for absentees.


There are local holidays specific to each geographic place, such as Thanksgiving in the USA and Diwali in India. Globally, SRE teams should be allowed to switch shifts at these times. Common holidays around the world, including New Year's Day, should be handled like weekends with minimal staff support and a slack pager response time.

‍Here is a blog about oncall compensation and way to set rotations

Modify the composition

Every shift begins with a briefing given to the on-call engineer about the major events, observations, and any outstanding problems that need to be fixed as part of the handover from the previous shift. Next, the SRE opens the command line terminal, monitoring consoles, dashboards, and ticket queue in preparation for the on-call session.

Alerts with Slack and Teams

Image description

Based on metrics, the SRE and development teams identify SLIs and produce alerts. Event-based monitoring systems can be set up to send out alerts based on events in addition to data. Consider the following scenario: the engineering team and the SREs (during their off-call development time) decide to use the metric cassandra_threadpools_activetasks as a SLI to track the performance of the production Cassandra cluster. In this instance, the Prometheus alert management YAML file can be used by the SRE to configure the alert. The annotation that is highlighted in the sample below can be used to publish alerts. One way to interface with contemporary incident response management systems is to utilise this annotation.

The Prometheus alert manager forwards the alert to the incident response management platform when the alert condition is satisfied. The on-call SRE engineer needs to examine the dashboard's active task count closely and ascertain the cause of the excessive task count by delving into the metrics. Corrective action must be taken after the reason has been identified.

Global integration of these alerting systems with a ticketing system, such Atlassian's Jira ticket management platform, is recommended. Every alert ought to generate a ticket on its own. The SRE and all other stakeholders who responded appropriately to the alert must update the ticket when it has been handled.

Troubleshooting

The SRE should have a terminal console available to use SSH or any other CLI tools provided by the organisation at the start of the on-call time. The engineering or customer service staff may contact the on-call SRE for assistance with a technical problem. Assume, for instance, that a user's action on the platform—such as hitting the cart's checkout button—produces a distinct request ID. The distributed system's various components—such as the database service, load balancing service, compute service, and others—all interact with this request. The SRE may receive a request ID; when receiving it, they are supposed to provide information about the request ID's life cycle. Some examples of this data include the machines and components that logged the request.

An SRE might be needed to look into network problems if the problem isn't immediately evident, like in the scenario where the request ID stated above wasn't registered by any service. The SRE may use TCPDump for Wireshark, two open-source packet analysis programmes, to record packets in order to rule out any network-related problems. The SRE may enlist the assistance of the network team and ask them to examine the packets in this laborious task.

When troubleshooting such challenges, the cumulative knowledge gathered from the varied backgrounds of SREs in a given team would undoubtedly be helpful. As part of the onboarding process, all of these actions ought to be recorded and subsequently utilised for training new SREs.

Implementation

The on-call SRE should be in charge of the deployment procedure during business hours. In the event that something goes wrong, the SRE need to be able to resolve the problems and reverse or forward the modifications. SRE should only perform production-impacting emergency deployments and has sufficient knowledge to assist the development team in preventing any negative effects on production. The change management process should be closely integrated with deployment procedures, which should be thoroughly documented.

Ticket administration (Jira, Slack, JSM, Pagerduty)

Image description

SREs keep a close eye on the tickets that are being routed through their queues and are awaiting action. These tickets have a lesser priority than those created by ongoing production issues since they were either generated by the alerting programme or escalated by other teams.

It is usual procedure for every ticket to provide a statement regarding SRE actions that have been completed. The tickets have to be escalated to the appropriate teams if nothing is done about them. When one SRE team passes over to another, the queue should ideally be empty.

Pagerly can help on following tickets on Tools like Jira, Pagerduty , Opsgenie

Encourage raising the stakes

SREs who are on call are most affected by these problems. The on-call SRE must use all of their skills to swiftly resolve the issue if the monitoring and alerting software fails to notify them or if there is an issue that is undiscovered and only discovered when the customer reports it. The relevant development teams should also be enlisted by the SRE to help.

After the problem is fixed, all pertinent stakeholders should be given a ticket with comprehensive incident documentation, a summary of all the actions performed to fix the problem, and an explanation of what could be automated and notified on. Above all other development effort, this ticket should be developed first.

Transfer Protocols

During every transition of SRE responsibilities, certain conventions must be observed to enable a seamless handover of the on-call responsibilities to the next personnel member. One such standard is giving the person who is next on call a summary packet. Every ticket in the shift has to be updated and documented with the steps taken to resolve it, any additional comments or queries from the SRE staff, and any clarifications from the development team. These tickets ought to be categorised according to how they affect the dependability and availability of the platform. Tickets that affect production should be marked and categorised, particularly for the post-mortem. Following its compilation and classification, this list of tickets by the on-call SRE ought to be uploaded to a shared communication channel for handover. The team may decide to use a different Slack channel, Microsoft Teams, or the collaborative features found in incident response platforms like Pagerly for this. It should be possible for the entire SRE organisation to access this summary.

Meeting after the death

Every week, all of the engineering leads and SREs attend this meeting. The format of post-mortem sessions is outlined in the flowchart that follows.

Making sure the problems highlighted don't happen again is the post-mortem most crucial result. This is not an easy task to accomplish; the corrective actions could involve anything as basic as creating a script or introducing more code checks, or they could involve something more complex like redesigning the entire application. To come as near to the objective of preventing recurring issues as possible, SRE and development teams should collaborate closely.

Create an escalation strategy

Every time a problem arises, a ticket with the necessary action items, documentation, and feature requests needs to be filed. It is frequently unclear right away which team will handle a specific ticket. Pagerly routing and tagging features make it possible to automate this process, which helps to overcome this obstacle. It may begin as a customer service ticket at first, and based on the type and severity of the event, it may be escalated to the engineering or SRE teams. The ticket will be returned to customer service after the relevant team has addressed it. The resolution may involve communicating with the customer or it may just involve documenting and closing the issue. Furthermore, this ticket will be reviewed in the post-mortem for additional analysis and be a part of the handoff procedure. To lessen alert fatigue, it is recommended practice to accurately classify and allocate the issue to the highest level of expertise.

Enhancing the page load

Paging occurs often among members of on-call engineering teams. In order to minimise the frequency of paging, pages must be carefully targeted to the right teams. When there is a problem that warrants a page on the internet, the customer service team is typically the first person to contact. From that point on, as was covered in the previous part, several investigations and escalation procedures must be taken.

Every alarm needs to have a written plan and resolution procedures. When a client reports a problem, for instance, the SRE should check to see whether any alerts are still running and determine whether they are connected in any way to the customer's problem.

Restricting page views of the SRE to genuine good matters should be the aim. When it is impossible to decide which team to page, a central channel of communication must be established so that SRE, engineering, customer service, and other team members can participate, talk about, and address the problem.

Runbook upkeep

A runbook is a summary of the precise actions—including commands—that an SRE engineer has to take in order to address a specific incident, as opposed to a cheat sheet that lists every command for a certain platform (like Kubernetes or Elasticsearch). When solving difficulties, time is of the essence. If the engineer on call is well-versed in the subject and has a list of instructions and action items at hand, they can implement solutions much more rapidly.

As a team, you may administer a single runbook centrally, or each person could prefer to have their own. Here's an example runbook for you.

One could argue that a large number of the tasks in a runbook could be automated, carried out with Jenkins, or built as separate continuous integration jobs. Although that is the best case scenario, not all SRE teams are experienced enough to carry it out flawlessly. An effective runbook could also be a helpful manual for automating tasks. An engineer's requirement for the command line is constant, thus anything that must be regularly entered into a terminal is a perfect choice for a runbook.

Pagerly has workflow which helps to annotate each alerts and try to keep runbook updated

Image description


Concluding remarks

Creating an on-call rotation plan that works and is sustainable requires careful thought. These include preparing shift changes and scheduling, creating an escalation mechanism, and holding post-mortem analysis meetings following every event. Pagerly is made to plan, carry out, monitor, and automate duties related to standard site reliability engineering responsibilities, such as on-call rotations.

💖 💪 🙅 🚩
falitjain
Falit Jain

Posted on July 12, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related