By adding new complexity to reliability engineering and making physical war rooms a thing of the past, COVID-19 has imposed permanent changes on incident management. Here’s how SREs can respond.


The COVID-19 pandemic has forever changed many things -- from the way we shop to the way we work.

If you’re an SRE, you can add incident management to that list. The pandemic has created challenges for incident management teams that are very likely to persist even after COVID recedes into memory. To adapt, SREs must adjust their incident management strategies in several key ways.

Let’s take a look at how COVID-19 has changed incident management and what it means for SREs going forward.

COVID-19’s impact on incident management

The pandemic has affected incident management in a number of ways.

Even more cloud adoption

Businesses were running a majority of their workloads in the cloud before COVID appeared on the horizon. Yet the pandemic pushed organizations to make even greater use of the cloud. As Gartner VP Sid Nag puts it, “The pandemic validated cloud’s value proposition,” with the result that businesses perceive cloud adoption “to be the ‘new normal,’ now more than ever.”

What this means for SREs is that even more services have moved from on-prem environments to distributed, scale-out, service-based architectures hosted in the cloud. In turn, the complexity of incidents has increased, making incident management and remediation that much more difficult. SREs no longer have physical access to the servers that host their companies’ workloads, and they must contend with a variety of additional layers of infrastructure and software when researching the root cause of each incident.

Higher customer expectations

At the same time that environments have grown more challenging to manage, customers have raised the bar even higher for the level of service they expect. “Change is everywhere and reliability is a must,” Michael Ringman, CIO, at TELUS International, a customer experience firm, writes.

This change applies true across all sectors, not just IT. But for SREs in particular, it means that the ability to detect, analyze and remediate disruptions has grown more important than ever. If in the pre-pandemic world we measured success in terms of the “three second rule” -- which states that most customers will abandon a site that fails to load in three seconds -- the post-COVID world is shaping up to be one where customers won’t tolerate service disruptions of any kind.

Increased load

Not only do users expect more, but they are also using IT services more heavily. According to Atlassian, 73 percent of organizations have reported an increase in usage of their services since the start of the pandemic.

It’s easy to understand why: As workforces have become more distributed, IT infrastructure and services have become more critical for binding them together.

For SREs, increased demand complicates even further the challenge of maintaining high levels of performance within increasingly complex, cloud-based environments. There are more users to track, more servers to provision, more compute and memory exhaustion alarms to handle and so on.

No more physical war room

Before COVID, SREs could count on being able to sit in the same physical room while they worked through major incidents. When all stakeholders sat at the same table, real-time communication was easy.

Today, the physical war room has become a thing of the past for many SREs. The pandemic not only temporarily disrupted the ability of engineers to meet on site, but it also pushed many companies to embrace the concept of remote workforces on a permanent basis. When the physical office disappears entirely, the physical war room goes away with it.

Optimizing incident management in a post-pandemic world

Managing incidents effectively in the face of these changes requires SREs and other stakeholders to rethink key components of their incident management strategy.

First and foremost, the ability to conduct incident response using only virtual tools has become essential. When SREs can no longer sit together in the same room and work through a disruption, incident management tools that make it possible to construct a virtual war room where all engineers have access to all of the data and communication tools they need is a must.

We’re talking here, by the way, about more than just creating a Slack channel and calling it a virtual war room. SREs need communication tools that integrate seamlessly with their full alerting and incident management stack. They need to be able to track alerts, assess their severity, collaborate, delegate responsibility and (last but not least) perform retrospective analysis to prevent issues from recurring, all without being physically present in the same building.

At the same time, SREs today require more insight than ever into incidents. In a world where the complexity of software environments has reached unprecedented levels and where SREs can no longer see and touch the infrastructure hosting their workloads, the ability to perform data-based analysis of incidents in order to understand how many customers they impact, how long they have been occurring and how difficult they will be to resolve is critical.

Finally, retrospective analysis has grown in importance. Whereas SRE teams in the past typically treated retrospectives as a nice-to-have feature, or reserved them only for the most serious incidents, performing systematic retrospective evaluation of every incident is vital for meeting the steep expectations that customers now espouse. SREs can’t afford to let even relatively minor incidents recur, which means they need full context on why each disruption happened and what they can do to make sure it never reappears.

Conclusion: Being an SRE is harder than ever - but new tools can help

In short, the pandemic has ushered in a series of changes that make the job of the typical SRE more challenging than ever.

The good news is that tools like Rootly can help. By enabling SRE teams to construct a full-featured virtual war room for assessing alerts, collaborating on response and performing retrospective analysis, Rootly fills the gap between alerting and response, delivering the automation that SREs need to keep customers happy in an increasingly complex and increasingly virtualized world.