"The process we've created using Rootly has made way for a shift in our culture. We now harvest data for metrics like MTTR. We actively drive them down, which our customers appreciate, but the culture is now actively making room for prevention—and that's the real benefit of this tooling."
Colin FitzGerald
Problem & Incident Manager
Incident management started as a side hustle for a couple of the engineering managers here, and for a long time that was okay. We were a much smaller company. Our operations weren't terribly complex. We had a couple of in-house bots and really basic tooling that operated our incident system. But at some point we scaled out of that being the right solution for us and we needed something really dedicated to this. So I transitioned into the Incident Manager role and I started looking at tooling options. We needed something to replace what we were doing. We needed something to handle scale with a lot more configurability.
We had a lot of challenges — we lacked core data, things like meantime to anything, the consistency in the way we approached incidents. Because it's not the same people fighting the same fires over and over (at least, you hope), it's different people doing different things, it's very difficult to compare incidents or gather sweeping data points or make systemic change when the way you approach an incident when your response isn't consistent. We didn't have any high-level visibility into incident trends. The tooling we had, like I said, was very minimal and we were quickly outpacing it. And so we started to ask questions like, how long does it take us to resolve an incident or mitigate for customers? How severe is it? And we couldn't get anywhere close to answering those questions. That's how we knew we needed a tool to collect this data in one place.
We evaluated a handful of tools, but the decision came down to a few things. One, did it have the configurability to make the transition from the old world to the new world relatively seamless? The idea is that I could configure a tool behind the scenes so that nobody even knew that it changed, and then we could handle the training later. We had this burning fire of "we need something in place right now". And so that was a huge component. I also needed a product team that was a true partner, who really knew their product to back and Rootly absolutely had that. It was fantastic. Then, customizability was key for us. Obviously we needed to customize into the way we were working at that point, but also we needed to grow far beyond that. We needed to evolve our systems and stuff. And so customizability and future growth was also a major component for us.
Throughout the whole process, your support engineering team was fantastic. The fact that we were moving from disparate home-built tools definitely created extra overhead for your team when it came to implementation and bringing our existing processes into Rootly. It took about two weeks or so to really get all of the components put together and I think the full cutover happened about three weeks after the contract was signed. And I worked with them every single day — just a remarkable group of people knew the product, knew what its limitations were, and then knew ways around a lot of the things that I ran into. The actual implementation process, while very laborious and tedious because of the previous setup we had, was pretty smooth. As smooth as it could be, that's for sure. Way better than I expected.
Every way. It's improved everything. So the basics, so we have data, we have consistent workflows, we have visibility and documentation. A lot of the operational overhead has been reduced. There is of course some operational overhead. There's no way around that, but the automations and systems that work behind the scenes allow us to expand the complexity of our system pretty drastically without relying on people to make change. I can make all those workflows happen myself. I can automate their triggers, I can get everything to happen according to the order you need them, that kind of stuff. And nobody is really the wiser. So I can then take the information we harvest, I can take the refined documentation stuff, I can fold the feedback from people in, just implement that change and people can continue going about their business without having to relearn an entire application or an entirely new workflow or anything like that.
When we compare the state of things now to when we first implemented Rootly, things are drastically different. Not only do we have our meantime to everything, but we can do incident counts, we can do severity counts, we can compare different teams and different departments and different pieces of the organization, how often and how badly things go down. This allows me to be strategic in where we invest our time from a reliability perspective. But we've also increased ownership and that's huge. Teams now have the visibility to get this data themselves and they can then take ownership of reducing incidents or reducing a meantime to resolve or mitigate for their own section of the customer base or their own section of our product offering.
And now, as we grow with Rootly, I come up with more challenging questions for the team, and every time they answer them. And so anytime I'm like "You know what? I've put down the tooling for a while. I'm back. I have more ideas. How do I do this crazy thing?" Your team always takes the time to really understand what I'm talking about. And then once the team gets it, they take it and run. Sometimes it's the same day, sometimes it's a day later or two days later, they'll come back and say "Look, we figured it out! Here is the thing you need." or "Here's how to work around this system or to get the specific type of data out of the system that you need." That alone has been marvellous because I can come up with a crazy off-the-cuff idea and then the Rootly team helps me implement it. I do that to them more often than I'd like to admit.
Ultimately, Rootly's impact on our incident response is only part of the picture. Because I spread the incident and problem spaces, the process we've created using Rootly has ultimately made way for a shift in our culture. We now harvest data for metrics — MTTR, MTTM, etc. We actively drive them down, which our customers appreciate, but the culture is now actively making room for prevention. That's the real benefit of this tooling. It's taking the pressure off of the other places. Without a way to discuss measurable figures, the company can't really hope to look beyond them at the bigger picture. The benefit of a hammer is the structure that you build with it, not the nails you hit. Rootly is our hammer and the structure is the whole incident and problem management culture.