If you ask an SRE to make a list their biggest reliability worries, birds and beavers probably won't be on it.
But these and other animals can and do cause critical disruptions. Whether it's Internet infrastructure, the power grid or passenger aircraft, even the best-engineered systems can sometimes fall victim to mundane intrusions by animals.
To prove the point, here's a look at four types of incidents where animals are the culprits, and what SREs can learn from them.
Beavers break the Internet
You know beavers work hard and build great dams. But what happens if, in the course of working hard on their dams, beavers chew through fiber cables?
The Internet for an entire community goes down, along with TV and cell service. At least, that's what happened in a small Canadian town in April 2021, when beavers damaged fiber infrastructure owned by Telus, a telecommunications provider. Service was out for about 36 hours.
Telus called the incident a "uniquely Canadian disruption." It was that, but it was also perhaps a lesson in the importance of redundancy. When you have a town that is wholly dependent on a single fiber for all of its Internet and phone service, you don't have a very resilient system.
To be fair, the population of the town, Tumbler Ridge, amounts to barely 2,000 souls. It's no doubt hard from a cost perspective to justify building redundant infrastructure for what is -- from an ISP's perspective, at least -- a pretty small set of users.
Still, in a perfect world, this system would have been engineered with a little more reliability built-in.
Google cowed by cows
You may think it would take more than grazing cattle to bring one of the world's largest and most powerful tech companies to its knees. But that's kind of what happened, according to Google VP Urs Hölzle. (OK, we're exaggerating a bit -- the company wasn't exactly about to keel over due to cows -- but it did suffer a real disruption.)
Hölzle explained recently on Twitter how Google discovered that an aerial fiber link (meaning one that is run alongside power mains) had fallen to the ground in Oregon. The fall did not create a major incident on its own. However, further investigation revealed that whenever cows who were grazing in the area happened to step on the cable, service degraded.
If you're an SRE, you might appreciate this anecdote not just for its randomness, but also for what it reveals about the importance of careful root-cause analysis. We're guessing that when Google engineers detected a service problem, checking for cows wasn't part of the response playbook.
Hölzle didn't explain exactly how Google traced the problem to the bovines, but it seems likely that it took some out-of-the-box thinking and a dynamic remediation plan.
What's the secret to building a super-reliable power grid?
Part of the answer is to make electrical infrastructure more resilient against intrusions by squirrels. The animals cause thousands of power disruptions each year, with squirrel-related incidents peaking in spring and fall.
To be fair to the squirrels, they're not the biggest source of power failure. Weather events hold that title. But the bushy-tailed rodents do their fair share of damage (as do a variety of other animals, although none wreak as much havoc on the power grid as squirrels).
If you were responsible for reliability engineering for the electrical grid, then, you would be wise to think about how you can make equipment more resilient against squirrel intrusions. That may not be obvious -- it's easier to focus on severe events like major storms as the number-one threat to electrical reliability -- but this is a case where it’s just as important to plan for reliability in the face of more mundane problems, like rodents who climb into transformers.
Birds and planes: A dangerous combo
It's a bird. It's a plane. It's...a flock of birds causing a plane to crash.
That's what happened in Russia in 2019, when a plane collided with a flock of gulls and was forced to do a crash-landing into a field.
It turns out that bird-related aircraft incidents are a fairly common occurrence -- which makes sense, we suppose. There are always going to be birds in the sky, and it's hard to engineer a plane that is totally impervious to collisions with them.
For SREs, the lesson here is that some problems just can't be fully prevented no matter how hard you try. The best you can do is plan ahead to ensure a smooth recovery -- by training pilots to perform a safe crash-landing when birds disrupt their craft, for example.
Conclusion: What animals can teach SREs
Sometimes, complex systems fail due to bugs buried deep within code, gross human negligence or malicious acts. Other times, it's because a cow steps on a wire or a bird collides with a plane.
If you're an SRE, you'd do well to think about animal-caused incidents, even if they are not a major threat to the systems you maintain. Although most of the disruptions described above were short-lived and impacted relatively small groups of people, they're examples of incidents that are easy to overlook as engineers anticipate problems and design playbooks. As such, they're a reminder that incident response must always be flexible -- because the next time your uptime turns into downtime, you just may have beavers to blame.
The Incident Review - Previous Posts
- The Incident Review: 4 Times When Typos Brought Down Critical Systems
- The Incident Review: 4 Incidents in Outer Space