5 Lessons from Facebook, Instagram, WhatsApp Outage
Facebook learned the hard way that a single configuration error can take down the mightiest of networks. Here are a few things that can help enterprises avoid making the same mistake.
![social media social media](https://eu-images.contentstack.com/v3/assets/blt69509c9116440be8/blte6e0476c93f64cbd/64cabcbdd1230358625469c5/Facebooksocialappsoutage-LarsHagberg-alamy.jpg?width=700&auto=webp&quality=80&disable=upscale)
Lars Hagberg via Alamy
IT workers around the world were likely barraged with AOL emails from their elderly relatives on October 4, asking why their computers were broken and their phones had stopped working. (At least that’s what happened to this humble reporter.) But it wasn’t a massive internet outage that caused this mass panic. In fact, the social networking site favored by those over a certain age -- Facebook -- was down. Everybody on Twitter was talking about it, including Twitter, which took the opportunity to troll its rival.
The outage came the day after a scathing 60 Minutes report featured an interview with a former Facebook insider turned whistleblower about the company’s algorithms that purposely fed political polarization because that kind of content tends to be more profitable.
So far, the timing of Facebook’s outage seems to be just a coincidence. In a blog post after the network had begun its recovery, Facebook VP of Infrastructure Santosh Janardhan issued an apology and explained what had gone wrong.
“The underlying cause of this outage also impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem,” he wrote. “Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”
For those in IT, maybe this scenario sounds familiar. Maybe you’ve also been in a situation where, say, a server certificate expired, the single spark that created a whole apocalypse for your entire network.
It’s also important to point out that Janardhan said there was no foul play involved in the outage.
“We want to make clear that there was no malicious activity behind this outage -- its root cause was a faulty configuration change on our end,” he wrote.
Although Facebook is mostly considered a consumer network, its outage offers a series of lessons to IT organizations. First and foremost, there’s the important part about how to avoid similar catastrophes in your own enterprise. But also, Facebook’s influence extends beyond consumer markets. Plus, Facebook’s other brands, Instagram and WhatsApp, also experienced an outage, and WhatsApp is certainly used in business scenarios as well as by consumers.
The following slides cover five lessons every IT organization should learn from the Facebook outage.
A day after his first post, Janardhan posted again, going into more detail about what brought the mighty Facebook down. There’s a lot of detail in the post. Some regular maintenance went wrong, a bug in an audit tool failed to find the error, and as a result all the DNS servers (domain name servers) became unreachable even though they were still operational.
It was as if the map of the world no longer had any state or street names written on it. These places still exist but there’s no way to write directions on how to get anywhere.
BGP advertisements played a part in this failure, according to Janardhan.
According to technology provider Cloudflare, border gateway protocol “is the postal service of the internet. When someone drops a letter into a mailbox, the postal service processes that piece of mail and choses a fast, efficient route to deliver the letter to its recipient. Similarly, when someone submits data across the internet, BGP is responsible for looking at all of the available paths that data could travel and picking the best route, which means hopping between autonomous systems.”
Cloudflare offered a more complete explanation of Facebook’s outage.
But if streets have no names, BGP can’t really route the traffic.
In his blog post explaining the failure, Janardhan mentions an audit tool that had a bug and failed to find an error. It’s not really clear from the post what audit tool Facebook was using in this case and what other tool sets the company had in place for configuration management. However, as we all witnessed on October 4, and probably other times inside our own companies when a configuration error occurred, a single mistake can take down a mighty empire.
Configuration management tools are designed to help prevent outages like the one Facebook experienced. For instance, SolarWinds offers a tool called Network Configuration Manager that helps organizations with making sure failures like this don’t happen, plus offering policy and compliance assistance, according to product manager Brandon Shopp.
Such tools can also help organizations pinpoint the source of the failure so that fixes and recoveries can happen faster.
Shopp points to another important best practice for configuration changes -- they should not be fully automated and there should be a system of checks and balances. For instance, if one engineer puts in for a configuration change, that change should be peer reviewed by another engineer before it goes through to minimize any negative impacts on the operation of the network.
“There are certain parts of the network that are instrumentally critical,” Shopp says. “The change that Facebook engineer made took down services for the better part of a business day … Automation can be a double-edged sword.”
Automation, and automation fed by AI and machine learning, is only as smart as how it has been taught and trained, Shopp says.
“Automation, while it can provide huge productivity gains, is not a silver bullet,” he says. He recommends a human-in-the-loop process that provides for peer review on changes that could have such a massive impact on the entire infrastructure.
Not only were Facebook’s external audience-facing sites and services down. The company’s internal tools and systems that it uses in its day-to-day operations were also down because they were all part of the same giant network. That made it more difficult for Facebook employees to fix the problems.
Indeed, there were unconfirmed reports of workers being unable to enter buildings where fixes were needed because the physical access control systems were inaccessible.
If you, too, are locked out of the house, you can’t really let anyone else into it.
If system recovery is important to you, consider putting the tools you need for recovery on a separate network.
Your enterprises’ employees are a savvy bunch after all the security training they’ve been through. But a social network outage could possibly cause them to lower their guard.
Chances are it wasn’t just your elderly relatives who were upset about Facebook being down. It’s also possible that some of the users at your own company are avid Facebook users. Maybe they are part of a neighborhood group, or a parent group, or some other organization that communicates via Facebook. Once they lose their line of communication, they are even more ripe for a phishing attack. What if a bad actor uses the Facebook outage as leverage for a phishing attack on your company. For instance, what if one of your company’s users gets an email about their Facebook account being locked, and how they should please click on this link to restore it. It’s in times of panic, like when you cannot access normal communication methods, that users make stupid mistakes.
Now is a good time to remind them that bad actors can leverage their panic in the face of a social network outage. When you get an email like that, stop, take a deep breath, and wait a few minutes to consider what might really be happening.
When you enabled website visitors to “Login with Facebook” to your organization’s ecommerce site, news site, internet-connected device, smart TV, or other service, did you have a plan for what would happen in the case of a Facebook outage?
Users who relied on Facebook as their sign-in tool for multiple internet sites were also out of luck during the Facebook outage -- and so were the businesses who enabled this type of sign-in for users. We are looking at you, Niantic.
If your customers used Facebook to login to your service, they couldn’t reach you for those six hours that Facebook was down. Consider whether it’s worth it to your organization to set up an alternative way for users to login during outages such as these or whether you are willing to risk the lost business.
Check out other InformationWeek slideshows.
When you enabled website visitors to “Login with Facebook” to your organization’s ecommerce site, news site, internet-connected device, smart TV, or other service, did you have a plan for what would happen in the case of a Facebook outage?
Users who relied on Facebook as their sign-in tool for multiple internet sites were also out of luck during the Facebook outage -- and so were the businesses who enabled this type of sign-in for users. We are looking at you, Niantic.
If your customers used Facebook to login to your service, they couldn’t reach you for those six hours that Facebook was down. Consider whether it’s worth it to your organization to set up an alternative way for users to login during outages such as these or whether you are willing to risk the lost business.
Check out other InformationWeek slideshows.
-
About the Author(s)
You May Also Like