nVisium Application Security Consultant Cody Michaels outlines five lessons learned from the Facebook outage that rocked the Internet in October 2021.
27 Oct, 2021

The Top 5 Lessons Learned From the Great Facebook Outage of 2021

by nVisium

The internet was shaken by the outage of Facebook earlier this month. Dozens of big-name companies, including countless smaller ones, were affected by this outage. Because of something as simple as a misconfigured Domain Name System (DNS) record, every device with the Facebook app integration started DDoS-ing recursive DNS resolvers — DDoS meaning "Distributed Denial of Service." This, in turn, caused overloading in numerous cases across the board.

Internet 101

You might be thinking to yourself, "So, what? A few sites were offline for a couple of hours." But the outage brought to light other issues. Communications for the very Facebook employees that could fix this issue were crippled. Some of these hindrances went so far that people were unable to enter buildings because the physical badge system wasn't even online.

I'm sure a lot of people reading this know how the internet works at this point. But just to make sure the gravity of what happened is understood, I want to delve into this. Let's pause for a moment and take a 50,000-foot view of what the internet and servers are.

Servers host the SaaS (Software as a Service) applications that everyone uses — and each server has an address for communications. The SaaS applications that live on these servers can change location very rapidly. This is thanks in part to services like AWS Elastic Load Balancing. This is a service that lets companies spin up their applications on additional servers when the traffic gets too much for one to handle. Then, as the traffic dies down, the application is removed from the server. In this way, a company only pays for the servers when they need them and not a moment longer.

While this is great from a financial standpoint, it can present issues from a business perspective. The Internet, in a nutshell, is connecting these servers together with each other and the client's systems looking to reach them. So, keeping up to date with all those shifting addresses is one of the challenges the internet deals with just to keep the lights on.

Again, this is a super high look at the internet and servers, but it's necessary to understand why we need Border Gateway Protocol — or BGP — which was the direct cause of the outage.

BGP and DNS Explained

BGP is how the internet can determine the fastest route from client to server. BGP is responsible for scanning all of the available paths that data can travel. Now, I'm not going to give an hour-long spiel about the differences between path-finding algorithms. Why? Because the pros and cons of Single Source Shortest Path (SSSP) and Minimum Spanning Tree (MST) algorithms are beyond the scope of this article. All one needs to know is that BGP is what determines the shortest path. It then gives that information to a Domain Name System (DNS). That DNS has the routing information that can take a domain name from human-readable "Google.com" and translate it to the IP address that should be used to reach it. Now, it's important to keep in mind that the same service can have multiple IP addresses depending on the scalability of the SaaS.

How to Break the Internet

Now that we have an understanding of how fluid SaaS locations can be, what can be done in order to keep track of them all in an (impressively) fast manner?

Doug Madory is the Director of Internet Analysis at Kentik, a San Francisco-based network monitoring company. Madory said at approximately 11:39 a.m. ET on October 4th (15:39 UTC), someone at Facebook caused an update to be made to the company’s BGP records.

“Not only are Facebook’s services and apps down for the public, its internal tools and communications platforms — including Workplace — are out as well,” New York Times tech reporter Ryan Mac tweeted. “No one can do any work. Several people I’ve talked to said this is the equivalent of a ‘snow day’ at the company.”

The irony is not lost on me here, but Facebook employees had to resort to tweeting in order to reach out due to issues accessing the building as I mentioned at the beginning of this article. An employee on the recovery effort to reverse this update-gone-wrong explained that said update blocked Facebook employees from reverting the changes. As a result, even those with physical access at the main office building couldn't use internal tooling — because they were all tied to the company's domains.

This fascinating thread on Hacker News delves into some of the not-so-obvious side effects of the various outages being the following: Many organizations saw network disruptions and slowness thanks to billions of devices constantly asking for the current coordinates of Facebook, Instagram, and WhatsApp. Shortly thereafter, Facebook then published a blog post saying the outage was the result of a faulty configuration change — confirming what many had originally thought could be the issue.

HINDSIGHT IS ALWAYS 20/20

Historically speaking, this event was nowhere near the worst Facebook outage the social media conglomerate has experienced. In fact, in 2019, the sometimes controversial platform went offline for more than 24 hours.

So, now that the digital dust has started to clear, it's time for the post-mortem. What can we learn from this event and how can you prevent it from happening to your organization? Because, let's face it: If a massive platform like Facebook can experience a sprawling outage, companies large and small should be taking notes.

Configuration Management Matters

Configuration Management (CM) is a systems engineering process for setting and maintaining consistency. This consistency includes items like the system's performance, functionality, and physical status. Essentially, CM allows for a programmatic approach to ensure things don't run off the rails. A server isn't responding after an update? Halt the update from the rest of the fleet — in this case, the other servers — and alert the proper individuals about the event. After this, execute a rollback on the update on the unresponsive server to resume services. Basically, by using CM for automation, you can easily build in checks against human errors.

the Importance of Testing in the Pipeline

The advent of DevOps has given us amazing power in automating as many manual processes as possible. This has allowed teams to push code to production from an average of once-a-sprint to minutes from submission to a code repository. That said, a common problem we see is the lack of a testing stage within this pipeline — other than the basic linter (static code analysis tool) — via Jenkins in GitHub.

The type of testing that needs to be added to the pipeline is pushing the code to staging servers. Essentially, these staging or dev servers are a sandbox to examine what would happen before the changes hit production. While this issue with Facebook came from a configuration — not purely developer code — the sentiment is still the same.

It's imperative to have that testing area to ensure confidence that what you are moving or changing in the production environments won't result in being the lead story on the front page of WIRED tomorrow. Lastly, always make sure to match the testing servers to production as close as possible for the most reliable testing.

Rollback Planning and Drills are critical

So, you've done your due diligence and taken every step to make sure this simple update will go smoothly — but after you push the update, you find out something unrelated broke from the change. While that is a case of not following orthogonality when it comes to development and design, it happens to the best of us, so don't fret.

Moments like these are when you refer to your rollback plan that I discussed earlier. It's what you have to do in order to get back to the previous state — before the change or update was pushed. If you've taken my CM advice to heart, you should already have this plan in place.

If not, develop a plan for rolling back — preferably before your next push to production. Once that plan is in place, you should run a mock scenario to make sure said plan will work beyond the documentation of it. This is actually one of the many useful reasons to run periodic cyber wargaming scenarios, which is an admittedly fun, interactive technique to test your cybersecurity preparedness in an attack context.

Communication Alternatives are a necessity

With the increasing percentage of the workforce switching to remote environments, it's more important than ever for reliable communications. So, for a company like Facebook, it makes sense to use a homegrown SaaS, like their proprietary Messenger platform. However, scenarios exactly like this are also why you should have a predetermined fallback.

While I'm sure this sounds like an obvious step, don't just lump a plan into the onboarding material for your team. That said, if you're a smaller company and you don't have turnover at hundreds a year, you may not even have a formal, established onboarding process.

And secondly, the alternative communication method can change at any time. So, one way to get around this is to keep some updated documentation on protocols if the main form of contact is cut for whatever reason. Another tip would be to have your IT department execute an automated message detailing the backup method —to email or SMS employees should this happen.

Depth and redundancy is key

Now, this is very closely tied to the last section talking about communication alternatives — but it applies to all of the lessons learned to some degree. This certainly applies on a case-by-case basis depending on your company, but set redundancy to a level that would make any doomsday prepper jealous. And if you think it's overkill, go one step further.

A good rule of thumb is and a question you should constantly be asking yourself is for every possible scenario is: Does this have a backup? This is where our rollback plan for redundancy comes in should you have a world-breaking change in production. Configuration Management is a redundancy to human manual monitoring, so if you have at least one level of backup, the entire livelihood of your company isn't at stake. At the end of the day, you can save yourself a plethora of headaches (and heartaches) by going back to basics and prioritizing redundancy throughout your environments.

- By Cody Michaels, Application Security Consultant, nVisium

 

Border Gateway Protocol Configuration Management Domain Name Server

You might also like:

Get Security Assessment Tips Delivered to your inbox