A poorly written command, a faulty audit tool, a DNS system that has hampered efforts to restore the network, and the strict security measures in the data centers all contributed to the seven-hour Dumpster Fire at Facebook. […]
Facebook says that the cause of Monday’s outage was a routine maintenance that went wrong, which caused the DNS servers to become unavailable.
Complicating matters was that Facebook’s technicians could no longer remotely access the devices they needed to get the network up and running again due to the DNS outage.
That slowed things down, but they were slowed down even more because the data centers have safeguards that make manipulation difficult – for everyone. “They’re hard to crack, and once you’re inside, the hardware and routers are designed to be hard to change, even if you have physical access to them,” says a Facebook blog by Santosh Janardhan, the company’s vice president of engineering and infrastructure.
It took some time, but once the systems were restored, the network worked again.
Restoring the customer-facing services running over the network was another lengthy process, as booting up those services at the same time could cause another round of crashes. “Individual data centers reported power drops in the tens of megawatts range, and a sudden reversal of such a power drop could put everything from electrical systems to caches at risk,” Janardhan wrote.
In total, Facebook was down for seven hours and five minutes.
Errors during routine maintenance
At the beginning of the outage, Facebook had taken only part of the backbone network offline for maintenance. “During one of these routine maintenance operations, a command was issued with the intent to evaluate the availability of global backbone capacity, inadvertently disconnecting all connections in our backbone network, thus shutting down Facebook’s data centers worldwide,” Janardhan wrote.
That wasn’t planned, and Facebook even had a tool available to sort out commands that could cause such a catastrophic outage, but it didn’t catch on. “Our systems are designed to audit such commands to avoid errors like this, but an error in this audit tool prevented the command from being properly inhibited,” Janardhan said.
When that happened, the DNA was doomed.
DNS was a single vulnerability
According to Angelique Medina, head of product marketing at Cisco ThousandEyes, which monitors Internet traffic and outages, an automatic response to the breakdown of the backbone appears to have caused the DNS to crash.
DNS (Directory Name Services) responds to requests for how to translate web names into IP addresses, and Facebook hosts its own DNS name servers. “They have an architecture where their DNS service is scaled up or down depending on server availability,” Medina says. “And when server availability dropped to zero because the network went down, all DNS servers were taken out of service.
This shutdown was achieved by the fact that Facebook’s DNS name servers sent messages to Internet Border Gateway Protocol (BGP) rOuters that store the knowledge of the routes to be used to reach certain IP addresses. The routes are routinely routed to the routers so that they can route traffic accordingly.
Facebook’s DNS servers sent BGP messages that disabled the announced routes for themselves, making it impossible to resolve traffic to anything on Facebook’s backbone network. “The end result was that our DNS servers became unreachable even though they were still up and running. This made it impossible for the rest of the Internet to find our servers,“ Janardhan wrote.
Even if the DNS servers had still been accessible via the Internet, Facebook customers would no longer have been able to use the service because the network they were trying to reach had collapsed. Unfortunately, Facebook engineers also no longer had access to the DNS servers required by their remote management platforms to reach the failed backbone systems.
“They don’t just use their DNS service for their customer-centric web offerings,” Medina says. “They also use it for their own internal tools and systems. The complete shutdown prevented your network operators or technicians from having access to the systems they needed to fix the problem.“
A more robust architecture would have two DNS services, so one could secure the other, she said. Amazon, for example, whose AWS offers a DNS service, uses two external services – Dyn and UltraDNS – for its DNS, according to Medina.
Lessons to learn from it
The incident reveals what could be a flaw in Facebook’s architecture, according to best networking practices. “Why was the DNS actually a single point of failure in this case,” she says. A DNS outage without backup DNS could result in a longer outage, “so I think a redundant DNS is an important consequence”.
Medina made another general observation in the case of failures from other service providers. “Often with these outages, there are so many dependencies within the network that a small problem in one part of the overall service architecture results in a problem that then has some sort of cascade effect,” she says.
“Many companies use a variety of internal services, and this can have unforeseen consequences. This may be more for technicians, but I think it’s worth pointing out.“
*Tim Greene is executive editor of Network World.