Facebook crash caused by a cascade of errors, he says
A cascade of errors made during maintenance on Facebook’s network caused the outage that took its services offline on Monday, the company said in a statement. blog post published Tuesday.
Facebook’s family of apps, which includes Instagram, WhatsApp and Messenger, have been offline for more than five hours as employees rushed to repair the damage. Over 3.5 billion people around the world use Facebook’s services to communicate with friends and family, spread political messages, and grow their businesses through advertising and outreach.
The initial problem occurred in a network that Facebook calls its âbackbone,â which connects its data centers across the world, wrote Santosh Janardhan, vice president of infrastructure at Facebook. blog post.
During network maintenance, a command was issued to assess the available capacity. But the order backfired, disconnecting the network and blocking communication from Facebook’s data centers, Janardhan said. An auditing tool designed to detect erroneous commands failed to detect the error, he added.
But that was only the beginning of the problems. “This change caused a complete disconnection of our server connections between our data centers and the Internet,” wrote Janardhan. “And this total loss of connection caused a second problem which made matters worse.”
Along with Facebook’s data centers offline, the company’s servers that manage its Internet addresses were also down. âThis made it impossible for the rest of the Internet to find our servers,â Mr. Janardhan said.
As the extent of the outage became clear, Facebook engineers struggled to restore access because its data centers are heavily protected and employees could not access them immediately, the company said.
âWe have done a tremendous job of hardening our systems to prevent unauthorized access, and it was interesting to see how this hardening slowed us down as we tried to recover from an outage caused not by malicious activity but by a mistake on our own, âMr. Janardhan wrote.
Once the engineers were inside Facebook’s data centers and started working, they were able to restore the network. But they had to be gradual when bringing the servers online so as not to overwhelm the system, Janardhan said.
The company planned to study how the outage happened and create exercises that would allow employees to practice fixing Facebook systems faster, he added.