Dropbox is pinning the blame for its Friday outage on a glitch in its server upgrade process.
The file storage and sharing site went offline on Friday and continued to suffer problems even after returning to life over the weekend. The remaining issues were corrected and the core service was restored as of Sunday 4:40 PM PT, according to Dropbox. For example, one issue prevented Dropbox users from sharing folders, but that feature is now working again.
How and why did the outage occur in the first place?
Rebutting earlier reports of a hack or DDoS (Distributed Denial of Service) attack, Dropbox said that the outage was caused by a "subtle bug" in a script involved in upgrading the operating system on its database servers. Each database uses one master and two slave machines for redundancy, a system that was caught up in the glitch.
"On Friday at 5:30 PM PT, we had a planned maintenance scheduled to upgrade the OS on some of our machines," Dropbox's head of infrastructure, Akhil Gupta, said in a blog posted on Sunday. "During this process, the upgrade script checks to make sure there is no active data on the machine before installing the new OS. A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-slave pairs were impacted, which resulted in the site going down."
Gupta insisted that the files of Dropbox users were never at risk during the outage since the databases don't contain any actual file data.
To try to avoid further such outages, Gupta said that the Dropbox team has now added checks that require servers to confirm their current state before they can run an incoming command. Dropbox has also developed and implemented a tool that it believes will help speed up the recovery of large databases.