Today, April 29, 2011, Amazon Web Services released a "summary" of its EC2 (Elastic Compute Cloud) and RDS (Relational Database Service) disruption in its U.S. East Region. This came approximately one week after what appears to be a classic example of a rolling disaster that occurred after someone incorrectly executed a communications network traffic shift as part of "normal AWS scaling activities." I read human error here--long known as the leading cause of large system failures.
The rolling disaster is a well understood phenomenon in IT and can be hard to foresee with a complex system. The way to discover and fix potential failure points is to test on a regular basis then build around them. But periodic testing can become difficult for a system of this magnitude.
What I find positive about the Amazon summary is a set of disaster recovery recommendations for users and an admission that AWS customer support during the outage was less than stellar. The disaster recovery recommendations should now be required reading for every AWS customer. In fact, I think that all cloud services users should read this statement with an eye to discovering potential holes in their own disaster recovery strategies. … Read more