There's no doubt that the recent "partial failure" of the Amazon Web Services cloud computing platform is giving enterprises, service providers, and developers pause--and will continue to do so for months to come. Amazon called the outage "partial" and a "degradation," but it was a very big deal. A significant part of Amazon's flagship EC2 (Elastic Compute Cloud) was offline for a day, as were the related EBS (Elastic Block Store) and RDS (Relational Database Service) offerings. The failure affected only the northern Virginia data center ("US-East"), and the majority of AWS services continued to run just fine. But for the customers whose hosted IT was down, there was nothing partial about it; their sites and applications were substantially or completely offline. These included marquee Web properties like Foursquare, Formspring, HootSuite, and Reddit, among hundreds of others.
Upping the ante, the failure propagated across multiple "availability zones," which are supposed to use physically distinct, independent infrastructure with no shared components--precisely to make such failure propagation impossible. Ooops! and OOOPS!! Even worse, it turns out that Amazon permanently lost some customer data. There probably is no greater sin in information technology than losing a customer's data.
And so, the backlash begins. Headlines blare! Tech blogs proclaim "the sky is falling." "Fortune" calls it "Amazon's cloud nightmare." CNN (via Mashable) calls it "Amazon's cloud collapse." The phrase "Cloudpocalyse" is bandied. Reporters and bloggers are almost immediately pitched with "why the cloud isn't ready for prime time" stories. Anyone with on-premises gear, alternative services, or in-sourcing strategies to sell is on the warpath. Critics spout, "We told you so!!" Business managers begin to ask, "Uh...we rely on Amazon, right? How does that affect us?" Or "Don't we use the cloud? Should we reconsider that?"
This "The cloud failed! It could do so again!" backlash has only begun. It will reverberate for months to come--all the louder because of how much cloud has become the desired "future state" for so much of IT, and because the headlines make it sound like a broad-based catastrophe. So let's be clear what this outage means.
The cloud did not fail. Amazon Web Services failed. Amazon's failure was partial, but very substantial. Large numbers of customers were badly affected, and things that should not have failed--that were supposedly specifically designed and arranged to ride through partial failures--suffered as well. Amazon lost only a very small portion of data (less than 1/10th of 1 percent of the EBS volumes managed in northern Virginia), but losing any of it is a very bad mark. Plain and simple, Amazon dropped the ball, big time. But plenty of other cloud services have no connection to AWS, and never blipped. Indeed, much of AWS was rock-solid before and after the US-East regional failure. Amazon is the premier name in cloud, and it's a blow that parts of AWS went down so hard. But the problem's much more constrained than the headlines allow.
This type of failure could happen anywhere. Cascading failures that take down multiple services or lose data can and do happen in many data centers--both those of individual enterprises and those of service providers. Amazon's track record is generally considered excellent, though parts of AWS have certainly been down before, and this isn't the first time it's lost data. Jumping all over Amazon is all the rage right now, and it will serve a good purpose: Amazon will be highly motivated to make sure this can't happen again. Other service providers will be highly motivated to ask, "So, how can we avoid being in the headlines like this?" They'll invest toward that. All cloud service providers will have to become more transparent about how they architect and operate their infrastructure, availability designs, and other features. In short, cloud infrastructures will improve and evolve as a result. But make no mistake, if all the IT now housed by Amazon were hosted elsewhere--say on smaller scale, less consolidated infrastructure, it's a fair bet its availability would be no higher. It would probably be worse, and it would certainly come at higher cost.
Aggregation worsens the impact. That so many sites, applications, and businesses rely on individual cloud service infrastructures like Amazon's puts many eggs in many fewer baskets. Everyone goes down at the same time, making it a much bigger deal. It is a Bad Thing that multiple services go offline at once. But it seems even worse than it is. It's like the safety of traveling in a commercial aircraft vs. a private car. Airplanes are the safer way to travel--by far. But when a plane does crash, hundreds can be killed at the same time. This can occasionally destroy a team or group, and it emotionally magnifies the apparent risk.
Some customers made out just fine. Not all AWS customers--even those concentrated on the US-East region--suffered through the outage, because they didn't depend on the particular services that went down, and/or because they had additional availability strategies in place that built on, but didn't solely depend upon, Amazon's reliability measures. Photography site SmugMug provides a great example; Netflix also rode through nicely. As lessons are learned from this outage, those that assume "the cloud--it's like magic!" will hopefully grow up and realize it's a tool, not a silver bullet. Hopefully, before someone dies. There are plenty of alternate cloud service providers, plenty of ways to upgrade the availability and manageability of off-the-shelf cloud platforms, and plenty of ways to blend the use of public cloud resources with more controlled private IT resources and approaches (aka, "private" and "hybrid" clouds).
Cloud is still the way forward. SmugMug CEO Don MacAskill--a cloud and AWS consumer--puts it well: "There's a lot of noise on the Net about how cloud computing is dead, stupid, flawed, makes no sense, is coming crashing down, etc. Anyone selling that stuff...doesn't know what on earth they're talking about." The backlash has begun, and it will reverberate a while. But cloud computing remains the only practical path toward the level of scale, efficiency, and flexibility businesses want from their IT. Not all clouds will look like the more Web-centric ones that colonized AWS first, but all IT will look increasingly cloudy.
Despite the very public, painful outage and the lurid "Cloudpocalypse" labels, this is but a bump in the road. Those that balk at cloud as a result are the IT equivalent of those who hear about an airplane crash and vow to drive not fly to their next meeting. It sounds reasonable in the heat of the moment, but when it comes time to travel cross-country, most everyone will look at the options and say "You know, I think I'll fly after all."