Microsoft became a true cloud provider this past weekend as it experienced nearly 22 hours of downtime on its fledgling Azure Services Platform. The cause of the outage has not yet been disclosed to the general public or the Azure user community.
In contrast to on-premise systems, in which the user is responsible for dealing with infrastructure problems, a big part of the appeal of the cloud is the fact that you don't have to manage your own systems, or deal with the inevitable failures that occur.
It's easy to go off on a tangent about the necessity of monitoring the cloud, but the real issue is one of communication. If Microsoft wants to be taken seriously as a hosting provider--especially one defining a very nascent wave of technology--there needs to be more information beyond what a single admin updates on an MSDN forum.
Of course, we would also assume the same thing of other cloud providers like Amazon Web Services, Google App Engine, and Salesforce.com, all of whom only provide the most basic uptime details (green=good, red=bad) with little to no explanation as to what exactly is being monitored. The obvious argument is that users don't need to know...until something goes wrong and information is scarce.
Third-party services such as Hyperic's Cloudstatus.com provide additional insight, but cloud vendors themselves have to become much more ardent about system status and the implications. How can vendors help to assuage issues related to outages?
Visibility: Give customers immediate (real-time) visibility into the availability and performance of the services that you are delivering to them.
Transparency: The performance and availability data needs to be freely available. Don't hide these metrics behind a login or some complex credentials-only mechanism. Companies who use this rule will succeed, and they will set the standard and force the rest of the industry to follow.
Trust: Above all else, report accurately. The most important asset a cloud services provider has is its reputation. Customers will forgive a service disruption--we all know computer systems have their periodic hiccups. Customers will not forgive anything that is less than honest and forthcoming.
This leads to one of the larger questions about cloud adoption: what happens when things go wrong? And are you prepared when things go bump in the night?
- As a user, what is your backup plan if your cloud provider fails?
- As a provider, what are you doing to communicate effectively with your users?
- As a provider, do I have the run-book in place for a large-scale outage?
Availability is paramount to any other perceived risk of using the cloud. Issues like security and latency have always been concerns, but nothing else matters if the cloud platform or application isn't available.
One interesting technical aside: Azure appears to have a required five-hour, full reboot of the system, which is probably fine now as the user base is fairly small. But just think about how long it would take to reboot all of Amazon Web Services. (An AWS total reboot is unlikely to happen as Amazon's service is built in zones. But hey, you never know.) Or how about the impact of 17 hours of intermittent availability plus 5 hours of reboot time in the context of AWS? Literally hundreds (thousands?) of businesses would wind up offline in some manner.
As Gavin Clarke wrote on The Register, "Microsoft wanted to offer people the full cloud experience. Well, now it has."
Follow me on Twitter @daveofdoom