My post on using the Cloud for storage went live just minutes before my intrepid IT guy Kevin received this email from utility computing provider Flexiscale about the potential large-scale loss of data stored on their Cloud storage service.
The short version: human error in their backup process deleted one of the main storage volumes. Roughly 12 hours later users have read-only access to the storage platform but no read-write. And now, they have to rebuild, but don't have the space.
"After consulting with our storage vendor it was agreed the most sensible option would be to copy the entire volume to a new disk structure (still maintaining it's integrity and structure), from where we could re-mount it correctly. Unfortunately due to it's size we didn't have spare capacity on the platform to create a complete duplicate of it."
Without disparaging Flexiscale, this is what I mean about the BigCos like IBM figuring these "enterprise-class" features out before enterprises move into Cloud consumption.
Full email pasted below:
As some of you are aware, we have been having issues with I/O (disk speed) in recent weeks. We identified short term and long term measures to eliminate these problems. The short team measures involved reorganising how data was stored across our storage network in a more efficient manner, and the long term measure was to increase the overall I/O capacity of the platform.
As a preparatory step to adding additional capacity one of our engineers was reorganising the data structure on the storage network and whilst cleaning up the snapshots we use as our backup process accidentally deleted one of the main storage volumes. This caused an immediate outage to a large amount of our customers
We immediately took action to take the entire disk structure offline (which caused the remaining customers to be taken offline) as it was the only way to preserve the integrity of the data on the system. Work then commenced with our storage vendor to restore this data.
Although we have now successfully gained read-only access to everyones data, a bug in the storage platforms operating system has prevented us from providing read-write access to it. This was discovered at 11pm last night, just when we thought we were about to bring the entire disk structure back online.
After consulting with our storage vendor it was agreed the most sensible option would be to copy the entire volume to a new disk structure (still maintaining it's integrity and structure), from where we could re-mount it correctly. Unfortunately due to it's size we didn't have spare capacity on the platform to create a complete duplicate of it.
An investigation of other ways of restoring the data then was undertaken but all options were considered too risky, and although downtime is a major problem for everyone, we felt the integrity of the data was the most important factor.
The decision was then taken to get additional capacity in from the storage vendor as soon as possible so that we could then increase the capacity to a sufficient level to allow us to copy the volume and successfully restore it. We originally thought we would be able to get this today, but unfortunately it will not arrive until mid-morning tommorow, although we have done (and will continue to do) everything we can to speed this up.
At this time we are assisting customers who need access to specific files to get this, and we will continue this as long as we can into the night as resources allow.
Tomorrow morning once the storage arrives and is online, we will copy the data across and then begin to restart the entire platform as quickly as possible, but as the system wasn't designed to restart everything at once, this will take time.
We will be offering credits against our SLA, which will be determined once everyone is back up and running, as I'm sure you can appreciate all resources are being focused on that at this moment.
I, and all my staff are well aware of the potential impact this will be causing to you our customers, and we are doing everything we can to help in that respect. We will also be undertaking an investigation to ensure additional safeguards are put in place to prevent this happening again.
Chief Executive Officer