August 27, 2008 9:34 AM PDT

Another large-scale Cloud data loss in process?

by Dave Rosenberg
  • Font size
  • Print
  • 4 comments

My post on using the Cloud for storage went live just minutes before my intrepid IT guy Kevin received this email from utility computing provider Flexiscale about the potential large-scale loss of data stored on their Cloud storage service.

The short version: human error in their backup process deleted one of the main storage volumes. Roughly 12 hours later users have read-only access to the storage platform but no read-write. And now, they have to rebuild, but don't have the space.

"After consulting with our storage vendor it was agreed the most sensible option would be to copy the entire volume to a new disk structure (still maintaining it's integrity and structure), from where we could re-mount it correctly. Unfortunately due to it's size we didn't have spare capacity on the platform to create a complete duplicate of it."

Without disparaging Flexiscale, this is what I mean about the BigCos like IBM figuring these "enterprise-class" features out before enterprises move into Cloud consumption.

Full email pasted below:

As some of you are aware, we have been having issues with I/O (disk speed) in recent weeks. We identified short term and long term measures to eliminate these problems. The short team measures involved reorganising how data was stored across our storage network in a more efficient manner, and the long term measure was to increase the overall I/O capacity of the platform.

As a preparatory step to adding additional capacity one of our engineers was reorganising the data structure on the storage network and whilst cleaning up the snapshots we use as our backup process accidentally deleted one of the main storage volumes. This caused an immediate outage to a large amount of our customers

We immediately took action to take the entire disk structure offline (which caused the remaining customers to be taken offline) as it was the only way to preserve the integrity of the data on the system. Work then commenced with our storage vendor to restore this data.

Although we have now successfully gained read-only access to everyones data, a bug in the storage platforms operating system has prevented us from providing read-write access to it. This was discovered at 11pm last night, just when we thought we were about to bring the entire disk structure back online.

After consulting with our storage vendor it was agreed the most sensible option would be to copy the entire volume to a new disk structure (still maintaining it's integrity and structure), from where we could re-mount it correctly. Unfortunately due to it's size we didn't have spare capacity on the platform to create a complete duplicate of it.

An investigation of other ways of restoring the data then was undertaken but all options were considered too risky, and although downtime is a major problem for everyone, we felt the integrity of the data was the most important factor.

The decision was then taken to get additional capacity in from the storage vendor as soon as possible so that we could then increase the capacity to a sufficient level to allow us to copy the volume and successfully restore it. We originally thought we would be able to get this today, but unfortunately it will not arrive until mid-morning tommorow, although we have done (and will continue to do) everything we can to speed this up.

At this time we are assisting customers who need access to specific files to get this, and we will continue this as long as we can into the night as resources allow.

Tomorrow morning once the storage arrives and is online, we will copy the data across and then begin to restart the entire platform as quickly as possible, but as the system wasn't designed to restart everything at once, this will take time.

We will be offering credits against our SLA, which will be determined once everyone is back up and running, as I'm sure you can appreciate all resources are being focused on that at this moment.

I, and all my staff are well aware of the potential impact this will be causing to you our customers, and we are doing everything we can to help in that respect. We will also be undertaking an investigation to ensure additional safeguards are put in place to prevent this happening again.

Sincerely,
Tony Lucas
Chief Executive Officer
XCalibre/FlexiScale

Dave Rosenberg dishes up "Software, Interrupted" with nearly 15 years of technology and marketing experience that spans from Bell Labs to multiple start-up IPOs to open-source enterprise software companies. He is co-founder of MuleSource and currently serves as the general manager of Hardy Way. He is a member of the CNET Blog Network and is not an employee of CNET. Disclosure. You can contact Dave via e-mail at softwareinterrupted@gmail.com or follow him on Twitter @daveofdoom.
Recent posts from Software, Interrupted
Flexing the boundaries of flash memory
LG, RIM top Apple in number of phone users
A modern approach to Java application development
Mountain Dew drinks up social media (Q&A)
Top ad trends list spotlights online behavior
IBM closes lackluster M&A year with buying spree
Virtual currency exchange to launch in 2010
Microsoft needs to go big with Windows Mobile
Add a Comment (Log in or register) (4 Comments)
  • prev
  • 1
  • next
by tonylucasxcal August 27, 2008 11:08 AM PDT
As the author of that e-mail, to clear up what is a bit of a misleading headline, we are confident about restoring all the data, and have read-only access to it at present.

Not a good situation for us or our customers though, I appreciate that.
Reply to this comment
by 2easytec August 27, 2008 3:28 PM PDT
Both as the director of a company that uses XCalibre and as an IT writer for the New York Times Company, I find this post and its headline to be a bit yellow. XCalibre has been taking steps to ensure against data loss throughout this whole ordeal. To insinuate that large amounts of data will be lost when the situation is not yet resolved boils down to scaremongering. I, for one, thought CNet was better than that. Or does the 'C' stand for 'Chicken Little' now?
Reply to this comment
by daverosenberg August 27, 2008 4:15 PM PDT
The headline got a little tweaked so I apologize for alarming anyone. From what I hear the company has done a good job fixing the problem.
Reply to this comment
by xcalibrecustomer August 29, 2008 3:30 AM PDT
As a customer on one of the servers that was affected, we are still without a fully functioning website - 3 days now. Their support staff, there appears to be only one of them, are most unhelpful in giving any information regarding this issue. Their network support page repeats the phrase `All customers will be notified` at various stages. We did not receive the mentioned email from Xcalibre and have received no updates from them at all. It appears that we are not a flexiscale customer but our hosting package with Xcalibre uses flexiscale.
I agree with the headline as Xcalibre, from talking with them, are incorrectly telling us that our website is back up.
I personally am appalled with the service and support that Xcalibre has provided and would not recommend their services to anyone.
Reply to this comment
(4 Comments)
  • prev
  • 1
  • next
advertisement

15 sites that went kaput in 2009

Web sites launch all the time, but they also shut their doors. We highlight 15 that bit the dust this year.

Top 10 news stories of the decade

Let the debate begin: Was the iPhone more important than iTunes? Was anything bigger than Google finding a great business model? CNET offers its list of the 10 most important stories of the '00s.

About Software, Interrupted

In "Software, Interrupted," Dave Rosenberg discusses disruption in the software market, as well as the products and services that keep business technology norms in perpetual flux.

With nearly 15 years of technology and marketing experience spanning from Bell Labs to multiple start-up IPOs, Dave co-founded open-source software company MuleSource and now serves as general manager of Hardy Way. He also happens to be a U.S. patent holder and a workaholic. Technology is his best friend and mortal enemy.

Add this feed to your online news reader

Software, Interrupted topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right