Sidekick outage casts cloud over Microsoft
The massive data failure at Microsoft's Danger subsidiary threatens to put a dark cloud over the company's broader "software plus services" strategy.
A key tenet of that approach is that businesses and consumers can trust Microsoft to reliably store valuable data on their servers.
T-Mobile Sidekick Slide
(Credit: Corinne Schulze/CNET)A week ago, though, Microsoft's Danger unit experienced a huge outage that left many T-Mobile Sidekick users without access to their calendar, address book, and other key data. That's because the Sidekick keeps nearly all its data in the cloud as opposed to keeping the primary copy on the devices themselves.
Things got even worse on Saturday, as Microsoft said in a statement that data not recovered thus far may be permanently lost. It's not immediately clear how many people lost their data. The outage earlier in the week affected a broad swath of Sidekick users, though many had data return during the week.
While outages in the cloud computing world are common (one need only look at recent issues with Twitter or Gmail), data losses are another story. And this one stands as one of the more stunning ones in recent memory.
The Danger outage comes just a month before Microsoft is expected to launch its operating system in the cloud--Windows Azure. That announcement is expected at November's Professional Developer Conference. One of the characteristics of Azure is that programs written for it can be run only via Microsoft's data centers and not on a company's own servers.
It should be pointed out that the Azure setup is entirely different from what Danger uses: the Sidekick uses an architecture Microsoft inherited rather than built (Microsoft bought Danger last year). Still, the failure would seem to be enough to give any CIO pause.
Update, 2 p.m. PT, 10/11/2009: I asked Microsoft for comment Saturday when I was writing this, in particular as to how the rest of its cloud might differ from the Danger set up.
Microsoft said Sunday that its the fabric controller that manages the Azure service is built with redundancy in mind.
"We write multiple replicas of user data to multiple devices so that the data is available in a situation where a single or multiple physical nodes may fail," Windows Azure general manager Doug Hauger said in a statement to CNET News.
That doesn't mean Azure is immune from data loss, though I'm told an entire data center would have to be wiped out, as opposed to just a server or collection of servers. I'd be interested to know whether Microsoft will also offer multiple location options so that users that want to can have their data in more than one physical spot as well.
But that's just one of many questions raised by this spectacular failure. Among the other questions still looming large in my head are:
1. What backup procedures did Danger have?
2. Just how many of T-mobile's Sidekick customers lost their data? (Feel free to let me know, Sidekick users.)
3. What impact will this have on the Pink project, which was largely seen as the evolution of the Sidekick, and some say was already in trouble?
4. Will this hurt Microsoft's efforts to build a brand around the notion of Windows Phone even though that uses a different architecture (with its own challenges, to be sure)?
During her years at CNET News, Ina Fried has changed beats several times, changed genders once, and covered both of the Pirates of Silicon Valley. These days, most of her attention is focused on Microsoft. E-mail Ina. 





category*
They were only showcasing their version of remote wipe, it's a feature right ?
Ina is wise to look at what damage this does to Azure, WiMo, and phone partnerships. Apparently Pink is going to see some blowback as well.
The SideKick is arguably the first real smartphone. In the process of driving it into the ground Microsoft has devalued a half billion dollar acquisition to zero, alientated every partner in the phone world, lost the most loyal customers in phones, and completely disproven the reliability of their cloud storage. It's difficult to imagine a more total failure.
Today it's pretty easy to say: if you like your data, if you want to buy an app that's perstent, if you want to control your media, get an iPhone. They're darned sexy and they don't have this appalling history.
The sad thing is, this was predictable. If you knew the details of Microsoft's partnership with Sendo you would have ditched your SideKick the day they acquired Danger and today you'd still have your contacts, your emails, your bookmarks, those amazing snaps of the kids and the captured video of the concert you went to last month. But if you didn't they're gone forever.
As for phone partners, they were already torqued about Pink. Think of how server vendors would react if Microsoft introduced own-brand servers. About how servers would react when Oracle bought Sun and sought to sell own-branded database servers. It ain't pretty. In the world of manufacturing you never, NEVER, undercut your resellers.
So... Microsoft loses Pink, Danger, WiMo and everything phones. Somebody always wins and today it's Android.
It looks to be an exciting week in Redmond (Bellevue actually, but near enough...). It's not every week that Microsoft spends half a billion dollars to improve Google's bottom line.
1) Backup!
2) Backup!
3) Backup!
Looks like Microsoft never got past rule #1. Or maybe it was Danger... still, once they have control they have responsibility. They certainly have the resources to improve Danger's infrastructure. Nothing happens instantly but it looks like they've had a year and a half to procrastinate. I see they've used the time well.
You know, I liked my old SKII and I was thinking of getting an LX but... Microsoft having my data? I shut down my Hotmail account years ago because I didn't trust them. Now I'm having second thoughts. Well, if they're losing data... third and fourth and fifth thoughts.
P.S. There's a typo in the second line, too many Os in "to".
How about "loss of data you couldn't backup casts cloud over CLOUD"? This "shove your info in and don't worry about it, just rely on us forever" approach is a way to make sure nothing more important than Myspace gets on the cloud.
If you want a successful server-based business, make or use a standardized file type for the information you're using, and let your customers download it. If Sidekick users could download a folder full of all their files from the cloud (.jpgs for the pictures they've taken, .mp3s for ringtones, .ctcs for their contacts, .emls for their emails, etc) and put it on a hard drive somewhere, disaster averted.
A few megs of space on a hard drive you have control over, vs. a proprietary format on a server you can't access and whose info you can't take elsewhere should you end your service.
This failure could and will happen to any other company. I hope this is a lesson to all of them, especially to Microsoft (which have several cloud services, such as Mesh, Skydrive, etc.).
This is one of the problems with the cloud, and not just security.
In a different building miles away we kept a set of backup tapes and documentation on all hardware parts and all switch and jumper settings. Call the hardware vendor and tell them to deliver these parts setup this way to your alternate street address. My estimate was we could be fully restored in four days.
What Microsoft did was something less than that. And, they are not claiming any fire damage.
The unreliability of MSFT stretches form Win Mobile to their Data Centers.
Have your own data and backup first, then put things on cloud, rain, wind, dust etc.
Dunno... even a half-competent admin knows what backing up to tape means.
Seriously? For customer data, any enterprise could set up your SAN with on-disk backup/shadow/snapshot copies (most SAN gear has this feature), have disk backup atop that, and tape backup atop all of that.
ANYONE doing cloud service operator should have at least two of the three in place, FFS...
Apparently they didn't have usable offline backups, but it's not clear to me if their offline backups were unusable or if they didn't have one at all.
As part of my security practice I teach companies the difference between fault-tolerance and DR, and between DR and BCP, and it's always bizarre to me that some IT staff don't already understand these basic concepts. How can IT people not know this already? A few years ago it was "my data is on RAID so it's safe" and today it is "my data is on SAN so it's safe".
What we have here is a massive failure in process, not technology. So many tasks by so many parties had to go wrong for a blunder like this to happen.
And how then did Hitachi take out the offline backups. Oh, there weren't any? And who's fault is that?
Maybe you should read my post again.
No kidding... you'd almost have to do it on purpose.
To be fair, mbenedict was agreeing with most of us in this thread, and adding to it.
It's still Microsoft's fault; well, some individuals and teams within Microsoft's fault, anyway. I wouldn't be surprised if there wasn't a couple of freshly unemployed folks packing their stuff in a MSFT-owned datacenter today (if they hadn't been perp-walked out the door already).
This is pretty basic stuff - you always have more than one backup when your income depends on the data, period. Even Microsoft knows this.
Whenever you go in to do any kind of work on a SAN (especially a frickin' head or head OS upgrade), you always make sure to have backups - send an extra copy to tape and test the restore if you have to (actually, you should do that anyway - gives you a fresh copy of the data you can keep handy), but always make sure you have the data safe.
For their size, I'm surprised they didn't just send the data to another SAN entirely and have that one take over for a bit, leaving the techs to beat up the original one. But then, I can understand how complacency can creep in... the (likely now unemployed) admins probably thought that it was just a head upgrade, no big deal, right?
Good object lesson for the rest of us if that's what went down.
It wasn't any cloud, it was people's bank transactions. Payments weren't processed for 10 days.
Do we want anyone to take care of our data, or maybe we should take care of our own bank transactions as well?
Mind-boggling, the most important data about everyone is already in a data-center out there, and you are not in control!
Cloud or no cloud, Microsoft or no Microsoft.
There's a big difference between a process gone bad but having redundant data present, and having a process gone bad without even checking for redundant data.
you make baby jesus cry... the only thing that MSFT is in error of is buying a company with a bunch of Linux hacks that cant execute a simple backup process. This is what happens when everyone tries to cut costs on the most important technologies and the best of breed processes that accompany them...
Danger to their will. This is just another symptom of Bill Gates biggest error: promoting that idiot Ballmer to the top rung of the ladder. Imagine some real software company being run by an arrogant soap salesman. Uggh!
This is a failure of basic principles, nothing more.
This all sounds familiar in some way.
...because they either weren't implemented, or not implemented correctly. Note that Steve Ballmer has frig-all to do with the day-to-day design or implementation of backups, which was my point. That Ballmer is a loud and arrogant jackass who seems bound to run Microsoft into the ground through bad strategy and even worse implementations? Sure.
But blaming the guy for something his admins should have taken care of as standard protocol? That's just dumb.
Seriously - this is datacenter 101 stuff: Always make sure you have good tape backups to restore from before dorking with your SAN.
Basically when an administrator does a crap job, don't just fire the administrator, fire the boss who hired him in the first place, because ultimately, the boss of the idiot is the person who put him there in the first place.
http://www.appleinsider.com/articles/09/10/09/exclusive_pink_danger_leaks_from_microsofts_windows_phone.html
""The final operator who is going to be pissed is T-Mobile, who has been just as loyal of a partner to Danger as Sharp has been. I don't know exactly what Microsoft has been telling them, but they have no doubt realized that they've been cut out of this deal in favor of their largest competitor. What's worse is that apparently Microsoft has been lying to them this whole time about the amount of resources that they've been putting behind Sidekick development and support (in reality, it was cut down to a handful of people in Palo Alto managing some contractors in Romania, Ukraine, etc.).
"The reason for the deceit wasn't purely to cover up the development of Pink but also because Microsoft could get more money from T-Mobile for their support contract if T-Mobile thought that there were still hundreds of engineers working on the Sidekick platform. As we saw from their recent embarrassment with Sidekick data outages, that has clearly not been the case for some time."
I was just thinking the same thing. Just the quote alone looks like it's made up of conjecture and speculation. Proof?
However, it does show that you can never have too many backups (including local ones--I have an external drive with my most important documents backed up even though they are also on my company's file servers which are tape-backed up). This also means you can't have proprietary data systems or the lack of ability to copy the data off the cloud at will. This is what bothers me most about many cloud computing systems currently. They try to wed you to their specific systems instead of just providing the computing power and server space.
Now MICROSOFT TAKES OVER and 3 MONTHS LATER ALL MY CONTACTS, PICTURES, CALENDERS, GAMES, RINGTONES, ALL GONE!!!!
Is Microsoft really that incompetent? (tongue in cheek).
One free month of data service is PATHETIC. They should be offering us a free switch to a new phone that doesn't rely upon DANGER!
I know my shares doesn't count for much, but boy it is an annual gratification!
You sir, are a real card! Why bother with being original in comedy when you can keep saying the same thing time after time after time? Who knows, perhaps after the 259th's time you say it, it might actually be funny!
Yes, it is a joke, a real funny one at that.
Except, the Danger architecture doesn't appear to be "software plus services". In the "software plus services" approach, you have software running locally that keeps your data and syncs it with the could, so it can be accessed from other places. What Danger does appears to be a 100% services, where the data only lives in the cloud.
Similar havoc had happened back when Microsoft bought Hotmail, and proceeded to convert all those FreeBSD servers Hotmail ran over to Windows NT...
Everyone knows that /dev/null has plenty of room! 'course, they should've been safe and encrypted it first by running it through /dev/zero... just to make sure no one could steal the data during the transfer. Can't be too careful now... :)
I agree that you can argue that MSFT BPOS is just another word for hosting but thats pretty much true of all of todays "clouds".
until now... and fundamentally the only thing that changed at Danger was the credit union where they cash their undeserved pay checks.
Let's hope that the media will do a better job this time.
To draw the long bow even further you could suggest it is an attempt to keep all users shackled to the desktop so they have to work/party like it's 1999... or 1989. Should this happen, MS is no longer seen as an irrelevant dinosaur!
I doubt that - Google and Amazon has had outages, but no data loss.
Neither Amazon or Google has a Terms of Service with their end user regarding data either. Neither are responsible for the data they have.
Have they had data loss? Yes. Are they responsible for it? Not legally. Do they have to report it? No.
I can't say much more than that without NDA violations.
you sure about that? heres just the very first "google" response to missing gmail...
http://www.deathbyemail.com/2007/11/the-case-of-the.html
Of course, lessons will have been learned, but their assurances are total nonsense, they have nothing to do with what actually went wrong.
In what language...?
wait, is there an architecture microsoft built? haven't they either bought or stolen everything they have?
...wait, their was MSBob. they made that didn't they?
Oh sorry, one other inovation, white on white. The most boring, least personal devices in the world. Is sterility really such a desirable fashion statement? Oh sorry I forgot, you can get black on black now as well. Thats so much more expressive...
Big companies buy little companies and exploit them, this is the American way. The whole system is set up that way. No one big company is mcuh better or worse than any other. Well, actually I would say Google is worse because of their "Do no evil" hypocrisy. Whatever.
I hope this disaster opens the eyes of companies like Mot, who also plan to put a lot of user data for their upcoming new Android devices in the cloud. These smart phones have Gbytes of memory, they can easily store all of the information they have lost, but they choose not do this because they want control over the customer. I think Palm might even be doing something similar, I think if you use the Pre with a POP3 account, Palm or Sprint's servers fetch your mail from your POP3 account, and then send it to the phone. That's the only explanation I can have for a new POP3 configuration on the Pre having to take a few hours to start working.
I wish the phone companies would be happy just being phone companies and stop trying to be lords of the internet.
daily? Heh - any decent enterprise-sized SAN has snapshot backups that you can run once an hour if you wanted to. Combine that with a daily snap shot, and you can set up a rotating schedule that's fairly merciful to disk space yet offer immediate restoration.
You're absolutely right- and Danger should have had that sort of backup built into their system. Unfortunately as it has become apparent, they did not. I suspect this is a good reason why services like Azure that is desitned for data security and backup from the ground up will be more successful.
I wish Microsoft had come up with some sort of interim backup setup for Danger's systems, but you are limited in what you can do with an inherited system.
Snapshots are not backups. If your SAN is toast, your snapshots are toast. It's possible this is what Danger had -- a bunch of snapshots which are now worthless.
You can take a backup from your snapshots, but there's no way to backup any "enterprise-sized" SAN in an hour, unless we're talking tiny SANs here. A backup from a snapshot isn't any faster than a regular backup.
What enterprises tend to do is to run incrementals from snapshots, usually once a day. Or in the case of a database, continuous replication is also a popular solution, but can't be considered a true backup (maybe if you keep all the transaction logs somewhere).
...which is why I mentioned tapes waaaay up earlier in the comments section. ;)
I was just addressing his mention of wanting daily data backup, in that you can provide something a bit more recent than that (well, in most cases, not this one obviously).
--
"I suspect this is a good reason why services like Azure that is desitned for data security and backup from the ground up will be more successful. "
Dunno ab't that one, Dan - if what mbenedict was saying about the root cause is correct, nothing can surmount human screwups. No tape or offsite backups made before the upgrade, the SAN upgrade got borked (along with its disks)... Don't know of any programming language that can anticipate something like that. :/
--
Oh, back to mbenedict:
"Or in the case of a database, continuous replication is also a popular solution, but can't be considered a true backup (maybe if you keep all the transaction logs somewhere)."
Usually both. You can peel off a replication stream, though that usually is best done without a WAN link in between (else it gets slow or expensive, you pick). OTOH, it's great for local copies (say, onto a sister SAN set), with the transaction logs getting peeled off to a separate and independent server.
Wouldn't call it a true backup by any means, though you can certainly use it as one in a pinch. True backup means getting a good and complete copy of the data off the premises, and onto a medium that cannot be directly changed or touched by the machinery you're backing up.
- by ewelch October 10, 2009 9:21 PM PDT
- Microsoft gambled, and they lost. No one should ever trust them as long as Ballmer is in charge. He needs to fall on his sword for a lack of leadership - integrity too.
- Like this Reply to this comment
-
Showing 1 of 4 pages (172 Comments)