April 19, 2007 9:20 PM PDT

RIM offers explanation for massive outage

Research In Motion finally offered some details late Thursday about what caused a severe outage of its BlackBerry e-mail service from Tuesday evening until Wednesday morning.

The company said in a statement that it had ruled out security and capacity issues as a cause of the outage that left millions of so-called "CrackBerry" addicts without access to their e-mail for several hours. The company also said the incident was not caused by any hardware failure or core software issue.

Ruling out those causes, the company has "determined that the incident was triggered by the introduction of a new, noncritical system routine that was designed to provide better optimization of the system's cache." In computing terms, a cache is a temporary storage area for that allows data to be served up quickly.

RIM said the system routine had not been expected to affect the regular operations of the BlackBerry servers and infrastructure. Despite previous testing, the new system routine produced an unexpected effect that set off a chain reaction, triggering a series of interaction errors between the system's operational database and the cache.

After RIM isolated the database problem and tried unsuccessfully to fix the issue, it began its "failover" process to a backup system. But that also failed.

"Although the backup system and failover process had been repeatedly and successfully tested previously, the failover process did not fully perform to RIM's expectations in this situation and therefore caused further delay in restoring service and processing the resulting message queue," the company said in the statement.

RIM also said it has already identified several aspects of its testing, monitoring and recovery processes that it plans to improve as a result of the incident.

Since the outage's start--around 5 p.m. PDT Tuesday--the company had been quiet about its cause. But experts said they were convinced the issue had to do with RIM's network since subscribers were still able to make phone calls and send and receive text messages.

RIM's service is centralized and works by routing all BlackBerry e-mails through one of two main network operations centers, which are essentially large data centers. One center is located in Canada and primarily serves the Western Hemisphere as well as parts of Asia. The other data center, located in the U.K., handles e-mail traffic in Europe, Africa and the Middle East. Analysts had speculated that since most of the people affected by the outage were based in North America that it was likely the problem occurred in the center located in Waterloo, Ontario.

By Wednesday morning, RIM said, the e-mail had begun trickling into in-boxes across North America. The service was operating normally on Thursday, the company said.

RIM has built a strong reputation as a reliable service provider that has attracted bankers, lawyers and lawmakers as subscribers. The company has recently been trying to broaden its appeal to consumers with new products, such as the BlackBerry Pearl handheld and the BlackBerry 8800.

The new strategy has helped the company rapidly expand its subscribers. In its latest quarter, RIM reported it had added 1.02 million new subscribers, taking its total to 8 million. This is a huge increase from the 2 million subscribers the company reported a year ago, when it settled a patent infringement case with NTP. The company expects to add between 1.12 million and 1.15 million subscribers during the current quarter.

See more CNET content tagged:
Research In Motion Ltd., BlackBerry E-Mail, failover, outage, subscriber

3 comments

Join the conversation!
Add your comment
Rim is having capacity problem
Even though they are denying it. The fact that they began to send email in batch mode indicates efforts to stagger the traffic and reduce the number of transactions.

My guess is that their database is reaching its capacity and RIM is suffering from its success.

There will not be an easy and quick fix for this problem. They can hide it but the problem will deem to show up again.
Posted by dewriver (2 comments )
Reply Link Flag
Doesn't explain why only one location....
As a crackberry addict myself, with an additional push/pull on my personal phone there is no question which is the best solution. Company Perle is a superb tool that I can switch off in personal time. Half the prats that complain about email are busy texting half the day... what's that all about? Exchange to Outlook to Perle; Calandar, email, all my contacts, tasks, notes. Absolute no brainer - No filofax; run a Blackberry. One device, one solution, lifes a dream!

Anyway... a) 8 million users can't be wrong. b) What a nice business problem and cash flow problem to have. I wish I had that worry too :-)
c) It didn't affect the UK so my guess is it was a local ****-up in the USA/Canada infrastructure. I imagine the next plan will be MORE de-central servers 'just in case', and a few mirrors ready to kick in.

Nothings perfect, but this is close!
Posted by pj-mckay (161 comments )
Reply Link Flag
I am in Cambodia on Dec. 25 and my Blackberry after coming back on from the outage earlier in the week is not working again. However I have found no information on a new outage. Do you know if this is ocurring? I can't get a response from RIM.(rablow@post.harvard.edu)
Posted by rablow (1 comment )
Reply Link Flag
 

Join the conversation

Add your comment

The posting of advertisements, profanity, or personal attacks is prohibited. Click here to review our Terms of Use.

What's Hot

Discussions

Shared

RSS Feeds

Add headlines from CNET News to your homepage or feedreader.