Gmail outage blamed on capacity miscalculation
Google's nearly two-hour Gmail outage Tuesday was the result of a miscalculation regarding the capacity of its system, the company said late Tuesday.
Gmail may be out of beta, but it wasn't ready for prime time Tuesday.
(Credit: Google)Gmail was down from about 12:30 p.m. PDT Tuesday to about 2:30 p.m. PDT, affecting millions of Gmail customers who depend on the service for everything from fantasy football roster updates to business-critical information. The problem was caused by a classic cascade in which servers became overwhelmed with traffic in rapid succession.
According to Google, the problem began when it took several Gmail servers offline for maintenance, a routine procedure that normally is transparent to users. However, the twist this time around was that Google had made some changes to the routers that direct Gmail traffic to servers in hopes of improving reliability, and those changes backfired.
"As we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers--servers which direct web queries to the appropriate Gmail server for response," Google said in a post to its Gmail blog late Tuesday.
"At about 12:30 p.m. Pacific a few of the request routers became overloaded and in effect told the rest of the system 'stop sending us traffic, we're too slow!' This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded," wrote Ben Treynor, vice president of engineering and site reliability czar.
Google fixed the problem by allocating traffic across the rest of its prodigious network, a luxury that it enjoys given the resources it has put in place to operate the world's leading search engine. But what's next?
Google said it would focus on making sure that the request routers have sufficient headroom to handle future spikes in demand, as well as figuring out a way to make sure that problems in one sector can be isolated without bringing down the entire service. "We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements--remains more than 99.9% available to all users, and we're committed to keeping events like today's notable for their rarity," Treynor wrote.
Several Google Apps customers who use Gmail for internal e-mail at their businesses and organizations did not return calls Tuesday seeking information on the degree to which they were affected, making it difficult to know the magnitude of the failure. However, Google has put an awful lot of time and money this year behind promoting Gmail as a back-end e-mail software alternative to products from Microsoft and IBM, and embarrassments like this will not help it sell the service to other organizations.
"We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there's a problem with the service," Treynor wrote. "Thus, right up front, I'd like to apologize to all of you--today's outage was a Big Deal, and we're treating it as such."
Tom Krazit writes about the ever-expanding world of Internet search, including Google, Yahoo, online advertising, and portals, as well as the evolution of mobile computing. He has written about traditional PC companies, chip manufacturers, and mobile computers, spending the last three years covering Apple. E-mail Tom. 





Google Apps (including Gmail) really need to be a "five-nines" reliability (five minutes of downtime annually) if they want to be considered as serious "cloud computing" leaders.
I have dozens of accounts between the three providers and Gmail is more unreliable than Yahoo or Hotmail. That says a lot, especially when Hotmail sucked eggs for years after they were acquired by Microsoft (they were originally a FreeBSD house).
Yahoo Mail has better uptime than Gmail. Far better. They are probably close to "five-nines" reliability.
As to consumer-grade services, the phone company has "five-nines" reliability, at least for my cheapo residential landline. Same with OTA broadcast TV service. Same with the power company.
If Google wants their "cloud computing" services to be considered like utilities, they have a LONG way to go before they can boast about that sort of reliability.
Your math and/or reading comprehension is bad. "Five-nines" reliability means five minutes of downtime annually.
It never should have happened. They effed up an entire production system in mid-day (these were machines hosted in the U.S.), most likely by not using the two-person rule on production machines.
It's just operational sloppiness.
As a GOOG shareholder, I find their performance to be quite lacking on this particular episode.
Human error can never be avoided so cut the guys some slack. At least they admitted they made a mistake which is a million times better than other companies out there. Honesty really is the best policy.
I'm currently okay with handing my personal e-mail service to Google, but they are in no position to brag about their reliability. And yes, I'm a GOOG shareholder.
Curious.
The request router excuse that Google is giving would pertain to HTTP requests (i.e., Gmail webmail interface) and not to POP3 or IMAP communications (which apparently were minimally affected by the outage).
It's still a p!ss-poor performance compared to other larger webmail services like Yahoo Mail or Hotmail.
We pay for Google Mail. $50 per year per user, with about 900 users.
Don't assume that every GMail user who is complaining, is complaining about a free service. It's free at the individual/consumer level but many of us are running institutional email on it.
That said, even with a couple of outages I'm still not willing to consider going back to the bad old days of running my own mail servers.
CNET: As a follow-up article can you research into the reliability comparison between the top 3 email providers Yahoo, Hotmail, and GMail? Or does anybody in this thread know the actual numbers?
Because of Google's dominance in the web, Google news like this may attract more attention than when Yahoo mail goes down, perhaps?
http://tinyurl.com/ntpjut
Google is screwed
Gmail going down is big news because Google wants applications to move to the cloud so every time their service is down it highlights the risk having a single vendor serve up apps that are integral to today's businesses. Imagine if 80% of companies used Google Apps. Business across the world would come to a halt.
These outages will keep Google Apps a novelty. Only non-mission critical apps will move to the cloud.
My TIMEWARNER CABLE Road Runner service goes out more often every month than GMAIL does.
You don't see news articles everyday the moment Time Warner Cable's horrific service goes out! Did i mention the control my TV, Internet, AND Phone?
That said, maybe Hotmail and Yahoo have experienced such outages but it just have not garnered as much media attention as Gmail did
Also, though gmail was not reachable via website, I and everyone I know had no problem accessing our mails via POP or IMAP account which I use on my phone to check gmail.
For some reason Gmail outages do garner more attention & news. For what we get for free (almost unlimited storage, email forwarding, HUGE file attachment sizes, ultimate SPAM detection, IMAP & POP, no dormancy if you don't log in every month, and so much more), these rare outages are nothing really. Now if I was an organization that paid for Gmail I might have a different tone. What Google needs to do is separate the paid version from the free version. They need to make sure that NO long term outage effects the people who pay for the service. But like you said POP & IMAP access for Gmail seemed to work fine during all of this.
I say GREAT JOB to Google/GMail team for recovering so quickly, and all you people who need to whine, because YOU think, for some odd reason, you're better then everyone else...PLEASE get over yourselves!
99.9% uptime, is better then 99.9% of every other email provider, free or paid.
Once the wave of traffic/request grows to critical mas.... all one can do is deny traffic and scramble to recover services and/or re-route traffic. Since Gmail is widely used around globe this could be a rather large feat to pull off from an operations point of view.
Given the volume of traffic Google/Gmail has acquired over the years. I'm impressed they recovered as fast as they did. My hats off to the many Google/Gmail Operations staff that brought Gmail back online as quickly as they did.
Without naming other web sites... this is nothing new. It's happened before and will mostly likely happen again.
Once the wave of traffic/request grows to critical mas.... all one can do is deny traffic and scramble to recover services and/or re-route traffic. Since Gmail is widely used around globe this could be a rather large feat to pull off from an operations point of view.
Given the volume of traffic Google/Gmail has acquired over the years. I'm impressed they recovered as fast as they did. My hats off to the many Google/Gmail Operations staff that brought Gmail back online as quickly as they did.
Without naming other web sites... this is nothing new. It's happened before and will mostly likely happen again.
Once the wave of traffic/request grows to critical mas.... all one can do is deny traffic and scramble to recover services and/or re-route traffic. Since Gmail is widely used around globe this could be a rather large feat to pull off from an operations point of view.
Given the volume of traffic Google/Gmail has acquired over the years. I'm impressed they recovered as fast as they did. My hats off to the many Google/Gmail Operations staff that brought Gmail back online as quickly as they did.
Without naming other web sites... this is nothing new. It's happened before and will mostly likely happen again.
Heck - I have several 25 user environments vs. Google's 125M users, that I have things scheduled for this weekend because it gives us a little extra time to figure things out in the event that things go wrong...
Coming to Gmail --> This just shows that no matter how much money you put on infrastructure ... its never enough. Sometimes being too popular can be a curse ;-)
I'm not entirely sure if Gmails servers are located globally or not, but if they are, then they could do maintenance during after hours.
And by after hours, roughly around the time after the RUSH of people checking e-mails.
The above could actually be gathered from their server logs pretty easily.
Stuff happens and life goes on. And "NO", just because G-mail is your enterprise-level e-mail, does not somehow exempt it from the "absolute law of imperfection." People, software, machines, processes, weather, governments, and your spouse...occassionally fail to perform as we wish or require. As great as they are, Google is no exception.
If you are still upset ... I dare you to consider switching back to Yahoo. Didn't think so...
Downtime should never occur in normal business hours. NEVER. Smart companies take risks early in the morning.
If this downtime occurred...from midnight to 2 am...it would be a non-story basically. It happened instead during peak business hours. That's ludicrius!
Ten minutes can mean the difference in getting a business proposal and responding to a last-minute deadline. It's like if 911in your local town for 10 minutes. I'm sure you wouldn't have such an arrogant "GET OVER IT" attitude.
Think before you type.
Do you really think you get reliability for $50 a year? That is funny. Reliability costs tons of money and when you use a system that multiple companies and people use you will never be guaranteed reliability. The cloud is the worst guarantee for reliability you will ever have. Reliability also equals redundancy, if you don't have a backup plan you only have yourself to blame.
If Google want to promote and remain in the EDU sector, then they'll need to start abiding by the same outage windows and consider the performance spikes that schools typically see. The back-to-school week was just awful timing.
EDU sector does maintenance during school year too.
- by giant_david September 2, 2009 7:18 AM PDT
- It was interesting to get latest information from twitter yesterday, since I am a Gmail user. Twitter is an amazing tool, it is arguably becoming the INTERNET neural system.
- Like this Reply to this comment
-
-
- by RobVaughn September 2, 2009 11:51 AM PDT
- I'll take the contrarian bet on this one: Twitter is the new RSS, and like RSS, will be used by geeks and people who want to socialize online, but will never reach critical mass, at least as a business tool.
- Like this
-
Showing 1 of 2 pages (73 Comments)