• On MovieTome: MovieTome: Holiday Movie Guide
September 1, 2009 8:55 PM PDT

Gmail outage blamed on capacity miscalculation

by Tom Krazit
  • Font size
  • Print
  • 73 comments

Google's nearly two-hour Gmail outage Tuesday was the result of a miscalculation regarding the capacity of its system, the company said late Tuesday.

Gmail may be out of beta, but it wasn't ready for prime time Tuesday.

(Credit: Google)

Gmail was down from about 12:30 p.m. PDT Tuesday to about 2:30 p.m. PDT, affecting millions of Gmail customers who depend on the service for everything from fantasy football roster updates to business-critical information. The problem was caused by a classic cascade in which servers became overwhelmed with traffic in rapid succession.

According to Google, the problem began when it took several Gmail servers offline for maintenance, a routine procedure that normally is transparent to users. However, the twist this time around was that Google had made some changes to the routers that direct Gmail traffic to servers in hopes of improving reliability, and those changes backfired.

"As we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers--servers which direct web queries to the appropriate Gmail server for response," Google said in a post to its Gmail blog late Tuesday.

"At about 12:30 p.m. Pacific a few of the request routers became overloaded and in effect told the rest of the system 'stop sending us traffic, we're too slow!' This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded," wrote Ben Treynor, vice president of engineering and site reliability czar.

Google fixed the problem by allocating traffic across the rest of its prodigious network, a luxury that it enjoys given the resources it has put in place to operate the world's leading search engine. But what's next?

Google said it would focus on making sure that the request routers have sufficient headroom to handle future spikes in demand, as well as figuring out a way to make sure that problems in one sector can be isolated without bringing down the entire service. "We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements--remains more than 99.9% available to all users, and we're committed to keeping events like today's notable for their rarity," Treynor wrote.

Several Google Apps customers who use Gmail for internal e-mail at their businesses and organizations did not return calls Tuesday seeking information on the degree to which they were affected, making it difficult to know the magnitude of the failure. However, Google has put an awful lot of time and money this year behind promoting Gmail as a back-end e-mail software alternative to products from Microsoft and IBM, and embarrassments like this will not help it sell the service to other organizations.

"We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there's a problem with the service," Treynor wrote. "Thus, right up front, I'd like to apologize to all of you--today's outage was a Big Deal, and we're treating it as such."

Tom Krazit writes about the ever-expanding world of Internet search, including Google, Yahoo, online advertising, and portals, as well as the evolution of mobile computing. He has written about traditional PC companies, chip manufacturers, and mobile computers, spending the last three years covering Apple. E-mail Tom.
Recent posts from Relevant Results
Google to track TiVo viewing habits
Google places ad explaining offensive image
Can News Corp. afford calling Google's bluff?
Brin: Google's OSes likely to converge
Judge sets February hearing for new Google Books deal
Google has its own plan for Netbooks
Google releases Chrome OS source code
Yahoo adds photos, tweets to news search
Add a Comment (Log in or register) Showing 1 of 2 pages (73 Comments)
by cvaldes1831 September 1, 2009 9:10 PM PDT
99.9% ("three-nines") reliability is insufficient. That equals over 525 minutes (8.75 hours) of downtime.

Google Apps (including Gmail) really need to be a "five-nines" reliability (five minutes of downtime annually) if they want to be considered as serious "cloud computing" leaders.
Reply to this comment
by halfbeer September 1, 2009 9:20 PM PDT
That's easy enough to say, but do you manage a service with as much load as Google's? I'm not saying that they didn't screw up, but I'd guess that they have more mail activity by now than microsoft or yahoo ever did through their free email services (but I could be wrong). I personally use rely on it for my home email because I still trust Google more than my ISP to not screw up email (my ISP (a major cable provider in the southeast) lost my account once by accident).
by sflocal September 1, 2009 9:26 PM PDT
99.999% uptime would mean a little over 5 hours of downtime. To my knowledge, they haven't yet reached that much downtime. So what's your point?
by cvaldes1831 September 1, 2009 9:26 PM PDT
Gmail is ranked third amongst webmail providers. Yahoo and Hotmail service more users.

I have dozens of accounts between the three providers and Gmail is more unreliable than Yahoo or Hotmail. That says a lot, especially when Hotmail sucked eggs for years after they were acquired by Microsoft (they were originally a FreeBSD house).
by Lerianis5 September 1, 2009 9:31 PM PDT
Sorry, but no. Even PRIVATE BUSINESSES don't have that much 'uptime' when it comes down to it, so why would you expect Google to? 99.9% reliability is MORE than enough for anything except our Defense Department and emergency help providers, i.e. hospitals, police, etc.
by cvaldes1831 September 1, 2009 9:47 PM PDT
You are wrong.

Yahoo Mail has better uptime than Gmail. Far better. They are probably close to "five-nines" reliability.

As to consumer-grade services, the phone company has "five-nines" reliability, at least for my cheapo residential landline. Same with OTA broadcast TV service. Same with the power company.

If Google wants their "cloud computing" services to be considered like utilities, they have a LONG way to go before they can boast about that sort of reliability.
by cvaldes1831 September 1, 2009 9:51 PM PDT
@sflocal:

Your math and/or reading comprehension is bad. "Five-nines" reliability means five minutes of downtime annually.
by SaneMind September 2, 2009 1:01 AM PDT
@cvaldes1831 Yeah right ... why don't you manage that infrastructure? C'mon cut some slack for those poor souls at Gmail who worked their a$$ off to bring it back in 2 hours.
by cvaldes1831 September 2, 2009 7:20 AM PDT
@SaneMind:

It never should have happened. They effed up an entire production system in mid-day (these were machines hosted in the U.S.), most likely by not using the two-person rule on production machines.

It's just operational sloppiness.

As a GOOG shareholder, I find their performance to be quite lacking on this particular episode.
by kennyhkw September 4, 2009 2:34 AM PDT
Don't know why people are complaining about 2 hours of downtime. C'mon, Google has made the net a nicer place and we get FREE email and as far as i know the only free email provider to give POP3 and SMTP which allows me to send my emails through Outlook. I've had Gmail since the day it went Beta and it simply is the best email service, period.
Human error can never be avoided so cut the guys some slack. At least they admitted they made a mistake which is a million times better than other companies out there. Honesty really is the best policy.
by cvaldes1831 September 1, 2009 9:13 PM PDT
Oh yeah, and this still sounds like a "lack of two-person rule" error.

I'm currently okay with handing my personal e-mail service to Google, but they are in no position to brag about their reliability. And yes, I'm a GOOG shareholder.
Reply to this comment
by Vegaman_Dan September 1, 2009 9:21 PM PDT
That's all well and good, but doesn't explain why my mailbox was rolled back to a week ago and now all the mail from the last week that I deleted is back again. That's a database issue, not router.

Curious.
Reply to this comment
by The_happy_switcher September 1, 2009 9:35 PM PDT
You sure do complain a lot about something you pay nothing for. Think about how many days of the year it works correctly. I still think Gmail is great and can accept the occasional glitch and inconvenience.
by cvaldes1831 September 1, 2009 9:54 PM PDT
You're probably suffering from a unique and/or isolated issue since this is the first instance I've read about data rollback.

The request router excuse that Google is giving would pertain to HTTP requests (i.e., Gmail webmail interface) and not to POP3 or IMAP communications (which apparently were minimally affected by the outage).

It's still a p!ss-poor performance compared to other larger webmail services like Yahoo Mail or Hotmail.
by El_Gringo_Guapo September 1, 2009 10:36 PM PDT
Not to knock Gmail (I use it too), but between my Gmail account and my Yahoo Mail account, I find Yahoo to be a tad more reliable than Gmail.
by September 2, 2009 12:07 AM PDT
"You sure do complain a lot about something you pay nothing for."

We pay for Google Mail. $50 per year per user, with about 900 users.
Don't assume that every GMail user who is complaining, is complaining about a free service. It's free at the individual/consumer level but many of us are running institutional email on it.

That said, even with a couple of outages I'm still not willing to consider going back to the bad old days of running my own mail servers.
by eadeguzman September 2, 2009 12:28 AM PDT
If what Vegaman_Dan is true, a week of lost email is not acceptable... yes, even if it's "free". Howver, Vegaman_Dan, it looks like it's an isolated issue as cvaldes1831 pointed out -- so goodluck complaining about it.

CNET: As a follow-up article can you research into the reliability comparison between the top 3 email providers Yahoo, Hotmail, and GMail? Or does anybody in this thread know the actual numbers?

Because of Google's dominance in the web, Google news like this may attract more attention than when Yahoo mail goes down, perhaps?
by tenenbaum24 September 1, 2009 9:44 PM PDT
Looks like Harrison Ford is super pissed about the gmail outage

http://tinyurl.com/ntpjut

Google is screwed
Reply to this comment
by cougar888 September 2, 2009 8:22 AM PDT
funny
by McGyver777 September 1, 2009 10:02 PM PDT
There is nothing like Gmail's conversational style email out there. It makes tracking emails ("conversations") very easy. Their search within the email client is incredible at finding stuff I have lost in the piles of data. Their TASK (from labs) and their UNDO feature are lifesavers for my BUSINESS life as a small business owner. Their downtime (while slowing things down) is something I can work around for the small time in-which they are trying to update, fix, or improve the network. If you read it correctly you will note they were trying to make things even better and ran into a problem. Humans do that. I trust Google with my business and private email because they have earned that trust and while the downtime today wasn't pleasant I can't complain that I have an entire suite of programs (gmail, docs, tasks, calendar, contacts, news, photos, mapping, streetview, Earth, etc...available to me which cost me and my company NOTHING. I keep thinking of the thousands of dollars I would have to spend to replace this software and then I realize the downtime was NOTHING compare to what they have saved me since I opened the business and chose to go "open source" and "cloud." I have always feared (in the back of my mind) that Google is Skynet...you gotta admit, for big brother they sure have a great company rep. I LIKE Google. I have a hard time saying that about any other "large" corporation. And I don't think I can even put into words why I like them so much...but I do know it is because I can find what I need through their search, get my work done for free, and have reliability that I couldn't match myself with my own self run servers. O. K., so I can put it into words....
Reply to this comment
by Mweaver2k9 September 1, 2009 10:37 PM PDT
If the conversational style makes tracking emails so easy, then why do you need to search to find "stuff you have lost in the piles of data"? This is way off topic, but their conversational style has a long way to go, there is way to many quirks in it right now, and is why the option to disable it should be available.
by mbertwave September 2, 2009 8:44 AM PDT
When multiple people reply to an email thread, gmail's conversational mode is a disaster. The fact that after all these years Google hasn't fixed this makes me wonder what the heck they are thinking.

Gmail going down is big news because Google wants applications to move to the cloud so every time their service is down it highlights the risk having a single vendor serve up apps that are integral to today's businesses. Imagine if 80% of companies used Google Apps. Business across the world would come to a halt.

These outages will keep Google Apps a novelty. Only non-mission critical apps will move to the cloud.
by stockyjoe September 2, 2009 8:32 PM PDT
I think its conversation style IS annoying. And the sort fearures in gmail compared to yahoo suck.
by OctoChops September 1, 2009 10:03 PM PDT
To put things in perspective for everyone,

My TIMEWARNER CABLE Road Runner service goes out more often every month than GMAIL does.

You don't see news articles everyday the moment Time Warner Cable's horrific service goes out! Did i mention the control my TV, Internet, AND Phone?
Reply to this comment
by lumbee2 September 1, 2009 10:58 PM PDT
Ditto on that!!
by adrollz September 2, 2009 2:38 AM PDT
I have never heard Hotmail or Yahoo Mail being down and that would be the only true comparison due to business accounts that rely on doing business through such services... Do you realize there is financial cost for business around the globe attached to such outages,,, I would not want this especially when I am paying for it.

That said, maybe Hotmail and Yahoo have experienced such outages but it just have not garnered as much media attention as Gmail did

Also, though gmail was not reachable via website, I and everyone I know had no problem accessing our mails via POP or IMAP account which I use on my phone to check gmail.
by winstein September 2, 2009 6:08 AM PDT
Yes, Hotmail was down and had problem for several days a few years ago. Yahoo had similar years back.
by nemrel September 2, 2009 6:34 AM PDT
@adrollz - I know when I used Hotmail and Yahoo Mail I experienced outages from them. It was rare, but it did happen. But there's a huge difference in Hotmail and Yahoo Mail when compared to Gmail. If I don't log into Hotmail or Yahoo Mail every 30 or 60 days my account is deactivated and ALL email sent to them is bounced - reactivation is easy - but doesn't bring back any emails sent while my account was deactivated. Gmail promises to never retire old accounts or accounts that haven't been logged into in some time. In fact I have many Gmail accounts that I haven't logged into in over a year.

For some reason Gmail outages do garner more attention & news. For what we get for free (almost unlimited storage, email forwarding, HUGE file attachment sizes, ultimate SPAM detection, IMAP & POP, no dormancy if you don't log in every month, and so much more), these rare outages are nothing really. Now if I was an organization that paid for Gmail I might have a different tone. What Google needs to do is separate the paid version from the free version. They need to make sure that NO long term outage effects the people who pay for the service. But like you said POP & IMAP access for Gmail seemed to work fine during all of this.
by jsbono September 1, 2009 10:42 PM PDT
One would think that a company with a 144B market cap could do better - today was not a "technical" issue but a management issue. That shouldn't happen with a company with virtually unlimited resources. Business is email, and I know we were shut down for hours today. NOT GOOD! And google hosted services are NOT free once you get to a certain number of accounts. Just because some people have free accounts, doesn't mean we should expect lesser reliability - that's an argument without merit.
Reply to this comment
by smpimacG5 September 2, 2009 6:33 AM PDT
Oh, come on! Do you really thing that IBM, Cisco Systems, Boeing, or ANY company has never had technical issues with email? I have worked at some of the largest Fortune 100 companies in my 20 years and EVERY company has has issues. At times for days! So 2 hours is NOTHING, and go cry in the corner if you couldn't send a %^$# email. Pick up the phone if its THAT important!

I say GREAT JOB to Google/GMail team for recovering so quickly, and all you people who need to whine, because YOU think, for some odd reason, you're better then everyone else...PLEASE get over yourselves!

99.9% uptime, is better then 99.9% of every other email provider, free or paid.
by 10012-1a September 1, 2009 11:41 PM PDT
Sounds like a major "Oops" at the change management or capacity planning level.

Once the wave of traffic/request grows to critical mas.... all one can do is deny traffic and scramble to recover services and/or re-route traffic. Since Gmail is widely used around globe this could be a rather large feat to pull off from an operations point of view.

Given the volume of traffic Google/Gmail has acquired over the years. I'm impressed they recovered as fast as they did. My hats off to the many Google/Gmail Operations staff that brought Gmail back online as quickly as they did.

Without naming other web sites... this is nothing new. It's happened before and will mostly likely happen again.
Reply to this comment
by stubbyns September 1, 2009 11:41 PM PDT
I didn't even notice it was down
Reply to this comment
by 10012-1a September 1, 2009 11:42 PM PDT
Sounds like a major "Oops" at the change management or capacity planning level.

Once the wave of traffic/request grows to critical mas.... all one can do is deny traffic and scramble to recover services and/or re-route traffic. Since Gmail is widely used around globe this could be a rather large feat to pull off from an operations point of view.

Given the volume of traffic Google/Gmail has acquired over the years. I'm impressed they recovered as fast as they did. My hats off to the many Google/Gmail Operations staff that brought Gmail back online as quickly as they did.

Without naming other web sites... this is nothing new. It's happened before and will mostly likely happen again.
Reply to this comment
by 10012-1a September 1, 2009 11:43 PM PDT
Sounds like a major "Oops" at the change management or capacity planning level.

Once the wave of traffic/request grows to critical mas.... all one can do is deny traffic and scramble to recover services and/or re-route traffic. Since Gmail is widely used around globe this could be a rather large feat to pull off from an operations point of view.

Given the volume of traffic Google/Gmail has acquired over the years. I'm impressed they recovered as fast as they did. My hats off to the many Google/Gmail Operations staff that brought Gmail back online as quickly as they did.

Without naming other web sites... this is nothing new. It's happened before and will mostly likely happen again.
Reply to this comment
by 10012-1a September 1, 2009 11:48 PM PDT
My hats off to the Gmail Operations staff for recovering as quickly as they did. Capacity issues on any site are always a challenge.
Reply to this comment
by reighman September 2, 2009 4:42 AM PDT
Amen. Very well said.
by RyanMPLS September 2, 2009 12:47 AM PDT
My question is why did this happen at 1230PDT on a Tuesday afternoon? These are the kind of changes that should happen after hours and not mid-day. If you can't do that - how about a weekend? Or maybe a long weekend that happens to be coming up (due to Labor Day in the United States).

Heck - I have several 25 user environments vs. Google's 125M users, that I have things scheduled for this weekend because it gives us a little extra time to figure things out in the event that things go wrong...
Reply to this comment
by SaneMind September 2, 2009 12:56 AM PDT
@RyanMPLS You do know that there is a world outside US? PDT doesn't apply everywhere and neither is long weekend due to Labor day.

Coming to Gmail --> This just shows that no matter how much money you put on infrastructure ... its never enough. Sometimes being too popular can be a curse ;-)
by Hunnter2k3 September 2, 2009 4:22 AM PDT
Both of you have a good point actually.

I'm not entirely sure if Gmails servers are located globally or not, but if they are, then they could do maintenance during after hours.
And by after hours, roughly around the time after the RUSH of people checking e-mails.
The above could actually be gathered from their server logs pretty easily.
by Squashman2 September 2, 2009 6:52 AM PDT
As you can read from the article, Google does have the infrastructure to take things down whenever they want. They just had some things configured incorrectly which caused their infrastructure to fail. They made the adjustment without bringing those servers back online and just kept going. No big deal.
by shootfirst September 2, 2009 8:41 AM PDT
Learn to read dude. The routers got overwhelmed and guess what lunch time is when everyone checks email. They probably rolled out the change to the routers at low use time and when traffic picked up that is when it hit the fan. Use a little logic and less caffeine, works better that way. Monday/Tuesday is probably a low usage time, weekends everyone is probably checking their email since they aren't at work. Just because your environment works one way doesn't mean they all do, not like Google can turn off their systems on the weekends.
by September 3, 2009 5:20 AM PDT
I have to agree. Although Google has a worldwide footprint I doubt their smallest user base is in North or South America. Anyone who has ever managed an IT environment knows you do maintenance during off-peak hours because something can always go wrong. And I am pretty sure that the majority of their users do not work on Sat or perhaps Sunday. If I relied on Google for any business-critical needs I would certainly rethink my position and move to a company with a process that has the least possible chance of an adverse impact on my business. I think I would really avoid their "cloud" before it rained on me.
by Kev-LG September 2, 2009 2:32 AM PDT
Funny - I didn't notice it had gone down until I read it on Twitter, and I wanted to check it right away. If I wouldn't have seen that, I probably wouldn't have noticed!
Reply to this comment
by alexpoho September 2, 2009 6:41 AM PDT
same here
by PatrickGlines September 2, 2009 3:56 AM PDT
Maybe it fit with the outage, the day before, of one of the biggest ISP in canada (link:http://text.dslreports.com/forum/remark,20094988) so everyone switch to gmail!
Reply to this comment
by Zenplace4 September 2, 2009 4:46 AM PDT
Mail was down for a couple of hours?! ... Boo-Hoo-Hoo. If this "tragic" event put a kink in your day ... or if you spent more than ten minutes complaining about Google's "horrible" 99.9% uptime service ... GET OVER IT!

Stuff happens and life goes on. And "NO", just because G-mail is your enterprise-level e-mail, does not somehow exempt it from the "absolute law of imperfection." People, software, machines, processes, weather, governments, and your spouse...occassionally fail to perform as we wish or require. As great as they are, Google is no exception.

If you are still upset ... I dare you to consider switching back to Yahoo. Didn't think so...
Reply to this comment
by strongpimphand September 2, 2009 8:40 AM PDT
...What in the world are you rambling about?

Downtime should never occur in normal business hours. NEVER. Smart companies take risks early in the morning.

If this downtime occurred...from midnight to 2 am...it would be a non-story basically. It happened instead during peak business hours. That's ludicrius!

Ten minutes can mean the difference in getting a business proposal and responding to a last-minute deadline. It's like if 911in your local town for 10 minutes. I'm sure you wouldn't have such an arrogant "GET OVER IT" attitude.

Think before you type.
by williambertram September 2, 2009 5:03 AM PDT
I agree with all the people who say "So What". There is no SLA with free E-Mail. Quality control for GMail is that people will use something else if they don't like it. Since there is nothing better, people continue to use it.
Reply to this comment
by Squashman2 September 2, 2009 6:49 AM PDT
But there are people who pay for Google Hosted Apps for their domain which includes their email. They need reliability.
by shootfirst September 2, 2009 8:38 AM PDT
Squashman2

Do you really think you get reliability for $50 a year? That is funny. Reliability costs tons of money and when you use a system that multiple companies and people use you will never be guaranteed reliability. The cloud is the worst guarantee for reliability you will ever have. Reliability also equals redundancy, if you don't have a backup plan you only have yourself to blame.
by Spats30 September 2, 2009 6:53 AM PDT
They should have performed this "maintenance" before everyone went back to school and logged into their GoogleApp (in their EDU space) at the same time. This might have just been poor timing in addition to a poorly thought out "improvement" release.

If Google want to promote and remain in the EDU sector, then they'll need to start abiding by the same outage windows and consider the performance spikes that schools typically see. The back-to-school week was just awful timing.
Reply to this comment
by shootfirst September 2, 2009 8:35 AM PDT
Maybe you should think before you log a comment. Maintenance comes when maintenance comes. You can't dictate that all maintenance should happen during the summer since not all machines will have issues at the same time and Google has a lot of machines. If your school wants reliability and to control maintenance schedules they can host the mail servers themselves, but that will never happen since most school IT don't have resources or staff capable of handling the load. Once you entrust a 3rd party to provide service you are at their beck and call, this is one of the major failings of the cloud, schools should have redundant plans so that they are using multiple clouds so when one goes down they can still be active, but they won't do that since it makes too much sense. BTW Google is probably doing more maintenance right now and there are no outage windows that are planned. Good thing you are in school and still learning.
by QA_Tester September 2, 2009 4:15 PM PDT
Go back to school and learn how to read Outage was result of bandwidth constraint as you can read from GOOG statement: "Google's nearly two-hour Gmail outage Tuesday was the result of a miscalculation regarding the capacity of its system, the company said late Tuesday." this is more of a capacity planning issue

EDU sector does maintenance during school year too.
by austinofohio September 2, 2009 6:58 AM PDT
Good Job in the recovery Goog. I use them as service demands it and recommend them to all my pals. These things happen and we humans aren't made perfect so then, why should we expect our creations to be perfect. Its common sense.
Reply to this comment
by giant_david September 2, 2009 7:18 AM PDT
It was interesting to get latest information from twitter yesterday, since I am a Gmail user. Twitter is an amazing tool, it is arguably becoming the INTERNET neural system.
Reply to this comment
by RobVaughn September 2, 2009 11:51 AM PDT
I'll take the contrarian bet on this one: Twitter is the new RSS, and like RSS, will be used by geeks and people who want to socialize online, but will never reach critical mass, at least as a business tool.
Showing 1 of 2 pages (73 Comments)
advertisement

Let the battle for holiday gadget shoppers begin

Retailers try different strategies for competing with behemoths like Amazon and Wal-Mart in the cutthroat competition to lure those giving electronics as gifts.

Firefox hopes to one-up IE with fast graphics

Windows 7 features called Direct2D and DirectWrite will speed up Internet Explorer 9 performance. But Firefox hopes it might retool for the same benefit first.

About Relevant Results

Relevant Results focuses on the big Internet companies of our time, tracking the evolution of search, communication, and business on the Web. Tom Krazit examines how a shift to mobile computing and the growing demand for online content affect our understanding of how to deliver information in the 21st century, in between bemoaning the state of the New York Mets and searching for the perfect IPA.

Add this feed to your online news reader

Relevant Results topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right