• On GameSpot: So-called 'Halo killer' gets 23 to life
October 7, 2009 1:54 PM PDT

Google: Computer memory flakier than expected

by Stephen Shankland
  • Font size
  • Print
  • 56 comments
Share

Wondering why your computer just crashed again? Its memory might be to blame, according to real-world Google research that finds error rates higher than what earlier work showed.

With hundreds of thousands of computers in its data centers, Google can collect an abundance of real-world data about how those machines actually work. That's exactly what the company did for a research paper that found error rates are surprisingly high.

"We found the incidence of memory errors and the range of error rates across different DIMMs (dual in-line memory modules) to be much higher than previously reported," according the paper jointly written by Bianca Schroeder, a professor at the University of Toronto, and Google's Eduardo Pinheiro and Wolf-Dietrich Weber. "Memory errors are not rare events."

The probability of an uncorrected memory error goes way up if a memory module has experienced a correctible error within the most recent month--431 times more likely in some cases.

The probability of an uncorrected memory error goes way up if a memory module has experienced a correctible error within the most recent month--431 times more likely in some cases.

(Credit: Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber)

How many errors? On average, about one in three Google servers experienced a correctable memory error each year and one in a hundred an uncorrectable error, an event that typically causes a crash.

4,000 errors per year
That may not sound like a high fraction, but bear these factors in mind, too: each memory module experienced an average of nearly 4,000 correctible errors per year, and unlike your PC, Google servers use error correction code (ECC) that can nip most of those problems in the bud. That means an correctable error on a Google machine likely is an uncorrectable error on your computer, said Peter Glaskowsky, an analyst at the Envisioneering Group (and member of CNET's blog network).

ECC detects where a memory cell that should have stored a one ended up with a zero or vice versa, and Google also uses some higher-end error correction technology called chipkill, too, the paper said. The study measured the majority of Google's servers, gathering data for nearly two and a half years, the first study at such scale. they said.

Previous research, such as some data from a 300-computer cluster, showed that memory modules had correctable error rates of 200 to 5,000 failures per billion hours of operation. Google, though, found the rate much higher: 25,000 to 75,000 failures per billion hours.

While memory errors can cause serious problems, they're a lot less serious for PCs than for servers, Glaskowsky said. That's because servers keep a lot of data in memory, writing it periodically to the relative safe haven of a hard drive, whereas most of a PC's memory holds just application or operating system files or perhaps some content that's being seen but not edited.

"Mostly consumer PCs aren't manipulating huge amounts of data in memory," Glaskowsky said. "In many cases it's just for viewing purposes."

But the study's results are causing some to rethink their software approach. One Google Chrome programmer, John Abd-El-Malek, suggested that the browser's database code be split off into a separate process from the rest of the browser code to cut down on corruption problems.

"Even if only a small fraction of these are harmful, spread over millions of users that's a lot of corruption," he wrote. He failed to convince at least some of his peers of his particular approach, but one skeptic, Scott Hess, responded, "I can see how it would make it useful to minimize how much in-memory data SQLite keeps, regardless of where SQLite lives."

Other myths debunked
The paper also challenged some other beliefs about memory.

• Temperature isn't such a big deal.

Higher temperatures generally cause more error rates, but differences in temperature at Google's data center "had a marginal impact on the incidence of memory errors." However, system utilization, which tends to go hand in hand with high temperature, did cause more errors.

• "Hard errors" are more common than "soft errors."

Hard errors, which are irreparable problems with hardware are more likely at fault than soft errors, which are transient issues caused by events such as random cosmic rays. This finding is interesting "since much previous work has assumed that soft errors are the dominating error mode in DRAM," the authors said, referring to the common dynamic random access memory used for computers' main memory.

• Newer generations of memory modules, such as DDR2, aren't any worse than older ones.

There has been concern that newer memory modules, which pack electronics more tightly, suffer higher error rates. "In fact, DIMMs used in the three most recent platforms exhibit lower correctible error rates than the two older platforms, despite generally higher DIMM capacities," the authors wrote. "This indicates that improvements in technology are able to keep up with adversarial trends in DIMM scaling."

The researchers based this conclusion in part on the evidence that one error in a memory module is a good predictor of another to come--either correctible or uncorrectable. Worse, error rates go up with time:

"We see a surprisingly strong and early effect of age on error rates," the paper said. "Aging in the form of increased correctible error rates sets in after only 10 to 18 months in the field."

Google replaces error-prone memory modules, but it's harder for regular computer users without ECC memory to spot problems. In the olden days of personal computing and into the 1990s, memory was unreliable enough that people ran reliability tests.

But it may be those tests could come back, perhaps built into operating system software, Glaskowsky said: "If error rates are high enough, there may be an argument for running memory tests again."

Stephen Shankland writes about a wide range of technology and products, but has a particular focus on browsers and digital photography. He joined CNET News in 1998 and since then also has covered Google, Yahoo, servers, supercomputing, Linux and open-source software, and science. E-mail Stephen, or follow him on Twitter at http://www.twitter.com/stshank.
Recent posts from Deep Tech
At a loss for words? Google offers search by sight
Navteq to supply Microsoft with 3D map data
Google acquires EtherPad online collaboration tool
Google edges toward Rosetta Stone status
Google wants to unclog Net's DNS plumbing
Phone photo quality interests Google, Microsoft
Intel hopes 48-core chip will solve new challenges
With an eye to the future, try raw photos today
Add a Comment (Log in or register) Showing 1 of 2 pages (56 Comments)
by dudesmiles October 7, 2009 2:23 PM PDT
they would be fine is apple made their computers. cheap morons.
Reply to this comment
by Gold_Storm_Mac October 7, 2009 2:31 PM PDT
why does the topic of apple always find its way into any article
by Shankland October 7, 2009 2:43 PM PDT
Google has a fascinating philosophy. It looks at reliability at a higher level than most of us, designing some reliability in at higher levels so individual computer failures aren't as much a problem. That may not be cost-effective for most of us, but it does mean that Google can buy cheaper hardware than full-on servers most folks use. Google spends a *lot* on hardware and software, and they have a pretty good track record with it overall despite some Gmail outages, so I think both "cheap" and "morons" is a bit off base.
by DENOBIN October 7, 2009 2:48 PM PDT
@Gold_Storm_Mac: Because Mac users tend to be an insecure lot and to have poor grammar.
by MEPace October 7, 2009 3:03 PM PDT
@DENOBIN: dont fogret we also aint able to puncuate
by T_Hoff October 7, 2009 3:14 PM PDT
This is about the error rates of the memory modules used in Google's servers. Apples doesn't make their own memory, they use and sell the same memory that is found in non-Apple products -- but charge a substantial premium for it.
by assman October 7, 2009 5:03 PM PDT
That has to be the the most retarded comment I have seen on the internet in at least a week. I mean wow. I am including Youtube comments. I elect you to the moron hall of fame.
by pooyan69 October 8, 2009 6:16 AM PDT
Trolling are we not?
by AkinsBalla October 8, 2009 6:53 AM PDT
Kingston does make a substantial amount of Mac memory modules... Of course, Kingston does do a great job, I've had it's HyperX brand in my PC for 5 years now and only a few crashes (I can remember two) have ever occured. Its true Apple doesn't make much of it's own, but I'm sure that doesn't depriciate it as it has notable companies make it's hardware... And yes at a premium price.
by Jeremy Chappell October 10, 2009 7:41 AM PDT
@dudesmiles: Google custom design their servers, they are the "ultimate Apple" of server design, the OS, the hardware and the "application" are designed together with the needs of the "application" informing the choices elsewhere. Google's Servers are reliable, cheap, and unavailable to the rest of us (we'd probably not like them much - they are case-less and designed purely for Google). No server maker could make a better machine for Google.
by luke_marsh October 7, 2009 3:08 PM PDT
Maybe that Pram Idea being made use of by Samsung isn't that bad a way to go after all hey.
Reply to this comment
by ecdfan October 7, 2009 7:20 PM PDT
luke_marsh: Sorry to disappoint you but PRAM (if it ever gets shipped commercially in volume) will be worse than DRAM. Much, much worse! Read this NASA report (page 8)::

http://opfm.jpl.nasa.gov/files/MEMORY%20INVESTIGATION%20for%20JEO%20MISSION%20D48262_revA_CL.pdf

"JPL has experienced one error in six BAE CRAM [the only "commercially available" PRAM at that time] devices tested. The cause of the error has not been determined; recall that the devices checked were engineering evaluation samples provided as a courtesy to JPL and were unscreened at the factory"

More details here:

http://ecdfan.blogspot.com/2009/05/how-to-spot-fake-samsung-and-pram.html
by john55440 October 7, 2009 3:15 PM PDT
"may be an argument for running memory tests again"

I don't know how good the tests are, but HP consumer personal computers run preinstalled diagnostic tests once a month, and memory is included in the testing.
Reply to this comment
by ikramerica--2008 October 7, 2009 6:00 PM PDT
Apple runs a memory check on every reboot in OSX. This can mean a long pre-boot time before the Grey Apple appears on laptops with 4GB of memory like mine, and then the actual OS boots rather quickly, especially on SSD. It's obviously not as intensive as those very long duration tests, but there are times where the memory is found to be faulty and disabled before boot, or the machine refuses to boot indicating a memory error, so the check does do something.
by MD_Willington October 7, 2009 3:23 PM PDT
"That means an correctable error on a Google machine likely is an uncorrectable error on your computer, said Peter Glaskowsky, an analyst at the Envisioneering Group (and member of CNET's blog network). "

Is Pete sure - My Asus M3A motherboard supports ECC modules no problem, just like plenty of other desktop boards do...
Reply to this comment
by ghaff October 7, 2009 3:39 PM PDT
But the reality is that the vast majority of desktop/notebook computers don't use ECC memory. The option is obviously there (more for homebuilt systems than for for those from major vendors-except perhaps engineering workstations or something intended as a server). But as a pretty accurate generalization, the typical personal computer doesn't have ECC memory.
by Shankland October 7, 2009 4:15 PM PDT
There are plenty of motherboards that support ECC, but it's not common in the mainstream PC world to find it in use. Try going to Dell or HP and seeing how many of their desktops and laptops advertise ECC.
by squished October 7, 2009 8:22 PM PDT
Dude you're a sample size of 1. Probably the top three things people consider when buying a computer are CPU, memory, and price. ECC is way way down the list. Most wouldn't be able to describe what it is even if you told them what the acronym was. So why would PC makers include it when it drives up the manufacturing cost while not adding any perceived value to the end consumer.
by richard993 October 8, 2009 2:44 AM PDT
I wouldn't buy a desktop or a server unless it supports ECC. I have ECC in every machine except my laptop... and I've had several of my machines fail ECC checks and blue-screen on me (non-correctable soft errors). It's good that Google has come out with a public report. This proves that enabling checksums in database systems and other important information stores is absolutely essential to any business.

Keep in mind though that I am not an everyday computer user... so the high price for ECC memory may not be worth the price for most people, but it was definitely worth it for me.
by Randys2cents October 7, 2009 3:37 PM PDT
I would be interested in knowing how a rack of servers with Rad-Hard memory would fair against standard
type memory. Also, has Google charted the dates of these errors and witnessed any kind of pattern like
Sun spots etc??
Reply to this comment
by Goodbye Helicopter October 7, 2009 3:46 PM PDT
Are they going to opensource the tools to monitor this stuff?
Reply to this comment
by Random_Walk October 7, 2009 4:01 PM PDT
FYI, regarding "ECC detects where a memory cell that should have stored a one ended up with a zero or vice versa, and Google also uses some higher-end error correction technology called chipkill, too, the paper said."

Err, most mainline (and better) OEM servers have/use ECC RAM, and also use chipkill. It's not like Google has some super-secret brewmaster server hardware there...

...now home machines? Yeah - there's a good reason why the Dell or HP most folks have is just that damned cheap. ;)
Reply to this comment
by mudphud October 7, 2009 5:50 PM PDT
I don't think they were suggesting Google is using something special, just pointing out what types of error correction they are using.
by codynews October 7, 2009 4:40 PM PDT
What's bogus about this hardware is it's GOOGLES findings, and not servers or PC's in general.

Google isn't just a larger buyer of the same stuff everyone else buys. For good or bad, Google has come to the conclusion that CHEAP CHEAP CHEAP is the way to go. Buy the cheapest stuff you can and your savings will pay for the increased failure rate. Are they right? Meh, maybe. They've put their money where their mouth is so I assume so.

But that's beside the point.. Because google tries to use the cheapest garbage they can under the assumption that that'll pay for extra failures, means we should tie THEIR findings to ANYTHING else.

Don't quote me on this, but I think one of the guys in charge of buying (now building) the google machines said that "if it doesn't break in a year we paid too much". So yeah, let's not use their numbers.

Cody
Reply to this comment
by ikramerica--2008 October 7, 2009 6:02 PM PDT
That's a good point. Without knowing the supplier and brand of the RAM boards and the chips on the boards, it's not that valuable.
by grossj144 October 8, 2009 6:36 AM PDT
What you fail to understand is that most of the people I know who buy PCs usually purchase the cheapest computers they can. They assume that a $300 computer is as good as a $1000, just so long as it doesn't "look" weird or ugly. So, these findings do have some bearing for "real world, average" PC buyers because Google does purchase relatively cheap components...just like a lot people. Plus, with today's economic situation, if the "average" person wants to buy or throw together a new computer, he/she is most likely going to use similar components to what Google would buy.
by afterhours October 8, 2009 7:08 AM PDT
If we can't quote you on this, then you shouldn't bother to post it because you won't stand by your own claims -- ergo, they are worthless. How do you know Google buys cheap? The Google server I've taken apart was well-engineered in a way even Apple could copy and move up the design food chain. Every Dell server I've worked on is junk in a box -- from the power supplies to the NIC chips, and certainly the CPU. There's a reason they are scrambling to stay afloat (killing off their remaining domestic manufacturing for lack of demand -- the NC assembly plant). Real servers along the lines of Sun and IBM, and smaller server hardware such as Apple and HP -- all of these are fairly well put-together, and not necessarily the bottom-feeding component users you suggest. Google has a vested interest in not touching their equipment more than necessary, so your quote is questionable at best. Cite your sources.
by make_or_break October 8, 2009 9:01 AM PDT
@afterhours,
Spurious comment about Dell and their NC plant; it's shutting down because people aren't buying desktops like they once did, which is what the North Carolina plant is configured to manufacture. This isn't just a Dell phenomenon, but something that's happening industry-wide. People want laptops and netbooks. Besides, considering how much gear is made in the Land of the Great Wall and the financial pressures that would entail for those who don't likewise build there, it would be surprising that ANYONE would maintain manufacturing plants anywhere but China or a locale with similar labor costs.
by Jeremy Chappell October 10, 2009 7:46 AM PDT
No you're not right. Google's hardware is cheap however there are factors that trump that:

1) Lower power consumption. Google care about power usage above purchase cost - if you had that many machines to run so would you!

2) Reliability, Google know system failures are expensive.
by contentcreator--2008 October 7, 2009 5:56 PM PDT
These failure rates should be specified "per-GB" to have any meaning --- your 1GB old desktop vs 64 GB on a server makes a big difference. Ie 10 failures/year/GB
Reply to this comment
by jlopezcnet October 7, 2009 6:15 PM PDT
Chances are google is trying to blame hardware because they are trying to deny any fault in their own code. Let's face it, Android is a flop. It's easier to blame it on bad hardware rather than take ownership of their "web apps" lack of substance.

First they blamed MS, now they blame hardware. Next thing you know they will blame the users.
Reply to this comment
by mabry77 October 8, 2009 2:24 PM PDT
Get off of Google's nuts. Google is like one of the best things that happened to the internet. All the money they make, they use it to better their company AND our internet experience. Android is not a flop. Microsoft is a flop. What does Microsoft do with all their money? Who knows... They don't use it to help us out. Google does. Before you start putting down Google, do some research.
by heygeo October 9, 2009 9:32 PM PDT
@mabry
Microsoft spends 6billion a year on R&D... 6billion.. google spends just shy of 4billion... Microsofts products are profitable.. yeah that means they're making money.
Last count aside from the search business google hasnt madecrap off anything else... youtube? nope, Gmail? nope google maps? uhuh .. check their financials. so before you go telling someone else to do some research you should take your own advice and stop spreading ignorance.
by ferricoxide October 10, 2009 7:45 AM PDT
Err... It's not OS or application code that generates errors within RAM. So, I'm kind of curious as to how, in your mind, "own code" plays into DIMM error-rate phenomena.
by flickrz October 7, 2009 6:16 PM PDT
Ofcourse, they use cheap utility servers to save money.
Reply to this comment
by sundance808 October 7, 2009 6:27 PM PDT
any possibility of disclosing the stats on a per manufacturer and DIMM type basis?
Reply to this comment
by Shankland October 7, 2009 10:31 PM PDT
The paper didn't disclose any manufacturers, only server "platforms" and memory classes such as DDR1, DDR2, and FB-DIMM. There were some interesting differences between DDR1 and DDR2 (and not enough data to say anything much about FB-DIMM, which doesn't matter much since it's a dead end). However, Google found that there weren't any differences to speak of among different memory makers. I encourage you to check out the report if you're interested--I linked to it in the second paragraph.
by winstein October 7, 2009 8:58 PM PDT
Most people have not idea what this study was about and makes stupid comments.

It has nothing to do with the OS, software applications, or any components in the computer other than the DRAM.

This study is about DRAM's reliability. More specifically, ECC DDR1 and ECC DDR2 reliability that are used in Google servers. Almost all desktop computers (yes that includes Apples) do not use the ECC memory because they are more expensive although they are more reliable.

What really caught my eye was that the DRAM's do "Age" and becomes unreliable over time.
Reply to this comment
by newnewsreader October 8, 2009 3:15 AM PDT
"On average, about one in three Google servers experienced a correctable memory error each year"

I assume a sever has more than one memory module, so from the quote above we can infer that a memory module has at most 1 correctable memory error a year.
But a few lines later you say: "...each memory module experienced an average of nearly 4,000 correctible errors per year..."
This is inconsistent.
How many errors is it?

J.
Reply to this comment
by mbenedict October 8, 2009 9:42 AM PDT
That's because the vast majority of DIMMs (~92%) are not affected by correctable errors, but the ones which are (~8%) suffer from a massive amount of errors on an annualized basis. So the average is quite large but the distribution is skewed.

In one platform, 20% of the DIMMs were responsible for 99.6% of the platform's errors. Interestingly, on an absolute basis only a relatively small percentage of the DIMMs affected by a correctable error will later get an uncorrectable error (about 2%).

I guess it makes sense. Most DIMMs don't have errors. But if just one bit in a DIMM starts to go bad, then that DIMM will generate massive numbers of errors, but all correctable. It's not until the number of bad bits exceed the platform capability that an uncorrectable error happens.
by joaompq October 8, 2009 3:57 AM PDT
Most desktop PC/Mac gets an memory error every 18 h (average time assuming you are using a good memory module), we can live with that ! Most of the users don't use a desktop for that long in regular day and most of the errors wont produce catastrophic results , most of the times it just produces a software glitch . No need of ECC modules for your personal computer unless you are using it in a mission critical environment (servers , medical equipment). The same apply to hard drives (you can look for transfer error rates that are published by some of the manufacturers). Most of the errors are corrected by the hardware (chipset) , you just have to think of it like ants in a picnic , they are coming but we can live with them :)
Reply to this comment
by Squashman2 October 8, 2009 7:44 AM PDT
Google does a lot of system analysis to keep all their systems running. Did you all the see the paper they did on HD failure rates in their server farms. They keep a close eye on everything. I don't know why people complain when gmail goes down for a couple of hours for a few people now and then. It is a free service and Google does a pretty good job keeping it running for the millions of people that use it.
Reply to this comment
by heygeo October 9, 2009 9:35 PM PDT
because its not a free service anymore... they are trying to sell this to the enterprise crowd.
thats why all these things matter, they lie about their uptime... they claim 99.99% thats less than 2hrs a year, theyve had an excess of 14hrs last year.. so if your looking up at googles practices you need to realize they are Bull****
by klor5 October 8, 2009 9:20 AM PDT
I often use Snow Leopard Cache Cleaner but have always been too paranoid to try its' Optimise RAM or Clear Free Memory etc. tools.Does anyone have any experience with these or other RAM optimise/repair software for OS X?
Reply to this comment
by mbenedict October 8, 2009 10:29 PM PDT
There's no need to use them.
by John Sawyer October 9, 2009 11:37 PM PDT
Those options have no effect on RAM hardware reliability--they have to do with OS X's RAM handling methods.
by gjl229 October 9, 2009 9:54 AM PDT
This article, and the comments, show the value in understanding basic statistics and actually reading the material.

It appears that one server in three experienced a correctable memory error in any given year. But "each memory module experienced an average of nearly 4,000 correctible (sic) errors per year".

That means that each server uses less than one ten-thousandth (1/10,000) of a memory module. I don't think memory modules come in thousandths but I think that's what the writing implies.

We don't know the standard deviation so we have to look further. The article reports that "each", i.e., every, memory module experienced a pile of errors in a year. Over several years, each (every) module averaged 4,000 errors per annum. That's not the same thing as saying that total errors averaged over all modules were about 4,000 per year per module. In the latter case, most modules could be error-free while some experienced 20,000 errors per year.

But then, my degree was in sociology and I probably missed something that the geeks among you just know instinctively. Like why Windows or Mac OS or Linux is clearly superior. Full stop. I don't get that, either.
Reply to this comment
by libertyforall1776 October 9, 2009 7:29 PM PDT
Makes you wonder why a company like Apple, who prides itself in "excellence", does not use ECC memory in their "Pro" line of laptops!!! :-(
Reply to this comment
by dennisl59 October 10, 2009 7:51 AM PDT
So who is the manufacturer of the DIMMS Google uses in their servers? And if the one they're using is 'so bad', then why don't they change vendors? $$$, that's why. And they calculated it's an risk that Goggle management willing to take. So, in my opinion, this is a 'why should I give a damn?' article.
Reply to this comment
Showing 1 of 2 pages (56 Comments)
advertisement

The yogurt makers of tech: Gadgets to avoid

Don't buy these one-trick ponies--unless you like gizmos that gather dust.

Google wants to unclog Net's DNS plumbing

The Net giant, ever eager for a faster Internet, debuts its Google Public DNS service. With it, Google could become even more central to the Net.

About Deep Tech

Stephen Shankland, who's covered the computing industry since 1998 and was a science reporter before that, here delves into a wide range of technology trends and offers hands-on tests. His particular interests include Web browsers, cameras, standards, research, science, and start-ups.

Add this feed to your online news reader

Deep Tech topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right