Google: Computer memory flakier than expected
Wondering why your computer just crashed again? Its memory might be to blame, according to real-world Google research that finds error rates higher than what earlier work showed.
With hundreds of thousands of computers in its data centers, Google can collect an abundance of real-world data about how those machines actually work. That's exactly what the company did for a research paper that found error rates are surprisingly high.
"We found the incidence of memory errors and the range of error rates across different DIMMs (dual in-line memory modules) to be much higher than previously reported," according the paper jointly written by Bianca Schroeder, a professor at the University of Toronto, and Google's Eduardo Pinheiro and Wolf-Dietrich Weber. "Memory errors are not rare events."
The probability of an uncorrected memory error goes way up if a memory module has experienced a correctible error within the most recent month--431 times more likely in some cases.
(Credit: Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber)How many errors? On average, about one in three Google servers experienced a correctable memory error each year and one in a hundred an uncorrectable error, an event that typically causes a crash.
4,000 errors per year
That may not sound like a high fraction, but bear these factors in mind, too: each memory module experienced an average of nearly 4,000 correctible errors per year, and unlike your PC, Google servers use error correction code (ECC) that can nip most of those problems in the bud. That means an correctable error on a Google machine likely is an uncorrectable error on your computer, said Peter Glaskowsky, an analyst at the Envisioneering Group (and member of CNET's blog network).
ECC detects where a memory cell that should have stored a one ended up with a zero or vice versa, and Google also uses some higher-end error correction technology called chipkill, too, the paper said. The study measured the majority of Google's servers, gathering data for nearly two and a half years, the first study at such scale. they said.
Previous research, such as some data from a 300-computer cluster, showed that memory modules had correctable error rates of 200 to 5,000 failures per billion hours of operation. Google, though, found the rate much higher: 25,000 to 75,000 failures per billion hours.
While memory errors can cause serious problems, they're a lot less serious for PCs than for servers, Glaskowsky said. That's because servers keep a lot of data in memory, writing it periodically to the relative safe haven of a hard drive, whereas most of a PC's memory holds just application or operating system files or perhaps some content that's being seen but not edited.
"Mostly consumer PCs aren't manipulating huge amounts of data in memory," Glaskowsky said. "In many cases it's just for viewing purposes."
But the study's results are causing some to rethink their software approach. One Google Chrome programmer, John Abd-El-Malek, suggested that the browser's database code be split off into a separate process from the rest of the browser code to cut down on corruption problems.
"Even if only a small fraction of these are harmful, spread over millions of users that's a lot of corruption," he wrote. He failed to convince at least some of his peers of his particular approach, but one skeptic, Scott Hess, responded, "I can see how it would make it useful to minimize how much in-memory data SQLite keeps, regardless of where SQLite lives."
Other myths debunked
The paper also challenged some other beliefs about memory.
Temperature isn't such a big deal.
Higher temperatures generally cause more error rates, but differences in temperature at Google's data center "had a marginal impact on the incidence of memory errors." However, system utilization, which tends to go hand in hand with high temperature, did cause more errors.
"Hard errors" are more common than "soft errors."
Hard errors, which are irreparable problems with hardware are more likely at fault than soft errors, which are transient issues caused by events such as random cosmic rays. This finding is interesting "since much previous work has assumed that soft errors are the dominating error mode in DRAM," the authors said, referring to the common dynamic random access memory used for computers' main memory.
Newer generations of memory modules, such as DDR2, aren't any worse than older ones.
There has been concern that newer memory modules, which pack electronics more tightly, suffer higher error rates. "In fact, DIMMs used in the three most recent platforms exhibit lower correctible error rates than the two older platforms, despite generally higher DIMM capacities," the authors wrote. "This indicates that improvements in technology are able to keep up with adversarial trends in DIMM scaling."
The researchers based this conclusion in part on the evidence that one error in a memory module is a good predictor of another to come--either correctible or uncorrectable. Worse, error rates go up with time:
"We see a surprisingly strong and early effect of age on error rates," the paper said. "Aging in the form of increased correctible error rates sets in after only 10 to 18 months in the field."
Google replaces error-prone memory modules, but it's harder for regular computer users without ECC memory to spot problems. In the olden days of personal computing and into the 1990s, memory was unreliable enough that people ran reliability tests.
But it may be those tests could come back, perhaps built into operating system software, Glaskowsky said: "If error rates are high enough, there may be an argument for running memory tests again."
Stephen Shankland writes about a wide range of technology and products, but has a particular focus on browsers and digital photography. He joined CNET News in 1998 and since then also has covered Google, Yahoo, servers, supercomputing, Linux and open-source software, and science. E-mail Stephen, or follow him on Twitter at http://www.twitter.com/stshank. 





http://opfm.jpl.nasa.gov/files/MEMORY%20INVESTIGATION%20for%20JEO%20MISSION%20D48262_revA_CL.pdf
"JPL has experienced one error in six BAE CRAM [the only "commercially available" PRAM at that time] devices tested. The cause of the error has not been determined; recall that the devices checked were engineering evaluation samples provided as a courtesy to JPL and were unscreened at the factory"
More details here:
http://ecdfan.blogspot.com/2009/05/how-to-spot-fake-samsung-and-pram.html
I don't know how good the tests are, but HP consumer personal computers run preinstalled diagnostic tests once a month, and memory is included in the testing.
Is Pete sure - My Asus M3A motherboard supports ECC modules no problem, just like plenty of other desktop boards do...
Keep in mind though that I am not an everyday computer user... so the high price for ECC memory may not be worth the price for most people, but it was definitely worth it for me.
type memory. Also, has Google charted the dates of these errors and witnessed any kind of pattern like
Sun spots etc??
Err, most mainline (and better) OEM servers have/use ECC RAM, and also use chipkill. It's not like Google has some super-secret brewmaster server hardware there...
...now home machines? Yeah - there's a good reason why the Dell or HP most folks have is just that damned cheap. ;)
Google isn't just a larger buyer of the same stuff everyone else buys. For good or bad, Google has come to the conclusion that CHEAP CHEAP CHEAP is the way to go. Buy the cheapest stuff you can and your savings will pay for the increased failure rate. Are they right? Meh, maybe. They've put their money where their mouth is so I assume so.
But that's beside the point.. Because google tries to use the cheapest garbage they can under the assumption that that'll pay for extra failures, means we should tie THEIR findings to ANYTHING else.
Don't quote me on this, but I think one of the guys in charge of buying (now building) the google machines said that "if it doesn't break in a year we paid too much". So yeah, let's not use their numbers.
Cody
Spurious comment about Dell and their NC plant; it's shutting down because people aren't buying desktops like they once did, which is what the North Carolina plant is configured to manufacture. This isn't just a Dell phenomenon, but something that's happening industry-wide. People want laptops and netbooks. Besides, considering how much gear is made in the Land of the Great Wall and the financial pressures that would entail for those who don't likewise build there, it would be surprising that ANYONE would maintain manufacturing plants anywhere but China or a locale with similar labor costs.
1) Lower power consumption. Google care about power usage above purchase cost - if you had that many machines to run so would you!
2) Reliability, Google know system failures are expensive.
First they blamed MS, now they blame hardware. Next thing you know they will blame the users.
Microsoft spends 6billion a year on R&D... 6billion.. google spends just shy of 4billion... Microsofts products are profitable.. yeah that means they're making money.
Last count aside from the search business google hasnt madecrap off anything else... youtube? nope, Gmail? nope google maps? uhuh .. check their financials. so before you go telling someone else to do some research you should take your own advice and stop spreading ignorance.
It has nothing to do with the OS, software applications, or any components in the computer other than the DRAM.
This study is about DRAM's reliability. More specifically, ECC DDR1 and ECC DDR2 reliability that are used in Google servers. Almost all desktop computers (yes that includes Apples) do not use the ECC memory because they are more expensive although they are more reliable.
What really caught my eye was that the DRAM's do "Age" and becomes unreliable over time.
I assume a sever has more than one memory module, so from the quote above we can infer that a memory module has at most 1 correctable memory error a year.
But a few lines later you say: "...each memory module experienced an average of nearly 4,000 correctible errors per year..."
This is inconsistent.
How many errors is it?
J.
In one platform, 20% of the DIMMs were responsible for 99.6% of the platform's errors. Interestingly, on an absolute basis only a relatively small percentage of the DIMMs affected by a correctable error will later get an uncorrectable error (about 2%).
I guess it makes sense. Most DIMMs don't have errors. But if just one bit in a DIMM starts to go bad, then that DIMM will generate massive numbers of errors, but all correctable. It's not until the number of bad bits exceed the platform capability that an uncorrectable error happens.
thats why all these things matter, they lie about their uptime... they claim 99.99% thats less than 2hrs a year, theyve had an excess of 14hrs last year.. so if your looking up at googles practices you need to realize they are Bull****
It appears that one server in three experienced a correctable memory error in any given year. But "each memory module experienced an average of nearly 4,000 correctible (sic) errors per year".
That means that each server uses less than one ten-thousandth (1/10,000) of a memory module. I don't think memory modules come in thousandths but I think that's what the writing implies.
We don't know the standard deviation so we have to look further. The article reports that "each", i.e., every, memory module experienced a pile of errors in a year. Over several years, each (every) module averaged 4,000 errors per annum. That's not the same thing as saying that total errors averaged over all modules were about 4,000 per year per module. In the latter case, most modules could be error-free while some experienced 20,000 errors per year.
But then, my degree was in sociology and I probably missed something that the geeks among you just know instinctively. Like why Windows or Mac OS or Linux is clearly superior. Full stop. I don't get that, either.
- by dennisl59 October 10, 2009 7:51 AM PDT
- So who is the manufacturer of the DIMMS Google uses in their servers? And if the one they're using is 'so bad', then why don't they change vendors? $$$, that's why. And they calculated it's an risk that Goggle management willing to take. So, in my opinion, this is a 'why should I give a damn?' article.
- Like this Reply to this comment
-
Showing 1 of 2 pages (56 Comments)