• On MovieTome: See the villain of IRON MAN 2!
November 10, 2009 7:00 AM PST

Wrapping up Speeds and Feeds, part 2: Reliability

by Peter Glaskowsky
  • Font size
  • Print
  • 9 comments
Share

Personal computers have become much more reliable over the last 10 years or so, mostly due to the introduction of advanced operating systems with memory protection and hardware abstraction. The hardware itself has gotten better too; uncorrectable random errors are rare in PCs and extraordinarily rare in server-class systems.

These and other improvements have largely eliminated machine crashes. Blue-screen errors on Windows and kernel panics in Linux and Mac OS X still occur, but much more rarely.

Error-reporting services have become common, helping software developers figure out what went wrong. Most large developers now issue regular patches to fix newly discovered bugs, making systems more reliable between major releases.

All this progress is wonderful, of course, but our PCs still aren't reliable in the way that other consumer products are reliable. Machine crashes are still possible, and any bug can bring down an individual application.

Automobiles, for example, can fail in many ways, but they are still much more reliable than PCs. The risks associated with vehicle failures have been greatly reduced by decades of design refinements. Would you feel safe if PC technology controlled the steering and brakes in your car? Conversely, wouldn't you be more confident in your PC if you knew it was as reliable as your vehicle?

Lagoon Nebula

Can you rely on your system to display this 370-megapixel image?

(Credit: European Southern Observatory (ESO))

PCs are also fragile in response to change. I know I'm always a little nervous the first time I install a new device driver or run a new application. Even without software changes, opening an unusually large image can induce some trepidation. Consider this 370-megapixel image of the Lagoon Nebula available from the European Southern Observatory Web site; how confident are you that all of your image-viewing programs would survive the attempt to open it?

And worst of all, PCs are fragile in response to attack. The kinds of problems that are sometimes created accidentally by software bugs are relatively easy to create on purpose.

Minimizing the frequency and consequences of these problems would require tremendous effort from everyone in the industry. Almost every bit of PC hardware and software would have to change. One part of the solution is an extension of the same techniques that make today's PCs more reliable than older models: more hardware-based isolation of one function from another.

The minimal isolation of today's systems is very convenient for software developers, making it easier to write code and achieve high levels of performance. More isolation means more complexity and more overhead, but it improves reliability.

Developers are taking the first steps in this direction already, for example, with the process isolation features of the Microsoft Internet Explorer 8 and Google Chrome browsers. But there's much more that can be done.

Another way to improve reliability is to verify that data and addresses are consistent in range and format with the original intent of the software developer before they are used by the program. Making these checks in software can help; the incidence of failures related to accidental and deliberate buffer-overflow conditions has been dramatically reduced in this way. There's plenty of room for new hardware to help in this process too.

There's also work to be done in making it easier to recover from failures, since true hardware failures are inevitable. This is another area where some high-end systems are way ahead of the PC. Fault-tolerant machine architectures have been around for a long time in the aerospace industry, for example.

Historically, fault tolerance has never been practical on the PC because PCs always had only one of each critical subsystem: one processor, one bank of memory, one display channel. Today, PC processors and graphics chips have multiple cores and multiple memory interfaces, creating the potential for redundant operation where it's most needed.

Recoverability also implies backups--not just of the contents of disk drives, but even of the live data in memory through checkpointing. And disk backups can be improved too, by making the backup process an integral part of all disk I/O. Modern file systems use journaling to increase reliability; this technique can be extended to allow recovering from errors long after they occur.

There will be a heavy price to be paid in complexity and performance for all of these techniques, but the currency for this payment is transistors, and Moore's Law gives us more of those in every new process generation. We need to consider how we want to allocate these transistors. Over time, I believe reliability should account for an increasing portion of them.

Peter N. Glaskowsky is a computer architect in Silicon Valley and a technology analyst for the Envisioneering Group. He has designed chip- and board-level products in the defense and computer industries, managed design teams, and served as editor in chief of the industry newsletter "Microprocessor Report." He is a member of the CNET Blog Network and is not an employee of CNET. Disclosure.
Recent posts from Speeds and Feeds
So long, and thanks for all the hits
Wrapping up Speeds and Feeds, part 5: Access
Wrapping up Speeds and Feeds, part 4: Security
Wrapping up Speeds and Feeds, part 3: Ruggedness
Wrapping up Speeds and Feeds, part 2: Reliability
Wrapping up Speeds and Feeds, part 1: Efficiency
Tilera's balancing act: 100 cores vs. market realities
The Gizmo Report: WikiReader--simple, singular
Add a Comment (Log in or register) (9 Comments)
  • prev
  • 1
  • next
by Argyll November 10, 2009 7:50 AM PST
That photo might be easier to test if it wasn't a 3+ hour download.
Reply to this comment
by zepol22 November 10, 2009 9:22 AM PST
took me 20 minutes to download
both Windows photo gallery and Apple picture viewer failed to open the file. However Picasa photo viewer did open the file on my pc.
by Ebraheem November 10, 2009 8:32 AM PST
"uncorrectable random errors are rare in PCs and extraordinarily rare in server-class systems."
Another CNET article believes otherwise:
http://news.cnet.com/8301-30685_3-10370026-264.html
Reply to this comment
by Peter N. Glaskowsky November 10, 2009 10:47 AM PST
Yes, and you may have noticed that I was consulted during the production of that article and quoted heavily. :-) I didn't want to re-hash that article in my post, but I knew this would come up in the comments, and I'm prepared!

Just remember that PCs aren't servers. What's considered "rare" in one context may not be in another.

From the research paper, we know that "About a third of all machines in the fleet experience
at least one [correctable] memory error per year" and "1.3% of machines are affected by uncorrectable errors per year, with some platforms seeing as many as 2-4% affected."

That's rare in the context of personal computers, where much of the contents of RAM aren't critical to continuing operation anyway. These errors will (and do!) go essentially unnoticed against the background of other reliability problems on PCs. In other words, there are other problems that should be solved first before RAM reliability becomes a problem.

. png
by SteveChicago November 10, 2009 8:47 AM PST
Yeah, but I don't want to pay like $6,000 for a laptop. That is why I have hourly backups. I put up with some annoyances of the OS for the low cost of the overall system. I believe that was one of the founding reason for the personal computer long ago. Before that most everyone work terminals off a mainframe.
Reply to this comment
by Peter N. Glaskowsky November 10, 2009 10:50 AM PST
Well, with Moore's Law working for us and the work of OS developers amortized across hundreds of millions of PC users, it should be possible even today to greatly improve system-level reliability for an added cost of less than $100. Admittedly, that's still a high fraction of the cost of a PC, but it will decline over time.

Personally I'd pay an extra $500 or more without hesitation for a substantially more reliable PC.

. png
by Argyll November 10, 2009 10:34 AM PST
I'm on a 5MB line and it's been more than 4 hours now. Either their server is slowing down with everyone and their brother trying to get the file or something else. I do plan on commenting again, once I get it downloaded. 40k is a joke!
Reply to this comment
by Peter N. Glaskowsky November 10, 2009 10:55 AM PST
Maybe I should apologize to the good people at ESO for publicizing that link...

. png
by Argyll November 10, 2009 12:50 PM PST
FINALLY! It took six hours to download that photo. My iMac (2008) 2.8 Ghz Core 2 Duo, running Adobe Photoshop CS2 opened the document without problems in 30 seconds. Apple's Preview took 45 seconds to complete the task. No crashes, even in Photoshop applying some filters, which did take time, given the size of the file.

I have to think that Linux and Windows PC's would turn in the same time frames and stability. Perhaps system add ons may contribute to the instability the author mentions. But then again, my iMac is tweaked about as much as it can be.

BTW: It's an awesome photo.
Reply to this comment
(9 Comments)
  • prev
  • 1
  • next
advertisement

The yogurt makers of tech: Gadgets to avoid

Don't buy these one-trick ponies--unless you like gizmos that gather dust.

Google wants to unclog Net's DNS plumbing

The Net giant, ever eager for a faster Internet, debuts its Google Public DNS service. With it, Google could become even more central to the Net.

advertisement

About Speeds and Feeds

Silicon Valley-based computer architect and chip analyst Peter N. Glaskowsky attends a variety of industry conferences throughout the year to meet with industry thought leaders and dig into the future of computing technology. In Speeds and Feeds, he analyzes trends in system architecture and interface design, as well as market and political pressures surrounding those trends. He is a member of the CNET Blog Network and is not an employee of CNET. Disclosure.

Add this feed to your online news reader

Speeds and Feeds topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right