• On CBS.com: Get More On Amazing Race Eliminated Team
August 20, 2007 2:45 PM PDT

Live from Hot Chips 19: Session 2, Nvidia

by Peter Glaskowsky
  • Font size
  • Print
  • Post a comment

Welcome back to the ongoing Speeds and Feeds coverage of Hot Chips 19 at Stanford. They give us comfy chairs and free Wi-Fi, so blogging about it is the least I can do. By the way, Dean Takahashi of the San Jose Mercury News is also blogging from Hot Chips, so you can get another perspective on the event here.

Session 2 is the first of two sessions of "Multi-Core and Parallelism" presentations. This one happens to be all about Nvidia. Session 3, up next, will include presentations about AMD's ATI Radeon HD 2900, Intel's 80-core "Tera-Scale" processor, the TRIPS project at the University of Texas at Austin, and the Tile Processor from Tilera.

The first presentation in this session, "The Nvidia GeForce 8800 GPU," is an overview of that chip. As I mentioned in my Siggraph coverage, the 8800 includes 128 processor cores, but there's more to say about it than that.

Unlike a conventional multicore processor, the multiple cores on a GPU are often doing the same thing. So the 8800 is designed so that groups of eight cores are all running a single program. They can be out of step with each other, making the 8800 more flexible than old lock-step SIMD (single instruction, multiple data) designs, but if at a given moment fewer than 8 copies of a given program are needed, some of the 8800's 128 cores will be idle.

For a single chip, all this adds up--576 billion floating-point operations per second in these cores, 104 GB/s of memory bandwidth, and 150W typical power consumption for advanced 3D games and other graphics-hungry applications.

The second presentation is also self-explanatory: "The Nvidia GPU Parallel Computing Architecture & CUDA Programming Model". CUDA (Compute Unified Device Architecture) supports high-level programming of these complicated chips using the C language so that software developers don't have to manage all the low-level hardware details.

CUDA implements a straightforward multithreaded programming model. Developers write software as if it will run on just one processor at a time. There are some restrictions on data access and data sharing, but most of the GPU complexity is hidden. Complete applications are built by combining many of these single-thread programs--potentially thousands of them--and defining when and how these threads are used, and what data they consume and produce.

The critical achievements of CUDA are that programmers write one program for all GPU sizes--Nvidia makes versions of the GeForce 8000 family with different numbers of cores. Programs don't even know how many cores they're using. CUDA programs work with the hardware to distribute the running threads across the available cores.

The final presentation in the session covers issues that arise when running non-graphics applications on Nvidia GPUs. The title is a mouthful: "Performance Insights on Executing Non-Graphics Applications on CUDA on the Nvidia GeForce 8800 GTX." The lead presenter was Professor Wen-mei Hwu of the University of Illinois at Urbana-Champaign, who has been working with Nvidia in this area.

Nvidia's GPUs are designed to support such apps, and Nvidia even makes boards and systems exclusively for non-graphics use, the Tesla family.

Depending on the software, however, the GPU is not necessarily a good platform for non-graphics uses. Apps that are inherently parallel with streaming data flows are good; apps with many serialized operations, especially where conditional tests control the flow of execution, aren't so good.

The presentation analyzed three sample applications:

  • MRI (magnetic resonance imaging) image reconstruction
  • Fluid dynamics
  • H.264 video encoding

Although all three of these applications are parallelizable to a certain extent, they have different levels of suitability for the GeForce 8800 architecture.

The MRI processing Hwu described runs 416 times faster on an 8800 than on an Athlon 64 2800+ (which I must point out is not a very modern microprocessor; it shipped in 2004).

The fluid-dynamics code is the LBM benchmark from the SPEC CPU2006 suite (more information here). This code runs only about 12 times faster on a GPU than on a CPU because of non-ideal memory usage and thread synchronization.

Finally, the H.264 code runs about 20 times faster, but this algorithm is also not well-optimized for GPUs.

This wide range of performance, even on inherently parallel applications, shows how sensitive GPUs are to algorithms and implementation details. This situation is likely to improve over time--Hwu himself made specific recommendations about how to improve GPU suitability for these algorithms--but there will likely always be applications that run more efficiently on general-purpose processors than on GPUs. Horses for courses, as they say.

Peter N. Glaskowsky is a computer architect in Silicon Valley and a technology analyst for the Envisioneering Group. He has designed chip- and board-level products in the defense and computer industries, managed design teams, and served as editor in chief of the industry newsletter "Microprocessor Report." He is a member of the CNET Blog Network and is not an employee of CNET. Disclosure.
Recent posts from Speeds and Feeds
Wrapping up Speeds and Feeds, part 5: Access
Wrapping up Speeds and Feeds, part 4: Security
Wrapping up Speeds and Feeds, part 3: Ruggedness
Wrapping up Speeds and Feeds, part 2: Reliability
Wrapping up Speeds and Feeds, part 1: Efficiency
Tilera's balancing act: 100 cores vs. market realities
The Gizmo Report: WikiReader--simple, singular
Taking a look at Nook
advertisement

A CNET Conversation with Eric Schmidt

CNET's Tom Krazit and Molly Wood sit down with Google CEO Eric Schmidt to discuss the future of Android, the Chrome OS, the problem of real-time search indexing, and more.

Verizon tests sending RIAA copyright notices

The No. 2 phone company, known for its reluctance to intervene in antipiracy cases, strikes an agreement to forward copyright notices on behalf of the music industry.

advertisement

About Speeds and Feeds

Silicon Valley-based computer architect and chip analyst Peter N. Glaskowsky attends a variety of industry conferences throughout the year to meet with industry thought leaders and dig into the future of computing technology. In Speeds and Feeds, he analyzes trends in system architecture and interface design, as well as market and political pressures surrounding those trends. He is a member of the CNET Blog Network and is not an employee of CNET. Disclosure.

Add this feed to your online news reader

Speeds and Feeds topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right