Nvidia chips are now in three of the five fastest supercomputers in the world. How did Nvidia get there so fast?
I spoke with Steve Scott, chief technology officer at Nvidia's Tesla products group, to find out.
First, a quick primer on Tesla and graphics chip-based supercomputing. Tesla processors are basically graphics processing units (GPUs) that have been redesigned for supercomputers. The results are impressive enough that some of the most important supercomputing sites have signed on. The U.S. Department of Energy's Oak Ridge National Laboratory, probably the premier U.S. supercomputing site, will use Tesla processors in its next supercomputer,Titan.
"About 28 percent of all the supercomputer sites use GPGPUs," said Steve Conway, an analyst at IDC, referring to general purpose GPU. "Those sites use at least some GPUs. That's about three times what it was three years ago. But this trend is much wider than it is deep. In the sense that the average supercomputer that does have GPUs in it, five percent of those processors are GPUs. The rest are standard (central processing unit) processors. So, it's still pretty early. GPGPUs right now are mainly experimental," Conway said.
Nvidia's Scott is trying to change that. He spent 20 years at supercomputing giant Cray as the CTO. He's been at Nvidia for only three months.
Q: Why the move to Nvidia?
Scott: I worked closely with Nvidia, AMD, and Intel on their road maps. I became absolutely convinced that heterogeneous, hybrid computing [GPUs and CPUs] was the only way to move forward given the power constraints affecting system designs and that Nvidia was the only company that had a viable business strategy.
And the Oak Ridge connection?
Scott: I worked very closely with Oak Ridge over the last few years. When it came time to replace Jaguar we realized we couldn't do it with normal CPU technology. Basically voltage scaling has ended. We're no longer able to drop the voltage [to make the processors more power efficient].
Explain the power efficiency argument?
Scott: Moore's Law is alive and well. We keep getting to add exponentially more transistors to a chip, but we can't run them all. Because the chip will literally burn up. So we're in an environment where we're constrained entirely by power, from a performance design perspective. It's become entirely about performance per watt. About power efficiency. And that's where the [GPU] accelerator technologies really shine.
The amount of energy per FLOP is over seven times lower in a current accelerator than it is in a CPU. The top-line GPU today has over 500 processors on it--compared to 6 or 8 on a CPU die [chip]. The processors are much smaller and have much lower overhead.
Do you see this conversion to GPU-based supercomputing accelerating?
Scott: Take a look at the Top 500 list. Three years ago the very first GPU-enabled computer showed up in the Top 500. Now they're more than doubling each year. At the high end of the list, three of the top five machines are GPU-enabled. And Oak Ridge is going with GPU supercomputing. So, it's a pretty fast ramp.
How does a GPU-centric supercomputer work?
Scott: You need to do the vast majority of your work on processing cores that are designed for energy efficiency [GPUs]. They're designed to execute hundreds of parallel threads efficiently. But you also need a small number of cores to run a single thread very fast. The leftover work that can't be parallelized, you want to run that on a CPU-style core.
High-performance computing code is highly parallel in a distributed memory fashion. In other words, you take your job and split it up over hundreds or thousands of separate nodes. And within each node, there will be parallel [GPU] work and serial [CPU] work. And typically the parallel work is going to be well over 90 percent of the code. Sometimes it's less, sometimes it's 99.9 percent of the code.
In the Titan machine, approximately 90 percent of the total FLOPs of that machine will come from the GPUs. And approximately 10 percent will be on the CPUs. It's 1:1. One CPU coupled with a Tesla GPU.
What about programming for the GPU?
Scott: CUDA [an Nvidia programming model] allowed people to program GPUs using C and later Fortran--as if they were standard computers. And more recently was...something called directives. Tell the compiler, please take this...and execute it on the accelerator [GPU], and then the compiler and the runtime take care of generating the instructions to run on the accelerator. From a programming model perspective, all the programmer really has to do is identify and expose the parallelism to the compiler just by using a set of directives.
What about Intel's foray into accelerated supercomputing with its Knights Corner many-core chip?
Scott: Nvidia GPUs are general purpose processors. They have an instruction set and they can do everything any other processor can do. They're not an x86 instruction set. They're a RISC instruction set. But it's absolutely a general purpose chip where you can do anything that you can do on any other computer. This is really Intel recognizing that they can't get where they need to go with standard Xeon multicore. They have to go to an accelerated model for the exact same reasons I was talking about. For power efficiency. I think it's an endorsement. They're several years behind Nvidia, but it's the right path.