• On GameSpot: So-called 'Halo killer' gets 23 to life
August 5, 2008 1:30 AM PDT

Intel's Larrabee--more and less than meets the eye

by Peter Glaskowsky

Intel announced on Monday that it will be presenting a paper at Siggraph 2008 about its "many-core" Larrabee architecture, which will be the basis of future Intel graphics processors.

The paper itself, however, has already been published, and I was able to get a copy of it. (Unfortunately, as you'll see at that link, the paper is normally available only to members of the Association for Computing Machinery.)

Larrabee block diagram

Intel's Larrabee includes "many" cores, on-chip memory controllers, a wide ring bus for on-chip communications, and a small amount of graphics-specific logic.

(Credit: Intel)

The paper is a pretty thorough summary of Intel's motives for developing Larrabee and the major features of the new architecture. Basically, Larrabee is about using many simple x86 cores--more than you'd see in the central processor (CPU) of the system--to implement a graphics processor (GPU). This concept has received a lot of attention since Intel first started talking about it last year.

The paper also answers perhaps the biggest unanswered question about Larrabee--what are the cores, and how can Intel put "many" of them on a chip when desktop CPUs are still moving from two to four cores?

Intel describes the Larrabee cores as "derived from the Pentium processor," but I think perhaps this is an oversimplification. The design shown in the paper is only vaguely Pentium-like, with one execution unit for scalar (single-operation) instructions and one primarily for vector (multiple-operation) instructions.

Larrabee core diagram

The Larrabee core contains only two execution units: one for scalar operations, one for vector operations.

(Credit: Intel)

That's the basic answer: Larrabee cores just have less going on. A quad-core desktop processor might have six or more execution units, and a lot of special logic to let it reorder instructions and execute code past conditional branches just in case it can guess the direction of the branch correctly. This complexity is necessary to maximize performance in a lot of desktop software, but it's not needed for linear, predictable code--which is what we usually find in 3D-rendering software.

But the vector unit in Larrabee is much more powerful than anything in older Intel processors--or even in the current Core 2 chips--because 3D rendering needs to do a lot of vector processing. The vector unit can perform 16 single-precision floating-point operations in parallel from a single instruction, which works out to 512 bits wide--great for graphics, though it would be overkill for a general-purpose processor, which is why the vector units in mainstream CPUs are 128 or 256 bits wide at most.

The new vector unit also supports three-operand instructions, probably including the classic "A * B + C" operation that is so common in many applications, including graphics. With three operands and two calculations per instruction, the peak throughput of a single Larrabee core should be 32 operations per cycle, and that's just what the paper claims.

I say "probably" because the Siggraph paper doesn't describe exactly what operations will be implemented in the vector unit, but I suspect this part of the Larrabee design is related to Intel's Advanced Vector Extensions, announced last April. The first implementations of AVX for desktop CPUs will apparently begin with a 256-bit design, another indication of how unusual it is for Larrabee to have a 512-bit vector unit.

The multithreading factor
Intel also built four-way multithreading into the Larrabee cores. Each Larrabee core can save all the register data from four separate threads in hardware, so that most thread-switch operations can be performed almost instantly rather than having to save one set of registers to main memory and load another. This approach is a reasonable compromise for reducing thread-switching overhead, although it probably consumes a significant amount of silicon.

Note that this kind of multithreading in Larrabee is very different from the Hyper-Threading technology Intel uses on Pentium 4, Atom, and future Nehalem processors. Hyper-Threading (aka simultaneous multi-threading) allows multiple threads to execute simultaneously on a single core, but this only makes sense when there are many execution units in the core. Larrabee's two execution units are not enough to share this way.

All of these differences prove rather conclusively that Larrabee's cores are not the same as the cores in Intel's Atom processors (also known as Silverthorne). That surprised me; the Atom core seemed fairly appropriate for the Larrabee project. All that really should have been necessary was to graft a wider vector unit onto the Atom design. But now I suppose the Atom and Larrabee projects have been completely independent from one another all along.

Intel won't say how many cores are in the first chip. The paper describes an on-chip ring network that connects the cores. The network is 512 bits wide. Interestingly, the paper mentions that there are two different ring designs--one for Larrabee chips with up to 16 cores, and one for larger chips. That suggests Intel has chips planned with relatively small numbers of cores, possibly as few as four or eight. Such small implementations might be appropriate for Intel's future integrated-graphics chip sets, but as such they will be very slow by comparison with contemporary discrete GPUs, just as Intel's current products are.

Larrabee provides some graphics-specific logic in addition to the CPU cores, but not much. The paper says that many tasks traditionally performed by fixed-function circuits, such as rasterization and blending, are performed in software on Larrabee. This is likely to be a disadvantage for Larrabee, since a software solution will inevitably consume more power than optimized logic--and consume computing resources that could have been used for other purposes. I suspect this was a time-to-market decision: tape out first, write software later.

The paper says Larrabee does provide fixed-function logic for texture filtering because filtering requires steps that don't fit as well into a CPU core. I presume there's other fixed-function logic in Larrabee, but the paper doesn't say.

Larrabee's rendering code uses binning, a technique that has been used in many software and hardware 3D solutions over the years, sometimes under names such as "tiling" and "chunking." Binning divides the screen into regions and identifies which polygons will appear in each region, then renders each region separately. It's a sensible choice for Larrabee, since each region can be assigned to a separate core.

Binning also reduces memory bandwidth, since it's easier for each core to keep track of the lower number of polygons assigned to it. The cores are less likely to need to go out to main memory for additional information.

The numbers crunch
The paper gives some performance numbers, but they're hard to interpret. For example, game benchmarks were constructed by running a scene through a game, then taking only widely separated frames for testing on the Intel design. In the F.E.A.R. game, for example, only every 100th frame was used in the tests. This creates an unusually difficult situation for Larrabee; there's likely to be much less reuse of information from one frame to the next.

But given that limitation of the test procedure, the results don't look very good. To render F.E.A.R. at 60 frames per second--a common definition of good-enough gaming performance--required from 7 to 25 cores, assuming each was running at 1GHz. Although there's a range here depending on the complexity of each frame, good gameplay requires maintaining a high frame rate--so it's possible that F.E.A.R. would, in practice, require at least a 16-core Larrabee processor.

And that's about the performance of a 2006-vintage Nvidia or Advanced Micro Devices/ATI graphics chip. This year's chips are three to four times as fast.

In other words, unless Intel is prepared to make big, hot Larrabee chips, I don't think it's going to be competitive with today's best graphics chips on games.

Intel can certainly do that-- no other semiconductor company on Earth can afford to make big chips the way Intel can-- but that would ruin Intel's gross margins, which are how Wall Street judges the company. Also, Intel's newest processor fabs are optimized for high-performance logic, like that used in Core 2 processors. Larrabee runs more slowly, suggesting it could be economically manufactured on ASIC product lines... but Intel's ASIC lines are all relatively old, refitted CPU lines.

Nvidia, by comparison, gets around this problem by designing its chips from the beginning to be made in modern ASIC factories, chiefly those run by TSMC. Although these factories are a generation behind Intel's in process technology, they're much less expensive to operate. So this may be a situation where Intel's process edge doesn't mean as much as it does in the CPU business.

The Larrabee programming model also supports nongraphics applications. Since it's fundamentally just a multicore x86 processor, it can do anything a regular CPU can do. Intel's paper even uses Sun Microsystems' term, Throughput Computing, for multicore processing.

The Larrabee cores aren't nearly as powerful as ordinary notebook or desktop processors for most applications. Real Larrabee chips will likely be faster than the 1GHz reference frequency used in the paper, but they still don't have as many execution units for the scalar operations that make up the bulk of operating-system and office software. That means a single Larrabee core could feel slow even when compared with a Pentium III processor at the same frequency, never mind a Core 2 Duo.

But with such a strong vector unit, a Larrabee core could be very good at video encoding and other tasks, especially those that use floating-point math. At 1GHz, a single Larrabee core hits a theoretical 32 GFLOPS (32 billion floating-point operations per second). A 32-core Larrabee chip could exceed a teraflop--roughly the performance of Nvidia's latest GPU, the GTX 280, which has 240 (very simple) cores.

But I don't expect to see that kind of performance from the first Larrabee chips. The power consumption of a 32-core design with all the extra overhead required by x86 processing would be very high. Even with Intel's advantages in process technology, such a large Larrabee chip would probably be commercially impractical. Smaller Larrabee designs may find some niche applications, however, acting as number-crunching coprocessors much as IBM's Cell chips do in some systems.

And although a Larrabee chip could, in principle, be exposed to Windows or Mac OS X to act as a collection of additional CPU cores, that wouldn't work very well in the real world and Intel has no intention of using it that way. Instead, Larrabee will be used like a coprocessor. In that application, Larrabee's x86 compatibility isn't worth very much.

The bottom line
So...what's Larrabee good for, and why did Intel bother with it?

I think maybe this was a science project that got out of hand. It came along just as AMD was buying ATI and so positioning itself as a leader in CPU-GPU integration. Intel had (and still has) no competitive GPU technology, but perhaps it saw Larrabee as a way to blur the line distinguishing CPUs from GPUs, allowing Intel to leverage its expertise in CPU design into the GPU space as well.

Intel may have paid too much attention to some of its own researchers, who have been touting ray tracing as a potential alternative to traditional polygon-order ray tracing. I wrote about this in some depth back in June ("Ray tracing for PCs--a bad idea whose time has come"). But ray tracing merits just one paragraph and one figure in this paper, which establish merely that Larrabee is more efficient at ray tracing than an ordinary Xeon server processor. It falls well short of establishing that ray tracing is a viable option on Larrabee, however.

Future members of the Larrabee family may be good GPUs, but from what I can see in this paper, the first Larrabee products will be too slow, too expensive, and too hot to be commercially competitive. It may be several more years beyond the expected 2009/2010 debut of the first Larrabee parts before we find out just how much of Intel's CPU know-how is transferable to the GPU market.

I'll be at Siggraph again this year, and I'll have more to say after I've read this paper through a few more times and had a chance to speak with some of the folks I know at AMD, Nvidia, and other companies in the graphics market.

Peter N. Glaskowsky is a computer architect in Silicon Valley and a technology analyst for the Envisioneering Group. He has designed chip- and board-level products in the defense and computer industries, managed design teams, and served as editor in chief of the industry newsletter "Microprocessor Report." He is a member of the CNET Blog Network and is not an employee of CNET. Disclosure.
Recent posts from Speeds and Feeds
Tilera's balancing act: 100 cores vs. market realities
The Gizmo Report: WikiReader--simple, singular
Taking a look at Nook
Mulling mobile broadband options
The factor factor, part 3
The factor factor, part 2
The factor factor, part 1
ATI and Nvidia face off--obliquely
Add a Comment (Log in or register) (13 Comments)
  • prev
  • 1
  • next
by eightwings August 5, 2008 5:51 AM PDT
It's Larrabee's 48 cores versus Nvidia's 240 cores and AMD's 500 cores. Way to go, Intel. That'll show them. Intel needs to go to computer science rehab.

Larrabee: Intel?s Hideous Heterogeneous Beast:
http://rebelscience.blogspot.com/2008/08/larrabee-intels-hideous-heterogeneous.html
Reply to this comment
by jltate August 5, 2008 12:24 PM PDT
Two things:

It's not heterogeneous.

If we're counting "cores" the way Nvidia does then Larrabee has 768.
by srikanth_janga August 5, 2008 7:49 AM PDT
Good article
Reply to this comment
by CompEng August 5, 2008 10:38 AM PDT
It's nice to see companies trying things a little differently, even if it doesn't always work out. I hope they figure out some tweaks to keep this from dying on the vine.
Reply to this comment
by whumpadoodle August 5, 2008 11:34 AM PDT
The "oh so secret" ACM article is available freely from Intel directly. Try:

http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf
Reply to this comment
by jltate August 5, 2008 12:21 PM PDT
Well, it's been estimated that Larrabee will run somewhere between 1.7 and 2.5GHz, and that high end chips will have 48 cores.

Also, the test they ran had F.E.A.R. at 1600x1200 with 4x MSAA. Even an 8800 GTX only gets ~80fps at those settings. Considering that the GTX280 doesn't give double the performance of the 8800GTX, Nvidia will certainly have something to think about between now and when this is released.
Reply to this comment
by rauxbaught August 5, 2008 6:09 PM PDT
As a long-time professional graphics programmer (who doesn't work for Intel), I can assure you that you are completely missing the point.

But before I get into that, let me correct a few complete fallacies in your article:
- There is no "reuse of information" across frames in video games. They took spaced out their samples because neighboring frames tend to be very similar. They wanted varying data points to determine the effect of varying loads on the various subsystems, and measure the scaling of the system as they add extra cores -- not to do sustained throughput measurements.
- Running multiple threads on a single core in in-order processors is generally done to cover memory latency. (This is different from out-of-order processors, which can have many more sequential instructions in flight at once, and use complicated logic to keep as many units as possible busy.) Hyperthreading in this case therefore increases the practical throughput of the system, since it's designed for parallelism instead of single-stream throughput.
- The use of 1Ghz cores was used to keep the math simple. (As was mentioned in the paper.) As far as we know, they could be using cores that run 2-3 times that frequency. (Or even half, for that matter...) Considering we don't know the frequency or the core count of their final hardware, comparing it to 2 year old hardware from nVidia is far from meaningful. (Thus making your claim that they won't be competitive very premature.)
- Binning (similar to previous techniques such as tiling) reduces the memory bandwidth to the framebuffer, not the polygons. If you'd read the paper, you'd see that their analysis indicates a significant memory bandwidth advantage over forward rendering, and memory bandwidth often predicts performance in graphics. As a side note, their algorithm is actually slightly different from traditional tiling techniques, which is presumably why they used different terminology.

Now, as for completely missing the point: They implemented everything in software. The very fact that they could contrast an immediate mode renderer against "binning" is a testament to how important Larrabee is as a paradigm shift. The performance balance of graphics development is currently determined by the hardware manufacturers. This has side effects like flat-shaded polygons being completely bound by numbers of rasterization and blending units. This means it's quite easy to put together a workload where more than half of the GPU is completely idle. Making the entire pipeline completely software-driven puts control of these decisions in the hands of developers.

Performance aside, having a software rasterization pipeline means the flexibility to set up whatever is desired. I've wanted fully programmable blending for 5 years. Now I can have it! Sending data back and forth across the bus taking too long? Process it all on the GPU! All of these things may be possible on current hardware using GPGPU programming techniques, but this is the first hardware that's designed for generality FIRST. That makes it a pretty big deal on its own, even if it's not the fastest chip on the block.
Reply to this comment
by Peter N. Glaskowsky August 5, 2008 11:49 PM PDT
Well, of course there's reuse between adjacent frames. The polygons are less likely to be changed. The textures are probably the same. Whether these things ultimately matter to Larrabee, I don't know. I was just raising the point to show that Intel's test strategy produces frame rates that might not be realized in real systems-- and if anything, the real frame rates are likely to be faster.

This chip doesn't use hyperthreading, so whatever you were thinking that means, it doesn't mean.

We'll just have to see what the frequencies are like in the real world. If they're high-- like CPU frequencies-- the design flow is more CPU-like and future iterations of the design, and process migrations, will be slower to develop. If they're slower than Intel's CPUs, Larrabee is leaving some performance on the table and the chips aren't as good a fit for Intel's fabs.

But I don't believe Larrabee is going to run at CPU-like frequencies. CPUs run so fast because they're made in bleeding-edge processes using the fastest possible transistors, and they have long pipelines. Larrabee will always be a generation or two behind, it can't afford the power costs of those fast transistors, and it's got a short pipeline.

If you think Larrabee is going to lead to meaningfully greater flexibility in 3D pipeline design, you must not have a lot of experience with PC graphics. The code running on Larrabee isn't going to be written by the application developers; it's going to come from Intel, and be written to suit Direct3D and OpenGL.

Besides, flexibility isn't nearly as significant a benefit as performance and energy efficiency. If Intel is sacrificing the latter to get the former, it's going to be sorry. And wow, x86 compatibility is so far down on the list of important GPU features that it's never been on the LIST before.

. png
by meh130 August 5, 2008 9:12 PM PDT
Simple, in-order cores. 4-way vertical multithreading. Roughly 1 GHz clock speed.
Sounds like Intel is applying the same design principals Sun used to create the Niagara family to the x86 architecture.
It would be interesting to see what a server could do with this chip. It would probably do very well at web, Java, and OLTP benchmarks.
Reply to this comment
by Peter N. Glaskowsky August 5, 2008 11:51 PM PDT
Web, Java, and OLTP servers rely heavily on integer processing. This thing is like a high-frequency 486 on most of that code. Even at 2+ GHz, it isn't going to offer very good performance.

. png
by eightwings August 6, 2008 12:35 AM PDT
My apologies to Intel for earlier comparing Larrabee's 48 cores with Nvidia's 240 cores and AMD's 500 cores. Nvidia and AMD have a different definition for core. By core, they mean a single double-precision vector unit. Each of their cores has 16 vector units, wich means that Nvidia's 240 cores is really 15 cores. So Larrabee is a much more powerful processor. Using Nvidia's definition, Larrabee would have 48x16 or 769 cores. Very impressive.

That being said, I think that Larrabee offers nothing that is revolutionary or visionary. It's going to be a ***** to program. The parallel programming problem is still with us. What is needed is a universal programming model that has all the advantages of CPUs and GPUs without their disadvantages.

How to Solve the Parallel Programming Crisis:
http://rebelscience.blogspot.com/2008/07/how-to-solve-parallel-programming.html
Reply to this comment
by nanikore August 6, 2008 11:34 AM PDT
No, no, no. If you want something that's "a ***** to program", try the notorious CELL PROCESSOR. If you look at the memory model of cell, and then look at Larrabee which is pretty much diametrically opposite, you would see what a huge deal Larrabee is. There's a reason that Larrabee is getting a lot of love from developers that it was presented to.
Reply to this comment
by ratburger October 3, 2008 6:53 PM PDT
It is hard to understand who the Larrabee architecture is aimed at, not computer graphics programmers anyway. It looks like the 16 element vectors are loaded and stored sequentially from one given address. That makes scatter and gather operations used for image processing very difficult and slow (as per SSE). In that regard Nvidia has a much better hardware architecture where essentially random memory access is no problem . On the other hand Larrabee does have a uniform and easy to use cache system and the x86 instruction set is very familiar.
If a motherboard is available with Larrabee as the CPU then sure I'll buy one. I can use it for high performance scientific computing no problem. I am sure a single thread would also be fast enough for office applications (how fast can you type?). If it has less than 32 cores or is only available as a graphics card then forget about it.
I guess Intel have gotten themselves into a bit of a marketing muddle, they have a solution but don't know what the problem is.
Reply to this comment
(13 Comments)
  • prev
  • 1
  • next
advertisement

FAQ: Buying the right Windows 7 upgrade

Readers still have lots of questions on just which version of the software they need to buy in order to upgrade their PC. CNET News tries to offer some answers.

N.Y. lawsuit details Intel's 'largesse' toward Dell

Attorney General Andrew Cuomo's federal antitrust case filed Wednesday alleges a longstanding symbiotic relationship between Intel and Dell.

advertisement

About Speeds and Feeds

Silicon Valley-based computer architect and chip analyst Peter N. Glaskowsky attends a variety of industry conferences throughout the year to meet with industry thought leaders and dig into the future of computing technology. In Speeds and Feeds, he analyzes trends in system architecture and interface design, as well as market and political pressures surrounding those trends. He is a member of the CNET Blog Network and is not an employee of CNET. Disclosure.

Add this feed to your online news reader

Speeds and Feeds topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right