February 24, 2006 6:21 AM PST

A 1,000-processor computer for $100K?

Related Stories

Linux Networx adopts new cluster design

November 13, 2005

Xilinx nears 90-nanometer finish line

March 31, 2003
BERKELEY, Calif.--It's not easy to get hardware designers and software developers in sync when it comes to developing a new computer architecture, according to Dave Patterson, one of the pioneers of the original RISC architecture.

The hardware takes years to develop, and work on affiliated software doesn't start in earnest until the hardware is done. Simulators exist, but the software developers really don't take advantage of them like they should, Patterson said. That causes the development cycle to drag out even longer.

Enter RAMP, or Research Accelerator for Multiple Processors. The idea behind the program is to build a laboratory computer out of field programmable gate arrays, reprogrammable chips that can be reconfigured to act as different chips. (As one Intel researcher has described it, the FPGA is the utility infielder of the semiconductor world.) Ideally, an FPGA-based RAMP computer could be assembled cheaply and easily.

"If you can put 25 CPUs in one FPGA, you can put 1,000 CPUs in 40 FPGAs," Patterson said during a symposium here this week at UC Berkeley, where he is a professor of electrical engineering. Such a computer would cost about $100,000, he estimated. It would also take up relatively little space--about one-third of a rack--and consume only about 1.5 kilowatts of power.

An equivalent computing cluster would cost about $2 million, take up 12 racks and consume 120 kilowatts.

The idea for a RAMP computer, which came up in a hallway conversation at a conference in 2004, is moving toward reality. A version with eight compute modules will be completed in the first half of this year, and the full 40-module version could come out in the second half of next year, Patterson said. Joining UC Berkeley in the project are researchers at Stanford University, the Massachusetts Institute of Technology and other schools.

"What we are not trying to do is build an FPGA supercomputer," Patterson said. "What we are trying to do is build a simulator system."

See more CNET content tagged:
FPGA, semiconductor, researcher, CPU

14 comments

Join the conversation!
Add your comment
wow
Wow, 12 racks and only 120 watts! I wish I could get that with just one unit!!!
Posted by GBLAZE41 (4 comments )
Reply Link Flag
Fixed the typo
Yep. That would have been some kind of energy efficiency!
Posted by Jon Skillings (249 comments )
Link Flag
power consumption
> It would [http://...|http://...] consume only about 1.5 kilowatts
> of power. [http://...|http://...] An equivalent computing cluster
> would cost about $2 million, take up 12 racks
> and consume 120 watts.

Typo? A household lightbulb uses 60-100 watts.
Posted by satayboy (73 comments )
Reply Link Flag
1k * Apple Mac Mini + FE Switch fabric
$subject. Fits the price ;-)
Posted by Philips (400 comments )
Reply Link Flag
close but not quite
i think the idea here is a customizable computer that is small and consumes little power. 1000 mac minis would take up more than the preposed 1/3 of a rack and would use more power most likely.

but what i really want is a way to cluster all my home PCs together in a way that will make them appear as one, run windows, and make multitasking and games a breeze. suggestions, anyone?
Posted by Dibbs (158 comments )
Link Flag
Already Designed
Strap together 1000 of those green wind up UN computers (<a class="jive-link-external" href="http://laptop.media.mit.edu/" target="_newWindow">http://laptop.media.mit.edu/</a>) and voila...low power consumption and 1000 processors.
Posted by KsprayDad (375 comments )
Reply Link Flag
Programmability
How long before this architecture can be put into use in the scientific computing community? ie Can an MPI library be created for these systems straightforwardly and quickly, or does a new parallel software library have to be developed?
Posted by paulrhodes (1 comment )
Reply Link Flag
Software is the problem
It is no need for faster processors. The problem is software. The operating system, and any other software, is just going still more slow, and need faster and faster hardware. The problem, is to solve that. As example, make clever compilers, that makes the software much better. There is no need for faster processors, because the software is very poor.

One, of the problems, is to analyse code, and to make it fast. Some programming languages are not good for programming at all. They are made, to make mistakes. And not being simple to analyse.

However, normal maschine language could be analysed. And translated into dataflow graphs, or the other way, from dataflow graph, and into smart machine language. That means, the software automatic find "parts" that is good to make as code that is executed in order (some code is, and dataflow computers have been dead for many years, since 70's). But other code, is optimal, when it is concurrent. And the dataflow graphs could make normal machine code concurrent. As example two undependendt part of machine language could automatic be split into different processors. And this task, is done by analysing software, that stores result from time to time (on harddrive). Also, the "tasks" is controled by dataflow graphs, and it could be understanded as a normal sequential processor, controled by dataflow graphs, and also, using concurrent normal processors. The normal processors is smaller, than an "intel" and the ideal is to have more on a chip. If we have a program, it automatic follows the datastream, and it is priority on this. If you like something to ask fast, the dependencies is first execute for this instruction. The code is runned from bottum. And if it need something (flags etc.), the computer automatic finds it, maybee running code in normal order too. It always seem, as it is issued in normal order, and you could always use normal programming languages, or Intel machinecode. There is no reson to make new software for applications. It is only a question about using methodes, that understand software.

As example, a "loop", is only issued, if someone need the result from loop. Even an infinite loop, may continue, because infinite loops not have meaning. Case of invarianse on variables, may let it exit. Or, it may be down prioritized, because of the loop. Smaller, and faster loops, is higer prioritized, and software typical not hangs. If a loop is runned a lot of times, other loops, that could be executed concurrent, still works, and is up prioritized, compared to the "bad" loop. First when computer have nothing to do, the "slow" loop contininues, else smaller is over prioritzed. This gives faster execution. Also, the "output", could have prioritizing. As example, if you do not use your disc drive, the software controling it, never run. It is down-prioritized, because there is not any path, that requires. All output paths to screen is higher prioritized. However, the disc driver software, may be higher prioritized, if the screen request. That means the user. When the user ask for something on the driver, it automatic runs. It could be a problem with status screens that tells a driver have loaded. They may need to be closed, or lower prioritized. Any thing has its priority. Multitasking, with "time" gives no meaning. Only task switching have any means. The software could not be runned fast, if you have one processor, and "divide" the tasks into two. The average, is task1+task2 time. If they are executed individual, it is task1 time before first output, and task1+task2 time, before output 2. That is always faster. In case, you have only one CPU, it is to prefer to find a good prioritizing, and the events is controled by that. If you have more CPU's, the methode could also split the software out on more CPU's. It does not require new assemblers. However, it is more easy to do, if you use better langugaes, that does not have any machineindependent parts (as example pointers, and datastructurs, that have lengths depending on computerachitectures, integers, float etc. too.) The "memory" architecture, need to be infinite to a logic machine language, or a programming language. Sizeof have no meaning. Machine language could not be implemented, since it have no meaning on a computer.

A good computer, have an operating system, that includes driver for the processor. Any code, that runs on the operating system, is machine independent, and you could move anything - both operating system, and any code, to any other computerachitecture. When something need to be converted into the architecture, it is done using the operating systems CPU driver. This driver, is the only one, that runs machine langage instructions. It does not exist any elsewhere. The code, is given to the operating system, and this is the one, that makes the processor code. It is done with tables, over the instructions, and other informations that is returned by the processor driver. Since the processor driver, is the only one, that may run processor code, it is the first to run, when computer is started. This interpreat the operating system. The language is very simple, and have very few orders. There is no need for speed, or to make anything advanced. The only thing it is used for, is to run the first part of operating system, and this part, will then generate some code, based on list of instructions etc. The processor driver, could also be used, as part of code generating or manipulation process. Normal, the code, is "just" send to the processor driver. The operating system, also do the job, with making software parallel. And could send it to different processors, with different architecture, using different processor drivers running on the operating system - or maybee on more computers. Using graph algorithmes, the parallel processors, could be analysed, and determined if it could be found af slow-bandwidt cut place, to make the software run on more processors/computers. To make the analysing of the software, is a part of the operating system. This do the concurency. And maps it into instructions. At first, it is very simple, since it just take some of its own code, and generate machine language for, using some tables from the processor driver. Now the speed is much higher, since it could use the processor better. It could then be better, using a more advanced methode, where all registers is optimized, and it is maped into the processor. Now, the operating system runs fast. And last task, is to issue, the final analyzing, that make the simulated, or code written for the operating system runned. Any code, is now analysed from dependencies, and made concurrent, and split to more CPU's, or even computers differnt places in world. A simple program, that include both pats, that involve "I/O" on one computer one place in world, and later in same program, involving I/O's on another computer in world, will automaticly be solved, and split out on the correct computers. This is done using methodes, to minimize bandwith, and maybee running code twice, to do that. On the same computer, it "just" divide the programs to 100's or more, processes. Anything is done independent if possible. And, if you have a loop, it just continues, if it is possible, due to no dependencies (that means if the result is not used for the next loops). It may be used for other code, but if there is no way on the dependency graph (or code), then it executes after its priority. The task to make computers faster, is not faster hardware. But to understand software. And make the computer, doing that.

If you take the swap algorithm by Bill Gate, it is very bad. It was much better, to map anything in world, harddrive or internet, to the ram. Then, the ram, is the "cache". And any use, of harddrive, or internet, is as fast as ram. As example, you will not need to use time, to transfer software from harddrive to ram. It just runs on harddrive. And is loaded "smart". Typical, if you have a gigabyte, it will not need to copy one giga first, to ram, and then find out, that it need to copy to harddrive. This takes time - as example if you run code, from net. It is just cached.

However interrupts need to be handled, and they may need to close swapping on drive. Any hardware, any port, and any driver, etc. is always opened and closed. Anyone should close it, when it is unused. Is you not surfing, then your internet is not used. There is no need for firewall. Or for internet driver. Anything is away, when it is closed. Just one open it, it first are opened. For some drivers, this means, that it must not be swaped out of memory. It could have functions, that messures how long time the interrupt take, and sending an instruction to the CPU first, makes it sure, that this is worst case forever. If the interrupt take just a bit longer, it is closed. And anything connected to it (needing it) is closed down. Anything is to open and close, and if you use a printer, it is only a driver for it, when it is opened. Do you connect another printer, it could swap driver, to this, but only, if any access to the old is closed. Then the driver, could be swapped out, or replaced, with another printer driver. Always it is only done, when anything have been closed. It is possible to ensure timing, to ensure interrupts works, to ensure that nothing takes up space in fast memory, that is not needed, and opening and closing everything, without forgetting it, makes sure, that the analysing could be done much better. Interrupts where drivers is not opened, is always skipped. And the computer use less power (none if waiting on key, or mouse, and not runs in short-loop). It is always needed to close things, that is not used. If you need some software that examines printer status, without it is opened, and anything is closed, the looking for status is opened, and the driver is not swaped out. This is normal not a part of operating system, and it close anything used, when it is not used more. But, a program in "start" group, coud avoid it, and continue showing status until the user close it down.
Even files, and internet, is part of ram, it is used as "files". You need to "open" af part of ram, as read-only, etc. as it was file. One reson is sequrity. And most, anything is on file, since it is the way to swap it. Only active interrupt drivers is in ram, but they are also only "cached", but is permanent cached, as long as they are open. This is some bits on the "files", or for the memory parts, that control this. One part of memory, could not write to another, if it is protected etc. And it is possible to also have password of parts of memory, and I/O's. They may be a part of another computer, and could be redireceted if the software is not written for it. Your software could then redirect sound to USA, and Picture to japan, and the operating system, will automatic duplicate code, execute it in the needed parts, in both Japan and USA, and the software communicates between usning minimum bandwidth. Still your software, is just the normal, written in C++ or simulated machine language. Even your ram, could be swapped to hongkong, if this is where your harddrive is.
The I/O, and memory, is same structure, and could be redirected both, to another catalog or another world. This is because it is just a memory mapped harddisk, and ethernet, that is opened and closed. If nothing use your ethernet part (no open handles to this memory), it swaps out, and does not take up place in either swap memory, since it is direct maped to its code, and runned from drive directly but cached only, and as it is opended, it could be part of it, that is never swaped out.

All this, is only to make you understand, that the troubles of computers, is not to hardware. It is software analyzing, that is the task, and to make software, that analyze software, and makes it faster. An interpreater could automaticly be converted into compiler, by that way. And that means, that any code, is abstract, and converted into native language, since an enterpreater just could be made for the language, without knowing the underlaying machine language. Instead of compiling, you just make som emulation. And it is then made into a compiler, and the registers mapped efficient into the CPU. The tasks run dynamicly when you run (as fast as I know, some hard ingeniering tasks still need to be done, to obtain that). But it exists methodes, that runs software faster, just by analyzing. As example, the convert any interpreater to compiler example.
Also "loops" that are big, or infinite could be optimized, to come with answer on short time.

One of the most investigated methodes, is to make normal software parallel, and to make it event driven, and add priority. This also seem, to give a much better performance, but in some cases, it need new computer architectures, that have integreated the problems into hardware, as example running the threes. In 70's it was made computers, that used dataflow investigation for that purpose. Typical you will never need an "order" to make some software parallel. You should just avoid "talking" together, with the different part. Then you could easy in any language, easy find that there is no talkling together, other than using semaphores, and then it is full parallel. Adding orders as "add process" have no meaning. It just slows speed down, because it tries to control, where it could be done by atomatic.

However, some features, may be added to the hardware. As example the "copying" feature. When you copy two parts of memory, it should fast remember into its cache, that it is copyed. That means, that an assembler copy order must exist. Else, the CPU could not see, that it should use same address. If you try to change something in the copied part, it automatcly adjust its cache, to step over this, and use cache. The data on busses, is much slower, since the copy order, also is used here - as example to ram, and harddrive. The ram chips, do coping themself. It could be done very efficient and they could read much of data, and move/copy/fill same time. When swaped on harddrive, also the copying instruction is send to harddrive, and this do copying itself. Do you copy one part of memory to another, as example a file, then it is just send an copy instruction, and it is done same nanosecond seen from the CPU side. Also, the harddrive gets its copy instruction, and may not save data twice, but only save it once. It take not as much time.
You could copy a harddrive into a catalog, in one nanosecond, and, it takes no space.

If you copy something, from one computers harddrive to another computers harddrive, it primary flows directly between the two computers, even a third have send the instructions to copy the two parts of "its" memory, between the two computers. Security is very importend here, to ensure, that computers could not get access or copy ram, from telephones, human brains, or printers, that they are not allowed access for. Also, two computers "memories" or files (handles). In a lot of code, copying let into the CPU cache solution, and memory etc. makes it possible, to "insert" positions in memory, without taking much time. It could be importend to some software, and the processor, ram, etc. could just "remember", that it is copyed, and only physical copy small blocks (that else takes up to much space to remember that is copied). Sometimes, it could also copy bigger parts, when it have nothing else to do, to clear up use, and make more copy describers aviable into hardware. Anytime something is copyed, it takes only one instruction, when it is copy describters left (and it normal will, since it send copy instructions to ram, etc. and they do it themself). Else, it will need to read it. And here it is either done, or it have its own copy descripter. Or, if anything is runned out, from at big memory, it may need to the copying is done. The chip guarentee it. And if not possible, it may insert waits, but it is not normal, since the copying is either done, or there is more describters in the memory, or drive.

The task is to make software.
Posted by JensM (1 comment )
Reply Link Flag
And he huffed, and he puffed...
Now who could argue with all that? Wow! I couldn't understand half of it through all the gramatical errors.

But, There is one underlying message that is true... the software written today is in no way optimized. Software programmers have become lax as fast hardware allowed them to use higher level languages to write their programs that would eventually end up as machine code full of waste. Assembler code is the most efficient of all as the instructions compile in a 1 to 1 relationship to machine code, but programmers don't like it because they would have to program hundreds, if not thousands of instructions to do but the simplest tasks... but those tasks would run efficiently and the programs would be much smaller in size. How much have we gotten away from efficiency? In the 80's, I saw a computer run a GUI operating system from a 256 kilobyte rom chip (that's 1/4th a megabyte), booted in 2 seconds, and had roughly the same functionality as windows 3.1 did. The processor ran at 8 Mhz. Today we have processors that run 400 times faster with Billions more memory. The operating sytem now takes up gigabytes and boots slower. The OS does alot more, but even so it is no where efficient. The difference... the old OS was written in assembler.
Posted by Seaspray0 (9714 comments )
Link Flag
 

Join the conversation

Add your comment

The posting of advertisements, profanity, or personal attacks is prohibited. Click here to review our Terms of Use.

What's Hot

Discussions

Shared

RSS Feeds

Add headlines from CNET News to your homepage or feedreader.