In surveying current processor architectures, we can easily find an example of almost any style or theme that you might be interested in. The line between RISC, originally a Reduced Instruction Set and CISC, a Complex Instruction Set Computer, has been blurred in the last ten years. While Instruction Set Architectures (ISA) have been stable until recent developments, the battle now is between very fast single processors and slower wide parallel processors.
This battle has now been joined by the technology battle that intensifies as circuit dimensions shrink, and the power/heat battle as clock speeds race for the 10 GHz crown. That particular winner may provide both home entertainment and heating in the near future.
The power/heat connondrum is proving to be a very hard barrier to bypass, as Intel has found in its Itanium (130W) and P(100W) processors. This is the limit of practical air cooling with current systems. IBM is not a stranger to these challenges, but their solution is based on Multi Chip Modules (MCM) with liquid cooling attachments, not typically suitable for desktop use. Any current or future processor Architecture (PA) will have to deal with all of these barriers to succeed in wide acceptance.
How are these barriers being breached? What techniques will give the next generation of processors that edge that jumps current limitations?
For the purpose of this article, I will separate processors into four broad classes:
This class is characterized by high clock speeds and very complex implementation to optimize the rate instructions are executed. Processors have many stages and several instructions are 'in flight' at any time. Large cache and high bandwidth links are standard. The IBM Power5 is an extreme example example of this class.
There are big hurdles ahead of this class. The first one is power, and it is a critical limit for both heat dissipation and electromigration. We can get more heat off of a chip by use of liquid cooling, channels on the back of the silicon, and silver as a heat conductor, much better than copper. The limiting issue will be the electromigration effects as dimensions get smaller, aggrivated by elevated temperatures. Extreme cooling such as Freon or alcohol as the fluid, or even liquid nitrogen, is possible but limited in appliction.
A second major limitation is the diminishing returns of complexity. Fast processors already use all the obvious techniques for speed, and those left are not likely to yield big gains. The Itanium's attempt to move that complexity into software will not yield further significan gains because of the NP nature of software comnplexity. NP is shorthand for complexity that increases exponentially, so no matter how much resource you use, the gains diminish rapidly.
Overall, this class of processors is near several limits that will close the door to future extensions. This is reflected in reality as all of the major vendors are moving to multiple cores (more than two) as is reflected in their future roadmaps. Faster systems must come from this and other approaches.
This class is characterized by moderate clock speeds, relatively simple implementation and resources for execution, modest cache and bandwidth requirements and automatic control of which parts of the chip are powered up to minimize power use. The Via C5P is a good example of this class.
The primary objective of this class is best performance while maintaining low power. In addition, small die size means cheap production and minimum board space. Each of the current vendors uses somewhat different techniques to keep power low, but Via's and Transmeta's techniques represent the extremes.
Via build the smallest, simplest die consistent with rapid execution with minimal complexity, and takes great care to use low power in each section as well as power control. Via's current C5P with a 47 mm sq. die and a installable chip smaller than a penny is the smallest processor capable of running as a dual processor system. I expect to see dual core C5 chips in the next generation or two from Via.
Transmeta steps away from direct execution and builds a powerful general execution engine with the capability of dynamically transforming the instruction stream into a VLIW stream that executes multiple units at low power. Transmeta's approach is a powerful general technique that has potential well beyond its current implementation in the TM8000.
There is little data available about this class of processors at the current time. Only the Power4 and Power5 from IBM fall in this class, and they seem to be two separate cores on a ceramic substrate, not yet the multiple processors in one core. When multiple processors on one core get announced, I'll be taking a close look at them.