One of the fun parts of spelunking around supercomputer sites is the discovery of new information or technology. This time around I had started at The Cornell Theory Center and examined the various projects, eventually linking to Los Alamos National Labs (LANL), near the recent fire in New Mexico.
Recently, there has been a lot of news about power shortages, with predictions of worse to come. So I followed the link about Zero Emissions power generation to find a method that uses a coal and water slurry to create hydrogen without combustion, generates electricity in a fuel cell, and captures the CO2 and stores it in common minerals. It is more efficient than current thermal generation and potentially much less expensive to operate. While this is currently a research project, it has great promise to solve two serious problems at the same time.
Further exploration of LANL led me to discover one of their publications named BITS. The January 2000 issue had an interesting article on one of my goals - better programming methods. The first article, Achieving Revolution Through Evolution, is the first I have seen that takes a long term look at productive use and reuse of C++ object capabilities, and then tells you how they are doing it. Even better is the follow-up article in the February/March issue, Blanca/Tecolote Base Class Redesign.
This is what happens when you overlook some needs and they come back and force a redesign. The approach they used wound up with very little rewriting of code, and more flexibility. I recommend these two articles, and their links which include some C++ code details, for a very good approach to getting the most from object oriented software. Some of the software referenced in their articles is available for download at the Advanced Computing Laboratory (ACL).
To my mind, the key to doing it right is taking a long term look at needs. In the case of LANL, they are working with a ten year horizon. Results are much sooner than that, but they are willing to invest some time and money to save more later. The promise of object oriented software needs this different approach for best results.
Supercomputer development began in the 1960s. The first big commercial super was the Control Data Corporation (CDC) 6600. Around the same time, IBM developed the 360/95 and the 370/195, By 1975, the University of Illinois had developed the Illiac IV, CDC the 7600, Burroughs the BSP and Texas Instruments, the ASC. These systems grew from 10 million to 100 million Floating Point Operations per Second (FLOPS).
By 1975, Seymour Cray had left CDC and formed Cray Computer, developing a separate line of supercomputers which dominated the super field for the next two decades. Starting with the Cray 1 with speeds of 100 megaflops, reliability and performance grew to the Cray-XMP, a multiprocessor super which reached over 1 billion FLOPS (1 Gigaflop). As the Cray 2 arrived, other competitors did as well. CDC introduced the Cyber-205, with Hitachi, Fujitsu and NEC, each with their own design.
Supercomputing had now reached substantial commercial use as well as research. No longer was the only customer for supers the US government for weather forecasting, aircraft and space research. In the 1980s, a new class of supers called minisupers, was introduced. These were not minicomputers but mini-scale systems with very fast processors that were much less expensive than the full scale supers. Minisupers, though they have mostly vanished today, made two very important contributions.
The first was that minisupers could be found in a lot of different companies, which gave a very broad range of people exposure to the benefits that could be had, both commercial and research. The second benefit was the realization that useful supers did not have to cost four million dollars and up, but real results could be had cheaply (by 1980 standards) for less than half a million. This opened the search for better ways to build supers, and in the 1990s, one of those ways turned out to be a big array of small computers or computer chips.
Today we pretty much take for granted that the biggest supers will be thousands, and soon, tens of thousands of computer chips all working together through fast links and shared memories. IBM's most ambitious super yet, the Blue Gene, is planned to be one million gigaflop computers in an array to yield the first petaflop computer, 1000 times as fast as their installed 1 teraflop computer, the Blue Horizon, which I wrote about in my last column.
There is another class of system, Distributed Computing, which has supercomputer capability without being a classical supercomputer design. Supercomputers are characterized by running a single OS image with all the compute elements being tightly coupled. One example of this new Distributed class is the Beowul system - multiple Linux computers tied together with fast communications links. Clusters up to 16 of these systems work well with basic Linux code, but as Los Alamos National Labs (LANL) has found out, scaling up from there runs into software and hardware limits. LANL calls this Cluster Computing.
The second example of this class is the Internet distributed computing (IDC) technique, as used by Seti@home, with over two million participants. Seti@home participants use spare cycles in any machine that can connect to the internet to analyze data captured from the Arecibo Observatory. Also using the IDC approach is the RC5 project.
Both the Linux cluster and the IDC approach differ from supercomputers because they cannot exchange data fast enough for most supercomputer applications. Supercomputer data exchanges are completed in tens of microseconds, while the Beowulf cluster interactions are typically 10 milliseconds, a thousand times slower. Internet interactions are 50 to 100 times slower than Beowulf. This limits internet distributed applications to those where both the code and data can be quickly downloaded, and the resulting computation only needs to communicate with the main host.
This very brief outline skips many interesting systems. The full history of supers would fill several books, and applications several more. To find out more about supers, start with a web search site like goto.com, type in supercomputers as the search word, and check out the many links to various sites. Or check any of the portals like Yahoo under the topic computers for supercomputer links.
What do I mean when I refer to supercomputer 'architecture'? Everyone understands the word in reference to buildings as meaning the design of a building. In reference to computers, it has a very specific design meaning - the conceptual design of how the parts are arranged and connected. Computer architecture is separate from the actual implementation of the hardware. Often times the architecture of a new design is well beyond what can be built at the current state of the art. This is deliberate to allow for future faster versions of the same architecture.
Supercomputer architecture is particularly demanding because processor speed is maximized, which makes the memory thruput a critical bottleneck. A processor that produces a result every nanosecond (billionth of a second) can waste a lot of time waiting for memory which is only capable of read or write events every 20 nanoseconds. With personal computers this is effectively solved by fast cache memory which holds the most frequently executed instructions and data.
With supers, the simple cache solution does not scale up well to larger systems. Not only do supercomputers have faster computation units and bigger memory demands, but as the number of symmetrical multiprocessors (SMP) gets much larger than 8 or 16, the sharing of main memory and keeping the independent cache memories in sync becomes a bigger bottleneck than the actual access to memory. Making the design even harder is the need of many threads of comput ation to exchange data before processing the next step. The two main approaches to solving the memory problem are Crossbar and Non Uniform Memory Access (NUMA).
The supercomputer challenge has been met by a variety of designs. Over time, most of them have evolved into variants on three basic concepts. They are Arrays of Identical Processors (AIP), Vector Processors, and Single Instruction Multiple Data (SIMD). There is more detailed description of architectures and example systems at the University of the West of England.
The most common supercomputer design concept today is AIP. All of the major computer manufacturers sell supercomputer class systems based on this concept, with a lot of variation in the details. One design with a lot of installations is IBM's parallel SP systems. The SP design merges two concepts - SMP modules with up to 16 processors, and very fast switches which can connect two or more modules. Up to 8,192 processors can now be linked into a single system yielding 12 teraflops of peak performance. This is four times faster than the existing ASCI Blue system at LANL.
Along with IBM, Fujitsu, Cray, SGI, Sun, HP and Compaq sell systems designed from the AIP concept with different arrangements of SMP and interconnections. Most now scale to 128 processors, and some to 512 or 1024 processors. Above 2048 processors, all of the systems are currently custom design except for IBM's SP series.
The second concept, vector processors, were the primary design early in the supercomputer era. Vector processors are named because they are designed to handle vectors of data - a stream of numbers where each number will be processed through the exact same sequence of operations and stored in sequence. This was very popular because there are a lot of scientific problems that are best represented by vectors of numbers.
All the early supers, from the CDC 6600 to the Cray 2 were vector processors. This changed for two main reasons. First, programming a vector processor for maximum performance required detailed knowledge of both the hardware architecture and timing, and of the nature of the problem. With a great deal of programming effort, vector processors were untouchable in terms of performance, but it was a tough job with few qualified programmers in that era.
The second reason this changed is the development of highly integrated chips. In the 1980s, it became possible to build a computer floating point unit that was as fast as a vector processor without the vector programming headaches. It is true that faster vector processors could also be built, but that advantage was much smaller than before. As more circuits could be put on a chip, the reduction of wiring delays between chips overtook the advantage of vector processing and the shift to other designs was almost universal. As a recent example, vector processors from Fujitsu have been superseded by an AIP design
Today, the only commercial vector processor I am aware of is the SX-5, made by NEC. They have an ambitious plan for a scaled up version of this to be used as an earth simulator with 40 teraflops of power. The SX-5 design uses multiple Processor Engines (PE), each with multiple vector processors. The design uses distributed memory and a very fast interconnect, similar to classic vector processor designs.
For applications which can take advantage of the vector processors, this design will be quite effective. Fortunately for the programmers, modern software is capable of doing a lot of the optimizations automatically, so this is no longer a big obstacle to using vector systems.
The third main architecture is SIMD. Each instruction drives multiple compute elements which simultaneously process different data words with that single instruction. While a vector system processes multiple words in a serial sequence, SIMD processes multiple words in parallel, one step at a time. The best known SIMD machine was the Illiac IV at the University of Illinois. Here is a short history of Illiac IV and why it was junk ten years later.
Memory access is a severe bottleneck for supercomputers. As memories get larger, access becomes slower even with today's fast chip memories. Until recently, memory access was always uniform - the same speed regardless of what address was accessed, or where the memory chip was located. With memory size well into the terabyte range, uniform access means slow access. Cache memory won't solve this problem.
A solution has been developed that gives up the concept of uniform access. It is called Non Uniform Memory Access (NUMA). This concept recognizes that in very large memories, most of the access is to a range of addresses, which with proper design, can be made local and fast to the processor referencing them. For those accesses outside that local area, access will be slower as that data is brought into the local range.
This is conceptually unlike a cache. Access can be a single remote read or write without moving the data, or it can be to move the data into a local memory. The time to access remote data depends on total access activity and physical location of the memory. Further, the data may be sent as a packet over a fast communication link rather than an data bus. In effect, each processor has the ability to reference remote data without having to put all of the memory on one access path.
The other common solution for very large memories is to split them into many smaller banks of memory and connect them to the processors through a crossbar switch. This technique has been used since the beginning of supercomputers. A crossbar switch for 8 memory banks and 8 processors would be called an 8 by 8 switch, where any of the processors could access any memory bank. Simultaneous access could be done if no two processors accessed the same bank at the same time. Smaller banks could be faster, and multiple access multiplied the bandwidth of the memory. All of the vector supers used this technique, as does the current NEC SX-5.
Shortly after I finally learned how to spell Monterey correctly, IBM cancelled it. They have declared victory and are integrating much of the technology into later versions of AIX (IBM's version of Unix). A planned release this fall for AIX 5L will include both Monterey technology and Linux compatibility. This release will support both IBM's Power architecture and Intel's IA-64 architecture.
Santa Cruiz Organization, known as SCO, has recently been bought by Caldera, the number two Linux vendor. SCO's Open Server and Unixware as well as SCO's resellers will join Caldera. SCO will retain its services division and the Tarantella product line. Any correlation between this item and the previous one is strictly coincidental.
In OS/2 news, there is the announcement of a successor to Warp 4 called eComStation. Created by Serenity Systems and IBM, It is based on the kernel of OS/2 Warp Server for eBusiness, updated to the version that IBM will release as the Convenience Pak in November. In addition to the OS, eComStation will include:
eComStation will be sold by Mensys internationally. For US sales check out Indelible Blue (Alas now closed), my personal favorite site for OS/2 software. Upgrades from Warp 4 must be ordered before Jan31, 2001.
[30]