This column will introduce you to a classic OS - OpenVMS. Currently at release 7.3, VMS originated with the DEC Vax, a classic machine. the latest OpenVMS, Version 7.3, announced in October 2000, features clustering, data backup and improved disaster tolerance.
OpenVMS runs on Compaq's mid-range and high-end systems, including the 32 processor Wildfire system announced earlier this year. Version 7.3, along with Compaq's efforts to get COE (Common Operating Environment) approval from the Defense Department, has convinced most OpenVMS users that Compaq will support their OS well into the future.
Why is this announcement important? Consider that Compaq supports four OSs itself and four more industry supported OSs:
Compaq's own four:What's so unusual about OpenVMS?The answer is twofold.
First, OpenVMS is a 24 year veteran of demanding real life operations in thousands of sites. OpenVMS offers a wide range of functions, and having evolved through two different processor architectures, the design is clean and extensible, with a reputation for reliability and flexability.
Second, clustering originated with VMS at Digital long before it became important in the 1990s. Clustering under VMS and OpenVMS was designed to support up to 96 systems in a cluster, with a wide size range of systems that could participate in the cluster. VMS offered advanced capabilities first in the minicomputer field, and ahead of many mainframe competitors.
For a good look at the evolution of the Vax and VMS, a 20th Anniversary Vax/VMS book in PDF format is available from Compaq. Check this link for the full scoop on the current OpenVMS.
I also take a look at the challenges of programming supercomputers. While programming a two or three level client/server system is a complex task, supers raise that a whole order of magnitude. Business applications typically have small chunks of code called on demand, with run times from milliseconds to seconds. Supers have large chunks of code, and huge data sets, that run for hours or days. This difference has begun to blur as commercial use of data mining and analysis of terabyte customer files put supercomputer demands on business systems.
In either case, big problems, like weather forecasting or data mining, need big, fast computers. Getting access to a supercomputer is only half of the job - the other half is programming it. This may not sound difficult, but the difference between a good programming job and an adequate one can cut run times in half or better. If time on the super costs $1000 per hour, the difference between five and ten hour run times over a year can be $250,000. That's worth some extra effort.
What's involved with that extra effort? And why is it so difficult? The answers to these questions revolve around the complex nature of supercomputer systems. For some details on supercomputer architecture, refer to my earlier column.
The VMS (Virtual Memory System) operating system was created for the new (1976) Vax architecture. Both the Vax and VMS were designed together, with multitasking, multiprocessing (MP), and distributed networking as a goal. MP was uncommon in minicomputers in those days, and that, along with PDP-11 compatibility, made the Vax very popular. VMS version X0.5 was the first released to customers in support of the beta test of the VAX-11/780 in 1977. VAX/VMS Version V1.0 was shipped in 1978, along with the first commercial 11/780s.
The Vax (Virtual Address eXtension) was DEC's first 32 bit machine with the ability to address gigabytes of virtual memory, even though the early Vaxes were limited to 16 megabytes of real memory. In those days memory chips were 4K bits, so a megabyte was over 2,000 chips. Since those chips were not as reliable as core memory, DEC built the memory with ECC, Error Checking and Correction circuits.
As prices came down and performance went up, the Vax dominated minicomputer sales in the 1980s. The Vax evolved through a large number of models, from multiple circuit boards to one chip per processor. It finally ceased production under Compaq in 1999. Its long production run can be credited to the security of the systems, the power of VMS and DEC's vision of a broad range of compatible systems with lower prices and greater power.
Digital's experience with the Vax foreshadowed the explosive growth of the PC revolution. A big FAQ about OpenVMS is available at: FAQ. This FAQ is an overview of all the information available for Vax and Alpha running VMS. The large number of links make this a good central point for information on VMS and OpenVMS.
VMS represented an innovative approach to operating systems. Early 1960s hardware had an OS for each type. IBM consolidated its systems to a single architecture called the 360 Series in 1964. But IBM's OS story only changed a little. By the late 1960s, IBM had five OS versions, from DOS to OS/360 in PCP, MFT and MVS versions, and an early VM for the 360/67. Digital made the big step in 1977 to a single OS for a whole line of systems, small to large, one system or a network, single or multiprocessor.
Here is a quote about Vax/VMS goals from the PDF book referenced above: "Machines would range from desktop to enterprise-wide systems. The goal was to establish a single VAX distributed computing architecture that would run the same operating system. A related goal was that VAX products would one day provide a price range span of 1000:1."
The goal from the start was to design an integrated hardware and operating system. In the 1970s, the Vax hardware drove the sales, but by the mid 1980s, the VMS operating system was more important, as it supported networking, multiple processors, old and new DEC hardware, and even older PDP-11 code run in compatibility mode. By the late 1980s, DEC reached that 1000:1 price goal by moving the Vax from 20 circuit boards in the Vax 11/780 to a single chip in the Microvax II and later, to the 8000 and subsequent series.
Digital was one of the first companies with a 32 bit minicomputer, driven there by their customers needs for bigger programs and more processing power. As the customers began to reach the limits of this platform, Digital developed the next generation, the Alpha chip. This was a true 64 bit design, pushing the state of the semiconductor art.
This is where Digital's vision really paid off. The VMS V1.0 for the Vax11/780 had matured, still compatible, to be an enterprise capable VMS V5.0. Now, in version 6, it was converted to run on the Alpha chip and both versions were renamed to OpenVMS, one OpenVMS Vax, the other OpenVMS Alpha. After a recompilation, almost all software was converted to the Alpha version.
That long evolution has led to the current OpenVMS. Compaq continues that evolution with a wide range of new products:
This extensive list of products is convincing evidence that OpenVMS is alive and well supported. For the details on OpenVMS V7.3, see Compaq's web site at: .
Today, on the verge of the real third millennium, OpenVMS has reached version 7.3, running on systems as large as a 32 processor SMP design, and as small as a single older Alpha processor, similar to a desktop. From desktop to enterprise, through six hardware revisions on each of two incompatible hardware foundations, the original VMS concept continues to run and play well in the modern era. Its grandson may still be running at the turn of the next century.
The programming challenge arises because every computer system has bottlenecks - parts of the system that are slower than the processor. Different programs run into different bottlenecks because they have more computation and less I/O, or vice versa. It is less critical in commercial programs because they typically do not run for hours every week. Business systems are more likely to have lots of smaller programs that run frequently, but only for a few seconds or minutes.
Programs for supers are different. They typically involve calculations on tens of thousands to millions of data points, then exchange data between processors, and recalculate of all the data points again. This goes on until some terminating criterion is reached, usually after thousands of cycles. The programming goal is simple - make sure no time is lost in any of the processing steps. Actually doing this is neither simple nor easy. A good example of this is: Shared memory programming on the IBM POWER3 nodes Parallelization of simple loops.
Supers aren't just a bunch of fast computers tied together. Everything is optimized starting with the processors. They are designed for maximum speed, which puts a large demand on the memory system. Processors can only calculate what they have in registers, and getting the data from and to memory is a critical bottleneck.
Like many chips for PCs, super chips have two or three levels of cache plus ways to bypass the cache when needed. To speed the calculations, the programmer must figure how long it takes to get an array of numbers into the processor, how long to calculate, and how long to get the results out. In the ideal situation, the calculations should run continuously with the data flowing in and out without pause.
It isn't just balancing between processor and cache. It's balancing the whole chain of components from main memory to cache 2 to cache 1 to processor, and back out again. Each level of processor and cache and memory has a different access time and bandwidth, getting faster as it gets closer to the processor.
There is no general solution to this problem. Each set of calculations takes a different data set, different calculation time, different interchange time and different I/O requirements. A good example of the balancing act required is: Single Processor Tuning of the PARTREE code for the IBM SP
Each supercomputer has different timing even if the architecture is the same. When the architecture is different, a whole new set of problems can arise. An article on the new Tera MTA architecture is: Parallel Performance of Monte Carlo Photon Transport Code on the TERA MTA
One capability in supers that is not available elsewhere is vector processing. Briefly, this is the ability of the system to bypass caches and send a continuous stream of numbers to the processor as fast as the processor can handle them, eliminating cache overhead and delays. This only works when the stream of numbers is long enough to make up for the overhead of setting up and starting the vector. Vector processing can be very powerful, but it is not easy to use effectively. See: Maximizing CRAY T90/J90 Applications Performance - vectorization of C code.
Most supercomputer runs use huge amounts of data. In my earlier column, I noted that SDSC, the San Diego Supercomputer Center, had the world's largest High Performance Storage System (HPSS). At that time it was over 168 trillion bytes (168 x 10**12). Shortly after my column, the doubled the tape storage giving them almost 400 terabytes (TB) of storage.
If you have a new computer, the disk can transfer about 10 megabytes per second. The channel may be faster, but the limiting factor is the disk transfer rate, not the channel. This is a bottleneck in every PC. If you had to process one TB of data, it would take your disk more than 100,000 seconds (almost 28 hours) just to read the data. Realistically, few disks can sustain more than 10 MB/second for more than a few rotations because they must seek to the next track, which interrupts the transfer.
With supers, the process is more complex. Prior to the run starting, the data is staged to a set of fast disks. Once the run starts, several disks in parallel transfer data over a fast channel to one of the systems, which must then distribute each piece of data to the appropriate processor. A block of data may be moved multiple times to get to where it is needed. When the system has more than a thousand processing units, routing data is a real challenge.
After a long trip from the HPSS, the data is finally in main memory. Now it starts down that path to the processor, and results come back which ultimately have to get back to the HPSS for storage. In an ideal system, data would stream from the HPSS into memory and back to the HPSS at just the right speed to fully utilize the carefully optimized calculation path. This could be done, in theory, for a single super connected to a single HPSS running a single program.
Unfortunately, the real super world is multiple supers each running multiple programs connected to multiple devices managed by a single HPSS. The program load is unpredictable, the HPSS load is unpredictable, and the distribution of the load changes from minute to minute. Even a perfectly optimized program will be slowed by external events caused by other programs. Regardless of the non ideal situation, optimization is still very important. For a look at how a simple I/O approach can cause performance problems, see: Hierarchical Data Structuring: An MPP I/O How-To.
For more about the complex details of programming a supercomputer, check out programming a supercomputer. Our friends at NPACI and SDSC have a list of articles about programming supers, and almost half of them are about optimization. Any programmer who has had to shoehorn a big job into a busy system will appreciate the challenge of super programming.