2001 Column Index

SuperComputer 2000 -- Part Two

by Bill Nicholls 10Jan2000

In part one, I talked about how supercomputers handle data, networking, I/O and very large databases, and introduced the Grid concept. In Part two, I will look at special applications of supercomputers and how they will affect our future.

After looking at a traditional supercomputer application done in an untraditional way, I'll be exploring factors limiting performance in commodity cluster computers. These clusters, because of their ease of assembly and low cost, promise to be a major part of future computer environments.

In the applications area, I'll look into the demanding problems involved in working with biological databases. Matching specific biological features in a database when more than 100 features may be involved is challenging. But the really difficult task is finding smallest trees in a database because of how many trees are possible.

Programmers will find the Software Tools section of considerable interest.Not only are software tools coming of age, but software environments for programmers can now be as good as what we provide for everybody else. The environments are as interesting as tools for what they presage - information utilities. Could the millennium have finally arrived?

Gigabit Gigamodels

Special applications are those that demand supercomputer level of resources to finish their job in a reasonable time. Weather forecasting is a great example of the time sensitive nature of these applications. A three day forecast needs to be done much faster than three days, or you could just look out the window. The only time a three day forecast has value on the fourth day is to see if the predictions matched the actual weather.

Along with weather forecasting, climate modeling is a supercomputer application that demands great processing power, though it is not time sensitive. These special applications are so demanding that more than one supercomputer, or more than one center may be connected to attack the very largest scale problems. One of the challenges in this process is making effective use of gigabit networks that tie the supercomputer centers together.

At SC2000, a contest was held between various applications to determine which team could make the best use of the network. Just to make it harder, they ran all of the bandwidth intensive applications at the same time, along with all the other show traffic. The results are outlined in: Bandwidth .

Clustering Performance

The performance of a commodity cluster for compute limited calculations depends on several factors, such as processor speed, how much data is loaded and exchanged, and how well the algorithm scales outside of an SMP environment. While 100 megabit Ethernet is the choice for low cost Message Passing Interconnect (MPI) on commodity cluster systems, a lot of problems require data exchange with higher speeds and shorter delays. These data exchanges, required for synchronizing the individual computations, are a critical limiting factor in scaling up a program from a 2 CPU SMP to a 64 CPU Beowulf cluster. The effects of MPI delays on overall performance increases more than linearly depending on how many neighbors in the computational mesh need to be notified.

Two high speed interconnect technologies for commodity clusters are GigaNet and Myrinet. GigaNet is based on a hardware implementation of Virtual Interface (VI) Architecture and Asynchronous Transfer Mode (ATM) technologies. Myrinet leverages packet switching technologies from experimental Massively Parallel Processors (MPP) networks.

The paper by Jenwei Hsieh, Tau Leng, Victor Mashayekhi, Reza Rooholamini, and Dell Computer Corporation investigates the two technologies for performance on a variety of problems. The extensive testing results are displayed in several graphs, some showing clear differences, others small differences.

The overall conclusion was that the specific implementation of the Message Passing Interface (MPI) was critical to performance. Polling using the MPICH-GM protocol on Myrinet was the lowest delay, but not the highest performance on coarse grained problems. Those coarse grained applications showed better performance on MPI/PRO over GigaNet. Careful inspection of this paper will provide the information to chose the better of the technologies for a specific algorithm and system architecture.

Speeding Biological Discoveries

Two papers at SC2000 investigated ways to speed access to huge pharmacological and biological databases. One paper investigates how many parallel searches to run depending on the computer and database size. The other paper investigated using current tree search algorithms in parallel to reduce search time from months to hours.

The first paper, investigates both data representation and search techniques on two different size SMP machines. It is Data Access Performance in a Large and Dynamic Pharmaceutical Drug Candidate Database by Zina Ben-Miled, Yang Liu, Indiana University, Purdue University, Dave Powers, Eli Lilly, Omran Bukhres, Indiana University, Purdue University, Michael Bem, Robert Jones, Robert Oppelt, Samuel Milosevich, Eli Lilly.

In testing, they used two different data representations that depended on whether the data array was sparse or dense. The DB sizes were 350K, 700K and 1.4M records in size, and both sequential and random access methods were evaluated for performance characteristics. The two systems were a Sun E3500 and a Sun E10000. The same tests were run on both machines with multiple DB size and number of parallel threads.

Some of the clear results were that fewer sequential threads were needed for maximum performance because the disk array could join requests for higher thruput. The smaller system saturated at fewer sequential threads because the disk array was very efficient, yielding a high data load. For both systems, threads had to be limited to avoid increasing the I/O wait because of disk contention. As database sizes or data representation changed, or for different machines, retuning the number of threads was necessary for best performance.

The second paper, Parallel Phylogenetic Inference by Quinn Snell, Michael Whiting, Mark Clement and David McLaughlin of Brigham Young University. This paper discusses the size of the problem and innovates the search process in two ways. One is to use randomly selected trees to find a starting point, the other is the distributed approach pioneered by the RC5 and Setiathome efforts.

The random perturbation approach is necessary because of the incredibly huge tree space that even a small set of taxa (data sets) can form. Here are some examples:

100 taxa 2 x 10**182 trees
1,000 taxa 2 x 10**2860 trees
10,000 taxa 8 x 10**38658 trees

Those last numbers are exponents. They represent the number of zeros before the decimal point. It is clear that this problem qualifies as NP-complete, which is a problem where the computational requirements rise at an exponential rate. NP-complete problems cannot be solved by brute force.

The system that runs the search is named DOGMA, for The Distributed Object Group Metacomputing Architecture. It can be used on supercomputers or clusters, and it also provides a way for machines to join through a screen saver or an Internet web page. One of the best features of DOGMA is that it is implemented as a browser web page and can handle other distributed tasks besides tree searches. The web site has a demo has of an early version of DOGMA.

The search technique is called Parallel Ratchet, which obtains a starting tree, randomly perturbs it, then performs a standard tree search on the modified starting tree. It then returns to the unmodified tree, and starts a search from the trees found in the first search. Ratchet then returns to the start with a new optimal tree and begins the process over. The goal is to find the shortest tree meeting the search criteria. DOGMA registers search tasks and assigns them to idle systems which have volunteered to help, much like setiathome.

For small data sets, linear speedup is restricted to only 4 to 8 processors, and surprisingly, more processors sometimes produce poorer results. Large data sets are expected to have better results in terms of speedup.

Software Tools

Finally, the programmer's children have some new tools. For decades programmers developed tools for other users while their toolset looked a lot like stone hammers. While any large system has a complex environment that needs tool support, supercomputer hardware and software is at least an order of magnitude more complex. More than just static analysis, supercomputer users need dynamic analysis of parallel code to make the best design choices.

Two papers from SC2000 contribute to this area. The first explores tools for object oriented software as the base for a custom toolkit, sharing common data so more effort can be spent on specific analysis. The second paper creates a parallel programming tool environment through web access, with a common methodology. Both papers look to the concept of tool sets rather than isolated tools.

The first paper, A Tool Framework for Static and Dynamic Analysis of Object-Oriented Software with Templates by Kathleen A. Lindlan, Janice Cuny, Allen D. Malony, Sameer Shende, CIS, University of Oregon, Bernd Mohr, ZAM, Forschungszentrum Juelich, Reid Rivenburgh, CRAG, Los Alamos National Laboratory, Craig Rasmussen, ACL, Los Alamos National Laboratory is database oriented.

They start with a Program Database Toolkit (PDT) that uses compile time information to build a DB of program structure information. From an Intermediate Language (IL) internal to a C++ compiler, a API named DUCTAPE makes the PDT information available in a standard way. The PDT provides a number of static tools including call graph trees. Further analysis can be done with tools that use the PDT, such as TAU (Tuning and Analysis Utilities) framework and the SILOON (Scripting Interface Languages for Object-Oriented Numerics) toolkit.

TAU uses the PDT to rewrite the source code with TAU macros. When compiled, the program collects detailed information on execution resources which can be easily displayed. SILOON enables scripting of numerical libraries into custom numeric programs by collecting and assembling the components like building blocks. Perl and Python scripts can easily assemble efficient numeric routines into a powerful tool with SILOON doing the interfaces and entry points requiring detailed knowledge.

Version 1.3 of the Program Database Toolkit for C++ has been released Toolkit. The distribution includes the C++ Front End, the IL Analyzer, and DUCTAPE. The capabilities of PDT are being extended to include processing Fortran 90 and Java code and exhancing PDT to enable all three languages to be handled by one set of tools.

The second paper, Towards an Integrated, Web-executable Parallel Programming Tool Environment by Insung Park, Nirav H. Kapadia, Renato J. Figueiredo, Rudolf Eigenmann*, Jos  A. B. Fortes, of Purdue University, looks at common access with a common methodology. Also specialized towards parallel programming, their approach, called Parallel Programming Hub (ParHub), makes diverse tools operating in different modes and systems available through a common web interface.

ParHub, a toolkit, is built on PUNCH (Purdue University Network Computing Hubs). PUNCH hides the interface and system differences by using a web interface built on cusomizable HTML templates. Using the same control information that tells the loader tool how to install a program, this information is compiled into PUNCH's system. PUNCH can install batch mode, text and graphical interface programs, including those running on remote third-party secure sites. PUNCH supports several other toolkits as well - it is not exclusively for parallel programming. See PUNCH.

PUNCH is supported by a two level infrastructure, a network based desktop environment supported by PUNCH and connected through Internet and intranet to a distributed resource back end managed by SCION middleware. PUNCH submits jobs to SCION, which calculates the most cost effective system to run the job. SCION supports physically distributed compute engines and tools, and handles the resource and priority assignment as well as failure recovery.

ParHub addresses two major impediments to the use of tools. First, with many tools available worldwide, only a few may be available to a person or team due to time, cost or system constraints. Second, most tools are not designed to share information, though that is beginning to change. The combination of limited resources, learning curves and incompatibilities has limited the availability of tool sets.

ParHub addresses the learning curve by building a common style web interface for all tools. While the information is different for each tool, the web format is familiar, reducing learning time. The web also enables a single entry point, called a portal, to hide the complexity of multiple systems and locations. In this portal, ParHub creates an integrated OpenMP environment with several tools designed to assist tuning parallel programming.

A special source compiler (Polaris) automatically discovers parallel opportunities and modifies the source to implement them in OpenMP form. Max/P is a Polaris based tool that evaluates the maximum potential parallelism at runtime. Once this work is done, hand tuning begins with the Ursa Minor tool, which collects the Polaris and runtime information for display and analysis. Using these tools, a programmer can find the compute core and modify it for better performance or more parallelism, and begin the optimization cycle again.

As interesting as ParHub is the PUNCH & SCION infrastructure. In this paper we see a prototype of an information utility, this one specialized for programmers and electronics engineers. But at its core, PUNCH & SCION addresses and solves the problems of access, ease of use, flexible infrastructure and multiple administration domains that will be essential to any true information utility. I'll be writing further about the information utility in coming months.

[30]