2002 Index

Building a Data Grid

by Bill Nicholls 14Jun2002

Grid Construction and a Sea Data Voyage

Overview

Grid computing, which I have called Meta Computing in previous columns, is a rapidly maturing set of concepts and software. There is now enough software and information on the web to enable almost any group to install and build a custom Grid system.

The advantage of a Grid is to convert a collection of distributed and incompatible systems into a single system environment. Once a Grid is in operation, additional capacity may be added anywhere, yet available everywhere with no extra effort.

Globus software, a new class of middleware that is the core of the grid, has now been released in V2. In this column I will outline the concepts and building process and provide links for those who want to build one of their own.

After building a grid, we will take a virtual sea voyage on the Research Vessel Melville. While in the open ocean, Melville will acquire multiple data types, perform on board processing and exchange data with systems on shore. When Melville returns from it's voyage, the scientific data collected will become part of a Digital Library.

What is a Grid?

The word 'Grid' is used in two meanings:

  1. It is the concept used to define the Information Technology infrastructure used within the scientific community
  2. It can refer to an implementation of the network of computers, network, storage and other services that create a specific Grid environment.

From a user interface point of view, a grid environment looks like a web portal with the ability to select data and applications, run programs and batch multiple jobs to run unattended. Conceptually, it is like being at the console of a unix system with the power of the whole grid complex at your fingertips. Like the web, distance only affects how fast things can happen, not what can be done.

Behind the UI and invisible to the users are the major components of the grid. Hardware includes high speed network links, special storage systems, computational engines (multiple processors, clusters, mainframes, etc.), and special I/O equipment.

The software behind the UI includes the portal code and operating system interfaces, comprehensive security coverage using X.509 certificates and VPN encrypted links, job management and accounting processes, and most important, the Storage Resource Broker (SRB).

At the end of the day, the most important part of any computer system is the data. Since the grid is transparent to both location and system type, the SRB must manage the data to provide transparency. The key to this is the use of a Metadata Catalog that collects standard metadata that includes all file characteristics, and allows custom metadata to be added by a program or user.

The SRB hardware is a multiple client-server system with a standard API for access to the metadata in a hierarchical collection. The SRB supports this access across multiple hardware and OS platforms, and a broad range of file systems and databases. It integrates data, metadata and transport functions to create the functional transparency within the grid.

Metadata at the Core

The SRB treats data as virtual files. A collection of data that is a file is identified by a unique name in the global name space. The type and location of the real file is kept in the metacatalog, identified by standard metadata captured when the file was processed. Collections of files may be virtual as well - a unique name may refer to an arbitrary set of files, any of which may be collections as well.

This creates a very flexible hierarchy of files, selectable on the basis of metadata attributes, not location. Path names are no longer used, eliminating the hard link between name and location. Instead, files are retrieved by looking up attributes in the metacatalog and having the SRB transfer the files from wherever stored to where needed.

Once that hard link has been broken, the SRB can then manage the data based on the data needs, not program or human needs. Data can be replicated to be available at multiple locations, segmented for parallel high speed transfer, collected at a site prior to being processed to avoid delays, and generally treated as an independent component in the grid system.

Data discovery is enhanced by multiple types of metadata. In addition to the base file metadata of date, type, size, format, and internal SRB identifier, other classes of metadata may be attached to any file. User defined data such as experiment and purpose, program data like run number and time, precision and program version, can be entered. Multiple runs can be identified by their individual parameters, enabling higher level analysis to be applied to projects without special programming.

Conceptually, the center of the SRB is the metadata, surrounded by virtual files, which are moved by SRB transport functions and delivered to programs operating in the grid environment.

Building a Grid Environment

The brief introduction to grids above is only a broad outline of what a grid is. In order to build one, you will need much more background detail, detailed installation instructions and a specific application in mind.

What may surprise you is that an experimental system with the essential grid software installed can be as small as a 500 MHz Pentium with 256 MB of memory and a 1 GB or larger disk. Clearly that configuration won't be enough for a production system, but it will be enough to start testing the real environment.

Note: The presentations which I will reference next were not designed as a unit to cover all of the information that is needed to build a grid. Based on my study of the published grid information and these presentations, I believe it is possible to build an experimental grid with these presentations as a start.

To start your research, visit the AHM2002 (NPACI All Hands Meeting) site for some presentations. Select the "Philosophy of the TeraGrid" by Charlie Catlett for a broad overview of a big grid environment. This presentation is loaded with details and is worth spending some time understanding.

Next, select "Grid Portals-A 3-Part Tutorial" by Mary Thomas, Steve Mock and Kurt Mueller. This presents an overview on components and building the User Interface for the Web and Grid portal, followed by a GridPort Toolkit introduction and concluding with Web Server and Portal Installation details.

To see a brief outline of what was done for a NASA/IPG to create a grid portal, select "GridPort for Telescience" by Stephen Mock. Next, look through "GridPort Tutorial: Example Portal" by Stephen Mock.

Other presentations you may find useful are the ones on SRB examples and programming, the NPACI Rocks Tutorial on clusters, and the detailed "Storage Resource Broker by Sheau-Yen Chen et al.

Before installing any software, check out the detailed information at the Globus Toolkit site. Also study the installation and administrator documents.

Finally, to begin the software installation, review "Installing GridPort and Globus" by Kurt Mueller. This part requires a unix class systems person as the software is not standardized for commercial installation.

Exploring the Oceans Electronically

The University of California San Diego (UCSD) Geological Center, Scrips Institution of Oceanography (SIO), and the San Diego Supercomputer Center (SDSC) have collaborated to integrate data acquisition, processing and modeling on an ocean going vessel, the R/V Melville.

The voyage began in Lyttelton Harbor, Christchurch, New Zealand in March, 2002. It was the start a 14-day expedition to Samoa, in the wake of Captain James Cook, who first explored these oceans. This was Leg 20 of the Cook Expedition, designed to survey New Zealand's outer continental shelf and to prototype the SIOExplorer Digital Library.

The shipboard prototype was designed to perform real time data acquisition, automatic tagging of of data with identifiers, and controlling the process from raw data to finished digital images and information. A good overview of the ship and systems with several pictures is here.

During the cruise, 5 GB of wide beam (about 20 km) sonar data was collected and resulted in a 3D model of the sea floor. This is a small part of the SIOExplorer collection of data from 795 SIO cruises since 1950. Over time, the older data sets will be added to the digital library with appropriate information to identify the older data's source and accuracy.

The Digital Library project is designed to bring the archive of SIO to the web for researchers, students and anyone with an interest in the oceans. This project started when the safety of hundreds of sonar data sets was evaluated. In the process of solving the problem of saving older files and making the data more widely available, SIO developed a modern metadata catalog.

This catalog, plus powerful search capabilities and a "Virtual Ocean" interface, makes the information available to anyone, not just research scientists. The project expanded to include multi-disciplinary work in earth sciences, prior work at SDSC on the San Diego Bay Project, and many other contributors. It is an excellent example of the benefits of cross disciplinary teamwork.

A detailed background on the Oceanography Digital Library has been published by NPACI, the National Partnership for Advanced Computational Infrastructure. At the end of the background article are several links to additional information about other digital libraries and supporting systems.

Note I plan to install an experimental grid on a spare system under FreeBSD. Even though Red Hat 7.2 is the most common OS for this, it should be possible. I'll add what I learn to my website to help others who want to try this.

Update 22Nov2002:The local grid project has been on hold for lack of a separate hardware base to install it on. This has been solved, but an imminent house move prevents me from pursuing this now. It will be a high priority project once the move is complete. I estimate January 2003 is when I will resume work on this project. Apologies to all who have been waiting for further information.

[30]

All content on this site is Copyright 2001 and 2002 by Bill Nicholls