NVO, the National Virtual Observatory, has just received a $10 million NSF grant. While this is not enough for the whole job, it is enough to establish a prototype and prove the concept in real use.
Just how challenging is the NVO? Very. Extremely. Mind boggling complex. NVO will advance both science and technology by significant strides. I say this both as a physicist, with a long interest in space and astronomy, and as an senior IT professional. So what, I hear you say. So what indeed.
In order to understand what NVO is, and what it implies, I'll need to talk about the individual challenges, the complex environment it will operate in, and how those advances will spread into business and our daily lives. Since NVO will touch several resources in the process of development and integration, I'll write about each one and how they'll be collected into a working NVO system. The products of that technology will be many.
Here's a taste of what it's all about. NVO will bring advances in science that will directly increase our understanding of the universe we live in. It will improve our chances of avoiding a catastrophic asteroid hit, spawn advances in understanding high energy events like the fusion that powers the sun, and can someday power our homes. It will create a tools to organize powerful meta clusters that will be used for research into medicine, the environment, genetics and other technologies on which our lives depend. If you find that all a bit hard to believe, well, read on.
A Virtual Observatory does not depend on any direct connection to light gathering equipment like telescopes, radio antennas or Hubble spacecraft. Like virtual memory that lets your computer act as though it had more memory than real RAM, the NVO will give us access to an immense digital collection of sky images that are stored at sites around the world.
When you view an image of the sky on your computer, it is actually a presentation of a digital representation of that sky. Older sky surveys were photographic, and those photographs must be scanned into a digital form to be accessed by NVO. More recent telescopes, both ground and spacecraft, are digital at the very detectors, so the image data goes directly to be stored on disk and tape.
Once an image is in digital form, it can be sent over the Internet, processed to remove noise, compared automatically to earlier images of the same sky for changes, and collected with other images to form a bigger image. The NVO plans to organize all of these images, covering images from radio to visible to X-ray wavelengths, into a single virtual collection.
This virtual collection will not be at a single location, but will be organized by federation software. This technology, Federation, will create a top layer of software that deals invisibly with the different locations, directories and formats where the digital images are stored. Just as your web browser makes the physical structure of the Internet invisible, so will federation software make the different sources of sky images appear as though they were all in one place.
Just how much data will be encompassed by NVO? Enough even to boggle the minds of experienced IT professionals. While the San Diego Supercomputer Center (SDSC) routinely handles terabytes (1 TB = 10**12 bytes) of data, the NVO will ultimately deal with petabytes (1 PB = 10**15 bytes) of data. By comparison with a typical new desktop computer with a 40 gigabyte drive, one petabyte would fill 25,000 such drives.
Where will all of this data be stored? The simple answer is "All over the place." It will be stored at observatories, supercomputer centers, NASA, universities and research labs. Each originator of sky image data will retain local control over that data and store it in whatever organization works best for them. If NVO were to ask for a common storage structure, even apart from the problem of financing the change, the conversion would occupy at least a decade.
If the NVO were to try to centralize all of the data, such a site would cost hundreds of millions of dollars to build and be a long time before it was operational. But because of the ARPA (Advanced Research Projects Agency) network project that was started in 1972 and some ambitious physicists who invented the web browser in 1990, we now have a tool which makes those massive projects of common organization or a single location unnecessary.
Even using the approach of a federation software layer, the NVO project has big challenges. Those challenges come in the form of data set size, source and organization diversity, and rate of growth.
Data sets, the groups of digital data that define a specific area of the sky, can be very large. For comparison, let's use a large scientific data set such as SDSC produces regularly, one terabyte, or a thousand gigabytes, enough to fill 25 of those 40 gigabyte drives.
A single sky survey can generate from 10 to 1000 GB (one terabyte) of data. A large one, such as the Mass2 infrared survey, has already generated ten terabytes of data. This one survey has created 10 times as much data as a large scientific data set. That's one survey, at one set of wavelengths.
Multiply that by a dozen to cover the current range of wavelengths of interest and you will begin to see just how fast this data set will grow. That does not include the huge collection of data, some digital and some analog (photographic negatives), that already exists. One estimate shows data growth accelerating in the next decade to one terabyte per day!
This brings us to the next issue, the format of the data sets. Once astronomers began using direct digital collection techniques, the common era of all data on photographic negatives ended. While many observatories continued to use photographic techniques, newer instruments and those which worked in wavelengths outside the range of photography, went to direct digital capture and storage.
Each instrument records data in a specific wavelength range, on a specific size of detector, at a specific level of resolution. Since there were a wide range of wavelengths, detectors and resolutions, these created a wide range of digital storage formats. Worse, the specifics of the data format could be different even on the same instrument, depending on the purpose of the image collection.
This diversity of astronomical data formats makes the Tower of Babel look like a small town. Complicating the problem even further is the lack of standards on how to record the purpose and format of the data (metadata), nor where to store this metadata.
Once astronomers started to exchange digital data, they worked to create standards, but technology often outgrew existing standards, requiring revisions. Funding for astronomy was simply insufficient to go back and bring all of the collected data up the the new standards.
We are not finished with the challenges astronomers face. Along with specific astronomy centers, there developed local software to store, process and display that data, using local formats. Initially, software sharing was very difficult as software languages, compilers and hardware differed from place to place.
Even if the code was in a common language such as Fortran, when transferred to another site it would have to be modified for local formats and hardware. Even details like the size of the computer word, when changing hardware, could cause many hours of rewriting software. Then the program would need to be compiled and debugged, and possibly have sections added to handle new displays. Sometimes the effort was nearly as big as starting from scratch.
As a result of the diversity of environments, software sharing grew slowly until the 1990s. Today, finally, locally written software is outnumbered by shared software. That has changed because of the Internet and shared development.
NVO would be impossible without four things:
The first two items have already changed the way we live and work with instant communications and a vast web of information. The third, the web browser, was not seen as vital until people realized that you could make anything shareable worldwide if it was browser compatible.
It is the standards created by the Internet and the browser that make NVO (and many other capabilities) financially feasible. Standards mean that however complex the task, you only have to do it once for everyone to share and profit. This is why so many people worry about fragmentation or proprietary standards on the web, which could force people to use several costly tools instead of just one low cost web browser.
How important is this? If we have just one independent standard for web access, someday web browsing will be built into every TV and user display at no extra cost to the buyer. Instead of needing expensive and quirky personal computers to browse the web, you will be able to hook up a flat panel display to DSL, broadband or satellite, and surf away, all for under $200. The technology is available, but the business success depends on the existence of one standard, which is in question right now.
The existence of one standard means that NVO, and all its spinoffs, will be available at a cost the public can afford. NVO is such a big challenge that adding multiple standards could drive its price too high, or delay NVO and its benefits for years.
The NVO project was preceded by a prototype of a segment of the NVO, named the Digital Sky Project. This project created a small federation of astronomical data, from several sources:
A proposal from this successful project led to the NVO, which will build on the ideas and lessons learned in the Digital Sky.
Mastering this set of challenges will take a team of players. From the NVO announcement:
"Astronomers from 17 research institutions have announced that they're starting an ambitious new project to put the universe on line. The National Virtual Observatory (NVO), headed by astronomer Alex Szalay of Johns Hopkins University and computer scientist Paul Messina of Caltech, will unite the astronomical databases of many earthbound and orbital observatories, taking advantage of the latest computer technology and data storage and analysis techniques. Organizers characterize their goal as 'building the framework' for the National Virtual Observatory."
This framework will build on existing tools such as the Globus Toolkit.
The NVO has two early web sites where project information and other links can be accessed.
NVO will build on a base of existing hardware and software, much of it developed around SDSC. Essential to the project will be tools like SDSC's Storage Resource Broker (SRB), which provides access to large data files stored in local and remote data servers and archives.
Organizing the computer services will be the Globus toolkit. The computers for NVO calculations will be across the US first, then worldwide. Globus will make the NVO project compatible with existing infrastructure, including the PACI TeraGrid, the Grid Physics Network (GriPhyN) project, and NASA's Information Power Grid as well as other sites using Globus.
An NVO testbed will include resources at various NSF and NASA sites. An important feature of the NVO testbed will be its use of major NSF PACI computing resources, including the TeraGrid computing and archive resources.
This quote, from Bob Hanisch, NVO project manager at the Space Telescope Science Institute, makes it clear that even the early benefits of the NVO will be wider than just scientists.
"A major goal for the NVO is to provide a window on the universe for students, teachers, backyard astronomers, and the interested public. The NVO will enable the public to explore directly the wealth of information from society's investment in our national research facilities."
Education isn't the only possibility. Many people are fascinated by space and astronomy, but few have any access to large telescopes or their data. People like me who live in the Pacific Northwet have another problem - clouds cover the sky more than 60% of the time.
NVO will change access to this information, making it possible for many who can't pursue their avocation to do so directly online. One likely side effect of NVO is a wider interest in, and appreciation of, both astronomy and science.
After all, the public pays the tab - what does the public get out of this? NVO gives them education and avocations, and let's not forget the most important thing. For any curious person, looking at the stars and finding out about the universe is fun.
While it is early in the project, some things are very likely:
These are only the obvious benefits. But like the ARPA network project in 1972 and a web browser for physicists in 1990, the real benefits will be something we can't forecast, something that will become possible simply because we put all of these good ideas to work.
Now that the public understands and uses the Internet, an unplanned benefit from those two projects mentioned above, I can safely predict that the spinoffs from NVO will change our lives and our children's lives for the better.
[30]