2001 Column Index

Digital Libraries and OS/2 Updates

by Bill Nicholls 10Jun2001

Digital Libraries add organization and meta-content to the Internet Digital Libraries are indexed and structured to make searches for relevant information much more effective than current search engines. To do this takes a plethora of standards and cooperation, from XML to DCMI to OAI.

Also, some updates about OS/2 software from Innotek and Mozilla.

What is a Digital Library?

Digital Libraries (DL) are structured storage environments of digital data with a consistent format for index and content abstraction. In paper libraries, the organization is physical shelving of books based on the Dewey Decimal System or the Library of Congress Classification system. An index, containing author and a brief abstract of the content link identifying data to the actual book. These indexes are contained on real or virtual index cards, organized by subject, author and title of the book or paper.

Digital Libraries on a computer have a similar structure, with the potential for a much richer set of links. DLs are much more flexible than paper libraries about representing data, but that flexibility comes at a price. The cost is a new difficulty in enabling access to all of the content in a consistent manner. One part of the cost is that the people who usually build the indexes, librarians trained in classification, are not available most places. As a result, classification cannot be done with the refinement that centuries of library development have yielded, as evidenced by the Library of Congress Classification system.

The second part of the cost is how to tell the computer what is data and what is index category. Once the author and editor create the index terms and an abstract, each part must be clearly labeled. This is where the next hurdle is. How do you identify the parts of the index? Which part is author, title, subject, or abstract? On a card, the parts are identified by the words "Author: Title: Subject:" followed by the correct item, and they are usually shown in different type.

This column will introduce you to the basic standards that support a common digital library format. I'll explain how the standards build on one another and why each one is important. Along the way, I'll point to resources where you can dig deeper, acquire tools for experimentation or building your own personal digital library.

Foundations for a Digital Library

Without a common method of defining content, individual DLs will remain islands of structure in an unstructured Internet sea. Even worse, each DL would have to create a new set of tools for using the DL, from setup to storage, from search to display. The earliest DLs were all separate projects, but that early experience, and discussions about interoperability, led to the creation of working groups who pioneered the standards we see today.

By a fortunate coincidence, just as these discussions began, another standard was being born. XML, the eXtended Markup Language, was in its early days of the 1.0 standard. XML clearly offered a way to define tags as well as a lot of other capabilities. See [http://www.xml.org] for a starting point. XML tags look a lot like HTML tags, with an important difference - the tags refer to content, not format.

This is a crucial difference. Format will change, depending on media and need. Content is the real meat of the document, regardless of format. Without content, format is meaningless. Just think about TV sitcoms. The implications of this difference are major.

Being able to tag content means we now have a handle on the meaning of a specific data stream. Tagging content is referred to as metadata. Metadata is data about data. In these XML tags, what is created is data about the type of the data being tagged. When we know the type of data, we or the computer can determine the possible uses or transformations of the data that are meaningful.

Instead of only being able to reformat a given set of data, we could, given the right software, use it for purposes that were not thought of when the data was created. The ability to repurpose a data set because the computer can know what the data type means opens up a wide range of possibilities, from automatic comparisons to doing complete research scans on very fine subsets of data.

First Floor for a Digital Library

XML by itself enables us to build a single DL with a common index, but it can't make information in different DLs easily accessible unless there is a standard for the index that all DLs support. With XML as a base, the DCMI (Dublin Core Metadata Initiative) was developed to define and provide standard DTDs for a minimum subset of tags to enable access to any conforming DL. Many extended tags, specific to a domain of knowledge, enable the DCMI to be enhanced in a standard way. Links to the basic references are here.

In the tools directory, programs for building, checking and extracting metadata are available. Specific programs are available for extracting metadata from existing files in HTML and Word document formats, as well as conversion tools and metadata building templates.

Many DCMI projects are already working. Indeed, the list is worldwide with Europe, Asia and Australia well represented. A few of these are Gateways, systems that collect and provide a centralized access and an available repository for other resources.

Accessing the Digital Library Index

XML and DCMI enable building of DLs that are compatible at the index card level, but the common definitions of content do not define the method of access or API. This is a critical level of interoperability because even with common definitions, if each DL created a different interface, we would still have some of the Tower of Babel problems to solve.

Another standard, OAI, the Open Archive Initiative, was designed to create a standard for accessing any compatible DL index, thus making the entire world of compatible DLs available for searching. This is called "The Open Archives Initiative Protocol for Metadata Harvesting." From the Introduction to this document:

"The OAI protocol described in this document permits metadata harvesting. The result is an interoperability framework with two classes of participants:
* Data Providers administer systems that support the OAI protocol as a means of exposing metadata about the content in their systems; * Service Providers issue OAI protocol requests to the systems of data providers and use the returned metadata as a basis for building value-added services."

Exploring Digital Libraries

A working OAI compatible DL is available for exploration at Cogprints. It covers Psychology, Neuroscience, and Linguistics, areas of Computer Science, Philosophy, Biology, Medicine, and Anthropology. A non-OAI DL is available arXhive, covering physics, math, nonlinear sciences and computer science.

Software for large digital libraries is available at [http://ePrint.org]. Work is ongoing to develop software for smaller DLs and individual use. I will be testing a restricted beta of Kepler, an OAI compliant system for individuals and small groups and reporting on my experiences.

Next time, I'll explore the software and other requirements for setting up an OAI DL along with other updates of work in progress. Until then, here are some updates on OS/2.

New OS/2 Updates

A recent update to the 0.9 version of Mozilla is 0.9.1. Although the numbering change is small, my early use shows it is faster and more stable. One significant change which caught me is the new install method. Check the README, but be sure to install in a clean directory. Execute warpzfol.cmd to build the folder, open the folder and execute mozilla.exe. No setup or install, no reboot required as all of the profile and conversion codes are now separated in the Mozilla folder. Once I was over the surprise, I liked it. Mozilla 0.9.1 is here, about a third of the way down the page.

The Macromedia flash plugin, which I reported as only available with Software Choice or Convenience Pack, is actually freely available from Innoteck. There is a lot of useful information in the README, so don't skip this one. Registration is required and a URL will be sent for download. Note that this plugin will not coexist with the win16 version of Macromedia Flash. I'll be trying this out shortly. Thanks to Andreas Boesche for the heads up.

Innoteck has also acquired the Legato Co-Standby Server for OS/2 from Legato Systems, Inc. Information is available. I wrote about Co-Standby for OS/2 in my Warptech 2000 report in June, 2000. You can find it at here.