Next: Tools for Coordinating Planning Between Observatories
Up: Enabling Technologies for Astronomy
Previous: Advanced Architecture for the Infrared Science Archive
Table of Contents - Subject Index - Author Index - Search - PS reprint - PDF reprint

Thakar, A., Kunszt, Z., & Szalay, A. 2001, in ASP Conf. Ser., Vol. 238, Astronomical Data Analysis Software and Systems X, eds. F. R. Harnden, Jr., F. A. Primini, & H. E. Payne (San Francisco: ASP), 40

A Parallel, Distributed Archive Template for the VO

Aniruddha R. Thakar, Peter Z. Kunszt, Alexander S. Szalay
The Johns Hopkins University


In the proposed Virtual Observatory (VO), there is an urgent need for a prototype distributed archive that a) uses standard interfaces to the outside world, b) contains a parallel and scalable query agent, and c) can serve as a virtual data grid node in the VO. We propose to use the current SDSS Science Archive as a basis for developing such an archive template for the VO. This effort will involve extending the current capabilities of the Science Archive query agent as well as redesigning certain aspects of it. We describe the steps that this effort will entail.

1. Introduction

The multi-Terabyte astronomical archives that are in the process of being built today will give rise to unprecedented sky and wavelength coverage along with an incredibly rich dataset. But the enormous potential for scientific discovery that these archives promise will not be fully realized until they are interconnected in such a way that data from individual archives can be combined, compared, and mined in a seamless fashion. The creation of such a multi-wavelength digital universe-at-your-fingertips is the ambitious and far-sighted goal of the effort to build a Virtual Observatory.

The enormous size of the upcoming archives, along with the multiplicity of standards, software tools, and hardware platforms at the disposal of the scientists that are building these archives, create a daunting challenge with respect to integrating them into a virtual observatory framework. The very task of defining what a virtual observatory is and must provide has consumed several months of discussions and meetings between astronomers. In short, there has been much talk but little action on defining a VO.

What is sorely needed is a prototype for a ``VO-ready'' archive that can serve as a template for what future archives should (or should not) be, and help to crystallize the concepts and priorities for the VO. We believe that a VO archive must have at least the following features:

In order to build such an archive template, we propose to use the Sloan Digital Sky Survey's current Science Archive as a basis and convert it into a VO archive template by making the modifications described below.

2. The SDSS Science Archive

The Sloan Digital Sky Survey (SDSS) is a multi-institution project to build a map of a large part of the northern sky in five wavelength bands (Szalay 1999). The SDSS Science Archive (abbreviated SX) is the science database that will result from the survey when it is completed (2005/2006). It is expected to be several Terabytes in size and will contain a catalog of more than 200 million objects and 1 million spectra.

The SX has a client/server architecture that features a lightweight, portable GUI client, a parallel (multi-threaded) distributed server (query agent), and a commercial object-oriented Database Management System (DBMS) - Objectivity (Thakar et al. 2000). It also includes a fast spatial indexing scheme--the Hierarchical Triangular Mesh or HTM (Kunszt et al. 2000), as well as a multi-dimensional flux index. Although the Science Archive is already considerably optimized for distributed data mining (Szalay et al. 2000), it is still not sufficiently equipped to be a VO data grid node. We aim to take the following specific steps to rectify this and create a prototype VO archive.

3. XML Compliance

A fundamental property of a VO-compatible archive must be the standardization of its interfaces with the outside world. Toward this goal, we recognize the eXtensible Markup Language (XML) as an emerging standard for data interchange on the Internet, and seek to make our archive XML-compliant in a way that will allow data and metadata to be exchanged with any other entity that can decode XML. The attractiveness of XML is that it is flexible and self-contained, so that specific information about reading even the most complex data can be encoded using standard XML's Document Type Definitions (DTDs).

One of the biggest advantages of XML is the wealth of public-domain software and tools that is already available (and rapidly increasing) on the Internet. We list below some of those that are of particular interest to our application:

4. Virtual Data Grid Node

The creation of a data grid will be one of the primary challenges and benefits of a Virtual Observatory. The ability to generate virtual data--the complex and often voluminous data that are created on the fly from complex analyses of archival data--efficiently will be crucial for future astronomical research in cosmology and other fields. Virtual Data Grids (VDGs) will be an indispensable component of the VO. The GriPhyN project ( Grid Physics Networks, envisages that Petascale Virtual Data Grids (PDVGs) will be necessary in the near future to meet the virtual data needs of the age of multi-Terabyte and Petabyte digital archives. Indeed, these archives will need to be designed as VDG nodes. This essentially means that each archive must provide a scalable parallel and distributed framework for executing complex, compute and I/O intensive queries and analysis tasks as close to the archive data as possible so as to minimize network traffic. The GriPhyN proposal contains examples of queries that will become possible (and frequent) with VDGs.

Our current model, although it is parallel, distributed and moderately scalable, is not very well-suited for the large-scale grid computation involving very complex query and analysis tools that is anticipated in a VO context. In order to make it a fully functional grid node, we need to build a massively parallel distributed framework with dynamic load-balancing, resource-scheduling, and message-passing communication between intelligent agents. We propose to reconfigure our current parallel model in order to achieve this objective, as shown in Figure 1. Our master/slave configuration will be replaced by a computational grid of intelligent query agents loosely coupled to a distributed data grid via an MPI (Message Passing Interface)-based communication toolkit like Globus ( It will also be necessary to have a grid agent at the top level that will interface to the outside world and serve as a listener/scheduler for the query agents.

Figure: (a) Current and (b) proposed distributed query computation models. The current master/slave model will be replaced by a loosely-coupled, MPI-based massively parallel (MPP) model.

5. Query Language Extensions

Our current query language, SXQL, is a subset of SQL (Standard Query Language) that also includes several object-oriented extensions and astronomical and mathematical macros. We plan to augment the language further so as ultimately to produce a versatile SQL-based scientific query language that includes at least the following features, several of which have already been incorporated into SXQL: the SQL SELECT-FROM-WHERE syntax, aliasing and nesting, the ability to follow links through associations, including language extensions for specifying to-many links, the ability to query on object methods, generic mathematical macro support, specific astronomical macro support, and support for spatial querying using the HTM.

6. Scientific Output

The output from VO queries will often be very complex and voluminous, and will need to be packaged in a self-describing, lightweight format. Again, XML provides the answer. The eXtensible Scientific Interchange Language (XSIL) is an XML DTD for scientific output that is extensible to any discipline (Williams 2000). It contains an extensible object model with a Java API and comes bundled with the Xlook browser. We hope to use XSIL to obtain a flexible, general object transport protocol, a portable ASCII and binary output format, and an ultra-light data format that includes support for binary streams.

7. Concluding Remarks

Although there are significant challenges facing the creation of a Virtual Observatory, the lack of the necessary technology is not one of them. The time is ripe for the implementation of the VO archive template described above. Such a template is sorely needed, and the existing state of software and hardware technology--in terms of storage capacity, network bandwidth, CPU speed, and software standards and technology--makes it achievable.

Within the next few years, the demand for virtual data will see a sharp rise, and the computational power afforded by virtual data grids will be indispensable for scientific research in large-scale structure and other fields within astronomy. Archives like this one will be poised to meet those challenges.


Kunszt P. Z., Szalay, A. S., & Thakar, A. R. 2000, in ASP Conf. Ser., Vol. 216, Astronomical Data Analysis Software and Systems IX, ed. N. Manset, C. Veillet, & D. Crabtree (San Francisco: ASP), 40

Szalay, A. S. 1999, Comp. in Sci. & Eng., Mar/Apr 1999, 54

Szalay, A. S., Kunszt, P. Z., Thakar, A., Gray, J., Slutz, D., & Brunner, R. J. 2000, Proc. 2000 ACM SIGMOD on Management of Data, 451

Thakar, A. R., Kunszt, P. Z., & Szalay, A. S. 2000, in ASP Conf. Ser., Vol. 216, Astronomical Data Analysis Software and Systems IX, ed. N. Manset, C. Veillet, & D. Crabtree (San Francisco: ASP), 231

Williams, R. 2000,

© Copyright 2001 Astronomical Society of the Pacific, 390 Ashton Avenue, San Francisco, California 94112, USA
Next: Tools for Coordinating Planning Between Observatories
Up: Enabling Technologies for Astronomy
Previous: Advanced Architecture for the Infrared Science Archive
Table of Contents - Subject Index - Author Index - Search - PS reprint - PDF reprint