Astronomy faces a data avalanche. Breakthroughs in telescope, detector, and computer technology allow astronomical surveys to produce terabytes of images and catalogs. These datasets will cover the sky in different wavebands, from and X-rays, optical, infrared, to radio. In a few years it will be easier to ``dial-up'' a part of the sky than wait many months to access a telescope. With the advent of inexpensive storage technologies and the availability of high-speed networks, the concept of multi-terabyte on-line databases interoperating seamlessly is no longer outlandish. More and more catalogs will be interlinked, query engines will become more and more sophisticated, and the research results from on-line data will be just as rich as that from ``real'' telescopes. The planned Large Synoptic Survey Telescope will produce over 10 petabytes/year by 2008! These technological developments will fundamentally change the way astronomy is done. These changes will have dramatic effects on the sociology of astronomy.
On-line astronomy demands new IT approaches that will yield tools and methodologies for data access, analysis, and discovery that are scalable to this regime. New needs lead to opportunities in IT research for data mining, for sophisticated pattern recognition, for large-scale statistical cross-correlations, and for the discovery of rare objects and sudden temporal variations. With a billion objects, statistical algorithms requiring steps would take billions of processor-years-even algorithms will take a long time, creating challenges in their own right! Moreover, there is a growing awareness, both in the US and abroad, that the acquisition, organization, analysis, and dissemination of scientific data are essential elements to a continuing robust growth of science and technology. These factors demand efficient and effective synthesis of these capabilities both for astronomy and for the broader scientific community.
Recognizing these trends and opportunities, the National Academy of Sciences Astronomy and Astrophysics Survey Committee, in its decadal survey recommends, as a first priority, the establishment of a National Virtual Observatory. The NVO would be a ``Rosetta Stone:'' linking the archival data sets of space- and ground-based observatories, the catalogs of multi-wavelength surveys, and the computational resources necessary to support comparison and cross-correlation among these resources. The NVO will benefit the entire astronomical community. It will democratize astronomical research: the same data and tools will be available to students and researchers, irrespective of geographical location or institutional affiliation. The NVO will also have far-reaching education potential. Astronomy occupies a very special place in the public eye: new discoveries fascinate both the large number of amateur astronomers and the general public alike. The NVO will be an enormous asset for teaching astronomy, information technology, and the method of scientific discovery. Outreach and education will be key elements: the NVO will deliver rich content via the Internet to a wide range of educational projects from K-12 through college and to the public.
This paper describes a short and incomplete overview of the NVO's scientific opportunities and Information Technology challenges. It presents an implementation strategy and management plan to create an initial federation, and a foundation for further tools and applications. The NVO will challenge the astronomical community with new opportunities for scientific discovery and it will challenge the information technology community with a visionary but achievable goal of distributed access and analysis of voluminous data collections. The scope of the effort is international.
By its very nature the NVO brings together groups with different talents. We recognize that the ability to fully exploit the federation of all major astronomy archives as an integrated resource requires cooperation between data providers from space missions, ground-based telescopes, and special surveys, from leading astronomical institutions throughout the country. The currently participating groups have committed to federate more than 100 Terabytes of astronomical data, consisting of over 50 collections in the NVO.
In order to be an engine of discovery for astronomy, and enable qualitatively new advances, the NVO must be driven by science goals. We think of the NVO as a genuine observatory that astronomers will use from their desks. It must supply digital archives, metadata management tools, data discovery, access services, programming interfaces, and computational services. Astronomers may develop their own custom programs to answer specific questions, sifting through the vast digital sky to identify rare objects, compare data with numerical models, and make discoveries through advanced visualizations and special statistical analyses. In addition, students, teachers, under-represented groups, and the general public have equal access to this cutting-edge resource from tailored portals. Before discussing the IT challenges, we first select a few cutting edge astronomy topics and discuss the NVO's scientific impact on them:
Comparing the Local and the Distant Universe: Combining IR and optical observations has opened a new window on the distant universe. Having a broad range of colors for distant galaxies enables us to estimate not only photometric redshifts, but also spectral type, and study detailed star formation history. The NVO will let us create rest-frame selected samples from combined UV, optical, and IR datasets. Comparing local and distant samples, we can study the evolution of physical properties, such as the infrared-radio correlation in star-forming galaxies. Statistical analyses of samples defined via multi-wavelength queries in the NVO will measure spatial clustering patterns as a function of redshift, revealing density fluctuations in the early universe and constraining values of the fundamental cosmological parameters. Massive cluster surveys will enable us to trace the complex evolution of large-scale structure, from its origins in the cosmic microwave background to the amazing diversity we see today; MAP and PLANCK promise views of the infant universe at high resolution. X-ray missions (Chandra, XMM, ROSAT), and deep ground based observations show structure to , and we will later probe to with NGST. SIRTF and other IR surveys will fill in the gap at intermediate redshifts. Distortions in optical galaxies around these clusters provide direct measures of dark matter.
Digital Milky Way: We know surprisingly little about the origin and evolution of our own galaxy. Federating all existing information on the Milky Way will enable systematic mining of multi-wavelength catalogs and surveys, both existing and yet to be created. The Milky Way galaxy is a complex entity consisting of multiple stellar components (bulge, disk, halo), each with its own mass function, metallicity, age, and kinematics. The interstellar medium is equally complex. Current surveys covering 100 square degrees already reveal halo kinematic substructures in the form of star streams and accretion remnants. Current surveys capture all halo giants out to 100 kpc and all but the faintest dwarfs to 15 kpc; deeper surveys will allow for the first time a definitive study on the origins of the Galactic halo and thick disk. An important challenge, for example, is to understand the role of fast encounters with other galaxies (typically smaller than the Milky Way) in the evolution of our Galaxy, which can create new stellar components, impact the evolution of existing components, and can directly induce star formation. The planned Large Synoptic Survey Telescope will provide high precision proper motions from co-added images that will help address these crucial problems.
Rare and Exotic Objects: Large surveys detect significant numbers of outliers in the statistical distributions of derived parameters. These anomalies are often not immediately obvious from a single observation; and are not followed up scientifically, since the discoverers may not be able to devise a compelling model for the phenomena using the limited data at their disposal. The NVO will enable astronomers to find objects that can only be identified by being statistically unusual when multiple-wavelength catalogs are compared. Early multicolor surveys (SDSS, 2MASS, DPOSS2) have led to the discovery of distant quasars and brown dwarfs. The search for new classes of objects in far larger volumes of parameter space will remain untouched until the NVO is created. The search for rare objects in the temporal domain could yield some of the most exciting new results from the NVO: rapid identification of transient objects by comparing new observations with prior epochs in real-time may reveal distant supernovae and gravitational microlensing.
Census of Active Galactic Nuclei (AGN): There has been a major shift in our understanding of the role of supermassive black holes in galactic formation and evolution. These black holes have masses in the range of - M and we now believe that most galaxies harbor such black holes at their centers. AGN are ``beacons'' which signal the presence of a black hole in a galaxy, shining by converting the energy from accreting matter into radiation. Much of the radiative output of the early universe may have been emitted by these accreting supermassive black holes. However, despite decades of effort, a census of AGN and their place in the galactic evolutionary scheme still eludes us. The chief problem is dust obscuration: hidden AGN may outnumber their unobscured counterparts by an order of magnitude. Glimpses of this population are now being seen in X-rays. The NVO will include the deepest X-ray data ever obtained (Chandra); the most extensive redshift catalogs (SDSS); the highest resolution optical observations (Hubble); the largest infrared archives (2MASS, then SIRTF); the high-resolution spectra from the VLT and Gemini; radio data from the Very Large Array; and a vast panoply of planned future surveys at various wavelengths. It will enable a panchromatic census of AGNs, allow us to probe the connection between galaxies and supermassive black holes, and reveal the cosmic history of energy production from both nucleosynthesis (star formation activity) and accretion (AGNs).
Search for Extra-Solar Planets: The search for extra-Solar planets is a major goal of twenty-first century astronomy; it carries with it the scientist's hunger for new understanding, the philosopher's inquiry into the meaning of life, and the public's desire to know the answer to a simple question: ``are we alone?'' The spectacular recent progress with the radial velocity technique has already overturned the standard models for planet formation, but also revealed its frustrating limitations. Other techniques are being pursued: the planet-transit technique readily scales to surveys around very large numbers of stars, and would benefit enormously from the federation of data from multiple future surveys. In particular, the data taken by the Large Synoptic Survey Telescope, coupled with infrared surveys (2MASS and its successors) and astrometric surveys (FAME), will enable a survey for planetary transits around billions of stars in the Milky Way and in several nearby galaxies.
Theoretical Astrophysics: The NVO will make possible, for the first time, truly significant interactions between large datasets and the equally large-scale theoretical simulations of astrophysical systems that are just now becoming available. In a few years, use of massively parallel terascale computing systems will allow: a) the calculation of the orbits and evolution of every star in a globular cluster, including stellar collisions, the formation of tight binary systems, core evolution, and the effect of all these on the stellar evolutionary tracks; b) the details of galaxy encounters and mergers, including both the fate of the stars and the interstellar medium; and c) the evolution of the large scale structure of the Universe, including the formation of galaxies, clusters of galaxies, and clusters of clusters. All of these and similar theoretical calculations will produce datasets comparable in size with those of the large scale observational surveys, and it will be possible to mine these datasets just like observational data. Definitive comparisons will be made between complex theoretical calculations and observational datasets large enough to be statistically significant in all parameters. These studies, which will be carried out within the framework of the NVO, will lead to solutions of some of the most outstanding and significant astrophysical problems of our time.
The primary focus of the NVO is data federation, fusion, and exploration. In order to achieve these goals, the NVO will break new ground as a large-scale prototype of a semantic web (Berners-Lee, Hendler, & Lassila 2001). The NVO will federate a large number of heterogeneous data sets distributed around the world. Some of these data sets are small, some are tens of terabytes in size, some are under database management, and others are not. The data include catalogs of objects with attributes, image data of varying resolution and wavelength, spectral data, temporal data, and ancillary reference literature. An ambitious goal of the NVO is to federate the data and information of an entire scientific discipline. Computational grid technologies will provide access to distributed computing resources that enable the creation of the terascale analysis pipelines that will be required for some investigations.
The NVO will require a close collaboration between computer scientists and application scientists, and the experience gained from development of the NVO can be expected to significantly impact Information Technology research in the future. Astronomy provides an ideal environment for developing a large scale data grid prototype: the community of astronomers is moderate in size (a few thousand) but not too large; the data sets are heterogeneous, but not exceedingly so, providing interesting, tractable challenges for metadata standards and protocols; the data repositories are widely distributed, but typically already electronically accessible; security is not usually emphasized in astronomy; and finally, astronomy is of widespread interest to the public, providing an interesting proving ground for a knowledge network that engages many levels of society.
A long-standing problem in astronomy and other sciences has been how to publish data in addition to a scientific paper. The advent of the Web has shown us how easy it is to ``publish'' a web page, and we intend to make data publishing just as easy for astronomers, yet in a semantically-rich fashion that allows readers to find it, read it, assess its provenance, and compute with it--in other words, to make full use of it.
The NVO must recognize the three different roles of author, publisher/cura-tor, and reader. Data authors will work with publishers to ``publish'' their data, i.e., to provide full digital access to information which has traditionally only been available in graphs or tables in printed papers. When this process is complete, the data moves to the archive along with the metadata and documentation. In addition, a ``standard'' form of the data will be generated: we need to translate measurements into standard units wherever possible, translate data into the standard representations supported by the archive (data models), and perhaps define a few new measurements and representations. Publishing astronomical data must become far less difficult than it is today. It is the task of the NVO to develop the tools, templates, and standards that make it easy to document and publish data, and to make it cost-effective for the archivists to manage and curate published data. In the NVO era an astronomer wishing to publish data will be able to characterize it through an NVO ``publication portal'' which captures metadata describing the content in NVO compliant terms, identifies the access mechanisms, and registers the resource with the NVO. In this process the NVO publishing standards do not obscure the raw data; users will always be able to ``drill down'' to data in the original format. The standardization is used only for locating and federating disparate data sources.
Meeting the NVO's unique IT challenges will both enable new science and advance our IT technologies into a petascale data grid, soon to be a frontier for US business as well as science. Knowledge extraction from billion-object catalogs requires new indexing and summarization techniques. Petascale pixel image analysis from multiple distributed archives will require integration of digital library and grid middleware. As new classifications are discovered, understood, and archived, the NVO catalog will have to evolve. A data management system will provide a uniform data access layer for data pipelining, archiving, and retrieval of terabytes of distributed astronomical images. An information management system will support inserting, querying, and evolving billions of objects, each with thousands of attributes. A knowledge support system will provide software tools for correlation, visualization, and statistical comparisons of both cataloged data and original image pixel data.
The NVO architecture will be based on middleware that integrates federated, distributed, autonomous archives. It will connect users to analysis services and data services. The analysis-oriented service will support massive data analysis of catalog and pixel image data. The key functional requirements for the middleware are:
Data federation and fusion is a prime focus of NVO. Combining existing datasets can create new knowledge; knowledge that does not require a telescope or a rocket launch. A prerequisite for data federation is interoperable metadata standards, but for large data, there are additional requirements. Caching and replication services can save the results of complex joins of multiple databases for later reuse. An efficient proximity join of a billion sources requires that the data are clustered so that nearby objects in the sky are nearby in the stream: we plan to use the HTM indexing developed at Johns Hopkins to achieve this (Kunszt, Szalay, & Thakar 2001).
Users have the most control and interactivity with their desktop workstations, so small datasets will probably be brought to the desktop for visualization and for experimentation. However, the initial stages of the pipeline may involve huge data volumes distributed over a wide area. Algorithms must be moved to the data. Thus agent-code portability is as important as the portability of the data format. While many operations can be controlled by menus and numerical parameters sophisticated users will want to write compiled code that can execute near the data.
The NVO framework will work toward widely accepted astronomical metadata standards and protocols. These must be extensible into the far future. XML will be our fabric for structured information, including interoperation between Astronomical XML, Astronomical Markup Language, Astrores, Extensible Scientific Interchange Language, and other astronomical data representations. FITS is a standard for structured datathat predates XML, and a first milestone of this proposal is a software toolbox for FITS/XML interoperation.
The NVO will help extend the ``profiles'' defined by NASA's Space Science Data System, and its prototype implementation in the Astrobrowse system. In addition to the interfaces among web services, we must extend the metadata semantics so that programs can be written against a ``metadata API.'' Other, non-astronomical objects must also be described for a successful NVO, and we hope to borrow from other projects and from the commercial world for adequate semantic and syntactic descriptions.
Scalability is a major challenge: The NVO must handle billions of objects in high-dimensional petascale astronomical catalogs. It must also scale to an international federation of hundreds of institutions: some huge and some tiny. The computational grid will help by providing massive online storage, parallel computing, resource management, and high-speed network access, but the NVO must solve the problems of data organization, data access, data analysis, and data visualization. We will attack these problems by a combination of cunning and brute force: where possible the NVO will build sophisticated indices, pre-compute popular aggregates, and cluster the data for efficient access. But, in the end, the curse of dimensionality, or the ad hoc nature of some queries will force bulk data scans. In these cases, the NVO will use brute force: providing very high speed sequential access to the data, and compact replicas of the data (bitmaps or tag objects) that minimize the amount of data that must be scanned to answer a query. Of course, all these operations will be done in parallel using both parallel processing and parallel IO. This parallelism should come from the data management tools, but if required, the NVO will implement these mechanisms.
In the end, the system will be judged by how quickly users can pose questions, and understand the answers. This means that the system will be responsible for translating high-level, non-procedural queries into efficient execution plans, executing the plans so as to minimize data movement, and then delivering the results to the visualization tool as quickly as possible, perhaps allowing the user to steer the computation as it progresses.
The diverse user communities will access the NVO through networks of varying speeds; the tools and human interfaces must be usable for both low and high-speed connections. Some analyses will require very-large-memory machines and computing speeds, while others will fit well on Beowulf class systems. The computational grid and the NVO federation will include and support both computation styles.
NVO will serve a large user community. The number of astronomers worldwide is only a few thousand; however, the education and public access functions will have user communities numbered in the millions. While each such user will impose a modest load, the aggregate will be substantial. The NVO framework must be designed to handle large numbers of small users as well as a few very large users.
Things are changing in the way that data analysis takes place in astronomy: intensive use of algorithms on data is replacing the personal astronomer/data relationship of the previous generation. The NVO will accelerate this process, so that thousands of astronomers can benefit from the data without drowning in it. There will be many astronomers wishing to analyze the data in many different ways. We will provide tools and algorithms that support statistical and data mining queries needed by astrophysicists, and interface these to the framework. These include spatial access methods such as kd-trees, R-trees, metric trees and newer generations thereof, and also condensed representations such as sparse datacubes and binned grids. These structures support queries regarding n-point correlations and non-parametric density estimates (Connolly et al. 2001). We will provide example implementations that will serve as well-documented tutorials on interfacing one's own analysis algorithm with the NVO and will also provide directly useful tools. Examples of such analyses involve the identification of rare objects, or automated shape finders searching for atypical objects, like the gravitationally lensed arcs.
We plan to build the NVO framework both through coordinating diverse efforts already in existence and providing a focus for the development of capabilities that do not yet exist. The NVO we envisage will act as an enabling and coordinating entity to foster the development of further tools, protocols, and collaborations necessary to realize the full scientific potential of large astronomical datasets in the coming decade. The components of the NVO include not only the archives, but also metadata standards, a standardized data access layer, query and computing services, and data mining applications. This involves significant technical and managerial challenges. No single group can build the NVO-the effort must include the existing data archiving efforts in astronomy even as it develops new capabilities and structures. The NVO must be able to change and respond to the rapidly evolving world of IT technology. In spite of its underlying complex software, the NVO should be no harder to use for the average astronomer than today's brick-and-mortar observatories and telescopes. Development of these capabilities will require close interaction and collaboration with the information technology community and other disciplines facing similar challenges. We need to ensure that the tools that we need exist or are built; we do not duplicate efforts, but rather rely on relevant experience of others.
The new capabilities of the NVO will be essential to realize the full value of the tera/petabyte datasets that are in hand or soon to be created. Rapid querying of multiple large-scale catalogs, establishment of statistical correlations, discovery of new data patterns and temporal variations, and confrontation with sophisticated numerical simulations are all avenues for new science that will be made possible through the NVO. Future surveys will have well defined templates available that will enable them to publish their data much more easily. The NVO, through its rich content and special portals designed for students, teachers and the public, will have major impact on a wide range of science education and public outreach projects. Computer science will be able to capitalize on the public appeal of astronomy.
I would like to acknowledge the enormous contributions from my friends and colleagues of the NVO Collaboration. It has been an extremely stimulating and enjoyable process working with such a knowledgable and highly motivated group, and getting a step closer towards our ambitious common goals.
Berners-Lee, T., Hendler, J., & Lassila, O. 2001, Scientific American, 284, 5 (May 2001), 34
Connolly, A. J., Genovese, C., Moore, A. W., Nichol, R. C., Schneider, J., Wasserman, L. 2000, AJ, in press
Kunszt, P. Z., Szalay, A. S., & Thakar, A. R. 2001, Proc. Mining the Sky, A. Banday, ed., Kluwer, 2001, in press