The main goal has been to build a ``friendly'' user interface so that users located remotely can either submit their own data ``tables'' (as ASCII files) for comparison or extract information from either ESO or CDS to perform cross correlations in all the parameter space provided by the data catalogs -not restricting the correlations to positional ones. We have decided to use HTTP as the exchange protocol between the Data Mining Facilities (DMF) located at CDS and the ESO servers, as well as between the user and the DMF.
The objectives of this joint project between ESO and CDS are to build tools which will allow astronomers to access a large volume of information in the form of electronic data with the purpose of cross correlation. Remotely located users can either submit their own data ``tables'' (as ASCII files) for comparison or extract information from either ESO or CDS to cross correlate by position in the sky or by any of the parameters provided by the data catalogs. An important point has been the development of knowledge structures with the purpose of facilitating the description of the data to provide highly flexible data-mining options.
This paper outlines the main features users will find at the DMF and comments on features which have allowed us to optimize the consulting time. In addition we discuss the mechanisms used to organize the information in order to accept complex queries regarding the selection of the catalogs to use for comparison based on the nature of the objects contained, the ``quantities'' stored, wavelength coverage, sky region, etc.
CDS hosts, under the VizieR system, around 2000 catalogs with more than 5000 tables, all of them accessible, described and stored using a very uniform system (Ochsenbein 1998; Ochsenbein et al. 1999). VizieR already allows queries by position and other criteria to individual catalogs.
The catalog content consists of: compilation catalogs, surveys, observing logs (from space missions), and tables from journal articles, with both observational and modeled data.
There are other facilities at CDS which could be used as a complement to data-mining: bibliographic references, name resolver (via Simbad), web resource locator (GLU), etc.
Knowledge-detection structures were developed to complement the meta information in the system. These structures were based on column content and also on astronomical object type.
Two knowledge detection structures were developed: one for astronomical object types, and the other for column content. The structure for object type resembles the structure used in SIMBAD, with a four level hierarchy; the source to assign object types to the catalogs and tables is the standardized description file (ReadMe file) developed at CDS and shared now by other data centers and journals.
The structure related to column content was fully developed for this project. It contains 35 main categories and has a four level hierarchy. Categories such as Photometry, Positions, Spectroscopy, Time and Physical Quantities are amongst the most populated.
A Unified Content Descriptor (UCD) is then assigned to each of the columns in each of the tables accessed with Vizier. We developed an automatic UCD assignation procedure based on column name, column units, and column description.
Because of the importance of the use of UCD's for datamining purposes, we developed tools to assign UCD's to user provided files.
The existence of knowledge structures implies that it is possible, on one hand, to retrieve the names of all catalogs/tables containing any given set of UCD's, and on the other hand, to perform cross correlations with catalogs/tables containing the same UCD's as the ones describing the content of a user file.
As an example, the following table shows the assignment of UCD's to a few ``typical'' quantities (or columns) one could use for data-mining:
Column: | R.A. | DEC | Name | Vmag | z |
Units: | h:m:s | d:m:s | mag | ||
UCD's: | POS_EQ_RA | POS_EQ_DEC | ID_MAIN | PHOT_JHN_V | REDSHIFT |
Figure 1 shows the options provided by the cross-correlation interface. One way to envision datamining is a process in which users want to perform cross correlation operations using the data stored in a Data Mine as the comparison side, against their own data or data extracted from the same or other Data Mine.
On the Users's side (reference side) there are several scenarios, such as:
On the Data Mining Facility side (comparison side) CDS offers:
Some of the important features developed for this project include the following:
During the development stage a number of practical applications were used as tests to measure flexibility and performance of the prototypes, some of the most relevant ones are listed here.
Other applications have been explored and will be part of the manual for the facilities or be part of the on-line examples available to users.
We are grateful to Pascal Dubois and Françoise Genova for valuable comments about the interface and the features one expects to find tools like the one presented in this paper.
Ochsenbein, F. 1998, in ASP Conf. Ser., Vol. 145, Astronomical Data Analysis Software and Systems VII, ed. R. Albrecht, R. N. Hook, & H. A. Bushouse (San Francisco: ASP), 387
Ochsenbein, F., Fernique, P., Ortiz, P., Egret, D., & Genova, F. 1999, in Future Generation Computer Systems (Amsterdam: North-Holland)
Veron-Cetty, M. P. & Veron, P. 1998, Quasars and Active Galactic Nuclei (8th ed.) (ESO Scientific Rep. No. 18)