Is there more to exploit in a large data archive/data flow? Is it possible to improve the Data Flow Scheme in order to foster the mine-ability of an archive, making, at the same time, the every day life of the quality control scientist easier? What are the common missing steps for an archive/observatory to be miner-ready? We will answer all these questions and suggest a newer approach for a data flow scheme, where the Data Quality and the Archive can be seen as two different clients of the same sub-system: the Observatory Data Warehouse.
HST and VLT insiders have great familiarity with the concept of Data Flow System (DFS): it is a closed-loop software system, which incorporates various subsystems that track the flow of data all the way from the submission of proposals to storage of the acquired data in the Science Archive Facilities. Typical DFS components are: Program Handling, Observation Handling, Telescope Control System, Science Archive, Pipeline and Quality Control. All these components produce various sorts of data with different formats and ``handling'' rules. Therefore, the information flow among the subsystems suffers from ``hiccups''. Ultimately, some data may be lost in the process and the referential integrity within the DFS may be compromised.
ST-ECF and CADC have always been busy in trying to patch the current HST Data Flow but unfortunately, as we will see later, only a posteriori. The nitty-gritty details of on-the-fly calibration of science HST data, the jitter extraction pipeline, the WFPC2 associations, and the FOS associations are examples of patching the engineer-oriented HST Data Flow. Those systems have introduced a previously missing, basic, scientifically-oriented description of the HST datasets.
It is thanks to this a posteriori effort of reconstructing what actually happened during the observations that higher-level, ready-for-science data products are now immediately available to scientists. Indeed, cosmic ray free co-added images, mosaics of dithered WFPC2 observations, and combinations (at least for a first visual inspection) of FOS spectra are generated on-the-fly upon demand.
In the case of the VLT, soon after the completion of the observing phase of a service mode programme, the PI receives a complete data package, which includes, among other, zeropoints and science frames free of instrument signatures. The archive, and therefore the future user of the science data, does not receive the same information and therefore the work of the quality control scientist, at the back end of the data flow, is partially lost.
After Phase 1 and 2 of the proposal preparation, the observations are scheduled and then executed. Both telemetry and science data are acquired and stored, while some reduction pipeline produces quick look products and a Quality Control team inspects the data, but usually only a few measurements are taken and passed to the PIs. For example, the PIs of VLT programmes receive a Quality Control report, which compares the user requirements in terms of seeing, fraction of lunar illumination, moon distance and airmass with the true values measured on site and stored in the ambient database.
These are certainly necessary steps. But are they sufficient? What is described by an investigator in his/her proposal (pointing information, S/N ratios, image quality etc) doesn't necessarily agree with what the observatory has been able to achieve at run time1 2. Discovering what actually happened during the observations is usually left to the PIs. Furthermore, such effort is then lost since no feedback is given to the Observatory. Moreover, an archive scientist will later have to go through the same reduction for his/her own study. Again no feedback will be provided to the archive facility.
Any data-miner will have to go through those steps again and again. This observatory does not qualify as miner-ready.
As highlighted in the previous paragraph, there are two main problems in todays' DFS implementations: (1) lack of interoperability within the various DF subsystems (2) insufficiently detailed description of the observations.
Two steps are necessary to overcome these limitations:
(a) The adoption of a Data Warehouse3 to control the various DFS activities. The information flows among DF subsystems via the data warehouse. The advantages of having it at the centre of the DFS are multiple. Among them, it guarantees a homogeneous access to the information created by any DF subsystem; it may also be the place used to develop and integrate tools to check for referential integrity. (b) The introduction of a new DFS component, let's call it "Characterisation" step, responsible for any data manipulation/reduction to extract all the parameters, which are useful for a thorough description of what actually happened during the observations. It should consist of a set of reduction pipelines to measure (and compare) those parameters requested in phase 2 (e.g., offsets of dithered frames, S/N of spectra, image quality, etc).
These Characterisation tasks should be executed at a later time than the Data Quality ones since they require a better calibration, using improved reference files and software typically unavailable at the time of the observations. This special activity should be carried out some time (1 year ?) later4.
In the end the data warehouse should not only contain Phase 2 and scheduling information, but also:
While certain parameters are already measured (mainly 1 and 2 above), others (some of 3, and mainly 4, 5 and 6) are not part of the current Data Flow Systems. Though 1 and 2 above are stored in the so-called calibration database, and though 3, 4, 5 and 6 above could end up into a "characterisation database", more is to be gained by integrating those two aspects within the observatory data warehouse. Having all this information on-line will greatly improve the way an instrument scientist or an archive scientist works.
The mine-ability of the system is greatly enhanced since engineers and scientists, both inside and outside the observatory, will have homogeneous access to information like: (a) ready-to-use measurements, (b) ready-to-view preview products, (c) a scientific view on the archive as opposed to the standard, sterile catalogue (observation log), (d) a quality control view on the archive, (trend analysis techniques and instruments/telescope health checks could benefit from monitoring parameters such as the noise levels of detectors, the measured resolution versus time and slit width, the image quality, etc.), (e) a superior level of abstraction, since at this level the underlying complexity of the various sub-systems that collected the necessary information must have been removed.
Without this level of abstraction, it will be difficult to achieve effective interoperability among archives.
We highlighted the typical problems which HST and VLT Data Flow Systems are facing today. Dispersing the information into several subsystems that are not interoperating is the immediate cause of glitches and inconsistencies, which, for the intrinsic heterogeneous nature of the DFS, are then difficult to identify and repair. We claim that a central repository of the information produced by all the various DF subsystems will greatly help to reach smoother operations.
Industry is facing the same kind of problems; indeed data warehousing is one of the hottest industry trends. The astronomical community should try to benefit from that effort.
Up to now an archive user, being an external user or an instrument scientist, has been able to browse through an observation log representing basically Phase2 information. The aim of introducing a characterisation step is to provide not only better information on what actually happened during the observation, but also to provide a higher level interface to the archive: a miner-ready interface which doesn't need nor want to know the details of the particular DFS, but which can help the scientist in identifying the data s/he needs.
A good DFS must be able to remove its own signature. A good Observatory (not only the archive) must be miner-ready.