Next: Searching for a cosmic string through the graviational lens effect: Japanese Virtual Observatory science use case
Up: Surveys &
Large Scale Data Management
Previous: From FITS to SQL - Loading and Publishing the SDSS Data
Table of Contents -
Subject Index -
Author Index -
Search -
PS reprint -
PDF reprint
Mehringer, D. M. & Plante, R. 2003, in ASP Conf. Ser., Vol. 314 Astronomical Data
Analysis Software and Systems XIII, eds. F. Ochsenbein, M. Allen, & D. Egret (San Francisco: ASP), 42
Adapting the BIMA Image Pipeline for Miriad Using Python
David M. Mehringer and Raymond L. Plante
National Center for Supercomputing Applications,
University of Illinois Urbana-Champaign, Urbana, IL
61801
Abstract:
Through our experience using AIPS++ in the BIMA Image Pipeline, we
found that a sophisticated scripting environment is crucial for
supporting an automated pipeline. Miriad V4, now in development,
introduces support for calling Miriad programs from a Python
environment (referred to as Pyramid). We are creating processing
recipes using Miriad through Python that can be used with the BIMA
Image Pipeline. As part of this work, we are prototyping tools that
could be integrated into Pyramid. These include two Python classes,
UVDataset and
Image for examining the contents of Miriad
datasets. These simple tools have allowed us to recast our Pipeline
using Miriad in only a couple of months. Python recipes are used for
such things as determining line-free channels for continuum
subtraction and determining if data will benefit from
self-calibration. We are currently using the Pipeline to do massive
processing of hundreds of tracks of archival data using NCSA's
Teraflop IA-32 Linux cluster.
The BIMA
Image Pipeline is part of the BIMA Data
Archive
and is
a system for automated processing of BIMA data after they have been
transferred from the Array located near Hat Creek, CA to NCSA,
ingested, and archived. This processing includes calibration,
self-calibration, continuum subtraction, and imaging of target
datasets and calibrators. The products of the processing are also
ingested and archived in the BIMA Data Archive where they can be
retrieved by astronomers. At this point, the processed data products
are meant to give astronomers a ``first look'' at their data. However,
as new processing recipes are developed, we foresee that the data
products will approach publication quality and will therefore reduce
the amount of processing that the end user will have to do on his or
her desktop.
We have recently re-implemented our Pipeline in
Python, using
Miriad
as
the underlying astronomical data processing engine. This paper discusses
this development and its results.
Our initial implementation of the BIMA Image Pipeline was using
AIPS++
as the astronomical data processing engine. We found the powerful
scripting language used by this package, called
Glish,
to be immensely useful for quickly and efficiently constructing data
processing scripts. Thus, it was clear that we would need a powerful
scripting language for our new implementation.
Use of a scripting language provides several benefits. It provides a
means of rapid development. Because there are no compilation steps,
the write test debug cycle can proceed
quickly. In our case, we
were able to write a complex, fully functional pipeline in only a
couple of months by writing a Python layer which calls Miriad tasks
for the processing of astronomical data. In addition, a scripting
language provides a relatively easy way for end users to develop their own
recipes. Because the learning curves for scripting languages tend to
be significantly shallower than for compiled languages, the cost of
code development for end users is relatively small. Our experience
with AIPS++ shows that many users have been able to quickly implement
complex algorithms using Glish. Because our goal is to have users
write and submit processing recipes for the BIMA Image Pipeline so
they may be used by the larger community, it is important that we
provide a scripting language interface to allow this.
We decided to use Python as the scripting language for the BIMA Image
Pipeline for several reasons. Python supports both the procedural and
object oriented code paradigms, and therefore it is easy for users
familiar with one or both of these to implement algorithms. Python
has a rich, yet simple to use, set of data types such as various types
of sequences (lists, tuples, etc.) and dictionaries (often called
hashes in other languages). Furthermore, Python allows these types to
be nested ad infinitum, so, for example, one could have a dictionary
which contains lists of dictionaries, integers, strings, or any
combination of these. We have found this unlimited flexibility to be
quite important in our development of recipes. Because Python is open
source and has a large user community, it benefits from having a
mature collection of standard libraries. These include libraries for
regular expression manipulation, system command execution,
mathematical function evaluation, XML parsing, etc. Thus, we do not
have to re-invent the wheel and can concentrate on developing
astronomical processing recipes. Python provides simple but
powerful mechanisms for manipulating lists ( i.e., arrays) via
slicing and function mapping with its built-in map() function.
This is especially important for producing code to handle astronomical
images. In addition, Python provides a command line interface which
allows for interactive (and thus rapid) development, testing, and
execution of code. Finally, this language provides a means of
interfacing to compiled code ( e.g., C) libraries. We plan on
developing such a Python interface to Miriad libraries for improved
performance.
One of the most important aspects of developing any data processing
pipeline is that it must be easy to access the metadata for the
various datasets. In radio interferometry, metadata are used to
determine processing parameters such as image extents and image pixel
sizes, the number of spectral windows for which images must be
created, etc. To make accessing metadata simple, we wrote two Python
classes, called UVDataset and Image, which provide APIs
for access of
metadata from these types of datasets. Methods allow retrieval of
such information as the number of spectral windows, system
temperatures, and antenna positions for uv-datasets and image
dimensions, pixel dimensions, and statistics for images.
The architecture of the BIMA Image Pipeline is depicted in
Figure 1.
Figure 1:
Architecture of the BIMA Image Pipeline
|
The bip (BIMA Image Pipeline) object holds relevant fields which are
used throughout the run by the top-level script and the processing
recipes it calls.
The processing parameters are contained in a text file as name=value
pairs. These parameters control how various recipes are executed.
Roles information about the various input datasets are contained in
another name=value pair text file. The roles describe how the datasets
are to be used during processing (e.g., target source, phase
calibrator, flux calibrator, etc.).
The top-level script calls various processing recipes in order. Most
recipes take input datasets (usually the output from a previous
recipe) and create output datasets. Each recipe is essentially a
Python function which is passed a dictionary
describing what processing parameters it should use and returns a
dictionary describing its results
(such as on which input datasets it was successful, the names of
the output datasets which were generated, etc.). The top-level script
makes decisions based on this information such as whether or not to run the
next recipe, what intermediate datasets should be used when the next
recipe is run, etc.
Using NCSA's IA-32
cluster
which has a peak performance of 1 Teraflop,
we have processed several hundred tracks of BIMA data. This is
typically done by
processing about 100 tracks at a time using 32 processors. This
processing is controlled via a master csh script. A CPU is
dedicated to processing a single track at a time. Unprocessed tracks
are held in a queue. When a track has finished being processed, the
CPU which has been freed is sent the next track in the unprocessed
track queue. In the future we plan to implement at the Miriad level
applications which have been written to take full advantage of cluster
technology. One of the first such applications we will implement will
be a parallelized version of CLEAN, which is an algorithm for
deconvolving images. Each channel of a multi-channel dataset can be
deconvolved independently of the other channels. This problem is
considered to be embarrassingly parallel, and so is a good first
step to taking full advantage of modern clusters.
Users (astronomers) may access the products of processing runs in the same way they
access raw data in the BIMA Data Archive. The user simply searches our
database by keying her project id, investigator name, and/or numerous
other search parameters into a web form. The page which is returned contains
all datasets matching the query parameter(s), and from this page, the
user may download as many datasets as she wishes using our
DaRT
download client (Mehringer & Plante 2000), or she may proceed to
pages with more detailed
information and download datasets one at a time.
Many types of processing products are archived, including deconvolved
wide band images and spectral line cubes of all target
sources in FITS format, calibrated target datasets uv data which
has had the continuum subtracted, calibration solutions used to
calibrate the target datasets, various plots (in Postscript format) of
images and calibration solutions.
References
Mehringer, D. M. & Plante, R. L. 2000, in ASP Conf. Ser., Vol. 216,
Astronomical Data Analysis Software and Systems
IX, ed. N. Manset,
C. Veillet, & D. Crabtree (San Francisco: ASP), 703
© Copyright 2004 Astronomical Society of the Pacific, 390 Ashton Avenue, San Francisco, California 94112, USA
Next: Searching for a cosmic string through the graviational lens effect: Japanese Virtual Observatory science use case
Up: Surveys &
Large Scale Data Management
Previous: From FITS to SQL - Loading and Publishing the SDSS Data
Table of Contents -
Subject Index -
Author Index -
Search -
PS reprint -
PDF reprint