As the available data from Sky Surveys and new digital archives rapidly multiplies every year, more than 80 percent of the data will exist outside of large data centers at any given moment, making it very important to have dynamic cross-identification tools like SkyQuery.
Loading an entire survey like 2MASS or SDSS into a database involves making decisions about issues like data formats and indices for tables. We describe the process of loading such a large amount of data into a relational DBMS (SQL Server) and generating a sky index using the Hierarchical Triangular Mesh (HTM), which provides a really fast way to find objects. This can be easily done even for a large survey like the 2MASS All-Sky Data Release (150GB uncompressed, 471M objects) in as little as 2 days including the required computation time for HTM.
Sky Surveys usually provide data to the public in a form of ASCII or FITS files and their brief descriptions. Using these descriptions we are able to understand the structure of the files and the data types included in them. If we would like to load this data in a database, we'll have to create a representation of the survey data types suitable for our database system. This means a database schema and a parser to convert the files to our format.
Most of the time we don't need to deal with lot's of different tables, it is enough to make only one or two tables for all of our data (see an example of a more complex schema in the next section, Fig. 2). This table (PhotoObj or SpecObj) will hold all the data we can find in the public files (but with SQL data types) completed with an identity primary key (ObjID) column. The best way to make this ``create table" script is to write a description after every column as a comment. By doing this, we'll be able to generate important metadata to our tables and functions later on.
Our next step is the file conversion. In order to load our files into a database, the clearest way is to convert them into a character delimited (CSV) format using the new data types and our symbol of a NULL value. Additionally, we must include our coordinates in a float type J2000 format as well (ra, dec), because the HTM coordinates will be calculated from these values. If we couldn't find any sample code bundled with the public data, we can still start thinking about using regular expressions with our parser.
At this point we have a fully functional database without a row of data. To fill it up, we can use ``bulk load" with the converted CSV files, or advanced (graphical) tools (e.g. Data Transformation Services in SQL Server). After this, the HTM coordinates will be calculated and loaded into a separate table by using a function of the installed extended stored procedure.
To speed up the searches in our node, we should create additional indices and define relationships between our tables (e.g. between PhotoObj and HTM tables using ObjID)
In order to make a SkyNode, our last two steps are creating a user to access the data and set up the web service using a modified configuration file. Our user must have permissions to SELECT from the tables, and EXEC the stored procedures. The SkyNode configuration file must include the user name, password and a few lines of description (name, location, ..).
With it's more than 471 million objects the 2MASS All-Sky Data Release is a bit more challenging to handle. The Data Release includes 3 different type of ``catalog" (Point Source Catalog, Extended Source Catalog, ScanInfo), which makes the database schema more complex (Fig. 2). Even if we have a more complex database, the biggest problem is still the time required to load the data and calculate the coordinate based indices. Using the new HTM2 code to create the htmID indices came up to be a surprisingly fast solution. After about 36 hours of parallel loading and computing time on 2 dual Xeon processors, we had a 2MASS database in the SQL Server.
New, final data releases from great surveys made a good choice to extend the list of SkyNodes this year. After about one month of work, the following new nodes were set up at SkyQuery.net:
To find more information about SkyNodes and SkyQuery, or to download
prepared SQL scripts visit the following pages:
Budavári, T., et al. 2003, in ASP Conf. Ser., Vol. 295, Astronomical Data Analysis Software and Systems XII, ed. H. E. Payne, R. I. Jedrzejewski, & R. N. Hook (San Francisco: ASP), xii:O10-131
Fekete, Gy., et al. 2004, this volume, 289
Malik, T., et al. 2002, CIDR'03, p.17, 'SkyQuery: A WebService Approach to Federate Databases'
Thakar, A., et al. 2004, this volume, 38