New science often arises from investigations based on the cross-identification of sources in different wavebands. Since source catalogs are often stored as tables in a relational database management system (DBMS), in theory this just requires a suitable JOIN between tables, based on source position. Even if positions alone do not produce unique associations, so that additional criteria are needed, positional coincidence is still the primary key for such joins. In many cases the absence of a counterpart in another band will be scientifically interesting: any unmatched sources can be found using an OUTER JOIN. So much for theory: the practice is more complicated.
Multi-dimensional indexing has been a hot topic of computer science research for many years, and many different algorithms have been proposed. The names include: BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, hB-tree, kd-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q-tree, Quadtree, R-tree, SKD-tree, SR-tree, SS-tree, TV-tree, UB-tree, and Z-order index.
Whereas the B-tree has become the standard one-dimensional indexing method used in every DBMS, none of the multi-dimensional indices has comparable properties such as fast insertion and retrieval, and a worst-case fill-factor of 50%. The problem is, of course, fundamentally intractable, since one cannot serialise the points on a two-dimensional surface. One of the earliest algorithms, the R-tree (Guttman 1984), has been implemented in a few DBMS, including one of the Open Source products: PostgreSQL. But the R-tree has some limitations for source catalogs:
An alternative algorithm, proposed here, is based on covering the sky with pixels of approximately equal area, and assigning a unique integer to each pixel. A number of suitable pixelisation schemes have been proposed, the best known being: HTM, the Hierarchical Triangular Mesh, (Kunszt et al. 2000), and HEALPix, the Hierarchical Equal Area iso-Latitude Pixelisation, (Górski et al. 1998). Both pixelisations are suitable for generating PCODE values, and both have selectable resolution.
For each table, the pixel code value corresponding to the central position (RA, Dec) of each source is computed and inserted into a new column called PCODE. A B-tree index is then created on the PCODE column. This cannot be the primary index of a table since PCODE values are not necessarily unique: one pixel may contain more than one source. The two tables can then be joined on their PCODE columns: this is simply an equi-join between integers, which all DBMS handle efficiently. Note that HTM, and one of the two numbering schemes for HEALPix, are designed to assign nearby numbers to adjacent pixels as far as possible, but this property is not used here; indeed indexing schemes based on this, such as the Z-order index, have serious failings.
There are a number of obvious problems with the simple method outlined above, but they can be solved:
In theory a B-tree index built on PCODEs should be more efficient than an R-tree, and this was confirmed in practice. We took subsets of the HST Guide Star Catalog and the USNO-A1 Catalog corresponding to an area of around 400 square degrees. This produced tables of 275,154 rows and 3,476,948 rows respectively. The HEALPix pixelisation scheme was used, with giving 805,306,368 pixels, each 26 arc-seconds across, and required only 32-bit PCODE values. The tests used PostgreSQL v7.2.1, since it supports R-trees, and were run on a 450 MHz PC running Linux.
|Algorithm||Table size||Index size||Index Creation||Fuzzy Join|
|PCODE and B-tree||293||64||233||143|
Not only is the PCODE method faster and smaller, it can also be used with any relational database management system, not just the few which support a spatial index.
In order to ensure that only a small proportion of additional rows are needed, one must choose a pixel size larger than the typical error-circle radius. Actually it is a rule of thumb in sky surveys that detections should be limited to the level at which there is no more than one source per 40 beam-areas, otherwise confusion effects dominate. It should, therefore, always be possible to find a pixel resolution which is fine enough to leave fewer than one object per pixel on average, while coarse enough to have relatively few error-circles crossing a pixel boundary. Of course there will always be a few source postions which lie close to the point at which 4 HEALPix pixels (or 6 HTM pixels) touch. Efficiency may be lower when dealing with surveys where the density of sources is very uneven, with many pixels empty, but some holding large numbers of sources (e.g., in the Galactic Plane). The PCODE method is also less suitable if a wide-area of the sky is of interest, e.g., when making a finding-chart.
A difficulty may still arise when joining two tables with very different resolutions, such as an X-ray catalog from the era preceding XMM-Newton and Chandra, and a modern optical catalog. In practice the problem is less serious, as a catalog with large error regions is, intrinsically, one with fewer sources in it, so the absolute number of extra rows needed is tolerable. In fact the 26 arc-second pixels we used in our tests turn out to be a good match to most current catalogs.
There is also a potential scalability limitation from the need to use SELECT DISTINCT on the output of the join, since this implies that the results of the join are sorted. This takes a time proportional to , where is the number of rows in the output table.
The HTM pixelisation has a useful property which promises additional scalability: the algorithm is recursive so that four pixels at each resolution stage fit exactly in one pixel from the next coarser resolution, and the pixel codes just add extra bits. Hence if a fine mesh is established using a PCODE-equipped table based on the HTM pixelisaton, it can be converted to any required coarser resolution, e.g., for joining with a low-resolution source catalog, just by ignoring pairs of bits at the least-significant end of the integer.
This suggests that all source catalogs could have PCODE values attached, based on HTM at some a suitably high resolution, in order to make a wide range of joins (and cone-searches) efficient. At lower resolution many of the added rows become superfluous but a SELECT DISTINCT will remove them too.
Górski, K. M., Hivon, E., & Wandelt, B. D. 1998, in Proc MPA/ESO Cosmology Conference on Evolution of Large-scale Structure
Guttman, A. 1984, in Proc. ACM SIGMOD Int. Conf. on Management of Data, 47
Kalpakis, K., Riggs, M., Pasad, M., Puttagunta, V., & Behnke, J. 2001, in ASP Conf. Ser., Vol. 238, Astronomical Data Analysis Software and Systems X, ed. F. R. Harnden, Jr., Francis A. Primini, & Harry E. Payne (San Francisco: ASP), 133
Kunszt, P. Z., Szalay, A. S., Csabai, I., & Thakar, A. R. 2000, in ASP Conf. Ser., Vol. 216, Astronomical Data Analysis Software and Systems IX, ed. N. Manset, C. Veillet, & D. Crabtree (San Francisco: ASP), 141