Parallel Operations and DVO

DVO now has a parallel mode. This page is meant to describe at a high level how this works in software and user guidance for working with a parallelized DVO database.

Parallel Tables and the \HostTable?

A parallelized DVO database has the Image and other top-level tables stored on a single machine (top-level host) in the primary CATDIR, while all of the average, measure, secfilt, missing (optional) tables are stored in a parallel set of directories on multiple machines. Parallel operations are initiated on the top-level host machine, with the database of interest specified as usual with the catdir command (in dvo shell) or the CATDIR config variable. Queries which need to interact with the parallel tables are sent to a remote client program. The remote client may do some work or simply return the results of a query (presumably with the total amount of data substantially reduced). The DVO program on the top-level host collects the results and uses them if needed for further operations locally.

The remote machines and directories are managed with the file CATDIR/HostTable.dat. This file contains one line for each remote host. Line which start with '#' are comments and ignored, as are empty lines. Each host listed in the table is associated with a host ID (integer > 0) and a directory where the remote data is stored. A single real computer may be represented more than once, with different directories for each entry and different host ID values. Here is an example of a section of a Host Table:

# ID  Hostname  Catdir
   4  ipp004    /data/ipp004.0/eugene/catdirs/3pi.20111229
#  5  ipp005    /data/ipp005.0/eugene/catdirs/3pi.20111229 -- over used by dvo
   6  ipp006    /data/ipp006.0/eugene/catdirs/3pi.20111229

The distribution of the remote files is managed with the program dvodist and tracked in the SkyTable? table. This table includes fields to identify the (primary) host id, an optional backup host ID, and a set of flags to track if the given table is available from the primary or backup location (or from the top-level host).

By default, the program dvodist will assign random hosts to each of the table sets. (XXX: add user guide to dvodist)

When table are distributed to the remote hosts, a soft link is created on the original top-level host pointing at the new location. This last feature means that any DVO function can work with parallel DVO without modifcation. (However, this might be quite slow).

Interacting with a parallelized DVO database is not completely user friendly. First, it requires the user to make informed decisions about which database and which commands should be run in a parallel mode. Second, I/O timeouts on the remote host can occasionally cause trouble and the user must be cautions of these errors.

Parallel Queries in dvo shell

The following dvo shell commands are parallelized. To perform them in parallel on a parallelized DVO, add the option -parallel

  • avextract
  • avmatch
  • mextract
  • gstar

In addition, the following utility functions are helpful with parallel DVO operations:

  • catlist : return a list of region files include in the given region
    • with -this-host, limit to files associated with this remote host
    • with -parallel-local, give the list of local file names as instead of the remote ones
  • catstats : give information about the catalog files
    • with -this-host, limit to files associated with this remote host
  • remote : perform the given command on the remote hosts (this can be used to perform more complex remote operations)

Other Parallelized DVO Tools

The following programs / program options have been parallelized. To perform them in parallel on a parallelized DVO, add the option -parallel

  • relphot
  • relastro
  • photdbc
  • addstar -resort
  • dvomerge : perform merge of serial db into parallel db
  • setphot