~ Eugene Magnier, 2008.12.10

One of the major failure modes in the IPP analysis is due to timeouts from Nebulous. The show up as failures of the client to find a requested existing instance or failures to construct a new instance. This type of failure is particularly common for the 'camera' stage of the analysis, in which 180 nebulous instances are requested; if any since instance fails, the process fails.

As an aside, let's consider briefly the real requirement for Nebulous throughput. For PS1, in order to keep up with the nightly processing, we need to process roughly 600 exposures in 24 hours (a typical long night's worth of data). In order to meet the 30 min transient requirement, we will need to process those 600 exposure in ~10 hours. The actual load is a bit tricky to calculate, since we may perform multiple stacks of the same exposures. However, a minimum number of operations is to assume that each exposure is processed by each stage of the analysis once. For each stage, there are a number of different files that a created or read, and Nebulous needs to service all of them. The number of files for each of the stages per exposure is roughly (assuming a typical ratio of 2 skycells for every chip and 1 stack for every 4 exposures):

  • chip : read: 60; create: 60 x 10
  • camera : read: 60 x 3; create: 5
  • fake : (not certain at this time, but small: 5-10)
  • warp : read: 60 x 5; create: 120 x 10
  • stack : read: 120 x 5; create: 120 x 10 / 4
  • diff : read 120 x 10; create: 120 x 10

This adds up to 2340 reads and 3305 creates for each exposure. To keep up with the 24h processing rate, these imply 16 reads / sec and 22 creates per sec; For the 10h processing rate, these numbers go to 39 reads/sec and 55 creates/sec. These are the minimum average sustained rates; in real processing there will be a more variable load, and we should be able to handle these variations without causing failures, so we should allow for a factor of ~2 for the real load requirements. Thus, the real Nebulous requirement is something like 80 reads/sec and 110 creates/sec.

Distributed Nebulous

One possibility to reduce our Nebulous load is to distribute the data across multiple servers. If we realize that any single nebulous object is completely independent of all other storage objects, then we can distribute the data to more than one database without requiring any inter-database communications. We would need a mechanism to decide which server is holding the data; I've suggested a simple hash key based on a fraction of the filename (or to be more exact, the nebulous storage object ext_id value). Thus, any file would be found by first generating the hash and selecting the corresponding server from a lookup table; then making the nebulous request as usual to that server. Note that it is not necessary in this implementation for there to be any synchronization between the databases.

Some issues to consider:

  • adding a new database. To add a new database server, a query would be run to select the existing storage objects that match the hash key corresponding to the new server. These objects and their corresponding instances and other associated table entries would be exported, and then imported to the new database. It would not be necessary to drop these entries from the existing database, though it would be necessary to set a boolean field ('ignore' or something equivalent) so that these would not be returned by the existing database. This operation would require a halt in processing.
  • deleting (merging back) a database. It is difficult to imagine why we would perform this operation, but it is possible to implement. This operation would require the entries in the database to be exported, and then loaded sequentially into the appropriate remaining nebulous server. The instances would need to be loaded with their corresponding storage objects so that new so_id values could be generated to link the instances to their objects.
  • Impact on the APIs. There are two APIs which do not currently take the 'ext_id' value, and thus do not provide enough information to determine the hosting server: find_objects(pattern) and delete_intance(uri). Both of these would need to be modified or re-defined to work in the new scheme.
  • find_objects(pattern) : this API is used by neb-ls exclusively (true?). Since this function cannot know the specific ext_id (after all, the goal is to find ext_ids), we have to rethink this slightly. One way to modify this API in a straightforward way would be to have the client (neb-ls) loop over the known databases and perform the current function unchanged directly. An alternative, depending on how the hash is defined, would be to convert the pattern to a hash, if enough is known, and then send the request to the corresponding server. This latter implementation would gain in some cases, but unless the hash key is very restricted, these cases are likely to be rare. (See below for discussion of the hashing strategy).
  • delete_instance(uri) : this API is used in Client.pm by 'replicate', 'cull', and 'delete'. All three of these functions take the ext_id (key) as an argument, so the simple modification is to add the key as an argument to the API, which would then allow the hash to be generated.
  • choice of hash key. The hash key could be defined in a number of ways. One extreme would be to use the entire filename (neb key) as the hash key. At the other extreme, we could use just the base directory-like component (eg all chars up to the first '/' char). Since all entries with the same key will land on the same server, the latter strategy is probably not advantageous. On the other hand, it probably is advantageous for a set of similar files to land on the same machine. If we want to break the bottle-neck for the camera stage, we definitely want the files from different chips to be well distributed. My recommendation would be to strip the extension from the key (all chars after the last '.') and use the rest of the string as the key. That would keep all of the output results from a single stage (eg, one chip) on one database, but distribute different runs even within the same exposure across multiple machines.

More on Performance