All about state, data_state, and fault

Bill Sweeney 2011-02-07

The purpose of this page is to document the use of the state, data_state, and fault columns in the various IPP database tables including some proposed changes to solve some problems with the update system and various other issues.

Run State

Each "stage" of the IPP has a table called a "Run". These objects are added to the database when processing is initiated. The column "state" is used by the ippTools programs to figure out what work there is to do. When a Run object is inserted into the database state is set to "new".

The output data products of a Run are represented by one or more rows in a component table. These objects are inserted when (initial) processing is completed for each component. The inputs for a stage are typically the results from a previous stage. Some stages have specific tables that define the set of inputs. These are created when the run is queued (diff and stack) or after an initial bit of processing (warp and magic).

Table 1 The names of the tables for each stage.

Run table       Component table         Input component table   
---------------------------------------------------------------
chipRun         chipProcessedImfile                              
cameraRun       cameraExp                                       
fakeRun         fakeProcessedImfile                            
warpRun         warpSkyfile             warpSkyCellMap        
diffRun         diffSkyfile             diffInputSkyfile     
stackRun        stackSumSkyfile         stackInputSkyfile   
magicRun        magicNodeResult         magicTree          
magicDSRun      magicDSFile             
distRun         distComponent

When a component row is inserted into the database the value for the column data_state is set to 'full'. (The stages camera and stack do not have data state since there is a single component per run. Magic and distribution do not have data_state at this time because no need for them has arisen. distComponent has a 'state' column but it is not currently used for anything. We should probably drop the column ore change it's name to data_state for consistency.)

Once all of the components for a run have been successfully processed, the state of the run is changed from 'new' to 'full'.

Faults

Each component table has a one byte column called fault. This column is used to flag processing errors. Faults are cleared by issuing ippTool "revert" commands. When the run state is "new" the faults are cleared by deleting the faulted row from the table. When the state is update, the rows are preserved and the fault column is updated.

Note: We intend to distinguish faults due to transient errors (NFS) and those due to programming errors. Some of the revert tasks only revert faults with fault == PS_EXIT_SYS_ERROR. However some stages (diff for example) do not yet use this code for all transient errors so we revert all fault values.

Cleanup

The space used by the data products is considerable. For PS1 we do not have enough disk space to keep more than several days fully processed data around. Storage is recovered by the cleanup process. When a run is cleaned the large images are deleted. If the images are subsequently needed they can be recovered by the update process described in the next section.

Not all stages are usually cleaned up. In particular the outputs from the camera and stack stages are preserved. (There may be code to manage cleanup and update for these stages but Bill's law applies.)

To initiate cleanup the state of the Run is set to "goto_cleaned". Runs in this state are cleaned by a single job. As the script proceeds and each components is "cleaned" a ippTool command is issued to change the component's data_state is set to "cleaned". (Note: destreak cleanup takes some shortcuts.)

    chiptool -tocleanedimfile -chip_id <chip_id> -class_id <chip>
    warptool -tocleanedskyfile -warp_id <warp_id> -skycell_id <skycell>
    difftool -tocleanedskyfile -diff_id <warp_id> -skycell_id <skycell>
    magicdstool -tocleanedfile -magic_ds_id <magic_ds_id> -component <component>

Once all components have been cleaned the Run's state column is set to "cleaned".

If an error occurs during the cleanup process the state of the run is set to "error_cleaned" with an ippTool mode

   chiptool -updateprocessedimfile -chip_id <chip_id> -class_id <chip> -set_state error_cleaned
    warptool -updateskyfile -warp_id <warp_id> -skycell_id <skycell> -set_state error_cleaned
    difftool -updateskyfile -diff_id <warp_id> -skycell_id <skycell> -set_state error_cleaned
    magicdstool -updatedestreakedfile -magic_ds_id <magic_ds_id> -component <component> -data_state error_cleaned

Currently we do not have tasks that automatically "revert" runs in state error_cleaned.

BUG:

  1. Some of the arguments are named incorrectly. -set_state argument actually specifies the new data_state, and for magicdstool -data_state should be -set_data_state.

Stages that are routinely cleaned are chip, warp, diff, destreak, and distribution. During cleanup the images are deleted but certain data products (photometry for example) are preserved.

Update

Images from a run can be regenerated through the update process. Since typically not all components of a run are required, components may be queued for update individually. A specific ippTool mode is used to queue updates for each stage:

   chip        chiptool    -setimfiletoupdate  -chip_id <chip_id> -class_id <chip>
    warp        warptool    -setskyfiletoupdate -warp_id <warp_id> -skycell_id <skycell>
    diff        difftool    -setskyfiletoupdate -diff_id <diff_id> -skycell_id <skycell>
    destreak    magicdstool -setfiletoupdate    -magic_ds_id <magic_ds_id> -component <component>

These modes change the state of the run to update (unless it is already equal to update) and change the data_state to update with certain restrictions.

Run.state == 'cleaned' or Run.state == update Component.data_state == 'cleaned'

At chip stage an additional requirement is imposed before an update is queued. If the component has been previously destreaked (magicked != 0) the state change is made only if the magicDSRun with stage_id = chip_id and stage =='chip' has state 'cleaned' or 'update'.

During update processing some data products are not regenerated. For example during chip updates photometry is not performed. (This should be true at warp and diff as well, but I'm not sure that it is implemented this way)

INCONSISTENCY: When going to update, chip and destreak also set Component.fault = 0. diff and warp should do this as well. At those stages the fault is cleared by revert. This introduces latency.

Update faults

Faults can occur during update processing when this happens the -update*file modes listed above are used to set the fault column. Faulted components are not returened by the -pending queries.

Faults are reverted with -revert*file modes. These modes set fault to zero.

Far too often the system cannot update an individual component even after trying repeatedly. Currently we repeat and retry until someone stops the madness by setting the run to goto_cleaned.

'''ISSUE:''' We need a way to tell the system to give trying to update a  faulted component. 
    Possible ways to do this
        a) special fault code that revert modes don't revert unless -fault <that_value> is supplied
        b) special data_state
        c) add a fault count and give up after a certain number of faults. (This is working well for
        the postage stamp server but requires schema changes.
I like a) since it avoids making data_state handling more complicated and says exactly what it means
"This component is bad don't mess with it anymore". Also it does not precude implementing c) in the future
to have allo update processing to give up on it's own. For the time being the the postage stamp
dependency checking will manage setting the fault code.
    Keep in Mind
    * To recover the space we need to have cleanup be ready for whatever we do. 
      Again using special fault code will make this easier. since we can keep the same
      data_state transitions
    * -set*toupdate will need to know not to set component to update if the fault has the magic value

Destreak Revert

When components fault during destreak update, the data_state is set to error_cleaned. We don't know what to revert the state to is the run 'new' or was it 'update'? In practice this is handled based on the label

    magicdstool -revertstatefaults -state failed_revert -set_state new -label \%.nightlyscience
    magicdstool -revertstatefaults -state failed_revert -set_state update -label ps_ud\%

but this should be handled gracefully. Perhaps by using a different state: 'failed_revert' 'failed_revert_up'