Shepherding the postage stamp server and the update process

Overview

Postage stamp request files are fits tables that contain 1 or more request specifications. Request files are submitted to the system five ways.

  • By posting to a fileset on a data store that the server is monitoring
  • by uploading a request file to the upload page.
  • By entering a request on the PSI form
  • by entering a "valid" set of parameters on the prototype web interface
  • adding the request into the database with pstamptool -addreq -uri /request_file_path -label some label

Each file is represented in the database by a row in the table pstampRequest. The request starts life in state 'new'. Once it is done parsing the state is set to 'run'. Request files are parsed into jobs by the task pstamp.job.run. The working directory for a request is /data/ippc30.1/pstamp/work/$yyyy/$mm/$dd/$req_id. (Dates are UTC)

For example here is the entry for a request while it is being parsed


mysql> select * from pstampRequest where req_id = 5977;
+--------+-------+-------+------------------+---------+--------+---------------+---------------------------------------------------------+-------------------------------------------+-------+
| req_id | ds_id | state | name             | reqType | label  | outProduct    | uri                                                     | outdir                                    | fault |
+--------+-------+-------+------------------+---------+--------+---------------+---------------------------------------------------------+-------------------------------------------+-------+
|   5977 |     0 | new   | Niall.1274580688 | pstamp  | WEB.UP | pstampresults | /data/ippdb02.0/pstamp/work/20100527/5977/web_1695.fits | /data/ippdb02.0/pstamp/work/20100527/5977 |     0 | 
+--------+-------+-------+------------------+---------+--------+---------------+---------------------------------------------------------+-------------------------------------------+-------+
1 row in set (0.00 sec)


Each row in the request table contains a request specification. Each row causes one or more pstampJobs to be entered into the database. Jobs are often entered with state = 'stop' and fault > 10. These are due to requests for images that can not be satisfied. The set of errors is listed at the bottom of this page.

Once the request table has completed parsing successfully it's state is set to 'run'. Once the request is in state run its jobs are eligible to run and will be output by pstamptool -pendingjob

If a pstampJob's input images have been cleaned a pstampDependent object is entered into the database with state 'new'. The pstampJob will not be run until it's dependent is set to 'full'.

The task pstamp.dependent.run runs the script pstamp_checkdependent.pl. This script queries the gpc1 database for the state of the dependent component. This script's job is to queue and monitor update processing for the dependent component. CAREFUL. THERE BE SPIDERS!

The log file for the dependent is stored in the file named "$pstampDependent->{outdir}/checkdep.$dep_id.log". The script runs periodically, the log file is appended to each time. The outdir for the dependent is the outdir of the first request that needs the component.

Once all jobs have completed (changed from run to stop state) the request is queued for "finishing". The task pstamp.finish.run builds the results fits table and packages up all of the data into a fileset on the request's outgoing data store.


Monitoring

The current workload can be seen at http://pstamp.ipp.ifa.hawaii.edu/status.php

Since the postage stamp / data store database is on ippc17 and not ippdb01 some care must be taken when running the associated ippTools commands.

I define the environment variable PSDBSERVER=ippc17 and the alias

   alias pst 'pstamptool -dbname ippRequestServer -dbserver $PSDBSERVER'

I always type pst instead of pstamptool. Since the postage stamp tables on gpc1 database have been dropped you get errors if you try and run pstamptool without -dbserver

    (ipp032:~) bills% pstamptool -pendingreq
     p_psDBRunQuery (psDB.c:812) : Failed to execute SQL query.  Error: Table 'gpc1.pstampRequest' doesn't exist
         pendingreqMode (pstamptool.c:320) : database error


     -> p_psDBRunQuery (psDB.c:812): Database error generated by the server
         Failed to execute SQL query.  Error: Table 'gpc1.pstampRequest' doesn't exist
     -> pendingreqMode (pstamptool.c:320): unknown psLib error
         database error

This is a feature. (I plan on removing the postage stamp tables from the set created by pxadmin -create and adding a new mode to create them but I haven't gotten around to it yet).


Common Failures

Requests can fault for three reasons either at the parse stage or when finishing the request.

The most common fault gets set in the database as fault = 200. This happens when one of the source data stores responds with a HTTP 500 (Internal Server Error). MOPS' data store does this occasionally. These usually require no intervention as there is a revert task that will clear the fault allowing the system to try again.

Other request faults rarely happen and are usually due to software bug.

Jobs can fault due to typical nfs errors. There is a revert task for jobs. It only resets jobs with fault > 0 and < 10. Faults >= 10 are faults that are to be reported to the users. Jobs which are faulted in this way should have had their state set to stop thus they are finished.

The vast majority of the problems these days are related to the update processing. There are a large number of error conditions that can gum up the works. Here is an example

(

ipp032:~) bills% pst -pendingdependent -limit 1
pstampDependent MULTI

pstampDependent  METADATA  
   dep_id           S64       55261          
   state            STR       new             
   stage            STR       chip            
   stage_id         S64       77563          
   component        STR       XY31            
   imagedb          STR       gpc1            
   outdir           STR       /data/ippdb02.0/pstamp/work/20100527/5977 
   rlabel           STR       ps_ud_WEB.UP    
   need_magic       BOOL      T              
   fault            S16       0              
   priority         S64       8              
END

pstamp_checkdependent.pl sets chipRun 77563, class_id XY31 to update state and watches for it to go to 'full' state. Once that happens the state of the pstampDependent is set to 'full'

The update processing is managed by the 'stdscience' pantasks except for the ps_ud_QUB label which is being run somewhere else.

When a dependent run faults (that is when a update of a chipProcessedImfile, warpSkyfile, etc faults) the dependent is faulted as well. Thus to revert the dependent one needs to revert the underlying failure first and then revert the dependent. This is not yet done automatically.

For example say the dependent listed above fails due to an nfs error. The chipProcessedImfile.fault should be set to a PS_EXIT_SYS_ERROR (2). pstamp_checkdependent.pl will notice this and fault the dependent. It does this to avoid faulted dependents from filling up the pantasks queue thus preventing other dependents from running.

Later the dependent fault will be reverted and the dependency checking will run again. Hopefully the chip fault will have reverted and finished. After some number of dependency faults (3 currently) the server gives up and faults the job.

So if we have long periods of cluster trouble, postage stamp users will get lots of failed jobs.

How to clear these jobs?

  • could just move offending request to hold state?
    pstamptool -dbname ippRequestServer -dbserver $PSDBSERVER -updatereq -req_id 182918 -set_state hold
    
  • change label priority until fixed/cleared? remember to run labeltool pointing to the PSS DB and likely also want to set update label as well in gpc1
    labeltool -dbname ippRequestServer -dbserver  $PSDBSERVER -updatelabel -label XXXX -set_priority nnn
    labeltool -dbname gpc1 -updatelabel -label ps_ud_XXXX -set_priority nnn
    

If just a few jobs are affected by the outage, then the rest of the request can

To make a request and it's jobs just go away set the request's state to goto_cleaned with the command

pstamptool -dbname ippRequestServer -dbserver $PSDBSERVER -updatereq  -set_state goto_cleaned -req_id 182918
  • could the fault also be set to an error code if known what the fault is -- like 25 for data(old) that isn't updating properly?

Postage Stamp Error Codes

NOTE: wiki pages are obsolete as soon as they are written.

See the file pstamp/src/pstamp.h (or the perl version) in a "current" IPP source tree for an absolutely up to date version of this list of the postage stamp error codes.
(A string version is also included in the results file (fits and mdc) in the results of a request.

typedef enum {
        PSTAMP_SUCCESS          = 0,
        PSTAMP_FIRST_ERROR_CODE = 10,
        PSTAMP_SYSTEM_ERROR     = 10,
        PSTAMP_NOT_IMPLEMENTED  = 11,
        PSTAMP_UNKNOWN_ERROR    = 12,
        PSTAMP_DUP_REQUEST      = 20,
        PSTAMP_INVALID_REQUEST  = 21,
        PSTAMP_UNKNOWN_PROJECT  = 22,
        PSTAMP_NO_IMAGE_MATCH   = 23,
        PSTAMP_NOT_DESTREAKED   = 24,
        PSTAMP_NOT_AVAILABLE    = 25,
        PSTAMP_GONE             = 26,
        PSTAMP_NO_JOBS_QUEUED   = 27,
        PSTAMP_NO_OVERLAP       = 28,
        PSTAMP_NOT_AUTHORIZED   = 29,
        PSTAMP_NO_VALID_PIXELS  = 30,
        PSTAMP_BG_RESTORE_NOT_AVAILABLE = 31,
} pstampJobErrors;

Monitoring the Apache Server

It is a good idea to check the proxy's Apache error log (/var/log/apace2/error_log) from time to time to look for unusual activity.

We get hourly requests from Google and Baidu for files long gone that their crawlers found prior to the time that we locked down the data store.