wiki:Processing

Context Navigation

Version 151 (modified by Serge CHASTEL, 15 years ago) ( diff )
--

Introduction
About Czar Pages: What to do if the czar pages don't update
Getting started and checking processing status
Using pantasks
Morning duties: checking summitcopy and burntool
Stopping and starting the pantasks servers
1. Stopping
  1. To stop a single pantasks server (scheduler) instance
  2. To shut down all pantasks_server instances
2. Starting
Queuing/Dequeuing data
1. Adding data to the queue
2. Removing data from the queue
Running the microtest scripts
Common problems and their solutions
Changing the code
Who to contact
Czar Logs

Introduction

This page outlines the procedures and responsibilities for the person currently acting as 'IPP Processing Czar'. In a nutshell, these include:

monitoring the various pantasks servers running on the production cluster using Czartool and pantasks_client
keeping a close eye on the 'stdscience' pantasks server in particular, using Czartool
keeping an eye on production cluster load using Ganglia
keeping an eye on available disk space using Czartool (or the neb-df command on any production machine)
alerting the IPP group to any notable errors or failures (see here for details)

NB You will need to have ipp user access on the production cluster. For convenience, have someone who already has access (anyone on the IPP team) to add your ssh public key to ~ipp/.ssh/authorized_keys.

About Czar Pages: What to do if the czar pages don't update

IPP Czar pages are updated every five minutes or so by the czarpoll.pl script. This script runs on ipp004 as ipp user within a screen session. If, for some reason, czarpoll crashed (it likely means that gpc1 mysql server has been restarted),

1) ssh as ipp on ipp004
2) identify the screen session which is running and reattach it
3) restart czarpoll and then
4) detach from the screen session

The sequence of commands should therefore be something like:

1) ssh as ipp on ipp004

yourhost:~$ ssh ipp@ipp004

2) Identify the screen session which is running and reattach it

ipp@ipp004:/home/panstarrs/ipp>screen -list
There is a screen on:
	18965.CzarPoll	(Attached)
1 Socket in /var/run/screen/S-ipp.

ipp@ipp004:/home/panstarrs/ipp>screen -r 18965.CzarPoll

(you see then the last lines of the display, e.g.:
        Total time                              : 0:55.57s
        CPU utilisation (percentage)            : 57.7%

3) Restart czarpoll from /home/panstarrs/rhenders/trunk/tools directory

ipp@ipp004:/home/panstarrs/rhenders/trunk/tools>./czarpoll.pl

You will see something like:

printf (...) interpreted as function at czartool/DayMetrics.pm line 80.
* Checking nightly science status
* Checking Nebulous
* Checking all pantasks servers
* Updating dates
[...]

4) Detach from the screen session by typing CTRL-A d

CTRL-A d

The terminal window is cleared and you should see something like:

[detached]
ipp@ipp004:/home/panstarrs/ipp>

You can safely logout or do other work

Getting started and checking processing status

Czartool makes it relatively easy to check the overall status of the processing pipeline. You can check the status of the various pantasks_servers, how much data was taken at the summit and has been copied to the cluster, and the status of various processes within stdscience, chip, camera, warp, diff etc.

Using `pantasks`

There are numerous pantasks servers. Their status can be checked with Czartool, but it is often necessary to use a client directly. To do this, first move to the directory corresponding to the server of interest, which are all under ~ipp on any cluster machine. For example, go to ~ipp/stdscience/, then run

pantasks_client

To check the current labels being processed:

pantasks: show.labels

Within pantasks, to check processing status, do

pantasks: status

This will return something like

 Task Status
  AV Name                     Nrun   Njobs   Ngood Nfail Ntime Command               
  +- extra.labels.on             0       3       3     0     0 echo                  
  +- extra.labels.off            0       3       3     0     0 echo                  
  +- ns.initday.load             0       3       3     0     0 echo                  
  ++ ns.registration.load        0    1331    1331     0     0 automate_stacks.pl    
  ++ ns.chips.load               0      66      66     0     0 automate_stacks.pl    
  ++ ns.chips.run                0       4       4     0     0 automate_stacks.pl    
  ++ ns.stacks.load              0    5825    5825     0     0 automate_stacks.pl    
  ++ ns.stacks.run               0       6       6     0     0 automate_stacks.pl    
  ++ ns.burntool.load            0       8       8     0     0 automate_stacks.pl    
  ++ ns.burntool.run             0     360     360     0     0 ipp_apply_burntool.pl 
  ++ chip.imfile.load            1   48039   48038     0     0 chiptool              
  ++ chip.imfile.run             0   23524   17755  5769     0 chip_imfile.pl        
  ++ chip.advanceexp             0    7514    7514     0     0 chiptool    
  etc...

The first column, 'AV', translates to Active and Valid, i.e. whether a process is running and whether it is valid at this point in time. For example, above, +- ns.initday.load is active, but is not valid at present since it is scheduled to run only once per day (to initialize the nightlyscience automation).

The key thing to monitor here is the Nfail column. Depending on the process, different numbers of Nfail as a proportion of Njobs are deemed acceptable.

Morning duties: checking `summitcopy` and `burntool`

There is nothing to be processed if data has not been copied from the telescope. This is the job of summitcopy, which runs slowly through the night, then speeds up once observations are complete every day. You can check that it has successfully copied files using Czartool.

After summitcopy comes burntool. If burntool is running then czartool will report it in the nightly science status ('BURNING'). To check this manually, run the following in the stdscience pantasks_client

ns.show.dates

You should see a 'book' entry with today's date, like

<today's date e.g. 2010-08-05> BURNING

If not, something is wrong.

The different steps values are shown in this Wiki page.

Stopping and starting the `pantasks` servers

It is occasionally necessary to stop and restart the pantasks_server instances. For example, when it is necessary to update and rebuild the code, or if pantasks itself becomes unresponsive or shows negative values in some columns of the status display (above).

Stopping

To stop a single pantasks server (scheduler) instance

Get the name of the host which is managing the pantasks server by looking the ptolemy.rc file in the server directory (e.g. for the 'distribution' task):
```
whoami@ipp004:/home/panstarrs/ipp> cat distribution/ptolemy.rc | grep PANTASKS_SERVER
PANTASKS_SERVER         ippc15
```

Log into the host as the ipp user
```
ssh ipp@ippc15
```

Start the pantasks_client in the pantasks server directory
```
pantasks_client
```

Shut down the server:
```
pantasks: shutdown now
```

To shut down all `pantasks_server` instances

check_system.sh stop

Wait 'n' minutes for all Nrun values to be zero, then

check_system.sh shutdown

Starting

Each pantasks_server uses a local input and ptolemy.rc file (this file details the machine where the server is to run).

NB for the special case of the addstar server, see this page.

To start all* pantasks servers use

check_system.sh stop

some

Starting a single server

To start a single server you need to ssh to the relevant machine (found in the ptolemy.rc file for that server) then do the following:

ssh ipp@ippXXX
cd <serverName>
pantasks_server &
pantasks_client
pantasks: server input input
pantasks: setup
pantasks: run

So, for example for stdscience

ssh ippc16
cd stdscience
pantasks_server &
pantasks_client
pantasks: server input input
pantasks: setup
pantasks: run

Starting an already running server

For already-running servers, pantasks should be started with the following commands only:

pantasks_client: server input input
pantasks_client: setup
pantasks_client: run

This loads the hosts and labels needed and starts the processing running. See ~ipp/stdscience/input if this is not clear.

Starting all servers

If everything has been shut down, you can start all pantasks with the following in ~ipp:

check_system start.server
check_system start
check_system run

The first command launches the pantasks_servers on the correct host, the second calls the three commands listed above ( {{{server input input; setup; run}}} ).

Queuing/Dequeuing data

Adding data to the queue

Before pantasks can used to manage processing of a particular label, chiptool must first be run to queue data and create that label. The custom here is to write a small script that runs chiptool with the necessary arguments. This script is then left in the stdscience sub-directory with the same name as the survey in question (M31, MD04 etc). This is so that there is a record of what has been queued. An example script would be

#!/bin/csh -f

set label = "M31.Run5.20100408"

set options = ""
set options = "$options -dbname gpc1"
set options = "$options -definebyquery"
set options = "$options -set_end_stage warp"
set options = "$options -set_tess_id M31.V0"
set options = "$options -set_data_group M31.Run5b"
set options = "$options -set_dist_group M31"
set options = "$options -comment M31%"
set options = "$options -dateobs_begin 2009-12-09T00:00:00"
set options = "$options -dateobs_end 2010-03-01T00:00:00"
# set options = "$options -simple -pretend"

chiptool $options -set_label $label -set_workdir neb://@HOST@.0/gpc1/$label

Now the label must added within pantasks

pantasks: add.label M31.Run5.20100408

Note: the add.label command does not propagate along the IPP chain. After adding it to stdscience, it might be required to add it to distribution server.

According to how things were set up, the system may be told to look for today's date. The command to add all data of a particular day (e.g. 2010-08-06) to the queue is:

ns.add.date 2010-08-06

Removing data from the queue

If a mistake has been made and a label needs to be removed from processing, then

pantasks: del.label M31.nightlyscience

chiptool must then be used to drop the label for data with a state of 'new'.

chiptool -updaterun -set_state drop -label bad_data_label -state new -dbname gpc1

If some of the data has already been processed (i.e. state!=new), then cleanup must be employed. TODO more here

Running the microtest scripts

The microtest data should be correctly automated, but still requires a script to be manually run. The basic pantasks tasks to reduce the microtest data are included in the stdscience/input file, in the add.microtest macro:

macro add.microtest
  add.label microtestMD07.nightlyscience
  add.label microtestMD07.noPattern.nightlyscience

  survey.add.WSdiff microtestMD07.nightlyscience MD07.refstack.20100330 microtestMD07 neb://@HOST@.0/gpc1
  survey.add.WSdiff microtestMD07.noPattern.nightlyscience MD07.refstack.20100330 microtestMD07.noPattern neb://@HOST@.0/gpc1
  
  survey.add.magic microtestMD07.nightlyscience /data/ipp050.0/gpc1_destreak
  survey.add.magic microtestMD07.noPattern.nightlyscience /data/ipp050.0/gpc1_destreak
end

Once the two labels have made it through magic, the microtest.pl script can be run. You'll need to have ppCoord built and in your path. This isn't built by psbuild. You just need to go into the ppViz directory and do psautogen --enable-optimize && make && make install. This script relies on VerifyStreaks having been run on the data as part of Magic (and being in the proper place). Note that if the VerifyStreaks binary could not be found in the course of the Magic processing, this will have be skipped. The script is then run as:

microtest.pl --dbhost ippdb01 --dbuser ipp --dbpass XXX --dbname gpc1 --label microtestMD07.nightlyscience --data_group microtestMD07.20XXYYZZ --verbose

Common problems and their solutions

Diff failures

A detailed guide to failures at the diff stage can be found here:

http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/diff_fixits

`burntool` doesn't start…

burntool requires that all images from the summit for a given night are registered at MHPCC before it can begin processing. Occasionally, an image gets 'stuck', preventing processing to begin. This is sometimes due to a corrupt file, or just a failure to copy it o MHPCC. So, first check with the camera group that the image is ok.

If image is ok

Assuming that the image is actually good then

stop summit copy pantasks
revert the fault
run summit_copy.pl leaving off the --md5 argument.
set summit copy pantasks to run

See the relevant procedure.

This will call dsget without the md5 sum check and update the database.

If image is not ok

We need to tell the system to forget about this image. TODO (This is the summary of what was tried on Wed. 2010-09-08 for o5446g0443o)

update summitExp set exp_type = 'broken', imfiles=0, fault =0 where exp_name = 'o5447g0519o';

-> if it has no effect

ns.del.date, ns.add.date

-> if it has no effect, check (and possibly change) obs_mode from 3PI to ENGINEERING:

UPDATE rawExp SET obs_mode = 'ENGINEERING' WHERE exp_id = 221762;

Czartool reports negative burntool stats

The burntool stats printed are equivalent to N_exposures_queued - N_exposures_burntooled. A negative value means that that target has been doubly queued.

Finding log files

On the main Czartool display, if there are any faults, they are shown in parenthesis. These in fact form links that will take you to the relevant ippMonitor page for the processing stage and label in question. Here a table will list details of the offending exposures, one column of which is 'state' (which should be 'new'). Linking from these will display the log for that particular exposure (or chip) from which the error may be diagnosed.

Reverting

When exposures fail at a certain stage (chip, cam, warp etc) they are given a 'fault' code:

Code	Description
1	Error of unknown nature
2	Error with a system call (often an NFS error)
3	Error with configuration
4	Error in programming (look also for aborts)
5	Error with data
6	Error due to timeout
>6	Reserved for magic

It is sometimes possible to 'revert' certain failed exposures. Reverting simply means attempting to process an exposure second time in case the cause of the fault was temporary, for example an NFS error. Faults like these are usually given fault code '2'. Turning reverts on via the czartool page will attempt to revert all those exposures that failed with code '2'. Behind the scenes, czartool is using pantasks_client to perform the reverts, as described in the next section.

Reverting from `pantasks_client`

To manually revert failures with fault code 2, do something like the following in pantasks_client

pantasks: warp.revert.on

And off again with

pantasks: warp.revert.off

The process is similar for chip, camera etc. A special case, however, is destreaks which need to be reverted as follows.

From the distribution panstarks_client

destreak.off
destreak.revert.on

Then, once there is nothing left to do

destreak.revert.off
destreak.on

Reverting faults with codes other than 2

By running the stage tool program directly it may be possible to revert failures with codes other than 2. For example, for the chip stage:

chiptool -revertprocessedimfile -label M31.nightlyscience -fault 4 -dbname gpc1

Similar arguments can be used with warptool, camtool etc.

TODO: The page linked by processing failures in the czartool page should show the command as the ipp user that should be run to fix the problem.

If component fails again after reverting

If a component fails repeatedly then something is likely wrong with one of it's inputs or perhaps there is a bug in the code. NEITHER of these situations should be ignored.

The log file can provide clues as to the cause of the problem. This page gives an example of how to fix certain failures.

Re-adding a nightly-science date to pantasks

Sometimes, if the stdscience pantasks server has been restarted before all nightlyscience processing has been completed, it may be necessary to re-add the date once the server is back up-and-running. For example, for the date below, stacks were not created or distributed because stdscience had been stopped before all the warps were completed. So, to re-add the date from the warp stage:

pantasks: ns.add.date 2010-09-11
pantasks: ns.set.date 2010-09-11 TOWARP

Removing a host

Troublesome Hosts

Sometimes a particular machine will act unpredictably and should be taken out of processing. To do this, go to each pantasks server in turn and remove the host, ipp016 in the example below

pantasks: controller host off ipp016

We also need to set the same host to a state of 'repair' in nebulous:

neb-host ipp016 repair

This leaves the machine accessible, but no new data can be allocated to it. See table below for a guide to the other nebulous states

state	allocate?	available?
up	yes	yes
down	no	no
repair	no	yes

Running neb-host with no arguments gives you a summary of the above for all hosts.

Non-Troublesome Hosts

The same commands can be used for non-troublesome hosts.

The controller machines command shows the list of hosts and 3 values: the first value is the number of connections from the pantasks server to the host. The controller host off <hostname> command has to be repeated as many times as the number of connections.

It may also happen that a working host has to be removed (if it was temporarily added to better share the load because of some machine failure for instance). The controller status command details the activity for each connection. Hosts should be removed only if they have the RESP(onding?) or IDLE status (so wait for the running tasks to complete).

Note: The controller host check <hostname> command only shows ONE connection status (TODO SC: I can't tell which one).

(SC TODO) controller status shows something looking like an addresse (e.g., 0.0.0.7d) which is different for each connection. It seems it's not possible to remove a particular connection. Am I right?

Corrupted data before the current stage

The ~ipp/src/ipp-<version>/tools directory contains a bunch of tools that can be used to fix weird problems. For instance, I had a repeating entry telling that warp failed because of a corrupted cam generated file. In my case (there are some others that do also the job):

perl runcameraexp.pl --help

and finally:

perl runcameraexp.pl --cam_id 140104

Pausing a task(?)

I'm not sure that "task" is the right word.

chip.off / warp.off (and to restart chip.on / warp.on)

Changing the code

This might mean rebuilding the current 'tag' (reflected in the directory name) or actually installing a new tag.

Rebuilding the current tag

We will use the example of tag 20100701 which is store under

~ipp/src/ipp-20100701

To update the code and rebuild, shutdown all pantasks (as shown above) then do the following.

cd ~ipp/src/ipp-20100701
svn update
psbuild -dev -optimize

Now restart all pantasks (as above).

Installing a new tag

shutdown all pantasks (as shown above)
change ~ipp/.tcshrc to point at the new tag (it is good to confirm by logging out and in again)
fix the files which are still installation specific:
- edit ~ipp/.ptolemyrc and change CONFDIR to point at the new location
- copy nebulous.site.pro to the working location (for now, just use the last installation version) eg

cp psconfig/ipp-20100623.lin64/share/pantasks/modules/nebulous.site.pro psconfig/ipp-20100701.lin64/share/pantasks/modules/nebulous.site.pro

restart all pantasks (as above)

Changes to gpc1 database schema

(From trunk/dbconfig/notes.txt)

When changing the database schema:

increment the pkg_version number on dbconfig/config.md
increment the ippdb version number in ippTools/configure.ac (to match)
increment the ippTools version number in ippTools/configure.ac
build ippbd ('make src' in dbconfig)
check in dbconfig, ippdb, and ippTools

Who to contact

Any problems or concerns should be reported to the ipp development mailing list:

ps-ipp-dev@ifa.hawaii.edu

Different members of the IPP team are responsible for different parts of the code, and the relevant person will hopefully address the issue.

Czar Logs

The following links show pages of czar activities.

2010-09-21: ipp020 failed

Replication Log: Replication Issues wiki page

Czar Logs wiki pages

Different members of the IPP team are responsible for different parts of the code, and the relevant person will hopefully address the issue. Different members of the IPP team are responsible for different parts of the code, and the relevant person will hopefully address the issue. Different members of the IPP team are responsible for different parts of the code, and the relevant person will hopefully address the issue.

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text