What is pantasks?

pantasks is the ipp parallel process manager for distributed computing across multiple nodes. Also see the ippTools FAQ for information on some of the commands that get launched by pantasks.

How do I start pantasks?

start up pantasks from a terminal window

 > pantasks
 
 Welcome to pantasks - parallel task scheduler
 
load some pantasks commands
 pantasks: module pantasks.pro

or, if you have a modified pantasks.pro file

 pantasks: input /home/username/pantasks/pantasks.pro

After loading the pantasks.pro file, you can add a database easily:

 pantasks: add.database mydatabase

If you don't add a database, pantasks will use the one declared in your .ipprc file.

How do I configure pantasks?

.pantasksrc

.ptolemyrc

Make sure you are member of the nebulous users group (at MHPCC).

What are the primary pantasks commands?

Connect to a controller host

 pantasks: controller host add myhost

Check the controller host status (NOTE: This will sometimes return no info, even when there are active hosts):

 pantasks: controller status

Check the processing status:

 pantasks: status

For additional timing details:

 pantasks: status -taskstats
 
Start processing:
 pantasks: run

Stop adding new processes, but finish out the queue :

 pantasks: stop

Stop processes right now :

 pantasks: halt

Exit pantasks :

 pantasks: exit

How do I get more verbose output from pantasks?

 pantasks: $VERBOSE = 1

Raise the number above 1 for more and more verbosity.

Why does my process fail in pantasks but succeed on the command line?

"I copied the command directly from the pantasks error stream (or from the verbose command output) and pasted it into a separate terminal. It succeeds on the terminal, but it failed in pantasks."

You may have a config error in your home directory. When pantasks executes a command, it does so from the user's home directory on whichever remote host happens to have been assigned for that process. If you happen to have some out of date config directories in that home directory, then they may be loaded before the system-level config directories. Here is an example: In your .ipprc file, you may have defined the path to your configuration directory with something like this:

 PATH            STR     /path/to/my/system/level/ippconfig

So in your system.config file you can define the directory for your GPC1 camera.config file with this line:

 CAMERAS		METADATA
 	GPC1			STR	gpc1/camera.config
 END

so when you run any script that needs to reference the GPC1 camera it will look

  • first in the current directory for ./gpc1/camera.config
  • next in the directory defined by your .ipprc PATH variable (in this case /path/to/my/system/level/ippconfig/gpc1/camera.config)

Thus, if you have some old gpc1/camera.config lying around in your home dir, then pantasks will look there first, and will hit a config error that would not appear if you run the command from the command line when you are not in your home dir.

Solution: move any old config files out of your home directory, or update them to remove the config issue.

Why are my nodes "down" or "resp"?

  • First, check that you can ssh from the machine on which you are running pantasks to the node without being prompted for a password and without errors reading your shell startup file (.bashrc, .cshrc, .profile, .login and the like):
    • Try "ssh myhost" If you're prompted for a password, then you need to set up ssh keys, and/or check your ssh configuration. Setting up the keys for passwordless login works like this:
      # Generate an ssh key pair (private key, identifies you; public key, to be shared with others) 
      # with an empty passphrase (to enable passwordless login)
      ssh-keygen
      # Now press <enter> twice (empty passphrase & to confirm)
      
      # Add public key to "authorized keys" listing public keys of people
      # who can connect if they possess the corresponding private key
      cat ~/.ssh/id_rsa.pub >> authorized_keys
      # You should now be able to 
      ssh localhost
      # without having to enter a password.
      
      # Copy the ~/.ssh/authorized_keys to any other remote machine that you
      # want to use in pantasks (or log in to without having to type your password
      # in general).
      
      # WARNING: anyone who has access to your ~/.ssh/id_rsa file now has access to the 
      # machines that have your id_rsa.pub in authorized_keys. I.e. treat ~/.ssh/id_rsa 
      # carefully and don't let anybody copy it.
      
      # To test that this works for pantasks, say
      ssh localhost pclient
      # If you don't get a pclient shell prompt, something is wrong.
      # If you do, just type 'exit' to get back to where you were
      
  • Second, check that you can start up 'pclient' over an ssh connection. For that to work, you need to run 'psconfig ipp-2.6.1' (or whatever the IPP version is called on your system) in your startup file.
    • Try "ssh myhost pclient". If it works, you can exit pclient with "exit" or "quit".
    • If it doesn't work, you need to check your shell configuration (.cshrc or .bashrc) to ensure the IPP environment is being loaded via psconfig. In .cshrc, just make sure that 'psconfig ipp-2.6.1' is executed. In .bashrc, there's a catch: bash doesn't let you use an alias in the same file that defines it, so you need to expand the 'source' by hand in .bashrc, thus (this fix will be included in the INSTALL instructions in releases post 2.6.1.):
       if [ -f /IPP/psconfig.csh ]; then
          alias psconfig='source /IPP/psconfig.bash'
          source /IPP/psconfig.bash ipp-2.6.1
       else
          alias psconfig='echo psconfig not available'
       fi
      
  • If there is a long delay between executing the "ssh" command and the shell appearing, there may be a timeout problem.
    • Make sure your shell configuration is not too complex.
    • One bash user had the IPP environment being set up in both .bashrc and .cshrc, so that psconfig was being called first when bash started, and then again on starting psconfig (which is a csh script). This produced a long delay, causing the command to time out in pantasks.
  • Finally, this can also be triggered by the readline bug, whose fix is described below under detrend_resid_imfile.pl triggers the error message 'Unknown option: --erbose'?

On startup I get "can't find config file. some functions will be unavailable." Which config file is missing?

You're missing the .ptolemyrc file. Copy dvo.site from your site-level config directory into your home dir and rename it as .ptolemyrc.

How do I re-run files that have failed at some stage after fixing the bug that caused the failure?

Most of the ippTools binaries (which provide the database interaction) have some version of a -revert command. For example, if a camera stage failed, try camtool -revertprocessedexp -cam_id 12345. You can also revert based on an error code using the -code argument.

Process X succeeded without fault, but it is not moving on to process Y. How do I force it to proceed?

First verify that none of the sub-stages failed. For example, if a chipRun has state 'new' and you think it should be 'full' and moving on to camRun, then check to be sure that all of the contributing chipProcessedImfile rows in your database are completed with fault=0. Then check to be sure that the next process in line was not initiated. e.g. Do you have a camRun with the chip_id that you expect?

If the process really just stopped without raising a fault and without initiating the next stage, then you can try manually setting its state to 'full' and using the appropriate ippTool to initiate the next process in the sequence. For a stalled chipRun with chip_id 2591, this would be done like this:

chiptool -dbname myDatabase -updaterun  -label 'myLabel' -chip_id 2591 -state full    

camtool -dbname myDatabase -definebyquery -chip_id 2591 -set_label 'mylabel' 

detrend_resid_imfile.pl triggers the error message 'Unknown option: --erbose'

The dropped character in a long line is a classic symptom of a bad 'readline' library. To fix it, do:

% pschecklibs -build -force libreadline

There should be no need to rebuild the IPP (since we're using dynamic libraries)

I've got a job sitting in the queue, and a bunch of idle hosts... how do I get the job onto a host?