PS1 IPP Czar Logs for the week 2011.01.10 - 2011.01.16

(Up to PS1 IPP Czar Logs)

Monday : 2011.01.10

Bill is the acting czar for the next two days

  • 08:30 All of last night's science data is through warp. Ran --queue_stacks --date 2011-01-10 --dbname gpc1
  • 14:40 added label DS.RECOVER.TEST to stdscience. 1 exposure to be processed to test destreaking with saving recovery pixels.

Tuesday : 2011.01.11

  • 09:00 sts dist run 377773 was stuck because of a corrupt file: the variance image from warp_id 144635 skycell.379. Due to the complexity of the destreaking process it isn't easy to regenerate this file. To allow the distRun to proceed skipping the broken skycell I performed the following steps.
1. in distribution pantasks:
2. disttool -revertcomponent -dist_id 37773
3. In mysql
       update warpSkyfile set quality = 42 where warp_id = 144635 and skycell_id = 'skycell.379';
4. In distribution pantasks: dist.on
5. wait a few minutes for distRun to complete
6. In mysql
       update warpSkyfile set quality = 0 where warp_id = 144635 and skycell_id = 'skycell.379';

Since quality of the skycell is non-zero distribution doesn't require the image to be readable. 
This allows the bundles for the other 87 skycells to get out the door. 

Since we put the quality back to zero if the warpRun is cleaned up the data will be available again through the update process.
  • We have two nightly stacks that are stuck due to 2 skycells that fail with programming errors.
stack_id 214086 fails with the known problem

    Data error code: 32cf
-> pmPSFEnvelope (pmPSFEnvelope.c:327): unknown psLib error
     No fake sources are suitable for PSF fitting.
 -> ppStackPSF (ppStackPSF.c:54): Problem determining PSF
     Unable to determine output PSF.
 -> ppStackPrepare (ppStackPrepare.c:274): Problem determining PSF
     Unable to determine output PSF.
 -> ppStackLoop (ppStackLoop.c:72): Problem determining PSF
     Unable to prepare for stacking.

stack_id 215251 fails with
    Filtered out 81 of 964 sources
    fix this code: z1 should not be nan for 0.016674
    Backtrace depth: 7
    Backtrace 0: p_psAssert
    Backtrace 1: pmModelRadius_PS1_V1
    Backtrace 2: (unknown)
    Backtrace 3: (unknown)
    Backtrace 4: psThreadLauncher
    Backtrace 5: (unknown)
    Backtrace 6: clone
Unable to perform ppStack: 4 at /home/panstarrs/ipp/psconfig//ipp-20101215.lin64/bin/ line 354

  • Except for these the two stacks the czartool dashboard is clear. But we are out of space.
  • 10:55 stopped distribution pantasks while I debug destreak "goto_restored" mode. (Undo destreaking by swapping back in the backup files).
  • With a change to magicdstool I was able to get "goto_restored" processing working but it doesn't reset the "magicked" bits back to zero. I took care of this by hand and queued all of the STS.201008 warpRuns that have been destreaked to be restored. It's a rather slow process.
  • 12:30 queued new STS.201008 diffRuns with the correct reduction and label to STS.20101202.wait. Then changed the label to STS.20101202 where the inputs are from warpRuns that don't need to be restored. I'll change the label for the others once that process is complete. 179 of 573 are restored so far.
  • 17:15 all STS.201008 warps have been restored to unmagicked state. Changed diffRuns with label STS.20101202.wait to STS.20101202. As 17:30 792 of 860 to be processed.
  • 19:45 noticed that ipp021 died. power cycled it.
<Jan/11 04:44 pm>ipp021 login: [1038205.587029] BUG: spinlock lockup on CPU#3, emacs/7426, ffff88002806b480
<Jan/11 04:44 pm>[1038205.590351] BUG: spinlock lockup on CPU#2, sed/7427, ffff88002806b480
<Jan/11 04:45 pm>[1038205.589518] BUG: spinlock lockup on CPU#0, pclient/30661, ffff88002806b480

Wednesday : 2011.01.12

  • czar = roy
  • 8:12 All data (around 200 3PI exposures) from last night appear to have been processed by 7am-ish. There are still 618 exposures under the STS.20101202 label at the diff stage.
  • 10:32 one stuck magicDS. looked in ~ipp/distribution/pantasks.stdout.log and found
failure for: --magic_ds_id 366209 --camera GPC1 --exp_id 279624 --streaks_path_base neb://any/gpc1/20110112/o5573g0191o.279624/o5573g0191o.279624.mgc.111101 --inv_streaks_path_base NULL --streaks NULL --inv_streaks NULL --stage warp --stage_id 146019 --component skycell.0734.119 --uri neb://ipp038.0/gpc1/ThreePi.nt/2011/01/12//o5573g0191o.279624/o5573g0191o.279624.wrp.146019.skycell.0734.119.fits --path_base neb://ipp038.0/gpc1/ThreePi.nt/2011/01/12//o5573g0191o.279624/o5573g0191o.279624.wrp.146019.skycell.0734.119 --cam_path_base NULL --cam_reduction NULL --outroot /data/ipp053.0/gpc1_destreak/ThreePi.nightlyscience/279624/warp --logfile /data/ipp053.0/gpc1_destreak/ThreePi.nightlyscience/279624/warp/279624.mds.366209.146019.skycell.0734.119.log --recoveryroot NULL --replace T --magicked 0 --run-state new --dbname gpc1 --verbose
job exit status: 5
job host: ipp038
job dtime: 4.975625
job exit date: Wed Jan 12 09:37:53 2011

Ran Bill's magic fix-it script as follows to recreate the broken warp:

ipp@ipp004:/home/panstarrs/ipp/src/ipp-20101215/tools>perl --skycell_id skycell.0734.119 --warp_id 146019
  • (bills 10:43) Noticed a bunch of destreak runs in state failed_revert. (This stuff is far to complicated) "Fixed" with:
magicdstool -clearstatefaults -state failed_revert -label STS.20101202 -set_state new

magicdstool -clearstatefaults -state failed_revert -set_state new -label ThreePi.nightlyscience

  • (bills) 10:48 cleaned up cleanup faults with "chiptool -revertcleanup -label goto_cleaned" I suppose this should be in a revert task.

Thursday : 2011.01.13

  • (roy) 7:50 No data from last night. STS.20101202 label continues to pass (slowly) through the diff stage.
  • (bills) 9:20 A couple of camera stage distRuns were stuck because the corresponding chipRuns had been cleaned. This is a bug. Changed the sql and built it into the production build.
  • (bills) 11:05 Queued magicRuns with label STS.20101202.b for the diffRuns that had been previously processed with reduction WARPSTACK instead of WARPSTACK_FORCED. They will need to have their labels changed to STS.20101202 befor destreaking will occur.
  • Emptied the survey.magic and survey.destreak books so that runs will get queued faster. We'll need to remember to put the entries back when we get new data.

Friday : 2011.01.14

  • (bills) 19:08 put back the entries in survey magic destreak and dist that I removed yesterday.

Saturday : 2011.01.15

Sunday : 2011.01.16

  • (bills) 06:45 magic, destreak, and distribution have fallen far behind. It looks like pantasks isn't running jobs. Restarted distribution pantasks.