PS1 IPP Czar Logs for the week 2013.01.28 - 2013.02.04

(Up to PS1 IPP Czar Logs)

Monday : 2013.01.28

  • 09:30 MEH: skycal off while fixing psastro/getstar being dealt with
  • 14:00 MEH: all pantasks (deepstack, pstamp, update) on ippc17 have crashed. restarting

Tuesday : 2013.01.30

  • 13:35 Serge: Stopped cleanup
  • 14:10 Bill: queued 20 M31 frames for chip - warp processing

Wednesday : 2013.01.31

Bill is czar today

  • 10:15 tempoarily changed neb-host ipp027 from down to repair. Should it be down? rsyncs are going on
  • Fixed 5 broken burntool instances, and reverted the faulted M31 chip runs that depended on them.
    cp /data/ipp027.0/nebulous/b7/c5/370480873.gpc1:20100725:o5402g0452o:o5402g0452o.ota26.burn.tbl /data/ippb02.1/nebulous/b7/c5/899429635.gpc1:20100725:o5402g0452o:o5402g0452o.ota26.burn.tbl
    cp /data/ipp027.0/nebulous/c1/ef/448745447.gpc1:20100915:o5454g0264o:o5454g0264o.ota26.burn.tbl /data/ippb02.0/nebulous/c1/ef/931041890.gpc1:20100915:o5454g0264o:o5454g0264o.ota26.burn.tbl
    cp /data/ipp027.0/nebulous/71/bd/492194663.gpc1:20101015:o5484g0447o:o5484g0447o.ota26.burn.tbl /data/ippb02.0/nebulous/71/bd/901069396.gpc1:20101015:o5484g0447o:o5484g0447o.ota26.burn.tbl
    cp /data/ipp027.0/nebulous/36/2d/535014565.gpc1:20101111:o5511g0093o:o5511g0093o.ota26.burn.tbl /data/ippb02.0/nebulous/36/2d/915499833.gpc1:20101111:o5511g0093o:o5511g0093o.ota26.burn.tbl
    cp /data/ipp027.0/nebulous/c2/2a/573274958.gpc1:20101214:o5544g0015o:o5544g0015o.ota26.burn.tbl /data/ippb02.2/nebulous/c2/2a/904066045.gpc1:20101214:o5544g0015o:o5544g0015o.ota26.burn.tbl
    
  • 13:54 Serge: Restarted cleanup
  • 13:30 or so enabled skycal
  • 15:02 removed label goto_cleaned.rerun to allow the goto_cleaned runs to get cleaned (they have more bytes to recover)
  • 15:39 registration pantasks died. Decided to take this opportunity to restart the pantasks. check_system.sh stop
  • 15:52 all pantasks restarted (except for cleanup, deepstack, and the addstars of course) cleanup set to stop to prepare to restart (previous attempt failed due to czar error)
  • 21:16 many failures due to permission problems on the backup nebulous directories for ipp027 and 028. Setting to down for now.

Thursday : 2013.02.01

Bill is czar again today

  • 07:30ish set ipp027 and 28 back to up
  • 10:30 Attempted to fix the permissions problem on the backup ipp027 directories
    hopefully this wasn't rash
    (stsci07:stsci07.2/ipp027.0/nebulous) # find . -maxdepth 2 -group users -exec chgrp nebulous {} \;
    (stsci07:stsci07.2/ipp027.0/nebulous) # find . -maxdepth 2 -perm 700 -exec chmod 775 {} \;
    
    Did same thing in
    /data/stsci02.2/ipp028.0/nebulous
    /data/stsci06.2/ipp027.0/nebulous
    and many more places. Still havent' found them all
    
  • 10:46 we've been getting strange lock timeouts from stacktool -addsumskyfile. This has caused 23% of stacks to fail since we restarted the stack pantasks. As an experiment turned off skycal and staticsky to see whether their load queries might be triggering the problem
  • 11:15 stopping stack to attempt to sort out the database problem.
  • 11:35 staticskyRun 391682 has a problem. It's i band input stack has no images. The problem is that the stacking ran twice on December 7. At 07:56 the first instance finished. At 08:03 a second job failed to insert it's results into the database. A second job was killed at 08:03 but apparently not before destorying the outputs from the first instance. This stack is lost. I'm going to drop the stack and update the staticskyInput table for the run to remove the stack_id for the broken entry.
  • 12:30 removed ThreePi.nightlyscience from the warp-stack diff survey task. I suspect that the queries there are causing the stack locking problem
  • 14:20 queued warp stack diffs for last night's warps. the difftool command took 48 minutes
  • 15:20 cleared out a number of jobs hung on ipp020. Power cycled it
  • 15:20 restarted stdscience setting ipp020 to off as a compute host since I'm annoyed with it.
  • 15:36 finally got all chips finished. Set chip processing to off in order to give the warp backlog a chance to catch up.
  • 18:50 reran 2 faulted pub runs that were in an invalid state (fileset already on data store should handle this case someday). Dropped stuck stack. stacktool -updaterun -set_state drop -stack_id 1974571 -set_note 'Data error code: 32ce'
  • 20:00 turning chip on

Friday : 2013.02.02

  • 21:08 Bill: restarted deepstack pantasks with 1 x compute3 nodes enabled. We are done with the LAP staticsky runs for stacks made prior to this week except for 1176 near the galactic center and the 36 runs that have faulted. Mostly with the can't find psf problem. I will look into those tomorrow. The galactic center runs have had their labels changed back and they will run now.
  • 22:40 Bill: added background.pro to stdscience and label M31.test.20130201. This will process 20 warp background runs using inputs from prototype chip background runs using MPG M31 auxiliary masks. Be done in no time. Oh nevermind. Something's wrong. Nothing to see here, move along.

Saturday : 2013.02.03

Sunday : 2013.02.04

  • 00:40 MEH: looks like stdscience struggling to stay loaded, needs its normal restart. adding labels as they were, setting chip.revert.off as it was when restarted
  • 08:30 Bill: chip revert was off. Reverted a few nightly science faults
  • 09:10 Bill: regenerated two missing burntool tables. Why are they still disappearing?
    perl ~bills/ipp/tools/fixburntool -e 232666 -c XY35
    perl ~bills/ipp/tools/fixburntool -e 289167 -c XY27
    

Sunday : 2013.02.04