(Up to PS1 IPP Czar Logs)

Monday : 2012.07.16

Bill is czar today.

  • 10:10 stdscience was stuck. There are several thousand chip runs outstanding but no jobs are getting queued. Czartool indicates that chip processing stopped making progress around 20:15 last night while outstanding warps got finished just after midnight. The book chipPendingImfile had lots of entries in init state but nothing was getting run. I restarted stdscience.
  • 13:40 Chris discovered that ippc03 was out of space in /tmp which was causing nebulous requests to fail. This undoubtedly contributed to the processing failures. Bill deleted the log files and restarted the apache servers.
  • 13:55 chip reverts are off because we have a number of exposures that are faulted repeatedly due to incorrect burntool state. Chris is working on fixing the burntool states.
  • 15:18 I see that chip reverts have been turned back on.

Tuesday : 2012.07.17

Bill is czar again today

  • 10:18 LAP is done through warp except for 4 exposures that have chips with missing burntool tables and one warp process whose perl script got stuck trying to check files on one of the stsci nodes. Killed that process and the warp completed. Also had to rerun chip 493934 XY55 because the output variance image was corrupt.
  • 10:45 Used Chris's do_burntool_single.pl script to recreate the missing burntool tables.
  • 11:15 ipp008 has some chip processes that have been stuck. Gene and I tried to fix the problem to no avail. Rebooted the system.
  • 11:23 restarted stdscience
  • 13:23 restored a couple of "lost" raw imfiles with previously unknown copies that Serge was able to find with his new databases. More info later.

Heather is the afternoon czar

Wednesday : 2012.07.18

Mark is czar

  • 02:30 fixed ppStack bug, turning LAP stacks back on for another test sample. about 900/3500 likely had this problem in the earlier run of stack.
  • 03:10 the 140 extra stack processing is hitting the stsci machines fairly hard, turning off 1x stare+2x compute+1x compute2 (~35) to start with. still causing some large spike, taking out 1x stare+1x compute2+1x wave4 more (~27).
  • 09:40 czar rate plot indicates ~300 stacks/hr with ~82 nodes. stsci nodes still getting spiked (nfs) but have handled for past few hrs.
  • 11:00 adding back 1x compute2+1x stare+1x wave4 for 103 nodes, stsci node spikes are broader now (time) -- monitoring -- seems okay and reaching >400/hr
  • 11:30 ganglia reports ipp008 all memory swapped but isnt. gmond restart worked for a little bit but back to saying all memory swapped. keep eye on.
  • 12:00 ecliptic.rp chip faults often NON-EXISTANT, but are being found with the new deneb-locate.py command --
    	o5362g0549o 	XY36 	181808 	488192 
          0                     NON-EXISTANT file:///data/ippb02.1/nebulous/d1/13/917781576.gpc1:20100615:o5362g0549o:o5362g0549o.ota36.fits
          0                     NON-EXISTANT file:///data/ippb02.0/nebulous/d1/13/917799349.gpc1:20100615:o5362g0549o:o5362g0549o.ota36.fits
    -- rogue files?
    /home/panstarrs/ipp/local/bin/deneb-locate.py gpc1/20100615/o5362g0549o/o5362g0549o.ota36.fits
    /data/ippb02.2/nebulous/d1/13/917799349.gpc1:20100615:o5362g0549o:o5362g0549o.ota36.fits -- seems ok
    /data/ipp024.0/nebulous/d1/13/319361864.gpc1:20100615:o5362g0549o:o5362g0549o.ota36.fits -- 0 size
    -- copy over to non-existant file location
    cp /data/ippb02.2/nebulous/d1/13/917799349.gpc1:20100615:o5362g0549o:o5362g0549o.ota36.fits /data/ippb02.1/nebulous/d1/13/917781576.gpc1:20100615:o5362g0549o:o5362g0549o.ota36.fits
    cp /data/ippb02.2/nebulous/d1/13/917799349.gpc1:20100615:o5362g0549o:o5362g0549o.ota36.fits /data/ippb02.0/nebulous/d1/13/917799349.gpc1:20100615:o5362g0549o:o5362g0549o.ota36.fits
    -- want to replicate to main machines so first instance is not looked for on ippb machines
    --> could also replicate all off of ippb machines since overly full -- if not use volume then likely puts to stsci randomly and don't want
    neb-replicate neb://ipp030.0/gpc1/20100615/o5362g0549o/o5362g0549o.ota36.fits --volume ipp030.0
    neb-replicate neb://ipp030.0/gpc1/20100615/o5362g0549o/o5362g0549o.ota36.fits --volume ipp031.0
    neb-cull neb://ipp030.0/gpc1/20100615/o5362g0549o/o5362g0549o.ota36.fits --volume ippb02.0
    neb-cull neb://ipp030.0/gpc1/20100615/o5362g0549o/o5362g0549o.ota36.fits --volume ippb02.1
    -- once reverted and okay, can rm rogue files
    rm /data/ippb02.2/nebulous/d1/13/917799349.gpc1:20100615:o5362g0549o:o5362g0549o.ota36.fits
    rm /data/ipp024.0/nebulous/d1/13/319361864.gpc1:20100615:o5362g0549o:o5362g0549o.ota36.fits
  • 12:10 ecliptic.rp chip faults also from missing/lost on ipp064.0X
    --> no instances..
    whichnode neb://ipp028.0/gpc1/20100616/o5363g0496o/o5363g0496o.ota26.fits
    ipp064.0X not available
    ipp064.0X not available
    --> deneb-locate any rogue files -- yes
    /home/panstarrs/ipp/local/bin/deneb-locate.py gpc1/20100616/o5363g0496o/o5363g0496o.ota26.fits
    neb-mv neb://ipp028.0/gpc1/20100616/o5363g0496o/o5363g0496o.ota26.fits neb://ipp028.0/gpc1/20100616/o5363g0496o/o5363g0496o.ota26.fits.bad
    neb-touch neb://ipp028.0/gpc1/20100616/o5363g0496o/o5363g0496o.ota26.fits --volume ipp028.0
    neb-stat --validate neb://ipp028.0/gpc1/20100616/o5363g0496o/o5363g0496o.ota26.fits
    cp /data/ippb02.0/nebulous/c8/91/917844868.gpc1:20100616:o5363g0496o:o5363g0496o.ota26.fits /data/ipp028.0/nebulous/c8/91/2208929027.gpc1:20100616:o5363g0496o:o5363g0496o.ota26.fits
    neb-replicate neb://ipp028.0/gpc1/20100616/o5363g0496o/o5363g0496o.ota26.fits --volume ipp030.0
    -- when revert finished successfully then 
    rm /data/ippb02.0/nebulous/c8/91/917844868.gpc1:20100616:o5363g0496o:o5363g0496o.ota26.fits
  • 12:30 adding back 1x compute+1x compute2 for 124 nodes in stack pantasks -- gets the stsci nodes to load spike to >100 with nfs.. but handling so far..
  • 12:50 pile up of ecliptic.rp in camera stage, ~15 runs >20ks. killing and reverting to see if clear ok -- when reverted still stall -- need to be looked into.
  • 13:58 CZW: Added SAS_v8 to stdscience. Defined LAP queue for Kepler field (based on http://kepler.nasa.gov/Science/about/targetFieldOfView/, defined as (227.5,36)->(302.5,52.5)) and set it as the "current.queue". This means that as the 09:00hr slice finishes, it will launch Kepler fields. I've also defined a 10:00hr slice, so we can switch to that when Kepler is fully queued.
  • 16:00 short testing single stage processing for chip/camera/warp/stack, warp is the primary culprit for harassing the stsci machines.
  • 18:00 running SAS chips/camera only for full rate (running set.poll 550)
  • 21:00 SAS chips finished, turning on warps to run slowly overnight and monitor rate. LAP chips/camera also processing since should cause no extra slow down.
  • 23:50 LAP stacks turned back on once SAS warp rate seen to turn over, adding stacks not appear to decrease rate significantly.

Thursday : 2012.07.19

Mark is czar

  • 10:00 processing stopped for Gavin to raise the number of nfsd instances on the stsci machines from 128 to 256. restarted processing with similar chip+cam+warp+stack as before (diff.off) to monitor rate. seems has the ability to push load up to 200 now but the system is responsive and just waiting on the many IO to the disk, same wait/load problem exists with ecliptic warps.
  • 12:30 removing ecliptic.rp from stdscience and placing back in its own pantasks at a lower rate
    • long camera stage runs still happening with new exposures, killing off when found >3000s and keeping camera.revert off for now
  • 12:40 turning up processing on LAP diffs (all to run in stdscience, set.poll 500 -2x compute3 -2x stare) and SAS+LAP stacking (adding 1x compute3, -1x stare overloaded, +1x wave3, -1x compute2 to accomodate ecliptic pantasks)
    • LAP diffs moving through and triggering more chips. turning chip.revert off since processing balance manually setup to move stacks+diffs through
  • 21:10 ippc17 high load, restarting pstamp and update
  • 19:30 stdscience and stack restarted once SAS stacks and LAP diffs finished. Tried boosting stack processing but chip/warp processing irregular so restarting stdscience+stack.
  • 22:00 Chris suspects the warp stage/stsci load trouble is due to the ecliptic not having @HOST@ in the workdir. LAP runs do have this defined, with 300 warps running only see a few load spikes, ~50, but clear quickly.
  • 22:30 watch out for when the ecliptic chips finish and all nodes go to warp in ecliptic pantasks (~23:00) -- stopped ecliptic pantasks for now, ~2k warps left, to not conflict with main pantasks (be careful of the node balance when in use, commented out multiple compute3).
  • 23:30 LAP stacks back on after chip+warps seen can reach 100-150. +1x compute3 to stack since deepstack not using tonight.

Friday : 2012.07.20

Serge is czar

  • 09:05 Serge. Fixing problems for LAP
    • No acceptable file for key neb://ipp045.0/gpc1/20100518/o5334g0188o/o5334g0188o.ota34.burn.tbl but...
      > neb-locate -all gpc1/20100518/o5334g0188o/o5334g0188o.ota34.burn.tbl
      > ls -l `neb-locate -all gpc1/20100518/o5334g0188o/o5334g0188o.ota34.burn.tbl| sed 's/file://g'`
      ls: cannot access /data/ippb02.0/nebulous/87/23/951094587.gpc1:20100518:o5334g0188o:o5334g0188o.ota34.burn.tbl: No such file or directory
      -rw-rw-r-- 1 apache nebulous 0 May 21  2011 /data/ippb00.0/nebulous/87/23/951084822.gpc1:20100518:o5334g0188o:o5334g0188o.ota34.burn.tbl
      > ls -l `deneb-locate.py gpc1/20100518/o5334g0188o/o5334g0188o.ota34.burn.tbl`
      ls: cannot access /data/ippb00.2/nebulous/87/23/951084822.gpc1:20100518:o5334g0188o:o5334g0188o.ota34.burn.tbl: No such file or directory
      -rw-rw-r-- 1 apache nebulous 29995 May 21  2011 /data/ippb02.2/nebulous/87/23/951094587.gpc1:20100518:o5334g0188o:o5334g0188o.ota34.burn.tbl

It looks like /data/ippb02.2/nebulous/87/23/951094587.gpc1:20100518:o5334g0188o:o5334g0188o.ota34.burn.tbl is a good candidate. In the nebulous db I fixed the uri for instance 951094587:

UPDATE instance SET uri='file:///data/ippb02.2/nebulous/87/23/951094587.gpc1:20100518:o5334g0188o:o5334g0188o.ota34.burn.tbl' WHERE ins_id = 951094587;
UPDATE instance SET vol_id=61 WHERE ins_id = 951094587;

I cleaned the cranky instance on ippb00.0:

neb-cull gpc1/20100518/o5334g0188o/o5334g0188o.ota34.burn.tbl --volume ippb00.0

I forced replication of the instance on ipp023 at mhpcc. ipp023 is the node where the image is kept

neb-replicate gpc1/20100518/o5334g0188o/o5334g0188o.ota34.burn.tbl --volume ipp023.0
  • Same for gpc1/20100605/o5352g0271o/o5352g0271o.ota25.burn.tbl.
  • Same for gpc1/20100618/o5365g0122o/o5365g0122o.ota22.fits
  • 09:45 Serge: killed backup on ippdb02 which is 13000 seconds behind its master
  • 09:50 Serge: no valid instance for gpc1/20100617/o5364g0065o/o5364g0065o.ota01.fits
    chiptool -dbname gpc1 -updateprocessedimfile -set_state full -chip_id 499271 -class_id XY01
    chiptool -dbname gpc1 -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 499271 -class_id XY01
    regtool -dbname gpc1 -updateprocessedimfile -set_state corrupt -class_id XY01 -exp_id 182583 
  • 11:00 Serge: Restarted the mysql server on ippdb02
  • 12:00 Mark: two odd ppImages found running for >10ks and Gene found they have massive RAM footprint (pushing systems into swap hell) -- ippc23 is not down, just unhappy
      0    ipp039    BUSY  15176.09  0 chip_imfile.pl --threads @MAX_THREADS@ --exp_id 77887 --chip_id 499135 --chip_imfile_id 29773311 --class_id XY60 --uri neb://ipp034.0/gpc1/20090615/o4997g0204o/o4997g0204o.ota60.fits --camera GPC1 --run-state new --deburned 1 --outroot neb://ipp039.0/gpc1/LAP.ThreePi.20120706/2012/07/20/o4997g0204o.77887/o4997g0204o.77887.ch.499135 --redirect-output --reduction LAP_SCIENCE --dbname gpc1 --verbose 
      1    ippc23    BUSY  12562.86  0 chip_imfile.pl --threads @MAX_THREADS@ --exp_id 77894 --chip_id 499196 --chip_imfile_id 29776971 --class_id XY60 --uri neb://ipp034.0/gpc1/20090615/o4997g0211o/o4997g0211o.ota60.fits --camera GPC1 --run-state new --deburned 1 --outroot neb://ipp039.0/gpc1/LAP.ThreePi.20120706/2012/07/20/o4997g0211o.77894/o4997g0211o.77894.ch.499196 --redirect-output --reduction LAP_SCIENCE --dbname gpc1 --verbose 
  • 13:35 Serge: rebooted ippc23
  • 15:20 Serge: restarting stdscience
  • 15:30 Mark: resetting ~3500 first run early LAP%20120706 to rerun and use the change to the stack variance and the normalization fix.
  • 16:50 Mark: reallocating a compute3 from stack back to deepstack to finish MD10 refstacks (and the SAS staticsky that has been triggered). removed extra compute3 from stack because staticsky/refstacks can be more RAM demanding and ecliptic is currently using compute3 as well.
  • 22:00 SAS_v8 staticsky needing multiple reverts to get down to <10 from 93 faults. reactivating compute3 in stack now staticsky mostly done, LAP stacks halfway finished with re-processing ~3500 early run stacks and back to new LAP stacks ~2am.

Saturday : 2012.07.21

  • 09:20 Mark: several long running/high RAM ppImage jobs for LAP chips. killed and restarted stdscience.
  • 11:10 and more and more to kill off. turning chip.revert off
  • 23:40 stdscience having trouble keeping fully loaded, doing daily restart.

Sunday : 2012.07.22

  • 11:10 Mark: ippc20 is in a bad way with swap and unresponsive. going to reboot. nothing on console, /dev/md2 has gone 206 days so check forced ((10.0% non-contiguous),
  • 16:00 default skycal settings seem okay for SAS, running a full sample (label SAS_v8 in deepstack found in skycalRun) to find any problems now that the staticsky Heather triggered on Friday is mostly finished.
  • 19:50 LAP hour stripe stacks mostly caught up, LAP will continue on to the Kepler stacks.
  • 21:50 stdscience having trouble keeping fully loaded, doing daily restart.