PS1 IPP Czar Logs for the week 2011.12.19 - 2011.12.23

(Up to PS1 IPP Czar Logs)

Monday : 2011.12.19

Mark is czar

  • 09:00 need to remove ipp064 (raid rebuild), 065, 066 (memory swap) from processing for work to be done by Haydn. Also set neb-host down to avoid any possible hang-ups with the hardmounts now.
  • 12:00 Haydn reported ipp063 back online with 48GB RAM, ipp065 & ipp066 rebooted twice and also report 48GB RAM. ipp064 RAID is rebuilding and Cindy will install Gentoo when finished however sometimes it shows 40 and 48 GB RAM so will need to be fixed after RAID finishes.
  • 12:20 oops, ipp064 was up with ipp065 for a few minutes so a few files updated or were added to nebulous from LAP. need to set the ipp064 data temporarily on ipp065 to repair with
    neb-host --volume ipp064.0 repair (option needs to be added to help/man?)
    
  • 12:30 MD06 chips+warps set to update for comparing how many lost with ipp064 and attempt refstack building again this week for the next OC starting in January otherwise will need to queue up full new run of condor.
  • 12:40 working through stalled LAP, restarting stdscience and distribution to rotate logs and refresh things.
  • 16:00 half of LAP running again after re-updating many error_cleaned chips
  • 20:30 fixing stalling LAP chips due to the FITS with incomplete extensions from ipp064. replicated ones to ATRC seem okay, so cp likely faulty one to .bad and cp good copy
    cp /data/ipp064.0/nebulous/46/b5/365746896.gpc1:20100722:o5399g0214o:o5399g0214o.ota26.fits /data/ipp064.0/nebulous/46/b5/365746896.gpc1:20100722:o5399g0214o:o5399g0214o.ota26.fits.bad
    ...
    cp /data/ippb00.2/nebulous/46/b5/1158188565.gpc1:20100722:o5399g0214o:o5399g0214o.ota26.fits /data/ipp064.0/nebulous/46/b5/365746896.gpc1:20100722:o5399g0214o:o5399g0214o.ota26.fits
    cp /data/ippb02.1/nebulous/cb/80/1160615386.gpc1:20100821:o5429g0180o:o5429g0180o.ota26.fits /data/ipp064.0/nebulous/cb/80/407468336.gpc1:20100821:o5429g0180o:o5429g0180o.ota26.fits
    cp /data/ippb00.0/nebulous/78/7e/936702266.gpc1:20100821:o5429g0194o:o5429g0194o.ota26.fits /data/ipp064.0/nebulous/78/7e/407501214.gpc1:20100821:o5429g0194o:o5429g0194o.ota26.fits
    cp /data/ippb02.0/nebulous/b6/9b/1158190731.gpc1:20100722:o5399g0241o:o5399g0241o.ota26.fits /data/ipp064.0/nebulous/b6/9b/365752348.gpc1:20100722:o5399g0241o:o5399g0241o.ota26.fits
    cp /data/ippb02.1/nebulous/7b/da/896024650.gpc1:20100917:o5456g0093o:o5456g0093o.ota26.fits /data/ipp064.0/nebulous/7b/da/451746379.gpc1:20100917:o5456g0093o:o5456g0093o.ota26.fits
    
  • missing file
    neb://ipp024.0/gpc1/20100822/o5430g0244o/o5430g0244o.ota13.fits
          0                     NON-EXISTANT file:///data/ippb02.2/nebulous/4c/cd/920273147.gpc1:20100822:o5430g0244o:o5430g0244o.ota13.fits
          0                     NON-EXISTANT file:///data/ippb02.2/nebulous/4c/cd/920291539.gpc1:20100822:o5430g0244o:o5430g0244o.ota13.fits
    chiptool -dropprocessedimfile -set_quality 42 -chip_id 365090 -class_id XY13 -dbname gpc1
    regtool -updateprocessedimfile -set_ignored -exp_id 210989 -class_id XY13 -dbname gpc1
    
  • 23:30 runchipimfile.pl on missing chips (typically XY26 from ipp064) seems to be clearing some of the faulted DS
    perl ~ipp/src/ipp-20110622/tools/runchipimfile.pl --chip_id 355330 --class_id XY26 --redirect-output
    

Tuesday : 2011.12.20

Mark is Czar

  • 07:00 the little nightly science finished, LAP fully stalled again by few faulted chips, MD06 chip/warp updates continue.
  • 07:30 again renaming likely bad chips to .bad and cp over likely good versions
    cp /data/ipp064.0/nebulous/cd/42/451748571.gpc1:20100917:o5456g0099o:o5456g0099o.ota26.fits /data/ipp064.0/nebulous/cd/42/451748571.gpc1:20100917:o5456g0099o:o5456g0099o.ota26.fits.bad
    ...
    cp /data/ippb01.0/nebulous/cd/42/914175068.gpc1:20100917:o5456g0099o:o5456g0099o.ota26.fits /data/ipp064.0/nebulous/cd/42/451748571.gpc1:20100917:o5456g0099o:o5456g0099o.ota26.fits
    cp /data/ippb00.1/nebulous/2b/d9/1198353397.gpc1:20100626:o5373g0386o:o5373g0386o.ota26.fits /data/ipp064.0/nebulous/2b/d9/338195117.gpc1:20100626:o5373g0386o:o5373g0386o.ota26.fits
    cp /data/ippb01.1/nebulous/99/bb/902728051.gpc1:20101114:o5514g0059o:o5514g0059o.ota26.fits /data/ipp064.0/nebulous/99/bb/539189992.gpc1:20101114:o5514g0059o:o5514g0059o.ota26.fits
    cp /data/ipp063.0/nebulous/16/8a/1615154887.gpc1:20101117:o5517g0042o:o5517g0042o.ota26.fits /data/ipp064.0/nebulous/16/8a/544061969.gpc1:20101117:o5517g0042o:o5517g0042o.ota26.fits
    
  • 09:00 continue to untangle the remaining 5 stalled LAP runs from weekend. Three stuck in distribution due to stack mdc file missing,
    perl ~ipp/src/ipp-20110622/tools/runstackskycell.pl --stack_id 561899  --redirect-output
    disttool -dbname gpc1 -revertrun -dist_id 1545583
    perl ~ipp/src/ipp-20110622/tools/runstackskycell.pl --stack_id 561913 --redirect-output
    perl ~ipp/src/ipp-20110622/tools/runstackskycell.pl --stack_id 561927 --redirect-output
    
  • 11:00 LAP DS has 3 faults, tried to fix but giving up. magic_ds_id=860615, 860713, 861429
  • 11:30 more LAP chip issues
    neb://ipp028.0/gpc1/20100522/o5338g0484o/o5338g0484o.ota26.fits -- no instances
    chiptool -dropprocessedimfile -set_quality 42 -chip_id 365411 -class_id XY26 -dbname gpc1
    regtool -updateprocessedimfile -set_ignored -exp_id 171060 -class_id XY26 -dbname gpc1
    
    neb://ipp034.0/gpc1/20100626/o5373g0390o/o5373g0390o.ota56.burn.tbl -- empty burntool tables
    chiptool -dropprocessedimfile -set_quality 42 -chip_id 365454 -class_id XY56 -dbname gpc1
    regtool -updateprocessedimfile -set_ignored -exp_id 187135 -class_id XY56 -dbname gpc1
    
    neb://ipp027.0/gpc1/20101117/o5517g0046o/o5517g0046o.ota26.fits -- missing extensions but funpacks ok..
    cp /data/ipp064.0/nebulous/01/c7/544067331.gpc1:20101117:o5517g0046o:o5517g0046o.ota26.fits /data/ipp064.0/nebulous/01/c7/544067331.gpc1:20101117:o5517g0046o:o5517g0046o.ota26.fits.bad
    cp /data/ippb02.1/nebulous/01/c7/915640648.gpc1:20101117:o5517g0046o:o5517g0046o.ota26.fits /data/ipp064.0/nebulous/01/c7/544067331.gpc1:20101117:o5517g0046o:o5517g0046o.ota26.fits
    
  • 13:25 remaining 65 stalled warps from weekend due to input chips mysteriously being cleaned. manually updated normally (chiptool -setimfiletoupdate -chip_id)
  • 15:50 not quite, 22 remaining warp updates stalled with error_cleaned. followed Bill's suggestion and set goto_cleaned then the error_cleaned to cleaned (warptool -dbname gpc1 -tocleanedskyfile -warp_id 285084 -skycell_id skycell.2150.014 etc) and remaining stack sets triggered.
  • 18:30 restarted stack after overloading it to move through the final LAP stack backlog. all 10 LAP runs processing now.
  • 20:30 MD06 had 6 chips missing extensions and one lost found during update
    neb://ipp028.0/gpc1/20100518/o5334g0140o/o5334g0140o.ota26.fits
    neb://ipp028.0/gpc1/20100602/o5349g0028o/o5349g0028o.ota26.fits
    neb://ipp028.0/gpc1/20100605/o5352g0133o/o5352g0133o.ota26.fits
    neb://ipp028.0/gpc1/20100614/o5361g0033o/o5361g0033o.ota26.fits
    neb://ipp028.0/gpc1/20100618/o5365g0065o/o5365g0065o.ota26.fits
    neb://ipp027.0/gpc1/20100624/o5371g0176o/o5371g0176o.ota26.fits
    
    neb://ipp028.0/gpc1/20100607/o5354g0038o/o5354g0038o.ota26.fits
    
  • 21:30 LAP more chips with extension problems replaced with second copy
    neb://ipp027.0/gpc1/20100909/o5448g0109o/o5448g0109o.ota26.fits
    neb://ipp027.0/gpc1/20100820/o5428g0287o/o5428g0287o.ota26.fits 
    neb://ipp027.0/gpc1/20100909/o5448g0130o/o5448g0130o.ota26.fits
    

Wednesday : 2011.12.21

  • 16:00 Mark: ~100 MD06 updates were stuck in warp after the chip fixes yesterday. handful were error_cleaned cases, rest were corrupted files from camera stage and ipp064.
    -- majority were corrupted SMFs. could have cp from good replicated version but was tedious so used runcameraexp.pl to remake (have listing if needed)
    runcameraexp.pl --redirect-output --cam_id 324590
    
    -- 6 were missing .trace files, neb-mv to .bad and re-ran runcamerexp.pl
    neb-mv neb://@HOST@.0/gpc1/condor_MD06.V3_01.haf/o5214g0210o.123894/o5214g0210o.123894.cm.323605.trace neb://@HOST@.0/gpc1/condor_MD06.V3_01.haf/o5214g0210o.123894/o5214g0210o.123894.cm.323605.trace.bad
    
  • 16:30 Mark: sending updated chips to goto_clean to free up space.
  • 22:00 Mark: setup SAS2.123.rerun20111129, SAS.footprint.123.rerun20111129 stacks made so far for distribution.
  • 22:30 Mark: deepstack pantasks turned back on for MD06 refstacks now that warps appear to be fully updated again.
  • 23:00 Mark: giving LAP another kick, most are due to first instance of ota26 having corrupted extensions. copying over second copy
    neb://ipp027.0/gpc1/20100904/o5443g0495o/o5443g0495o.ota26.fits
          1 5da30fc8fdf90bbc6a9b92c65836d769 file:///data/ipp064.0/nebulous/3c/69/431318460.gpc1:20100904:o5443g0495o:o5443g0495o.ota26.fits
          1 fd477cfabc5f5a4774885ff497f98843 file:///data/ippb02.2/nebulous/3c/69/881863603.gpc1:20100904:o5443g0495o:o5443g0495o.ota26.fits
    cp /data/ipp064.0/nebulous/3c/69/431318460.gpc1:20100904:o5443g0495o:o5443g0495o.ota26.fits /data/ipp064.0/nebulous/3c/69/431318460.gpc1:20100904:o5443g0495o:o5443g0495o.ota26.fits.bad
    cp /data/ippb02.2/nebulous/3c/69/881863603.gpc1:20100904:o5443g0495o:o5443g0495o.ota26.fits /data/ipp064.0/nebulous/3c/69/431318460.gpc1:20100904:o5443g0495o:o5443g0495o.ota26.fits
    
    neb://ipp027.0/gpc1/20100909/o5448g0115o/o5448g0115o.ota26.fits
    neb://ipp027.0/gpc1/20100909/o5448g0112o/o5448g0112o.ota26.fits
    neb://ipp027.0/gpc1/20100815/o5423g0179o/o5423g0179o.ota26.fits
    neb://ipp027.0/gpc1/20100825/o5433g0154o/o5433g0154o.ota26.fits
    neb://ipp027.0/gpc1/20100914/o5453g0098o/o5453g0098o.ota26.fits
    neb://ipp027.0/gpc1/20100817/o5425g0480o/o5425g0480o.ota26.fits
    
    -- couple burntool tables empty and just set to dropprocessedimfile
    neb://ipp050.0/gpc1/20100530/o5346g0440o/o5346g0440o.ota61.burn.tbl
    neb://ipp036.0/gpc1/20100917/o5456g0083o/o5456g0083o.ota67.burn.tbl
    
    chiptool -dbname gpc1 -dropprocessedimfile -set_quality 42 -chip_id 334742 -class_id XY61
    chiptool -dbname gpc1 -updaterun -set_state full -chip_id 334742
    
    -- and non-existent file
    neb://ipp040.0/gpc1/20100820/o5428g0281o/o5428g0281o.ota15.fits
    
    chiptool -dbname gpc1 -dropprocessedimfile -set_quality 42 -chip_id 323179 -class_id XY15
    chiptool -dbname gpc1 -updaterun -set_state full -chip_id 323179
    regtool -updateprocessedimfile -set_ignored -exp_id 209855 -class_id XY15 -dbname gpc1
    

Thursday : 2011.12.22

Gene is czar today

  • 10:40 Restarted processing with new tag ipp-20111222
  • Bill fixed 3 stuck LAP runs due to unavailable instances
    neb-mv neb://ipp064.0/gpc1/LAP.ThreePi.20110809/2011/11/26/o5442g0047o.218397/SR_o5442g0047o.218397.wrp.314730.skycell.1399.057.wt.fits.trash
    neb-mv neb://ipp064.0/gpc1/LAP.ThreePi.20110809/2011/11/26/o5442g0047o.218397/SR_o5442g0047o.218397.wrp.314730.skycell.1399.057.wt.fits neb://ipp064.0/gpc1/LAP.ThreePi.20110809/2011/11/26/o5442g0047o.218397/SR_o5442g0047o.218397.wrp.314730.skycell.1399.057.wt.fits.trash
    neb-mv neb://ipp064.0/gpc1/LAP.ThreePi.20110809/2011/11/26/o5807g0062o.386352/SR_o5807g0062o.386352.cm.330210.XY26.mk.fits neb://ipp064.0/gpc1/LAP.ThreePi.20110809/2011/11/26/o5807g0062o.386352/SR_o5807g0062o.386352.cm.330210.XY26.mk.fits.trash
    
  • and fixed 2 bad instances of a couple of raw files and dropped one raw file that has been lost
    cp /data/ippb00.1/nebulous/3d/45/1179344012.gpc1:20100820:o5428g0209o:o5428g0209o.ota26.fits /data/ipp064.0/nebulous/3d/45/406513075.gpc1:20100820:o5428g0209o:o5428g0209o.ota26.fits
    cp /data/ippb02.1/nebulous/9b/78/896023980.gpc1:20100917:o5456g0049o:o5456g0049o.ota26.fits /data/ipp064.0/nebulous/9b/78/451703443.gpc1:20100917:o5456g0049o:o5456g0049o.ota26.fits
    chiptool -revertprocessedimfile -label LAP.ThreePi.20110809
    regtool -updateprocessedimfile -set_ignored -exp_id 210983 -class_id XY17
    
  • 12:51 Bill dropped magicDSRun 861429 (cam stage chips have been cleaned)
  • 13:30 Bill warp skyfile 273690 skycell.2221.010 was causing it's warp run to not complete. It's only input lost it's rawImfile and warptool doesn't know how to deal with that situation yet. Set it to full state with quality 42
  • 14:00 Bill queued M31 data from 2011 July to be processed
  • 15:00 Mark setup SAS multi-filter staticsky runs to go to ps1-sas-cat on the datastore.
  • 19:44 Bill set M31.v4.2011 to inactive

Friday : 2011.11.23

Bill is czar today

  • 09:40 set M31.v4.2011 label to active
  • 09:45 replaced 3 corrupt rawImfile instances with good copies
    cp /data/ipp057.0/nebulous/43/69/1643955860.gpc1:20100620:o5367g0581o:o5367g0581o.ota26.fits /data/ipp064.0/nebulous/43/69/330856362.gpc1:20100620:o5367g0581o:o5367g0581o.ota26.fits
    cp /data/ippb01.0/nebulous/a4/f6/1161307957.gpc1:20100829:o5437g0554o:o5437g0554o.ota26.fits /data/ipp064.0/nebulous/a4/f6/421877742.gpc1:20100829:o5437g0554o:o5437g0554o.ota26.fits
    cp /data/ipp063.0/nebulous/0d/d7/1664967163.gpc1:20101112:o5512g0080o:o5512g0080o.ota26.fits /data/ipp064.0/nebulous/0d/d7/535798294.gpc1:20101112:o5512g0080o:o5512g0080o.ota26.fits
    
  • 12:00 Reran vpRuns which were damaged due to ipp064
  • 16:20 lowered M31 priority below LAP to give it a chance to have priority
  • 16:30 There are four lonely LAP stacks who are stuck because their inputs have been cleaned. Not knowing how LAP works I wrote a script set the inputs to be updated. (51 exposures)

Saturday : 2011.11.24

  • 10:22 There are a number of warps and destreak runs that are failing because the associated smf file is corrupt (ip064) Bill "fixed" by copying the uncensored file on top of the bad instance.

Sunday : 2011.11.24

  • quiet day. Fixed a few bad raw instances on ipp064.There were good copies on other hosts.