PS1 IPP Czar Logs for the week 2011.09.26 - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2011.09.26

  • 01:50 Mark: stdscience pantasks down. restarted.

Tuesday : 2011.09.27

  • 10:20 Mark: ipp033 unresponsive, setting neb-host ipp033 down.
  • 11:20 ipp033 seemed to come back on its own without reboot, setting neb-host back to up.
  • 15:30 Mark: added a sample iostat gmetric for ipp053 to track both normal IO for normal processing and Magic. Chose poor names and broke updates, fixed names but caused oddity with ganglia and plotting CPU on cluster load average summary plot, memory used on cluster memory summary plot, in on network summary plot. Reset gmetric timeout for all bad names and looks back to normal. Removed badly created RRDs.

Wednesday : 2011.09.28

  • 09:00 Mark: registration stuck? similar to earlier in the night, XY05, but earlier case fixed itself?
    regtool -updateprocessedimfile -exp_id 400406 -class_id XY05 -set_state pending_burntool -dbname gpc1
    
  • 16:50 Mark: with ippc18 overload, some pantasks_servers seem to be offline. working through restarting them.

Thursday : YYYY.MM.DD

Friday : 2011.09.30

  • 02:00 Mark: Ken Smith reported unable to access some files from the datastore. mount to ipp042 from ippc17 failure the problem? Mount was working in morning, files also accessible. Someone fixed? Cindy and Gavin have setup sudo permissions for me.
  • 08:00 Registration got stuck around 0200 again. Ran regtool and restarted registration and things are catching up again
    regtool -updateprocessedimfile -exp_id 401101  -class_id XY45 -set_state pending_burntool -dbname gpc1
    
  • 09:00 and a few more
    regtool -updateprocessedimfile -exp_id 401108  -class_id XY45 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 401172 -class_id XY52 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 401192 -class_id XY15 -set_state pending_burntool -dbname gpc1
    
  • 09:30 more and done.
    regtool -updateprocessedimfile -exp_id 401227 -class_id XY45 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 401249 -class_id XY45 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 401260 -class_id XY21 -set_state pending_burntool -dbname gpc1
    
  • 12:30 ipp026 crash, set to down in nebulous. Serge rebooted. Kernel panic with ppImage. Back up and running.
  • 13:30 removing processing from stare00 (stdscience+distribution) for Cindy to install disks. Adding stare00 to hosts_ignore_stare in pantasks_hosts.input in case of pantasks restarts until tests with 3TB disks finishes.
  • 18:00 MD.V3 setup for MD03,05,06,07. V3 for MD04,08,09 ready but not setup yet until pre-V3 for MD04 retired/purged and 08,09 finish active observations.
  • 23:00 registration stuck. restarted pantasks and going again. for ~18 exposures and stalled again with normal check_burntool issue. ran
    regtool -updateprocessedimfile -exp_id 401454 -class_id XY56 -set_state pending_burntool -dbname gpc1
    

Saturday : 2011.10.01

  • 01:00 Mark: ipp053 getting hammered a bit. test run of MD03.V3 had missing OTA file, copied from ippd02
    cp /data/ippb02.1/nebulous/71/5d/1197044579.gpc1:20091229:o5194g0536o:o5194g0536o.ota76.fits /data/ipp053.0/nebulous/71/5d/118786341.gpc1:20091229:o5194g0536o:o5194g0536o.ota76.fits
    
  • 02:00 MD03.GR0-y thrown into queue as low priority to help fill in gaps when LAP is in stack mode.
  • 02:30-04:00 something seems to be squelching pantasks processing when plenty to do. restarted stdscience and took nearly 30min to then fully queue up new jobs. network, condor?
  • 08:00 registration really being fussy.
    regtool -updateprocessedimfile -exp_id  401649 -class_id XY64 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id  401649 -class_id XY26 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id  401684 -class_id XY45 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id  401739 -class_id XY33 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id  401753 -class_id XY32 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id  401753 -class_id XY10 -set_state pending_burntool -dbname gpc1
    
  • 08:30 MD03.GR0 55/130-ish are failing in chip, missing OTA76 on ipp053
  • 13:00 chip.revert.off while fixing non-existent OTA76 entries in MD03.GR0 from the single copy on ippbXX
    file:///data/ipp053.0/nebulous/df/cd/118786339.gpc1:20091229:o5194g0535o:o5194g0535o.ota76.fits
    file:///data/ipp053.0/nebulous/36/db/118786568.gpc1:20091229:o5194g0537o:o5194g0537o.ota76.fits
    file:///data/ipp053.0/nebulous/6b/73/118786723.gpc1:20091229:o5194g0538o:o5194g0538o.ota76.fits
    file:///data/ipp053.0/nebulous/79/da/118786989.gpc1:20091229:o5194g0539o:o5194g0539o.ota76.fits
    file:///data/ipp053.0/nebulous/16/75/118787230.gpc1:20091229:o5194g0540o:o5194g0540o.ota76.fits
    file:///data/ipp053.0/nebulous/fa/32/118787160.gpc1:20091229:o5194g0541o:o5194g0541o.ota76.fits
    file:///data/ipp053.0/nebulous/18/8a/118787813.gpc1:20091229:o5194g0544o:o5194g0544o.ota76.fits
    file:///data/ipp053.0/nebulous/40/db/118787833.gpc1:20091229:o5194g0545o:o5194g0545o.ota76.fits
    file:///data/ipp053.0/nebulous/23/f0/118787921.gpc1:20091229:o5194g0546o:o5194g0546o.ota76.fits
    file:///data/ipp053.0/nebulous/76/ad/118788309.gpc1:20091229:o5194g0547o:o5194g0547o.ota76.fits
    file:///data/ipp053.0/nebulous/ba/bd/118788394.gpc1:20091229:o5194g0548o:o5194g0548o.ota76.fits
    file:///data/ipp053.0/nebulous/db/60/118788673.gpc1:20091229:o5194g0549o:o5194g0549o.ota76.fits
    file:///data/ipp053.0/nebulous/5c/d0/118788786.gpc1:20091229:o5194g0550o:o5194g0550o.ota76.fits
    file:///data/ipp053.0/nebulous/37/06/118787573.gpc1:20091229:o5194g0542o:o5194g0542o.ota76.fits
    file:///data/ipp053.0/nebulous/c8/39/118787631.gpc1:20091229:o5194g0543o:o5194g0543o.ota76.fits
    file:///data/ipp053.0/nebulous/63/5b/119330767.gpc1:20091230:o5195g0354o:o5195g0354o.ota76.fits
    file:///data/ipp053.0/nebulous/27/20/119331052.gpc1:20091230:o5195g0355o:o5195g0355o.ota76.fits
    file:///data/ipp053.0/nebulous/e3/58/119332409.gpc1:20091230:o5195g0356o:o5195g0356o.ota76.fits
    file:///data/ipp053.0/nebulous/bd/28/119332776.gpc1:20091230:o5195g0357o:o5195g0357o.ota76.fits
    file:///data/ipp053.0/nebulous/b5/bf/119333111.gpc1:20091230:o5195g0358o:o5195g0358o.ota76.fits
    file:///data/ipp053.0/nebulous/46/08/119334122.gpc1:20091230:o5195g0359o:o5195g0359o.ota76.fits
    file:///data/ipp053.0/nebulous/e4/62/119335070.gpc1:20091230:o5195g0360o:o5195g0360o.ota76.fits
    file:///data/ipp053.0/nebulous/5a/71/119336071.gpc1:20091230:o5195g0361o:o5195g0361o.ota76.fits
    file:///data/ipp053.0/nebulous/e7/a4/119337493.gpc1:20091230:o5195g0362o:o5195g0362o.ota76.fits
    file:///data/ipp053.0/nebulous/20/b4/119338667.gpc1:20091230:o5195g0363o:o5195g0363o.ota76.fits
    file:///data/ipp053.0/nebulous/3f/c7/119340708.gpc1:20091230:o5195g0364o:o5195g0364o.ota76.fits
    file:///data/ipp053.0/nebulous/79/83/119342618.gpc1:20091230:o5195g0365o:o5195g0365o.ota76.fits
    file:///data/ipp053.0/nebulous/e0/49/119344276.gpc1:20091230:o5195g0366o:o5195g0366o.ota76.fits
    file:///data/ipp053.0/nebulous/00/97/119346137.gpc1:20091230:o5195g0367o:o5195g0367o.ota76.fits
    file:///data/ipp053.0/nebulous/22/ed/119347916.gpc1:20091230:o5195g0368o:o5195g0368o.ota76.fits
    file:///data/ipp053.0/nebulous/3b/65/119349599.gpc1:20091230:o5195g0369o:o5195g0369o.ota76.fits
    file:///data/ipp053.0/nebulous/08/ce/120763721.gpc1:20100105:o5201g0126o:o5201g0126o.ota76.fits
    file:///data/ipp053.0/nebulous/3d/c1/120764673.gpc1:20100105:o5201g0128o:o5201g0128o.ota76.fits
    file:///data/ipp053.0/nebulous/a0/3f/120765333.gpc1:20100105:o5201g0131o:o5201g0131o.ota76.fits
    file:///data/ipp053.0/nebulous/5e/12/120765466.gpc1:20100105:o5201g0132o:o5201g0132o.ota76.fits
    file:///data/ipp053.0/nebulous/dd/48/120764963.gpc1:20100105:o5201g0130o:o5201g0130o.ota76.fits
    file:///data/ipp053.0/nebulous/51/d8/120765577.gpc1:20100105:o5201g0133o:o5201g0133o.ota76.fits
    file:///data/ipp053.0/nebulous/e1/1f/120764795.gpc1:20100105:o5201g0129o:o5201g0129o.ota76.fits
    file:///data/ipp053.0/nebulous/6a/f7/120763838.gpc1:20100105:o5201g0127o:o5201g0127o.ota76.fits
    file:///data/ipp053.0/nebulous/9f/52/178536084.gpc1:20100301:o5256g0346o:o5256g0346o.ota76.fits
    file:///data/ipp053.0/nebulous/8f/da/178540557.gpc1:20100301:o5256g0347o:o5256g0347o.ota76.fits
    file:///data/ipp053.0/nebulous/33/7d/178541035.gpc1:20100301:o5256g0348o:o5256g0348o.ota76.fits
    file:///data/ipp053.0/nebulous/a3/ce/178542507.gpc1:20100301:o5256g0349o:o5256g0349o.ota76.fits
    file:///data/ipp053.0/nebulous/1a/9c/180013232.gpc1:20100302:o5257g0030o:o5257g0030o.ota76.fits
    file:///data/ipp053.0/nebulous/91/3f/180036206.gpc1:20100302:o5257g0032o:o5257g0032o.ota76.fits
    file:///data/ipp053.0/nebulous/d2/61/781339129.gpc1:20110318:o5638g0126o:o5638g0126o.ota76.fits
    file:///data/ipp053.0/nebulous/60/f1/781340553.gpc1:20110318:o5638g0127o:o5638g0127o.ota76.fits
    file:///data/ipp053.0/nebulous/4c/b5/781371801.gpc1:20110318:o5638g0130o:o5638g0130o.ota76.fits
    file:///data/ipp053.0/nebulous/93/db/788219280.gpc1:20110321:o5641g0335o:o5641g0335o.ota76.fits
    file:///data/ipp053.0/nebulous/a3/a2/788221467.gpc1:20110321:o5641g0338o:o5641g0338o.ota76.fits
    file:///data/ipp053.0/nebulous/ca/7e/788221066.gpc1:20110321:o5641g0337o:o5641g0337o.ota76.fits
    file:///data/ipp053.0/nebulous/4f/7d/788222521.gpc1:20110321:o5641g0340o:o5641g0340o.ota76.fits
    file:///data/ipp053.0/nebulous/44/8b/788222414.gpc1:20110321:o5641g0339o:o5641g0339o.ota76.fits
    file:///data/ipp053.0/nebulous/42/df/788223309.gpc1:20110321:o5641g0342o:o5641g0342o.ota76.fits
    file:///data/ipp053.0/nebulous/df/f5/788222899.gpc1:20110321:o5641g0341o:o5641g0341o.ota76.fits
    
  • 14:30 chip.revert.on, MD03.GR0 chip processing fill in while LAP stacks running.
  • 14:40 since fixing MD03 anyways, fix 2 faulting in LAP too
    neb://ipp025.0/gpc1/20100910/o5449g0446o/o5449g0446o.ota76.fits 
    neb://ipp025.0/gpc1/20100910/o5449g0449o/o5449g0449o.ota76.fits
    

Sunday : 20111.20.02

  • 11:00 MD03.GR0 ~50 more OTA 76 non-existent on ipp053. chip.revert.off while fixing.
  • 15:30 old test MD04.V3 warps set to goto_purged, MD04.GR0 test run.
  • 20:00 MD03.GR0/refstack.20111010 stack sample, MD04.GR0 chip-warp loaded at bottom priority as backup processing if lighter observing or registration stalls.
  • 22:30 plenty to do in nightly science but seems under-utilized. was stdscience restarted today? going to restart now, looks better (all nodes busy instead of just 100-150)