PS1 IPP Czar Logs for the week 2013.09.02 - 2013.09.08

(Up to PS1 IPP Czar Logs)

Monday : 2013.09.02

  • 00:10 MEH: looks like registration is stuck on something, 36 behind
  • 11:30 MEH: Serge reported a warp continues faulting, looks like no good sources (1 psf source) -- edge skycell so set quality 42 for now.. exposure publishing for MOPS now
    warptool -dbname gpc1 -updateskyfile -warp_id 836928 -skycell_id skycell.049 -fault 0 -set_quality 42
  • 11:40 MEH: LAP stalled, good time for regular restart stdsci and clearing all LAP chip faults so processing can resume..
  • 12:10 MEH: MD01 warp qual held up stacks and SSdiff as well -- stacks finished and tweak_ssdiff to run
  • 19:10 MEH: LAP near run out of chips again, ~50 to fix. more single chips only on ipp047 -- 5 more to the list for 17
    | neb://ipp047.0/gpc1/20111120/o5885g0116o/o5885g0116o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111125/o5890g0407o/o5890g0407o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0203o/o5891g0203o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0117o/o5891g0117o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0127o/o5891g0127o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0222o/o5891g0222o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111125/o5890g0419o/o5890g0419o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111120/o5885g0127o/o5885g0127o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111120/o5885g0190o/o5885g0190o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111125/o5890g0412o/o5890g0412o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111120/o5885g0175o/o5885g0175o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0226o/o5891g0226o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0228o/o5891g0228o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111125/o5890g0508o/o5890g0508o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111125/o5890g0496o/o5890g0496o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0260o/o5891g0260o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0257o/o5891g0257o.ota67.fits |

Tuesday : 2013.09.03

Bill is czar today

  • 07:15 registration is stuck. Can't see why everybody has burntool_state = -14. Restarting pantasks.
  • 07:25 Ah there it is. burntool was stuck because one chip had burntool_state = -1 and data_state = check_burntool. too early in the morning.
  • forgot to mention, ipp028 claimed to be down in ganglia but was up. It looks like it restarted around 6am today. It is now showing up in ganglia
  • 09:30 We got reports of slow access to the data store last night. This may have been related to the ipp028 outage, but I haven't investigated. For good measure though I have restarted apace on ippc17.
  • 09:35 nightly science chips are done. Setting to avoid because there are 90 or so LAP chips that are faulting due to ipp047 problems.
    • MEH -- there is a tweak_chiprevert file to turn off chip.revert except for a couple times during nightly processing for when stdsci is restarted
  • 11:49 restarted update pantasks earlier but forgot to setup and run. fixed that
  • 12:10 restarted cleanup pantasks.
    • Testing cleanup of old distRuns with outroot like '/%' (non-nebulous) diff.cleanup off for now while I make sure cleanup of old directories doesn't raise havoc. Looks good so far.
  • 12:01 restarted stdscience but forgot to mention it. diff cleanup started up again
  • 15:45 MEH: as discussed last meeting, all MD04 being updated this week (label MD04.pv2.20130903). will be watching, so no need to revert etc
  • 18:18 removed label goto_cleaned.rerundiff label from cleanup. Since the diffRuns with that label are older and being rerun,there is less space to recover than more recent (higher diff_id) runs which have images to clean up.
  • 19:00 MEH: ippdb02 and ippdb04 are still increasing sec behind master, pausing cleanup and stdscience before nightly to see if decreases a bit
  • 19:30 MEH: ippdb04 slowly catching up ~2ks, ippdb02 still 110900s behind (suspect doing backup still) -- leaving cleanup off for a while longer, LAP label out, nightly starting
    • with nightly running, ippdb04 barely steady ~22k behind, turning cleanup and LAP back on
    • iperf running on ippdb04 @400% CPU.. kill -STOP and see if helps (running since 7/30, since last boot)
  • 19:31 turned chip.revert.on
    • MEH -- and set the tweak_chiprevert so will only activate ~midnight and 6am, maybe an additional hour should be added

Wednesday : 2013.09.04

Bill is czar today

  • 04:30 reverted faulted about 6 nightlyscience faulted chips by hand
  • 04:34 chip stage is done for 221 new exposures
  • 07:50 MEH: ippdb04 has at least caught up, starting MD04.pv2 again
  • 10:30 setting stdscience to stop to prepare for daily restart
  • 10:36 stdscience restarted. ippc63 removed from host list
  • 12:45 ipp028 rebooted itself. Message on the console mce: [Hardware Error]: CPU 6: Machine Check Exception: 5 Bank 5: b200000080200e0f
  • 13:57 restarted all pantasks. waiting for gavin to restart ipp028 before setting to run
  • 14:26 Gavin rebooted ipp028 and set pantasks to 'run'
  • 17:45 MEH: tweak_chiprevert to only run during nightly processing.. interrupted attempt to fix MD04 chips..

Thursday : 2010.09.05

  • 10:04 Bill: queued 37 STS exposures for chip processing. Added label STS.test.20130905 to stdscience
  • 10:30 Bill: recovered 4 sts raw files whose only available instance was missing on ippb02. Also 4 burntool tables.
  • 10:45 MEH: stdsci in need of regular restart, ssdiff done so doing now
  • 11:30 MEH: restarting deepstack with 2x compute3 from stack to do MD09.refstack.20130613_1dg requested by Ken and finish MD01_1dg
  • 16:30 MEH: lots of red on disk, sending remaining GR0 distribution of reprocessed stacks to cleanup, many misc MD chip+warp to cleanup

Friday : 2010.09.06

  • 11:50 MEH: MD04.pv2 finished updating, back to full LAP updates
  • 12:00 CZW: to attempt to get the nebulous database synchronization time down.
  • 12:20 CZW: stdscience restart
  • 13:30 cleanup is off to track ippdb02 nebdb catchup -- must be mostly caught up by end of weekend for backups, at current rate will take +3 days..
  • 15:20 MEH: ippdb02 catchup rate not really improved, actually degraded with cleanup off so likely depends on what cmds mysql being run of course.. Gavin finds write throughput rate similar between ippdb00,ippdb02 so mysql probably could use a restart but not on a friday afternoon unless absolutely necessary..

Saturday : 2013.09.07

  • 00:20 MEH: ippdb02 caught up, turning cleanup back on
  • 12:50 MEH: all stalled LAP into stacks now, good time for regular restart of stdsci
  • 17:10 MEH: LAP stalled, fixing chips..
  • 19:50 MEH: registration stuck, revert cleared.. 28 exposures behind (all)
  • 23:40 MEH: looks like ipp033 is down.. nothing on console.. guess will power cycle it.. -- nightly processing continues again

Sunday : 2013.09.08

  • 08:30 MEH: looks like something is stuck in registration.. again.. shows
    o6543g0445o  XY11 0 check_burntool neb://ipp056.0/gpc1/20130908/o6543g0445o/o6543g0445o.ota11.fits
    -- restting back to pending_ clears.. thought this was setup for auto-running in the task.. 
    regtool -updateprocessedimfile -exp_id 653563 -class_id XY11 -set_state pending_burntool -dbname gpc1
  • 09:10 nightly through.. regular unsticking LAP for stacks and restarting stdsci
  • 20:50 MEH: again hanging in registration -- 45 exposures behind..
    o6544g0078o  XY27 0 check_burntool neb://ipp065.0/gpc1/20130909/o6544g0078o/o6544g0078o.ota27.fits
    regtool -updateprocessedimfile -exp_id 653666 -class_id XY27 -set_state pending_burntool -dbname gpc1
    -- and two imfiles reverted
    regtool -revertprocessedimfile -dbname gpc1