PS1 IPP Czar Logs for the week 2015.06.15 - 2016.06.21

(Up to PS1 IPP Czar Logs)

Monday : 2015.06.15

  • 06:19 Bill: stdscience is running sluggishly. I'm going to restart it.
  • 06:37 4 diffs for skycell.0697.027 are failing repeatedly. That stack is bad apparently. Set quality faults for that skycell for diff_id 1161536 1161554 1161573 and 1161061
  • 20:00 EAM: restarting ipp pantasks for smoother nightly processing.

Tuesday : 2015.06.16

  • 07:40 MEH: some WS diffs are stalled from primary neb-instance of stack being on ippb06, i.e.,
    1162409 	skycell.1480.080 	ThreePi.WS.nightlyscience 	ThreePi.20150616
    neb://any/gpc1/LAP.ThreePi.20130717/2014/02/15/RINGS.V3/skycell.1480.080/RINGS.V3.skycell.1480.080.stk.3112920.unconv.fits
    
  • 07:50 MEH: ipp077, 080 ganglia apparently haven't been reporting since early May...
    /etc/init.d/gmond restart
    
  • 11:50 MEH: two very old gpc2 exposures just repeatedly failing in registration for some bad? ota -- setting to state drop in newExp, if wanted can deal with later
    update newExp set state="drop" where exp_id=1806;
    update newExp set state="drop" where exp_id=1021;
    
  • 17:45 CZW: Investigating a sudden string of failures on the phase 2 shuffle indicates that ipp066 isn't automounting anything that isn't already mounted.
    • Gene fixed this: /etc/init.d/autofs zap && /etc/init.d/autofs start
  • 18:00 MEH: restarting registration to incorporate longstanding failing in revert exp and test reducing the exec time since @1800s it takes ~2hrs to get back to gpc1 and MOPS processing can get quite a ways behind by then
  • 21:50 MEH: ipp067 very high load and processing rate very low with many errors.. putting into repair while looking at errors -- rate better, leaving in repair
  • 23:20 MEH: ipp068,069 having very high load periods causing faults.. may need to also put into repair
  • 00:00 MEH: regularly high load with ever higher spike.. into repair -- something is on/allocated to much
    • then ipp070 -- so not helping other than shifting to new host -- something needs to be turned down..
    • all back up for the night since putting repair just moves to another node -- according to ganglia, wasn't happening last night

Wednesday : 2015.06.17

  • 07:40 MEH: nightly still not finished...
  • 07:55 MEH: clearing some stalled warps+diffs
  • 08:20 MEH: nightly still not finished due to faults from overloaded nodes
  • 08:45 MEH: finally moved faulting camera stage exposures to warp after playing wack-a-mole in putting overloaded nodes into repair..
  • 09:00 MEH: nightly finally finished, neb-host up over-loaded nodes
  • 11:45 EAM: ippdb06 mysql is falling behind wrt ippdb08. I'm going to stop and restart mysql on the slave and in the process update the innodb memory usage to match.
  • 16:30 MEH: ipp061 partially degraded, set repair for a while

Thursday : 2015.06.18

  • CZW: added one warp/chip update with label czw.PV1.stsci to stdscience.
  • 15:20 CZW: restarting pv3 phase-2 shuffle. This has been hammering the ipp067-70 nodes, and it looks like this is caused by a lot of the beginning of PV3 warp data living there. I've randomized the task list, and will restart with that randomized list. This should spread the load out over the ~30 nodes, reducing the individual impacts. Hopefully this will also run faster and free disk space for nightly use.

Friday : 2015-06-19

  • 17:55 CZW: Stopping and restarting the ipp user pantasks so they'll be fresh for the weekend.

Saturday : 2015.06.20

  • 12:35 EAM : repeated failures for one exposure (cannot find a zero point) -- i've invalidated it:
    camtool -dbname gpc2 -updateprocessedexp -cam_id 8394 -fault 0 -set_quality 42
    

Sunday : YYYY.MM.DD