PS1 IPP Czar Logs for the week 2014.10.20 - 2014.10.26

(Up to PS1 IPP Czar Logs)

Monday : 2014.10.20

  • 06:00 Bill set pstamp pantasks to stop to prepare for periodic restart
    • 6:19 restarted
  • 18:50 MEH: ippsXX off in lanl stdlocal -- using for MOPS test chunk reprocess (ippmops@ipps01 client)
  • 20:30 MEH: someone restarted lanl stdlocal and left ippsXX on.. noted here and in email, can't be more informative that that..
  • 23:55 MEH: returning ippsXX

Tuesday : 2014.10.21

  • 16:25 MEH: ippsXX used for MOPS test reprocessing again -- off in ippsXX. if you restart lanl stdlocal please be sure to turn them off before setting run...
  • 21:00 MEH: turning ippsXX back on in lanl stdlocal
  • 21:06 EAM: restarting stdlocal

Wednesday : 2014.10.22

  • 12:40 CZW: restarting stdlanl to see if that will sort out the poll task not polling as much as it could.
  • 20:30 EAM: restarting stdlocal

Thursday : 2014.10.23

  • 20:30 MEH: notice processing slowing and stack turned off in lanl stdlocal, guessing someone is going to do a regular restart so won't touch
  • 20:55 EAM: yes, restarting in just a moment...
  • 21:20 EAM: restart complete.

Friday : 2014.10.24

  • 10:00 MEH: will start the restart of all nightly processing pantasks in prep for possible observations tonight
  • 11:00 MEH: ippsXX out of lanl stdlocal for MOPS diffim reprocessing --
  • 12:20 MEH: ippsXX back in lanl stdlocal
  • 14:20 MEH: lanl stdlocal not running stacks, looks like could use regular restart
  • 14:27 EAM: stopping and restarting stdlocal. I will set up stdlocal to block if too many nightlyscience jobs are outstanding.
  • 14:35 EAM: I've turned on the tasks which remove storage host for the night (20:00 HST - 05:30 HST) to avoid overloading. Since I also have the code on to block if too many nightly science jobs are outstanding, this is probably a bit conservative (thus the restrictive hours).
  • 14:42 EAM: I've also paused 2000 chipRuns for lap.pv3.20140730.ipp -- the ippdb01 load may partly be from the 3000 chips ready to run.
  • 16:30 MEH: rolling new nightly ops tag ipp-20141024
  • 20:00 MEH: just before sci data through registration all-clear for new tag -- changing psconfig ipp-20141024 and restarting distribution, registration, stack, stdscience, summitcopy, cleanup
  • 20:20 MEH: odd fault 3 for chip I/O on DETREND for o6955g0060o -- revert okay. also seeing timeouts on regtool..
  • 20:25 MEH: removing the update labels for tonight del.update.labels
  • 20:35 MEH: seeing other faults I/O based -- lanl stdlocal to stop for a bit to see
  • 21:20 MEH: many stage faults when turned lanl stdlocal back on w/o ippsXX machines still -- back stop
  • 21:55 MEH: another long running OTA (51) -- looks like nice star cluster so ok..
  • 22:00 MEH: suspect mysql on ippdb01 needs a reboot -- will try to wait until the morning. MOPS has at least one chunk out now to look at so will turn on lanl stdlocal for a bit again w/o ippsXX in order to lighten the load
  • 23:00 MEH: ippdb01 seems to be doing better than before w/ lanl stdlocal running, some shadow processing going on?

Saturday : 2014.10.25

  • 11:00 MEH: mysql restart on ippdb01 cancelled -- if timeouts again tonight will just turn down lanl stdlocal again
  • 16:10 MEH: ippsXX out of lanl stdlocal for mops test processing --
  • 17:50 MEH: returning ippsXX back to on in lanl stdlocal
  • 22:00 EAM: summit is fogged in. i'm turning on storage hosts for stdlocal.

Sunday : 2014.10.26

  • 17:05 MEH: czarpoll on ippc11 needs to be started again whenever mysql on ippdb01 is restarted, that and roboczar if email warnings are wanted
  • 21:05 MEH: ipp036 down, not responsive so power cycle -- out of processing, neb-host down until back up -- back up, now for the clean-up...
  • 21:45 MEH: summitcopy and stdsci needed restart to clear unfaulted jobs on ipp036..
    • diverting ippsXX to stdsci to catch up -- ippsXX off in lanl stdlocal
  • 23:00 MEH: returning ippsXX back on in lanl stdlocal