(Up to PS1 IPP Czar Logs)

Monday : 2012.09.24

  • CZW: 12:00 Stopped system for stdscience reboot.
  • CZW: 12:04 Rebooted ipp020 due to mounting issues that were resistant to force.umount. Rebooted using the call:
    reboot -f
  • CZW: 12:10 autofs was not correct on ipp020. /etc/init.d/autofs restart resolved the problem.
  • CZW: 12:12 Repeated this procedure on ipp063, once ipp020 returned.
  • CZW: 12:16 Restarted stdscience.
  • CZW: 13:50 Restarted ipp020/gmond to fix the "ganglia purple memory bug" that Mark pointed out.

Tuesday : 2012.09.26

  • 08:00 Bill We got some data last night. registration has a stuck job on ipp010. It is having the "ganglia purple memory bug" Killed the register_imfile. Rebooting ipp010
  • 08:05 reset registrations pending book. set fault for the ipp010 victim 521411:ota26 then reverted it. burntool is now feverishly working to catch up.

Wednesday : 2012-09-26

Serge is czar

  • 09:00 (Serge): rsync mysql@ippdb02 to ippc63 complete. Starting rsync of mysql@ippc63 to ippc61
  • 15:10 MEH: turning off the deepstack compute3 group from stack pantasks for a local/trunk deepstack pantask setup for MD09.GR0 staticsky run
  • 16:25 CZW: restarting stdscience because it looks slow.
  • 23:40 MEH: looks like ipp018 has lost some mounts, removing from processing to try and get nightly science back online and avoid rebooting until tomorrow. Also putting ipp018 into repair, restarting summitcopy and registration.
    • ipp018 is bypassed, registration going again but will see how long it lasts. ipp018 needs a reboot in the morning.

Thursday : 2012.09.27

  • 04:30 Bill We have our nightly registration hangs. tonight's troublemakers are ipp011 and ipp012
  • 05:00 rebooted ipp011 and ipp012. Restarted registration and stdscience. Burntool immediately fell behind on ipp012 because stdscience loaded it up with several ppImage and pswarp processes. Took the node out of stdscience for a bit.
  • 05:10 restarted update and pstamp pantasks
  • 05:15 ipp018 has nfs hangs rebooted
  • 08:50 (Serge) restarted cleanup
  • 10:15 (Serge) chip.revert.off. Fixed a bunch of missing burntool tables in LAP. chip.revert.on
  • 14:50 (Serge) neb-host ipp018 up
  • 20:30 MEH: adding MD02.refstack.20120927 chip->warp processing in
  • 22:20 MEH: ipp018 has lost mount connections again and is stalling registration. isolating it again -- removing from processing, neb-host repair, restarting registration. looks like lost mounts around 15:30 today based on chip_imfile.pl run
    • will also restart stdscience for daily reset

Friday : 2012.09.28

  • 11:45 MEH: rebooting ipp018, ipp011 to recover lost mounts.
    • ipp011 had a little trouble rebooting.. power cycled and back up now.
    • Gene made a wave1_weak list with ipp010,ipp011,ipp018 for stdscience, only 2 get loaded.
    • ipp018 was neb-host off, no reason why in czar logs recently so putting back to up. ipp011 disk back up.
  • 12:40 MEH: might as well restart stdscience for its 10-12 hr cycling and clear out some old ipp011 stuck tasks.
    • helped to clear remaining 3PI nightly science warp
  • 16:30 EAM : shutdown everything, rebooted ipp007, checked for hung NFS mounts (none), restarting everything
  • 17:00 MEH: deepstack restarted and running MD02.reftest/stack.20120729

Saturday : 2012.09.29

  • 13:45 MEH: doing regular restart of stdscience. chip.off for a few hours to push some LAP warps into stack.

Sunday : 2012.09.30

  • 11:30 MEH: stdscience rate is down, doing daily restart of stdscience