(Up to PS1 IPP Czar Logs)

Monday : 2012.10.08

Mark is czar

  • 00:10 MEH: stdscience struggling to keep loaded >50%, needs regular restart
  • 07:20 MEH: some 580 nightly science exposures finished downloading, processing ongoing.
  • 07:50 nightly science done, chip.off to force warps through and load LAP stacks and set.poll 400-600 slowly to keep all nodes possibly loaded but not overload the stsci nodes at once (remember to do again later afternoon to have LAP stacks loaded for night).
  • 09:00 Serge fixed replication to ippdb03, ipp001 Replication_Issues -- czar page slowly catching up
  • 09:50 MEH: fixing LAP chips holding up warps->stacks
    -- gone
    -- found orphan
    -- remade
    -- dropped exposure from 8/7/2012 in LAP 8745, was in odd state maybe from cleanup on/off
    laptool -updateexp -lap_id 8745 -exp_id 506605 -set_data_state drop -dbname gpc1
  • 14:00 replication of ippdb03 caught up now. ipp001 caught up around noon. turning chip.on again as well for LAP
  • 14:50 nearing time for stdscience restart again.. done
  • 15:10 should check on devel cluster machines occasionally to make sure temps are getting out of hand -- ipp003 temp1/2 idle @44,50F
  • 16:30 burn through more warps, chip.off, until nightly observations start
  • 18:30 chip.on -

Tuesday : 2012.10.09

Mark is czar

  • 06:45 Serge fixed replication for ippdb03, ipp001 again (same problem as yesterday/weekend).
  • 07:10 MEH: nightly science exposures finished downloading. stdscience ready for regular restart. backup of 2200 LAP stacks at start of night easily processed during nightly science.
  • 09:20 (Serge) rsync from ippdb02 to ippc63 crashed during the night. Unfortunately the time when it happened was not displayed...
  • 11:00 MEH: chip.revert.off to look for faulting exposures, recovered neb://ipp020.0/gpc1/20100714/o5391g0197o/o5391g0197o.ota40.fits.
  • 11:30 manually forced LAP to check older lap_id to push fixed chip->warp into stack with lap_science.pl --monitor_mode --lap_id 8822, have 2800 stacks now so systems should be in full use for day.
  • 11:50 ippc18.0 @38/917 GB again.. rsync the past months log dirs to their respective Archive w/o any zipping (want to always do that on the ippc18.1 disk anyways to keep the load down on ippc18.0), but will only clear up some 20GB. Month of October has some large logs for cleanup, stdscience and can't go to Archive yet.
    • 12:30 rsync from ippc18.0 to ippc18.1 with large rm jobs can load the system and stall processing
    • 13:00 bzip2 on the Archive files on ippc18.1 (not NFS mounted) can also load the system and slow down processing. starting with the cleanup logs, rarely looked at and many haven't been compressed yet.
  • 14:20 forgot about reboot testing ipp046 with BIOS battery replacement -- testing if reboots on power cycle as it should. kind of late today, should do tomorrow or near future.
  • 15:10 czar is finding a cooler place to remotely monitor processing...
  • 18:45 stdscience needs its regular restart

Wednesday : 2012-10-10

Serge is czar

  • XX:XX reboot testing ipp046 after BIOS battery replacement, does it boot on power cycle properly now.
  • 07:30 (Serge): Same replication problem as yesterday
  • 09:50 (Serge): ippdb02 rsync complete. Restarting replication on ippdb02
  • 10:35 (Serge): Moving mysql from ippc63.0 to ippc63.1
  • 16:10: See entry on 2012-10-09 at 15:10
  • 17:15 MEH: stdscience <50% loaded, restarting and adding MD05.20121005 for V3 finish processing.

Thursday : 2012-10-11

  • 07:30 (Serge): Same replication problem as the previous nights
  • 08:00 (bill): restarted pstamp and update pantasks. Their pcontrols were limiting throughput I think.
  • 10:25 MEH: stdscience <50% loaded, doing regular restart. looks like ipp020 has been a problem node with jobs taking long time to run on it, going to drop from stdscience processing. probably needs to be checked/rebooted.
  • 16:40 MEH: ippc01 should have ~10-15GB left on /, not sure where the disk space disappeared to and / is 100% full causing problems... removing from the nebserver list in .tcshrc to see if helps the Cannot write to '/tmp/nebulous_server.log' error in stdscience. restarting stdscience and seems to. probably need to restart all pantasks for ippc01 to be removed...
  • 17:15 MEH: ipp020 looks like mounts are messed up, don't want to reboot remotely before nightly science if not have to, so isolate -- remove from all processing (was already set to neb-repair before for some reason).
    • too late.. looks like at least a registration job got stalled on it... will see if can fix. and restarted ok.
    • looks like mounts gave up early afternoon, also hit cleanup and pstamp with stalling jobs.. will see if can fix
  • 17:45 ippc01 has a /var/lib/mysql.original at 13G and that is using the excess disk space
  • 17:55 restarted stack, pstamp, update, distribution, summitcopy to avoid the problem with ippc01, took ipp020 out again
  • 18:10 MEH: adding ippc01 back into the nebserver list in .tcshrc after Serge removed/reset the 12G nebulous.log. ippc01 will be used again when pantasks are restarted.

Friday : 2012.10.12

  • 09:45 MEH: stdscience struggling, doing regular restart.
  • 10:20 MEH: ippc01 mysql.original Serge remembers was an original attempt/test to move the nebulous db, can and has been rm'd.
  • 11:05 MEH: rebooting ipp020 to fix the lost mounts, will add back into processing
  • 12:45 MEH: large numbers of nightly stacks running under MD05.GR0 to complete MD05 reprocessing to V3
  • 18:20 MEH: ipp011 looks to have become unresponsive, cannot ssh into. going to reboot before nightly science. back up. Ipp011-crash-20121012
  • 22:35 MEH: stdscience <50% loaded, needs regular restart

Saturday : 2012.10.13

  • 09:25 MEH: looks like ipp023 is down. rebooted okay, from console Ipp023-crash-20121013
  • 10:55 LAP out of chips, chip.off after stdscience regular restart

Sunday : 2012.10.14

  • 11:15 MEH: looks like both ipp013 and ipp010 are stalled, cannot ssh into and jobs are stalled there.