PS1 IPP Czar Logs for the week 2014-03-03 - 2014-03-09

(Up to PS1 IPP Czar Logs)

Monday : YYYY.MM.DD

  • 16:19 Bill ran add.sts.rp macro in stdscience to start up STS processing

Tuesday : 2014.03.04

mark is czar

  • 06xx Bill, Gene rebooted ippdb01 and got nightly going again
  • 06:50 MEH: ippdb01 down again, looks like root rebooted ~0700 -- looks like rebooted both times possibly missing a little RAM compared to earlier in week on ganglia
    • STS label out of stdsci until night finished.
  • 07:20 MEH: czarpoll and roboczar stopped on ippc11 -- restarted
  • 07:30 MEH: staticsky was stopped w/o a note, set back to run.. STS label back into stdsci
  • 10:10 MEH: older STS throwing many fault 2 burntool issue -- chip.revert.off..
    burntool state vs burntoolStateGood : -13 vs 14
    Image burntool version does not match current accepted version. at /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/chip_imfile.pl line 830
    	main::my_die('Image burntool version does not match current accepted version.', 89358, 953683, 'XY01', 2) called at /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/chip_imfile.pl line 328
    
    • Bill has been changing label for these to STS.rp.2013.hold
  • 10:20 MEH: ippc18.0 <70G.. doing log archive will free up ~25G only. apache node /tmp space >20G and okay.
  • 11:50 MEH: noticed ipp004 has been constantly faulting stacks, was missing /local/ipp/tmp...
  • 13:20 MEH: ipp040,043,050,052,053 being continual slugs on chip processing, 2-3ks (as has been noted many times in past) -- neb-host repair for a bit
  • 13:30 MEH: stdsci >>100k Njobs and becoming somewhat unresponsive, will do necessary regular restart shortly before nightly
  • 14:25 MEH: ippc08 is down -- nothing on console -- high room T -- ippc08_log
    • power cycled and back up with problem with PAM and homedir, and doing rebuild on /dev/md3 -- out of processing for bit, then seemed to eventually get itself sorted on own..
  • 15:10 MEH: stdsci regular restart now that things have cleared after the ippc08 crash
  • 17:30 MEH: cannot mount /data/ipp053.0 -- may need a reboot -- yes and backup okay

Wednesday : 2014.03.05

mark is czar

  • 09:33 MEH: nightly finished, few 3PI.WS diffim fault 5 to clear (mix of known nan PSF and non-robustness to few stamps stats issue)
    	528288 	skycell.2433.067 	ThreePi.WS.nightlyscience 	ThreePi.20140305
    	528325 	skycell.2433.067 	ThreePi.WS.nightlyscience 	ThreePi.20140305
    	528333 	skycell.2482.011 	ThreePi.WS.nightlyscience 	ThreePi.20140305
    	528371 	skycell.2482.011 	ThreePi.WS.nightlyscience 	ThreePi.20140305
    	528372 	skycell.2524.026 	ThreePi.WS.nightlyscience 	ThreePi.20140305
    
  • 10:00 MEH: ippc18 pantasks log archive running bzip for most of day
  • 12:30 MEH: ipp001 mysql logs again filling up disk (~360G free), cleaning up >4TB -- good for another 2-3 months
  • 13:40 MEH: ipp020,025 neb-host up from caution repair for load and odd crash -- during day processing when have STS left running to watch
    • ipp007,ipp065 neb-host repair while rebuilding after Haydn replaced disk

  • 20:00 MEH: looks like may have just lost ipp035.. unresponsive -- cannot log into console to power cycle, nor ipp034 (hangs after password).
    • neb-host down, out of processing, looking for mounts to force umount

Thursday : 2014-03-06

  • sometime -- Rita to do IPP port swap from 48 port switch to ippcore slot 9
  • CZW: 10:20: Add power to staticsky:
    # These have 40gb and 12 cores
    hosts add s3; hosts off ignore_s3
    hosts add s3; hosts off ignore_s3
    hosts add s3; hosts off ignore_s3
    
    # Only have 8 cores, so only 2 instances
    hosts add s2; hosts off ignore_s2
    hosts add s2; hosts off ignore_s2
    
    # This could support 2 instances, most likely.  Being cautious.
    hosts add s1; hosts off ignore_s1
    
    # Insufficient memory for 2 instances.
    hosts add s0; hosts off ignore_s0
    
  • CZW: 5:55: I'm going to deactivate the extra power in staticsky, so I don't forget about it.
    hosts off s3
    hosts off s2
    hosts off s1
    hosts off s0
    hosts off s3
    hosts off s2
    hosts off s3
    

Friday : 2014.03.07

  • 18:00 CZW: I'm going to disable the s* nodes in staticsky that I re-enabled this morning for nightly processing.

Saturday : 2014.03.08

  • 07:00 EAM : I re-enabled the s* nodes in staticsky
  • 18:00 EAM : I disabled the s* nodes in staticsky

Sunday : 2014.03.089

  • 07:00 EAM : I re-enabled the s* nodes in staticsky; I've created a pair of macros: storage.hosts.on and storage.hosts.off to do this. I've tried to use 'control run reap' to manage the hosts to avoid leaving the ignore_s* nodes on, but I'm not sure it is quite working (may need some sleep statements).
  • 20:30 EAM : I have disabled the s* nodes again. I also set the ra poll max to 240 as we have run out of stacks.