PS1 IPP Czar Logs for the week 2017-04-24 - 2017-04-30

(Up to PS1 IPP Czar Logs)

Monday : 2017-04-24

  • 16:00 CZW: Restarting ippitc pantasks.

Tuesday : 2017-04-24

  • 16:15 CZW: Restarting ippitc pantasks.

Wednesday : YYYY.MM.DD

Thursday : YYYY.MM.DD

Friday : 2017.04.28

  • MEH: HVAC work through 4/30 @IfA Manoa -- ipp001,ipp003,ipp022,ippb04,ippb05 shutdown to reduce thermal load in mez so ipp002,ippops1,ippops2 can remain up for processing
    • ipp001,ipp003 need mysql shutdown cleanly before halting the machines
    • power off all machines once halted -- Gavin fixed remote power management on ippb04,b05
    • nominal Temp to monitor --
      -- ipp002 --
      Core0 Temp:
                   +44 C
      Core1 Temp:
                   +47 C
      temp1:       +41 C  (high =    +4 C, hyst =   +16 C)   sensor = thermistor   ALARM   
      temp2:     +43.5 C  (high =   +80 C, hyst =   +75 C)   sensor = thermistor           
      temp3:     +43.0 C  (high =   +80 C, hyst =   +75 C)   sensor = thermistor          
      
      -- ippops1 --
      temp1:       +36.8 C  (high = +100.0 C, hyst = +95.0 C)  sensor = Intel PECI
      temp2:      -128.0 C  (high = +100.0 C, hyst = +95.0 C)  sensor = thermal diode
      temp5:       +40.0 C  (high = +100.0 C, hyst = +95.0 C)  sensor = thermistor
      temp6:       +41.0 C  (high = +100.0 C, hyst = +95.0 C)  sensor = thermistor
      
      -- ippops2 --
      temp1:       +35.8 C  (high = +100.0 C, hyst = +95.0 C)  sensor = Intel PECI
      temp2:      -128.0 C  (high = +100.0 C, hyst = +95.0 C)  sensor = thermal diode
      temp5:       +39.0 C  (high = +100.0 C, hyst = +95.0 C)  sensor = thermistor
      temp6:       +40.0 C  (high = +100.0 C, hyst = +95.0 C)  sensor = thermistor
      
  • MEH: ippc121 has been stalling processing @camera stage -- out of normal nightly processing for now
  • 15:45 CZW: Restarted retired host reinsertion as ipptest in screen session on stare02. The check to confirm hosts can be retired is running slowly, as it is scanning for the existence of all files that have database entries on the retired hosts. Reinserting will reduce the number of these that need to be checked, and the check can be resumed post-reinsertion at the point it was stopped. This should have minimal impact, but feel free to control-C if ipp118-ipp122 loads are too high/slows down processing.
    • the ones doing this need to monitor the level of nebulous replication slippage (looks like ~1700,3900/hr) and adjust it accordingly if gets behind during nightly processing
  • MEH: Haydn work at ITC
    • ipp081 back up again -- leave neb-host down for day or so and then put into repair only until have remote power access
    • ipp090,091 rebooted to report number of CPU properly (12 not 24) -- ipp091 raid card is dead, new ones ordered and will arrive in week or two
    • ippdb05 needs new BBU -- wait until Monday
    • ipp085,086,089 also need BBU but wait until db05 replaced next week
  • MEH: ipp116.0 has gpc1 DB replication and <1TB free space.. -- should have been put into repair a while ago...

Saturday : YYYY.MM.DD

Sunday : 2017.04.30

  • MEH: AC back on @IfA -- ipp002 temps show that it is, so power on ipp003,001. ipp022,b04,b05 not booting and probably need manual button press