PS1 IPP Czar Logs for the week 2014.08.18 - 2014.08.24

Monday : 2014.08.18

  • 22:30 MEH: ipp036 crashed ~15 min ago, nothing on console. power cycled and back up before it wedged processing..

Tuesday : 2014.08.19

  • HAF is czar today

Wednesday : 2014.08.20

  • 13:32 Bill: stopping pstamp server in order to restart and to install scripts to support stamps of exptime and expnum images.
  • 18:20 MEH: using ippsXX and compute3 nodes for ~1hr for MD stacks before nightly processing

Thursday : 2014.08.21

  • 00:05 MEH: stdsci really needed regular restart
    • unwant normally 20, someone set to 10 and doesn't appear to be logged but suspect to ease overloading the few nodes with disk space.. suspect some nodes may have also been turned off but again not logged.. -- also reset down to 10 because don't want to watch the nodes all night.. but will likely try back to 20 if data tonight
      controller parameters unwant = 10
    • rate from ~30/hr back up to 50/hr? yes and was able to maintain this even with most datanodes red -- if stdsci not regularly restarted then processing known to be slowed
  • 00:20 MEH: observations stopped for humidity it looks like -- restart long running summitcopy and registration as well
  • 01:35 MEH: ipp036 very unhappy -- neb-host repair and out of stdsci until cleared -- leave in repair since seems to be sensitive when low disk space
  • 12:00 MEH: doing long neglected pantasks_sever log archiving..
  • 12:15 MEH: little processing, flipping the nebulous log on ippc02 so free space similar to that on the other ippc0x machines -- WARNING something else using diskspace on ippc02 so only 18G free while others have >20-23G and will regularly run out first
  • 21:00 MEH: pstamp and update labels off for a while
  • 21:50 MEH: ipp071 load~100 but cpu_wait<60 in spikes and still responsive -- rate of ~40-50/hr holding so far, full stsci nodes loaded and unwant=10 still

Friday : 2014.08.22

  • 07:10 MEH: chips mostly done, unwant 20 to keep things loaded
  • 09:30 MEH: OSS mostly done
  • 10:05 MEH: 3PI diff finished. PSS and update label back online
  • 11:50 MEH: using the compute3 nodes again for MD pv2 stacks
  • 18:23 Bill: Eddie points out that 3PI.PV2 skycal distribution is only 66 percent complete. Changed labels for pending runs. If there are problems remove lap.threepi.20130717 label from distribution pantasks
    • 18:50 MEH: will probably take out for nightly processing since only 5 machines for data...
  • 23:10 MEH: looks like ipp034 down for ~20 min.. nothing on console, regular load -- power cycle an backup, reduce to half load over night

Saturday : 2014.08.23

  • 10:30 MEH: 3PI nightly finished -- adding update label back to stdsci, distribution+pstamp on, regular restart of stdsci
  • 11:35 MEH: ipp047 crashed ~10 min ago, nothing on console and no real load and in repair already.. -- power cycle and back up
  • 17:40 MEH: ipp047 down again --
  • 21:30 MEH: ipp047 down again for ~hour, no load -- take out of all auto-loading pantasks_hosts.input, set neb-host down and see if still crashes

Sunday : 2014.08.24

  • 10:30 MEH: ipp047 down again while neb-host down.. power cycle
  • 11:10 MEH: ipp047 down again, not going to power cycle now
  • 14:30 MEH: MD stacks running on ippsXX and compute3
  • 23:00 MEH: ipp036 down for ~10 mins.. nothing on console, in repair and normal load -- power cycled, take out of processing
    • crash kept job in summitcopy and stdsci greatly backed things up.. restarting summitcopy and stdsci and slowly catching up