PS1 IPP Czar Logs for the week 2015.01.26 - 2015.02.01

(Up to PS1 IPP Czar Logs)

Monday : 2015.01.26

  • 04:30 EAM: ipp044 crashed, nothing on console. rebooting.
  • 04:45 EAM: ipp044 having trouble on reboot. I set it down in nebulous.
  • 05:00 EAM: ipp044 still not returning. i've shutdown stdscience and i'm clearing hung nfs mounts
  • 05:40 EAM: ipp044 mounts on the storage nodes are clear, stdscience restart. clearing computes as well
  • 05:50 EAM: looks like everything is clear, ipp044 is still down. stdscience, stdlocal, stdlanl are all back up and running.
  • 07:30 MEH: manually had to clear exposure fault in registration for o7048g0547o
    regtool -dbname gpc1 -revertprocessedexp -exp_id 863126
    
  • 08:00 EAM: stopping and restarting stdlocal (110k warps)
    • there were various hung jobs due to hung ipp044 mount points on the stdlocal x-nodes. I've cleared them out.
  • 10:05 MEH: nightly finished, adding storage nodes to stdlocal storage.hosts.on
  • 11:35 MEH: ipp094 Haydn releasing for IPP use now, BBU not ready for data targeting but will put into group s5 for processing.
    • also adding for neb-host, may be ok off and on for non-targeted data
  • 14:15 MEH: ipp090,093,095,082,080 typically have heavy loads and also seem to be doing some sort of raid verification, put to repair for now
    • is a read process, so unclear if that will help -- 080 almost finished, others still <60% since saturday midnight?
  • 15:45 MEH: modified nightly targeting in ipphosts.mhpcc.config to remove ipp083,084,086 since BBU replacement will be few more days at least -- restarted summitcopy, registration, stdsci
  • 15:55 MEH: ipp094 neb-host up but non-targeted, same for ipp083,084,086,092 -- monitor for loads but should be fine for non-targeted w/o BBU WB cache

Tuesday : 2015.01.27

  • 05:40 EAM: restarted stdlocal
  • 08:00 EAM: bumped up poll to 1200 for now -- there are lots of failed warps due to ipp044 being offline. at the default of 600, there are not enough chip jobs to keep pcontrol busy.
  • 16:00 CZW: killed leftover ippsky/staticsky jobs, and moved the x-node power back to stdlocal.
  • 17:15 CZW: I've noticed that stsci volumes (not full nodes) are dropping out of nebulous today. They reappear after a bit, and can be forced back by logging into ippdb00 and manually checking the mount (ls /data/stsci13.1/ and the like resolve the issue). I don't see anything definitive in the dmesg output that would explain this.

Wednesday : 2015.01.28

  • 06:50 EAM: ippc20 crashed with kernel panics, rebooting.
  • 14:00 CZW: Pulled back an assortment of LANL exposures/stacks for local processing to attempt to clear them and skip through endless fail-revert-fails.

Thursday : 2015.01.29

  • 05:30 EAM: stdscience needs a restart (too sluggish). restarting now.
  • 07:10 EAM: stdscience is fairly far behind -- i've moved c2 hosts from stdlocal to stdscience.
  • 11:00 EAM: restarting stdlocal
  • 16:24 Bill: started ~ippskky/staticsky with a new SAS label enabled.

Friday : 2015.01.30

  • 12:50 CZW: Set ipp027 to down to prevent nebulous from trying to access files there while it's down.
  • 14:05 EAM: stopped stdlocal for regular restart. lots of jobs are hung waiting for ipp027 nfs to clear. rather than do the whole force umount dance, i'm going to wait and hear the report on ipp027 from Haydn.
  • 14:10 EAM: changed the label on the prior staticsky attempt (to LAP.PV3.v0.20140730.sky01); generated replacement staticsky runs with the same old label (LAP.PV3.20140730.sky01)
  • 16:10 EAM: killing off hung jobs, forcing umounts for ipp027. Haydn is replacing batteries for ipp092, ipp094, ipp086, so putting them down
  • 16:40 CZW: ipp027 to repair, as it's running again.
  • 17:25 CZW: ipp084 and ipp083 also went down, which seems to have jammed processing. There was also a high nebulous connection load, so I've stopped the ipp051 scan (again) to see if that helps (it does). The connection load is still higher than the regular average (aiming for 500 or less, was stuck in the 800+ range), so I've turned off stdlanl and stdlocal (which were universally failing anyway). All hosts seem to be back online, and none of the mounts seem to be hanging. I'll bring stdlanl/stdlocal back online when the connection count drops back down.
    • 18:00 CZW: nebulous processlist cleared sufficiently for me to discover that even after the perl is terminated, the sql query for the ipp051 scan can still be running on the database. I've killed the query, and restarted stdlanl and stdlocal.
    • 18:05 CZW: Haydn sent an all clear email for the repair work, so I've put the three hosts from Gene's 16:10 note to repair.

Saturday : 2015.01.31

  • 09:30 MEH: nightly stdsci needed to be restarted yesterday by the czar, the poll numbers >100k so nodes are not being full utilized and so nightly still is not finished.. restarting now so can get nightly finished
  • 09:45 MEH: also ipp086,092,094 were neb-host up with bad BBU (non-targeted for data) for past week? should still be possible so they get random data
  • 09:50 MEH: cleanup has to be on during day at the very least or the new data nodes will fill up quickly and make nightly processing more troublesome
  • 10:15 MEH: ippm and ippsky overusing nebulous apache servers.. might as well fix it too since i'm here..
  • 11:25 MEH: nightly almost finally finished.. ipp004 looks unresponsive.. trying power cycle -- back up, homedir wouldn't mount initially, but finally worked itself out after ~10 mins..
    • also since online fixing things, might as well restart summitcopy+registration as have been running a while as well..
  • 19:30 MEH: ipp077 seems to be faulting jobs, unable to lock files during replication onto other machines -- taking out of processing, nfs may need to be restarted?

Sunday : 2015.02.01

  • 11:15 EAM: restarting stdlocal