PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

Extra/Non-standard Processing

Continuing HAF suggestion to improve communication, so we know what's going on better -- a list in the czar pages additional (non-standard) processing - so that we all know what's going on.

Daily Czaring:

  • currently there is a modified ops tag running diffs (WS labels only) as ippmops under ~ippmops/src/stdscience on ippc29 -- if problems (Njobs>100k, power loss on ippc29 etc), it will need to be restarted like a normal nightly processing pantasks
    ./ stdscience
    • even with the modified ops tag, there are still various files MIA in nebulous and as such requires the daily czar to check and clear them as has been discussed before

(Up to PS1 IPP Czar Logs)

Monday : 2015.09.21

  • 08:40 EAM : looks like we had a power event on cab1 : all machines in that cabinet rebooted spontaneously in the past 20 min. ipp013 is still coming up slowly.
  • 18:15 EAM : we lost all power around 9:30 this morning. UPS failure. it took until around noon for power to come back. I am now restarting all ipp pantasks
  • 22:45 MEH: restarted czarpoll and manually added WS while in another pantasks, but found labels not showing up. rebuild ippMonitor on db03 and fix it and cleared all the old labels in ippMonitor not needed. seems now new labels in stdsci actually show up on ippMonitor when added too..
    • nebdiskd started on ippdb08
    • WS diff pantasks run under ippmops stdscience also needed to be started

Tuesday : 2015.09.22

  • 11:30 MEH: c2 nodes being used for extra processing for couple hours -- off in stdsci
    • c2 nodes back on in ipp:stdscience and ippmops:stdscience

Wednesday : 2015.09.23

Thursday : 2015.09.24

Friday : 2015.09.25

  • 13:20 EAM: another power outage for the IPP. the cluster is mostly back up, but I need to shutdown mysql on ippdb05 and make a dump. the ippdb03 replication backup is not starting -- it looks like the binlogs got corrupted during the shutdown. If I can recover from the existing binlogs I will do that and stop the dump, but for now please leave all services off.
  • 14:20 EAM: ~ipp pantasks, ~ippmops stdscience pantasks started. note pstamp has not been restarted: I am dumping mysql @ ippc17 to restart the replicant.
    • restarted czarpoll & roboczar on ippc11 (as per the Processing page) instead of ipp009 as noted a few weeks ago.
    • re-running the ippc17 mysql rsync (forgot to shutdown the mysql first!)
  • 15:15 EAM: mysql @ ippc19 restarted, local rsync dump is done. rsync to ippc17 is running.
  • 18:45 MEH: restarting ippmops:stdscience for WS diffs with compression turned back on manually changed in ipp-20141024.lin64 build to COMP_SUB in filerules-split.mdc for now
    • 20:25 MEH testing finished and ippmops:stdscience running w/ compressed diffims now
  • 19:35 EAM: started nebdiskd on ippdb08
  • 20:30 MEH: scraping whatever can into cleanup..

Saturday : YYYY.MM.DD

  • 06:10 EAM: ipp078 was listed in nebulous with an empty / missing disk. I checked it and it was reporting the Journal commit errors we saw on Sept 22. I rebooted like Gavin did last time and the disk came back. ipp070, on the other hand, is reporting an empty volume as well, but it seems to be find in reality. Maybe nebdiskd just needs to recycle. See below for Sept 22 conversation.
    I checked ipp wiki, Haydn's previously recorded note -- SSD #1 went offline, which crashed it.
    Power cycled the system.  (SSD #1 is missing, SSD #2 is primary)
    All disks report OK. RAID 6 volume is OK
     Number of online units: 1,
                      available drives: 0,
                      hot spares: 0,
                      offline units: 0
    On Tue, 22 Sep 2015, Eugene Magnier wrote:
    we are having a problem with ipp078.  I can ping it and see the disk from
    other machines, but i cannot log in.  the console has a bunch of messages
    saying "journal commit I/O error"
    Gavin: advice?  reboot and examine raid?
  • 09:20 EAM: problems with ipp096 and ippx089 - ippx100, all on cab17. I rebooted ipp096, which seemed to be down. It booted fine, but cannot get to NIS. I can get a login prompt, but cannot log in. I suspect there is a network problem for cab17. the cacti graphs also show dropouts on cab17-4948.
  • 21:48 MEH: Serge reports nightly data not being processed
    • ipp096 is down, it must be set neb-host down or nightly processing will get stuck looking for detrends on that machine.. -- moving past chip now
    • ipp096 mounts also not cleared and holding some up in later processing, and seems like cleanup jobs for past 8 hrs as well

Sunday : YYYY.MM.DD