PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2016.02.08

  • 13:58 MEH: Haydn needs to rebuild drive in ippc10 so will be offline for bit
  • 16:11 CZW: Starting raw data shuffle from the rest of the cluster to ippb03, which has some free disk space. This will be running in the replication pantasks as the ipp user. This should be a minimal load on the cluster, and will only run 10k jobs at a time (I'm using this to tune how quickly we fill hosts up). ippb03 will be set to up, but the xattr bit will prevent random file creation there.
    • 16:31 CZW: I've stopped this pantasks, as ippb03 seems to have issues with glockfile, and I'm not sure how to sort that out.
    • 17:20 CZW: Restarted again, as Gene helped debug the odd nfs/lockd/glockfile issue that was preventing ippb03 from responding correctly.
  • 18:00 Haydn reporting ippdb03 recovery problematic -- will likely use ippc15 as a replacement unless someone using -- taking out of default nightly processing

Tuesday : 2016.02.09

  • 11:00 CZW Restarting rsync to ipp1XX. 10x nodes from ipp screen session on stare03. These don't always die cleanly, so if they need to actually die suddenly, email me.
  • 22:20 MEH: large number of faults unable to find detrends on ipp040 (ipp041,042,038 and others) of the group with rsync transfers lagging nightly processing -- wasn't it discussed to set down if running during nightly processing? jobs also lagging on ipp040 so turning the nightly jobs down at least there... others may need to be turned down as well (or the rsyncs..)
  • 04:20 EAM: ipp042 just crashed. no message no console. i'm rebooting.

Wednesday : 2016-02-10

  • 11:45 CZW: Starting ipplanl/pv3update pantasks to process (hopefully) final updates for STSCI. This uses the x nodes only for processing. I've set ipp03x,ipp04x back to update, as the rsync process is stopped on them.
  • 15:00 CZW: New detrend for XY62 installed. det_id = 1072.
  • 16:45 CZW: Restarting ipplanl/pv3update with half as many processing hosts to see if that helps resolve network traffic issues.
  • 16:52 CZW: Setting ipp1XX nodes to up in nebulous, as we're not doing the rsyncs at the moment. Leaving ipp101 out of this, as it may be needed to repair ippb03.
  • 18:07 CZW: Preemptive restart of ipp pantasks servers.

Thursday : 2016-02-11

  • 18:07 CZW: Preemptive restart of ipp pantasks servers.

Friday : YYYY.MM.DD

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD