PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2017.01.23

  • MEH: QUB targeted followup needs ippx, ipps nodes as well as nightly nodes in AM -- finished
  • 11:45 CZW: Increasing replication job count to 300, with the goal of pushing this back to 450 by the end of the day.
  • MEH: MOPS moving from ipp028 to ipp027 as replacement while ipp032 is down -- neb-host repair and out of processing
  • 14:57 CZW: Set ipp118 and ipp120 to down in nebulous to allow Haydn to scan the file system after the weekend outage.
  • 15:00 CZW: Mounting issues are hanging up postage stamp generation. It looks like a number of hosts still have bad mounts, and that is preventing the jobs from completing. I've stopped stdscience while I try to sort out the problem.
  • 15:09 CZW: The last two items appear to be somewhat related. ipp075 had jobs not running, and had a 'mount ipp118' job hanging. Killing this (which had been running since ~12:00) allowed all other mount related issues to clear (df being the obvious example), and jobs immediately started running. I'm going to search the cluster for similar jobs, kill them, and hope that ipp118 responds better once the file systems have been checked.
  • 15:20 CZW: The hung pstamp jobs have cleared, without me completing my scan. I'm going to let stdscience continue, and troubleshoot anything that looks stuck.
  • 17:13 CZW: ipp118 has been repaired, and I am setting it back to "repair" status in nebulous.
  • 17:53 CZW: ipp120 has been repaired as well, and I have set both it and ipp118 to "up" in nebulous.

Tuesday : YYYY.MM.DD

  • MEH: regular faults in normal processing (including summitcopy...) seem to tie back to jobs running on ipp075 not being able to lock files -- manually turning off ipp075 in processing helped until can be fixed after processing (nfs restart maybe needed?)
  • MEH: QUB targeted followup needs ippx, ipps nodes as well as nightly nodes in AM -- finished
  • 16:57 CZW: Restarting regular pantasks. I've removed ipp075 from active processing as the lock problems do not seem to be fixed after an NFS restart (added to hosts_ignore_storage, commented out of hosts_s5).

Wednesday : YYYY.MM.DD

  • MEH: QUB targeted followup needs ippx, ipps nodes as well as nightly nodes in AM -- finished

Thursday : YYYY.MM.DD

  • 15:00 CZW: ipp121 to up, ipp032 to repair.
  • 15:31 HAF: restarted pantasks
  • 20:00 MEH: while checking in on K2 noticed regular faults -- files tie back to ipp121.1 (all or just mostly) -- looks like XFS problem again, need to be checked/fixed (probably both to be sure) so back to neb-host repair
    Jan 26 17:26:44 ipp121 [255484.504432] nfsd: non-standard errno: -117
    Jan 26 17:26:46 ipp121 [255486.911774] XFS: Internal error XFS_WANT_CORRUPTED_RETURN at line 325 of file fs/xfs/xfs_alloc.c. Caller 0xffffffff8121dcb0
    

Friday : 2017.01.27

  • MEH: ganglia not reporting for c18, x051, x066 for ~1.5 weeks.. czars need to check on this regularly, can't know if a node is having issue if don't get status for it... (and if czars aren't regularly scanning the ganglia page that is also a problem) seem to recall restarting ippc18 before, it may be having issue
    /etc/init.d/gmond restart
    
  • MEH: rebooted ipp075 to clear nfs file lock problem -- up and seems ok, back into processing and neb-host up to verify ok -- ok
  • MEH: Haydn reboot and checking XFS on disks to fix error -- back online fairly quickly, neb-host up to check if okay during darks and pstamp requests -- ok
  • MEH: Haydn replaced failing drive in ipp067 -- will be rebuilding overnight at least, neb-host repair
  • MEH: ipp067-097 hitting nebulous shutoff for new data during night now -- seems okay, ipp100-104,118-122 can handle the extra data volume fine

Saturday : YYYY.MM.DD

  • MEH: ipp008, 066 neb-host repair -- having disk warning messages and Haydn in Manoa next week (need to keep nightly off in case problem)

Sunday : 2017.01.29

  • MEH: ipp067-097 getting to minimally useful space, move raw data targeting to ipp100-104,118-122 only now along with more skycell products -- restarting pantasks with this change for tonight