PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : YYYY.MM.DD

  • 12:29 MEH: adding CFA/MD07 WS publish for MOPS, restarting stdsci -- commented out until MOPS is re-setup to process

Tuesday : 2016.05.10

  • 09:30 MEH: mysql gpc1 dump cleanup running on ipp001 so space isn't filled up..
  • 10:11 MEH: sending to cleanup -- >1 month old WSdiff catalogs, ps_ud% (to clear the cycling faults) except the QUB targeted fields, ws_nightly_update, some QUB.% chips
    • leaving all the .multi so that cleanup can be done by others
  • 11:35 MEH: 29/917G on homedisk -- starting bzip of pantasks logs, only freed ~5GB, other users need to manage their use more
  • 11:41 MEH: ipp078 seems ok past day, neb-host up and see if error returns
    • @1345 noticed couldn't log in again -- console spammed with <May/11 07:22 am>[86659.748219] journal commit I/O error -- and stalls on INIT: Id "s0" respawning too fast: disabled for 5 minutes
  • 17:34 MEH: restarted all nightly pantasks w/ ipp078 out of processing and neb-host down until can be looked into why behaving oddly

Wednesday : 2016.05.11

  • 09:30 EAM: shutting down ippc01 - ippc16 excluding ippc11
  • 10:20 EAM: czarpoll and roboczar services moved to ippc33 (see Processing)
  • 13:20 CZW: Set ipp033, ipp035-ipp045 to down in nebulous, as they have had their nebulous data reinserted elsewhere.
  • 15:26 MEH: Heather finding i/o errors with /data/ipp080.0 -- setting neb-host down (from up) and taking out of processing
    May 11 15:24:07 ipp080 [19790986.747529] XFS (sda1): xfs_log_force: error 5 returned.
    
    • Haydn rebooting -- sdb drive needs to be fsck in another machine. he will work on ipp078,080 tomorrow at MRTC-B and move the power plugs for ipp008,12, 13, 37 so they don't suddenly reboot

Thursday : 2016-05-12

  • 13:00 Gene has finished switching nebulous from ippdb08 to ippdb01, which has larger disks.
  • 14:10 CZW: Setting ipp078 and ipp080 to repair in nebulous after Haydn checked their disks.
  • 14:15 CZW: Resuming 4x nebulous restore commands.
  • 15:50 CZW: ipp080 appears to be down, and is reporting it has a different ssh key than before. NFS share is offline as well.
  • 16:08 CZW: Stopped processing, Haydn is going to repair drives in ippc39, ippc43, ippc57.
  • 16:12 CZW: ipp080 is accessible by ssh again, but there are configuration issues that prevent it from exporting /data/ipp080.0, and for it to automount any host beyond ipp071.0. It is already out of processing, so this shouldn't impact nightly science.
  • 17:20 CZW: ipp080 is working with the correct configuration again, and is back into repair in nebulous.
  • 18:23 CZW: ipp pantasks running again.

Friday : 2016.05.13

  • 09:00 EAM: ipp078 is behaving badly again -- connection reset by peer when trying to ssh
  • 16:18 CZW: Stopping/restarting ipp pantasks.

Saturday : 2016.05.14

  • 00:30 MEH: stack processing on ipps nodes is restarting, long running stacks so low i/o loads (ipps04:ippmd/reftstack)
  • 16:05 MEH: md03 WSdiff running on ipps nodes (ipps01:ippmd/stdscience)

Sunday : YYYY.MM.DD