PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2016.04.04

  • 09:34 MEH: more SNIaF updates in ippqub:stdscience_ws
  • 20:20 EAM : stopping pantasks for restart

Tuesday : YYYY.MM.DD

Wednesday : 2016.04.06

  • MEH: gpc1 hasn't had a proper dump since ~2/4 when ippdb03 had OS disk raid issues and had to be swapped with ippc15 and was not checked...
    • /var/spool/cron/crontabs group stb-admin not crontab.. fixed so now ipp has crontab access again on ippdb03...
    • re-creating crontab best can to follow info on DatabaseBackups as best possible..
  • MEH: ippMonitor has been reporting problems (red) with ippdb03 (gpc1 replication) and ippdb06 (nebulous replication) for some time now with no one looking into -- this dates back to the swap of ippMonitor to another machine while ippdb03 being remade -- there are all kinds of inconsistencies in host names, passwords in ippMonitor AND mysql setups... tweaked to get ippMonitor to report on those so we can be alerted to problems now.. but cannot easily clean up inconsistencies as who knows what all rely on them as things are not documented..

  • 15:45 MEH: Haydn replacing BBU in ipp085 -- all pantasks stop for now
  • 17:40 MEH: restarting nightly pantasks with data targeting adjustment off the nodes with less free space
    • adding ipp100,102,103,104.0+1 as targets and also adding manually to summitcopy+registration for testing for tonight with disk space imbalance

Thursday : 2016.04.07

  • MEH: home disk <35GB again, ~ipp registration and stdscience already >3GB of logs and should've been bzip'd for the previous month...
    • replication/logs is wasting 17GB of space and needs to be bzip'd
  • MEH: adding ipp100-104 into hosts_s6 group -- similar to s5 group machines but those are used for pv3 right now -- could use s6 for boosting summitcopy+registration while s5 removed for pv3 work
  • 15:20 MEH: sending ps_ud% labels to cleanup now that STScI reports having download complete --
  • MEH: restarting nightly pantasks and manually swapping in ipphosts.mhpcc.config.20160407avoidfull for data targets to avoid full nodes w/ warps+2x diffs (replaced with ipp100-104 since pixel products cleaned up) -- if problem then the .base version can be restored or a recompile done..
    • manually adding 2x s6 to summitcopy+registration

Friday : 2016-04-08

  • 15:40 CZW: Restarting ipp/pantasks.
  • 17:35 CZW: ippc05 is having problems.
    This host runs the ipp/cleanup pantasks.  I noticed that after the
    restart of pantasks, it wasn't completing any of those jobs
    (lossycomp).  Logging in resulted in a lot of odd hanging.  A dmesg
    suggests a disk failure:
    [...lots of these...]
    [16951939.150146] ata2.00: status: { DRDY ERR }
    [16951939.150148] ata2.00: error: { UNC }
    [16951939.151880] ata2.00: configured for UDMA/133
    [16951939.151896] ata2: EH complete
    [16951946.152075] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
    [16951946.152079] ata2.00: irq_stat 0x40000001
    [16951946.152084] ata2.00: cmd c8/00:08:f9:43:73/00:00:00:00:00/e3 tag 0 dma 4096 in
    [16951946.152085]          res 51/40:08:f9:43:73/00:00:03:00:00/e3 Emask 0x9 (media error)
    [16951946.152087] ata2.00: status: { DRDY ERR }
    [16951946.152089] ata2.00: error: { UNC }
    [16951946.153872] ata2.00: configured for UDMA/133
    [16951946.153892] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
    [16951946.153895] sd 1:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
    [16951946.153900] Descriptor sense data with sense descriptors (in hex):
    [16951946.153902]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
    [16951946.153911]         03 73 43 f9
    [16951946.153914] sd 1:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
    [16951946.153919] end_request: I/O error, dev sdb, sector 57885689
    [16951946.153945] ata2: EH complete
    [16951946.153991] sd 1:0:0:0: [sdb] Write Protect is off
    [16951946.153995] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
    [16951946.154170] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    [16951946.154207] sd 1:0:0:0: [sdb] 976773168 512-byte hardware sectors: (500 GB/465 GiB)
    [16951946.154222] sd 1:0:0:0: [sdb] Write Protect is off
    [16951946.154224] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
    [16951946.154425] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    [16951946.168432] raid1:md2: read error corrected (8 sectors at 25690376 on sdb3)
    [16951946.168447] raid1: sda3: redirecting sector 25690376 to another mirror
    
    I've stopped and shutdown the cleanup pantasks, and will restart it on
    a different host (ippc11) for the weekend.  I've also removed ippc05
    from the nebulous server list for the ipp and ippqub users.
    
  • 17:45 CZW: I've restarted all the ipp/pantasks servers to ensure that everything is using the correct .tcshrc file with ippc05 commented out.
  • 18:20 CZW: After waiting ~10-15 minutes for my su to succeed, I've stopped apache on ippc05.

Saturday : YYYY.MM.DD

  • 14:00 EAM : ipp048 died. i rebooted it sucessfully, but left it off in nebulous

Sunday : 2016.04.10

  • 05:00 EAM : ipp048 is down again. i'm going to leave it powered off. i also restarted the czarlog scripts which must have hung.
  • 06:20 EAM : stdscience is running a bit slow so I am stopping it for a restart.