PS1 IPP Czar Logs for the week 2015.02.16 - 2015.02.22

(Up to PS1 IPP Czar Logs)

Monday : 2015.02.16

06:40 EAM: ipp023 down in the night, rebooting now

  • 08:50 MEH: OSS diff fault 5 needs quality set to clear
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 660733 -skycell_id skycell.1650.047 -dbname gpc1
  • 10:50 EAM: ippdb06 is up and running as replicant, 2434901 sec behind master. it will probably take several days to catch up. i have purged the ippdb00 binlogs to the entry a few behind the current active binlog for ippdb06. this has freed up 189G, so we can run for a while again.
    mysql> PURGE BINARY LOGS TO 'mysqld-bin.004520';
  • 23:30 MEH: rolling faults happening on nightly processing from summitcopy through diffs.. ippdb00 connections?

Tuesday : 2015.02.17

  • 08:10 EAM: mysql on ippdb06 crashed @ 06:43, restarting it now so it can catch up with ippdb00
  • 08:30 MEH: nightly finished, doing storage.hosts.on in stdlocal
  • 08:50 EAM: mysql @ ippdb06 up and running, 2271789 sec behind ippdb00. it is going to take > a week to catch up...
  • 09:30 EAM: stdlocal is sluggish, stopping for a restart.
  • 09:55 EAM: stdlocal up and running.
  • 09:55 MEH: QUB stamps slow, restarting pstamp -- with PSPS being used more as well as QUB, this will probably need a near daily restart (really should just do for all pantasks after nightly is finished) -- so doing all other nightly pantasks now
  • 10:40 MEH: ipp097 neb-host up, manually adding for nightly processing sum+reg+stdsci
  • 10:50 MEH: ipp094 BBU seems bad
  • 10:55 MEH: ipp067-ipp082 are ~3.1TB remaining. with normal nightly science was ~4.3TB on 2/12 with normal load of nightly science chip/warp/diffs that get cleaned. Extra data products appear to be going there at ~0.8TB/week.
  • 11:25 MEH: ipp077 seems to be having issues mounting some of the data disks and dmesg log reporting some odd issues with local disk?
    [2015-02-17 07:01:26] journal commit I/O error
    [2015-02-17 07:01:26] __journal_remove_journal_head: freeing b_committed_data
    [2015-02-17 07:01:26] __journal_remove_journal_head: freeing b_committed_data
    [2015-02-17 07:01:26] sd 2:0:0:0: [sdc] Synchronizing SCSI cache
    [2015-02-17 07:01:26] sd 2:0:0:0: [sdc]  
    [2015-02-17 07:01:26] Result: hostbyte=0x04 driverbyte=0x00
    [2015-02-17 07:01:26] sd 2:0:0:0: [sdc] Stopping disk
    [2015-02-17 07:01:26] sd 2:0:0:0: [sdc] START_STOP FAILED
    [2015-02-17 07:01:26] sd 2:0:0:0: [sdc]  
    [2015-02-17 07:01:26] Result: hostbyte=0x04 driverbyte=0x00
    [2015-02-17 07:01:26] Read-error on swap-device (8:32:94345265)
  • 12:10 MEH: Haydn doing reboot of ipp094 to try and recover BBU -- still bad reporting, he is contacting LSI
  • 14:27 CZW: Stopping stdlocal to do a batch lap monitor scan. I'll restart it when I'm done.
  • 17:15 MEH: Haydn rebooted ipp077, still doesn't see /dev/sdc but the mounts seem to be fixed now --
  • 19:50 MEH: large number of registration and red all over ippmonitor faults..
    Nebulous::Client::move - unhandled fault - database error: error: DBD::mysq
    • email to whoever is running extra to stop -- pstamp and cleanup stop to help ease the issue -- may have to stop stdlocal if doesn't clear up
    • stdlocal stop
  • 21:00 MEH: been ~20 min w/o fault mess, stdlocal run and see
  • 21:15 MEH: massive faults again, stdlocal stop
  • 22:15 MEH: been 1hr and no massive fault events -- stdlocal run again
  • 22:30 MEH: massive faults -- stdlocal stop
  • 22:50 MEH: trying reduced poll in stdlocal -- while observing with the 88in..
  • 00:55 MEH: poll up to 200 too high large faults.. back down to 150..

Wednesday : 2015.02.18

  • 01:35 MEH: manually quality fault bad warp -- cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1477687 -skycell_id skycell.1868.076
  • 02:15 MEH: pstamp on for past hour or so, issue seems to be driven by the large load of cleanup and stdlocal
  • 08:40 MEH: cleanup must be turned back on as soon as nightly finishes in order to catchup from yesterday's faulting issues
  • 08:50 MEH: stdlocal poll back to normal
  • 09:45 EAM: mysql@ippdb06 crashed, restarted and is now continuing to catch up with ippdb00 -- 2248479 sec behind.
  • 09:45 MEH: Gene notes stdlocal needs a restart, stdsci does as well and both will be restarted at at same time, as might as well pstamp
  • 10:00 MEH: ippc01-c03,c06-c09 apache stopped to rollover the nebulous_server.log, start apache when finished
    • ippc02 extra disk use not from nebulous_server.log
  • 10:30 MEH: ipp086,094 still bad BBU, but should be fine with random data (many of the ipp054-066 also have BBU issues being so old and are ok with random data)
  • 11:40 MEH: ippc02 stop/start mysql to rollover the slow.log (>5G) and using excessive space on OS disk
  • 12:30 MEH: older 40T disks in repair (ipp009,011,014,015,017,018,023-031) should be able to be moved to neb-host up -- on the wrong side of the 2x10G link so will need to monitor, should leave in repair any poorer/problematic systems
    • all should be not targeted except for maybe distribution, ipp010 being used as another alt-data node
    • if notice a problem, high cpu_wait, then back to repair they go
  • 12:45 CZW: stopping stdlocal/stdlanl for the daily lap kick process.
  • 17:30 MEH: update the ipp/nightly processing data host targeting to include ipp097
  • 19:20 MEH: summitcopy+registration massive faults again
    • last was around 19:25:39 -- process list around ~700 (same as in /data/ippc18.0/home/watersc1/monitor_connections.20141210/mon_con.dat5), then drops to 200-400. was this higher processlist while stdlocal slowly cleared the storage node jobs basically?
    • cleanup ~900s, was spike to ~1800s during all the faults
    • things seem ok now ~20:00
    • restarting all apache servers today may have cleared some problems
  • 22:45 MEH: fixing faulted ps_ud_QUB update -- looks like update/cleanup got crossed again, chip (chip_id 1528697) cleaned (no cmf) but state full and XY41 pixels not regenerated but data_state full

Thursday : 2015.02.19

  • 08:00 EAM: stopped and restarted stdlocal, added storage hosts
  • 08:15 EAM: added gpc2 database to summitcopy manually, and also to the input script.
  • 10:10 MEH: restarted pstamp, other QUB stamps stuck most of night --
    • another pstamp/update/cleanup conflict -- data_states full, label ps_ud_QUB, state cleaned and pstamp waiting on -- clearing now
  • 11:30 CZW: One exposure seems to be stuck, and reverts aren't clearing it. The log reports: "Aborting in function p_psImageAlloc at psImage.c:90.", which appears to be the case in this outstanding ticket. A visual inspection shows nothing that looks like a star on the image. chiptool -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 1534859 -class_id XY17 to clear the bad chip.
  • 12:30 MEH: pstamp space down to 400G/5.5T -- MPE label out for a bit so cleanup can catchup (and so space for QUB time critical stamps to get out)
  • 13:30 EAM: I am about to stop replication on ippdb02 to make an rsync copy of the database to ippdb06.
  • 16:35 MEH: pstamp has run out of disk space -- MPE label out
    • as ~ipp, running for 12 days, if not enough then will do 10 days --preserve-days 12 --verbose
  • 17:05 MEH: ipp086,094 (random data targets w/o BBU) are unhappy for some reason (high cpu_wait) so back to repair they go
  • 18:20 MEH: pstamp 12 day only cleaned up ~35G, do 5 days.. to get ~200G back.. 2/18+19 have ~4.6T split between them..
  • 20:30 CZW: summitcopy seems to be running slowly. Doing a restart of the server to see if that helps.
    • CZW: and registration because it's not doing anything while summitcopy restarts.
    • MEH: is it because it is also downloading large numbers of gpc2 data? -- see Gene's note above

Friday : 2015.02.20

  • 00:45 MEH: JK has triggered large cleanup of their large stamps, >2T free now so adding MPE label back into pstamp
  • 09:40 EAM: ippdb06 rsync from ippdb02 finished, new copy of neb mysql is up and running there (now only 78413 sec behind). NOTE: there are now 2 copies of the neb database on ippdb06, both in /export/ippdb06.0. mysql.20141108 is the original version (currently pretty far behind), mysql.20150220 is the new version. it is possible to switch between these by stopping the mysql server and changing the link at /var/lib/mysql before restarting. the link at /var/lib/mysql just needs to point at the version desired.
  • 09:50 EAM: with Craig's ok, I've bumped up the number of hosts running summitcopy. We need to catch up on gpc2. this needs to revert to normal when we go on sky tonight.
  • 11:30 CZW: restarting stdlocal pantasks.

Saturday : 2015.02.21

  • 08:10 EAM: we finally caught up with the gpc2 downloads last night, so before tonight someone should restart summitcopy with the normal complement of hosts. it would be good to see if gpc2 and gpc1 can live together without problems.
  • 08:11 EAM: we are down to about 9k queued staticsky runs, so I'm queuing the next batch (12-14h).
  • 12:55 MEH: ipp086 being harassed (high cpu wait), into repair
  • 15:10 EAM: stdlocal sluggish, stopping for restart
  • 15:30 EAM: restarted stdlocal, summitcopy, registration. gpc2 database is in both summitcopy and registration.
  • 15:31 EAM: mysql@ippdb06 crashed again, restarted it.

Sunday : 2015.02.22

  • 09:40 EAM: mysql@ippdb06 still running, 0 sec behind.
  • 09:41 EAM: stopping stdlocal for a restart.
  • 10:40 EAM: stdlocal restarted -- a handfun of stacks were taking a very long time to complete, so I pushed them to state 'wait'. this will keep them from re-queuing, but the end of job script should put them in a reasonable state.
  • 10:45 EAM: We've been running staticsky with 4 x x0,x1,x3, but I am expecting to get cycles from the UHM Cray soon, and staticsky is the best use there (high amdahl number). So I'm turning off x3 to move those nodes into stdlocal as the staticsky clears.