PS1 IPP Czar Logs for the week 2015.08.03 - 2015.08.09

Non-standard Processing

In an effort to improve communication, so we know what's going on better, HAF thinks we should list in the czar pages additional (non-standard) processing - so that we all know what's going on (no requirements for others to help run it). HAF will start:

HAF:

  • ipp054 - ipp081: addstar processing for full force / diff. If you notice problems and need to reboot / stop this, please do the following:
    • contact HAF via email or phone (all should have my #)
    • if you feel addstar needs to be stopped and can't reach her:
             ssh ippdvo2@ippc19
             cd addstar.ipp054 (replace with machine that is problematic)
             pantasks_client
             stop
      
      • and then check status: wait for addstar run to finish, wait for minidvodb.premerge to finish (otherwise, risk corruption of mini)
  • as of 8/4- these are running, except ipp060/ipp064/ipp078

(Up to PS1 IPP Czar Logs)

Monday : 2015.08.03

  • 00:50 MEH: ganglia reporing many down systems and many faults happened -- someone start/turn something on recently...
  • 01:55 MEH: pstamps for QUB don't seem to be loading/running.. try restarting pstamp.. otherwise have to look at in morning..
  • 02:10 MEH: still reg and other faults w/ long running jobs in stdsci.. probably going to be backed up mess in morning
  • 05:55 Bill: restarted pstamp pantasks after moving some broken requests out of the way
    foreach $req_id { pstamptool -dbserver ippc17 -dbname ippRequestServer -updatereq -set_state goto_cleaned -req_id $req_id }
    
  • 06:10 EAM: ganglia continue to report many machines down, but this was not true. In the suspicion that this was the result of using the xnodes for nightly processing, i have returned to the original loading for stdscience (compute and storage nodes). I stopped and restarted all ipp pantasks.
  • 11:20 MEH: clearing couple repeat WS faults with quality 42
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.2322.017 -diff_id 1186759  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.2261.014 -diff_id 1186775  -fault 0
    
  • 12:30 MEH: Haydn taking ipp082 down to fix BBU -- neb-host down in advance as normal.
    • 13:10 back up, will leave in repair until data to process and test neb-host up response
  • 17:00 CZW: restarting cleanup pantasks.
  • 23:45 MEH: gpc2 OSS.20150725 was cleaned before MOPS had chance to process... with poor weather so far, setting to update and test ipp082 in neb-host up
    • ipp082 responded as expected with BBU and write cache on, should be able to stay neb-host up now

Tuesday : 2015.08.04

  • 07:52 Bill: ippc30 is acting overloaded. I suspect that Johannes is being a bit too aggressive downloading his results. I'm removing the MPE label from the postage stamp processor for a bit in order to allow it to settle down.
    • added label back in sometime later. Reduced the poll limit for jobs from 75 to 50. That yielded a load of 50 (versus usual 100). The problem earlier may have actually been that the home directory server wasn't responding as addstar processing was restarted.
  • 8ish HAF: restarted addstars, see email.
  • 19:50 MEH: restarting pstamp, need to keep up rate so QUB can get their stamps..
    • PS1 outlook is poor from humidity, going to bump pstamp update rate for stamps for upcoming PESSTO and SNIFS runs

Wednesday : 2015.08.05

  • 12:00 CZW: Restarting all pantasks.
  • 12:55 MEH: sending the gpc2 OSS.20150725 updated data for MOPS back to cleanup now

Thursday : 2015.08.06

  • 09:35 Bill: set quality fault for diff 1187664 skycell.1013.057 because I mistakenly deleted one of the files from the template stack 2873716.

Friday : 2015.08.07

Saturday : 2015.08.08

  • 05:20 EAM : ipp063's disk is offline. i am stopping everything (including ipp063 addstar) and rebooting. we are also getting too many connection errors to nebulous. i do not yet know if these are connected.
  • 05:24 EAM : ipp080 also overloaded with md5sums, i put it in repair.
  • 05:45 EAM : ipp063 back up with working disk; i've restarted the standard ipp pantasks

Sunday : 2015.08.09

  • 05:43 Bill : restarted gmond on ipp071
  • 07:47 Bill : restarted gmond on ipp076
  • 09:10 EAM : i've put ipp080 'up' and ipp063 to 'repair'. we are lacking up nodes in nebulous. it looks like we have enough space on ipp023 - ipp030 for some processing, so i'm putting those up, too.
  • 10:00 EAM : restarted gmond on ipp074 -- lots of VFS: file-max limit 19809059 reached messages in dmesg, but they do not seem to be happening now. if these keep up / re-appear, this machine may need to be rebooted.
  • 19:17 HAF is restarting pantasks for tonight -- I don't want to wake up to a mess.
  • 19:56 HAF is sorting out faults:
    update  warpSkyfile set fault = 0, quality = 42 where warp_id = 1612666 and skycell_id = 'skycell.0703.028';