PS1 IPP Czar Logs for the week 2017.07.10 - 2017.07.16

(Up to PS1 IPP Czar Logs)

Monday : 2017.07.10

  • MEH: as discussed at the meeting last week and this week, priority w-band refstacks need to be running on as many nodes as possible, but are removed from data nodes during nightly processing (~ippqub/stack/ptolemy.rc, PSNSC.wref.20170707)
    • ipp058-066 repair->up during day for refstack processing (until h02-con power management setup)

Tuesday : 2017.07.11

  • 13:00 EAM : I have updated czarpoll for ippMonitor in a couple of useful ways:
    • I have fixed a bug preventing it from reading the labels from the pantasks
    • I have modified the code to merge the stdscience and distribution pantasks labels together for display
      • this latter should obviate the need to manually hack the CzarDb?.pm code to add in labels.
    • I have added the gnuplot binary code path to the .xml config file, and removed the hard-coded GNUPLOT value from
    • I have moved the config file up one level (to ippMonitor/czartool/czarconfig.xml from ippMonitor/czartool/czartool/czarconfig.xml) so it is not so hidden
  • MEH: adding limited w refstack processing on ippx nodes along side dvopsps processing -- x051-x064

Wednesday : 2017.07.12

  • MEH: cleaned up misc K2+MD07 nightly products
  • MEH: continuing to add limited refstack processing on ippx nodes along side dvopsps processing -- x030-x050
    • created additional faults likely due to the heavy nfs to ipp118-122 so turning back off there
  • MEH: ipp121 ganglia down, restarted gmond and reporting again so any issues can be logged...
  • 16:00 CZW: power cycled stare01, which I seem to have killed accidentally.

Thursday : 2017.07.13

  • 06:50 MEH: a very long list of summitcopy fault 200 -- using revertcopy to clear
    request failed: 500 EOF when chunk header expected at /data/ippc64.1/ippitc/psconfig/ipp-20170121.lin64/bin/dsget line 155.
  • 07:20 MEH: lost connection to summit for a few mins at least
    request failed: 503 Service Temporarily Unavailable at /data/ippc64.1/ippitc/psconfig/ipp-20170121.lin64/bin/dsget line 155.
  • MEH: network issue summit faulted files before manually started revertcopy -- manually set fault 0 in gpc1 to re-download
    7 summit faults
      exp_name         registered      fault
    o7947g0105o   2017-07-13 14:44:34   203
    o7947g0107o   2017-07-13 14:46:03   203
    o7947g0108o   2017-07-13 14:46:44   203
    o7947g0109o   2017-07-13 14:47:29   203
    o7947g0110o   2017-07-13 14:48:10   203
    o7947g0111o   2017-07-13 14:48:51   203
    o7947g0116o   2017-07-13 14:52:28   203
  • MEH: Gavin resetup email services for cluster, RoboCzar? back on and looks to work sending mail again

Friday : YYYY.MM.DD

  • 09:00 EAM : one exposure (o7948g0563o) failed on pzgetexp (query of datastore) similar to the exposures reported above by MEH. I manually cleared the exposure in mysql and the re-ran pzgetexp:
    pzgetexp -uri -inst
       gpc1 -telescope ps1 -dbname gpc1 -last-fileset o7948g0560o
    Note that I have added an option to pzgetexp to specify a last-fileset -- this could be the first exposure for a night -- so the request gets all exposures for the night, but does not download the entire datastore's list. (TODO: add a -revert equivalent to pzgetexp so we do not have to manually hack the database).
  • MEH: nightly stack pantasks running w-stacks now -- removed from normal and adding si0,ci1 -- finished for tonight
  • MEH: /export/ippc97.1 was also still broken from past c18 mount cleanup? remounted
  • MEH: Haydn replaced .1 disk on compute nodes ippc36, c39, c43, c47, c58, and c59 -- recreated ipp/tmp and fixed permissions on those partitions again to be used
  • MEH: w warp cleanup this afternoon freed up space on ipp118.0,119.1 so they can go back up from repair
  • MEH: Haydn thinks he finished all the serial console setups now -- ipp058-066 have remote power management so can leave up during nightly for stack targeting (very little nightly will ever go there anyways)

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD