PS1 IPP Czar Logs for the week 2016.10.31 - 2016.11.06

(Up to PS1 IPP Czar Logs)

Monday : 2016.10.31

  • MEH: NCU/MOPS processing today, upper ippx and ipps nodes use continues
  • MEH: ippMonitor has had ippdb03 red for a while -- password for ippMonitor needed to be fixed in config -- showing slave status properly now
  • MEH: relastro reported done at meeting -- should reallocated nightly nodes back to default (x0,1->s3,4) since Heather using for next loading processed
    • oddly seems s6 also commented out -- # EAM 2016.05.04 : remove s6 nodes to test load for ITS move -- is this still necessary? -- no, was a short time test and the 3x should have been put back into processing

Tuesday : 2016.11.01

  • MEH: NCU/MOPS processing mostly finished except for ipps node use (and normal nightly c2 node use) -- waiting for request for products as needed now
  • 13:20 MEH: Haydn taking one of datanodes on 10G offline to test something with the cable? setting processing to stop -- finished and restarted all pantasks for nightly processing
    • many of nightly datanodes are red... cleaning some of the NCU diffs to make space since appears not any other space to be freed

Wednesday : 2016.11.02

  • 9:00 CZW: Processing stopped to allow for switchover to N5K switch for ippdb01.
  • 12:45 CZW: Processing resumed.
  • 16:25 CZW: Pantasks restart, even if we didn't have data last night.
  • MEH: QUB updating modest amount of NCU diffs for stamps off and on possibly (regularly part of ~ippqub/stdscience_ws and independent of nightly processing for MOPS)
  • MEH: SNIaF updates running on ipps nodes in ~ippmops/stdscience

Thursday : 2016.11.03

  • MEH: reconfig for nightly QUB targeted processing -- finished until after midnight possibly tonight
  • 16:30 CZW: Restart of pantasks after rebuild of ippTools. This picks up the change to prioritize the download of image files that do not have an exp_name LIKE 'c%' above those that do match that exp_name.
  • MEH: cleaning up remaining NCU WSdiff to hopefully free up more space for normal nightly data..
  • MEH: pstamp was stalled on MOPS request (866480) for >hr, restart cleared issue -- QUB targeted observations again tonight and need pstamp working

Friday : 2016.11.04

  • 14:00 MEH: reboot of ippcore to clear any possible underlying issues in network after many changes made in preparing moving to Manoa ITS
    • rates improved, errors reduced but on the backup gigE connection out from ippcore, 10G link appears to be broken (watch summitcopy download times, has it been the cause of randomly long downloads?)
  • MEH: appears ippdb06 isn't replicating nebulous since ~August.. czars need to really look at the primary czarpage on ippmonitor/ippdb03 (ippc18 is the backup and not up to date and has credentials that are incorrect connecting to DBs from ippc18) -- Gene will need to do a full mysql dump in the morning after nightly processing, ~12hrs

Saturday : 2016.11.05

  • MEH: ganglia on ippdb01 apparently not recording status since ~week before last -- restarted gmond and back
    • same for ipp017...
  • 15:00 EAM : this morning, I stopped all operations, shutdown the neb apache servers, and shutdown the nebulous mysql on ippdb01. (mysqld-bin.000409 | 930523745) I ran rsync from ippdb01 to ipp102 of the full database. As of 13:15, this rsync completed. after that, I did the following:
    • restarted mysql on ippdb01
    • restarted the apache servers
    • restarted the pantasks
    • restarted czarlog and roboczar
    • restarted nebdiskd

I believe everything is back up and running at this point.

Sunday : 2016.11.06

  • MEH: majority of nightly science datanodes red after normal cleanup... will try to find things to cleanup -- ~20% of a night in .multi diffs again... some QUB will just need to be re-updated when needed... -- will need to re-update some WSdiff for MOPS in morning from QUB field