PS1 IPP Czar Logs for the week 2017.03.13 - 2017.03.19

(Up to PS1 IPP Czar Logs)

Monday : 2017.03.13

  • 21:00 EAM : we shutdown 1/3 of the cluster today to move to ITC. I set up the pantasks host to avoid machines in the POD. I also added in ipp121 and ipp078, which had been behaving themselves. Unfortunately, around 20:00 ipp121 had a kernel panic and started to have nfs problems. I rebooted ipp121 and took and ipp078 out of processing.
  • 23:59 EAM : load on s3 & s5 was a bit heavy, while m1 & c2 were light. adjusted loading so all of those groups have 5 connections

Tuesday : 2017.03.14

  • 17:00 CZW: I've set up a pantasks to run rsync jobs between the low-numbered storage volumes and the high-numbered storage volumes (ipp118/119/120 for now). These jobs should automatically stop at night (6:30PM), and restart in the morning (8AM), to ensure that nothing impacts nightly processing. It will likely need additional tuning, as the server takes a long time to load the input file, but it is at least running. Server directory /data/stare00.1/watersc1/itc_rsync_lists/rsync_run, running as ipptest.
  • 19:30 EAM : stopping and restarting pantasks

Wednesday : 2017.03.15

  • 14:20 CZW: I've chagned the Nebulous-Server max_used_space parameter to 0.989 on ippc70-ippc75, and restarted their apache servers. This change will allow the ITC nodes to be used at a higher fill factor than before, allowing processing to shift there.
  • 16:20 CZW: With the shift of processing to ITC, I've disabled the trange limits on the rsync pantasks (/data/stare00.1/watersc1/itc_rsync_lists/rsync_run). This will allow jobs to run internal to the MRTCB over night, as the target hosts aren't used for nightly processing. I've also increased the loading slightly (3x instead of 1x, still running 10MB/s rsync jobs), and may move it up a bit more (5x max).
  • 18:20 MEH: shutting down pantasks @MRTC-B and starting summitcopy+registration+stdscience @ITC with agreement from MOPS --

Thursday : 2017.03.16

  • MEH: nightly processing @ITC
    • summitcopy,registration,stdscience,pstamp,distribution,cleanup,stack all running -- ~ippitc/ setup for all
      • changed default 6x si0, 3x ci0/1 to 5x si0, 4x ci0/1 for fewer faults, but then seemed to get a pileup in warp -- may be due to extra targeting on ipp105,117 (many pending for ipp117) so bumping unwant 20->30->40
      • many fault 2 seem to be coming from c70-c75 (apache nodes), -1x loading -- didn't really help, remove all from stdscience with hosts_apache group(other pantasks seem ok)
    • ippqub/stdscience_ws running normally @MRTC-B, should not be touched other than set to stop -- WS/SSdiffs/ps_ud_QUB running (5x si0, 5x ci0/1) in AM after nightly and directly targeting ipp105,117 datanodes only -- ps_ud_QUB 1x ITC loading during night
    • WWdiffs hacked to only produce diff+inv uncompressed images now -- %SS,3PI starting night %.20170317
    • data targeted more to ipp105,117 now -- more may be needed and may help to reduce ipp105,117 to 1x loading only in processing
    • gpc2 removed from ippMonitor/ to help speed up refresh (not waste time on those queries and plots) ~20 min now
    • many WSdiff faults by skycell due to missing stacks -- have to just manually set quality 42 to clear for all so QUB can get diffs available other skycells
    • many pstamp chip update faults due to missing raw and mdc/psf products -- pstamp will skip after a bit, but probably helps to cleanup before nightly so cycles not wasted on reverts to fail again and again
      • pstamp update cleanup back on (defaults 22-00 UTC daily, maybe better if made later, 02-04 UTC? maybe conflict w/ normal cleanup) -- will help keep space cleaned up, but may need to block MPIA/MPE/WEB.BIG/PSI.BIG if large jobs show up or possibly just remove label during night
    • summitcopy keeping up well, only ~5-10 exposure lag
    • startup had quite a bit of lag, chip jobs >300-400s, camera>1ks -- mounts to all the datanodes being re-established? logs seem to indicate stalling on gpc1 addprocessed exposure? MRTC-B/ITC link seems very active at start of nightly processing as well --

Friday : 2017.03.17

  • 01:50 MEH: ~50% datanodes stopped taking data, rest will quickly fill up to only ipp105,117 -- sending diffs from 20170316 to cleanup a few hours early since MOPS has made stamps already and czar'd (and will have the addition pixel products)
    • MEH: another issue may be the nodes w/ mysql running -- while probably won't use too much space, once the new pods arrive then the mysql nodes should probably just be put into repair until more space freed up
  • 18:45 CZW: Starting /data/stare00.1/watersc1/itc_rsync_lists/rsync_run2 pantasks as ipptest to start shuffle of ipp015-ipp031 data to ipp118-ipp122.

Saturday : YYYY.MM.DD

Sunday : 2017.30.19

  • 21:45 MEH: nightly is falling behind, appears ~ippitc/stdscience is <50% loaded and >30 exposures behind in chip and warp, rate decay trend last night suggests >100k jobs still has problem (140k currently) -- doing restart of stdscience so not behind for system work in morning -- fully loaded and catching up now