PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2017.01.30

  • 11:15 CZW: Restarted ITC raw data shuffle to work on failures from previous iterations.
  • 16:15 CZW: Restarting raw shuffle with randomized list to attempt to prevent overloading (~ipptest/replication on stare04).
  • MEH: ipp004-031 neb-host repair so new data goes to them in preparation for the move to ITC/decomissioning
    • ipp056, ipp058 also neb-host repair since appears raid is using emergency spare and don't want nightly science on it in case problem while Haydn in Manoa for week
  • MEH: data targeting to remove ipp004-031 locally, nightly pantasks normally restarted to include change (~ipp/psconfig/ipp-20141024.lin64/share/pantasks/modules/ipphosts.mhpcc.config)
  • MEH: Haydn setting up new ippc64-ippc127 nodes at ITC using decommissioned machine IPs, so some warnings to ps-ipp-ops may pop up
  • 21:30 CZW: Restarted shuffle pantasks pointing at permanent shuffle products (starting with camera). I've cut the number of parallel jobs significantly (to 30), to ensure this doesn't interact badly with nightly processing. This will also allow statistics to be gathered on this process.
  • 22:05 MEH: ipp121 has crashed/kernel panic... looking at power cycle -- rebooted okay, running for a bit and then seems to be ext3 errors -- take out of processing and leave in neb-host repair (so MOPS can get stamps already possibly on the disks) -- see ipp121_log
    • o7784g0144o missing and keeping MOPS from having a complete chunk clear of data possibly on ipp121.. stuck in chip -- restart stdscience

Tuesday : 2017-01-31

  • 18:00 CZW: Updated version of permcheck.pl that should reduce the load on storage nodes by only spawning md5sum processes when the files are large. This seems to have increased the throughput to the ITC significantly, without causing excessive load spikes.

Wednesday : 2017.02.01

  • 16:00 EAM : I have finished the rsync of the nebulous and gpc1 mysql database from the MTRC-B machines to ipp116 and ipp117. replication is up and running for both, but ISP, UIP, SSP need to be copied and included.
  • 16:05 EAM : restarting pantasks

Thursday : 2017-02-02

  • 16:00 CZW: There is an issue with home directory mounting at ITC. These nodes will probably drop out of nebulous until this is resolved. ipptest/replication pantasks stopped until this is resolved, as the jobs will not complete.
  • 16:10 CZW: Restarting IPP pantasks.
  • 17:30 CZW: No clear word on why ITC dropped out, but it seems to be back, so I'm restarting shuffle to attempt to get that finished.

Friday : 2017-02-03

  • 14:20 CZW: ITC shuffle seems to be hitting some hosts hard, so I'm dialing back the number of jobs it has running simultaneously. I suspect this is caused by rsync data from the stsci data move.
  • 14:30 CZW: ippx001 is down, but I'm hesitant to power cycle it as that will take the other nodes on the same power supply down as well. It may be best to leave until it can be manually fixed on site.
  • 16:00 CZW: Restarting IPP pantasks servers.

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD

  • MEH: QUB targeted followup needs ippx, ipps nodes as well as nightly nodes in AM -- finished