PS1 IPP Czar Logs for the week 2017.06.26 - 2017.07.02

(Up to PS1 IPP Czar Logs)

Monday : 2017.06.26

  • 16:40 EAM : I notice that ssh to machines outside the ITC cluster from machines without external IP addresses is extremely slow (extra 30-45 seconds to connect). I suspect this is a config problem on the NAT and have pinged Curt.
  • 16:45 EAM : swapping nebdiskd to run as ippitc@ippdb01 to ipp@ipp117 : ipp user can get to the ippbXX machines, ipp117 has an external ID address.

Tuesday : 2017.06.27

  • 12:00 EAM : started M31 reprocessing using BG_PRESERVE_SHALLOW reduction class (S/N limit of 15, preserves the background).
  • 16:00 CZW: Rebooting stare00, stare02-04 to see if that clears up processing/NFS issues that have been issues. stare01 is ingesting a database, and has been skipped for this pass.

Wednesday : YYYY.MM.DD

Thursday : 2017.06.29

  • MEH: ipp091 missing from ippmonitor -- doesn't appear disk usage is actually updating -- Gene had kill -STOP earlier
  • MEH: notice lack of system emails -- roboczar, nagios, crontabs -- sendmail from ipp113 not working -- sendmail on ipp113 /etc/ssmtp.conf is setup for ippc18 -- Gavin will need to make changes for the full system
  • MEH: ipp106 not in ganglia -- try restarting
    /etc/init.d/gmond restart
     * ERROR:  gmond is already stopping.
    • ippc18 still mounted and also hanging df -- force umount and gmond restarts okay now and ipp106 finally reporting again after >1week..
  • MEH: check data nodes again with space for remote power management --
    • had wrong default password for i10-con -- can now connect as well for all consoles there
    • ipp076 doesn't prompt for password to log into console, but can (w/o password) access the remote power management -- Haydn recommends caution as may not be wired right
  • MEH: cleaning up older K2 data products
  • 16:00 CZW: Starting HSC processing using x3 nodes (two instances). These appear completely idle, so I do not expect conflicts. I may expand this use.

Friday : 2017.06.30

  • 14:45 CZW: Starting some initial transfers from retired machines to ippb14/ippb15. I've limited the bandwidth to try and not saturate the 1Gb link. Initial test is transferring ipp013, ipp018, ipp019, ipp020.
  • 18:10 MEH: restarting nightly pantasks for tonight back on sky (except stdscience, will restart that ~1930 just before observing and will remove m31.ps1.20170627 label)

Saturday : 2017.07.01

  • MEH: back on sky last night but data later due to weather -- use to test WSdiff and repair->up data nodes with space -- ipp068,071-076,078-081,083,054-057
  • MEH: ipp091 still not on ippmonitor -- /var/log/messages show odd automount messages -- restart autofs, first restart resulted in problem ([!!] and not automounting /data/ipp091.0 on ipp091), did a second time and after some delay then ok -- back on ippmonitor plot now
  • MEH: ippmonitor disk use still not updating other than state -- run nebdiskd --restart and started a second process -- killed both and restarted nebdiskd from scratch ipp@ipp117
    • ipp119.1 appears red now (as it should...) -- Heather/PSPS using these nodes for batches.. must have space for that so set to repair...
  • 19:35 MEH: restarting nightly pantasks before observations start to clear changed made for m31 processing

Sunday : YYYY.MM.DD