Version 11 (modified by mhuber@…, 2 months ago)

--

PS1 IPP Czar Logs for the week 2017-09-04 - 2017-09-10

(Up to PS1 IPP Czar Logs)

Monday : YYYY.MM.DD

Tuesday : 2017.09.05

  • 18:25 CZW: Started rsync of retired node poorly replicated data to B nodes. I've started slowly, with ipp006-ipp012 transferring to ippb07-ippb10. I'll likely ramp this up to cover more hosts tomorrow.
  • 21:25 EAM: ipp121 crashed at some point earlier in the night. I was unable to get in via console, so I power cycled it. it came back up.

Wednesday : YYYY.MM.DD

Thursday : 2017.09.07

  • MEH: rather than having the ippc70-c75 critical nebulous apache nodes with various levels of space, reset all the tmp/nebulous_server.log files
    • turn on logrotation for nodes as well to prevent those logs from eating up space -- root crontab for mid-week@10am --
      1 10 * * 3 /etc/cron.daily/logrotate.cron 2>&1
      
    • turn on disk space check like done for homedir -- also crontab ippc19 to check every 4 hours
      0  */4  *  *  *  /bin/bash /home/panstarrs/ipp/local/bin/apachedisk_chk.sh > /dev/null
      
  • MEH: doing more misc manually triggered cleanup
    • including retry of another buildup of error_cleaned chip/warp/diff...
    • including apparently missed nights 20170903,20170904 as well
    • set regularly red data nodes to neb-host repair -- ipp076, ipp102.0
    • set non-red data nodes to neb-host up again -- ipp115.1, ipp104.0
  • MEH: ippdb09 now primary nebulous DB but no plots in ganglia for status... -- /etc/ganglia/gmond.conf change 10.10.20.16 (ippc18..) --> 10.10.20.17 and restart

Friday : 2017.09.08

  • MEH: Hadyn repaired ippb02 -- putting neb-host repair again so can access its data
  • MEH: Haydn/Gavin working with ippb00 to start the changeover to the 10G switch -- may be offline for a bit
    • success so doing rest -- ippb01-03,06-15 neb-host down
    • DNS may take up to 48 hours to propagate
    • @1600 can ssh to all w/ ippbXX from ipp117
    • fixed the ~ipp/.ssh/known_hosts conflict
  • MEH: ipp086,088,089 have stuck gmond and appear to be down but are not -- appears to be some gmond defunct jobs not clearing, unclear why
  • MEH: ipp@ipp117 nebdiskd is reporting not able to connect to ippdb09 -- manually running nebdiskd -p XXXX having the proper password seemed to fix the problem...
  • MEH: UH has required all machines open to the world to have openssh upgraded to v7.5 -- this impacts ippb04,ippb05 access from ITC and forces password entry -- Gavin is looking into it, but should be ok to set neb-host repair so data can be read from them

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD