PS1 IPP Czar Logs for the week 2015.04.06 - 2015.04.13

(Up to PS1 IPP Czar Logs)

Monday : 2015.04.06

  • 15:30 MEH: starting the regular restart of nightly pantasks
  • 15:40 MEH: ippc0x apache servers low on disk space, restarting apache and purging old log
  • 15:50 Haydn reports ipp017 back up, put into neb-host repair
    • clearing old and then stalled WS diff distribution -- files got cleaned again.. so update again
  • 17:30 MEH: ipp088,090,093,095,097 all in repair -- put to neb-host up and see if past getting slammed since earlier storage nodes are red..
  • 19:50 MEH: slow processing, nightly nodes still on in PV3 diff
  • 20:50 MEH: reg fault backing up processing?
  • 22:55 MEH: killed long running thread db00

Tuesday : 2015.04.07

  • 09:25 EAM : today, we are changing the nebulous database mysql master from ippdb00 (nearly full disk) to ippdb08 (new hardware). Here are my notes on the process:
    • 09:25 EAM : stopping the various pantasks (~ipp/*, ~ipptest/pv3diff*, ~ipplanl)
    • 09:50 EAM : everything is shut down, stopping apache on ippc17 and ippc30 (pstamp and datastore)
    • 10:00 EAM : shutdown all apache servers on ippc01 - ippc10
      • everything is quiet on ippdb00 : no queries except the slaves. after excuting 'flush tables with read lock' on ippdb00, the master status is:
        mysql> SHOW MASTER STATUS;
        | File              | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
        | mysqld-bin.005160 | 80910016 |              |                  |                   |
        1 row in set (0.00 sec)
    • 10:04 EAM : shutting down the mysql engine on ippdb00, ippdb02, ippdb06, ippdb08
    • 10:55 EAM : set the new master configuration for both ippdb02 and ippdb06. started the slaves and tested (inserted a row into the unused table 'log' and then deleted it). the slaves see the changes on the master. I had some confusion setting up the replication users for ippdb06 and ippdb02 on ippdb08 (I needed to use the following command):
      GRANT REPLICATION SLAVE ON *.* to 'USER'@'' identified by 'PASSWORD';
    • 11:05 EAM : nebulous.ipp now points at ippdb08. I started up the ippc01 - ippc10 apache servers and all seems well.
    • 11:12 EAM : started apache on ippc17 and ippc30. started nebdiskd.
    • 11:36 EAM : pv3diff and pv3diffleft started
    • 11:40 EAM : ~ipp standard pantasks started
      • MEH: ipp nightly processing targeting modified, please do not rebuild the ops tag -- merged now
    • 12:55 EAM : i've launched the gpc2 MOPS and photometry test images and added gpc2 to the stdscience pantasks (also added it to stdscience/input so it will be available)
  • 14:10 MEH: something is trying to heavily write to ipp082.. wonder if this is partial cause -- into repair until behaving better..
    [2015-04-07 13:28:08] 3w-sas: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=0.
  • 14:45 MEH: ipp090 having CMCI storm spamming logs -- putting into neb-host repair until decide what to do about it
  • 15:15 MEH: ipp082 was almost ok for ~1hr and something is again trying to kill it.... before could put to repair looks to have corrected itself -- aligns with start of gpc2 diffs?
  • 16:30 Haydn rebooting ipp090 to try and fix the CMCI messages -- took a second reboot to get back online. leave in repair for tonight
  • 19:50 MEH: stsci10 appears unresponsive and cannot log into -- neb-host down
  • 20:20 EAM: I've power cycled it. it seems to be booting ok.
  • 00:25 MEH: looks like stsci10 is back up but still neb-host down.. raising to repair for the night

Wednesday : 2015.04.08

  • 11:00 MEH: ipp090 seems okay overnight, so back to neb-host up
  • 11:30 EAM: stopping and restarting pv3diff and pv3diffleft
  • 14:30 EAM: i was running a parallel dvo merge on stsci19 and it seems to have crashed. i'm not sure if there was a causal connection, but in any event i've power cycled it.

Thursday : 2015.04.09

  • 05:50 EAM : stsci09 crashed last night around 1am. i am unable to get into the console. perhaps someone is rebooting it? in any event, this has wedged nightly processing. i'm trying to kill off things so we can recover. stsci09 set to neb-down.
  • 06:10 EAM : I stopped all ipp user processing, put stsci09 to neb-down, killed the hung stdscience jobs (all for some reason), umounted stsci09.? from those machines. this cleared the hung jobs and the nfs mounts in general. I've restarted ipp-user processing. pv3 diffs seem to have restarted ok on their own, which puzzles me.
  • 06:55 EAM : restarted czarplot stuff after force-umount stsci09 disks (took some doing...)
  • 17:03 CZW: set all stsciXX node to repair to see if the recent crashes are a function of having them accept data.
  • 17:56 CZW: restarted pv3diff pantasks, ippx016 was having serious difficulties, but I couldn't see a reason. stsci04 was not mounted, but the ppSub jobs were not completing. I killed those jobs, and it regained sanity.

Friday : 2015.04.10

  • 12:20 EAM : last night we lost ippx001 - ippx044 on ganglia and could not connect via console. haydn investigated this morning and found that one PDU died, the one which supplied power to the switch and the console. the group is back up now.
  • 15:30 EAM : stopping and restarting pv3diff & pv3diffleft.

Saturday : 2015.04.11

  • 05:45 EAM : stopped and restarted pv3diff & pv3diffleft. i took ipp016 out of the processing list for now -- jobs were running extra slowly there for unknown reasons. note that CZW also reported trouble with ippx016 above
  • 14:25 MEH: ipps13 was not down, ganglia just not responding on it
  • 22:10 EAM : stopped all services, stopped and restarted gpc1 mysql on ippdb05 (because diffs were no longer queuing). restarted everything, but pv3diff is off and pv3diffleft is running with all x nodes.

Sunday : 2015.04.12

  • 00:15 CZW: summitcopy had bad book pages for pzPendingAdvance. I couldn't figure out how to clear those, so I've restarted that pantasks. This should fix the missing exposures that are blocking registration.
  • 00:25 CZW: That didn't work. summitcopy refuses to execute the advance commands for some reason. Manually done, and registration seems to be registering.
    pztool -advance -summit_id 896707 -exp_name o7124g0150o -inst gpc1 -telescope ps1 -workdir neb://@HOST@.0/gpc1/20150412 -dbname gpc1 -end_stage reg
    pztool -advance -summit_id 896706 -exp_name o7124g0149o -inst gpc1 -telescope ps1 -workdir neb://@HOST@.0/gpc1/20150412 -dbname gpc1 -end_stage reg
    pztool -advance -summit_id 896704 -exp_name o7124g0147o -inst gpc1 -telescope ps1 -workdir neb://@HOST@.0/gpc1/20150412 -dbname gpc1 -end_stage reg
    pztool -advance -summit_id 896705 -exp_name o7124g0148o -inst gpc1 -telescope ps1 -workdir neb://@HOST@.0/gpc1/20150412 -dbname gpc1 -end_stage reg
  • 19:50 EAM : stopping and restarting stdscience, pv3diff, pv3diffleft