PS1 IPP Czar Logs for the week 2015.05.24 - 2015.05.31

(Up to PS1 IPP Czar Logs)

Monday : 2015.05.25

  • 15:55 MEH: stdsci still having initday problem after first cycle.. restarting stdsci and manually sending things to cleanup after QUB stamps
  • 16:30 MEH: stsci18 appears down again, power cycle --

Tuesday : 2015.05.26

  • 06:00 Bill: stsci12 is down.
    • There are a large number of stdscience and pstamp jobs that are hung, probably due to nfs problems. The postage stamp server is completely stopped as a result of this. I restarted that pantasks. stdscience appears to be limping along.
  • 06:30 EAM: just rebooted stsci12. it took two tries -- the first time it complained during the bios check of the raid about an invalid sas address. perhaps I did not leave it off long enough. it is back up now.
  • 10:26 EAM: several repeatedly failing diffs (suspiciously all on the same skycell). I set to quality 42:
      difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0685.025 -diff_id 1152360 -fault 0
      difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0685.025 -diff_id 1152374 -fault 0
      difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0685.025 -diff_id 1152402 -fault 0
      difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0685.025 -diff_id 1152419 -fault 0
  • 10:30 EAM: ... and a problem warp:
      warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.1036.028 -warp_id 1577382 -fault 0
  • 15:10 EAM: restarting pv3ffrt & pv3fflt
  • 17:05 EAM: queuing RA = 19 for pv3ffrt, pv3fflt is still working on RA = 21, remote is working on RA = 15.
  • 19:50 EAM: stopping and restarting ~ipp pantasks

Wednesday : 2015.05.27

  • 07:25 EAM: restarting pv3ffrt & pv3fflt
  • 17:10 CZW: restarting pv3ffrt & pv3fflt

Thursday : 2015.05.28

  • 06:50 EAM: stsci01 was down, rebooted.
  • 08:00 EAM: stopped pv3fflt and pv3ffrt; put the bulge label in pv3ffrt and assigned only himem xnodes (ippx001 - ippx048)
  • 14:30 EAM: rebooted stsci16
  • 15:00 HAF: CZW installed new fix to initday bug, recompiled ipp. HAF restarted stdsci + others
  • 15:45 EAM: rebooted stsci18
  • 18:09 CZW: rebooting stsci10
  • 20:13 HAF:
    regtool -updateprocessedimfile -exp_id 920345 -class_id XY06 -set_state pending_burntool -dbname gpc1 
  • 20:20 EAM: rebooting stsci17 (seems like an extra bad 24 hours -- 5 machines in 1 day! Did they sense that I talked to Prem??)

Friday : 2015.05.29

  • 04:40 Bill: processing is pretty stuck. stsci09 went down about 5 hours ago. Set to down with neb-host. I don't know how to reboot the stsci machines.
    • 04:50 restarted the postage stamp server pantasks
  • 06:06 EAM: rebooting stsci09. I got the following error message:
    Single-bit ECC errors were detected during the previous boot                                                                                
    of the RAID controller. The DIMM on the controller needs replacement.
    Please contact technical support to resolve this issue.                                                                                      
    Press 'X' to continue or else power off the system and replace the DIMM
    module and reboot. If you have replaced the DIMM press 'X' to continue.
  • 10:40 EAM: restarted pv3ff[lr]t. ippx016 was hanging on df for /data/ipp014.0, rebooted it

Saturday : 2015.05.30

  • 17:45 HAF: restarting stdsci + friends (to make sure it's all good for tonight)
  • 19:40 EAM: rebooting stsci09 (hopefully) and restarting pv3ff[lr]t

Sunday : 2015.05.31

  • 05:00 EAM: restarting pv3ff[rl]t. pv3fflt has gotten very low on jobs and pv3ffrt has lots of work still to do. I have moved all but the highest 12 xnodes into pv3ffrt and am restarting (using only 4 jobs per host on pv3ffrt for the low-mem xnodes)
  • 05:10 EAM: Since we got no data last night, I'm hoping to stop and rsync the gpc1 mysql database today so we can repair replication. I would like to start this at 10am this morning, barring any objections. this means stopping all processing.
  • 10:45 EAM: all processing is stopped; i am going to stop mysql and rsync it.
    • important info: mysqld-bin.000571 | 494690925
    • rsync started @ 11:00:28 with 6 threads (master + 1 for each of the 5 biggest tables)
  • 21:40 EAM: catching up from earlier work: the local rsync of mysql@ippdb05 finished around 17:30. I did a double-check second pass (no problems) and restarted mysql on ippdb05. I also restarted the ipp user pantasks at that time. I later restarted czarpoll and roboplot (~19:30) and just now restarted ippsky pv3ffrt & pv3fflt. I have also set up an rsync of the backup from ippdb05 -> ippdb03:/export/ippdb03.0/mysql.20150531. I removed 2 older copies of mysql and some ancient backups from that partition to make space.