PS1 IPP Czar Logs for the week 2015.08.24 - 2015.08.30

(Up to PS1 IPP Czar Logs)

Monday : 2015.08.24

  • 00:05 MEH: pstamp could use a restart
  • 08:50 EAM: ipp034 crashed, no messages on console, rebooting

Tuesday : 2015.08.25

  • 21:45 EAM: I have removed all of the non-nebulous, non-MANIFEST directories from all stsci machines; I have set all stsci machines to DOWN in nebulous. They are ready to be shipped.

Wednesday : 2015.08.26

  • 14:10 bill: restarted pstamp server pantasks. It had zillions of errors because ippc30 got it's power pulled inadvertantly.
  • 14:19 bill: removed MPE label from pstamp because Johannes' requests are not impossible to serve.
  • 17:17 CZW: restarting all ipp pantasks servers.

Thursday : 2015.08.27

Friday : 2015.08.28

  • 14:00 CZW: Remembered Mark's email from this morning. The diff faults due to missing files are likely due to things not being fully restored after the shuffle. I've set a quality argument on the affected skycells so the rest can complete. Due to a typo, these have quality 542.
  • 15:50 CZW: Rebooting ipp025, which has become unhappy with the same XFS issues as we've seen before.

Saturday : 2015.08.29

  • 07:45 EAM: ippb06 crashed, rebooted.

Sunday : 2015.08.30

  • 05:20 Bill: warptool -updateskyfile -set_quality 42 -fault 0 -warp_id 1617727 -skycell_id skycell.0699.027
  • 06:19 Bill: turned diff reverts off so that we can get a handle on the permanent WS diff failures due to incomplete stack data.
    • Once they are done faulting I will change the label for the faulting WS diffs renable reverts and figure out what to do about the broken ones.
    • Shall we drop the runs completely or set the skycells to quality errors or something else.
  • 08:33 Bill: Moved the incomplete diffs from 20150829 out of the way
    • set quality faults for all fault 2s for last night's data. But more are still running and faulting.
      difftool -updaterun -set_label OSS.WS.20150829 -label oss.ws.nightlyscience -data_group oss.20150829 -state new
      
      In mysql
      update diffRun join diffSkyfile using(diff_id) set fault = 0, quality=66 where data_group = 'oss.20150830' and fault = 2;
      
    • turning reverts back on will check in in an hour or so.
  • 13:15 Bill: turned diff revert off again
    • 13:20 Cleared the remaining faults. If somebody wants to go rescue (or abandon by cleaning) the 20150829 data the label can be changed. Also the skycells with incomplete stack data can be indentified by the quality = 66 value
      difftool -updaterun -set_label ThreePi.WS.20150829 -label oss.ws.nightlyscience -data_group ThreePi.20150829 -state new
      
      In mysql
      update diffRun join diffSkyfile using(diff_id) set fault = 0, quality=66 where data_group = 'oss.20150830' and fault = 2;
      update diffRun join diffSkyfile using(diff_id) set fault = 0, quality=66 where data_group = 'ThreePi.20150830' and fault = 2;
      
    • diff.revert.on: Some of the ps_ud_QUB updates are running into a similar problem but looking at the pstamp fault counts not too many.
  • 13:30 Restarted pstamp pantasks. Stdscience could use a restart as well.
  • 14:45 Bill: restarted stdscience
  • 19:02 HAF restarted all the pantasks