PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

Continuing HAF suggestion to improve communication, so we know what's going on better -- a list in the czar pages additional (non-standard) processing - so that we all know what's going on.

Daily Czaring:

  • currently there is a modified ops tag running diffs (WS labels only) as ippqub (was ippmops) under ~ippqub/src/stdscience_ws on ippc06 -- if problems (Njobs>100k, power loss on ippc06 etc), it will need to be restarted like a normal nightly processing pantasks (or IF ANY OTHER ISSUES LIKE THE ~ipp pantasks)
    ./start_server.sh stdscience_ws
    
    • even with the modified ops tag, there are still various files MIA in nebulous and as such requires the daily czar to check and clear them as has been discussed before
  • ps_ud_QUB has also been moved to the ippqub:stdscience_ws pantasks to support updates possibly broken by missing cmf files, chip and warp updates will also be done in that pantasks as well

(Up to PS1 IPP Czar Logs)

Monday : 2015.11.16

  • 01:03 MEH: ipp017 crashed again, nagios email confirmed sent after Gavin turned back on -- power cycle and back up
  • 03:20 EAM: ipp017 back down. i've powered it off for now. i'll bring it back up when I need to get to the dvo data there tomorrow
  • 07:05 MEH: ipp017 down again, we should leave neb-host down so nightly processing can proceeded then -- also need to clear all kinds of stalled jobs..
    • likely also stalled mounts as well..
    • nightly running again -- still clearing mounts, but since neb-host down and stalled jobs cleared this can continue
  • 07:21 MEH: ippc30 also seems to be having a problem with homedir? -- may need a power cycle -- ippc19 timing out? was ippc19 timing out for some reason.. likely will happen again then..
  • 10:50 MEH: nightly fully finished, fault 5 WSdiff from PV3 stack having NAN fwhm again
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1609.048 -diff_id 1277021  -fault 0
    
  • 11:18 MEH: nightly finished through distribution, doing regular restart of nightly pantasks

Tuesday : 2015-11-17

  • 17:17 CZW: Possibly not essential, but I'm doing a daily restart of pantasks. Cleanup has been busy doing lossycomp, and seems to be bogged down a bit.

Wednesday : 2015-11-18

  • 17:06 CZW: Earlier today, in response to Serge's missing diff, I did the following list of commands. The 14006 quality is a translation of the hex code (36b6) listed in the log file. This was set to fault 5 instead of setting the quality correctly.
     difftool -dbname gpc1 -updatediffskyfile -diff_id 1277599 -skycell_id skycell.0880.020 -fault 0 -set_quality 14006
     difftool -dbname gpc1 -updaterun -set_state full -diff_id 1277599
     pubtool -dbname gpc1 -definerun -label ThreePi.nightlyscience -client_id 5
    

Thursday : YYYY.MM.DD

Friday : YYYY.MM.DD

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD