PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

Extra/Non-standard Processing

Continuing HAF suggestion to improve communication, so we know what's going on better -- a list in the czar pages additional (non-standard) processing - so that we all know what's going on.

Daily Czaring:

  • currently there is a modified ops tag running diffs (WS labels only) as ippqub (was ippmops) under ~ippqub/src/stdscience_ws on ippc06 (was ippc29) -- if problems (Njobs>100k, power loss on ippc06 etc), it will need to be restarted like a normal nightly processing pantasks
    ./start_server.sh stdscience_ws
    
    • even with the modified ops tag, there are still various files MIA in nebulous and as such requires the daily czar to check and clear them as has been discussed before

MD processing: -- stalled until remaining missing files restored ---

  • ippmd/stdscience running WS diffs w/o writing images -- using ippx065-x096 (hosts_xmd) -- stop as necessary, but always communicate doing so

(Up to PS1 IPP Czar Logs)

Monday : 2015.10.19

  • 09:30 MEH: switching the WSdiff for grizy to use the LAP.3PI label now that PV2 stacks are cleaned up -- effectively 10/9 was last night for grizy WSdiff using PV2 stacks
    • all WS.nightlyscience other than QUB.WS should be on hold (label out of distribution) until QUB is ready
  • 10:15 EAM: stopping and restarting pantasks.

Tuesday : 2015.10.20

  • 07:57 MEH: adding WS label back to distribution for QUB now, preparing to update past week of warps for missed WS diffs
  • 11:15 MEH: sending ~297 chip+warp to update for missing WSdiff -- label WS.hold -- and setting all necessary warp+diff to same label to hold from cleanup until stamps downloaded
  • 15:50 MEH: missing WS diffs from past week now running to finish before tonight

Wednesday : 2015.10.21

  • 02:53 MEH: setting several fault 5 WS diffim to qual 42 due to stack FWHM being NAN (on PV3...), will probably need to set qual 42 on the PV3 stacks to limit in future after looking at stack...
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1650.018 -diff_id 1261830  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1561.041 -diff_id 1261831  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1736.010 -diff_id 1261833  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1562.016 -diff_id 1261837  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1563.069 -diff_id 1261839  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1905.064 -diff_id 1261957  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1581.015 -diff_id 1261984  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1569.062 -diff_id 1262009  -fault 0
    
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1410.001 -diff_id 1262049  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1410.001 -diff_id 1262050  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1498.076 -diff_id 1262052  -fault 0
    
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1410.078 -diff_id 1262257  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1319.076 -diff_id 1262374  -fault 0
    
  • 03:08 MEH: reported down, looks like rebooting on its own -- looks like same group again (008,012,013,014,016,018,037) according to ganglia boottime
  • 03:27 MEH: been over 20 min and reg still stuck.. manually doing
    regtool -updateprocessedimfile -exp_id 978605 -class_id XY55 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 978605 -class_id XY37 -set_state pending_burntool -dbname gpc1
    
  • 06:22 MEH: a 10/19 diff got cleaned before distribution, have to update to clear and with label WS.hold until QUB gets stamps
    difftool -dbname gpc1  -setskyfiletoupdate -set_label WS.hold -diff_id 1261216
    
    • appears to be an incompatibility in the ~ipp and ~ippqub ops tag for doing diff updates.. possibly related to -updatemode
  • 10:12 MEH: distribution finished, doing regular restart of nightly pantasks
  • 10:55 MEH: cleaning up old nightly logs to free up space in ~ipp and home disk, ~30GB
  • 18:40 MEH: pausing bzip of old logs on ippc19 to not impact nightly processing
  • 19:10 MEH: noticed a massive wave of faults from summitcopy, registration, chip/cam/warp...
    • ippc0x machine overloaded, ipp097 overloaded so put in repair for now -- many md5sum of not tonight's data running there...
  • 20:00 MEH: still sporadic faults plaguing nightly processing... -- backlog of 24 exposures cleared
    • setting replication to stop to see if helps --

Thursday : 2015.10.22

  • 08:30 MEH: restarting nightly pantasks log bzip2 and replication now that nightly mostly finished
  • 15:30 MEH: disabling and running WSdiffs manually tonight -- doing regular restart of pantasks
  • 22:00 MEH: will be working with QUB nightly data manually tonight --

Friday : 2015.10.23

  • 03:23 MEH: using the upper ippxNNN nodes (hosts xmd) in the stack pantasks for QUB nightly stack since MD is currently stalled
  • 09:27 MEH: returning stdsci back to normal state with WSdiff/distribution and restarting

Saturday : 2015.10.24

  • 07:00 MEH: appears ps_ud_QUB having faults due to missing cmf as well... moving label to ippqub:stdscience_ws (will also turn on chip+warp that will do processing for that label only)
    • adding the ippx065-x096 (host group xmd) to ippqub:stdscience_ws to work on large update/pss request from QUB since MD processing stalled

Sunday : 2015.10.25

  • 08:23 MEH: ipp030 disk not responding, set neb-host down so can get QUB processing done..
  • 11:21 MEH: looks like rebooted, putting neb-host repair (used to be up before unresponsive)