PS1 IPP Czar Logs for the week 2015.10.12 - 2015.10.19

Extra/Non-standard Processing

Continuing HAF suggestion to improve communication, so we know what's going on better -- a list in the czar pages additional (non-standard) processing - so that we all know what's going on.

Daily Czaring:

  • currently there is a modified ops tag running diffs (WS labels only) as ippqub (was ippmops) under ~ippqub/src/stdscience_ws on ippc06 (was ippc29) -- if problems (Njobs>100k, power loss on ippc06 etc), it will need to be restarted like a normal nightly processing pantasks
    ./ stdscience_ws
    • even with the modified ops tag, there are still various files MIA in nebulous and as such requires the daily czar to check and clear them as has been discussed before

MD processing:

  • ippmd/stdscience running WS diffs w/o writing images -- using ippx065-x096 (hosts_xmd) -- stop as necessary, but always communicate doing so

(Up to PS1 IPP Czar Logs)

Monday : 2015.10.12

  • 05:00 EAM: ipp023 crashed, rebooting it.
  • 11:00 MEH: normal restart of nightly pantasks
  • 18:00 MEH: plan is not to switch back to ippc18 as home disk unless ippc19 has problems -- so add new ipp user Archive_c19.1 symlink for pantasks logs and start the rsync+bzip to clear up space
  • 20:20 MEH: clearing stalled warp -- cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.0947.044 -warp_id 1628049 -fault 0

Tuesday : 2015-10-13

  • 16:00 CZW: I'm restarting ipp/replication to run commands to shuffle raw data to ippb06.X.

Wednesday : 2015-10-14

  • 21:30 CZW: I just saw on the calendar that I'm czar today. I'm setting ipp091 to repair, as it seems to have a very high load, and that seems to be causing registration/summit copy issues. I've also turned down the replication pantasks to do 10 jobs at a time instead of 40 to ensure that the nebulous database doesn't get unhappy.

Thursday : 2015-10-15

  • 18:05 CZW: Doing a restart on the ipp pantasks servers.

Friday : 2015.10.16

  • 04:45 MEH: ipp017,039 been up okay for a day, setting back to neb-host repair for access to data on the disks

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD