PS1 IPP Czar Logs for the week 2013.09.16 - 2013.09.22

(Up to PS1 IPP Czar Logs)

Monday : 2013.09.16

  • 10:30 Bill restarted pstamp and update pantasks.
  • 16:44 Bill is going to restart stdscience and stack pantasks.
    • 16:57 and since I can, also restarted summitcopy, registration, distribution, and publishing.

Tuesday : 2013.09.17

Bill is czar today

  • no data last night due to humidity. LAP stacks are moving right along
  • 11:23 stopping processing in order to rebuild ippTools and postage stamp server in order to incorporate new features
  • 11:57 rebuild is complete. Now things up again.
  • 16:00 or so label SAS.20130917 added to stdscience. This is to process sky calibration runs for the new SAS 22. Will add SAS.20130917.update for new sas 24 once the others finish.
  • 16:15 Since I fixed the corrupted warp problem (actually detect and rerun jobs with corrupted files) all of the stack that fail with fault == 2 are due to nfs write errors. Turning stack.revert back on.
  • 16:20 MEH: with staticsky running in special pantasks, manually removing/reallocating 1x compute3 from stack to avoid overloading -- will then be used for refstack testing until further notice

Wednesday : 2013.09.18

Bill is czar today

  • 09:24 restarted stdscience and pstamp pantasks
  • Note: the increase in free space since yesterday happened because I queued the M31 chips and warp for cleanup.

Thursday : 2013.09.19

  • 11:20 Bill: Heather noticed that ganglia showed a sine wave pattern with around a 3 minute period. After restarting stdscience pantasks it went away.
  • 12:35 Bill: restarted pstamp.

Friday : 2013.09.20

  • 10:00 Bill: set up SAS.20130917 staticsky bundles to be distributed.
  • 10:30 The sawtooth load behaviour is back. Started about the same time as the distribution did.
  • 11:00 distribution is done. Perhaps it's time to restart stdscience

Saturday : 2013.09.21

  • 08:05 Bill: postage stamp server is backed up. Needs restart. set to stop and restarted
  • 08:21 Bill: For the third day in a row the cluster load is oscillating. The plot of network bandwidth shows this the best. It stops when stdscience is stopped.
    • .
    • MEH -- how is this different from the same behavior whenever stdsci gets near 100k Njobs? if you watch the poll loads, are they full and regularly reloading?
  • 08:28 restarting stdscience

Sunday : 2013.09.22

  • 08:05 MEH: nightly finish so time for regular restart stdsci
  • 17:05 Bill: and pstamp....