PS1 IPP Czar Logs for the week 2011.06.13 - 2011.06.19

(Up to PS1 IPP Czar Logs)

Monday : 2011-06-13

  • 10:20 Restarted replication
  • 10:55 : Chris queued-up a single MD04 exposure for re-processing as requested by Jim
  • 11:10 : CZW Begin shuffle speed up test by doubling number of hosts running jobs.
  • 12:00 : Bill queued chip runs for a new STS reference stack: label STS.refstack.20110613
  • 12:09 : Bill queued STS.2009.b data that is finished to be cleaned. Queued all outstanding ps_ud data to be cleaned.
  • 13:24 : Bill noticed that warp was stuck (all entries in warpPendingSkyCell had pantaskState DONE) warp.reset
  • 13:30 : Bill rebuilt a broken diff skycell that was causing a magic error rundiffskycell.pl --redirect-output --diff_id 138137 --skycell_id skycell.2080.074
  • 13:31 : Bill fixed a broken streakMap file by setting the offending magicNodeResult to fault = 2. Upon revert it will be regenerated.
  • 13:35 : Bill 25 tries is enough difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 138027 -skycell_id skycell.1635.016

Tuesday : 2011-06-14

  • 06:10 Restarted replication
  • 12:34 : Bill warp 208424:skycell.2.20:STS.V3 was in the warpPendingSkyCell book yet there was no job in the controller list. Deleted the page.

Wednesday : 2011-06-15

  • 04:48: sts refstack exposures are through warp. Queued the stacks. 438 of 448 skycells got queued. The missing ones are on the edges of the projection centers.
  • 08:25: sts.2009.b data queued so far is on distribution server. Set the data to be cleaned.
  • 15:45: The value of NEB_SERVER has been changed in ~ipp/.tcshrc and ~ipp/ippconfig/site.config. If necessary revert to the previous value (ippc03)
  • 17:12: heather - bill found some faults for me to investigate. They all clustered around the same time (near 4pm), and looked nebulous related, so I reverted it and it worked.

Thursday : 2011-06-16

  • 10:45: MPG has all of the STS data on the data store. Set the distRuns' to be cleaned.
  • 12:54: a magic node was faulting repeatedly due to a corrupted diffskyfile fixed with perl ~bills/ipp/tools/rundiffskycell.pl --diff_id 138666 --skycell_id skycell.2259.100 --redirect-output
  • 13:00: Set label STS.nightlyscience to allow STS.refstack diffs to complete. We want to insure that the masking fraction doesn't increase going from warp-warp diffs to warp-stack diffs

Friday : 2011-06-17

Bill is processing czar today

  • 9:07 a magic node was failing due to a corrupt diff file. Fixed with ~bills/ipp/tools/rundiffskycell.pl --redirect-output --diff_id 139432 --skycell_id skycell.2437.113
  • STS diffs versus reference stack caused lots of false detections. Queued diffRuns for sts.nightlyscience using my script. Data from June 11, 13, and 14 is pending.
  • changed publish.revert to run every 20 minutes and turned it on. We are getting some transient failures. Since it is visible the "infinite size log file problem" that caused us to leave it turned off shouldn't be a big issue.
  • Lot's of little faults. I think this is happening because we are targeting wave 1 nodes for data because so many of the wave 3s are full
  • Finally some good news. I found 8416 distRuns from various STS processing labels that Johannes already has but hadn't yet been cleaned up.
  • 13:58 restarted distribution pantasks
  • 14:04 ippc07 is the pantasks host for cleanup. It was pretty busy running processing jobs. Set host to off in stdscience and distribution.
  • 16:03 doubled the number of hosts working on cleanup.
  • 16:21 as an experiment. set distribution to stop
  • 16:51 set distribution to run (no visible increase in cleanup rate)

Saturday : 2011-06-18

  • 07:51 ipp021 was down this morning. Nothing on the console. Cycled power. Stopped distribution to give the nodes a chance to figure out that it's back up. Needed to do force.umount on ippdb00 to get it avialable in nebulous.

Sunday : 2011-06-19

  • 00:40 CZW: There seem to be a large number of faults in processing and registration. It looks like these are mostly I/O issues, but I haven't been able to determine what's wrong. I've been trying to fix the registration bugs, but there are likely to still be some tomorrow. The two things I've had to do are:
     regtool -revertprocessedexp -exp_id 350790
     regtool -updateprocessedimfile -exp_id 350756 -class_id XY02 -set_state pending_burntool
    

with the appropriate parameters. The first one fixes exposures that failed to complete the exposure, and the second is for imfiles with a state of "check_burntool" for longer than a minute.

  • 06:00 ipp021 crashed several hours ago. Bill power cycled it. It crashed yesterday as well.
  • 17:00 Bill moved several log files with a storage object but no instance out of the way to clean up the extant faults.