PS1 IPP Czar Logs for the weekend 2010.12.03 - 2010.10.05

(Up to PS1 IPP Czar Logs)

Friday : 2010.12.03

Bill is acting czar today. We are running STS.20101202 chip - warp) and MD03.refstack.20101202 stacks.

Apparently something has changed in ppStack and it is a memory hog. Certain skycells fail with memory allocation error and in some cases crash the node running the process.

3 hosts died with this problem a bit after 3 pm. ipp015, c27 and c29. Cindy and Gavin rebooted them. Then Gavin needed to clean up some nfs problems.

As a workaround we're going to stop processing of these stacks in stdscience and use a separate pantasks with a selection of hosts that have 32GB of physical memory. This will run out of ~ipp/stackonly with pantasks_server on ipp042.

Around 16:00 I disabled the two labels (using labeltool) and I am allowing the stacks that are running to finish. We were getting lots of faults from chip-warp because nodes running the stacks are intermittently unresponsive as nfs servers.

16:53 HST started up "stackonly" and restarted stdscience. I have not yet added the STS label to stdscience. Will do that after we see how the stacking progresses.

18:14 added STS.20101202 to stdscience and run. Lowered poll.limit to 64. Later turned warp off.

Turned hosts running stacks off in stdscience ipp040 - 52 (less 42,44)

22:56 since we're getting new data on the summit tonight, I turned off stacking so as to keep things from potentially being unstable. Turned warp back on.

23:15 Lots of faults from ippc27. (wasn't that machine down for the past week? Set it off in pantasks) Also turned hosts 40 -52 back on

23:20 Lots of faults from ipp048. The problem was the nfs access to ipp006 was broken. sudo /usr/local/sbin/force.umount ipp006 seems to have fixed that.

23:13 past bedtime. No faults on the czartool dash board. set.poll 180 (the usual value) good night

Saturday : 2010.12.04

06:00 many faults. Reverted chip but they came right back. Examining pantasks.stdout.log all of the faults are for jobs dispatched to ippc07. That node was getting I/O errors trying to talk to ipp027. force.umount fixed the problem. All faults reverted. Looking at ippc07:/var/log/messages ipp027 stopped responding at 03:50. Ganglia data for ipp027 doesn't show any unusual activity.

Summit copy and registration are proceeding. Last exposure so far is o5534g0566o. We're about 33 exposures behind with downloads This is likely due to the 132 stare exposures taken.

06:18 Noticed that publishing wasn't running. Started it up.

06:35 All object exposures from last night have been copied and registered. 20 STS chipRuns still processing.

07:11 Noticed that cleanup wasn't running. Fired it up.

09:26 OSS chip processing complete. Turning chip.off to allow warps to finish faster.

09:45 turned off warp so the OSS diffs run faster. Interestingly after the past two steps the cluster load level jumped higher.

10:10 26 of 36 diffRuns are done.

10:20 only 5 left. 29 of 36 have already been published. Turning chip and warp back on.

19:51 started up stack processing in ~ipp/stackonly

Sunday : 2010.12.05

05:46 107 stack skycells have completed in the stackonly pantasks. No apparent errors. Turned stack off so as to not interferere with this mornings stdscience processing.

summit copy has finished downloading today's 420 science exposures.

9:06 9 stacks that were in the queue are still running. standard science processing is proceeding. 167 OSS exposures. Looks like 4 batches with comments containing J1, J2, I1 and I2. The exposure counts for each of these are 43 X J1, 40 X J2, 40 X I1, 42 X I2. Looks like J1 and J2 are sets with the same boresite. So I guess J2 is really visits 3 and 4.

10:25 OSS chips finish bills turns off chip

10:30 Serge turns off warp and stack

11:30 50 OSS diffs finish, bill turns processing back on then has second thoughts. Maybe more diffs will get queued

12:15 bill turns things back on. No more OSS diffs have appeared

14:17 bill forgot to turn camera on. Gene turned it on.

16:18 stackonly pantasks turned on.