Throughput Demonstration 2009.11.04

This page contains notes related to the throughput demonstration performed 2009.11.04. The IPP code used for this demo can be found under tags/ipp-magic-v1-0@26069; the magic code for this demo is found in ~ipp/magic_tarballs/magic.20091023.tgz and ssa-core-cpp.20091023.tgz.

throughput summary

last night I proposed that we might be I/O bound since the processing rate did not seem to increase as much as I would have expected. Here is evidence that I was wrong. Attached is a plot of the number of completed 'camera' stage analyses per half hour during the test yesterday. You can see that we start at about the 13/bin rate for the first period of the test, during which I had jobs running on 137 nodes*. You can see the period around 6pm last night when I crashed pantasks and waited a bit to restart the system. Then you can see the rate goes up to about 20/bin for the last part of the test, during which I had jobs running in 169 nodes. So, adding 23% more CPU increased the rate by 54%. hmm. I think we are seeing two effects: the added CPUs and the reduced amount of nebulous collisions from the change in the skycell assignments.

All in all, we are not in bad shape. The green line indicates a good comfort level where we would be doing 600 exposures in 12 hours; the yellow line is 600 exposures in 18 hours, which is slower than we would like; the red line is bad: 600 exposures in 24 hours. So, it looks like we are comfortably in the yellow to green range if we use the same collection of machines as last night. We also have a bit more room for increased processing: there are several machines not currently in the list, and we probably could add another set of jobs to the wave 1 machines which have not been crashing. There were also some I/O contention issues during this experiment that we have partly addressed, and more that can be improved on (ie, the nebulous randomization).

* roughly : ipp005 and ipp039 came in and out of the list at different times.

streak count

The plot below is a histogram of the number of streaks per exposures from this experiment. The median number of streaks is 8, which is a significant improvement over the rate in the past.

start @ 11:30am

Bill, Heather & I have been running small-scale tests of the full processing -> distribution to ensure all of the steps are working. We've ironed out a few bugs and have a solid demonstration of the full sequence including automatic diffs, magic, destreak, and distribution.

I've just (at 11:35) launched a full-scale throughput demonstration using 611 exposures. These are 3pi exposures from July for the RA,DEC range 270.0 - 300.0, 0.0 - 20.0. There are g,r,i exposures, supposedly all taken as TTI pairs. Obviously, the fact that 611 is odd, and 612 is the closest multiple of 6, says that at least one is missing.

I've also modified pantasks.pro (in ~ipp, not yet checked in) to fix the problem where we get paths of the form /data/any.0/foo/bar (something of a hack, I attempt to replace @HOST@.0 and @HOST@.1 before doing just @HOST@). This should help with nfs load during camera.

update 1

my fix to the camera paths had no effect: the paths are now of the form:

neb://any/gpc1/throughput.20091104/o4999g0217o.78668/o4999g0217o.78668.cm.29757.smf

but they are still all going to ipp039, who is getting quite a load. I'm going to remove it as a processing node so it can deal with the camera I/O.

update 2

we are having some trouble in this test because ipp039 is getting really slammed.

the situation is that the 'any' neb host problem is affecting more than just camera: a number of every diff and warp run are also going to 'any', which means ipp039. I'm checking into the cause, but it shows we need to fix this nebulous bug.

update 3

i found a bug in the skycell by host pantasks function that allowed the range to be larger than count, which in turn set a number of the target hosts and nebulous volumes to 'any'. i've fixed this, so hopefully the load on ipp039 will fall again...

report @ 3pm

looks like the fix I put in for this is now working: we have no skycells with path_base including 'any' since the fix, and the load on ipp039 has been mostly sensible since then as well. there are probably still a few warps sitting there that are being used for diffs, so it will be elevated until those are done.

the processing speed is looking mostly reasonable, except for the diff_skycells, which were heavily hit by the ipp039 collisions. I'll be looking for those times to drop somewhat as well.

So far, we are 3 hours into the test, and we have only processed 1/6 of the chips (and somewhat less of the other stages). Since the goal is 12 hours, we are running a bit behind. I'm hopeful that the ipp039 collisions account for most of the gap...

report @ 6pm

It has been 6 hours now, and the chip and diff processing times are not much better than before. We've only done about 30% of the data set, so it is still going slowly. At this rate, it will be 20 hours before we finish this batch, which is kind of slow. I'm going to change my plan and try to boost the number of hosts.

At the moment, our limiting steps have used the following number of seconds:

chip: 1,021,709.35 sec
warp: 610,303.59 sec
diff: 540,536.4 sec
destreak: 259,726 sec
dist process: 177,430.2 sec
camera: 69,067.08 sec

update 1

oops.

I just attempted to load another set of wave 2 nodes and got a segfault from pantasks...

I'm going to wait 5 min for jobs to complete, and then I'll restart pantasks...

update 2

OK, I've restarted the stdscience pantasks, and I've bumped up the number of hosts to:

1x1 3x2 4x3 4xC

continuing to exclude ipp005 and ipp039 from processing.

report @ 9pm

we've managed to get through 110 chips in the last 3 hours, compared with about 100 in the first 3 hours and 72 in the second three hours, so the rate is a bit faster now due to the larger number of machines. but, compared with the previous setup (1x1, 2x2, 3x3, 3xC) vs (1x1 3x2 4x3 4xC), we have 15 + 15 + 17 = 32 more CPUs out of a previous total of 137 CPUs, or 25% more CPUs. I don't think the processing is going 25% faster that the first 3 hours (before the ipp039 bottleneck became a problem), but it is a little hard to judge. If not, then it is evidence that we have still some I/O bottlenecks. I'll look more closely at the times reported by the warp, diff, magic, and rc stages

report @ 9 am

processing effectively stopped around 6am this morning, though the full analysis was not complete:

  • there are a number of chipRuns that have not finished: 37 exposures are still pending. This is basically because of point 2:
  • ipp014 is unhappy: it is not so unhappy that it has been kicked out of nebulous, but files on it are giving I/O errors. The failing chips are being caused by this, but coupled to a lingering nebulous error: the rules are forcing nebulous to return the ipp014 instance rather than randomize among the instances. i'm going to reboot ipp014 and kick it out of nebulous for now.

I'm attaching a snapshot of the ganglia load for the past 24 hours. you can see where the test started a bit before noon yesterday with the initial set of machines, and then the point where I crashed pantasks and restarted with and additional set of wave 2, 3, and C machines. you can also see the problem machines caused by the nebulous distribution (ipp039 and a couple of the ipp04X machines), and the ones that have been giving occasional nfs issues (eg ipp016). I'm not certain if the latter issues are being cause by glitches in those raids, something equivalent to the causes of the ipp005 crashes, or because of poor data distribution. this is something we need to investigate.

update

I've gone through most errors and found the following:

  • all of the chip failures were caused by the issue with ipp014 -> these would be addessed by fixing the nebulous randomization.
  • all but one of the warp failures were caused by the chmod problem -> these can be addressed with one of the options I listed
  • the remaining warp failure was a more traditional NFS I/O problem, and would be naturally cleaned up with the revert pass
  • the diff failures I checked seemed to be the traditional NFS I/O problems, but we do not seem to have a diff revert task -- that is surprising!

there are also some destreak failures which I did not investigate. also, destreak advance was getting timeouts, which says either the query needs to be adjust or the timeout period in pantasks does. finally, dist_component.pl had some failures, which are probably NFS I/O issues as well.

update at 11:35

this experiment is winding down slowly. There are a few stragglers that have required a little extra intervention to succeed (things like removing files with bad permissions -- see the earlier nfs error), and there are a few entries that stuck by a bug (3 in warp, 3 in diff so far). I'm turning off the warp revert task since the only outstanding jobs are ones that require a bug fix.

Of the 611 initial exposures, we have gotten all 611 through chip, camera, and fake. 608 made it through warp. 608 warps only yielded 294 diffs (10 exposures must not have a corresponding TTI match), of which 291 succeeded. I expect to see 582 magicked exposures, but so far we have only identified 578 exposures for magic -- there may be some errors that need to be cleared in magic (no magic revert?).

Attachments