2008.06.25

These are notes from my test runs of the single-image science analysis steps for GPC1 images, starting 2008.06.24. Yesterday, I finished generating the master masks for GPC1. I then set up a collection of science images for a test analysis run. I chose 177 image from the 'altstrip' data sets. These are altitude sequences running from ~80 deg alt to ~40 deg alt. The data are taken in griy, but for now I have selected only the i and y data. I started the processing initially around 5pm. I had a false start: ppImage was ignoring the manually selected masked pixels with a bitmask of 0x02 because I did not update the MASK.VALUE recipe element. I also initially forgot to accept the y-band flat-field image. After these false starts, I restarted the processing around 10pm. Over night, the processing continued without significant interference from the night-time summit copy & registration stages. The average processing time for a GPC1 chip image was ~150 seconds. These jobs are using only a single CPU on each of 15 machines. The code is not compiled optimized. If I assume a factor of 2 speed up for the -O2 (pretty typical), then, in order to keep up with a rate of 1 image per 60 seconds, we will need to keep 75 CPUs running on this stage of the analysis. (We wave 1, we currently have 72 CPUs available; with waves 1 & 2, we will have 200 CPUs available). Even to make use of the available processing power, we need to code the multi-thread version of ppImage / psphot. Given a roughly equivalent load for each of the warp and stack stages, it sounds like we will need to also squeeze 10 - 20% more speed out of ppImage / psphot.

There were a number of chips with errors. Here are notes on the types of failure modes:

  • mysql command to dump the uri for the failed chips: mysql -h ipp004 -u ipp -p gpc1 -e "select chip_id, exp_id, class_id, uri from chipProcessedImfile where fault > 0" > chip.errors.txt
  • error modes and number: failure on psf model : 94 / 96 failure to open input image : 1 / 96 failure to read detrend image : 1 / 96

the two I/O failures are (probably) symptoms of the NFS lags in the system. These are a pain, but should be immediately fixable with a retry.

The PSF model failures are a well-understood problem in psphot: if there are not enough stars, psphot quits. It turns out that the psphot.config value for the SN limit for objects to be tried as a PSF star was set fairly high (100), with the result that there could easily be no or few bright enough stars. I reduced this to SN of 25 and reset the fault on those image.

There was a pantasks server crash, but I don't know the cause.