IPP Progress Report for the week 2011.07.11 - 2011.07.15

(Up to IPP Progress Reports)

Eugene Magnier

This has been a week of fighting fires and trying to understand problems.

The MHPCC->ATRC copy is finally moving along smoothly. After further iterations with Chris on the neb-admin search sql, the system is now reliably picking up all of the chips across the cluster and correctly identifying the ones that need to be replicated. We also tuned the parameters for this pantasks so that the pipe stays full. Since about Wed of last week, we have been able to keep up ~120MB/sec nearly continuously. At this rate, we will fill the ATRC machines in 3-4 weeks (just in time for another box to arrive).

A related issue: we had been having an unusual number of machine crashes over the past 2-3 weeks. This turned out to be caused by an error in the pantasks config for the shuffle to ATRC: the code was not correctly telling pantasks which machines would be used for the process. Uneven distribution of the source chips meant that individual source machines at MHPCC could be heavily overloaded (50 simultaneous copies). Fixing this configuration error has stopped the machine crashes plaguing us for the past few weeks.

We have also been having trouble keeping up with the data each night. Until May we had been finishing nightly processing soon after the end of the night. Since May, on nights when we actually got data, we were taking most of the day to finish the processing. This problem had been masked by the weather (lack of data) and the extra LAP processing data (a likely scapegoat). Last week, we had lots of data and the problem was clearly in evidence. We limited or stopped LAP processing, but still we were clearly running slower than previously. Heather looked into timings and discovered that chip processing had gotten slower since May, but mostly in the supporting perl script. I discovered that the perl script, trying to extra a recipe value from the camera configuration, was taking an unreasonably long time to parse the metadata files. I hardcoded for one night the specific value, and the processing rate went up significantly. I will adapt the system to do the parsing in the ppConfigDump c code.

The final big concern on the regular processing is the storage. Looking at the historical logs, it seems that we are filling up the disks more quickly than we expect. If we have no LAP processing running, we should be adding to our storage mostly by adding raw data, at a rate of about 35-40TB per month (assuming the processing buffer is roughly full). If LAP processing is running, then we also add by the amount of static sky we have generated. Going back over 6 months, however, it seems we are increasing the storage at close to 2x the raw data rate even when we are not running LAP processing. This is becoming critical as we can only shuffle data to ATRC at a finite rate. We are still investigating this particular problem.

I also have been working on the slow parts of the extended source analysis, but have only been able to make modest progress given these other issues.

Serge Chastel

  • Nebulous mysql dump now works fine again
  • Changed some mysql dump configuration (isp and ippadmin dumps were still done from ippdb01)
  • SVN repository for operational configuration + associated control
  • IPP Czar on Thursday and Friday
  • IPP Test Framework meeting with Jim, Gene, Mark, and Roy.
  • Early IPP Test Framework developments

Heather Flewelling

  • czar
    • found many stuck nfs mounts
    • neb-rm -m'd many log files (and others) that did not exist
    • sorted out detrends (one had both copies on ipp012, some had copies on the ippb0 machines, but with 0 bytes)
    • sorted out many stuck faults
  • dvodb
    • survey task for staticsky written and tested
    • SAS dvodb set up and built (staticsky and cam)
    • LAP dvodb added more staticskys (stalled on this because we are stalled on LAP processing)
  • handed sidik the current static masks
  • started investigating a better way to find stuck nfs mounts

Roy Henderson

  • ippToPsps loading:
    • mangled skycell IDs (see last week's report) broke the load. Fudged numbers, then reloaded stacks
    • loaded a few more detections and stacks from DVO
    • merge completed and I checked that I could query the new data. Everything looked OK
    • some help for Kent Wood prior to old 3PI deletion. I think he and his team succeeded in extracting what they need.
  • ippToPsps development:
    • more work on addRun SQL query. Was getting un-merged DVO items. Now, with Heather's help, it gets only merged/full items
    • main program now polls indefinitely, looking for new P2 or ST items to queue up. This has been running all week with no problems
  • Czar Tuesday and Wednesday:
    • lots of errors, some failed attempts to revert things manually
    • ipp026 went crazy and managed to halt the whole pipeline. Twice.
  • Czartool:
    • cleaned-up webpage with lots of changes to tables and colors for clarity
    • page now has two 'modes', standard and update, so we don't need two big tables cluttering up the page
    • improved query to get magic mask fraction with help from Chris. Now much faster.
    • changes to code for time-series plots, including...
    • a new rate plot suggested by Gene that uses a running-mean to show exposures-per-hour more clearly than before
  • started work on populating a gpc1 database instance for the test loop stuff

Mark Huber

  • Implementing MD10.V3 tessellation. Stumbled across bug in pswarp Gene was able to fix. Setup and in use now for MD10 observations. Exposures in y-band run through warp stage for MD10.V3 reference stack.
  • simtest->PSPS development - basic sample for loading into PSPS generated.
  • Continued desktop setup/configurations.

Bill Sweeney

Chris Waters

  • Dark: Worked on ways to select which order of model is supported by the data in an attempt to only fit quadratic models in temperature when that would not introduce more noise into the dark. This still seems to introduce more noise than a simple linear in temperature model (which has similar introduced noise characteristics as the old linear dark model). Plotting up pixel flux values from a wide range of exposure times and temperatures shows no obvious quadratic term, even in pixels where the quadratic term has been selected by the code. This is also true in corner glow areas, which appear to be linear in temperature as well. Currently constructing a full exposure model where the quadratic term is excluded, in an attempt to see if this is required at all for any OTA.
  • Nebulous: Confirmed that in some cases, the shuffle code is failing to move an instance of the raw data to the long term storage nodes at the ATRC. However, this is not resulting in data loss, as this failed instance (which has zero size) is being detected and the valid second copy is not deleted. Updated neb-stat to identify these files by calculating md5sums for each instance that is available. Corrected shuffle SQL to properly select all files that should be moved.
  • LAP: preliminary work to allow quickstacks to be skipped if only a small fraction of the input exposures are unpaired. This should speed up LAP processing at the expense of a small number of input exposures (usually ~3 / 50-60 total inputs).