IPP Progress Report for the week 2011.03.14 - 2011.03.18

(Up to IPP Progress Reports)

Eugene Magnier

I worked on 3 issues related to the footprint / peak culling process. The peak culling process attempts to remove insignificant peaks within a footprint by requiring a significant downward deviation in going from any possible peak to the brighter significant peaks in the same footprint. First, I finalized the modification of the culling process to use the smoothed image rather than the unsmoothed image. This reduces the rate of false positives coming from background deviations. Next, I modified the culling process for saturated stars so that it requires a more substantial downward deviation than for fainter sources. This helps to group detections in the saturated cores which otherwise remain as many distinct detections. Finally, I modified the culling process to be faster for large footprints with many peaks. The original algorithm drew a unique threshold footprint for each peak of interest, leading to very long processing times in images with excess backgrounds. Chris discovered that these images, especially coming from a few specific chips where the background variations are worst, take up to half of our processing time (for less than 10% of our data). My modification to the code uses slightly more coarse threshold bins, and only draws a single footprint for each threshold bin. This speeds up processing by factors of 30 - 100. The sample of slow images, each of which took 10k - 30k seconds in the original analysis, now took an average of about 200 seconds per chip.

We found 2 other throughput issues this week which will help us a lot. The first was in the area of nebulous. We identified 2 different issues with easy fixes. First, we have long had a problem where the apache server can get into a states in which in generates segmentation faults. We discovered that these are caused by some internal apache bug when there are too many apache threads. The segfaults can be avoided if we limit the number of threads to < 128. The segfaults are probably introducing a small amount of additional slowness. We also realized that the apache server used for the nebulous interface has a tendency to take a large amount of memory on that machine, especially if the nebulous mysql starts to get a little slow. Both of these issues acted as a runaway problem: as the nebulous load would get high, apache would spawn more threads, either slowing things down by segfaulting or by taking more memory. The slowness on apache meant more queries would be pending, driving up the apache load, etc. As a result, we see periods where the apache error rates goes to very high values and the mysql rates plummets, with everything getting extra slow.

The configuration change fixed the segfaults. To address the memory load, we moved the apache server from ippdb00 (where mysql lives) to another machine. This was a big success: the load on ippdb00 has gone from a typical value of 20-25 during heavy processing to 4-5, and the rate of nebulous operations has increased from a maximum of roughly 150 per second to roughly 600 per second. We also suspect that we could push nebulous harder if we set up web redirects and use 2 or 3 machines for the apache server. There are also some optimizations to the SOAP interface code run on the server which Serge has identified.

The other throughput improvement is in nighttime processing / registration. We had some occasional problems in the past couple of weeks in which the registration process, including the burntool analysis, was getting sluggish and causing high load on the gpc1 database. This problem has only had modest impact in the past, but last week it caused processing to come to a halt one day until we stopped everything and restarted the mysql server. We discovered that slowness comes from the query used to identify images for burntool processing. This particular query needs to join 2 tables (summitExp and summitImfile), but the only joining key is a string. Mysql is not smart enough to limit the query to only interesting rows; it ends up checking millions of database rows. We have made an additional restriction to the query which acts as a short term hack while we work out the code to add a veritable auto-increment index.

Finally, I launched some MD04 test data sets for PSPS to demonstrate the current state of the code.

Serge Chastel

Heather Flewelling

  • ippconfig test
    • ran magictest.3Pi.20110315
    • it is picking up the correct configs
    • comparing to magictest.3Pi.200110309.a - they have different recipes which use different masks.
  • addstar
    • found a deadbeef in addtool -addminidvodbrun (I need to fix this still)
    • requeued up all the old minidvodbRuns for the new ThreePi? database
    • started merging of old minidvodbs into new ThreePi?.V2 database
    • added dvoverify checks and a few new columns for minidvodbProcessed. Commited and checked into tag and trunk.
  • modified/copied diffgrep script in ipptrunk/tools to grep stack logs
    • sent gene a list of problematic(?) skycells in refstacks

  • copied smfs to niall
  • keep trying to process MD09.jtrp and MD10.jtrp - delayed by various problems
  • sick 1.5 days

Roy Henderson

  • PSPS
    • extracted more files for Jim's MD04 stack analysis (weights)
    • some discussions and planning with Jim regarding new plan to only publish detections seen twice or more.
    • PSVO
      • more design discussions with Daniel and Jim regarding integration of graphical query building into PSVO. We have a plan.
      • started on Classes to encapsulate database schema for the above
      • bug with JDBC drivers not on CLASSPATH for jar version: fixed
  • IPP
    • czartool
      • a face-lift for czartool: nicer tables, clearer highlighting of errors, better general formatting
      • a fix for unsightly jumps in czartool plots when a new label is added (done by checking for large jumps in 1st derivative value, then adjusting accordingly)
      • finally added publish to czartool
      • created a new rate line plot (while retaining stacked histogram). Now using this on czartool webpage
      • added a 'pending postage stamp requests' table to webpage
      • added nebulous time series plot to webpage
    • ippToPsps
      • DetectionBatch class now utilizing new Fits class. Used new unit-test to ensure I didn't break anything along the way
      • added some documentation on detection batch unit-testing
  • Other
    • some time granting data access to new Hungarian group

Bill Sweeney

  • Spent a couple of days using psastro and DVO to perform and analyze tests with STS exposures that get poor astrometric solutions on some chips when using the standard settings. The conclusion was that Gene needs to do some work on psastro to solve the problem.
  • Prototyped some code to track "Events" that occur during the lifetime of IPP output components. For example a given chip will be processed, cleaned up, updated to make postage stamps, cleaned up, updated to make postage stamps.... Often postage stamp jobs get stuck waiting for data that is in a long queue of things to be cleaned up to be finish just so that we can regenerate the image (or an image from another part of the exposure). We have found the need to track this in order to better manage our disk usage. Also if given images are in great demand we can stop the madness and keep those images around.
  • Built a reference stack for the CNP survey (circum north pole?).
  • We had performance problems with both of our databases this week. This caused lots of lost time investigating what was happening. In the end we restarted the database and things were fine for awhile.
  • Spent two very busy days as processing czar.
  • Wrote scripts to queue chip and diff processing for several hundred ThreePi? exposures for which processing was incomplete. The chosen exposures are in the Hyades and nearby regions and these are of interest to Bertrand Goldman of MPIA and KP5.
  • Created DVO database for all of the STS.2010 exposures to help with the astrometry investigations.
  • Spent much of Saturday babysitting the processing listed in the last 2 items.

Chris Waters

  • Investigated the dtime_photom values from the chip stage: ChipStage_Timing. Conclusion was that chips with bad dark models were identifying large numbers of faint sources, and the footprint culling algorithm ran slowly for these sources.
  • Diskspace: Identified that hosts with larger-than-expected disk usage were hosts that had been offline at somepoint in the past. This causes them to have gaps in the range of so_ids that are available on that host, which decreases the effectiveness of the disk balance code. To get around this, disk balance now starts at a random so_id value to allow the full space to be explored with more regularity. Previously, only a small fraction starting at so_id = 1 were being examined.
  • Reprocessing: organized scripts to calculate tessellation cells that are well populated in all filters, and plot DVO image footprints. This will help in choosing which data to use when we start reprocessing.
  • Detectability server: bugfixes and rewrites to get this working correctly. Still not finished.