IPP Progress Report for the week 2011.11.28 - 2011.12.02

(Up to IPP Progress Reports)

Eugene Magnier

I had productive discussions with the KP12 folks (mostly Peter & Nigel) about remaining issues in the psphot krons when compared to sextractor values. I was able to address the two most outstanding features. First, there was a quantization of the kron fluxes, especially noticable for bright stars. This was caused by quantized steps in the windows used to measure the first radial moments which are in turn used to define the apertures for the kron fluxes. The effect was fairly small (~2% steps in the flux), but gave an ugly appearance that worried folks. I addressed this by interpolating the radial window selection sizes to avoid sharp transitions. The other issue was in bright sources with too faint psphot kron mags. These were somewhat rare, but not rare enough. They were caused by detections in the wings of brighter objects, which were flagged as bad detections, nonetheless being masked in the kron analysis. This was easy to fix to excluding those objects when the kron masking was performed.

I also spent some time discussing the IPP/MOPS testing with Larry, Chris, and Peter. They pointed out that the ON_GHOST flag bit was never set in the ppSub output. Looking at the code, I realized that this test was not applied for the analysis used by ppSub, only the for the detections from the positive images. I added in a test to check for this condition (on ghost) and set the flag accordingly.

Finally, we had some really bad hardware issues this week which Cindy has been chasing down with lots of help from Gavin. We were forced by this issue to limit some of the operations over the weekend, and we are still not quite at full ops. Here is the summary of the problem which I sent to the IPP Users email list:

We recently added 13 new storage nodes to our cluster, each providing
40TB of storage for a total of 520TB added to the system.  These
machines were ordered to be functionally identical to the other
storage machines in our cluster, but they were delivered with a
different model of RAID card from our previous purchases.

Over last weekend, we had a problem with one of the new machines,
ipp064.  One of the disks failed and needed to be replaced.  This is a
standard operation on our cluster -- we typically replace 5-15 disks
per week on the rest of the machines.  However, our standard procedure
resulted in an error on the controller that largely damaged the RAID
array.  Although this is not desired, it is not catastrophic since our
system design includes replication for all raw data, all long-term
output products, and much of the ephemeral data as well.  (We don't
yet replicate every output product, but this latest incident may
encourage us to do so in the future).  We initiated recovery
procedures for the data -- since some data was recoverable, we set
about making a backup copy and generating a second replica copy from
the replicants on the cluster.

The very worrisome event occurred on Wednesday night when a second
raid (on ipp058) had a disk error.  Since we still did not have a good
answer to the cause of the problem for ipp064 from the vendor, we
operated in a very conservative manner.  Instead of doing our standard
recovery procedure, we worked with the vendor to recover the raid.  In
the process, we have learned that the raid controller has some kind of
error which makes it susceptible to the full loss of data that we
suffered from ipp064.  At this point, ipp058 appears to be in a valid
state -- there is no lost data -- but until we (a) have a full
recovery process that we believe from the vendor and (b) an
understanding of how to avoid these errors, we cannot trust the
validity of these machines.  Unfortunately, without those machines, we
have a space crunch.  Although there is space available on the
cluster, it is poorly distributed.  About half of our machines are
nearly full.  So, until we can clear up this situation, we are
limiting the operations which will either stress the new storage
machines or load the nearly full set of the cluster.

We continue to work with the vendors, but we are looking into replacing all of the raid cards on these machines if the interface continues to be unreliable.

Serge Chastel

Heather Flewelling

  • cooked lots of dvodbs for ORR (multiple iterations of MD04, SAS, SAS.footprint)
  • restarted ThreePi? + merging
  • answered Roy and Jim questions
  • development work on addstar to properly handle staticskyRuns with multiple filters
  • sent email to Gene about modifications to staticsky to reduce insanity in addstar/psps ingestion

Roy Henderson

  • Code changes prior to SAS/MD4 loading:
    • populating un-convolved fluxes in StackApFlx table
    • incorporating filterID into stackDetectID to avoid key violation for multi-filter stacks
  • SAS loading:
    • started testing with a test DVO provided by Heather
    • much confusion over stack_id/sky_id/path_base for new stacks in DVO
    • worked on very convoluted way of getting correct cmf file for a given stack_id for new multi-filter stacks
    • had to stop loading SAS and pull all batches from datastore after issue discovered with z-frames
    • restarted loading and completed with new DVO created by Heather
  • MD4 loading:
    • dusted-off old code to convert DVO databases to MySQL
    • converted new MD4 DVO to MySQL in 24 hours
    • processing was much faster, but still too slow to load in time for ORR. So...
    • made a database dump of MD4 and restored it to numerous hosts in order to run multiple clients
    • loading completed in 24 hours
  • Other:
    • completed loading of RA 6-18 hrs RA for OLD survey, i.e. half the sky
    • very carefully republished all the failed/OnHold batches from a list provided by Conrad
    • investigated issue of large numbers of NULL fluxes in new stacks
    • updated the PSPS news page with plots after merge completion

Mark Huber

  • MD: evaluating extent of MD06,08 lost on ipp064.
  • staticsky
    • preparation of MD04, SAS.footprint.123, SAS2.123 data sets for loading dvo+PSPS. New reprocessing of MD04 camera exposures destreaked. New re-run of SAS fixed missing footprint stacks but are now missing SAS2.123 stacks.
    • continued testing faults in psphotstack

Bill Sweeney

  • Unfortunately Bill was ill this week and did not get much acomplished.
  • Wrote some scripts to set up reprocessing of the 3000 or so exposures that got destreaked incorrectly between 11-November and 24-November. This project was hampered by the loss of data in ipp064.
  • Made a couple of minor changes to the postage stamp server to support easier selection of stack skycells for get image requests.

Chris Waters

  • ipp064: rsynced data from partially recovered raid to ipp065, to allow for some salvage operations.
  • SSTF: Ran test sets with improved flagging code.
  • Sky/background issue: finished proposed set of changes to PATTERN.ROW/CELL choices. Modified PATTERN.ROW behavior to fit trends in slope and offset as a function of row. This is needed to ensure that PATTERN.ROW does not subtract off sky variations while correcting row-to-row offsets. This seems to improve the background subtraction around these cells in most cases. Checking the magic fraction shows a minor increase, with the original processing averaging 0.027 masking, and the new processing averaging 0.028.
  • Diskspace: Due to wave 4 RAID concerns, set all wave 4 nodes to nebulous repair, blocked the targetting shuffle from moving data to these nodes, and added the stare nodes to nebulous and updated summit copy to properly target these hosts for stare data.