PS1 IPP Czar Logs for the week 2011.03.07 - 2011.03.13

(Up to PS1 IPP Czar Logs)

Monday : 2011.03.07

Roy is czar.

  • No observations last night. Churning through magic and magicDS for labels: ThreePi.136 and STS.2010.raw.
  • 11:21 bill rebooted ipp005
  • 12:10 bill noticed that update pantasks died. Probably triggered by the ipp005 errors. He restarted it. Also changed pstamp_job_run to distinguish between temporarily not available files and permanently gone files.
  • sometime later bill noticed that the distribution pantasks server died as well and restarted it. sigh.
  • 13:49 bill doubled the number of hosts working on updates and increased the poll limit from 32 to 100.
  • 17:54 mr fussy was unhappy with the progress for distribution. Not enough jobs in queue. Restarted and used 3 x default.hosts
  • 21:50 it didn't help much
  • 21:50 replication pantasks died. Last timestamp on logfile 21:09. Restarted by bills.

Tuesday : 2011.03.08

  • 09:30 (bill) cleanup is very backed up. ippdb00 is red on ganglia. distribution is making slow progress. As an experiment I've stopped replication to see if it changes anything.
  • 09:51 It didn't. Set replication to run. Doubled the number of nodes working on cleanup though.
  • 10:40 The slope of the magic and destreak lines went down about the time that I increased the horsepower working on cleanup.
  • 18:00 Increasing the nodes working on cleanup didn't help. We restarted cleanup a couple of hours ago. Now I've changed the poll limit to only 16. With this it seems that cleanup is going faster now.
  • 21:07 Stopped replication and cleanup to see how it affects destreak and magic. (the cleanup that Eric's pstamp jobs were waiting for are mostly done)
  • 21:50 started cleanup with poll.limit == 8
  • 22:05 this seems to be stable. Woo ho we have new images from the summit.
  • 23:39 turned replication back on. Still getting bursts of apache segvs but only in bursts

Tuesday : 2011.03.09

  • 05:59 ThreePi? data from last night got queued with label ThreePi?.136.nightlyscience. Fixed database and nightly_science.config
  • A magicDSRun failed due to a corruted camera mask file. Since the other chips had already been destreaked fixing this involved mutliple steps
    • Set the magicDSRun to be "restored" This puts the original files back in place "magicdstool -updaterun -set_state goto_restored -magic_ds_id 432760"
    • wait until the files are restored (by
    • rerun the camera processing "perl ~bills/ipp/tools/ --redirect-output --cam_id 180048 --dbname gpc1"
    • set magicDSRun back to new "magicdstool -updaterun -set_state new -magic_ds_id 432760"
  • 12:40 turned on revert tasks for chip, camera, and warp
  • 15:25 warp 169687 had 4 failing skycells due to a corrupt camera mask file (XY36 host ipp035) fixed with "perl ~bills/ipp/tools/ --redirect-output --cam_id 181521 --dbname gpc1"

Friday : 2011.03.11

  • heather is czar - this is the day after the tsunami - we didn't stop processing last night. ipp045 had problems which chris/gene/gavin sorted out.
  • 15:21 heather added MD08.jtrp to stdscience
  • 15:21 heather has been running magictest.3Pi.200110309.a on her stdsci.
  • 15:40 serge tries new Apache parameters on ippc01.
  • 15:30 bill is running a pantasks on ippc22 which is building some distribution bundles of raw detrend images.
  • 16:24 after investigations by serge/heather/gene/bill chip_id 202275 class_id 47 is manually set to quality of 42 and fault of 0. ppImage went insane here.
  • 17:11 bill's pantasks has finished it's work and has been shut off.
  • heather restored the following raw chips (one copy was 0 bytes, the other was fine). They were found by processing MD08.jtrp, so no idea how many more are like this.
    • gpc1/20090628/o5010g0081o/o5010g0081o.ota45.fits
    • gpc1/20090628/o5010g0082o/o5010g0082o.ota24.fits
    • gpc1/20090628/o5010g0083o/o5010g0083o.ota45.fits
    • gpc1/20090628/o5010g0083o/o5010g0083o.ota52.fits
    • gpc1/20090628/o5010g0085o/o5010g0085o.ota25.fits

Saturday : 2011.03.12

== Sunday : 2011.03.13 ===

  • 04:30 eam : ipp032 crashed (much earlier, probably near midnight). I rebooted it. I also restarted all pantaskses to clear out the errors. A number of log files from ipp032 were missing from nebulous, so i moved them to dead file names
  • 10:15 eam : apache got into the major segfault state; restarting apache cleared this out