PS1 IPP Czar Logs for the week 2010.01.24 - 2010.01.30

(Up to PS1 IPP Czar Logs)

Monday : 2011.01.24

  • eam : warp seemed to be slow: no progress for about 1 hour. the book was full of DONE runs. I stopped processing, ran 'process_cleanup warpPendingSkyCell' to clear the book, and restarted processing. It ran fine after that
  • eam : burntool seemed to have gotten stuck. I looked on the new burntool state ippMonitor page and found one of the imfiles did not seem to be making progress. Looking at the pantasks (control status), I noticed that the job for that cell had been running for a very long time (>500 sec). I went to the machine where it was running and noticed that it was hanging on access /data/ipp033.0 (which crashed over the weekend). I used force.umount to clear the mount point, and things moved along from there
  • bills 15:53 Removed ipp053 from distribution host lists and restarted distribution pantasks. It seemed sluggish anyways.
  • bills 16:00 cleared some magicDSRun revert faults. I need to automate this!
  • bills 16:09 updated magic_destreak_cleanup.pl to *not* delete the original uncensored diff stage cmf files.
  • bills 16:10 set label STS.201009 back to active.
  • bills 16:25 Executed stacktool -updaterun -set_state drop -stack_id 216256 -set_note 'fails due to problem in ticket 1427'
  • bills 19:37 several faults have appeared setting STS.201009 back to inactive
  • bills 21:19 still lots of faults. I suspect that the rsync processes running on this node are related. I did neb-host --host ipp008 --state repair

Tuesday : 2011.01.25

Bill is czar today

  • 04:21 Many many faults. ipp012 filesystem is read-only from many nodes. ssh is rejected. Stopping all processing for a few minutes to investigate.
  • 04:25 ipp012 console unresponsive. 'Only output from console: INIT: Id "s0" r[220286.000299] Kernel panic - not syncing: Attempted to kill init!' cycling power
  • 04:37 processing restarted . Space is running out.
  • 04:39 There is a very large postage stamp request (> 15000 jobs from MPIA) Setting MPIA labels to inactive for now.
  • 04:55 distribution pantasks jobs for ipp012 are all failing. setting ipp012 to off.
  • 07:11 many nodes over 98%. Stopped processing except for summit copy and registration. All of last night's data has been copied but there are 308 still working on registration & burntool.
  • 07:23 ippdb00:/tmp is full which is causing nebulous errors. Ran: 'sudo mv /tmp/nebulous_server.log /export/ippdb00.0/nebulous_server.log-20110125' Still has zero free.
  • 08:00 fixed register exp problem being caused by invalid value for $default_host in registration pantasks.Changed it from ipp023 to any
  • 08:30 burntool is proceeding.
  • 08:30 since magic is way backlogged and since it doesn't use much space I've turned distribution back to run with destreak.off
  • 09:26 fixed broken magic_cleanup script. Data is being recovered on ipp053. Turning processing back on.
  • 09:41 turned on STS label set priority to 1000. Once those exposures are done and distributed we can clean up the data.
  • 11:13 turned off chip to allow the sts warps to progress faster.
  • 11:00 ipp053 is < 98% now. Queued magicRuns on ipp050 for cleanup. It's down to 97% as well.
  • 13:15 chip.on
  • 13:31 lowered priority of STS.201009 below MD% but above 3pi
  • 15:24 removed unneeded labels from the list of labels and survey book entries in stdscience. This should lower the latency of getting things queued. To do this execute the following pantasks command "server input del.labels.january"
  • 15:58 Taking ThreePi?.nightlyscience out of survey.publish while I work on enabling publication of muggled data.
  • 16:54 in stdscience did survey.off to prevent new runs from getting queued for now.
  • 17:23 I'm testing publishing out of my build. ~ipp/publishing is off but rest assured some data is getting processed.
  • 20:58 restarted distribution to see if it will run faster.

Wednesday : 2011.01.26

  • 02:15 Earlier I added SweetSpot?.nightlyscience to survey.destreak. Yipes we'e backlogged. We need a survey czar it seems! There is one STS.201009 distRun left to process and it's blocked behind SweetSpot?. I'm removing the sweetspot label from distribution for the next few hours.
  • 02:20 Let's give a warm IPP welcome to the new PI survey! Added label PI.nightlyscience to stdscience.
  • 09:30 I've convinced myself that the new publication system is working correctly so I've added in the other labels.
  • 10:34 Set STS.20100906 to be cleaned.
  • 11:14 magic node repeatedly failing due to corrupted diff input file. Fixed with rundiffskyfile.
  • 16:40 Today I changed the order of the queries in magictool -toprocess. It now does the skycell nodes last. This fixes the problem that Gene discovered last night. Ran it throughout the day with various starts and stops. The rate is about .8 runs completed per minute. Since stdscience was idle we increased the number of hosts working and the rate didn't increase significantly. I've just restarted pantasks with the normal setup to see how it compares.
  • 17:04 stdscience needs to be on because the survey tasks run there.
  • 22:32 ThreePi? backlog for magic is slowly recovering. I've added SweetSpot?.stdscience label to distribution. Need to start working on the backlog there too.

Thursday : 2011.01.27

  • 06:38 looks like we had a good night. 3pi quads are getting published in a timely manner. The 3pi magic sweetspot DS backlog has improved greatly, but of course created a 3pi distribution backlog. Turning that label off in distribution to give the MD bundles a chance to run.
  • 09:00 updated disttool to order pending components by priority, so I can turn SweetSpot?.nightlyscience back on. Eh not yet.
  • 09:51 turning chip off so that last night's 3pi Quads can finish warp and diff (They are all through chip)
  • 12:27 Gene restarted stdscience. This restarted chip processing. Added SweetSpot?.nightlyscience to distribution and enabled STS.201009.
  • 13:13:31 The record holder for most postage stamp jobs per request file goes to morganson.20110126020856 at 27949. After a couple days work it finally finished. Eric has told me that he will split things up better in the future.

Gene is czar today.

  • 15:46 (Bill) Fixed a ppStack memory problem. Adding label MD05.refstack.20110121 wit stack.poll value of 10 (versus the usual 40)
  • 16:43 (Bill) Since all of the SweetSpot? destreak runs are done and we might get sweetspot data tonight, set priority of that label back to it's usual value (410)
  • 17:52 (Bill) increased stack.poll to 20
  • 18:30 (Bill) gradually increased stack.poll to usual value 40
  • 19:40 (bill) dropped stack 217864 due to problem reported in ticket 1427: no sources suitable for psf fitting.
  • 22:00 (Bill) 2 cam runs failed due to corrupt chip files fixed with
ipp@ipp033:/home/panstarrs/ipp>perl ~bills/ipp/tools/runchipimfile.pl --redirect-output --chip_id 185653 --class_id XY35 --dbname gpc1
command to process this skycell
chip_imfile.pl --exp_id 288461 --chip_id 185653 --chip_imfile_id 10964592 --class_id XY35 --uri neb://ipp045.0/gpc1/20110128/o5589g0075o/o5589g0075o.ota35.fits --camera GPC1 --run-state new --deburned 0 --outroot neb://ipp045.0/gpc1/SweetSpot.nt/2011/01/28//o5589g0075o.288461/o5589g0075o.288461.ch.185653 --redirect-output --no-update --verbose --dbname gpc1


Starting script /home/panstarrs/ipp/psconfig//ipp-20101215.lin64/bin/chip_imfile.pl on ipp033 at Thu Jan 27 22:02:59 HST 2011



        Time spent in user mode   (CPU seconds) : 288.891s
        Time spent in kernel mode (CPU seconds) : 26.818s
        Total time                              : 7:12.63s
        CPU utilisation (percentage)            : 72.9%
ipp@ipp033:/home/panstarrs/ipp>echo $?
0


and

ipp@ipp026:/home/panstarrs/ipp>perl ~bills/ipp/tools/runchipimfile.pl --redirect-output --class_id XY35 --chip_id 185679
command to process this skycell
chip_imfile.pl --exp_id 288490 --chip_id 185679 --chip_imfile_id 10966152 --class_id XY35 --uri neb://ipp045.0/gpc1/20110128/o5589g0103o/o5589g0103o.ota35.fits --camera GPC1 --run-state new --deburned 0 --outroot neb://ipp045.0/gpc1/SweetSpot.nt/2011/01/28//o5589g0103o.288490/o5589g0103o.288490.ch.185679 --redirect-output --no-update --verbose --dbname gpc1


Starting script /home/panstarrs/ipp/psconfig//ipp-20101215.lin64/bin/chip_imfile.pl on ipp026 at Thu Jan 27 22:20:15 HST 2011



        Time spent in user mode   (CPU seconds) : 89.080s
        Time spent in kernel mode (CPU seconds) : 9.142s
        Total time                              : 2:19.15s
        CPU utilisation (percentage)            : 70.5%
ipp@ipp026:/home/panstarrs/ipp>echo $?
0

Friday : 2011.01.28

Bill checking in this morning. Looks like we had a good night

  • 07:44 Bill spent a very confusing quarter of an hour debugging a distRun that wasn't finishing. "No fault why not finished". It turned out the revert task fires off every 60 seconds so the state of the database kept changing.
  • Set state of distRun 398335 to hold so that I can debug and fix a bug in streaksrelease (unexpected image type)
  • 1 sts magicRun got stuck due to a corrupted diffSkyfile fixed with
ipp@ipp038:/home/panstarrs/ipp>perl ~bills/ipp/tools/rundiffskycell.pl  --dbname gpc1 --redirect-output --diff_id 105616 --skycell_id skycell.190
command to process this skycell

  • Just got a burst of destreak failures for data where the path_base is set to volume ipp019
  • fixed a bug in magictool -toprocess that I introduced a couple of days ago. Apparently this is the first time we've had no pending magicRuns.
  • warp was stuck because the book warpPendingSkyCell was full of entries in state DONE so none could be added.
  • ipp019 had another period of high load leading to nfs write failures.
  • 09:34 reverted failed stacks
  • 16:30 Lots of unlogged activity today. the system was re-built and all pantasks restarted
  • 16:33 Executed book setword nsStacks 2011-01-28 nsStackState FINISHED_STACKS in an attempt to stop ns.stacks.run from dying
  • 19:10 Fixed summit copy faults. Set exposures with files with 104 not found and 110 gone errors to drop. Downloaded 4 files from c5580g0112o which had incorrect checksums by omitting the checksum check. Got done a few minutes late so summit copy is slightly behind.
  • 20:15 bumped priority of STS. Let's get it out of here!

Saturday : 2011.01.29

  • 06:28 Bill did hosts add compute, hosts add wave 2 & hosts add wave3 in distribution pantsks.Increasing the number of hosts from 78 to 119.
  • 6:47 fixed corrupted file that caused repeating diff failure with: ipp@ipp043:/home/panstarrs/ipp>perl ~bills/ipp/tools/runwarpskycell.pl --dbname gpc1 --redirect-output --warp_id 155417 --skycell_id skycell.1711.016
  • 09:32 It feels like destreak is starving disribution. I realized that we have several stages enabled which are not being used. Ran server input removedunusedstages to remove raw, fake, warp_bg, and chip_bg, and sky from the DIST_STAGE list
  • 19:39 PI night begins. reloaded survey.pro to remove echo from survey.publish task that was filling up pantasks.stdout.log

Sunday : 2011.01.30

  • 06:00 things going smoothly. At of observing the added some 3pi exposures. Added 3p entries to survey.publish.
  • 06:15 1 diff run faulted with fault 4 due to one of the known threading assert failures. Ran difftool -revert but it faulted again. Reverted again and this time it completed
  • 06:20 PI distribution appears to be backed up but it's not. The chip and warp stages weren't in DIST_STAGES so they weren't getting processed.
  • 08:59 got a diff failure with fault 4 and a traceback that I don't recognize. See below. will try to revert.
    replaced models for 189 objects: 0.000050 sec
Unable to determine detection efficiencies from fake sources
 -> psphotGuessModelsReadout (psphotGuessModels.c:115): unknown psLib error
     Unable to guess model.
 -> psphotGuessModels (psphotGuessModels.c:24): Problem in configure files
     failed on to guess models for PSPHOT.INPUT entry 0
 -> psphotMagnitudesReadout (psphotMagnitudes.c:129): unknown psLib error
     Unable to guess model.
 -> psphotMagnitudes (psphotMagnitudes.c:38): Problem in configure files
     failed to measure magnitudes for PSPHOT.INPUT entry 0
 -> effLimit (psphotEfficiency.c:54): unexpected NULL found
     Unable to generate PSF model.
 -> psphotEfficiencyReadout (psphotEfficiency.c:271): unknown psLib error
     Unable to determine limits for image
 -> psphotEfficiency (psphotEfficiency.c:176): Problem in configure files
     failed to measure detection efficiency for PSPHOT.INPUT entry 0
Unable to perform ppSub: 4 at /home/panstarrs/ipp/psconfig//ipp-20101215.lin64/bin/diff_skycell.pl line 400.
Running [/home/panstarrs/ipp/psconfig/ipp-20101215.lin64/bin/difftool -diff_id 106631 -skycell_id skycell.0937.033 -fault 4 -adddiffskyfile -dtime_script 104.999993741512 -hostname ipp028 -path_base neb://ipp028.0/gpc1/PI.nt/2011/01/30/RINGS.V0/skycell.0937.033/RINGS.V0.skycell.0937.033.dif.106631 -dbname gpc1]...

  • The skycell above worked after reverting.
  • 13:00 or so turned off distribution and destreak for a little while in order to get an idea how much faster magic would proceed.
  • 13:53 pstamp has finished MOPS's requests. Restarted postage stamp server with the usual setup: one set of hosts.
  • 21:42 restarted distribution. It seemed slow. Since pending DS runs are warp only and most of the distRuns are diff increased poll limit to 300
  • dist still starved. increased poll limit to 1000