PS1 IPP Czar Logs for the week 2011.01.03 - 2011.01.09

(Up to PS1 IPP Czar Logs)

Monday : 2011.01.3

heather reverted (using regtool -revert...) burntool/registration.

bill and eugene have turned off processing because we are out of disk space.

Tuesday : 2011.01.04

bill is czar today

  • It appears that all data from last night has had burtool applied.
  • 12:30 Set stdscience to 'run' added ThreePi?.nightlyscience back in
  • 12:52 we seem to be getting a pretty decent rate of faults due to nfs errors

Wednesday : 2011.01.05

Bill is czar today

  • (serge/07:40) cam revert on
  • (serge/08:39) publishing restarted
  • 10:00 warp stuck lots of entries in warpPendingSkyCell book in done state. ipp049 not responding to ssh 4 warps stuck running there. Stopped everything for awhile let jobs finish. Then reset the books (warp.reset, chip.reset, etc)
  • 10:51 Gavin rebooted ipp049. publish was getting lots of faults. Stopped it and asked Serge to investigate)
  • 11:35 Turned off some reverts in order to debug the fault rate. Also set poll limit to 32 to reduce the load in order to get an idea whether that is the problem or not.
  • 12:20 two chips failed repeatedly. Turned out to that the log files had a storage object but no instances. Fixed with neb-mv
    • Serge found the origin of the publishFailures (some runs got queued for a client with a non-existent data store)
    • turned warp off to allow the diffs to make better progress. poll limit is 64
  • 12:41 turned warp back on. Still getting failures even with only a few jobs running. There is at least one corrupt camRun 152676. See
  • 13:29 found another corrupt file warp_id 142806 skycell.1162.062 . Increased poll limit to 128
  • 13:45 found 2 publishRuns that were failing and reverting repeatedly 9GB of log files was the result
  • 14:06 ran --queue_stacks --date 2011-01-05
  • 15:00 gave up trying to debug the cause of the high fault. All reverts back on.

Thursday : 2011.01.06

serge is czar

  • (bills 06:57) figured out why ssdiffs aren't getting queued for MD03. warps and stacks were done with MD03.V2 but the survey task still had the MD03 template.
  • registration is stuck. I am investigating.
  • (bills 07:45) I reverted faults but issued the command wrong and reverted over 20000 old faults. Set newExp.state to 'wait' where state ='run' and exp_id < 273800
  • (bills) 08:41 burntool is proceeding slowly. All but 5 or so chips are finshed and the query for pending files is slow compared to the time it takes to run burntool so there are no jobs to run most of the time.
  • (roy/heather/serge) 08:52 burntool/registration very slow. Saw no failed registration chips, so restarted registration server. Saw worrying message in registration log:
Can't find regtool at /home/panstarrs/ipp/psconfig//ipp-20101215.lin64/bin/ line 47.
Can't find required tools. at /home/panstarrs/ipp/psconfig//ipp-20101215.lin64/bin/ line 55.

config error for: --exp_id 274277 --class_id XY30 --this_uri 
      neb://ipp016.0/gpc1/20110106/o5567g0206o/o5567g0206o.ota30.fits --previous_uri neb://ipp016.0/gpc1/20110106/o5567g0205o/o5567g0205o.ota30.fits 
      --dbname gpc1 --verbose
job exit status: 3
job host: ipp012
job dtime: 0.432504
job exit date: Thu Jan  6 08:09:55 2011
  • (bills) 10:01 There are 816 magicRuns to process. I turned off magic reverts to look for repeating failures.
  • (serge) 10:54 Stopped summitcopy to help registration to finish for last night data
  • (serge) 11:08 removed ippc00 host from stdscience
  • (from bill) To free up the cluster a bit I've turned off processing of ThreePi? data for now. The command is labeltool -updatelabel -set_inactive -label ThreePi?.nightlyscience
  • (serge) 12:31 cleanup temporarily stopped. ipp009 looks mad (umount.nfs uses 100% of a cpu?!). Let's wait a bit before rebooting it.
  • (serge) 13:28 Gene stopped stdscience and rebooted ipp009. When back I restart stdsciecne
  • (serge) 15:05 MSS data published to MOPS ds. I reactivated 3pi processing: 'labeltool -updatelabel -label ThreePi?.nightlyscience -dbname gpc1 -set_active'. I restarted cleanup and summitcopy. All revert are set to on in stdscience. All services (but addstar, detrendm and replication) are running.
  • (serge) 16:16 Restarted distribution
  • (from bill) 16:52 Since the stacks for last night's data hadn't been queued I ran the following by hand: --queue_stacks --date 2011-01-06
  • (from Gene) 21:09 removed ThreePi?.nightlyscience from stdscience label list

Friday : 2011.01.07

serge is czar

  • (bills) 05:00 1 file repeatedly failing summit copy and 1 repeatedly failed registration. The problem is that nebulous instances for the image file (copy) and registration log file have been created on ipp025 but that node has been taken out of nebulous. ganglia says that ipp025 has a high load. I can log into it. On ippdb00 I tried force.umount and it successfully unmounted the ipp025 but the remount step never finished. To unstick registration and summit copy I used neb-mv to move the inaccessible instances out of the way.
  • (bills) 05:15 summit copy is finished but registration/burntool has 134 unfinished exposures. It seems to be slowly making progress burntooling files. But no files are finishing registration. The burntool jobs are for chips from exposures after o5568g0183o (the one that faulted) so perhaps something is wrong. One chip (XY35) from o5568g0183o has burntool_state == -1 That is probably blocking things from proceeding. I don't see a regtool mode to fix this so I edited the database and set burntool_state back to zero. That didn't seem to help though.
  • (bills) 05:55 We needed data_state changed from check_burntool to pending_burntool. Now we're moving along. Sounds like we may need a revertburntool mode.
  • (bills) 06:05 Stopped registration to make rawImfile.burntool_state a key.
  • (bills) 06:30 power cycled ipp025 There was a panic message on console. See
  • (bills) 06:39 reverted warp, diff, and dist faults (probably caused by ipp025)
  • (bills) 06:40 added ThreePi?.nightlyscience back into stdscience label list. (Gene removed it last night)
  • (bills) 07:00 registration/burntool has finished
  • (bills) 07:11 Someone submitted a postage stamp by coordinate request through the web interface for a point in M31. CFA has some requests that are blocked by that so I've lowered the priority for the postage stamp label WEB and gpc1 label ps_ud_WEB to let the cfa requests through.
  • (serge) 08:34,, to speed up MD
  • (serge) 08:39 chip.on, warp.on, stack.on
  • (serge) 09:58 I had to queue the last 3 MD02 exposures for publishing by hand (pubtool -definerun -client_id 1 -label MD02.nightlyscience -dbname gpc1). The weirdest is that I had to enter the command three times. The first command only queued the first missing exposure (o5568g0132o MD02 z N5568 MD02 center). Successive calls to (pubtool -definerun -client_id 1 -label MD02.nightlyscience -dbname gpc1) didn't add the missing exposures? I had to wait for the first missing exposure to be published to be able to add the second one (same for the third exposure)
  • (from bill) 12:36 Rather than try and figure out what's wrong with nightly science (I think it's confused about the states)

I ran --queue_stacks --date 2011-01-07 --dbname gpc1
  • (bills) 12:41 set all pantasks to stop in preparation for doing Build install in ippScripts of the production build. Then did the build, then restarted. These changes should prevent registration from getting stuck due to bad data_state/burntool_state.
  • (bills) 13:00 Changed Label priorities back for requests that come in from the WEB form. The M31 request still has 562 jobs to go but other requests have started to come in from the form so I want to let them have a chance. I think that I need to get trickier with the labels. Maybe use WEB.HOG if a request generates too many jobs.
  • (bills) 18:47 in distribution pantasks ran default.hosts macro to double the number of hosts working on those jobs. The idea being that magic has been falling behind...
  • (bills) 20:30 No science data will be taken tonight. However the are taking Laser exposures. Many hundreds of them at a high rate. I sent email inquiring about it.

Saturday : 2011.01.08

  • (bills) 08:00 Science processing is almost caught up. There are a few distRuns still pending and 1 magicRun and 1 magicDSRun still outstanding. This is likely due to the fact that ipp016 has been down for over 7 hours. I'll reboot it.
  • (bills) 08:40 One diffRun was stuck because the inputs were cleaned up before they finished. Set it to goto_cleaned.
  • Another finished after reverting.
  • The magic run was stuck due to a corrupt file. I reran --diff_id 100011 --skycell_id skycell.2223.129.
  • Finally the destreak run fails with. Since it has failed 867 times it's time to investigate it.
 -> setExciseValue (streaksio.c:1253): Known programming error
     unexpected image type found: 1032

This is caused by the weight image being uncompressed to an F64 image. streaksremove is not ready for this image type. I've changed the state of the run so I can debug it next week.
  • queued STS data from August for processing
  • (bills) 16:52 ipp016 crashed. that's twice today. power cycled it.

Sunday : 2011.01.09

Bill is acting czar this evening.

  • 18:00 ipp016 has been down for several hours. Console said something about 'spinlock bug' power cycled.
  • 18:43 2 MD magic runs were not running. There were entries in the book magicToTree but now jobs running. magic.reset unstuck them.
  • 18:46 queued nightly stacks for 2011-01-09
  • 20:08 just noticed that stdscience was not running. Restarted it. Registration has almost vanquished the Laser onslaught.
  • 20:44 A couple of the STS exposures have filter = 'Not available' Set them to drop. chip_ids: 177137 177242
  • 20:45 executed
pantasks: 2011-01-09
pantasks: 2011-01-10

Now the state is

  • 21:17 Decided to lower the priority of STS.20101202 below ThreePi?.nightlyscience
  • 23:23 fault counts too high in distribution. Stopped it. Changed all instances of ipp016 in ipphosts.mhpcc to ipp006. Restarted distribution