PS1 IPP Czar Logs for the week 2011.09.19 - 2011.19.25

(Up to PS1 IPP Czar Logs)

Monday : 2011.09.19

  • 08:00: Roy: everything is down from the summit.
  • 08:30: Roy: ipp033 is back, so ran:
neb-host ipp033 up
  • 09:15: Bill: Postage stamp server is caught up. Set all data with labels ps_ud% to be cleaned.
  • 09:20: Bill: cleared all failed_revert faults.
  • 10:10: Bill: set lap priority to 410 to allow it to finish it's destreak.
  • 11:00: Bill: set lap priority back to 200
  • 12:10 Chris/Mark: stdscience going slow. pcontrol large CPU use, stopping panstasks for restart.
  • 12:29: Roy: shutdown stdscience, and restarted it
  • 15:28: Bill fixed bug in ppMops that was causing some faults.
  • 16:06: Bill fixed bug in his STS queueing script. Ran it to queue diffs for Sept 5 which were blocked by the bug
  • 18:00: Mark removed labels from stdscience not being used tonight, MD03,04,05,06,07,08, PI, STD.nightlyscience, to test processing hiccups.
  • 20:10 few frames of MD08 were observed, adding it back in.

Tuesday : 2011.09.20

  • 08:35: Roy: registration stuck. regpeek revealed:
o5824g0388o  XY11 0 check_burntool neb://ipp008.0/gpc1/20110920/o5824g0388o/o5824g0388o.ota11.fits

Got exp_id using

SELECT DISTINCT exp_id FROM rawImfile WHERE exp_name = 'o5824g0388o';
| exp_id |
| 395112 | 

Then ran

regtool -updateprocessedimfile -exp_id 395112 -class_id XY11 -set_state pending_burntool -dbname gpc1
  • 08:40: Roy: Same as above, this time for exposure o5824g0398o
  • 08:42: Roy: And again, exposure o5824g0399o
  • 8:48 heather redid one of the failed burntool commands in the pantasks log.
  • 9:00 CZW: restarted distribution server as it had the 100% CPU pantasks, and was processing slowly.
  • 09:45 Bill: update server was down. Restarted it
  • 10:19 Started magic_cleanup server in order to recover disk space on ipp053 ~bills/magic_cleanup.
  • 14:47 CZW: Stopped stdscience to restart it.

Wednesday : 2011-09-21

  • 1:20 CZW: Stopped distribution to restart it.
  • 18:20 CZW: Preparation for stare night:
    • Stopped stdscience, distribution, cleanup, stack pantasks to prevent any processing
    • Disabled stare node processing using script (~ipp/ off)
    • Disabled future stare node processing by editing ~ipp/ippconfig/pantasks_hosts.input
    • Disabled nebulous allocation to all but the most full hosts using script: ~ipp/stare/pre_stare_night

Thursday : 2011-09-22

  • 10:25 Stopped pantasks. Set nebulous state back to normal. Began restarting the pantasks.
  • 10:36 Fixed warp with missing psf file --warp_id 260390 --skycell_id skycell.0783.030. Reverted the stack that wanted it.
  • 10:50 Oops summit copy is only about 1/2 done with last night's data. Set nebulous back to pre_stare_night. Also rerunning the warp didn't take. Strange process, cleanup, update yields bad quality problem.
  • 18:15 CZW: Updated currently pending_burntool entries to avoid a large amount of useless burntool processing:
       select 'regtool -updateprocessedimfile -exp_id ',exp_id,'-class_id',class_id,'-set_state full -burntool_state -14' from rawImfile where dateobs > '2011-09-22T07:34:03.000000' AND dateobs < '2011-09-22T16:00:22.000000' AND data_state = 'pending_burntool';
  • 19:00 heather restarted summitcopy (so it had 30 hosts instead of 60). Earlier in the day someone had set it to 160 (!) connections to camera's datastore, which caused denial of service. More than 30 connections causes performance issues for camera while it's taking data. Don't do that.
  • 22:30 CZW: ran post_stare_night script to re-enable all the hosts in nebulous. Restarted stdscience and distribution pantasks. Manually ran cleanup for 2011-09-22 and 2011-09-23 to ensure that we've cleaned everything that we should.

Friday : 2011-09-23

  • 09:15 Serge: restarted publishing
  • 09:20 Serge: no exposure has been registered (they seem to have been copied though). Shutdown and restart registration
  • 10:00 heather: registration stuck. eventually figured it out:
    checkexp -s 2011-09-21
    reports that 1003 were not registered (as opposed to the 500 or so when just running checkexp). This is from 2 night's ago stare data. wasn't informative, because it was looking at today. I hacked the script to report back yesterday's (instead of todays), it revealed something stuck in check_burntool, which I reverted with the usual
    regtool -updateprocessedimfile -exp_id 397161 -class_id XY56 -set_state pending_burntool -dbname gpc1

Things are registering now.

  • 18:12 CZW: Restarting distribution and stdscience as they both have pcontrols that are using up large fractions of a processor.

Saturday : 2011-09-24

  • 2:00 CZW: kicked registration using the commands supplied by the new Unfortunately, this strongly suggests that my edits to didn't actually fix the periodic burntool/registration faults.
  • 11:00 Mark: another 0 sized ota stalling LAP? copied over good version
    -rw-rw-r-- 1 apache 0 Jun 16 22:31 /data/ipp021.0/nebulous/68/5d/1014122430.gpc1:20100818:o5426g0194o:o5426g0194o.ota42.fits
    -rw-rw-r-- 1 apache 23014080 Aug 17  2010 /data/ipp027.0/nebulous/68/5d/402728908.gpc1:20100818:o5426g0194o:o5426g0194o.ota42.fits
  • 12:40 looks like PSF missing from warp run faulting stack?
    neb://ipp038.0/gpc1/LAP.ThreePi.20110809/2011/09/23/o5775g0359o.371609/o5775g0359o.371609.wrp.261019.skycell.0869.014.psf does not exist
    perl ~ipp/src/ipp-20110622/tools/ --warp_id 261019 --skycell_id skycell.0869.014 --redirect-output

Sunday : 2011-09-25

  • 01:40 Mark: more missing PSF for LAP stacks? Ran to try and fix
    PSF neb://ipp040.0/gpc1/LAP.ThreePi.20110809/2011/09/24/o5761g0298o.364937/o5761g0298o.364937.wrp.261675.skycell.0866.020.psf does not exist 
    PSF neb://ipp053.0/gpc1/LAP.ThreePi.20110809/2011/09/24/o5093g0047o.98554/o5093g0047o.98554.wrp.261659.skycell.0866.045.psf does not exist
    PSF neb://ipp030.0/gpc1/LAP.ThreePi.20110809/2011/09/24/o5396g0187o.193313/o5396g0187o.193313.wrp.261663.skycell.0866.077.psf does not exist
    perl ~ipp/src/ipp-20110622/tools/ --warp_id 261675 --skycell_id skycell.0866.020 --redirect-output
    perl ~ipp/src/ipp-20110622/tools/ --warp_id 261659 --skycell_id skycell.0866.045 --redirect-output
    perl ~ipp/src/ipp-20110622/tools/ --warp_id 261663 --skycell_id skycell.0866.077 --redirect-output
    Still fails, should have looked in warp log, unable to determine PSF bad data quality and not sure why passes fault/quality.