(Up to PS1 IPP Czar Logs)

Monday : 2012-07-30

  • 01:24 CZW: Registration was stalled as per Mark's email. My resolution was to dump the previous exposure input table, and move on without it. This sometimes causes segfaults for unknown reasons. As this is an early exposure, I don't think the loss of the input table is that significant. It looks like it's unstuck things, so we should catch back up soon.
    /home/panstarrs/ipp/psconfig/ipp-20120626.lin64/bin/funpack -S /data/ipp058.0/nebulous/ef/54/2283421506.gpc1:20120730:o6138g0063o:o6138g0063o.ota14.fits > /tmp/burntool.502313.XY14.26Z7.fits
    /home/panstarrs/ipp/psconfig/ipp-20120626.lin64/bin/burntool /tmp/burntool.502313.XY14.26Z7.fits  out=/data/ipp058.0/nebulous/7c/b1/2283600282.gpc1:20120730:o6138g0063o:o6138g0063o.ota14.burn.tbl tableonly=t persist=t
    /home/panstarrs/ipp/psconfig/ipp-20120626.lin64/bin/regtool -dbname gpc1 -updateprocessedimfile -exp_id 502313 -class_id XY14 -burntool_state -14 -set_state full
    
  • 08:59 Serge: fixed 0-size/lost instances gpc1/20100518/o5334g0071o/o5334g0071o.ota14.burn.tbl (LAP)
  • 9:00 onwards (HAF): more unsticking of exposures:
    1. find the stuck exposure
    2. funpack it
    3. redo burntool without history
    4. copy old table out of the way and insert new
    5. redo regtool:
    6. repeat as necessary
    
    example:
    
    funpack -S `neb-locate --path gpc1/20120730/o6138g0175o/o6138g0175o.ota14.fits` > o6138g0175o.ota14.fits
    burntool o6138g0175o.ota14.fits out=o6138g0175o.ota14.burn.tbl tableonly=t persist=t
    neb-mv gpc1/20120730/o6138g0175o/o6138g0175o.ota14.burn.tbl gpc1/20120730/o6138g0175o/o6138g0175o.ota14.burn.tbl.bak
    neb-insert gpc1/20120730/o6138g0175o/o6138g0175o.ota14.burn.tbl o6138g0175o.ota14.burn.tbl --copies=2
    regtool -updateprocessedimfile -class_id XY14 -burntool_state -14 -set_state full -exp_id 502425 -dbname gpc1
    regtool -updateprocessedimfile -class_id XY14 -fault 0 -exp_id 502425 -dbname gpc1
    
    This was done for 3 stuck exposures.
    
    I hate XY14... 
    
  • 13:26 pm (haf) restarted stdsci
  • 15:55 Serge: Changed chip.pro so that all labels but LAP.ThreePi.20120706 are reverted now.
  • 16:25 Serge: Restarted stdscience after I miserably crashed it
  • 16:33 MEH: tweaking stdscience to run stackstack diffs from last night server input tweak_ssdiff

Tuesday : 2012-07-31

  • 9:14 (haf): bad weather last night...
  • 10:17 (Serge): Recovered two missing otas: gpc1/20101110/o5510g0331o/o5510g0331o.ota26.fits and gpc1/20101108/o5508g0412o/o5508g0412o.ota26.fits. Note: To allow (manual) replication to atrc, I had to set ippb00 as up for 10 seconds or so.
  • 11:40 (Serge): Restarted (pantasks) replication. Set ippb0[01].[12] disks as up e.g. neb-host --volume ippb01.2 up
  • 15:00 (Serge): Stopped (mysql) replication on ippdb03
  • 15:23 CZW: Restarted stdscience to kick the rate back up.
  • 23:10 MEH: would really prefer not to have a tag change in the middle of a deepstack set... taking compute3 out of processing to just do deepstacks (2x left in stdscience should be ok).

Wednesday : 2012-08-01

  • 10:28 (Serge): Started rsync of ippdb02:/export/ippdb02.0/mysql to ippc63 (screen session "mysql_rsync" as root)
  • 13:55 CZW: Restarted stdscience to kick the rate back up.

Thursday : 2012-08-02

Bill is czar today

  • 00:14 MEH: deep stacks finished, returning compute3 to stdscience (6x) and stack (1x)
  • 09:40 Bill: registration got stuck with failed burntool on 502898 XY14. Followed Heather's most excellent instructions on how to fix and burntool is now proceeding..... with yesterdays failures.
  • 10:10 Today's problem was actually a summit copy job that crashed leaving the entry in the book. Restarted summit copy and registration pantasks and now we're off and running
  • 10:30 stdscience is sluggish, needs daily restart.
  • 10:35 ipp066 down, ipp066-20120802-crash.log
  • 15:34 Removed ecliptic.rp label from stdscience and started to shut down the pantasks' in preparation for migration to new tag.
  • 15:30 Stopped everything in order to switch to the new tag
  • 15:45 restarted pantasks. Slowly setting them to run.
  • 16:00 all systems go
  • 16:15 MEH: stealing the 1x compute3 from stack for deepstack

Friday : 2012-08-03

Mark is czar

  • 00:10 MEH: ah surprise, nightly science stuck again... will try to unstick while waiting for MD staticsky to finish. looks like a dsget (and then another after restart) got stalled on ipp032. killing off and running manually on ipp032 was fine. restarted summitcopy and registration, all moving again.
      0    ipp032    BUSY  16374.75 0.0.0.56  0 summit_copy.pl --uri http://conductor.ifa.hawaii.edu/ds/gpc1/o6142g0002o/o6142g0002o51.fits --filename neb://ipp032.0/gpc1/20120803/o6142g0002o/o6142g0002o.ota51.fits --summit_id 499208 --exp_name o6142g0002o --inst gpc1 --telescope ps1 --class chip --class_id ota51 --bytes 49432320 --md5 62fde464ca63680d8086048c1142ef5e --dbname gpc1 --timeout 600 --verbose --copies 2 --compress --nebulous 
    
  • 01:00 also some isp stuck/stalled on ipp032 and had to kill after hanging from restart
      0    ipp032    BUSY  15360.01 0.0.0.18e  0 register_imfile.pl --exp_id 722463 --tmp_class_id chip01 --tmp_exp_name o6142i0011o02 --uri neb://any/isp/20120803/o6142i0011o02/o6142i0011o02.chip01.fits --logfile neb://any/isp/20120803/o6142i0011o02.722463/o6142i0011o02.722463.reg.chip01.log --bytes 8602560 --md5sum f671ea09ff563b69c3dd6a98b3cf07dc --sunset 03:30:00 --sunrise 17:30:00 --summit_dateobs 2012-08-03T05:54:56.000000 --dbname isp --verbose 
    
      0    ipp032    BUSY   1993.63 0.0.0.1b4  0 register_exp.pl --exp_id 722901 --exp_tag o6142i0183o02.722901 --logfile neb://any/isp/20120803/neb://any/isp/201208/o6142i0183o02.722901/o6142i0183o02.722901.reg.log --label ISP-Apogee.201208 --dvodb /data/ipp004.0/isp/catdir.isp --end_stage camera --tess_id RINGS.V0 --dbname isp --verbose 
    
  • 01:29 ipp032 seems to be unhappy in registration, removing it for now.
  • 07:00 (Serge) Restarted replication on ippdb02. Backup not reactivated while it's late.
  • 07:20 MEH: another download stalled, on ipp032 no surprise. taking ipp032 out of summitcopy..
      0    ipp032    BUSY  22833.20 0.0.0.19d  0 summit_copy.pl --uri http://otis3.ifa.hawaii.edu/ds/skyprobe/o6142i0221o01/o6142i0221o01.fits --filename neb://any/isp/20120803/o6142i0221o01/o6142i0221o01.chip01.fits --summit_id 722999 --exp_name o6142i0221o01 --inst isp --telescope ps1 --class chip --class_id chip01 --bytes 8602560 --md5 b56febe12ecc44df14e9dee4bf9dd2d3 --dbname isp --timeout 600 --verbose --copies 2 --nebulous
    
  • 07:30 stdscience is sluggish, restarting while download still catching up. taking ipp032 out of processing there as well.
  • 08:50 nightly science downloaded and mostly finished.
  • 09:00 STS warps having trouble with corrupted chip weight file
    -> p_psFitsError (psFits.c:74): I/O error
         Reading FITS file /data/ipp036.0/nebulous/e4/9d/2306253777.gpc1:STS.nt:2012:08:03:o6142g0222o.503762:o6142g0222o.503762.ch.526083.XY55.ch.wt.fits failed.
     -> p_psFitsError (psFits.c:78): I/O error
         [CFITSIO error: decompression error: hit end of compressed byte stream]
     -> hduRead (pmHDU.c:154): I/O error
    
    perl ~ipp/src/ipp-20120802/tools/runchipimfile.pl --chip_id 526083 --class_id XY55 --redirect-output
    
  • 09:05 (Serge) Fixed gpc1/20100805/o5413g0107o/o5413g0107o.ota35.fits (involved in ecliptic.rp)
  • 09:20 (Serge) ippc63 now replicates ippdb00
  • 10:50 MEH: multiple mounts are stalled on ipp032, rebooting
  • 11:40 (Serge): Running test for new mops trails fitting. Source in ~ipp/sch/ipp-20120802 based on current tag with update from ppTranslate trunk. Binaries in ~ipp/psconfig/ipp-sch.lin64. Label in publishing: ThreePi.TrailFitting.MopsTest.01. Ran pubtool -dbname gpc1 -definerun -data_group ThreePi.20120803 -set_label ThreePi.TrailFitting.MopsTest.01 -client_id 12
  • 14:00 Cindy needs to do checks on ipp016, will need to take out of processing and nebulous (put to repair a few hours earlier to keep stuff from going there)
  • 14:30 (Serge): End of tests for the new mops format. Published data with labels ThreePi?.TrailFitting?.MopsTest?.0[1-3] have been generated.
  • 14:50 ipp016 back into processing+nebulous
  • 15:30 Bill stopped processing briefly to build a bug fix into psModules. The bug caused psphotStack to segv when performing the Extended Source Fits
  • 19:35 MEH: doing the daily restarting of stdscience
  • 23:07 (Serge) ippc63 is still 179962 seconds behind ippdb00.
  • 23:30 MEH: MD01.refstack.20120803 and rerun of MD10.refstack.20120804 will be using deepstack all weekend.

Saturday : 2012-08-04

  • 08:30 Bill. Registration stuck again. Stuck process on ipp042 bad burntool state. Killed job running since 2 am. Restarted pantasks. ran
regtool -updateprocessedimfile -set_state pending_burntool -exp_id 504370 -class_id XY63
  • 09:30 more carnage. ipp042 stuck on summit copy for file on ipp018. force.umount didn't work. rebooted ipp042 restarted summit copy but I couldn't pantasks_client to communicate with it. Moved pantasks_server from ipp052 to ipp051.
  • 12:30 MEH: nightly science still processing but would be helped by the daily restart of stdscience, so doing
  • 16:30 noticed new LAP label LAP.ThreePi?.20120706 not in stack pantasks, adding.
  • 16:40 nightly science mostly finished now. been watching remaining 3PI imfile XY06 overuse RAM, killed multiple times when RAM overuse got significantly large (ipp055,ippc41, others), originally noticed running >3ks
     11    ipp055    BUSY    245.32 0.0.0.c8ea  0 chip_imfile.pl --threads @MAX_THREADS@ --exp_id 504730 --chip_id 530582 --chip_imfile_id 31660084 --class_id XY06 --uri neb://ipp055.0/gpc1/20120804/o6143g0690o/o6143g0690o.ota06.fits --camera GPC1 --run-state new --deburned 0 --outroot neb://ipp055.0/gpc1/ThreePi.nt/2012/08/04//o6143g0690o.504730/o6143g0690o.504730.ch.530582 --redirect-output --dbname gpc1 --verbose 
    
    -- stalling at in log gpc1/ThreePi.nt/2012/08/04/o6143g0690o.504730/o6143g0690o.504730.ch.530582.XY06.log
    ...
        50343 sources, 632 moments, 0 faint, 0 failed: 21.498576 sec
          --- psphot Rough Class ---
            With 0 stars, using 1 x 1 grid for PSF clump
          psf clump  X,  Y: 162.341919, 158.947601 : DX, DY: 32.702652, 33.536469 : loaded from metadata
          Rough classifications: 47943 72 1762 0 0 6
          SN range (moments): 0.181953 - 50.430634
          SN range (peaks)  : 5.000071 - 24787.253906 (47943)
        rough classification: 0.032786 sec
          replaced models for 425 objects: 0.536426 sec
        built models for 50768 objects: 115.123376 sec
          --- psphot Fit Source (Linear) ---
              covariance factor: 1.000000
              built fitSources: 36.658863 sec (50768 objects)
    
    -- setting quality to move exposure forward
    chiptool -dbname gpc1 -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 530582 -class_id XY06
    
  • 19:40 like before, transferring compute3 from stack+stdscience (leaving 2x) to +1x deepstack for MD01+MD10redo

Sunday : 2012-08-05

  • 02:00 MEH: registration/burntool hung up again? restarted registration+summitcopy, picked back up ~02:20 but unclear if restart helped. stdscience slow on LAP waiting for nightly science, restarted for daily 12-18hr reset.
  • 07:30 (Serge): ippc63 is still 139699 behind master.
  • 14:00 MEH: returning extra compute3 from deepstack back to stack+stdscience
  • 18:20 restarting stdscience again, killed another ppImage overusing RAM
      0    ipp017    BUSY   1613.50 0.0.4.ff93  0 chip_imfile.pl --threads @MAX_THREADS@ --exp_id 458160 --chip_id 529295 --chip_imfile_id 31582905 --class_id XY60 --uri neb://ipp039.0/gpc1/20120224/o5981g0959o/o5981g0959o.ota60.fits --camera GPC1 --run-state new --deburned 0 --outroot neb://ipp039.0/gpc1/LAP.ThreePi.20120706/2012/08/04/o5981g0959o.458160/o5981g0959o.458160.ch.529295 --redirect-output --reduction LAP_SCIENCE --dbname gpc1 --verbose 
    
  • 21:21 MEH: ipp055 gone nuts. rebooted ipp055-20120805-crash.log
  • 22:00 registration got out of sorts again, restarting seemed to fix..