PS1 IPP Czar Logs for the week 2011.11.07 - 2011.11.13

(Up to PS1 IPP Czar Logs)

Monday : 2011.11.07

Mark is czar

  • 09:00 ipp036 removed from nebulous so Cindy can powerdown and replace CPU+heatsink
  • 09:30 remaining night data having 503 error in summitcopy (c5872g0011o--c5872g0014o). finished @10:00
  • 11:30 MD07-y nightly stack sample running in normal stack pantasks to verify skycell coverage with the new center shift for exposures since 4/2009.
  • 13:15 ipp036 back up, putting back into nebulous fully and slowly adding back into processing
  • 13:30 ipp053 slowly putting back into processing, taken out Friday for over weekend due to mount issues and many tasks wanting to run on it.
  • 13:45 ipp036 export not up, ran exportfs -f
  • 14:00 ippc13 slowly add back processing since watching ipp036,053 too. was taken out from processing because of second crash w/in hour over weekend.
  • 14:45 ippc10 also back online, adding back in for processing
  • 15:30 running MD07.reftest.20111107 sample of different number of input warps in skycell.055 for all filters with the deepstack pantasks (compute2 group).
  • 16:30 cleaning LAP -- needed to do more often, now have 1500 stacks to run..
    -- troubled update fault=2, find update log neb://ipp020.0/gpc1/LAP.ThreePi.20110809/2011/10/09/o5463g0339o.229905/o5463g0339o.229905.ch.317278.XY31.log.update and problem with with neb://ipp044.0/gpc1/20100924/o5463g0339o/o5463g0339o.ota31.fits
          0                     NON-EXISTANT file:///data/ippb02.1/nebulous/10/c3/967714255.gpc1:20100924:o5463g0339o:o5463g0339o.ota31.fits
          0                     NON-EXISTANT file:///data/ippb02.1/nebulous/10/c3/967715079.gpc1:20100924:o5463g0339o:o5463g0339o.ota31.fits
    
     chiptool -updateprocessedimfile -set_state full -chip_id 317278 -class_id XY31 -dbname gpc1
     chiptool -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 317278 -class_id XY31 -dbname gpc1
    
    -- more update trouble chip_id=339071, drop from multiple runs - 
    laptool -updateexp -lap_id 1705 -exp_id 229905 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1706 -exp_id 229905 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1709 -exp_id 229905 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1710 -exp_id 229905 -set_data_state drop -dbname gpc1
    
    -- warp (update?) trouble for exp_id=260145, drop from runs
    laptool -updateexp -lap_id 1699 -exp_id 260145 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1700 -exp_id 260145 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1703 -exp_id 260145 -set_data_state drop -dbname gpc1
    
    -- warp fault due to chip file problem
    Reading FITS file /data/ipp033.0/nebulous/a6/76/1544401387.gpc1:LAP.ThreePi.20110809:2011:11:07:o5523g0098o.260145:o5523g0098o.260145.ch.339377.XY52.ch.fits failed.
    
    perl ~ipp/src/ipp-20110622/tools/runchipimfile.pl --chip_id 339377 --class_id XY52 --redirect-output
    
  • 16:40 info note: running lap_science.pl --monitor_mode --lap_id 1730 reported
    QUALITY: 223457 has bad warp quality: 16 / 77
    STATUS: LAP_ID           1730
    STATUS: DEFINED_QSTACK:  0
    STATUS: HAVE_QSTACK:     0
    STATUS: DEFINED_FSTACK:  0
    STATUS: HAVE_FSTACK:     0
    STATUS: NEEDS_REMADE:    0
    STATUS: NEEDS PRIVATIZE: 0
    STATUS: ARE WARPED:      102
    STATUS: ARE MAGICKED:    50
    STATUS: CAN_DIFF:        101
    STATUS: HAVE_DIFF:       54
    STATUS: TOTAL_EXPOSURES: 113
    
    means LAP kicking out that exposure (currently done if >20% warps have bad quality). 
    
  • 17:30 ippc13 okay so far with the 4x stdscience, 2x stack so adding back distribution,pstamp,update,publish
  • 18:00 ippc13 wend down, so seems can handle the stdscience+stack load but not distribution. will try some combos to see what is stable. Chris also recommended maybe adding a compute_broken group in ~ipp/ippconfig/pantasks_hosts.input that could also include ippc11 (seems to handle 3 processes okay).
  • 19:00 ippc13 looks like crashed when the 1x update started while running 2x stdscience, 2x stack. drop down to 2x stdscience, 1x stack and group with ippc11 in hosts_poor_compute
  • 20:30 another LAP chip lost
    Nebulous key neb://ipp043.0/gpc1/20100619/o5366g0539o/o5366g0539o.ota25.fits is consistent.
          0                     NON-EXISTANT file:///data/ippb02.0/nebulous/4c/7e/893019408.gpc1:20100619:o5366g0539o:o5366g0539o.ota25.fits
          0                     NON-EXISTANT file:///data/ippb02.2/nebulous/4c/7e/893026715.gpc1:20100619:o5366g0539o:o5366g0539o.ota25.fits
    
  • 21:00 ippc11 donw. cannot run 1x stack + 2x stdscience so remove from processing completely again. rebooting, but not coming back up. leaving off.
  • 23:30 ippc11 powered up.
  • 23:40 LAP clearing all of sunday's (11/6) runs now and morving forward.
  • MD07.refstack.20111106 z-band etc running under the deepstack pantasks. MD07.GR0 sample nightly stacks added to the main stack pantasks as they run quicker and at a lower priority that can be interrupted by LAP.

Tuesday : 2011.11.08

  • 11:50 Mark: stare nodes been quiet for a while, adding back into processing. distribution uses 4 nodes on each of the stare nodes, maybe 1-2 of these each could be added to stack since distribution seems to be keeping up.
  • 15:00 ipp038 raid back to allocate in nebulous. Bill pointed out ipp038 was in nebulous state repair, Mark had put it into repair 11/5 when raid reported degraded with disk issue. finished rebuilding 11/6 just forgot to re-enable.

Wednesday : 2011.11.09

  • down most of the afternoon to test trunk for potential new tag
  • 16:42 Bill restarted pantasks
  • 16:47 fixed magic fault due to corrupt diff skyfile --diff_id 188156 --skycell_id skycell.0879.049
  • 16:48 fixed destreak fault due to corrupt chip file --chip_id 300733 --class_id XY65 Wait a second. That file must have been used for a warp. Why wasn't it corrupt then???

Thursday : 2011-11-10

Bill is czar today

  • 09:20 fixed a corrupt warp file warp_id 303159 --skycell_id skycell.2243.044
  • 09:25 dropped 3 chip LAP chip files that are repeatedly getting a SEGV: -chip_id 340549 -class_id XY67 -chip_id 340551 -class_id XY67 -chip_id 340552 -class_id XY67
  • 10:20 Hayden replaced ippdb02's RAID battery backup unit.
  • He also took ippc11 and ipp029 down for examination to see if there is some obvious reason that they have been crashing frequently.
  • 10:30 started using the new tag ipp-20111110
  • 11:00 ipp029 back on line. Started up the other pantasks
  • 11:07 Fixed the first bug in the tag (mine in magic_destreak.pl)
  • 13:37 ipdpb02 is back: BBU has been replaced and is fully charged (according to Gavin). Write cache is enabled
  • 13:37 Postage stamp server is broken. There is a memory management problem in Ohana in the new tag. Worked around it by changing dvoImagesAtCoords to leak a tiny bit of memory.
  • 16:30 Mark: runs in deepstack pantasks with old op tag finished. shutdown and restarted with new tag.
  • 16:35 new MD03 from last night with the new MD03.V3 refstacks looks good except for the WS diffims, the SS diffims look ok.

Friday : 2011-11-11

  • 15:20 kicking LAP
    -- odd not getting repaired (time-out?), do manually to help trigger some stacks
          0                     NON-EXISTANT file:///data/ipp053.0/nebulous/9f/34/446156815.gpc1:20100914:o5453g0351o:o5453g0351o.ota76.fits
          1 f1f364cecdba94c22b5db609a79688f8 file:///data/ippb02.2/nebulous/9f/34/1128954824.gpc1:20100914:o5453g0351o:o5453g0351o.ota76.fits
    cp /data/ippb02.2/nebulous/9f/34/1128954824.gpc1:20100914:o5453g0351o:o5453g0351o.ota76.fits /data/ipp053.0/nebulous/9f/34/446156815.gpc1:20100914:o5453g0351o:o5453g0351o.ota76.fits
    
    -- strange stalled update behavior. update just not happening on misc chips (i.e. no quality issues)
    

Saturday : 2011-11-12

  • 01:00 Mark: pantasks slowed. restarting. MD07 refstacks appear to be finished, returning the compute2 group to distribution for now, will be useful to switch to stack pantasks once staticsky is running.
  • 01:45 registration stuck, ran
    regtool -revertprocessedimfile -dbname gpc1 -exp_id 419175 -class_id ota74 -fault 3
    -- must have been a mount error? 27 more faulted with 3 
    

Sunday : 2011-11-13

  • 21:15 Mark: looks like ipp029 became unresponsive ~2.5 hrs ago, cycling power.
  • 23:45 ipp029 not happy, down again. removing from processing and restarting.