PS1 IPP Czar Logs for the week 2011.10.31 - 2011.11.06

(Up to PS1 IPP Czar Logs)

Monday : 2011.10.31

  • 00:30 another LAP chip
    neb://ipp025.0/gpc1/20100522/o5338g0475o/o5338g0475o.ota76.fits
          0                     NON-EXISTANT file:///data/ipp053.0/nebulous/14/50/286984446.gpc1:20100522:o5338g0475o:o5338g0475o.ota76.fits
          1 c5d2ec4d851524456fb48a1bd770d6bf file:///data/ippb01.1/nebulous/14/50/1154222719.gpc1:20100522:o5338g0475o:o5338g0475o.ota76.fits
    
    cp /data/ippb01.1/nebulous/14/50/1154222719.gpc1:20100522:o5338g0475o:o5338g0475o.ota76.fits /data/ipp053.0/nebulous/14/50/286984446.gpc1:20100522:o5338g0475o:o5338g0475o.ota76.fits
    
    --> neb-repair should have been able to fix this case but didn't in the chip stage if problem. if doesn't then need to know why.
    
  • Ingestion of nebulous on ippdb02 completed during the week end (after 10 days: Wed Oct 19 16:12:33 HST 2011 - Sat Oct 29 00:09:27 HST 2011). Replication started
    CHANGE MASTER TO MASTER_HOST='ippdb00', MASTER_USER='repl_neb', MASTER_PASSWORD='xxx', MASTER_LOG_FILE='mysqld-bin.003352', MASTER_LOG_POS=277806591;
    

The slave is 1011144 seconds behind the master.

  • 11:15 Mark: returning the compute2 group from deepstack back to distribution for now.
  • 14:17 Bill: dropped two chips whose files have been lost
regtool -updateprocessedimfile -set_ignored -exp_id 232263 -class_id XY15
regtool -updateprocessedimfile -set_ignored -exp_id 231044 -class_id XY16

Tuesday : 2011-11-01

Heather is czar today

  • 10:31 Bill had Gavin change the automount parameters to use hard instead of soft mounts. Pantasks' restarted
  • 12:45 or so Found two more lost files
regtool -updateprocessedimfile -set_ignored -exp_id 187628 -class_id XY50
regtool -updateprocessedimfile -set_ignored -exp_id 187601 -class_id XY50
  • chip 335562 XY67 has been faulting repeatedly. To debug Bill attempted to run it by hand and it succeeded. Strange

Wednesday : 2011-11-02

  • Bill queued 20 i band M31 exposures for chip through warp processing for a demonstration of how magic makes M31.2011 data unusable.

Thursday : 2011.11.03

  • 10:00 Mark adding the condor_MD07.V3_01 label to pantasks to finish the chip-warp processing for refstacks.
  • 10:10 stdscience seems lagging, restarting. LAP gets held up on ipp053 jobs. Moving condor_MD07.V3_01 priority above LAP. MD07 also being held up on issues preventing jobs from completing, but is still processing other MD07 jobs.

Friday : 2011.11.04

Mark czar

  • 01:05 ippc13 down, cycling power. no info on console. restarted update pantasks.
  • 01:30 new data coming in. stalled. restarted registration and summitcopy. was hung mount on ipp053 to ippb01, cleared with
    /usr/local/sbin/force.umount ippb01
    
  • 07:50 ippc09 mount of ippb02 stuck.
  • 08:00 shutdown of stdscience pantasks dumped backtrace to ippc16 stdout
    ipp@ippc16:/home/panstarrs/ipp>*** glibc detected *** pcontrol: double free or corruption (!prev): 0x00000000006d7730 ***
    ======= Backtrace: =========
    /lib/libc.so.6[0x7fc41c95bb88]
    /lib/libc.so.6(cfree+0x76)[0x7fc41c95d746]
    pcontrol[0x406571]
    pcontrol[0x406b68]
    pcontrol[0x406bac]
    pcontrol[0x403b6e]
    /home/panstarrs/ipp/psconfig//ipp-20110622.lin64/lib/libbasiccmd.so(quit+0x18)[0x7fc41e69fba8]
    /home/panstarrs/ipp/psconfig//ipp-20110622.lin64/lib/libshell.so(command+0x22c)[0x7fc41e47cc5c]
    /home/panstarrs/ipp/psconfig//ipp-20110622.lin64/lib/libshell.so(multicommand+0x90)[0x7fc41e480cd0]
    /home/panstarrs/ipp/psconfig//ipp-20110622.lin64/lib/libshell.so(opihi+0x90)[0x7fc41e493b30]
    pcontrol[0x403bd9]
    /lib/libc.so.6(__libc_start_main+0xe6)[0x7fc41c905486]
    pcontrol[0x401f69]
    ======= Memory map: ========
    00400000-0040d000 r-xp 00000000 00:14 17252593                           /data/ippc18.0/home/ipp/psconfig/ipp-20110622.lin64/bin/pcontrol
    ...(full map in notes)...
    
    
  • 09:10 Bill set states for 2 lost rawImfiles
chiptool -dropprocessedimfile -set_quality 42 -chip_id 337393 -class_id XY53
regtool -updateprocessedimfile -set_ignored -exp_id 195736 -class_id XY53

chiptool -dropprocessedimfile -chip_id 337396 -class_id XY37 -set_quality 42
regtool -updateprocessedimfile -exp_id 195761 -class_id XY37 -set_ignored
  • 11:30 Mark stopping stdscience, distribution while manually fixing all hanging mounts. ipp053 seems to have particular trouble with ATRC mounts, removing from processing in adding to the hosts_ignore_wave3 group.
  • 12:50 all stalled/hanging processes and ippbXX mounts cleared. restarting stdscience and distribution.
  • 13:00 many jobs want to use ipp053 (ota76), so will take some time to clear
  • 13:10 chip_imfile taking >1000s on ipp014, neb-repair stalled, needed to reset mounts to ippb01.
  • 14:15 Bill asked Gavin to re-set the ippbXX mounts to soft mounts as before (-nosuid,rw,tcp,soft,rsize=32768,wsize=32768,timeo=20,retrans=6,retry=5)
  • 14:42 Bill set to diffRuns with no diffInputSkyfiles to drop 186382 & 186805
  • 14:46 Bill set 11 LAP destreak runs in state 'failed_revert' to 'new'
  • 15:00 last nights 65 exposures finally though. MD07 has many warps that also want to run on ipp053, LAP set mostly cleared of remaining ipp053 need.
  • 14:30 some LAP exposures still waiting for update since 11/2 (~48 hrs so drop), unsuccessful in getting to update so dropping exposures so stack can finish
    laptool -updateexp -lap_id 1593 -exp_id 384384 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1593 -exp_id 384366 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1593 -exp_id 235422 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1593 -exp_id 102133 -set_data_state drop -dbname gpc1
    
  • 20:30 ippc13 down, rebooting. no info on console, load was above normal, may have overloaded but ganglia shows nothing extreme. update pantasks restarted.
  • 21:30 ippc13 down again, rebooting and removing some of the processing on it. restarting update pantasks. must be friday night..

Saturday : 2011.11.05

  • 00:30 stdscience was mostly processing LAP and only ~150, restarted and running ~240 and still some LAP. maybe mount issue somewhere.
  • 01:00 ipp038 drive timeout, raid degraded. set neb-host to repair.
  • 13:00 stdscience lagging, restarting.
  • 13:30 more LAP runs 1605,1608,1609 held up by stalled exposures in diff/magic/destreak, exp_id=92672, 92682
    laptool -updateexp -lap_id 1605 -exp_id 92672 -set_data_state drop
    laptool -updateexp -lap_id 1605 -exp_id 92682 -set_data_state drop
    laptool -updateexp -lap_id 1608 -exp_id 92672 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1608 -exp_id 92682 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1609 -exp_id 92672 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1609 -exp_id 92682 -set_data_state drop -dbname gpc1
    
    

Sunday : 2011.11.06

  • 09:30 nightly science mostly done, needed to rerun camera stage on 3PI
    failed to read /data/ipp011.0/nebulous/ff/ef/1540770737.gpc1:ThreePi.nt:2011:11:06:o5871g0079o.417564:o5871g0079o.417564.ch.338747.XY02.ch.mk.fits
    
    perl ~ipp/src/ipp-20110622/tools/runchipimfile.pl --chip_id 338747 --class_id XY02 --redirect-output
    
  • 10:00 LAP needs some kicking,
    laptool -listrun -seq_id 8 -state run -dbname gpc1 -simple
    -- 1629, 32, 33 few day+ old
    lap_science.pl --monitor_mode --lap_id 1629
    -- holding up 1629,32,33
     failed to read /data/ipp018.0/nebulous/e9/b1/1535098532.gpc1:LAP.ThreePi.20110809:2011:11:05:o5756g0406o.361819:o5756g0406o.361819.cm.315161.XY02.mk.fits
    
    perl ~ipp/src/ipp-20110622/tools/runcameraexp.pl --redirect-output --cam_id 315161
    
    -- holding up 1641,43,44
     failed to read /data/ipp053.0/nebulous/f3/bf/1539668817.gpc1:LAP.ThreePi.20110809:2011:11:06:o5510g0118o.253312:o5510g0118o.253312.ch.338493.XY76.ch.fits 
    
    perl ~ipp/src/ipp-20110622/tools/runchipimfile.pl --chip_id 338493 --class_id XY76 --redirect-output 
    
    
  • 10:30 MD07 warp fault
    failed to read /data/ipp013.0/nebulous/26/a4/1529339884.gpc1:condor_MD07.V3_01.haf:o5353g0140o.178130:o5353g0140o.178130.cm.314328.XY64.mk.fits 
    
    perl ~ipp/src/ipp-20110622/tools/runcameraexp.pl --redirect-output --cam_id 314328
    
  • 11:00 MD07 chip-warp finished, taking a compute2 group from distribution again for stacking with deepstack pantasks on ippc20
  • 14:00 LAP kicking - stare night tonight so trying to have LAP running fully
    --  1643 waiting for magic 
    failure for: magic_process.pl --magic_id 241638 --camera GPC1 --node skycell.1764.056 --baseroot /data/ipp053.0/gpc1_destreak/LAP.ThreePi.20110809/386180/386180.mgc.241638 --logfile /data/ipp053.0/gpc1_destreak/LAP.ThreePi.20110809/386180/386180.mgc.241638.skycell.1764.056.log  --dbname gpc1 --verbose
    
    CFITSIO: Error reading tile from the file, /data/ipp052.0/nebulous/99/5f/1538679085.gpc1:LAP.ThreePi.20110809:2011:11:06:RINGS.V3:skycell.1764.056:RINGS.V3.skycell.1764.056.dif.187302.mask.fits with size 6279 by 6261 pixels
    
    --> need to rerun diff
    perl ~ipp/src/ipp-20110622/tools/rundiffskycell.pl --redirect-output --diff_id 187302 --skycell_id skycell.1764.056
    
    -- 1641, 1644 - not all warps finished still, error_cleaned on chip_id 336008, try fixing 
    chiptool -updateprocessedimfile -set_state cleaned -chip_id 336008 -class_id XY15 -dbname gpc1
    chiptool -updaterun -set_state cleaned -chip_id 336008 -dbname gpc1
    chiptool -setimfiletoupdate -chip_id 336008 -set_label LAP.ThreePi.20110809 -dbname gpc1
    
    -- 1647 - waiting on magic? not clear why not triggering, drop
    laptool -updateexp -lap_id 1647 -exp_id 92672 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1647 -exp_id 92682 -set_data_state drop -dbname gpc1
    
    
  • 18:00 Bill removed stare nodes from processing (~ipp/stare_nodes.sh off)
  • 19:30 missing file
          0                     NON-EXISTANT file:///data/ippb02.1/nebulous/b2/de/937325554.gpc1:20100926:o5465g0318o:o5465g0318o.ota16.fits
          0                     NON-EXISTANT file:///data/ippb02.2/nebulous/b2/de/937326189.gpc1:20100926:o5465g0318o:o5465g0318o.ota16.fits
    
    chiptool -dropprocessedimfile -chip_id 339071 -class_id XY16 -set_quality 42 -dbname gpc1
    regtool -updateprocessedimfile -set_state corrupt -class_id XY16 -exp_id 231183 -dbname gpc1
    
  • 20:00 running slow, restarted stdscience and distribution