PS1 IPP Czar Logs for the week 2011.10.31 - 2011.11.06

(Up to PS1 IPP Czar Logs)

Monday : 2011.10.31

  • 00:30 another LAP chip
          0                     NON-EXISTANT file:///data/ipp053.0/nebulous/14/50/286984446.gpc1:20100522:o5338g0475o:o5338g0475o.ota76.fits
          1 c5d2ec4d851524456fb48a1bd770d6bf file:///data/ippb01.1/nebulous/14/50/1154222719.gpc1:20100522:o5338g0475o:o5338g0475o.ota76.fits
    cp /data/ippb01.1/nebulous/14/50/1154222719.gpc1:20100522:o5338g0475o:o5338g0475o.ota76.fits /data/ipp053.0/nebulous/14/50/286984446.gpc1:20100522:o5338g0475o:o5338g0475o.ota76.fits
    --> neb-repair should have been able to fix this case but didn't in the chip stage if problem. if doesn't then need to know why.
  • Ingestion of nebulous on ippdb02 completed during the week end (after 10 days: Wed Oct 19 16:12:33 HST 2011 - Sat Oct 29 00:09:27 HST 2011). Replication started
    CHANGE MASTER TO MASTER_HOST='ippdb00', MASTER_USER='repl_neb', MASTER_PASSWORD='xxx', MASTER_LOG_FILE='mysqld-bin.003352', MASTER_LOG_POS=277806591;

The slave is 1011144 seconds behind the master.

  • 11:15 Mark: returning the compute2 group from deepstack back to distribution for now.
  • 14:17 Bill: dropped two chips whose files have been lost
regtool -updateprocessedimfile -set_ignored -exp_id 232263 -class_id XY15
regtool -updateprocessedimfile -set_ignored -exp_id 231044 -class_id XY16

Tuesday : 2011-11-01

Heather is czar today

  • 10:31 Bill had Gavin change the automount parameters to use hard instead of soft mounts. Pantasks' restarted
  • 12:45 or so Found two more lost files
regtool -updateprocessedimfile -set_ignored -exp_id 187628 -class_id XY50
regtool -updateprocessedimfile -set_ignored -exp_id 187601 -class_id XY50
  • chip 335562 XY67 has been faulting repeatedly. To debug Bill attempted to run it by hand and it succeeded. Strange

Wednesday : 2011-11-02

  • Bill queued 20 i band M31 exposures for chip through warp processing for a demonstration of how magic makes M31.2011 data unusable.

Thursday : 2011.11.03

  • 10:00 Mark adding the condor_MD07.V3_01 label to pantasks to finish the chip-warp processing for refstacks.
  • 10:10 stdscience seems lagging, restarting. LAP gets held up on ipp053 jobs. Moving condor_MD07.V3_01 priority above LAP. MD07 also being held up on issues preventing jobs from completing, but is still processing other MD07 jobs.

Friday : 2011.11.04

Mark czar

  • 01:05 ippc13 down, cycling power. no info on console. restarted update pantasks.
  • 01:30 new data coming in. stalled. restarted registration and summitcopy. was hung mount on ipp053 to ippb01, cleared with
    /usr/local/sbin/force.umount ippb01
  • 07:50 ippc09 mount of ippb02 stuck.
  • 08:00 shutdown of stdscience pantasks dumped backtrace to ippc16 stdout
    ipp@ippc16:/home/panstarrs/ipp>*** glibc detected *** pcontrol: double free or corruption (!prev): 0x00000000006d7730 ***
    ======= Backtrace: =========
    ======= Memory map: ========
    00400000-0040d000 r-xp 00000000 00:14 17252593                           /data/ippc18.0/home/ipp/psconfig/ipp-20110622.lin64/bin/pcontrol
    ...(full map in notes)...
  • 09:10 Bill set states for 2 lost rawImfiles
chiptool -dropprocessedimfile -set_quality 42 -chip_id 337393 -class_id XY53
regtool -updateprocessedimfile -set_ignored -exp_id 195736 -class_id XY53

chiptool -dropprocessedimfile -chip_id 337396 -class_id XY37 -set_quality 42
regtool -updateprocessedimfile -exp_id 195761 -class_id XY37 -set_ignored
  • 11:30 Mark stopping stdscience, distribution while manually fixing all hanging mounts. ipp053 seems to have particular trouble with ATRC mounts, removing from processing in adding to the hosts_ignore_wave3 group.
  • 12:50 all stalled/hanging processes and ippbXX mounts cleared. restarting stdscience and distribution.
  • 13:00 many jobs want to use ipp053 (ota76), so will take some time to clear
  • 13:10 chip_imfile taking >1000s on ipp014, neb-repair stalled, needed to reset mounts to ippb01.
  • 14:15 Bill asked Gavin to re-set the ippbXX mounts to soft mounts as before (-nosuid,rw,tcp,soft,rsize=32768,wsize=32768,timeo=20,retrans=6,retry=5)
  • 14:42 Bill set to diffRuns with no diffInputSkyfiles to drop 186382 & 186805
  • 14:46 Bill set 11 LAP destreak runs in state 'failed_revert' to 'new'
  • 15:00 last nights 65 exposures finally though. MD07 has many warps that also want to run on ipp053, LAP set mostly cleared of remaining ipp053 need.
  • 14:30 some LAP exposures still waiting for update since 11/2 (~48 hrs so drop), unsuccessful in getting to update so dropping exposures so stack can finish
    laptool -updateexp -lap_id 1593 -exp_id 384384 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1593 -exp_id 384366 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1593 -exp_id 235422 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1593 -exp_id 102133 -set_data_state drop -dbname gpc1
  • 20:30 ippc13 down, rebooting. no info on console, load was above normal, may have overloaded but ganglia shows nothing extreme. update pantasks restarted.
  • 21:30 ippc13 down again, rebooting and removing some of the processing on it. restarting update pantasks. must be friday night..

Saturday : 2011.11.05

  • 00:30 stdscience was mostly processing LAP and only ~150, restarted and running ~240 and still some LAP. maybe mount issue somewhere.
  • 01:00 ipp038 drive timeout, raid degraded. set neb-host to repair.
  • 13:00 stdscience lagging, restarting.
  • 13:30 more LAP runs 1605,1608,1609 held up by stalled exposures in diff/magic/destreak, exp_id=92672, 92682
    laptool -updateexp -lap_id 1605 -exp_id 92672 -set_data_state drop
    laptool -updateexp -lap_id 1605 -exp_id 92682 -set_data_state drop
    laptool -updateexp -lap_id 1608 -exp_id 92672 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1608 -exp_id 92682 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1609 -exp_id 92672 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1609 -exp_id 92682 -set_data_state drop -dbname gpc1

Sunday : 2011.11.06

  • 09:30 nightly science mostly done, needed to rerun camera stage on 3PI
    failed to read /data/ipp011.0/nebulous/ff/ef/
    perl ~ipp/src/ipp-20110622/tools/ --chip_id 338747 --class_id XY02 --redirect-output
  • 10:00 LAP needs some kicking,
    laptool -listrun -seq_id 8 -state run -dbname gpc1 -simple
    -- 1629, 32, 33 few day+ old --monitor_mode --lap_id 1629
    -- holding up 1629,32,33
     failed to read /data/ipp018.0/nebulous/e9/b1/
    perl ~ipp/src/ipp-20110622/tools/ --redirect-output --cam_id 315161
    -- holding up 1641,43,44
     failed to read /data/ipp053.0/nebulous/f3/bf/ 
    perl ~ipp/src/ipp-20110622/tools/ --chip_id 338493 --class_id XY76 --redirect-output 
  • 10:30 MD07 warp fault
    failed to read /data/ipp013.0/nebulous/26/a4/ 
    perl ~ipp/src/ipp-20110622/tools/ --redirect-output --cam_id 314328
  • 11:00 MD07 chip-warp finished, taking a compute2 group from distribution again for stacking with deepstack pantasks on ippc20
  • 14:00 LAP kicking - stare night tonight so trying to have LAP running fully
    --  1643 waiting for magic 
    failure for: --magic_id 241638 --camera GPC1 --node skycell.1764.056 --baseroot /data/ipp053.0/gpc1_destreak/LAP.ThreePi.20110809/386180/386180.mgc.241638 --logfile /data/ipp053.0/gpc1_destreak/LAP.ThreePi.20110809/386180/386180.mgc.241638.skycell.1764.056.log  --dbname gpc1 --verbose
    CFITSIO: Error reading tile from the file, /data/ipp052.0/nebulous/99/5f/1538679085.gpc1:LAP.ThreePi.20110809:2011:11:06:RINGS.V3:skycell.1764.056:RINGS.V3.skycell.1764.056.dif.187302.mask.fits with size 6279 by 6261 pixels
    --> need to rerun diff
    perl ~ipp/src/ipp-20110622/tools/ --redirect-output --diff_id 187302 --skycell_id skycell.1764.056
    -- 1641, 1644 - not all warps finished still, error_cleaned on chip_id 336008, try fixing 
    chiptool -updateprocessedimfile -set_state cleaned -chip_id 336008 -class_id XY15 -dbname gpc1
    chiptool -updaterun -set_state cleaned -chip_id 336008 -dbname gpc1
    chiptool -setimfiletoupdate -chip_id 336008 -set_label LAP.ThreePi.20110809 -dbname gpc1
    -- 1647 - waiting on magic? not clear why not triggering, drop
    laptool -updateexp -lap_id 1647 -exp_id 92672 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 1647 -exp_id 92682 -set_data_state drop -dbname gpc1
  • 18:00 Bill removed stare nodes from processing (~ipp/ off)
  • 19:30 missing file
          0                     NON-EXISTANT file:///data/ippb02.1/nebulous/b2/de/937325554.gpc1:20100926:o5465g0318o:o5465g0318o.ota16.fits
          0                     NON-EXISTANT file:///data/ippb02.2/nebulous/b2/de/937326189.gpc1:20100926:o5465g0318o:o5465g0318o.ota16.fits
    chiptool -dropprocessedimfile -chip_id 339071 -class_id XY16 -set_quality 42 -dbname gpc1
    regtool -updateprocessedimfile -set_state corrupt -class_id XY16 -exp_id 231183 -dbname gpc1
  • 20:00 running slow, restarted stdscience and distribution