(Up to PS1 IPP Czar Logs)

Monday : 2012-07-23

Serge is czar

  • 00:10 keeping chip.revert off to avoid reverting the chip runs that use excessively large memory that can bring down machines by overloading swap
  • 09:00 Serge. LAP failed chips.
    • gpc1/20100605/o5352g0430o/o5352g0430o.ota25.burn.tbl fixed (same procedure as on Friday)
    • gpc1/20100618/o5365g0104o/o5365g0104o.ota52.fits fixed
    • gpc1/20100618/o5365g0103o/o5365g0103o.ota52.fits fixed. Manually replicated to ippb00.2. See entry at 12:10 in this page
    • gpc1/20100618/o5365g0116o/o5365g0116o.ota26.fits recovered
    • gpc1/20100524/o5340g0106o/o5340g0106o.ota35.burn.tbl fixed
    • gpc1/20100524/o5340g0120o/o5340g0120o.ota55.burn.tbl fixed
    • gpc1/20100602/o5349g0037o/o5349g0037o.ota31.fits fixed
    • gpc1/20100524/o5340g0096o/o5340g0096o.ota17.burn.tbl fixed
    • gpc1/20100524/o5340g0096o/o5340g0096o.ota72.burn.tbl fixed
    • gpc1/20100622/o5369g0059o/o5369g0059o.ota26.fits recovered
  • 09:50 MEH: setup SAS_v8 to distribution (probably will also send chips since didnt un-target those in the SAS dist_group..)
  • 09:51 Serge: manually reverted failed chips for LAP
  • 10:10 Serge. LAP again
    • gpc1/20100228/o5255g0497o/o5255g0497o.ota47.burn.tbl fixed I ran (see Mark's entry on 2012.05.07)
      ipp_apply_burntool_single.pl --exp_id 143264 --class_id XY47 --this_uri neb://ipp030.0/gpc1/20100228/o5255g0497o/o5255g0497o.ota47.fits 
        --continue 10 --previous_uri neb://ipp005.0/gpc1/20100228/o5255g0496o/o5255g0496o.ota47.fits --dbname gpc1 --verbose
      
    • gpc1/20100714/o5391g0011o/o5391g0011o.ota60.burn.tbl fixed
    • gpc1/20100714/o5391g0012o/o5391g0012o.ota60.burn.tbl fixed
    • gpc1/20100228/o5255g0533o/o5255g0533o.ota35.burn.tbl fixed (using ipp_apply_burntool_single.pl)
    • gpc1/20100228/o5255g0487o/o5255g0487o.ota12.burn.tbl fixed (using ipp_apply_burntool_single.pl)
    • gpc1/20100228/o5255g0491o/o5255g0491o.ota10.burn.tbl fixed (using ipp_apply_burntool_single.pl)
    • gpc1/20100228/o5255g0496o/o5255g0496o.ota33.burn.tbl fixed (using ipp_apply_burntool_single.pl) -- Manually replicated to ippb00.2.
    • gpc1/20100228/o5255g0510o/o5255g0510o.ota40.burn.tbl fixed (using ipp_apply_burntool_single.pl) -- Manually replicated to ippb00.2.
    • gpc1/20100228/o5255g0466o/o5255g0466o.ota22.burn.tbl fixed
    • gpc1/20100228/o5255g0532o/o5255g0532o.ota35.burn.tbl fixed
    • gpc1/20100228/o5255g0534o/o5255g0534o.ota45.burn.tbl fixed
    • gpc1/20100228/o5255g0538o/o5255g0538o.ota53.burn.tbl fixed
    • gpc1/20100228/o5255g0527o/o5255g0527o.ota40.burn.tbl fixed
    • gpc1/20100714/o5391g0027o/o5391g0027o.ota60.burn.tbl fixed
    • gpc1/20100714/o5391g0028o/o5391g0028o.ota60.burn.tbl fixed
    • gpc1/20100714/o5391g0030o/o5391g0030o.ota60.burn.tbl fixed
    • gpc1/20100728/o5405g0348o/o5405g0348o.ota15.fits fixed
  • 12:00 MEH: killing some ppImages taking up >50% RAM on machines before it incaps them
  • 12:10 Serge: Manually reverted LAP chips
  • 13:50 Serge: Killed a bunch of crazy ppImage
  • 14:26 Serge: Restarted stdscience

Tuesday : 2012-07-24

  • 08:55 Serge: fixing some LAP (Note: for recovered ota to replicate at atrc, I temporarily set neb-host ippb00 to up).
    • gpc1/20100626/o5373g0248o/o5373g0248o.ota65.fits fixed
    • gpc1/20100626/o5373g0246o/o5373g0246o.ota76.fits fixed
    • gpc1/20100626/o5373g0245o/o5373g0245o.ota26.fits recovered
    • gpc1/20100626/o5373g0249o/o5373g0249o.ota26.fits recovered
    • gpc1/20100518/o5334g0175o/o5334g0175o.ota14.burn.tbl fixed
  • 09:20 Serge: Reverted LAP chips with fault 2: chiptool -dbname gpc1 -label LAP.ThreePi.20120706 -revertprocessedimfile -fault 2
  • 12:15 Serge: Started ingestion of nebulous dump (mysql-neb-ippdb02-2012-07-23T00:30:02.dump.bz) on ippc63... Screen session "nebulous_ingestion".
  • 12:40 Serge: Started ingestion of nebulous dump (mysql-neb-ippdb02-2012-07-23T00:30:02.dump.bz) on ippc62... Screen session "nebulous_ingestion".
  • 14:05 Bill: set all ps_ud labels chips and warps to be cleaned. Restarted pstamp and update pantasks
  • 14:18 Bill: restarted cleanup with some changes to chip.cleanup.run and warp.cleanup.run that should queue jobs faster

Wednesday : YYYY.MM.DD

Thursday : YYYY.MM.DD

  • 09:00 Serge: Changed vm.swappiness on ippc11 (Set tot 10 instead of default 60). Modified /etc/sysctl.conf (backup /etc/sysctl.conf.before_swappiness_change_20120726) Note: sysctl -w vm.swappiness=10 if we don't want to change the sysctl.conf file. More on swapiness here: http://unixfoo.blogspot.com/2007/11/linux-performance-tuning.html
  • 09:20 Serge: Restarted mysql server on ippc11. It looks like even if the swappiness is changed the mysqld process can't go back to non-swap (there was absolutely no activity during the 20 minutes after the swappiness change on ipcc11).
  • 10:45 Serge: LAP
    • gpc1/20110311/o5631g0115o/o5631g0115o.ota55.burn.tbl fixed
    • gpc1/20100223/o5250g0351o/o5250g0351o.ota53.burn.tbl fixed Reverted LAP fault 2
  • 13:45 Serge: chip.revert.off
    • gpc1/20100517/o5333g0124o/o5333g0124o.ota36.fits recovered
    • gpc1/20100517/o5333g0127o/o5333g0127o.ota36.fits recovered
    • gpc1/20100518/o5334g0095o/o5334g0095o.ota02.burn.tbl fixed
    • gpc1/20100518/o5334g0118o/o5334g0118o.ota14.burn.tbl fixed
  • 16:10 Serge: ganglia is crazy on ipp009. I tried anything I could and decided to reboot it to clear it...
  • 16:50 Serge: I don;'t know if the swapiness is really useful but ganglia doesn't show any ugly purple bar on top of ipp009, ippc11 and ippc63 memory usage

Friday : 2012.07.27

  • 08:35 Serge: chiptool -dbname gpc1 -label LAP.ThreePi.20120706 -revertprocessedimfile -fault 2 All type-2 errors because of ipp009 reboot yesterday.
  • 08:36 Serge: chiptool -dbname gpc1 -label LAP.ThreePi.20120706 -revertprocessedimfile -fault 4 Only one error: o5203g0357o, XY14, exp_id = 120169, chip_id = 519949
  • 08:50 Serge: Modified the swappiness on ipp005 and ipp006. Flushed the caches with sync ; echo 3 > /proc/sys/vm/drop_caches. See http://www.kernel.org/doc/Documentation/sysctl/vm.txt for details
  • 09:35 Serge: Nebulous ingestion crashed again on ippc63 8(
  • 09:45 Serge: Nebulous ingestion seems frozen on ippc62 8(((
  • 20:00 MEH: restarting stack pantasks to reset the extra nodes being used back to push the backlog of stacks through last weekend. compute3 back to full use in deepstack for.. deepstacks..
  • 20:30 3PI nightly science data processing fine (yay)
  • 21:00 Serge: chip.revert.on in stdscience

Saturday : 2012.07.28

  • 09:10 stdscience needs its 12hr restart
  • 12:00 MEH: looks like ipp057 has been unresponsive for >1hr. power cycled and back up. hanging task in stdscience that was on ipp057, restarting stdscience again too.

Sunday : 2012.07.29

  • 01:50 MD09.GR0 chip trouble (XY67,71) seemed to be related to psphot in the ipp-20120531 tag (test run with new ipp-20120626 tag are fine), set those quality 42 and finished processing MD09.GR0.
  • 02:25 apparently most camera runs in the ecliptic pantasks have been running for 303ks. killing off to see if revert and complete.
  • 09:15 Bill restarted distribution with stages chip_bg and warp_bg back in the list of distribution stages to query for work to do. These were commented out for efficiency since they are only necssary when observing M31. M31 distribution is now rolling proceeding.
  • 12:40 LAP mostly finished except for stacks? pushing more nodes into the ecliptic pantasks (so if restart stdscience will need to turn off in stdscience or ecliptic): +6x stare +1x compute3
  • 23:00 nightly science registration trouble? extra/incorrect regtool -checkstatus?
    Running [/home/panstarrs/ipp/psconfig/ipp-20120626.lin64/bin/regtool -checkstatus -dateobs_begin  -dateobs_end T17:30:00 -class_id XY14 -dbname gpc1]...
    extra arguments: T17:30:00 
    
    -- many entries in the burn log
    neb://ipp058.0/gpc1/20120730/o6138g0063o/o6138g0063o.ota14.burn.log