(Up to PS1 IPP Czar Logs)

Monday : 2012.08.06

Mark is czar

  • 07:40 MEH: nightly science slowly continues to download/process, ~100 exposures behind. restarted summitcopy to see if that helps.
    • ipp027.0 continues to decrease, 100% disk used for ~140GB remaining.
    • reports that 'IPP datastore is down, can't connect to backend MySQL server', i'm getting proxy error. can't find any notes on the proxy setup other than now on ippops1. /data/ippc17.0 was full, cleaned out /local/ipp/tmp and mysql happy again. pstamp pantasks having trouble now.
  • 08:55 (Serge): pubtool -revert -dbname gpc1 -fault 2 -label ThreePi.nightlyscience after ippc17 full disk incident
  • 09:10 (Serge): Killed rogue jobs and restarted publishing
  • 09:30 (Serge): Fixed LAP fault 2: gpc1/20100530/o5346g0039o/o5346g0039o.ota47.burn.tbl, gpc1/20100624/o5371g0063o/o5371g0063o.ota64.fits, gpc1/20100529/o5345g0048o/o5345g0048o.ota26.fits
  • 10:20 MEH: 12-18hr restarting of stdscience. seems to have picked up the remaining 3 3PI chips..
  • 10:30 rebooting ippc17 to try and possibly clear some weird behavior. pstamp and datastore back online. oddly, looks like publishing pushed out some results to datastore with 8/5 date.
  • 11:00 nightly science finished. Chris fixed LAP issue and full runs continuing.
  • 11:10 (Serge): ippc63 is 68965 behind its master
  • 14:20: MEH: odd behavior, stdscience mostly stopped suspect due to book filling up with CRASH entries from ippc29,ippc61 with a very odd error. restarted and slowly put in ippc29,c61 and seems to be fine again. stack was also having CRASH with ippc29. (meeting discussion suggestes possibility stale NFS to ippc18 psLib may have been the problem from rebuild of tag today)
    --> pantasks.stdout.log.20120806.142251 
    crash for: warp_skycell.pl --threads @MAX_THREADS@ --warp_id 489012 --warp_skyfile_id 40697774 --skycell_id skycell.1456.049 --tess_dir RINGS.V3 --camera GPC1 --outroot neb://stsci01.2/gpc1/LAP.ThreePi.20120706/2012/08/04/o5566g0752o.273732/o5566g0752o.273732.wrp.489012.skycell.1456.049 --redirect-output --run-state new  --reduction LAP_SCIENCE --dbname gpc1 --verbose
    job exit status: CRASH
    job host: ippc61
    job dtime: 0.104959
    job exit date: Mon Aug  6 14:19:14 2012
    hostname: ippc61
    --> pantasks.stderr.log.20120806.142251
     -> addprocessedimfileMode (chiptool.c:539): unknown psLib error
         -exp_id is required
    --> ippc29 also in stack -- nothing in stderr log though
    crash for: stack_skycell.pl --threads @MAX_THREADS@ --stack_id 1120678 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/06/RINGS.V3/skycell.1276.094/RINGS.V3.skycell.1276.094.stk.1120678 --redirect-output --run-state new --reduction THREEPI_STACK --dbname gpc1 --verbose
    job exit status: CRASH
    job host: ippc29
    job dtime: 0.114992
    job exit date: Mon Aug  6 14:13:14 2012
    hostname: ippc29
  • 16:50 MEH: shutdown and restarted all pantasks.
  • 20:20 and the fun continues. ipp055 raid is degraded, setting to repair.. failed.. neb-host stalls
  • 20:25 all summitcopy, processing has stalled, nebulous problem?
    500 read timeout at /home/panstarrs/ipp/psconfig//ipp-20120802.lin64/lib/Nebulous/Client.pm line 1351
    Unable to perform dsget: 9 at /home/panstarrs/ipp/psconfig//ipp-20120802.lin64/bin/summit_copy.pl line 236.
    failure for: summit_copy.pl --uri http://otis3.ifa.hawaii.edu/ds/skyprobe/o6146i0041o01/o6146i0041o01.fits --filename neb://any/isp/20120807/o6146i0041o01/o6146i0041o01.chip01.fits --summit_id 728008 --exp_name o6146i0041o01 --inst isp --telescope ps1 --class chip --class_id chip01 --bytes 8602560 --md5 6d88e816d69f885f0966ecd5af3bd814 --dbname isp --timeout 600 --verbose --copies 2 --nebulous
    • doh! /dev/sda4 1.9T 1.9T 20K 100% /export/ippdb00.0
    • /export/ippdb00.0/ipp/gpc1/MD08.20091110 is from 2009, assuming doesn't need to be on ippdb00.0 and rsync'd to ipp060.0 for 8.8GB
    • neb-host ipp055 repair -- worked
    • summitcopy on and running, registration on, stdscience on w/o LAP
    • pstamp+update also on for MOPS, not sure how long 8GB will last
  • 21:10 ipp012+ipp021 raid rebuilding, putting to neb-repair? was old email, put back up.
  • 21:55 (Serge) Purged 590 GB of binlogs on ippdb00 (full disk)
    PURGE BINARY LOGS BEFORE '2012-07-20';
  • 21:56 (Serge) ippc63 is about 50000 behind its master

Tuesday : 2012-08-07

Serge is czar

  • 06:40 (Serge): Still 600-409 = 191 exposures to be registered but registration activity seems fine (as far as I can tell).
  • 09:10 (Serge): Fixed gpc1/20110315/o5635g0567o/o5635g0567o.ota75.burn.tbl (LAP)
  • 09:15 (Serge): 68 exposures to register/download
  • 09:40 (Serge): ipp016 set to repair: neb-host ipp016 repair
  • 09:50 (Serge): ~ipp/check_system.sh hostoff ipp016 run 8 times. ./check_system.sh hostcheck ipp016 tells that it's OFF in all pantasks
  • 10:00 (Serge): No ipp process running on ipp016. neb-host ipp016 down
  • 10:03 (Serge): Shutting down ipp016
  • 10:35 (Serge): ippc63 is 40000 seconds behind its master
  • 11:15 (Serge): All observations downloaded and registered
  • 11:30 MEH: MOPS would like WS diffims from STS last night, adding to stdscience label STS.nightlyscience.mops
  • 11:45 MEH: stealing ippc57,c58,c59,c60,ipp060 from stdscience for testruns
  • 13:15 (Serge): ipp016 is back
  • 14:10 (Serge): Fixed gpc1/20100530/o5346g0142o/o5346g0142o.ota14.burn.tbl (LAP)
  • 21:30 MEH: restarting stdscience to turn on MD01, tweaking ssdiff to then run.

Wednesday : 2012-08-08

Serge is czar

  • 04:20 (Serge): About 80 exposures behind. Let's see if labeltool -updatelabel -set_inactive -label LAP.ThreePi.20120706 -dbname gpc1 helps...
  • 07:30 (Serge): All but 20 darks missing: labeltool -updatelabel -set_inactive -label LAP.ThreePi.20120706 -dbname gpc1
  • 09:05 (Serge): Nightly science over
  • 09:10 (Serge): Setting LAP label to inactive was of no help. I ran source ~bills/db01/tz.sql and source ~bills/db01/latency_query.sql which showed that the time difference between the end of observation and the end of registration in the IPP is slightly increasing.
  • 09:20 (Serge): Fixed gpc1/20100530/o5346g0144o/o5346g0144o.ota14.burn.tbl and gpc1/20100530/o5346g0164o/o5346g0164o.ota61.burn.tbl in LAP
  • 09:22 (Serge): ippc63 is 600 seconds behind its master
  • 10:15 (Serge): Killed 3 ppImage instances on ipp039 and one on ipp047. Fixed gpc1/20100530/o5346g0182o/o5346g0182o.ota15.burn.tbl.
  • 11:00 (Serge): ippc63 is synchronized with ippdb00.
  • 11:00 MEH: ran wrong command in stdscience, need to restart..
  • 11:00 (Serge): Recovering space (should be around 500GB) on ipp027 by deleting the various /export/ipp027.0/backup/snapshots/daily.?/alala/data/alala.0/elixir. On ipp022, I'm archiving /export/ipp022.0/alala.0/elixir. It will be copied by rsnapshot.
  • 14:15 (Serge): Fixed gpc1/20100530/o5346g0104o/o5346g0104o.ota13.burn.tbl and gpc1/20100118/o5214g0253o/o5214g0253o.ota22.burn.tbl
  • 14:30 (Serge): Killed ppImage on ipp039 involving (o4991g0100o, xy60) and (o4985g0063o, xy60)
    chiptool -dbname gpc1 -updateprocessedimfile -fault 0 -set_quality 42 -set_state full -chip_id 534179 -class_id XY60
    regtool -dbname gpc1 -updateprocessedimfile -set_state corrupt -class_id XY60 -exp_id 75621


chiptool -dbname gpc1 -updateprocessedimfile -fault 0 -set_quality 42 -set_state full -chip_id 534138 -class_id XY60
regtool -dbname gpc1 -updateprocessedimfile -set_state corrupt -class_id XY60 -exp_id 73255 
  • 15:15 (Serge): Killed ppImage on ipp039 involving (o4985g0082o, XY60) and (o4991g0100o, XY60)
    chiptool -dbname gpc1 -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 534139  -class_id XY60
    chiptool -dbname gpc1 -updateprocessedimfile -set_state full -chip_id 534139  -class_id XY60
    regtool -dbname gpc1 -updateprocessedimfile -set_state corrupt -class_id XY60 -exp_id 73272 


chiptool -dbname gpc1 -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 534179 -class_id XY60
chiptool -dbname gpc1 -updateprocessedimfile -set_state full -chip_id 534179 -class_id XY60
regtool -dbname gpc1 -updateprocessedimfile -set_state corrupt -class_id XY60 -exp_id 75621
  • 15:55 (Serge): Restarted rgistration after bias images painted the czartool in red
  • 16:40 (Bill): Fixed and installed ppBackground to apply the correct sign with the pattern continuity correction. Queued 120 new M31 chipBackgroundRuns
  • 17:15 MEH: ipp016 raid issues today, disk swapped and raid rebuilding so setting neb-host repair to harass it less

Thursday : 2012-08-09

Serge is czar

  • 07:30 (Serge): Summit copy and registration finished
  • 09:00 (Serge): Nightly science finished (but 4 STS at warp stage and one M31 at stack)
  • 09:05 (Serge): Killed the usual bunch of mad ppImage. Dropped (class_id, exp_id, chip_id)
    • on ipp039: XY60 73268 535777; XY60 73279 535850 ; XY60 73239 536023.
    • on ipp043: XY60 73266 536066
    • on ipp047: XY67 73268 535777 ; XY67 73279 535850
    • on ippc49: XY60 73264 535799
  • 09:10 (Serge): Fixed gpc1/20100118/o5214g0268o/o5214g0268o.ota16.burn.tbl; gpc1/20100530/o5346g0092o/o5346g0092o.ota47.burn.tbl
  • 09:15 (Serge): Recovered not-so-lost ota gpc1/20100530/o5346g0113o/o5346g0113o.ota26.fits
  • 10:30 (Serge): Fixed gpc1/20100530/o5346g0227o/o5346g0227o.ota15.burn.tbl
  • 10:37 (Bill) stacktool -updaterun -set_state drop -stack_id 1113450 -set_note 'fails due to problem in ticket 1427'
  • 11:03 (Serge): Dropped:
    • ipp039: XY60 73291 536588 ; XY60 73294 536589 ; XY60 73307 536590 ; XY60 73308 536591
  • 15:10 (Serge): LAP surgery
    • Dropped ipp030: XY60 73306 536761 ; ipp039: XY60 73299 536825
    • Recovered gpc1/20100530/o5346g0220o/o5346g0220o.ota26.fits.
    • Fixed gpc1/20100530/o5346g0233o/o5346g0233o.ota14.burn.tbl ; gpc1/20100530/o5346g0257o/o5346g0257o.ota47.burn.tbl ; gpc1/20100128/o5224g0254o/o5224g0254o.ota66.burn.tbl ; gpc1/20100530/o5346g0258o/o5346g0258o.ota15.burn.tbl
  • 20:20 Bill: summit copy getting stuck again. Couldn't find the root cause of a process being stuck for 30 minutes. Killed and restarted registration and summit copy pantasks
  • 20:48 ipp024 having trouble communicating with stcsi nodes. Just took it out of registration and summit copy pantasks. burntool finally is makng progress with tonight's data.
  • 21:00 ipp024 is a black hole for jobs. Cycling power on it.
  • 21:45 Setting lap label to inactive to see if that helps copy/registration from clogging.
  • 21:59 ipp039 and ipp047 were clogged up with many LAP jobs eating 20 GB of memory. Congrats to these machines for not crashing. Killed the jobs. Leaving LAP label off until morning. It looks like we may be in dense regions.
  • 22:15 more hosts bogged down on LAP ppImage. stopped stdscience. killed all ppImage jobs left running. restarted stdscience. To allow stacks to continue LAP label set to active but removed from stdscience.

Friday : 2012-08-10

  • 05:30 Bill Even with little to do summit copy is not close to keeping up with the data acquisition rate we are 163 exposures behind. Adding the LAP label back into stdscience. chip.revert.off
  • 06:12 Well turning LAP back on was a mistake. Apparently there are some exposures from 2009 in the queue that are causing memory to explode. (There are several ppImage procsses over 20GB). Turning label back off.
  • 08:44 killed off several exploded LAP jobs. One had a virtual size of 65G. Will start looking into what is going on with these exposures shortly.
  • 09:34 Serge: ganglia was showing ipp024 as swapping when nothing was running there but the mysql server. I stopped mysql, flushed the caches (sync ; echo 3 > /proc/sys/vm/drop_caches), restarted ganglia, then restarted mysql. It lokks fine at 09:36.
  • 10:04 Bill added a bit of code to psphotPSFStats which will cause the chips with unreasonbly large PSF values to generate a quality error rather than destroy the cluster.
    +        if (fwhmMaj > 35) {
    +            // XXX: get this parameter from the recipe
    +            // FWHM is too large. Using this often leads to detection of huge numbers of sources
    +            psLogMsg ("psphot", PS_LOG_WARN, "fwhm too large giving up\n");
    +            goto escape;
    +        }
  • 10:14 (Serge): Set ippc63 to off in stdscience so that the nebulous replication there can catch up (49000 seconds behind now).
  • 10:45 (Serge): LAP surgery
    • Recovered: gpc1/20100611/o5358g0044o/o5358g0044o.ota52.fits ; gpc1/20100611/o5358g0052o/o5358g0052o.ota56.fits ; gpc1/20100518/o5334g0041o/o5334g0041o.ota37.fits
    • Fixed: gpc1/20100228/o5255g0363o/o5255g0363o.ota05.burn.tbl ; gpc1/20100228/o5255g0347o/o5255g0347o.ota55.burn.tbl ; gpc1/20100611/o5358g0051o/o5358g0051o.ota05.burn.tbl ; gpc1/20100228/o5255g0371o/o5255g0371o.ota36.burn.tbl
  • 12:24 Bill fixed corrupted weight file for chip_id 535215 XY55 withperl ~bills/ipp/tools/runchipimfile.pl --chip_id 535215 --class_id XY55 --redirect-output
  • 14:37 (Serge): Fixed gpc1/20100228/o5255g0359o/o5255g0359o.ota02.burn.tbl. Not so much surgery after the 10:04 bug fix
  • 15:43 (Serge): Might be worth running again the following "dropped" chips once the 10:04 bug is really fixed:
     chipProcessedImfile.exp_id, chipProcessedImfile.chip_id, chipProcessedImfile.class_id 
    JOIN chipRun USING(chip_id) 
     quality=42 AND
     data_group LIKE 'LAP.ThreePi.20120706%' 
    ORDER BY chipProcessedImfile.exp_id;
  • 16:11 (Serge): Fixed gpc1/20100228/o5255g0393o/o5255g0393o.ota33.burn.tbl
  • 20:18 (Bill): restarted stdscience

Saturday : 2012-08-11

  • 07:30 (Bill) summit copy is sluggish and has been for weeks. I think that it's pantasks host ipp051 is overloaded with processing jobs. I'm going to move it to ippc15
  • 07:35 summit coy restarted with pantasks running on the dedictated pantasks host ippc15. distribution, publishing, and summit copy are all running there now.

Sunday : 2012-08-12