PS1 IPP Czar Logs for the week 2013.09.09 - 2013.09.14

(Up to PS1 IPP Czar Logs)

Monday : 2013.09.09

mark is czar

  • 08:40 MEH: finishing clearing broken LAP chips, then regular restart of stdsci
  • 09:05 MEH: 4 pantasks_servers running on ippc15 by ipp, should be only 3.. restarting most pantasks.. and 5 rather than 3 running on ippc04 -- culprits were publishing 2x, stack 2x, registration 2x loaded.. kind of excessive over-starting..
    • wonder if the 2x registration has been source of odd registration hangup lately..
  • 14:15 MEH: apparently ippc06 rebooted around @1345.. -- nothing on console.. someone bump it?
    • took down cleanup and detrend.. restarting..
  • 14:30 MEH: ipp047 back online! apparently was never taken out of ippconfig/pantasks_hosts.input so taking out for now and putting into neb-host repair like siblings
  • 15:00 MEH: ippc06 rebooting itself again.. -- taking out of processing, restarting cleanup and detrend pantasks
  • 23:50 MEH: czarpoll fairly sluggish most of evening, not clear what from. going to restart

Tuesday : 2013.09.10

mark is czar

  • 07:10 MEH: ipp033 down, 238 exposures behind.. -- nothing on console, back up and registration+processing continues
    • ipp033 is in repair so unclear why would stall registration -- was a check_burntool problem for neb://ipp033.0, so used regtool to clear and not have to wait, but..
      o6545g0361o  XY52 0 check_burntool neb://ipp033.0/gpc1/20130910/o6545g0361o/o6545g0361o.ota52.fits	#??? regtool -updateprocessedimfile -exp_id 654541 -class_id XY52 -set_state pending_burntool -dbname gpc1
      
  • 07:55 MEH: czar stuff is very slow still, looks like ippdb01 is very busy still. dvo related? -- czarpage and status pretty much useless, updates maybe every 30-60 mins..
  • 08:10 MEH: 3PI diffim fault 5, ongoing problem
    VectorFitPolynomial1DOrd (psMinimizePolyFit.c:610): unknown psLib error
         Could not solve linear equations.
    
    difftool -updatediffskyfile -set_quality 14006 -diff_id 476426 -skycell_id skycell.1049.062 -fault 0
    
  • 08:15 MEH: registration finished
  • 09:50 MEH: stdscience in dire need of regular restart..
  • 10:50 MEH: distribution odd counts, restarting as well
  • 11:50 MEH: apparently MD01,02 night stacks not made yet..
  • 12:20 MEH: stack pantask odd counts, just shutdown for restart as well..
  • 12:30 MEH: ippdb01 now <50% cpu usage with minimal things in processlist after cleaning out, start just distribution, can it advance runs?
    • seems to timeout? after ~700s but does advance what is found.. doesn't seem right
  • 13:00 MEH: something wrong, even stacktool for MD night stacks take >5 min and seem to just timeout -- restart mysql on ippdb01
    • queries running normally again -- MD night stacks in progress, distribution running, LAP back on
    • state on restart -- running since 7/30, RAM 93%, 16G swap, often CPU@700% -- 30d regular restart useful when full 24/7 processing?
  • 14:10 MEH: ippc63 runs hotter than other compute3 and ippc30 not used in any processing as it holds/servers the PSS data -- make a hosts_compute3_weak group -- currently loading 3x into stdsci, faster jobs and lower RAM than stacks typically
    • appear to run okay, could probably add couple more except it appears the jobs fill up cores on one cpu and then fill the other -- adding more will fully load a CPU on ippc63 and likely make too hot
  • 14:20 MEH: ipp033 (wave2) and ipp047 (wave3) manually adding back into stdscience 2x (nominally 4x, and 1x in summit+reg+dist+clean) -- seem okay, but ipp033 has been largely unstable so remove for nightly processing
  • 14:30 MEH: nightly science now finally finished except for MD02 diffims, setting up now with the new 1D convolution version refstack to look at --
  • 15:30 LAP had hiccup in DB situation earlier -- current LAP runs now up to ~150
  • 16:20 MEH: doing restart of stdscience to start MD02 diffims and so won't have slowdown of nightly science in morning
  • 18:55 MEH: doing another restart of stdsci -- 2.5k chip->warp to update so tried to overload system -- got 30k imfiles through but reset for night
  • 22:30 MEH: LAP processing is very stable rate -- ~200/hr, ~160/hr, 500/hr for chip, cam and warp, stack respectively --
  • 23:00 MEH: summit fault appears to be gone exposure from earlier problems -- c6546g0021o

Wednesday : 2013.09.11

  • 00:40 MEH: MD10 nightly appears to be downloading and processing ok
  • 10:50 Bill: More STS testing
    • ran 23 sts exposures that got processed in sts.test.20130905 with the wrong reduction. They're done already.
    • Queued 34 exposures with the STS_DATASET reduction, label STS.test.20130911.
    • ERROR: forgot to change the chip recipe in STS_DATASET reduction from CHIP_AUXMASK to CHIP. Requeued as STS.test.20130912
    • The same set will be processed again later with fits compression turned off. label STS.test.20130912.nocomp. Need to set up the recipe and reduction.
  • 13:17 Bill added the STS.test.20130912.nocomp label
  • 13:31 Bill restarted stdscience
  • 14:00 Bill there was 1 M31 exposure that never got finished due to the ipp047 outage. Set up to run it.
  • 15:30 M31 and STS data have finished labels and survey entries removed

Thursday : YYYY.MM.DD

  • 18:45 MEH: stdscience really needs regular restart, best before nightly data -- LAP rate has doubled from most of day 100-->200, regular restarts
  • 19:40 MEH: looks like registration is sticky... 22 behind.. pztool -clearcommonfaults looks to have unstuck it

Friday : YYYY.MM.DD

Bill is czar today

  • 10:43 restarted pstamp and update pantasks. Setting stdscience to stop in preparation for daily restart.
  • 11:00 stdscience restarted. We're getting some faults in stdscience and stack with problems communicating with the gpc1 database.
  • 13:27 fixing corrupt warp files. It turns out that for updates the code to check outputs is commented out. I'm thinking about turning it back on...
  • 13:37 updated warp_skycell.pl to check for corruption in output fits images

Saturday : YYYY.MM.DD

  • 10:40 MEH: time for regular restart of stdscience
  • 18:50 MEH: stdsci reaching 100k count, should be restarted before nightly

Sunday : YYYY.MM.DD