PS1 IPP Czar Logs for the week 2013.10.14 - 2013.10.20

(Up to PS1 IPP Czar Logs)

Monday : 2013.10.14

Happy Discoverers Day!

  • 06:50 Bill: ipp055 is down which has ground the entire cluster to a halt. Set to down in nebulous
    • Users were waiting for the postage stamp server so I restarted it, and the update pantasks.
    • stdscience was a backlogged mess so I shut it down. The space by nodes chart is a disaster. Other pantasks set to stop to let the jobs clear
  • 07:05 Bill:
    • Gene got ipp055 up again. set it to repair in nebulous
    • "who uses cluster" list looks clean. Probably safe to restart the pantasks. Restarted stdscience.

Tuesday : 2013.10.15

  • 14:10 MEH: in case missed it last week czarlog, deepstack is running and will be running for quite some time unless priorities shift. was setup w/ 2x compute3 from stack and uses 40-70% RAM on these machines (48GB), it may be possible to add 1x compute 3 back to stack (for nightly and LAP stacks) -- will manually add and monitor.
    • usage is no worse than normal and raises the stack rate back to 300 when loaders running , so adding into general startup -- ippconfig/pantasks_hosts.input
  • heather turned off pantasks at 11:37 so rita could restart ipp037, she then turned pantasks back to run
  • heather notes something happened at 12:30ish and loaders stalled... ?
  • mid-afternoon (heather+MEH) ipp062 was in a funky state (it flatlined about noonish?). MEH got nfs going again, heather restarted loaders.
  • 19:00 MED restarting stdscience
  • 22:00 MEH cleared backlog of 30 exposures in registration (fault+check_burntool)

Wednesday : 2013.10.16

  • 7:30am HAF unsticks registration and restarts it
  • 7:45 Gavin reboots ipp063 - this is why registration was stuck. Heather needs to check/restart loaders
    • MEH: ipp063 logs show it didn't stop responding until ~06:38, mostly after observing and doesn't seem to be the direct cause of stalling registration since >200 exposures were backed up.
  • 8:25 Bill (the czar) just woke up.
    • fixed (the latest?) cause for registration stalling with: regtool -updateprocessedimfile -set_state pending_burntool -burntool_state 0 -exp_id 665227 -class_id XY27
  • 08:46 Been poking around. Every machine I log into seems to be sluggish with many processes running. Setting stack pantasks to stop to give more cycles to the backlogged stdscience.
  • 09:27 fixed a few more bad burntool_state. The problem may be that certain nodes are having trouble communiating with the gpc1 database. Set ipp041 and ipp031 to off in stdscience pantasks because they seem to be the source of many faults.
  • 09:40 MEH: diff_id=485036,485053 skycell.skycell.2288.093 appears to have input file oddity triggering fault 5
    • IPP_IDET goes from normal index+1 to the same value of 2147483648 at 3523 -- how?
    • the value corresponds to -- OFF_CHIP 2147483648 peak lands off edge of chip
       pmReadoutReadObjects (pmSourceIO.c:1077) : seq < 0 for source 3527: Suspect neb://stsci13.2/gpc1/ThreePi.nt/2013/10/16//o6581g0167o.665159/o6581g0167o.665159.wrp.849870.skycell.2288.093.cmf is corrupt
    • manual rerun of warp produces a normal CMF and WS diff w/ manual quality not set completed okay (485053)
  • 11:30 someone manually set quality on diff_id 485036, 484888 fault 5 skycells and should probably add details here for the record
  • 13:42 Restarted stdscience pantasks
  • 15:22 queued STS data for cleanup queued 2012-09 data to be processed
  • 19:00 MEH: nightly starting and system getting abused.. turning 1x wave1 off in stdsci --
    • ipp040 unresponsive for ~5 min
    • ipp015 unresponsive for ~20 min
    • @19:20 taking another wave1 out of stdsci -- total 2x now
  • 20:10 MEH: registration >50 exposures behind.. 4 imfile faults, looks like one might be stalled on ipp040.. will try to clear..
    • ipp040 overload, and possibly other machines may follow if extra processing running and not balanced..
  • 20:34 Bill set sts label to inactive
  • 21:09 restarted summitcopy pantasks
  • 21:15 Bill just realized that I left camera off earlier today. It's now very backed up.

Thursday : 2013.10.17

Bill is czar today

  • 06:50 Looks like no problems with processing last night. There are about 40 science exposures left to download and burntool is keeping up.
  • 11:00 dist cleanup has been turned on
  • 11:24 Set 112 sts chipRuns whose distRuns were prematurely cleaned up to be updated
  • 17:05 MEH: ipp061 is down, and not just overloaded/stalled. console shows loads of stuff.. ipp061-crash-20131017T153400, power cycling -- back up
  • 18:27 Setting stdscience to stop for belated daily restart
    • 18:39 restarted

Friday : 2013.10.18

Saturday : 2013.10.19

  • 10:20 MEH: odd camera fault 3 -- neb://any/gpc1/ThreePi.nt/2013/10/19//o6584g0204o.666403/
    Unable to parse camera.
     -> pmConfigConvertFilename (pmConfig.c:1802): System error
         Unable to access file neb://ipp041.0/gpc1/ThreePi.nt/2013/10/19//o6584g0204o.666403/ nebclient.c:535 nebFind() - no instances found
     -> readPHUfromFilename (pmFPAfileDefine.c:454): System error
         Failed to convert file name neb://ipp041.0/gpc1/ThreePi.nt/2013/10/19//o6584g0204o.666403/
     -> fpaFileDefineFromArray (pmFPAfileDefine.c:641): System error
         Failed to read PHU for neb://ipp041.0/gpc1/ThreePi.nt/2013/10/19//o6584g0204o.666403/
     -> ppImageDefineFile (ppImageDefineFile.c:17): unknown psLib error
         failed to load file definition ARG LIST
     -> ppImageParseCamera (ppImageParseCamera.c:12): I/O error
         Can't find an input image source
    -- will need to rerun chip
    perl ~ipp/src/ipp-20130712/tools/ --chip_id  902161  --class_id XY62 --redirect-output
  • 17:50 MEH: starting MD.pv2 update processing during nightly science, STS reprocessing during day (label out at night), LAP if nothing else to do
    • restarting stdsci
    • 1x compute3 out from stack again and back to deepstack for starting MD03,04 refstack early November
  • 18:20 MEH: ipp053 unhappy, huge wait% according to ganglia and unable to login.. dvo_client again?? sts jobs? -- seems to be back after 20min.. taking out of processing for a bit

Sunday : 2013.10.20

  • 09:04 Bill: restarted pstamp and update pantasks
  • 10:30 MEH: regular restart stdsci, STS label back in
  • 20:40 MEH: ipp062 mounts appear borked.. nfs restart to try and reset/clear -- nightly registration/processing slowly catching up, may have borked other things but nothing can be done about that