PS1 IPP Czar Logs for the week 2015.05.05 - 2015.05.11

(Up to PS1 IPP Czar Logs)

Monday : 2015.05.05

Bill is the (absent minded) czar today

  • 09:30 Bill: lap is stalled by missing files. Took 1 set of c2 nodes from stack and added to staticsky
  • 11:10 MEH: running a batch fix of these broken burn.tbl to unstick LAP -- things moving again, huge backlog of stacks in process
  • 12:30 CZW: One last burntool table seemed to be gumming things up. ipp_apply_burntool_single.pl --continue 2 --class_id XY64 --exp_id 265113 --this_uri neb://ipp051.0/gpc1/20101216/o5546g0024o/o5546g0024o.ota64.fits --previous_uri neb://ipp051.0/gpc1/20101216/o5546g0023o/o5546g0023o.ota64.fits --dbname gpc1 sorted it out (previous table was bad as well).
  • 13:40 Bill: ipp051 is back up.
    • Haydn says ippc16 has an alarm and warning light. Took it down so that he could have a look at it.
  • 14:20 MEH: turning ippsXX on in staticsky while preparing next deepstack run
  • 14:48 Bill: after Haydn opened ippc16 up and reseated connectors, etc it stopped beeping. It is back on line.
  • 15:31 Executed periodic shut down and restart of pantasks (except stack and staticsky).

Tuesday : 2015.05.06

Bill is czar today

  • 06:52 removed lap label from stdscience while nightly science warps finish up
  • 07:12 restarted staticsky in galactic plane with only 1 x c2 2 x m0 and 2 x m1 nodes enabled. Once I see the memory usage may add more c2s.
  • 07:50 MEH: time to cycle out the ipps machines
  • 13:58 staticsky memory usage is mostly staying below 25G. Trying 2 at a time.
  • 14:57 stdscience restarted
  • 15:37 started a script on ippc03 to delete the staticsky residual images which we don't need to save.
  • 18:39 staticsky memory use is getting out of control. Set one set of the c2's to off

Wednesday : 2015.05.07

  • 06:26 Bill: removed LAP label from stdscience until nightly warps finish
    • 07:36 added lap label back in
  • 10:05 Bill enabled second set of c2 nodes in staticsky (nothing crashed last night and we should be getting further from the really high density regions).
  • 11:00 ippc63 can be added back to the c2 group
  • 16:00 Bill: 2 x c2 is growing vm use to > 80G in some cases. Back to 1x
  • 18:20 Bill: cleared two diff faults with
        difftool -updatediffskyfile -set_quality 14006 -diff_id 545777 -fault 0 -skycell_id skycell.2121.090
        difftool -updatediffskyfile -set_quality 14006 -diff_id 545777 -fault 0 -skycell_id skycell.2192.037
    

Thursday : 2015.05.08

  • 04:27 Bill: removed lap label from stdscience
  • 05:10 set second set of c2 notes on in staticsky
  • 07:56 bill: restarted stdscience pantasks
  • 11:25 MEH: ipp051 has been out since 5/5 after new mobo installed... probably should be added back into processing..

Friday : 2015.05.09

mark is czar

  • 08:30 MEH: large number of PSS chips fault 3, M31.rp.2013.20091223
     -> pmFPAfromFilename (pmFPAfileDefine.c:533): Problem in configure files
         Failed to determine camera format for neb://ipp021.0/gpc1/auxmask/mpg.20130506/det_969/mask_XY33_5174-5196.fits
    
    • Bill notes these cannot be updated w/ the auxmask and no alternative exists for the time being, to clear just send back to goto_cleaned
  • 08:40 MEH: 2x c2 on in staticsky for single run (looks like c44,53,55 may have been fully off) + 1x c2 on in stack, once diffims finish then regular restart stdsci and pushing more power into stack
  • 10:00 MEH: manual transfer power stdsci->stk since no chip->warp to do
    s3: -5 stdsci, +3 stk 
    s2: -4 stdsci, +2 stk
    s1: -3 stdsci, +2 stk
    
  • 10:20 MEH: will keep stack out of c2 for a while, based on Bill's email and single run set of 2x looks ok for now -- 2x c2 for staticsky only
  • 13:20 MEH: 1x c2 in stack on/off for a set while watching mem use+rate
  • 13:45 MEH: lost contact with ipp cluster -- Gavin notes power issues at MHPCC area, MRTC-B on generator, network stuff in MRTC-A with no power. Haydn on way down.
  • 14:45 MEH: power and network back up -- ippc02,05,06,08 and unfortunately ippdb01 were down -- seem to have booted okay, now to fix all the errors and neb problems..
  • 15:40 MEH: still fixing stack and stdsci. summitcopy 500 timeouts seem to be clearing now?
  • 16:20 MEH: summit back and downloading proceeding, load of neb instance conflicts for all stages to clear neb-mv .bad. a few summitcopy had FITS needing as well, so will have dupe ones to clean up later..
  • 16:30 MEH: since late in day, leaving system configured for nightly -- staticsky 2x c2 and stack 1x c2
  • 16:50 MEH: stk not running until sure cleared all faults properly, revert for stdsci also off for same reason -- seems okay now, running -- nope, looks like a few camera and a warp run have a problem.. leaving on for nightly
  • 17:15 CZW: tested a pole queue lapRun to ensure that both 3PI and CNP will be selected. This checked out (lap_id = 24160), so I have appended the pv2 pole queue to the truncated end of the current queue. This will allow the currently unfinished runs over the bulk of the 3PI to complete, and continue to launch the pole.
  • 22:00 MEH: lots of nightly data, LAP label out for a while

Saturday : 2015.05.10

  • 01:40 MEH: seriously.. ipp033 down for about 10min.. nothing on console.. attempting power cycle -- back up
  • 01:50 MEH: adding 1x c2 to staticsky -- mem load looks like it can handle 3x for a bit, turn c2 off in stack since LAP stacks caught up
  • 11:00 MEH: odd LAP faults probably from ippdb01 going down yesterday, reverts off to collect listing
    • o5758g0202o, 566487 full but fault 5 and smf, log.update ok -- set_fault 0
    • 52 or so exposures full, log and update smf there, no where near current processing area but fault 2 -- leaving, something odd but probably can just set_fault 0
    • 9 or so exposures in pole area fault after power glitch -- leaving
    • some state drop -- leaving
    • will leave camera.revert.off until nightly starts, raise camera.poll 35
  • 11:50 MEH: QUB PSS updates behind a large number of WS diffs from last night, chip.off for a bit

Sunday : 2015.05.11

  • 13:00 MEH: nightly WS diffims done, then regular restart of stdsci