PS1 IPP Czar Logs for the week 2014.10.13 - 2014.10.20

(Up to PS1 IPP Czar Logs)

Monday : 2014.10.13

  • if pixelservers shutdown described in PS1 shutdown list, will want to be sure all nightly data downloaded by early morning
  • 06:00 MEH: nightly data downloaded
  • 09:00 MEH: no nightly data until ~Wednesday -- disable nightly shutoff of nodes in lanl stdlocal
    • looks like lanl stdlocal needs a regular restart anyways
    • after restart will see if this works tonight or if there is something else that also needs to be flipped.. -- server input tweak_offstorage
  • 12:03 Bill: restarted pstamp pantasks and queued ps_ud label for cleanup.
    • also dropped warp_id 453790 whose corresponding camRun 472138 was in state drop, which prevents updates for that warpRun from succeeding.
  • 13:41 Bill: turned MPE label off in pstamp. There is a large backlog of requests to be cleaned for that label and the new ones are running so fast that we are falling behind on space.
    • 14:15 cleanup is done 75% full. Adding MPE label back in.

Tuesday : 2014.10.14

  • 09:20 MEH: Ken has noted lanl stdlocal has been pretty much idle since ~4am -- looks like handful mem fault warps and many fault 2 stacks can try to be cleared. actually only one was mem fault, three are this and don't know what has been done for this case
    cannot build growth curve (psf model is invalid everywhere)
    Backtrace depth: 11
    Backtrace 0: p_psAssert
    Backtrace 1: pmGrowthCurveGenerate
    Backtrace 2: psphotMakeGrowthCurve
    Backtrace 3: psphotChoosePSFReadout
    Backtrace 4: psphotChoosePSF
    Backtrace 5: psphotReadoutFindPSF
    Backtrace 6: (unknown)
    Backtrace 7: (unknown)
    Backtrace 8: (unknown)
    Backtrace 9: __libc_start_main
    Backtrace 10: (unknown)
    Assertion failed in function pmGrowthCurveGenerate at pmGrowthCurveGenerate.c:89. Error stack:
    
  • 10:45 CZW: restarting stdlocal, stopping stdlanl to do LANL side cleanup.

Wednesday : 2014-10-15

  • 8:45 CZW: All ipp/ipplanl pantasks servers shutdown in preparation for MRTC-B power outage.
  • 14:15 CZW: Most of the nodes are back online, so I've restarted the ipp/ipplanl pantasks. I'm going to allow LANL processing to happen, although I'm still keeping an eye on the disk usage there.

Thursday : 2014.10.16

  • 13:00 CZW: I've put ipp082 into repair because it has a very high load. It's responsive, but I want to see if letting it cool off a bit will clear some of the fault 2 issues I'm seeing in processing.
  • 16:15 CZW: I attempted to put ipp082 back to up after reassigning cabinets for the other ipp07X nodes to ensure replication wasn't hitting ipp082 constantly. This resulted in ipp082 shooting back up to a load of 200+, so it's back to repair. It was fine before the shutdown, but I can't see anything different with its configuration.

Friday : 2014.10.17

  • 09:25 EAM : No PS1 data today (still offline @ summit). Hurricane Ana is approaching but should pass well south of Maui. We currently do NOT plan on shutting down the IPP over the weekend. Processing on stdlocal is a bit sluggish, so I am stopping stack and will restart pantasks when the stacks have all cleared.
  • 10:45 EAM : restarted stdlocal, with storage hosts on.
  • 14:15 CZW: set ipp080 to up in nebulous.
  • 20:55 MEH: taking 2x ippsXX nodes from lanl stdlocal for md09 test processing -- returned

Saturday : 2014.10.18

  • 20:40 EAM : stdlocal is running slow (170k chips, etc). I'm stopping stacks and will restart when the stacks clear

Sunday : 2014.10.19

  • 14:45 EAM : restarting stdlocal again (113k warps, 89k chips).
  • 20:55 EAM : ~12 warps were failing due to the psf curve of growth problem. i've set them to a bad quality (42); the script for this is ~ipplanl/stdlocal/warp.faults.201401019, which gives the warp ids and skycell ids in case we want to retry them.