PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2014.11.17

  • 01:20 MEH: ippsXX in use for MD PV3 -- ~ippmd/stdscience/ptolemy.rc if need to turn off
  • 12:50 MEH: going to start using ippx001-020 for MD processing rate tests, turning off slowly in stdlocal
    • ippx013 many faults -- Gavin fixed missing entries in autofs

Tuesday : 2014.11.18

  • 07:30 MEH: stdsci underloading, will need a restart after nightly finishes along w/ summit and reg. removing update labels for now -- was well underloaded so restarted
    • lanl stdlocal also seems to be using storage nodes when nightly not finished, 2x stacks on upp039 (only 8 cores as well..) was causing chips to run ~5x longer than normal.. manually turning storage nodes off..
  • 09:45 MEH: nightly finally finished
  • 09:30 MEH: ippsXX and ippxNNN off for ippmd, turned on in lanl stdlocal for SAS processing
  • 10:55 EAM: restarted stdlocal with the following modified logic for loading:
    • storage hosts are either all on (daytime, 6AM HST - 7PM HST) or all off (nighttime)
    • chip-warp processing is blocked only if > 10k nightly science items are outstanding
    • ippxNNN nodes are loaded in 2 blocks: x0 and x1, corresponding to machines on the same PDU
    • ippxNNN and ippsNN nodes are not ON by default, but are loaded for ease of pcontrol operations
    • added the SAS.20141118 label to process the newest SAS run.
  • 11:10 MEH: turning ippsXX and ippxNNN on then in stdlocal for SAS prioity
  • 21:35 MEH: registration backed up >100 because of a fault? had to manually run
    regtool -updateprocessedimfile -exp_id 819547 -class_id XY75 -set_state pending_burntool -dbname gpc1
  • 22:00 MEH: 100+ exposures behind, ippsXX off in stdlocal and into stdsci to catchup, then ippsXX into MD since SAS stacks ~done
  • 23:00 MEH: stdlocal also stop until caught up -- nightly rate seems to increase, too many chip-warp jobs?
  • 00:45 MEH: mostly caught up, stdlocal back on w/o ippsXX nodes

Wednesday : 2014.11.19

  • 00:55 MEH: ippsXX in use by ~ippmd/stdscience/ptolemy.rc
  • 05:05 EAM: ipp036 crashed, nothing on console. rebooting
  • 05:15 EAM: stopping & restarting stdlocal. xnodes on, snodes off
  • 07:30 MEH: registration stuck again -- 174 exposures behind. oddly chip processing seems to have run out of chips ~0400-0500 on the ippmonitor plots..
    • needed to manually run
      regtool -updateprocessedimfile -exp_id 820015 -class_id XY55 -set_state pending_burntool -dbname gpc1
    • agaiin flipping on ippsXX nodes in stdsci and stdlocal off
  • 09:20 MEH: nightly mostly finished, ippsXX back to ippmd, ippxNNN back to stdlocal
  • 09:30 MEH: failing skycell
    cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1201568 -skycell_id skycell.1787.029
  • 12:45 MEH: ippmd will be taking ippxNNN x0 group from stdlocal (doesn't seem to be fully loaded)
  • 19:30 MEH: another build up of reg faults
  • 21:00 MEH: seeing chip jobs taking ~2x long than normal, network is reaching 3GB/s so turning 2x x1 off in stdlocal and 2x off all in ippmd
  • 22:10 MEH: stdsci poll underloaded.. doing quick restart
  • 23:00 MEH: looks like more stacks being run by stdlocal now, stack poll increased? chip times better probably from fewer chips overall being processed, could bump ippx nodes up possibly but cannot monitor right now

Thursday : 2014.11.20

  • 04:10 EAM : ipp035 crashed, nothing on console, rebooting.
  • 10:30 EAM : ipp035 crashed again, nothing on console, rebooting.
  • 13:30 MEH: ippx group x0 off in ippmd --
  • 14:45 CZW: roboczar complained about pstamp. It has multiple 100k Ngood jobs, so I'm going to stop and restart it.
  • 15:20 CZW: stdlanl has lots of fails/timeouts/etc, so I'm restarting it to make it easier to work with.

Friday : 2014.11.21

  • 01:35 MEH: looks like SAS has finished staticsky runs and the ippxNNN nodes are idle -- putting the x0 group back into ippmd until needed again
  • 14:20 MEH: ippxNNN x1 group off in stdlocal, will use for a large test run of CNP stacks for ~5 hrs. when finished, will turn back on in stdlocal
    • ippx024 missing /local/ipp symlink (pointing to ipp053.0..)
    • ippx013 also having fault problem but has /local/ipp symlink so maybe something else -- ippx010,x004 same issue -- looks like missing /dev/md3 entry in the /etc/mtab (and /proc/mounts) even though disk is accessible (can r/w to it) -- see the stdlocal ~ipplanl/stdlocal/pantasks.stdout.log littered with these faults for stacks as well..
  • 16:00 MEH: Haydn reports ippx016 seems to have lost its boot/OS partition when rebooted to try and recover ~16G of memory
  • 20:10 EAM: the problem with /dev/md3 reported by Mark above is caused by the entry commented out in /etc/fstab. This is true for ippx004, ippx010, ippx013, ippx039. I am fixing this by fixing the /etc/fstab entries, manually mounting the partitions, and re-running the rsyncs
  • 20:15 MEH: ippmd stack run mostly finished, will be adding ippxNNN x1 group machines back into stdlocal. ippxNNN x0 group machines will be added back to stdlocal in the morning (ippmd still using)
  • 21:30 MEH: ipp036 looking unresponsive and nothing on console.. neb-host down (from repair) and out of processing -- leaving out of processing, looks like summitcopy won't drop its job, registration did drop, 1/6 in stdscience is also not dropping job -- restart summitcopy and stdscience (needing a restart anyways)
    • registration also got wedged again -- ipp036 crash stuck o6983g0219o but registration was stuck on o6983g0211o, manually cleared
      regtool -updateprocessedimfile -exp_id 821847 -class_id XY55 -set_state pending_burntool -dbname gpc1
    • stdsci chip.imfile.load hitting many timeouts, few chips being loaded.. not sure the cause.. but taking out update labels for the night

Saturday : 2014.11.22

  • 08:20 MEH: ippxNNN group x0 off in ippmd and on in stdlocal for use later in relastro run
    • x0 no long includes ippx001 -- ippmd will use that off and on for deep stack CNP tests
  • 11:55 EAM : stopping pantaskses to restart mysql.

Sunday : YYYY.MM.DD