PS1 IPP Czar Logs for the week 2015.01.12 - 2015.01.18

(Up to PS1 IPP Czar Logs)

Monday : 2015.01.12

  • 00:40 MEH: clearing a fault 5 diff
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 635299 -skycell_id skycell.1607.047  -dbname gpc1
    
  • 00:50 MEH: stdsci poll still low, restarted
  • 01:00 MEH: stdlocal set.poll 100 in case do catch up, nightly has priority and be finished in time for fiber switchover
  • 02:00 MEH: ipp095 overloaded a bit again, to repair
  • 06:57 Bill: two warps were repeatedly getting fault 4 from the "cannot compute curve of growth psf is invalid everywhere" assertion. Set quality for them
    warptool -set_quality 42 -fault 0 -updateskyfile -warp_id 1345413 -skycell_id skycell.0509.081
    warptool -set_quality 42 -fault 0 -updateskyfile -warp_id 1345472 -skycell_id skycell.1084.070
    
  • 07:00 Bill ipp090 and ipp093 have rather high loads as well.
    • but processing rate was generally okay
  • 10:20 MEH: clearing some fault 5 WSdiffs
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 635300 -skycell_id skycell.1607.038  -dbname gpc1
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 635300 -skycell_id skycell.1607.047  -dbname gpc1
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 635450 -skycell_id skycell.1247.076  -dbname gpc1
    
  • 10:30 MEH: clearing some old WSdiff that won't complete due to old and image cleanup.. -- when time may be able to update warps to complete but...
    difftool -updaterun -diff_id 634284 -dbname gpc1 -set_label ThreePi.WS.nightlyscience.missed
    difftool -updaterun -diff_id 621228 -dbname gpc1 -set_label OSS.WS.nightlyscience.missed
    
    • did manual chip+warp updates when necessary**, were actually from fault 5 and need to have quality set so now remaining diff should be released ok
      difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 634284 -skycell_id skycell.0679.021  -dbname gpc1
      difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 621228 -skycell_id skycell.1529.088  -dbname gpc1
      
    • **chips for 1/10 were cleaned up already, but not warps
  • 11:50 MEH: starting stdlocal chip only test for 10G link usage, stopping all other processing since will need cleared for fiber changes in couple hours anyways
    • poll~100 chip only uses ~4Gbits/s both in and out on link
    • poll~150 uses ~6-7Gbits/s both in and out on link
    • nightly overall uses ~5-7Gbits/s mostly 6-7G out and 5-6G in -- so even 150 w/ stdlocal was a bit high during nightly
  • 11:55 MEH: ippc11 screen issue,
    There are screens on:
    	13722.robo	(Detached)
    	12542.RoboCzar	(Dead ???)
    	19893.CzarPoll	(Dead ???)
    	13565.plot	(Detached)
    Remove dead screens with 'screen -wipe'.
    
    • wiped, cleared other names and restarted screens
  • 14:40 MEH: possible changes to network plans -- setting stdlocal back to full poll until nightly starts
  • 19:40 MEH: nightly starting, stdlocal set.poll 100 -- seems to be just at limit, maybe a slow slippage in nightly chip processing which will eventually turn chip-warp off in stdlocal. ~1300 camera entries so not having stdlocal auto-off would be good.. so set.poll 80

Tuesday : 2015.01.13

  • 07:45 MEH: seems stdlocal set.poll~80 is a limit, still caused enough of a backup until about 2am when stdlocal triggered auto-stop chip-warp, stdsci processing mostly kept up however, including registration. stdlocal set.poll back to 400 ~0830
  • 09:10 MEH: restarting pstamp
  • 10:40 MEH: processing to stop -- going to start modifying systems
    • decomission jaws nodes -- move pantasks to ippc01-c09 --
    • ippx037-x044 + 4849 switch to n5k
    • stsci ip to .30. network
    • redistribute other ippx across multiple 4849 on 6509e
  • 15:45 MEH: stsci IP moved to .30. net, x037-x044 moved to nk5 and .20. net, all pantasks moved off jaws nodes -- Gene gave all clear for starting nightly pantasks to be ready for data tonight
    • while processing down, going to rotate the apache ippc0x logs --
    • finding stagnant stsci mounts on ippc18,c19,c11 so far.. starting to force umount.. -- cleared ippc18,c19; Gene working on c11
      • this why ipp032 high load doing nothing -- ipp032 clear and Serge/MOPS reports all good, all other datanodes seem fine
    • Gene ippc20-c32-ish also has stsci mounts -- Gene working on -- looks like stdsci and pstamp using some of these mixed LANL compute nodes in c2 group -- stsci and pstamp stop until cleared.. summitcopy+registration can continue
  • 17:05 MEH: Gene cleared compute mounts, stsci+pstamp back on. revert update faults caused to ps_ud_QUB
  • 17:25 CZW: stdlanl and stdlocal restarted and running. At Gene's recommendation, the poll on stdlocal has been decreased to 100 as the default setting until we see how things behave.
  • 19:20 MEH: normal processing
    • looks like summitcopy ~5-8 behind, registration ~2-3 -- normal
    • WS diffs running seem more load on ipp093,095,090 and may need to go to repair
    • waiting for network 10G link to have some regularity -- @1930 -- ~6/5G into/out 6509e w/ poll 100 (roughly that running but warps+camera+stacks also so not clear)
  • 19:35 MEH: larger fault spike, scrambled summitcopy order some, but catching up
  • 19:45 MEH: fallout from another massive fault spike -- took about 10 min to recover registration
    150113 19:32:34InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    
    150113 19:42:30InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    
  • 20:00 MEH: once things pick back up, load on 10G link is ~7/7-8G.. both ipp/jaws networks are ~2G
    • not-targeted ipp092 repair -- to many want to write to node in WT mode
    • generally running much better @100 chip in stdlocal
  • 20:30 MEH: stdlocal camera,warp off so chip poll 200 will have a full 200 chips running at a time
  • 20:45 MEH: another massive fault wave ~2045 --
    150113 20:47:52InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    
  • 20:50 MEH: ippmd running single deep stack on c29 -- 0 impact
  • 21:00 MEH: chip poll 200 -- reaching 8-9G on 10G link but...
    • more rounds of massive faults -- backlog getting close to shutoff.. may not get good measure..
  • 21:15 MEH: is there something else running overloading ippdb00? stdlanl prep?
    • try stop cleanup to free up connections?
  • 21:20 MEH: another huge fault wave -- cleanup still flushing out --
  • 22:00 MEH: cleanup cleared ~2145 and no faults since ~2130 but was enough overloading to drive stdlocal to auto-off except for stacks.. that's it, no more testing except 10G link load w/ 500 stdlocal stacks maybe... leaving set.poll 200 since seemed okay in case able to catch up in few hours (but likely can go higher)..

Wednesday : YYYY.MM.DD

  • 13:50 CZW Stopping ipp/ipplanl processing for Haydn to reboot ipp083/84 for RAID battery issues.
  • 16:05 CZW ipp/ipplanl processing back online. stdlocal poll is set to 400, where I'll try to keep it overnight. As cleanup may have been the problem last night, I'll stop that before we begin observing (we are not in a diskspace crunch at the moment).
  • 21:30 EAM stdscience is getting behind, and stdlocal is getting blocked. however, it is not clear (at least to me) that stdlocal is responsible (10g link 6509e <-> ippcore is not saturated, neb db queries are not faulting), nor is the lag so terrible. I'm setting the stdlocal cutoff higher (9000 from 3000) to allow it to keep running.
    • MEH: another option would be to turn the number of stdlocal nodes running down by maybe half and see if that helps to ease the problem? when stdlocal switched off and mostly cleared except stacks and ~100 warps of so for a bit, the specific plot of rate for OSS seemed to pick up and there were less red faults showing up
  • 22:05 MEH: to help ease the nearly full 10G link, going to start turning off c2 in stdsci and replace with ipps and ippx(MD) since they are on the same side as the primary data nodes
  • 23:40 CZW: stsci03 is causing a large number of outstanding glockfile instances. I've set it to repair in nebulous, and I'll give it a few minutes to see if it clears itself before I start killing jobs. All issues are with stsci03.1, but I've moved the whole node to repair.
  • 01:20 MEH: ipp090 has also been giving trouble like ipp095,093 due to over allocation in older targeting -- should update soon, into repair as well

Thursday : 2015.01.15

  • 10:40 EAM: stdlocal is a bit sluggish, restarting
  • 10:50 MEH: manually deploying revised and temporary (untill BBU situation resolved) host target for nightly, ipp090,093,095 back to normal neb-host up
    ~ipp/psconfig/ipp-20141024.lin64/share/pantasks/modules/ipphosts.mhpcc.config
    
    • the reallocation will require cleanup to run during the day or nightly will fill up nodes quickly..
  • 12:30 MEH: restarting summitcopy+registration+stdsci to use the changes. stdsci needed its regular restart anyways...
    • ipp083,084,086,091,092 are not targeted so probably would be okay to neb-host up for random data, however, they have to be watched and it helps to keep in repair if need to take down for BBU reset/replace
  • 13:50 EAM: I've been tweaking stdlocal stack polling. at 100, we get a build up of stacks. at too high a level, we get saturated doing stacks and oscillate between all chip and all stack. set to 150 now.
  • 13:11 Bill: started up ~ippsky/staticsky with a whole bunch of x nodes. Then discovered that they are in use by other processing streams. After the first batch complete, I will restart with 2 sets of x0 and x1 hosts
    • 14:45 As advised by Gene, set the existing pantasks to 1 x (x0, x1, x2, x3)
  • 14:35 EAM : dropped 2x for each xnode (7 each) from stdlocal for staticsky to run. also bumped the stack poll up to 200.
  • 15:10 MEH: redoing the 4x:c2 off in stdsci and 6x:m0+1+x0b+x1b swap to ease the load on the 10G link again
  • 21:25 MEH: revised data node targeting and 6x:m0+1+x0b+x1b swap in nightly seems to be doing ok, processing rate fairly regular. cleanup appears to still be on but have only notice one major fault event (little after ~1939)

Friday : 2015.01.16

  • 05:50 EAM : ipp014 was in an odd state: it had been rebooted into the livecd prompt. I'm retrying to reboot again.
  • 06:10 EAM : no luck -- it keeps coming up with in livecd, looks like it does not see the boot disk. sending a note to haydn.
  • 06:12 EAM : meanwhile, lots of hung jobs across the cluster on machines with hung nfs mounts (ipp014 of course). I'm going to shut stdscience down, kill off everything, then try to clear the mounts
  • 06:45 EAM : i've cleared all of the hung mounts and am restarting all services.
  • 9:12 HAF: registration jammed, ran
     regtool -updateprocessedimfile -exp_id 855559 -class_id XY26 -set_state pending_burntool -dbname gpc1
    
    
  • 11:05 MEH: clearing a stalled warp (cannot build growth curve (psf model is invalid everywhere)) MOPS asked about an hour ago so processing finishes
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1355552 -skycell_id skycell.1950.023
    
  • 11:50 CZW: stdlanl didn't appear to get restarted. I've fixed that.
  • 11:55 CZW: stdlocal poll to 400.
  • 12:35 EAM: cluster still not busy enough; bumped poll to 600.
  • 12:50 MEH: will continue to try and use the MD nodes for processing during day with the continuing condition of not impacting LAP+nightly processing (until told nodes need to be reassigned elsewhere as well)
  • 15:00 MEH: will try offloading ipps+ippxMD nightly jobs to the new data nodes -- involves more possible risk/uncertainty than the c2->ippsxMD as just compute nodes -- cleared w/ czar
    • ipp067-070=s4(132G ram), ipp071-095=s5 (198G ram) -- already running 1x in summitcopy+registration+cleanup. large ram reason for group split but unimportant here
    • ipp067-070 wasn't in summitcopy or registration yet being on the bad side of the 6509e 10G link -- adding
    • ipp092 had been commented out but should be fine, need to update groups -- all should be usable except ipp094 which is still offline -- 4+24=28 nodes
    • start w/ 4x:s4+s5 loading in stdsci, all ipps+ippxMD off for tonight -- the s3 group is able to handle 8x:stdsci and 1x:sum+reg+cleanup so should be similar and will add more s4+5 as capable/needed
  • 15:30 EAM: Haydn rebooted ipp083 & ipp086 to fix raid betteries.
  • 16:40 MEH: ipp084, ipp091 have BBU and in WB state -- neb-host up now, will wait to add to data host targeting when the rest go online as well (083,086,092), so same targeting as last night
  • 21:20 MEH: are the QUB stamps stalled? going to restart pstamp -- still seems so -- looks like some updates with label for ps_ud_QUB were then cleaned
  • 23:10 MEH: bumped stdsci up 5x:s4+s5 -- seems to keep up. ipp067-095 have a fairly high load (cpu~60%), summitcopy+registration+stdsci fine, but like previous nights can have load spikes ~50 and some wait%.
  • 23:20 MEH: OSS warp fault to quality clear -- cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1358177 -skycell_id skycell.1609.051
    
  • 23:40 MEH: another wave of faults..
    150116 23:38:14InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    

Saturday : 2015.01.17

  • 05:25 EAM: looks like things ran pretty well last night - we are in pretty good shape on the nightly processing and stdlocal ran solidly at 75-100 chips/hour. I notice that the following stages are lagging in stdlocal, and have attributed this to a backup in camera. basically, the galactic plane is slower in camera because it scales by number of stars (even more so than chip). since we have a fairly limited number of camera jobs at a time, this becomes a bottleneck. I've bumped camera polling from 25 to 60 to address this, but it perhaps could go higher.
  • 08:30 MEH: manually sending some missed nights to cleanup from earlier in week, OSS.20150113 and ThreePi?.20150108
  • 09:00 MEH: going to try and fix the cleanup/update conflict blocking QUB stamps.. somehow warps from 1/14 were set to cleaned but were in progress of update/pstamp use... was this an unlogged manual cleanup/change or an odd timing conflict in cleanup?
  • 10:15 EAM: stopping stdlocal for a restart
  • 13:15 MEH: starting CNP+MD chip cleanup after verified warps fully finished -- ipp067-081 must get some space cleared up, while no permanent files are being targeted there, space is needed for nightly xy+skyfile processing balance.. some shuffle may be necessary very soon...
    • with just chips need to bump the poll to ~90 so all nodes in use
  • 16:25 MEH: ipp086 BBU status optimal now and WB mode so neb-host repair to up -- will temporarily add ipp084,086,091 for data tonight. stdsci is getting close to needing regular restart anyways
  • 20:30 MEH: long running stacks again running by ~ippmd/deepstack/ptolemy.rc

Sunday : 2015.01.18

  • 07:10 EAM: stopping and restarting stdlocal
  • 07:15 EAM: bad warp:
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1363320 -skycell_id skycell.1609.080
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1366210 -skycell_id skycell.1716.054
    
  • 09:45 EAM: nightly processing is done, i've turned on storage hosts in stdlocal as a test.
  • 10:30 EAM: looks like the link 6509e-ippcore is now running at ~6Gb/s with storage hosts in stdlocal. i've bumped the storage host loading just a bit more (1 more per s1-s3).