PS1 IPP Czar Logs for the week 2015.01.05 - 2015.01.11

(Up to PS1 IPP Czar Logs)

Monday : 2015.01.05

  • 09:15 EAM : stdlocal is sluggish, restarting it now.
  • 10:20 EAM : ipp071 seems to be responding to logins again, so i've put it in neb repair (not down)
  • 19:20 MEH: w/ ipp071 back available, the half faulted deepstacks can hopefully finish -- ~ippmd/deepstack/ptolemy.rc -- found possible problem, deepstack stop now

Tuesday : 2015.01.06

  • 20:20 MEH: try md stacks again -- ~ippmd/deepstack/ptolemy.rc
  • 20:50 EAM: summit is getting heavy winds so I'm leaving stdlocal at high load levels. I'll check again after the KP3 telecon (11pm).
  • 21:15 EAM: stdlocal was at 120k warps so I'm restarting it.

Wednesday : 2015.01.07

  • 10:44 HAF unjammed registration
     regtool -updateprocessedimfile -exp_id 848923 -class_id XY55 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 848925 -class_id XY55 -set_state pending_burntool -dbname gpc1
    

* restarted registration (11:12, haf)

Thursday : 2015.01.08

  • 03:45 Bill stdscience not making progress. Jobs in pcontrol queue without associated processes on the target host. No ppImage processes in 'who uses cluster page' Shutting down stdscience for a restart
    • unfortunately the who uses cluster page is not working with all of the hosts. There are some ppImage processes still running See for example ippc26
    • ganglia shows ipp059 is down (should have loooked there first. Unfortunately the console app won't accept my password when I try to connect with it
    • I am hesitant to start up a new stdscience pantasks with things in this state
    • ran neb-host ipp059 down

  • 05:30 MEH: power cycling ipp059 but then need to go to aas booth and won't be able to check on until later
    • only info on console was
      <Jan/07 10:48 pm>ipp059 login: [381619.298227] Disabling IRQ #78
      
  • 06:20 EAM: ipp059 is back up and happy, i've restarted stdscience.
  • 07:45 EAM: it looks like cab1 had a hiccup last night. all machines in that cab, ex ipp009 and ipp019, rebooted about 9pm. I stopped all processing to clear out some mount point troubles. then it turned out that ipp018 was having xfs problems on /export/ipp018.0. I tried to repair, but needed to reboot to umount. on boot, /dev/sda4 refused to mount. I ran xfs_check /dev/sda4 and it said the metadata needed to be replayed by mounting. mounting gave an error that "Structure needs cleaning". I ran xfs_repair /dev/sda4 as advised, and it said that the log was unrecoverable. I am now running xfs_repair -L /dev/sda4 as advised, and it is trying to clean up the disk partition. The output is going to /root/sda4.repair.log. I have since restarted the processing with ipp018 in neb 'down'. there will probably be some errors until ipp018 is up if anything put output there (was it in neb up earlier? not sure).
  • 09:10 EAM: xfs_repair finally finished and /export/ipp018.0 mounted successfully. nfs exports it just fine now.
  • 10:43 CZW: processing is working, but there are a lot of faults caused by failures to read detrends on ipp018. The files exist, have the correct md5sum, and funpack correctly. A closer look suggests that we have mounts on some of the x-nodes that can't correctly read from ipp018. ippx090 is an example of this: attempting to read a file results in 'no such file' errors, and the mount point doesn't mount correctly. I'm going to set ipp018 to down to try and get nightly science to clear out, and when ipp018 is quiet, see if a reboot makes it behave again.
  • 16:45 CZW: ipp051 to neb-host down so Haydn can work on a disk.
  • 16:50 CZW: restarting stdlocal pantasks.
  • 17:45 CZW: ipp051 repair, 53,48,47,45,39 down for fast work.
  • 18:20 CZW: all storage hosts back online and into repair (as they were before the work), as is ipp018 (which now responds to NFS requests from the hosts it was ignoring before).

Friday : 2015.01.09

  • 04:50 MEH: ipp083 has odd battery state and went into WT state yesterday afternoon, doesn't behave well in that state so to neb-host repair it goes
    Jan  8 15:36:40 ipp083 MR_MONITOR[16553]: <MRMON195> Controller ID:  0   BBU disabled; changing WB logical drives to WT, Forced WB VDs are not affected
    
  • 04:55 MEH: registration >200 behind, 4 imfiles needed to be reverted but was 50 exposures from those still -- is stdload overloading things?
    • ippx nodes are active in 2x in stdsci and 6-8x in stdlocal (normal allocation), was this intended? going to assume not since wasn't logged and turn them off in stdsci -- ippc nodes also actively mixed, 4x normal in stdsci and 6x in stdlocal -- with many extra nodes in stdsci (~600) there are order 200 sitting IDLE and far from the ~100k job issue state..
    • with the large number of exposures backed up in summitcopy and registration, stdlocal isn't triggering off as it isn't sensitive to that case
  • 05:15 MEH: also looks like there might be a summitcopy problem off and on w/ 500 read timeouts, but no time to work through right now
  • 06:40 EAM: things are basically moving along, but ipp093 is heavily loaded (probably running slow), and holding up registration. I've set it to neb repair for now.
  • 08:35 MEH: two exposures had exposure faults and weren't clearing over past few hours until manually ran
    regtool -dbname gpc1 -revertprocessedexp -exp_id 849780
    regtool -dbname gpc1 -revertprocessedexp -exp_id 849790
    
  • 08:40 MEH: ipp067-071 may be shuffled network today, setting off in ~ippmd/deepstack/ptolemy.rc but may not finish in time

Saturday : 2015.01.10

  • 05:20 MEH: nightly is 300 exposures behind... stdlocal stop
    • manually revert exposure reg fault
      regtool -dbname gpc1 -revertprocessedexp -exp_id 850441
      
    • summit+reg+stsci probably could have used a restart going into the weekend -- doing now
    • with all the faulting/stalling caused by stdlocal overloading and probably nodes in repair, registration not keeping up -- double load hosts and bump unwant 5->10 to catch up this morning (will drive >ipp067 nodes fairly hard), return to default state when registration finished
    • with registration more active again, ipp093 showing some overload -- back to repair again like yesterday. when that done, ipp095 becomes a favorite target and has issues. both seem to have WB/cache on, so unclear why a problem. didn't appear to be a problem over night, just very little getting done overall and weren't bothered as much
    • stdsci still has some ippx nodes active, and stdlocal had 6x:c2, if those load stacks during nightly then the 4x:c2 in stdsci will have problems.. (and unclear how adding those nodes on top of all the ippx nodes on a heavily limiting 10G link helps anything) -- so keeping stdlocal to stop even though nightly+MD only uses ~6G on the 10G link
    • WS label also out of stdsci
  • 09:00 MEH: to try and take some load off targeted nodes, ipp091,092, 056, 058 neb-host up -- have been in repair due to failed BBU and WT state, but should be able to handle untargeted use?
    • ipp085 is in WB state (BBU ok?) but was put into repair yesterday because of load (same issue as ipp093?)
  • 09:30 MEH: out of order diffims for MOPS w/ o7032g0093o set (from o7032g0110o being stalled in registration?), manually requeue
    difftool -dbname gpc1 -definewarpwarp -exp_id 850421 -template_exp_id 850441 -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2015/01/10 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20150110 -set_reduction SWEETSPOT -simple -rerun
    
    difftool -dbname gpc1 -definewarpwarp -exp_id 850455 -template_exp_id 850472 -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2015/01/10 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20150110 -set_reduction SWEETSPOT -simple -rerun
    
  • 10:00 MEH: removing 4 of 6x:c2 in stdlocal and starting at set.poll 50 (chips, warps will be 2x that and stacks has own poll). as logged before in czarlog before holidays after the stsci nodes targeted, ~150 seems to just max out the 10G link w/ nightly (50 before stsci targeting), will see how goes w/ nightly again
    • too many exposures behind still.. auto-off chip-warp happens -- stacks running ok
  • 10:25 MEH: ippc28 down for 1d 15hr, cannot find note about it, anyone tried power cycling? waiting for response before doing so myself.. power cycle and seems to be okay
  • 12:50 MEH: using the remaining WS diffs, tested set.poll in stdlocal -- 150 still reaches just barely 10G throughput, w/o summitcopy and registration data going on that 10G link. it would probably be good if a czar regularly checked in on the start nightly processing, setting the stdlocal poll to 100-150 and verifying things are processing properly particularly when the system is in such an unstable state.
    • like nightly, MD processing has been mostly hosed with stdlocal overloading -- stdlocal may look to do a lot of processing, but when in turn losing a large part of the following day catching up with nightly processing instead, it is unclear how much is actually gained..
    • then secondary issues like ipp093 overloading can be recognized more quickly and set to repair as needed
  • 21:20 EAM : I just cleared a stuck summitcopy -- o7033g0115o ota45 had a pztool command trying to complete on ipp012 for over an hour. I killed that pztool command and the copy completed correctly. I do know why it hung, but it blocked registration / burntool for quite some time.
  • 22:20 MEH: setting stdlocal how to set.poll 100, manually reverted o7033g0170o, 77 exposures behind --
    • however, with registration catching up, was enough to trigger stdlocal to auto-off chip-warp... now only 35 exposure behind (22:40)
    • bumped up unwant for registration to 10 since was somewhat idle, maybe help -- now down to 25 outstanding (22:50)
    • later ipp093 also being targeted too much maybe, so into repair again -- lagging ~200 chips in morning, ipp095 similar issue so in repair and seem to help chip rate a little more

Sunday : 2015.01.11

  • 07:00 MEH: ipp093, 095 back neb-host up once summitcopy+registration finished, see if can handle just normal processing -- may need to go back to repair
  • 13:40 EAM: restarting stdlocal : getting sluggish @ 100k jobs.
  • 00:33 MEH: stdsci probably could use a restart, polls underloaded