PS1 IPP Czar Logs for the week 2014.09.22 - 2014.09.28

(Up to PS1 IPP Czar Logs)

Monday : 2014.09.22

  • 09:35 EAM : processing went OK last night, but there are a number of outstanding failures related to glockfile failures. I am stopping processing to clear out these lock issues.
  • 10:40 EAM : I needed to reboot ipp058 so it could run glockfile on remote machines. nightly science processing then finished and I re-started stdlocal
  • 13:15 EAM : ipp013 crashed. I power cycled it and it came back up with help from Haydn (BIOS setting?)
  • 22:33 EAM : ipp013 was overloaded and crashed again. i rebooted it, but needed to take it out of all pantasks. i am stopping and restarting standard pantasks. i'm also restarting the ippdb01 db which has been extra sluggish.

Tuesday : 2014.09.23

  • HAF: fun times after gpc1 reboot - restarted czartool
  • HAF: requeing manually diff for serge:
    o6923g0113o 	FAIL (Diff1 stage) 	OSSR.R19S3.4.Q.w ps1_20_5599 visit 3
    o6923g0115o 	FAIL (Diff1 stage) 	OSSR.R19S3.4.Q.w ps1_20_2310 visit 3
    o6923g0130o 	FAIL (Diff1 stage) 	OSSR.R19S3.4.Q.w ps1_20_5599 visit 4
    o6923g0132o 	FAIL (Diff1 stage) 	OSSR.R19S3.4.Q.w ps1_20_2310 visit 4

used these commands

[heather@ippc18 ~]$ difftool -dbname gpc1 -definewarpwarp -exp_id 798994 -template_exp_id 799008 -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2014/09/23 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20140923 -set_reduction SWEETSPOT -simple -rerun 
597963 new neb://@HOST@.0/gpc1/OSS.nt/2014/09/23 OSS.nightlyscience OSS.20140923 SweetSpot SWEETSPOT  2014-09-23T21:51:33.763077 RINGS.V3 T T 0  0.000000 nan nan nan nan 1 
597964 new neb://@HOST@.0/gpc1/OSS.nt/2014/09/23 OSS.nightlyscience OSS.20140923 SweetSpot SWEETSPOT  2014-09-23T21:51:33.763077 RINGS.V3 T T 0  0.000000 nan nan nan nan 1 
[heather@ippc18 ~]$ difftool -dbname gpc1 -definewarpwarp -exp_id 798992  -template_exp_id 799009 -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2014/09/23 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20140923 -set_reduction SWEETSPOT -simple -rerun -pretend
1041931 RINGS.V3 798992 OSS.20140923 1041973
[heather@ippc18 ~]$ difftool -dbname gpc1 -definewarpwarp -exp_id 798992 -template_exp_id 799009 -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2014/09/23 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20140923 -set_reduction SWEETSPOT -simple -rerun 
597965 new neb://@HOST@.0/gpc1/OSS.nt/2014/09/23 OSS.nightlyscience OSS.20140923 SweetSpot SWEETSPOT  2014-09-23T21:51:59.973567 RINGS.V3 T T 0  0.000000 nan nan nan nan 1 

Wednesday : 2014.09.24

  • HAF registration got stuck at like 2am - I kicked it. It's 7:20am, and we still have 60 images to download. Why is it so sloooooooow?
  • 21:00 MEH: fallout from switching machines up and stalling and rebooting during nightly downloads, some files want to be on ipp077... stopping everything to slowly let those get placed in peace to avoid tediously correcting..
    • when brought back up, these should have been put into repair
    • cleanup was also stopped, until get nightly data might as well make space on 067,068 so set to run
    • might as well do the regular restart of stdscience while waiting for nightly

Thursday : 2014.09.25

  • 06:00 MEH: no network connection into UH manoa or maui... -- though some email going through? PS1 not open last night so no nightly data
  • 08:30 MEH: network access restored.. -- Gavin noted to Sidik's question about network that UH ITS had planned intermittent network on schedule for 0530-0630, we probably should keep eye on ITS info page (though if down won't be able to see the page anyways..) -- and
  • 09:40 MEH: regular restart of summitcopy and registration, also to flush the logs of all the neb issues and ipp077 down yesterday
  • 12:10 MEH: rebuild ops tag ipp-20130712 for SHUTOUTC to be just passed as str
  • 14:20 MEH: stopping processing while ipp072-ipp082 down for PDU work ~1hr?
  • 15:35 MEH: Haydn finished, will look at ipp077 raid battery so that still down -- try to load the system w/ lanl stdlocal
    • neb-host back as was before -- ipp076,082 repair; ipp078,079,080 up
    • stack.on, revert chip/cam/warp for LAP.PV3.20140730.ipp
    • running fine, set ipp082 repair->up
    • ipp067,068,069,070 getting killed now.. ipp082 fine -- cleanup heavy there as well, stop all and clear out
    • suspect it is stack.summary, gets turned on with stack.on and many many many are running at the same time
  • 18:15 MEH: processing back to normal -- ipp077 back up, leaving ipp067,068,069,070 in repair to watch the new data nodes and put data there for a while.
    • ipp071,077,082 neb-host up seem fine, 071 gets a higher load more than occasionally so back to repair overnight

Friday : 2014.09.26

  • 07:40 MEH: ipp071 neb-host up to watch -- high load ~60-100, past ~2 hrs LAP rate ~70/hr
    • ipp071 repair @0930, rate ~same even when stacks started, stack off for bit to watch base rate -- raising 90-100/hr now, so ipp071 10G/NFS/raid config probably needs tweaking
    • underloaded ~1100 so turning stack.on, stack.summary off
  • 12:30 MEH: Gavin made the change for r/wsize for nfs down to 32768, manually umounted ipp071 and neb-host up, -- not looking like any help, at first
    • neb-host up had spike in load, net io in dominated ~30MB/s
    • after ~30min, net io in (write) dropped to ~10MB/s and io out (read) up to 50MB/s and load fine, processing rate going up about what was w/ ipp071 in repair
    • forgot to umount ipp071 on ippc20-c32, maybe that was the cause initial overload?
    • ipp071 seems to not have any battery status info in /var/log/messages, others have checking, charging, -finished etc -- is there a battery issue and write buffering not on? -- appears to be reporting no battery, Haydn will check on next week
    • will need to watch ipp071 during nightly processing to see if okay, otherwise put into repair during the night
    • ipp067,068,069,070 don't need to be in repair other than might as well fill up new machines for a while
  • 15:02 Bill: there were a few OSS exposures last week that got multiple cam runs queued. This causes the releasetool -definerelexp commands to fail.
    • changed the label for second copy. Left them in full state though since both copies were used in various warps & diffs so didn't want updates to fail
      camtool -updaterun -set_label OSS.nightlyscience.dupe -cam_id 1037006
      camtool -updaterun -set_label OSS.nightlyscience.dupe -cam_id 1040293
      camtool -updaterun -set_label OSS.nightlyscience.dupe -cam_id 1040495
      camtool -updaterun -set_label OSS.nightlyscience.dupe -cam_id 1040795
      camtool -updaterun -set_label OSS.nightlyscience.dupe -cam_id 1040840
      camtool -updaterun -set_label OSS.nightlyscience.dupe -cam_id 1040859
      camtool -updaterun -set_label OSS.nightlyscience.dupe -cam_id 1059634
      camtool -updaterun -set_label OSS.nightlyscience.dupe -cam_id 1059646
  • 15:15 MEH: adding ippsXX machines 4x to lanl stdlocal for more load testing -- seems okay (270->331 jobs) for hour so far
  • 16:15 MEH: no console PDU remote power cycling on new machines yet, so don't kill them..
  • 16:30 MEH: lanl stdlocal now has only a few jobs loading, so cannot test adding more nodes
  • 17:20 MEH: more lanl stdlocal activity and ipp071 overloading w/ 60MB/s in (write) -- into repair
  • 22:20 MEH: noticed many many fault 2 all stages of the equivalent for ipp076,078,079 (maybe all the new data nodes)
         Failed to delete a previously-existing file (/data/ipp076.0/nebulous/8e/5f/5148980950.gpc1:OSS.nt:2014:09:27:o6927g0124o.800663:o6927g0124o.800663.wrp.1046865.skycell.0705.068.psf), error 2: No such file or directory
     -> pmFPAfileOpen (pmFPAfileIO.c:816): I/O error
         error opening file /data/ipp076.0/nebulous/8e/5f/5148980950.gpc1:OSS.nt:2014:09:27:o6927g0124o.800663:o6927g0124o.800663.wrp.1046865.skycell.0705.068.psf

Saturday : 2014.09.27

  • 21:30 MEH: again, high number of fault 2 events all stages. adding extra storage nodes up (ipp067,068,069,070) not helping, removing ippsXX from lanl stdlocal not helping
  • 22:40 MEH: ipp078 down? cannot log into, cannot access /data/ipp078.0 but ganglia says up -- set neb-host down immediately.. looks like things are wedged..
    • lanl stdlocal stop
    • @22:56 ganglia finally reporting down..
    • 23:00 MEH: two summitcopy jobs manually killed and cleared, slowly sorting itself out
    • 23:30 MEH: cleared most stalled jobs by faulting -- will be a collection of exposures stalled until ipp078 back up as it holds single instance products. 3 jobs not clearing pantasks but clear of nodes, forget how to clear properly so will just do restart stdsci
    • for sunday night may want to switch back to ipp067,068,069,070 as we have access to reboot those if problem happens
    • until ipp078 rebooted or manual reruns of missing parts, following exposures will be stalled
  • 01:20 MEH: still seeing many extra fault 2 and more wait_cpu levels.. with all the broken exposures, want to try to limit the extra faulting. so ipp067,068,069,070 up and ipp076,077,079,081,082 repair (leaving them up along w/ ipp067,068,069,070 didn't seem to help the extra faults much at all)

Sunday : 2014.09.28

  • 10:30 MEH: looks like lap stdlocal is running, ipp067,068,069,070 back to repair to hold free space for nightly processing
  • 21:55 MEH: build up of nightly data but many fault 2, switching back to ipp067,068,069,070 up and ipp076,077,078,079,081,082 repair
    • should be enough space for nightly, can reboot ipp067,068,069,070 if necessary and noticed ipp078 down last night never sent email was down and so if not watching would have gone unnoticed..
    • w/ lanl stdlocal running, tends to overload the nodes but rate seems ok still and faults not happening (wasting cycles) -- -2x c2 in lanl stdlocal seems to ease the load some (stacks?)
    • another possible alternative is to put all nodes neb-host up and use ippsXX machines to help balance against all the extra fault 2 in nightly processing, but don't want to deal with the mess if one of the no-access nodes goes down again..