PS1 IPP Czar Logs for the week 2014.12.01 - 2014.12.07

(Up to PS1 IPP Czar Logs)

Monday : 2014.12.01

  • 09:53 Bill: restarted distribution pantasks which was sluggish with diff stage enabled. SAS distribution is all finished except for diff stage.
  • 13:00 Bill: ~ippsky is rerunning staticsky and skycal for sas.37
  • 17:00 MEH: setting up ippx049+ for ippsky staticsky/cal to use -- manually adding when ready, ~3x should be fine for 50G ram
    • ippx049--x088 have been setup and added temporarily to a x2 group (need split by pdu?), after Haydn's email x089-x092 also -- see x2 group for ones still down
  • 22:00 MEH: nightly processing not doing well w/ ippmd and ippsky running (or something else?) -- ipp083-087 overloaded/not responding and stalling registration

Tuesday : 2014.12.02

  • 07:11 Bill: started up fullforce in ~ippsky/staticsky (staticsky and skycal are finished except for one skycell that faulted with a data error. It is running now
  • 07:30 MEH: registration backed up quite a bit, too many nodes in use -- ippmd stop until nightly finishes
    • looks like something may have started ~0200, nightly rate dropped greatly then as well.. and not just for stdsci -- looking at the storage node network, appears the red storage nodes hit a data block about that time and there it is...
  • 08:30 MEH: taking some xNNN nodes out of ippsky until nightly finishes.. then probably could do 4x on those for FF
  • 08:40 MEH: an exposure (o6993g0619o) is stalled in registration.. been a while so restart summitcopy and registration (then stdsci once nightly is finished) -- had to manually run
    regtool -dbname gpc1 -revertprocessedexp -exp_id 829654
    
  • 09:30 MEH: ipp056,058,060 have space and could be moved out of repair to help ease load on new storage nodes and get nightly finished
    • same for ipp012,013,029,030
    • now get to see those load >50 -- so even when get 10G for ipp083-088 setup, the other 1G systems will get large loads?
  • 12:00 MEH: warp growth curve fault --
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1241149  -skycell_id skycell.1616.069
    
  • 12:30 MEH: nightly finally finished, while waiting for storage node reboots, going to do the regular restart of stdsci and set to stop
  • 12:50 MEH: adding host groups s4,s5 for the new storage hosts ipp067-070,ipp071-088 respectively. will start testing running jobs on them including deep stacks and should not be allocated for any other jobs at this time
  • 13:15 CZW: ran check_system.sh stop to ensure IPP processing is off while we do 10G reboots (and hopefully scidb switch).
  • 14:00 MEH: not doing the scidb/ippdb01/05 switch, might as well use this standby time to restart the mysql on ippdb01 (been ~10d or so)
    • restarted czarpoll and roboczar on ippc11
  • 15:47 Bill: found another bug in skycal. skycal will need to be rerun for SAS.37 (as does fullforce)
  • 16:00 MEH: Haydn & Gavin finished config and reboot of storage nodes. CZW tested r/w on storage nodes, look like ready to go. set pantasks to run and shortly after ipp077 crash -- set to neb-host down
  • 17:50 MEH: Gavin restored ipp077 back w/ 1G network, problem w/ 10G card? put back to neb-host repair
  • 18:30 MEH: ipp013 appears unresponsive.. ipp012 and some others did for brief period of time as well.. neb-host down for a bit just in case..
    • console showing mem check.. looks like it power cycled..
    • ipp008,012,013,014,016,018,037 actually were all power cycled according to ganglia.. oh why..
  • 19:00 MEH: nightly seems to be stalled.. lanl stdlocal and ippmd stop..
    • restarted registration seems to have cleared the problem... was stuck on all ota like
      o6994g0004o  XY76 0 pending_burntool neb://ipp053.0/gpc1/20141203/o6994g0004o/o6994g0004o.ota76.fits	
      
    • 19:10 registration finally caught up..
  • 20:00 MEH: starting ippsky/staticsky processing in place of stdlocal as SAS has priority -- attempting to balance after skycal finishes to try and finish SAS by morning w/o impacting nightly processing
    • skycal finished, fullforced doesn't seem to be running after 20 mins.. -- c2 off, leaving 2x x2 on in case something happens -- ippsky to stop
  • 20:10 MEH: manually adding ipp067 and ipp072 to summitcopy to test, red and data blocked so if crash will minimize cleanup..
  • 21:05 MEH: ippmd run, stdlocal run, watching load -- stdlocal turn off all x0,x1 so ~135 jobs running (ippmd ~170 jobs) -- BUT this is w/o diffims being done in stdsci apparently..
  • 23:45 MEH: no WW diffs for MOPS being made.. -- 3 chunks available and nothing being queued.. -- CZW noticed initday not run today due to downtime, when added diffims now queued
  • 00:30 MEH: warp faults -- cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1241654 -skycell_id skycell.2295.051
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1241730 -skycell_id skycell.2295.051
    
  • 01:00 MEH: nightly processing seems to have screeched to a halt -- ipp060 is terrible wait_cpu state, neb-host repair finally cleared. stdlocal and ippmd stop until cleared and then watch similar loading again (~135 stdlocal, ~170 ippmd)
    • rate just isn't great, so stdlocal and ippmd off until morning. needs more tuning and since ippmd stop, put some ipps nodes into stdsci (will remove in morning)

Wednesday : 2014.12.03

  • 8:00 HAF I cleared up some faults in registration:
    regtool -revertprocessedimfile -dbname gpc1 -exp_id 830390
    regtool -revertprocessedimfile -dbname gpc1 -exp_id 830392
    regtool -revertprocessedimfile -dbname gpc1 -exp_id 830393
    

This allowed all but 1 exposure to finish registration ( 0471) - the obvious tricks aren't working, I need to think a bit how to fix it.

  • 9:53 HAF: the last exposure is done, it fixed itself (?)
  • 10:40 MEH: nightly xSS finished w/ diffims, turning on ippsky for SAS FF w/ 2x x2 group to start with and ippmd after removing ipps from stdsci
    • seems fine, adding more x2 -- 4x x2, poll 200, rate ~250-300s per, maybe finish ~8-10hr?
  • 11:30 MEH: all nightly finished, putting update labels back in
  • 12:30 MEH: stdlocal only has 2x x0+x1 loaded, adding 1x x0+x1 into ippsky as well
  • 16:25 CZW: Good news: I fixed the warp-stack issue that was preventing these from being queued. Bad news: there are now 1150 diffs that want to run.
  • 16:54 CZW: ipp034 died again. Nothing informative on the console. Power cycled. It's already in repair, so there's not much more to do when it comes back.

Thursday : 2014.12.04

  • 08:30 MEH: reducing ippsky by 2x x2 to use in ippmd processing -- SAS finished except for faults and should be enough nodes to finish those if necessary
  • 10:30 CZW: stsci03 is back online, and is set to repair in nebulous.
  • 10:35 MEH: MSS diffim fault 5 needed to be dropped so MOPS can finish their processing.. -- similar problem
    -> VectorFitPolynomial1DOrd (psMinimizePolyFit.c:633): unknown psLib error
         Could not solve linear equations.
    
    -- so just marked quality 42
    
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 613152 -skycell_id skycell.1099.062 -dbname gpc1
    
  • 11:00 MEH: deep stack test runs on ipp068,072 to start -- will put into repair if not finished by time observations being
    • in doing this, discovered /local/ipp symlink not setup... will check others and fix as necessary -- 073-075,077, 083-088 as well
  • 11:15 CZW: I've set chipRuns with label=LAP.PV3.20140730.ipp, data_group=LAP.PV3.20140730.ipp.20141129 to label LAP.PV3.20140730 to migrate this processing to LANL. These chips are behind a large backlog locally, and as the LANL queue is largely empty, I'd rather push them there to get processing time. I may repeat this with other local data groups if this works well to clear the .ipp runs in favor of the the .bigmem ones.
  • 12:30 MEH: ippmd using all of x2 group -- 640 jobs running at once now
  • 13:00 MEH: Gavin changed the rwsize for mounts on 10G (071-076,078-082) -- will see how it goes
  • 13:30 CZW: full ipp/ipplanl pantasks restart.
  • 14:45 MEH: massive overloads on ipp083,085,088 -- put to repair and see how it goes for now -- ippc0x nodes also overload but suspect due to waiting on transactions from ipp083,085,088
    • then started picking on the other 1G -- 056,058,086,084 -- so go just 10G for a bit --
    • actually not clear 10G vs 1G, or rather the 1G mostly all have bad batteries and in WT not WB for raid.. even ipp056, 058
    • ipp060 has good battery, but looks like problem from dmesg
      [2014-12-03 00:38:16] SLAB: Unable to allocate memory on node 0 (gfp=0x20)
      [2014-12-03 00:38:16]   cache: TCP, object size: 1728, order: 1
      [2014-12-03 00:38:16]   node 0: slabs: 8/8, objs: 32/32, free: 0
      [2014-12-03 00:45:30] mrmonitord: page allocation failure: order:1, mode:0x200020
      [2014-12-03 00:45:30] Pid: 15556, comm: mrmonitord Tainted: G           O 3.7.6 #1
      ...
      
    • ipp012,030 are on 1G? they are doing fine
  • 15:05 MEH: ipp072 then ipp068 high loads, suspending deep stack test
  • 16:50 MEH: ipp083,084,085,086,088 back neb-host up -- will run high load and possibly be a drag on processing, but will adjust extra-nightly processing to account for
  • 17:30 MEH: stdsci WS diffims finished ~17:20 and all loads are fine now w/ stdlocal and ippmd running >800 jobs.. maybe the storage nodes that stsci runs on need to have the mounts flushed to use the new rwsize for the 10G machines too -- CZW working on it
  • 18:00 MEH: ippmd seems to show larger cpu_wait with ippx049+ compare to ippsXX for chip-->warp processing, what is different and is this an issue? NIS/homedir connection issue mentioned by Haydn?
    • didn't see w/ x0,x1 test use w/ ippmd
  • 19:45 MEH: seeing a lot of strange faults -- missing files to delete again etc
  • 19:55 MEH: nightly seems to be moving okay, manually adding ipp067-082 to summitcopy since ran for past couple of nights with two
  • 21:10 MEH: diffims started and loads very high again now and rate dropping. turning majority of ippmd stop for a bit
  • 21:40 MEH: rate much better, but notice stdlocal also running just ~45 stacks now and don't know when that was done -- starting ippmd w/ ~86 and will continue until rate degraded again
  • 23:00 MEH: Serge reports a problem .mops file on the datastore --
    datastore.ipp.ifa.hawaii.edu/IPP-MOPS-TEST/pub.937571.diff.613469/5601098179.gpc1:IPP-MOPS-TEST:IPP-MOPS-TEST.937571.pos.mops
    
    • seems nebulous is having problems? finally was able to neb-ls -p and see the original file okay, but timeout caused the datastore file to fail?
    • Serge is manually fetching directly from disk to add for MOPS
  • 23:50 MEH: set quality 42 for problem warp and diff
    --> cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1246258  -skycell_id skycell.0802.029
    
    --> Reference FWHM: nan 
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 613517 -skycell_id skycell.0801.040 -dbname gpc1
    
    
  • 00:20 MEH: cleanup seems to be driving nebulous into the ground? stopped for the night
  • 01:50 MEH: nebulous was a red herring, issue was symlinks on ippc17:/datastore/dsroot/IPP-MOPS-TEST to ipp071.0 but the ipp071.0 mount was broken
    • important, never try to do a ls on /datastore/dsroot/IPP-MOPS-TEST...
    • while nebulous not the cause of this problem, Serge still was seeing timeouts so cleanup etc is overloading things w/ nightly science
  • 02:15 MEH: trying cleanup on one more time -- clearly driving larger load on ippc0x apache and ipp083-088 nodes -- weather looks poor so turning back on w/ ippmd turned down for tonight

Friday : 2014.12.05

  • 07:15 Gene using ippx002-005 for relastro tests -- will need ippx002-011 today and tomorrow+
  • 07:55 MEH: neb-host repair for ipp082,083-088 in advance of possible raid battery swap (082,087 already in repair)
  • 08:00 MEH: reboot ipp060 to see if clears error and helps when neb-host up
    • still switches to WT for raid even though battery reported there and not bad (not good either..)
  • 08:40 MEH: ippx016 has been fixed for a while now, needed /local/ipp link fixed and removed comment out in group x1
    • ippx039 still problem -- reports no ext3 partition on /dev/md3
  • 10:35 MEH: ippx018 out of stdlocal so Haydn can fix disk (maybe shouldnt be using anyways..)
  • 11:30 MEH: stdlocal is now disabled from using storage nodes (storage.hosts.off) as well as disabled from calling in morning to turn on. leaving the nightly.science.block on so if get behind in nightly then chip->warp is turned off in stdlocal to ease network loads
  • 11:45 MEH: Haydn fixed ippx039 disk, x101-103 are also ready -- added to hosts x2
  • 11:50 MEH: another fault 5 ws diff from ref fwhm nan
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 613691 -skycell_id skycell.0801.040 -dbname gpc1
    
  • 12:00 MEH: some necessary regular maintenance
    • ippc18 homedir ~26G, time to clean up rooms -- ipp log archive only provide ~35G and will take some time to avoid overloading ippc18..
    • apache nebulous.log on ippx0c machines -- one was really close <2G last night, thanks to Serge for rotating the logs for all of the apache servers so at ~same space
  • 16:50 MEH: registration has a tweak_revert file for imfile.revert.fast 600s and imfile.revert.norm 1800s to see how the change goes
  • 17:00 Haydn doing battery relearn cycle on ipp082-088 since batteries are new. ipp056,058,060 are old and likely do need to be replaced
  • 17:15 Haydn restarting ippx018 to replace bad disk
  • 17:20 MEH: putting the ipp083-086,088 nodes back up, also putting some smaller full storage nodes in repair and ones with space back up for a while to fill
  • 19:30 MEH: ipp056,058,060 battery relearn finished, will all the other nodes up will try these up as well and see

Saturday : 2014.12.06

  • revised -- planned downtime MRTC-B 590B Lipoa Automatic Transfer Switch (ATS) -- Saturday December 6th. They will be starting our shutdown processes at 9am and will hopefully start turning things back on at 3pm.
    • all processing stopped before ~7am
    • Haydn on site to start power down of systems @7am

  • node allocation after restart --
    • stdsci -- default
    • stdlocal -- reduced loads 3x c2, 3x x2 -- no storage nodes now
    • relastro -- ippx002--x036 (unless x018 fails with disk issue)
    • ippmd -- ipps, ippx037-x048, little also on x2 at reduced loads
    • cleanup tasks -- priority
    • rsync slow.log off ippdb00 since only 7.xG space available.. already renamed so restart will create new one
  • 06:20 MEH: let the stop of processing begin --
    • ippmd down
    • stdlocal down
    • pstamp, distribution down
    • cleanup down
    • summit+registration+stdsci down
    • czarpoll/roboczar down
    • ippdb00,01,02,03,05 mysql shutdown
    • all systems powered down -- ippc18 looks to still be up
  • 12:50 MEH: systems turning back on but nothing should be started until deal with nebulous db log -- will send out email when ready to go
    • ipp055,057 recovered some memory lost from before the shutdown, yay
    • ipp054 lost some memory -- powercycle to try to recover -- done and ok now
    • stsci12 booting but will need to be taken down again to replace a power supply -- done
  • 14:40 MEH: db machines up and logs seem okay (slave positions match etc) except lately (since 10/14) the nebulous mysqld.err log has many many such entries
    141205 20:02:07InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    InnoDB: many active transactions running concurrently?
    
    • ippc17 mysql seems okay
    • ipp067,stsci09 still down?
    • compute nodes booting, so many nodes..
    • starting slow scan for all data node mounts
    • stsci12 power supply replaced
    • when all datanodes up, ipp will start nebdiskd on ippdb00
  • 15:10 MEH:
    • ippdb06 booted, mysql crash recovery and replication catching up -- Haydn wants to keep an eye on it running again, memtest ran 65hr w/o problem
    • ipp067,stsci09 now up. stsci09 had odd error in dmesg to keep eye on
      [2014-12-06 15:12:09] mrmonitord[16293]: segfault at 0 ip 00000000080a38a8 sp 00000000f5fefd70 error 4 in mrmonitord[8048000+13d000]
      
    • ippxNNN nodes almost finished, jaws nodes next -- may want to change the documented order of this so processing (nightly on the storage nodes etc) can be started while all the ippxNNN nodes come up
    • ippc29 missing memory again -- i will power cycle later, only a test/stack node for MD
    • datanode mounting datanodes seems ok, c0-2,m0,m1 mount of datanodes seem ok --
  • 16:05 MEH: limited nightly processing started since uses data nodes, some misc compute nodes have to be looked at still
    • exposure blocked in summitcopy o6997g0719o -- others getting stalled but slowly clearing, restarting summitcopy cleared
    • ippdb01 running pretty hard, kind of concerned for just limited nightly processing
    • NO other processing should start until nightly finishes, then will start cleanup and pstamp
    • ippc19 date is wrong, ippc07 having trouble mounting homedir -- fixed w/ manual time set and restart ntpd, restart nfs
  • 17:45 MEH: nightly caught up except for darks, starting other pantasks
    • adding back WS labels back in for diffims -- survey.warpstack.diff repeat timeouts, gpc1 having trouble
    • looks like slow query is pilling up and not clearing, may need to restart db01 mysql -- still happening, will have to turn off WS diff or will drive db into the ground
  • 18:50 MEH: the long WS and unused stack nightlyscience query faults after ~600s but continue to pile up in the db -- disable WS diffs and stack tasks..
    • where is the label coming from for it to be doing a query looking for MD08.. -- target groups in nightly_science.config
      AND (rawExp.exp_type = 'object') AND (rawExp.dateobs >= '2014-12-07T00:00:00.000000' AND rawExp.dateobs <= '2014-12-07T23:59:59.000000') AND (warpRun.label LIKE 'MD08.nightlyscience') AND ((warpSkyfile.good_frac > 0.05000000 - 0.00000119)) AND (rawExp.filter = 'g.00000')
      
    • will need to try and get the WS diff caught up by monday to avoid cleanup problem?
  • 20:30 MEH: watching ippmd + ipplocal running w/ stdsci until first full chunk done -- seeing no diffims being made..
    • if starting other pantasks, please at least note it in czarlogs...
  • 20:50 MEH: restarting stdsci, to clear fail/counters -- also picked up diffims now --
  • 21:11 EAM: restarting my relastro analysis using ippx002 - ippx032.
  • 21:40 MEH: summitcopy is greatly lagging.. no jobs on stdlocal but turning off with ippmd overnight -- more disks have gone red, need to not make worse until cleanup makes progress
  • 23:15 MEH: ipp077 unresponsive and cannot access disk, already in repair but put to down -- console spammed w/
    <Dec/07 04:54 pm>[34155.922340] journal commit I/O error
    
    • can't be good -- setting neb-host down --
  • 00:20 MEH: ippdb01 went into nearly 100% cpu_wait state.. -- why, things were running lightly and smoothly..
    • uncommon number of processes >550?
    • pstamp started w/ jobs about that time
    • cleanup jobs doubled in dtime
    • ippdb01 net io~0
    • simple queries now >300s
    • pstamp stop and cpu_wait drop, run and cpu_wait jumps to 30% then seems to settle down -- why would pstamp impact ippdb01 system?
  • 01:15 MEH: hour later, still no diffims.. restart stdsci? yes, finds many to do, is the timeouts etc causing a glitch?

Sunday : 2014.12.07

  • 10:45 MEH: looks like registration is down, did someone stop it around 0730?? nightly science is still not finished so going to restart it -- looks like segfault
    [2014-12-07 07:27:15] pantasks_server[8766]: segfault at 10679f8 ip 0000000000408a4e sp 0000000042c8cf20 error 4 in pantasks_server[400000+16000]
    
  • 11:10 MEH: even with no jobs running, the WS query has continued to timeout and 12 long queries trying to run -- mysqld.slow.log shows >600s to 6ks (pantasks timeout=600, so tweak to 1800)
  • 12:45 MEH: long WS query runs in 30s or less on ippdb03,db05.. something wrong with db01, even w/ all stop but stdlanl..
    • raise timeout to 3600s -- 3pi query timeout..
  • 15:45 MEH: we have power cycle access now for ipp082-087 on console
  • 16:00 MEH: looks like battery relearn may have worked on the newer data nodes, but still retains WT policy?
    Dec  7 01:40:47 ipp083 MR_MONITOR[16550]: <MRMON194> Controller ID:  0   BBU enabled; changing WT logical drives to WB
    Dec  7 01:40:47 ipp083 MR_MONITOR[16550]: <MRMON054> Controller ID:  0   Policy change on VD:   0 
    Dec  7 01:40:47 ipp083 Current   =   Current Write Policy: Write Through;
    
    • ipp084,087 not have this message -- relearn not finished?
  • 17:45 MEH: restarting db01 mysql -- something clearly wrong with it.. -- nothing should be started using gpc1 unless checked with me first
  • 19:20 MEH: db01 mysql WS queries working like db03 now and running, nightly processing seems okay -- still please do not start anything w/o checking first
  • 20:10 MEH: restart of mysql on db01 seems to have cleared the timeout problem -- nightly, cleanup, pstamp, stdlocal (though chip-warp off due to WSdiff), ippmd seem to be running fine
    • should be able to start other processing now
  • 20:45 MEH: of course.. ipp034 unresponsive.. neb-host up->down.. -- nothing on console, was under normal load -- take out of processing and leave up for the disk space but probably should go back to repair soon like others
    • rebooted w/ nfs problem, didn't mount homedir -- restart nfs cleared
  • 22:45 MEH: ipp084 issue, setting neb-host repair and out of any processing -- has been having since last night
    mce: [Hardware Error]: Machine check events logged
    
    • ipp086,088 also had these initial errors after boot
      [2014-12-06 14:42:32] MegaCli64[22184]: segfault at 0 ip 000000000051bdd0 sp 00007fff51a9d1e0 error 4 in MegaCli64[400000+28b000]
      
  • 23:20 MEH: ipp060, 058 didn't like ipp084 going to repair -- both don't have BBU and set to repair
    • ipp056 couple spikes, also no BBU
  • 23:50 MEH: adding in 2x m0,m1 and 1x x2 to stdsci to help with the WS diffim (~20/hr)