PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2014.12.08

  • 08:45 MEH: ippmd using ~280 nodes now that nightly is finished (ippsXX, 2x x2)
  • 09:05 MEH: WS diffs queued fine today -- will be stdlocal ~300, ippmd~300, stdsci~300 -- except 600 diffs will shortly shut off chip-warp in stdlocal
  • 12:55 MEH: the 20T data nodes with space that had put neb-host up late last week to fill and increase the number of data nodes have filled or behaving poorly -- neb-host repair on all 20T nodes now
    • for the 20T nodes w/o processing things seemed mostly okay -- net in was high >50MB/s and seemed to be writing okay, just forced to write constantly and probably could be put back in, but ones doing processing probably shouldn't be

  • 20:35 HAF: various emails floating around, there is a problem with nebulous (mark / gene noticed), and summit copy /registration are faulty. The errors we see are like this:
-> pmConfigConvertFilename (pmConfig.c:1833): System error
     failed to create a new nebulous key: nebclient.c:1012 nebSetServerErr() - SOAP-ENV:Server - error: DBD::mysql::st execute failed: The table 'storage_object' is full at /usr/lib64/perl5/site_perl/5.8.8/Nebulous/Server.pm line 275,  line 12.
 -> pmConfigRead (pmConfig.c:618): System error
     Unable to resolve trace destination: neb://ipp015.0/gpc1/ThreePi.nt/2014/12/09//o7000g0063o.833489/o7000g0063o.833489.ch.1316586.XY23.trace
Unable to perform ppImage: 1 at /home/panstarrs/ipp/psconfig/ipp-20141024.lin64/bin/chip_imfile.pl line 830
	main::my_die('Unable to perform ppImage: 1', 833489, 1316586, 'XY23', 1) called at /home/panstarrs/ipp/psconfig/ipp-20141024.lin64/bin/chip_imfile.pl line 509
  • 20:35 HAF: Serge is investigating, notes that ippdb00 is full and is finding the magical incantations to fix that.
  • 20:48 SC: Magic incantation is this: PURGE BINARY LOGS TO 'mysqld-bin.003942';
  • 22:17 HAF: seeng the same errors again. this time for 'instance' table. Now what? It's jamming up registration and stuff.

Tuesday : 2014.12.09

  • 07:45 EAM : processing has been limping along, with a number of the 'table full' errors. there is enough room on the disk after Serge purged the binary logs, so that is not the cause. the tables are big, but not approaching the InnoDB 64 TB max values. The load on the machine (ippdb00) is modest (3-4). I am guessing that a restart of mysql might clear out something which is cached?
  • 08:10 EAM : more research has revealed the likely cause: the number of concurrent transactions was too large (the full disk probably also triggered this). I turned down the number of nodes doing cleanup and this seems to be making things better (lower error rate). the following mysql bug report is relevant (and points out that 5.5.xx has bumped the transaction limit): http://bugs.mysql.com/bug.php?id=26590
  • 08:15 EAM : addendum: this is probably not as critical, but we may want to bump the ibdata file. this is from the mysql manual (http://dev.mysql.com/doc/refman/5.0/en/innodb-data-log-reconfiguration.html)
For example, this tablespace has just one auto-extending data file ibdata1:

innodb_data_home_dir =
innodb_data_file_path = /ibdata/ibdata1:10M:autoextend
Suppose that this data file, over time, has grown to 988MB. Here is the configuration line after modifying the original data file to not be auto-extending and adding another auto-extending data file:

innodb_data_home_dir =
innodb_data_file_path = /ibdata/ibdata1:988M;/disk2/ibdata2:50M:autoextend
When you add a new file to the tablespace configuration, make sure that it does not exist. InnoDB will create and initialize the file when you restart the server.
  • 09:00 EAM : put ipp077 to 'repair' now that it has a new 10g card
  • 09:50 MEH: ippmd lite processing back on now that faults seem to be happening less again and stdlocal running -- set stop as needed, just leave note -- nope, md still many faults and so off
  • 11:40 MEH: removing the WS label from stdsci so power is focused on the more timely WW diffs for MOPS
  • 14:50 EAM : ipp034 crashed, power cycling
  • 14:55 CZW: while working on a revert task for registration exposures, I reverted two exposures from XXX (I don't see a chipRun for it, so it might have been engineering) and 2014-08-05. These will likely process as usual and confuse everyone. I was going to reboot ipp034, but I see that Gene has beaten me to it.
  • 19:00 MEH: stdlocal seems to be running w/ 4x x2, not the 3x so turning 1x off to be sure ippmd can run
    • stdsci has a massive loading of 5x c2 and 3x x2, when jobs available hits >550, stdlocal >300 and things are faulting.. -- turning 3x c2 off since stdlocal is also running stacks and that could really hurt nightly processing..
  • 19:45 MEH: was all kind of mess, more than adding few ipps for MD -- was stdsci allocation rebalanced after nightly finished this afternoon? -- taking the 5x:c2 and 2x:x2 out of nightly for now, may want ~100 more like on sunday night if doing WS
    • somehow stdlocal has 4x:x2 again.. massive faults.. turning 1x:x2 off -- getting better
  • 21:00 MEH: looks like balance finally restored, few faults -- stdsci=363, stdlocal~160 (still dropping so may need to turn some back on), ippmd~55 and summit/registration etc for nightly
    • looks like WS are still set to trigger in morning so not going to worry about that
    • massive camera backlog -- raise poll to 30, could probably go higher but don't want to overload
  • 22:30 MEH: stdlocal stabilized ~114, should be able to raise to ~160 so can add x2(+x3) and see
    • seems fine, since stdlocal has stacks and try adding another 60 stdmd to see if another x2(+x3) could go into stdlocal -- suspect not, and stdlocal has switched to only stacks ~170 now so hard to test -- stacks cycling out and seeing some faults so reducing ippmd back down to ~50 (cannot swap to stdlocal, seems like ~170+50 is the limit w/ stdsci)
    • also seems to not be doing as many warps, so those are piling up
  • 22:50 MEH: ippc63 unresponsive -- crashed, power cycled and back up but take out of stdsci processing
  • 02:30 MEH: ippmd will make not progress w/ 50 jobs, putting ipps into stdsci to help clear by morning

Wednesday : 2014-12-10

  • 12:40 CZW: all pantasks stopped, czar/monitor scripts stopped as well.
  • 13:15 CZW: ipp034 has died again. I'm power cycling it again, so when processing resumes, it can try to limp along.
  • 13:00 SC:
    • Replication stopped on ippc17, ipp001, ippdb03, ippdb05
    • ippdb01: Users not root all deleted
    • ippdb05: Added ipp user tomysql
    • ippdb01: mysql shutdown
    • ippdb05: slave information deleted (RESET SLAVE + /etc/mysql/my.cnf)
    • ippdb05: server restart (no slave info)
    • ippdb05:

mysql> SHOW MASTER STATUS; +-------------------+----------+--------------+------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | +-------------------+----------+--------------+------------------+ | mysqld-bin.000243 | 98 | | | +-------------------+----------+--------------+------------------+ 1 row in set (0.00 sec)

  • ippdb03 is now a slave of ippdb05
  • issue with isp on ippc17
  • can't check from ipp001 since ippdb05 not visible
  • ippdb01 is now a slave of ippdb05
  • root password changed on ippdb01
  • mysql servers stopped on ippdb01 and ippdb05
  • Gavin can change dns/ip

Thursday : 2014.12.11

  • 05:15 EAM : ipp034 crashed with nothing on console, rebooted
  • 06:10 MEH: stdsci barely keeping loaded, only because have so many data products -- regular restarts are required, just adding more nodes doesn't really help
    • seeing 3x:x2+3, 2x:s, 2x:c2 but well underused -- will add back c2 first and 1x:x2+3 should be sufficient -- ~400 vs ~300 w/ normal stdsci loading (only storage hosts)
    • ipp034 has crashed well too often.. it is now out of normal processing along w/ ipp035,036..
    • was odd backlog of >50 fakes, those all cleared..
    • suspect x2+x3 not as effective as nodes, seeing larger levels of cpu_wait vs c2 (or even m0+1 when used before)
  • 07:20 MEH: because behind from stdsci needing a restart + ipp034, adding in 2x:x2+3 now
  • 07:50 MEH: found a stuck exposure in registration -- o7002g0242o
  • 07:55 MEH: restart long running summitcopy+registration, manually add ipp067-089 new storage nodes to summitcopy
    • exposure still stuck in registration, have to manually revert the exposure
      regtool -dbname gpc1 -revertprocessedexp -exp_id 835227
      
  • 08:45 MEH: pstamp could also use a restart -- probably would have been a good idea to have restarted all pantasks after the db01/05 switchover just to clear things..
  • 09:00 MEH: adding imfile.revert.fast/norm macro to registration, fast changes imfile.revert to 600s from 1800s, have been running w/ for past week or so set to 600s (but does not change by default)
  • 09:45 MEH: another warp cannot build growth curve
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1261475 -skycell_id skycell.0991.037
    
  • 10:40 MEH: MOPS reports WWdiff out of order and need o7002g0276o-o7002g0293o --
    difftool -dbname gpc1 -definewarpwarp -exp_id 835254 -template_exp_id 835273 -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2014/12/11 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20141211 -set_reduction SWEETSPOT -simple -rerun
    
  • 11:10 MEH: ipp077 neb-host down, cannot log into and console showing disk errors again
    <Dec/12 04:49 am>[255970.357705] journal commit I/O error
    
  • 11:20 MEH: ipp084, 087 watch in neb-host up state, if okay then can add to list for larger rwsize for 10G machines
  • 11:25 MEH: restart ippmd -- ipps as normal, adding x0+1b and x0+1 (excluding those for relastro)
    • last week tested deep stacks and summitcopy on new storage nodes s4,5 -- now test some load of chip-warp -- no time before nebulous shutdown..
  • 11:40 CZW: At Gene's request, I've sent stop commands to all the ipp/ipplanl pantasks to attempt to catch up the nebulous database replication.
  • 15:xx Gene finished restarting db00
  • 15:15 MEH: stdsci running w/o c or x nodes -- pstamp updates
  • 15:20 MEH: stdlocal running w/ 3x:c2 only ~127 jobs, adding 3x:x2+3 ~291 jobs --
  • 15:25 MEH: ippmd running 10x:m0,m1,x0+1b etc ~291 jobs
  • 15:35 MEH: ipp082 neb-host repair->up, may systems w/ space so see if okay
  • 15:45 MEH: if okay, then stack.off in stdlocal so can load up more 6x:x2+3 for >300 more jobs, 6x:c2 for >200 more jobs -- stdlocal~640, ippmd~562 -- chip processing time ~3-4 longer then normal, so clearly hitting bottleneck somewhere
    • ippsXX net io in ~40-60MB/s while ippxNNN and c nodes are ~10-20MB/s
  • 16:40 Gene will not need the subset of x0+1 nodes for another week, can use for processing again
  • 17:00 MEH: after reconfig on residual cleaning, turning things back up to see where limit is (if any) before nightly science
    • at stdlocal~363, ippmd~398 chip processing is 2-3x longer than average -- this will be a problem if nightly is the same
    • for nightly science config -- want to avoid any stack processing overlap w/ stdlocal (or ippmd later) --
    • ippsXX appear to be more effective at processing than most c and x0+1+2+3 groups -- so the +100 estimate for WSdiffs during nightly is probably more like +150 for c/x
    • x0+1 will also be needed for future use in relastro for a little time but not until later next week --
  • 18:30 MEH: initial setup -- will adjust during nightly and move more power into stdlocal then
    • too many running? registration slow/stalled -- half of o7003g0053o was stuck burntool -- restart registration cleared but will be a bit to catchup and cycle through
  • 19:25 MEH: stdsci running ~443 jobs, giving ~30-60 minutes to establish rates, then start adding additional processing
  • 20:00 MEH: ramped up ippmd to 263 jobs (chip-warp only, ipps and x0+1b only)
  • 20:15 MEH: ipp087,084 showing high loads again but still seem responsive -- ipp082 set repair->up (somehow failed earlier) and bad news..
    • also hitting transactions warning -- turning ippmd down to 100
    • ipp082 hasn't had a battery report in /var/log/messages since 10/14 so probably a problem and w/ the large r/wsize for the mounts it gets clobbered -- this will likely be a problem for any of the 10G systems w/ larger r/wsize and should be considered for failures overnight
  • 23:55 MEH: looks like ippc20 crash@2350 -- power cycle
    <Dec/11 11:46 pm>ippc20 login: [464132.909240] general protection fault: 0000 [#1] SMP 
    
  • 00:00 MEH: ippmd almost finished with set and will want to rebuild tag w/ warp resid removed -- adding more power to stdlocal when finished. more notes to follow in morning, will need to restart stdlocal to also use full x0+1 host group until relastro needs again next week --
  • 00:25 MEH: seeing db00 warnings in odd times for only couple minutes or so --
    141211 20:21:57InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    InnoDB: many active transactions running concurrently?
    
    141211 23:23:13InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    
    141212  0:15:56InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    
    
  • 01:00 MEH: registration in odd state again -- stalled o7003g0375o exposure stalling o7003g0376o stuck in pending_burntool. couple ota stuck >3ks and didn't timeout..
    • getting odd error when regtool -revertprocessedimfile -dbname gpc1
       -> p_psDBRunQuery (psDB.c:812): Database error generated by the server
           Failed to execute SQL query.  Error: Cannot delete or update a parent row: 
      eign key constraint fails (`gpc1/chipProcessedImfile`, CONSTRAINT `chipProcessed
      e_ibfk_2` FOREIGN KEY (`exp_id`, `class_id`) REFERENCES `rawImfile` (`exp_id`, `
      _id`))
       -> revertprocessedimfileMode (regtool.c:872): unknown psLib error
           database error
      
  • 01:30 MEH: all pantasks seemed to be hit with 10-20 stalled jobs not clearing..
  • 04:50 MEH: registration looks to have seq faulted around 0330..
    • Chris also found reason for reg revert problem -- rawImfile had fault but processed chip -- manually set fault to 0 for both
      | exp_id | class_id | chip_id | fault | fault | data_state | state | state |
      +--------+----------+---------+-------+-------+------------+-------+-------+
      | 836124 | XY60     | 1321675 |     2 |     0 | full       | full  | full  | 
      | 836124 | XY73     | 1321675 |     2 |     0 | full       | full  | full  | 
      
  • 05:35 MEH: and another cluster of excess db00 transaction causing all sorts of faults --
    141212  5:35:51InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    InnoDB: many active transactions running concurrently?
    
  • 05:55 MEH: registration moving but very slowly -- stdsci well behind so adding in 1x:x0+1 nodes since knows about them
    • registration seems to constantly be hitting this case for ota74
      o7003g0630o  XY74 0 check_burntool neb://ipp051.0/gpc1/20141212/o7003g0630o/o7003g0630o.ota74.fits	#??? regtool -updateprocessedimfile -exp_id 836374 -class_id XY74 -set_state pending_burntool -dbname gpc1
      
  • 06:30 MEH: while stdlocal only ~300 jobs, something is clearly harassing nightly processing -- stdlocal stop

Friday : 2014.12.12

  • 06:50 MEH: on the plus side, nightly downloads finish before 7am now.. registration catching up a little faster -- +2x:c2 and +2x:m0+1 to stdsci, stdlocal needs to stay off, ippmd has been w/o jobs since ~0100
    • stdlocal needs restart for host group updates anyways --
  • 08:50 MEH: two exposures confused -- trying restart of stdsci but doubt will work, but stdsci does need a clean restart after all the nodes added last night..
    • restart picked up the stalled chip -- was missing three ota in chipProcessedImfile and found them automatically
    • manually set warp skycell_id fault 2 and reverted -- had all warpSkyfile in full, just wouldn't advance run -- and still won't, so something else is missing only 67 skycells but 68 entries in warpImfile.. -- revertoverlap results in similar issue for registration -- something got out of order last night..
       -> p_psDBRunQuery (psDB.c:812): Database error generated by the server
           Failed to execute SQL query.  Error: Cannot delete or update a parent row: a foreign key constraint fails (`gpc1/warpSkyfile`, CONSTRAINT `warpSkyfile_ibfk_1` FOREIGN KEY (`warp_id`, `skycell_id`, `tess_id`) REFERENCES `warpSkyCellMap` (`warp_id`, `skycell_id`, `tess_id`))
       -> revertoverlapMode (warptool.c:744): unknown psLib error
           database error
      
      select * from warpSkyCellMap where warp_id= 1262583 and fault>0;
      
      | warp_id | skycell_id       | tess_id  | class_id | fault |
      +---------+------------------+----------+----------+-------+
      | 1262583 | skycell.1074.052 | RINGS.V3 | XY60     |     2 | 
      | 1262583 | skycell.1074.053 | RINGS.V3 | XY60     |     2 | 
      | 1262583 | skycell.1074.062 | RINGS.V3 | XY60     |     2 | 
      | 1262583 | skycell.1074.063 | RINGS.V3 | XY60     |     2 | 
      | 1262583 | skycell.1074.065 | RINGS.V3 | XY73     |     2 | 
      | 1262583 | skycell.1074.066 | RINGS.V3 | XY73     |     2 | 
      | 1262583 | skycell.1074.075 | RINGS.V3 | XY73     |     2 | 
      | 1262583 | skycell.1074.076 | RINGS.V3 | XY73     |     2 | 
      
      
    • set faults to 0 and moving forward --
  • 10:15 MEH: nightly ~fix/finished
    • stdlocal restarted to use 4x:x0+1+2+3 and 2x:c2 and c0/1 subset still -- ~435 jobs, see how it goes with all xnodes used (except x0/1b) for stdlocal mix processing, 4x may be too much?
    • stdsci to use storage + 3x:c2
    • ippmd: 10x:m0+1, ?x:s4+5, 3x:c2 (daytime) -- will be testing s4+5 today, stdlocal should provide sufficient data loads..
    • ipp077,082,084,087 are problems --
  • 11:30 MEH: stdlocal needs more stacks for loading the x nodes -- ippmd will take a bit to setup, manually adding 5x:m, 5x:xb, +1x:x, +1x:c2 -- so now 5x:all but 3x:c2 ~711, similar to what has been in past w/ stdlocal and ippmd
    • and now stacks load -- should be interesting, will likely need to scale back at least c2 to avoid to many stacks there w/ PSS updates -- turn of stacks for a bit and can go to ~6-10x
    • ippc18 looks like it is getting clobbered when someone tried to compile *
  • 11:50 MEH: stdlocal w/o stack 6x:c2+x+m ~930 -- chip jobs ~2-3x longer, pantasks was showing many EXIT cases (seems ok now), minimal load s4+5, no db00 warnings
    • clearly seeing more cpu_wait on the ippx and even c nodes -- near overload case
  • 12:10 MEH: adding to 7x:c2+x+m, will it trigger db00 warnings (nope), ~1067 jobs now -- processing loads seem irregularly similar however chip times have jumped 3-4x longer, so seems like clear overload case now
  • 13:00 MEH: data nodes aren't really loaded for a good MD test, may need to add processing to s0-3 storage nodes as well
  • 13:30 MEH: turn stdlocal back down to 3x:c2, 4x:m+x to start adding stacks -- cpu_wait down,
  • 13:50 CZW: restarted ipplanl/pv3shuffle to do a clean-up pass on the warp residuals. npending=20 should keep the load low, and as the majority of the files should already be deleted, few of these should get past the "find neb-key" stage of the removal code.
  • 14:00 MEH: stdlocal stack on now, ~593 active jobs, no storage nodes yet
    • no stacks really.. 14:30 adding ippmd 4x:m+x -- yes, just back to cpu_wait increase, net io still ~2GB/s and loads are similar except on ipps machines
  • 15:00 MEH: adding 50% normal stdlocal storage
  • 15:50 MEH: ipps back to ippmd fully -- all ippx to stdlocal -- more stacks are running, clearly doing better on ippx nodes now
    • stdlocal stack.off so can clear quickly when nightly starts
    • 10x:x to see how it impacts summitcopy/registration before nightly starts -- ~1250 total now w/ c2 and s0-3
  • 17:30 MEH: stdlocal 10x:x interesting results
    • effectively makes jobs single threaded but doesn't seem to run that much longer -- absolute rate exp/hr still running, but is making use of the larger memory too
    • because so many jobs to do, load looks like doing something but cpu activity is still low, net appears more bursty
    • stdlocal seems to have hard time loading poll and some 100-200 idle jobs of the 1k.
    • summitcopy showing a good number of timeouts
  • 17:40 MEH: ipp008,012,013,014,016,018,037 down on ganglia -- possible all power cycled.. ipp008 memcheck and bootup in progress.. ipp013 will take forever with its memory
  • 18:40 MEH: stdsci normal storage nodes + 4x:c2; stdlocal 2x:c2 + 6x:x + normal 1x:c0,2x:c1b; ippmd 10x:m0+1 -- total ~430; 648; 172 = 1250
    • also notice ~ipplanl/pv3shuffle/ptolemy.rc poll 10
  • 19:20 MEH: multiple rolling faults --
    • ippmd stop, still happening
    • stdlocal stop, still happening
    • resid cleanup stop -- seemed to fix for a bit, stdlocal back on and jobs trying to kill ipp070..
  • 21:30 MEH: 20:58 BBU disabled on ipp070 after relearned finished.. neb-host repair and finally recovered, nightly only processing continues -- this will be a problem w/ 10G data nodes and larger r/wsize mounts?
  • 22:20 MEH: the question once again, how much processing time is spent producing products that fault and revert to succeed..
  • again, massive registration faults -- 141212 22:44:35InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    • cleanup poll 30, resid cleanup stop, ippmd 50% down, stdlocal less

Saturday : 2014.11.13

  • 05:00 : ippdb00 connection rates as reported by Chris' tool (~watersc1/monitor_connections.20141210/mon_con.dat) have been low (100-300) for the second half of the night. the free slot failures came in only a handful of burst during the night:
grep "cannot find a free slot" /var/log/mysql/mysqld.err | sort | uniq -c 
[snip]
      2 141212 16:54:57InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
      8 141212 16:54:58InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     26 141212 16:54:59InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     31 141212 16:55:00InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     10 141212 16:55:01InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
      2 141212 16:55:02InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     23 141212 16:55:03InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     28 141212 16:55:04InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     57 141212 16:55:05InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     51 141212 16:55:06InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     41 141212 16:55:07InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     55 141212 16:55:08InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     77 141212 20:57:03InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     45 141212 21:13:48InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     17 141212 21:13:50InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     23 141212 21:13:51InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     88 141212 21:13:52InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     12 141212 21:13:54InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     12 141212 21:13:55InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     80 141212 21:21:15InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     42 141212 21:21:16InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     73 141212 22:44:32InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    100 141212 22:44:33InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     17 141212 22:44:34InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
     78 141212 22:44:35InnoDB: Warning: cannot find a free slot for an undo log. Do you have too

We still have no good clue about this, but I suspect cleanup. I'm going to recommend the following: let keep cleanup off at night and weekends (I've shut it down now), and let's plan for campaigns to run it hard during the work day when folks can monitor the behavior in real time. Meanwhile, we need to push stdlocal processing hard to get the rate back up.

  • MEH: sounds good, once turned down and resid cleanup off, didnt have any more heavy faults.. except turning up the jobs still caused more excessive timeouts and stage faults
  • 06:10 MEH: registration holding on to old o7004g0241o -- restarting registration -- reg exposure fault --
    regtool -dbname gpc1 -revertprocessedexp -exp_id 836738
    
  • 08:40 MEH: registration backed up and seeing lots of faults in stdsci -- too many things running?
    • also o7004g0723o74 trying to download on ipp051 for ~1 hr.. killed and restarting
    • will take a bit for burntool and registration to finish for nightly to finish
  • 09:30 MEH: with stdlocal reduced can see what is more of a repeating fault when things aren't regularly faulting -- cannot build growth curve
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1267946  -skycell_id skycell.2240.013
    
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1268074  -skycell_id skycell.2103.045
    
  • 12:03 EAM : added storage hosts to stdlocal, bumping up the c2 usage (We are not seeing ippdb00 transaction errors now)
    • MEH: isn't that because ~ipplanl/pv3shuffle/ptolemy.rc isn't actively running? -- two kinds of faults illustrated w/ nightly science running -- the massive ones from db00 transactions and the general 10-20% constantly from just many jobs running
    • MEH: general cleanup is also stopped still
  • 12:35 EAM : restarting stdlocal: 8*x0, 8*x1, 6*x2, 6*x3, 2*c0, 3*c1b, 6*c2.
    • MEH: one of the notes above friday was found too many nodes loaded resulted in growing number of idle nodes, somewhere ~800 seemed to have trouble
  • 23:05 MEH: not sure if stdlocal was rebalanced for nightly science after Gene added more nodes but nightly rate was <40 not 50-60 needed for processing including WS.. taking out storage nodes and reducing 6->2x:c2

Sunday : YYYY.MM.DD

  • 22:30 MEH: stdsci is struggling, needed a regular restart earlier..
    • also looks like low on disk space so things are struggling there as well