ITC shuffle

First segment results

After finishing the first 100k shuffle, the end statistics were:

alljobs success failure
AV Name Njobs Tmin Tave Tmax Njobs Tmin Tave Tmax Njobs Tmin Tave Tmax
-+ runjob 100010 1.01 1622.12 850661.23 91463 8.13 1453.07 203219.15 8547 1.01 3456.48 850661.23

This is slower than expected from the initial tests, but doesn't take into account two factors:

  • The first ~50k were run using only the stare nodes, not with the additional ~200 pantasks clients.
  • The majority of those stare nodes became hung during the week of thanksgiving, clogging the task list with jobs with very long execution times.

I had to stop the second shuffle today because the network to the ITC was being worked on, so the only results from the second batch are based on a small sample:

alljobs success failure
AV Name Njobs Tmin Tave Tmax Njobs Tmin Tave Tmax Njobs Tmin Tave Tmax
++ runjob 605 5.20 2060.32 4659.94 473 1098.32 3145.94 4659.94 132 5.20 1632.36 3697.19

These still have very long execution times. I'm not sure where the bottleneck is.

Looking at the ganglia plots for the network:

stare00 Have been used since the beginning (5 nodes, 8x loading)
ipp110 Target storage nodes (13 nodes)
ippx055 Addition processing nodes (34 nodes, 6x loading)

The target nodes appear to all cap at 8MB/s, so the transfer is moving ~104MB/s to ITC. When this was using only the stare nodes, ~22MB/s was flowing through each (roughly the same total). Adding the additional processing nodes dropped the stare node throughput to ~4MB/s, similar to what the lighter-loaded x-nodes at ~3MB/s.

I suspect this means we're currently being limited by the 1G ITC-MRTCB connection.

Shuffle operations

rawcheck.pl

In order to safely transfer data to the new IPP nodes at the ITC, I've updated the ippScripts/scripts/rawcheck.pl code. The new code does the following operations:

  • General setup.
  • Set up a hard-coded requirement map containing the number of copies that must be available at given locations. Currently set to ITC = 1, OFFSITE = 1, MRTCB = 0.
  • Iterate over all IPP storage volumes, assigning them to the appropriate location.
  • Determine which volumes are acceptable for new instances. This uses the standard nebulous rules, and manually excludes ippb05 due to some unresolved glockfile issues.
  • Construct lists of volumes at each location that can be randomly drawn from to select replication targets.
  • Run regtool to get the database information for the exp_id to be considered. Then, for all class_ids in that exp_id:
    • Find all the nebulous instances for the imfile.
    • Check each instance to ensure it matches the database md5sum value, and assign it to a list of "good" or "bad" copies.
    • Attempt to fix any bad copies from a known good copy.
    • Check each location to see if it has a sufficient number of copies, and if not, replicate to a randomly drawn volume at that location, checking the md5sum afterwards.
    • If culling, cull excess copies, but only if no other replications have happened (a safety to ensure this requires at least two passes so the system can be fully settled between replication and culling).

This has been committed to the trunk at r39812 and to the ipptest user's ipp-20150312 build at r39813.

pantasks

I've run a pantasks using only the stare nodes, with 8x loading, plus additional x node power (x2, x3 using 6x loading, x1 using 2x) (~300 active jobs, based in /data/ippc19.0/home/watersc1/itc_sync_script.20161026/test_pant). This seemed to have no impact on the cluster, although no nightly science observations were taken. I've also updated the ipptest/replication directory to match this, and supplied it with the split file lists of commands. Based on the ipplanl/stdlocal pantasks input, I've copied (but commented out) turn.on/off.storage tasks into this input, but I'm not sure what the correct loading should be (which is why these are commented out). This runs as the ipptest user, simply to have a group user managing the process.

To run the pantasks, symlink the appropriate command list into place:

  • ln -sf ./split_cmds.00 ./scan.cmds

Start the server as usual (running on stare04), read the input, and issue the command to read the command list:

  • init_from_file

Then run as usual. It would be good to save the failures to a file before restarting a subsequent pantasks, so these can be retried:

  • ./fails > fails.00

Timing

The initial testing (currently at 8119 of 10000 jobs) has an average execution time of 426.89s, but with a very long tail. I manually killed ~10 glockfile processes that prompted the ippb05 exclusion, so those jobs have execution times of ~60000s. The current 'controller status' output shows jobs still running at 600-700s, with no clear reason for the slowdown. Some of the initial jobs were run previously during testing, but that should only comprise ~1500 jobs at most.

Problem cases

The main source of problems that I've noticed is due to mounting issues. Usually only one node has issues, and even then, it's usually only with a single mount point. Running umount -f /data/ipp0XY.Z sometimes clears this up, after killing all rawcheck.pl instances running on that problem node (sometimes killing glockfile instances are also necessary). However, when this doesn't work, restarting nfs on that problem node (the one with the hung mount, not the NFS server) seems to sort things out: /etc/init.d/nfs restart && /etc/init.d/rpc.statd restart.

Possible mitigation in case of speed concerns

There is currently no way to populate an empty instance for a pre-existing nebulous key. However, obtaining the ins_id necessary should not be hard to do, suggesting that it may be possible to rsync large numbers of imfiles to the ITC cluster, and then run a "re-insert" operation that constructs hard links from the rsynced location to the expected new instance instead of doing a copy of the file to that new instance. None of this code has been written.

Attachments