2014-07-02

The preliminary maintenance script is:

  1. Log in as ipplanl user on main node (currently ippc11).
  2. Check status of ssh control master connection: ./lanl_control_tool.pl If the connection is down, skip to #5.
  3. Stop network active pantasks tasks: (this needs to have a control script) remote.poll.off; remote.exec.off. Wait for outstanding *.run jobs to terminate. This may take a long time.
  4. Terminate the running master connection: ./lanl_control_tool.pl stop.
  5. Confirm that the correct .ssh/config file will be read: ./lanl_switch_users.pl <lanl_username>
  6. Reinitialize master connection: ./lanl_control_tool.pl start. Enter password at prompt. This will start the connection, and then push it into the background, returning control of the terminal.
  7. Restart network active pantasks tasks: remote.poll.on; remote.exec.on
  8. Revert any paused jobs: remotetool -dbname gpc1 -revertauth -label <PV3_processing_label>

2014-06-09

Threading tests:

#stage nthread walltime avgload avgmem avguser dtimeTot
chip 2 560 12.595 0.634793 47.9375 94.1957
chip 3 496 14.31 0.903884 65.35 81.9142
chip 4 492 14.6729 0.904637 61.8286 81.7962
warp 2 259 11.4914 0.751037 53.2143 28.9995
warp 3 246 12.446 0.780706 52.8 27.2578
warp 4 241 12.4033 0.780651 58.6833 26.1379
stack 2 967 13.7547 0.938137 53.3824 427.864
stack 3 760 16.171 0.937694 60.14 339.402
stack 4 631 20.8767 0.91656 81.7917 272.584

2014-05-16

After sending Ken a summary of how the remote processing works, I thought it would useful to copy that information here.

* Password issue (box 1 on the diagram).

The head node runs the pantasks controlling all processing. The user doing the processing needs to have a key-card validated connection to LANL. Due to the LANL security system, this connection is disconnected every 12 hours, and must be re-established to do any executing or polling of remote jobs.

  • Slow preparation issue (box 2 on the diagram).

Each remote job needs to have all of its file depenencies and commands structured, and this can happen on any node in the pantasks. This does not require an active connection to LANL, as it is purely IPP-side. As each remote job consists of ~1000 exposures, and each exposure has on order 60-80 sub-components, these jobs take a long time (each subcomponent seems to take on the order of a second or less). However, as we can run many of these simultaneously, the average cost of these prep jobs can be brought down. This runs into the issue that if we have many jobs being prepared, we will need to run them all.

  • Data persistence issue (not directly visible on diagram).

Having many jobs in the queue requires a lot of data to be transfered, and subsequent stages (cam/warp/stack) need to be executed before the LANL disk quota removes needed results from previous stages. We are only transferring back (box 7) permanent data products that we want to keep, not all intermediate products.

  • Remainder of explanation (boxes 2+ on diagram)
  • Box 2: Prepare

Each stage has a different prepare script, but they all output the same products: a list of filename requirements (in multiple formats to match nebulous/disk formats), a list of commands to execute (the ppImage/psastro/pswarp/ppStack, but also various mkdir commands to ensure output products can be constructed, and symlink commands to alias files from their nebulous key to nebulous disk representation), and a resource requirement.

  • Box 3-5: Exec

Once the prepare is finished, pantasks can launch an exec script (box 3) via the ssh link that feeds that data through to "mu-fe", the Mustang front-end computer. This launches the transfer tool on mu-fe (box 4), which in turn connects via ssh and tar to the IPP storage accessible from ippc2X. Once all required data has been downloaded to the scratch disk at LANL, the job resource requiest is submitted to the MOAB scheduler (box 5).

  • Box 6-7: Poll

After some amount of time (currently one hour after the GPC1 database entry for the remoteRun was last updated), pantasks launches a poll command over the ssh link to mu-fe (box 6). If the job has completed, the filelists calculated in the prepare step are used to pass back any permanent data to the IPP storage nodes (box 7). Included in this are the dbinfo files, which contain the update commands necessary to update the GPC1 database, and these are executed by the poll command before it sets the remoteRun state to 'full'.

Unpictured are the IPP advance commands that can also run in the pantasks, which launch the next stage processing for the data returned.

2014-04-08

The final test to do was the parallelized test, using differing numbers of front end nodes. I manually launched these jobs, as I haven't organized a proper automatic parallelization over the front end nodes. I also increased the bundle size to 100MB, which does squeeze a bit more speed relative to the 10MB bundles (54.5 MB/s vs 30.5 MB/s at 32 jobs per node). The linear fits are simple eye-ball lines, illustrating that we do not scale perfectly between 1 and 4 nodes. Based on this, we can get gigabit/s speeds using 3 nodes without seriously overtaxing the system with connections.

The only block now is to write something that can take input file lists, construct the necessary bundles, and then farm those jobs out across the front end nodes to do the work.

2014-04-04

After finding terrible transfer rates on the post-chip return transfer, I set up a test set to probe how quickly I can transfer data back. I created files with a range of sizes, and used a range of parallel jobs, as shown in the table of figures below. The main result that I hadn't expected was that the majority of the transfer issue is likely to do with the inefficiency of transferring small files. Even though I attempted to use the ControlMaster? path (which should minimize the connection setup time), this did not seem to significantly improve the transfer for small files (ControlMaster? has a connection limit which the high-parallel tests violated. This explains why there's a break in the 10-Megabyte file rate).

In any case, for the 10MB file (which is of order a single raw OTA image), the single source speed was around 30-40 MB/s.

2014-03-28

The main final block on smoothly doing processing is the data transfer operations. I've spent some time this week thinking over how to structure this, and I think I've finally decided on the way that will work best. I attempted to see if the processing nodes could initiate connections outside the cluster, but that does not seem to be possible ("ssh: Could not resolve hostname ippc20.ipp.ifa.hawaii.edu: Temporary failure in name resolution").

This means relying on the front end nodes to do all the work. Mark sent me some forwards from the UMD team, as well as the perl script they're using for their data transfers. This script appears to stream a series of forks to run wget to fetch the data from the webservers on the ippcXX nodes. This is also being run in a virtual machine, although the emails suggest that that isn't a requirement. However, it does seem that the script is somewhat specific to the UMD downloading setup. For the IPP/SC interface, we need to transfer things over scp, and there are filename translations that need to be taken care of (disk file locations are different between the local and remote disks).

Because of this, I think the best solution is to have the prepare_* steps generate the expected output file/local instances. Pushing this to the exec phase means recalculating information that is no longer available. Therefore, as part of the prepare step, while generating lists of files to transfer initially, and defining the remote commands to execute, a list of expected output files that need to be transferred also needs to be created. In addition, due to the constraints of the nebulous filesystem, neb-touch commands will be run to instantiate a disk file that can then be added to a list of transfer destinations. This also simplifies the requirement for the transfer engine, which only needs to know how to run "scp --options" on an element from file A and a destination from file B (so the same program can do both the fetching and the pushing).

Since the required transfer code is not complicated, I think pushing the work to GNU parallel is probably the easiest option. This allows all front end nodes to do scp jobs, and the job/host parameter can be tuned as well.

2014-03-21

To fix the time problems from the previous update, I reworked the code to serialize the mkdir/ppImage/dbinfo steps into a single job that's farmed out. This completed successfully (~49 minutes), and I reran this test today with the previous run moved out of the way. This prevents the mkdir from succeeding faster for directories that already existed. This also completed in ~50 minutes, suggesting that this has resolved the timeout issue. This also works out to ~83% of the request time, close to the expected target.

I'm still trying to come up with a useful data-return system, but at least the stage level scripts can largely finished now that a working chip stage is done.

returning data

I think this has to be done such that the data is requested by the local side, and then pushed from the remote side (due to the connectivity issues with fetching data with the password system). My current model is:

  • Local script polls for job completion.
  • Upon receiving the completion signal, launch the data return script remotely that will send the data back.
  • Parse the results of that script, install files into the correct local location, execute database commands.

2014-03-19

The newest concern is related to the speed of individual commands in the job bundle. Things that should not be a significant portion of the time request are running slower than expected, causing the job to fail due to taking too long. From the various time stamps and change stats:

try 1 first directory timestamp 2014-03-18 20:02:39
directory 119 timestamp 2014-03-18 20:14:18
last directory (620) 2014-03-18 20:40:50
try 2 submit time 2014-03-18 18:45:07
complete time 2014-03-18 20:59:53
beginning date command 2014-03-18 19:59:30
first ppImage 2014-03-19 02:00:51
last ppImage 2014-03-19 02:39:06
first dbinfo 2014-03-18 20:48:00
last (119) dbinfo 2014-03-18 20:59:00

Based on these times (and shifting the ppImage times to a matching timezone), the time we would need to request for the job to complete successfully from scratch is far larger than expected:

step time notes
mkdir 26:32 This took effectively zero time in the second try, as the directories already existed.
ppImage 38:15 This is about 64% of the requested time, smaller than the 80% target value.
dbinfo 57:19 This is a projection from the 119 that completed.
total 122:06

My suspicion is that writing to disk in a parallel fashion is far slower than expected, and that the jobs can pile up in a disk-wait state, making the parallel mkdir and echos overwhelming. As the writes for the ppImage stage are more randomized and less intense, this disk-wait isn't a problem.

2014-03-17

Due to the fact that ppImage doesn't make directories for output files, and a realization that the stats to database information step needs to be parallelized as well, I've reworked the command creation to operate in a pre/command/post split. For the chip stage, the pre commands are simply mkdir -p calls to ensure that all output directories exist. The command phase for chip is the detrend-calculated ppImage commands from before, and the post phase constructs dbinfo files. These files are created by an echo -n (to obtain information that is known when the command list is calculated, but is hard or unknowable on the remote cluster (database id values, etc), followed by a ppStatsFromMetadata that converts the stats file into the remaining database flags (execution time, object counts, background levels, etc).

2014-02-24

The detrend calculation was impractically slow in the previous version of prepare_chip. Since each imfile needs to have ~6-7 detrends identified, doing this with detselect commands was very slow. This led to an execution time for SAS of ~9 hours. The fix chosen was to have the script look up the information for all possible detrends, and compile this into an in-memory catalog. This makes the script start up much slower, but reduces the overall time for large datasets. A quick check comparing the old and new command lists shows that the two methods produce identical results. The speed of the new code is still long (~2 hours), but is definitely an improvement. For 620 exposures with 60 imfiles, this seems to work out to an average execution time of ~0.2s/imfile.

The next constraint that I'm running into is that attempting to process the SAS illustrates the need to sort out a better way of transferring data. The SAS detrend set and the test detrend set I have previously transferred are effectively disjoint. This means transferring ~9x60 files, bumping "how to transfer data" from a thing to think about once the SAS test has been started to a more active issue.

Allocation calculation

For chip, I'm assuming the median ppImage command takes time_cost ~150 seconds with 2x threading (based on some manipulation of the database values). Then, for a bundle containing N ppImage commands, the allocation is calulated as:

 proc_per_core = 24
 fill_factor   = 0.8
 max_nodes     = 1000

 Core_need = N * Nthreads
 Node_need = Core_need / proc_per_core
 Time_need = N * time_cost
 
 time_requested = int((node_need * time_cost) / (fill_factor * max_nodes)) + 1
 node_requested = int((node_need * time_cost) / (fill_factor * time_requested)) + 1

 N         = 37200
 Nthreads  = 2
 time_cost = 150 / 60 / 60

 Core_need = 74400
 Node_need = 3100
 Time_need = 1550

 time_requested = int(3100 * 150 / 60 / 60 / 0.8 / 1000) + 1 = 1
 node_requested = int(3100 * 150 / 60 / 60 / 0.8 / 1) + 1 = 162
 
 unused side calculation:
 Time_need / (node_request * proc_per_core / Nthreads) = 0.79733 hours

Login issue

One main obstacle is related to login issues. The current procedure I've been using is:

  • Start master connection using key card.
  • Start secondary connections on that host using the ControlMaster? setup. I originally had this set as ControlPath /home/panstarrs/watersc1/.ssh/connection/%h_%p_%r which is the suggested format as it makes the connection dependent on the remote hostname (%h), the remote port (%p), and the remote username (%r). I have now changed this to include %l, the local host name, as without that, multiple local hosts cannot log in at the same time (the connection described does not exist on the new host).
  • Connecting to multiple hosts can be done with only a single key entry, as you can generate a series of tokens once the card has been unlocked. These tokens do need to be manually entered when the host's master connection is initiated.
  • Become disconnected after 12 hours due to remote end security.

This procedure becomes slightly trickier in an expanded/parallelized format. The current set of problems that I see at the moment are:

  • One user needs to initiate all connections. This can be alleviated by working as the IPP user, and dropping the %r in the ControlPath?, and setting remote umask values in such a way that all local side users with key cards can resume a connection for the (local) IPP user.
  • Initializing large numbers of hosts is clunky. I don't see any way around sshing to all local connection hosts and starting the master connection. Being able to refresh tokens minimizes the trouble.
  • Every 12 hours, someone will need to refresh the connections due to the auto-disconnect. This could become a major concern for data-transfer tasks. It seems likely that data transfers will need to be running fairly regularly, and dropped connections reduces the duty cycle on this.

2014-02-19

The database changes and tool work has been done, and so far I've tested the IPP side of the code to handle multi-exposure job bundles:

# remote_id 2
chiptool -definebyquery -dateobs_begin 2011-07-03T00:00:00 -dateobs_end 2011-07-06T23:59:59 -obs_mode SAS2  -set_label lanl.SAS.201402114 -set_data_group czw.SAS.201402114 -set_end_stage warp -set_tess_id RINGS.V3 -set_workdir neb://@HOST@.0/czw/SAS.201402114/ -filter g.00000 -pretend -simple -set_dist_group SAS
chiptool -definebyquery -dateobs_begin 2011-07-03T00:00:00 -dateobs_end 2011-07-06T23:59:59 -obs_mode SAS2  -set_label lanl.SAS.201402114 -set_data_group czw.SAS.201402114 -set_end_stage warp -set_tess_id RINGS.V3 -set_workdir neb://@HOST@.0/czw/SAS.201402114/ -filter i.00000 -pretend -simple -set_dist_group SAS
chiptool -definebyquery -dateobs_begin 2011-07-03T00:00:00 -dateobs_end 2011-07-06T23:59:59 -obs_mode SAS2  -set_label lanl.SAS.201402114 -set_data_group czw.SAS.201402114 -set_end_stage warp -set_tess_id RINGS.V3 -set_workdir neb://@HOST@.0/czw/SAS.201402114/ -filter r.00000 -pretend -simple -set_dist_group SAS
chiptool -definebyquery -dateobs_begin 2011-07-03T00:00:00 -dateobs_end 2011-07-06T23:59:59 -obs_mode SAS2  -set_label lanl.SAS.201402114 -set_data_group czw.SAS.201402114 -set_end_stage warp -set_tess_id RINGS.V3 -set_workdir neb://@HOST@.0/czw/SAS.201402114/ -filter y.00000 -pretend -simple -set_dist_group SAS
chiptool -definebyquery -dateobs_begin 2011-07-03T00:00:00 -dateobs_end 2011-07-06T23:59:59 -obs_mode SAS2  -set_label lanl.SAS.201402114 -set_data_group czw.SAS.201402114 -set_end_stage warp -set_tess_id RINGS.V3 -set_workdir neb://@HOST@.0/czw/SAS.201402114/ -filter z.00000 -pretend -simple -set_dist_group SAS
remotetool -definebyquery -label lanl.SAS.201402114 -stage chip -path_base neb://@HOST@.0/czw/lanl.SAS.201402114
sc_prepare_chip.pl --remote_id 2 --camera GPC1 --dbname gpc1 --path_base neb://@HOST@.0/czw/lanl.rtdb.t1/remote_chip.2 --no_update --verbose

# remote_id 3
chiptool -definebyquery -set_label czw.lanl.t2 -set_workdir neb://any/czw/lanl.t2/ -exp_id 555555
chiptool -definebyquery -set_label czw.lanl.t2 -set_workdir neb://any/czw/lanl.t2/ -exp_id 555556
remotetool -definebyquery -label czw.lanl.t2 -stage chip -path_base neb://@HOST@.0/czw/lanl.rtdb.t1/ -trace db 45
sc_prepare_chip.pl --remote_id 3 --camera GPC1 --dbname gpc1 --path_base neb://@HOST@.0/czw/lanl.rtdb.t1/remote_chip.3 --no_update --verbose

# remote_id 4
chiptool -definebyquery -set_label czw.lanl.t2 -set_workdir neb://any/czw/lanl.t2/ -exp_id 555557
chiptool -definebyquery -set_label czw.lanl.t2 -set_workdir neb://any/czw/lanl.t2/ -exp_id 555558
remotetool -definebyquery -label czw.lanl.t2 -stage chip -path_base neb://@HOST@.0/czw/lanl.rtdb.t1/ -trace db 45
sc_prepare_chip.pl --remote_id 4 --camera GPC1 --dbname gpc1 --path_base neb://@HOST@.0/czw/lanl.rtdb.t1/remote_chip.4 --no_update --verbose
neb-ls neb://@HOST@.0/czw/lanl.rtdb.t1/remote_chip.4%

Unfortunately, the prepare_chip call for SAS is taking much longer than I'd expected, largely due to the cost in looking up the detrends. I think this means that chiptool needs to be able to do the detselect operations itself, reducing the number of extra calls that need to be done.

2014-02-07

I think that the stask interface is working now, although I still would like to try and sort out the retry code (I have some untested changes for my next test). However, extending the current scripts from the single exposure to the large N format is not really possible without the database changes, so I'm making the final changes to the database schema and tool. I'd like to fold this change into the working database on Monday, which will allow me to finish the scripts early next week.

  • remoteRun
    • remote_id
    • state
    • stage
    • label
    • path_base
    • job_id
    • last_poll
    • fault
  • remoteComponent
    • remote_id
    • stage_id

With this schema, a remote job contains only one stage, a simplification largely done to be able to keep the different stages in different scripts. There's no absolute limit on the number of components that are included in a given job, so this can be tuned to whatever size is most efficient to work with. I still expect to do a SAS run once smaller tests are complete, and given the current discussions, we may simply just put the 620 exposures in this set into a single remoteRun entry (for each stage), and allocate enough processing to do it in one pass. The processing structure is now:

  • Define chip jobs manually as we do now.
  • Define remoteRun/Component entries: remotetool -definebyquery -label asdf -stage chip -limit N
  • Prepare the run: sc_prepare_chip.pl --remote_id X
  • Execute the run: sc_remote_exec.pl --remote_id X
  • Poll for completion: sc_remote_exec.pl --poll --remote_id X
    • Retrieve the completion status for the chipRun level updates.
    • Do any file transfers needed. This may need to simply be scheduled.
    • Mark remoteRun as full
  • Complete chipRuns auto-queue replacement camRun entries, which are subsequently polled by remotetool -definebyquery -stage cam

The policy value from the previous iteration no longer exists. With retries built into stask, we'll simply accept that any item that can't finish after a fixed number of attempts (2-3) will never complete and will be skipped.

2014-02-04

Due to the limitations with srun, it's likely going to be easier in the future to use stask instead. This isn't a major change in the scripting, as it's simply an issue of switching the syntax around.

The more major shift is going to be in making sure we use the system efficiently. The scheduling system is designed to run jobs that execute for 8-12 hours, which the current single stage-single exposure system doesn't do. In addition, processors are allocated as full nodes (24 cores), so it's most efficient to use all of them. A quick calculation suggests that an 8 hour, 5 node block could process ~525 exposures through chip, ~2750 through camera, or ~1000 through warp (based on some quick averages from the database dtime values). Launching jobs of this scale requires some rethinking of how to stage things.

I still think holding the remoteRun fixed to a single mrun command is best. However, instead of the remoteRun defining everything, it needs to have child rows in a different table (remoteExp?) that points to the stage, stage_id, file lists, etc that are contained in the remoteRun. This will need some work to redevelop.

Further thinking leads me to the conclusion that with stages of this size, it's likely to be completely impractical to do much manual reverts. I'm beginning to think that if a command fails twice (is run, fails, is attempted again when that state is found, and fails again), it should just be abandoned and ignored. stask doesn't current retry failed jobs (it notes the return code, but does no logic on it).

2014-02-03

I wanted to wait to see what the result of my srun command was. It turns out to be the disappointing:

srun: error: Multi_prog config file test24.config is too large

A search reveals that this file is larger than the 60000 byte limit (at 69854). I suspect this means rewriting things to do multiple parallel runs in series.

I think the solution to this is to simply break up the srun config into smaller chunks. Since all chip stage things are likely to be roughly the same length as this one, breaking it in half should work. The camera stage is one command, and splitting the warp commands (~70ish) in half should be under the limit as well.

Another issue I've thought about is the reference catalog. For the chip stage detrends, I'm assuming that constant use will keep them active on the scratch partition. In addition, we're already precalculating which are needed for the ppImage calls, so keeping these as an entry in the set of files to check existance of is fine. However, the reference catalog files aren't necessarily used regularly, and I don't know if we know ahead of time which entries will be used (I suspect not). Therefore, I think instead of treating this as a detrend, it will need to be put into the permanent storage area. The same is true of the tessellation directories. A quick check indicates this is ~137G (refcat) and 530M (tess dir) in space.

2014-01-31

The first set of programs seem to be working for controlling remote processing. I've revised the 2014-01-23 section to reflect the changes that I've made as developing the code. I've been able to successfully run things up to the job polling state. It seems that my test job is stuck waiting for resources, which leads to the next main issue.

I'm not completely sure that I have the job resources properly defined. This breaks down into the following points:

  • I allocate 60 nodes to execute the 60 chip stage jobs in the msub command. This in turn calls srun which has 60 ppImage commands defined. I don't fully understand if this is doing what I expect (grabbing 60 in msub, and allocating one to each ppImage via srun), or if I'm double allocating.
  • I'm allocating 60 nodes with ppn (proc/node) of 2 (as a test to reduce the resource request while debugging). I don't actually care to have separate nodes (one 8 proc node could run 4 jobs), so I don't know if I'm over-restricting my request.
  • I don't have a good idea of the actual requirements for IPP jobs. Are some jobs sufficiently well threaded that I should be trying to run a higher number of threads?

2014-01-23

After completing some of the most obvious scripts that are necessary to make this work, I've come up with the following processing outline that I think satisfies all of the needed steps.

Queuing/database management

All processing needs to begin with chipRun entries defined in the standard way. chiptool -pendingimfile provides the majority of the information needed, so it's best to reuse this framework. However, after the chipRun entries are defined, they need to be registered into a "remoteRun" database entry. My current assumption is that a chip_id or chipRun.label field will be used in the definebyquery statement. The remoteRun entry will manage the communication with the remote resources. This table needs to have the following fields:

  • remote_id: main table index
  • stage: current stage being processed for this entry
  • stage_id: reference to the stageRun table
  • state: in addition to the run/full/cleaned values, this will need values pending, auth, and error (described below).
  • last_poll: date string indicating last time this entry was remotely polled.
  • job_id: remote processing job_id.
  • fault: standard fault code
  • path_base: local storage location for supercompute file generation.
  • job_policy: policy state for dealing with partially incomplete runs.

Processing order

Once the remoteRun is inserted (with state = 'new'), the initial prep task will run, doing the following:

  • Construct list of required files necessary to complete the task on the remote cluster, and their local analogs (if they exist).
  • Construct the command script that defines the resources needed and will launch the srun command.
  • Construct the config script needed for srun to parallelize the processing.
  • Update remoteRun to state = 'pending'

With the preparation done, the launch task runs, with the following effects:

  • Check that the connection to the remote cluster is available. If not, set the database entry to state = 'auth', update the last_poll entry, and quit.
  • Send the prepared files from the previous step. This includes the command and config scripts, as well as a list of required files needed to complete the job. The file validation script is run on the remote end and outputs any missing entries through the ssh stream. A transfer list is constructed of the files that need to be pushed to the remote site.
  • Any file pushes that are required are performed.
  • The command script is inserted into the remote processing queue and the job_id value is obtained.
  • The remoteRun is set to state = 'run', and the job_id value updated.
  • A job poll is performed (setting last_poll), and the script either terminates (allowing a future poll to check), or sleeps until the poll determines the job has completed. A future poll will skip the previous steps and return to this point.
  • The remote job status completes, and the resulting files are pulled back from the remote site, including any permanent files (log, stats, SMF, etc), as well as a job_summary compiled at the end of the command script.
  • If the remote job has completed successfully, or if the job_policy is set to "ignore", the local database is updated to reflect that the stageRun has completed successfully, and the state is updated to 'full'. If the job_policy is not set to ignore (job errors are fatal), the local database is updated with information about the jobs that have completed successfully, but the jobs that have failed are marked as so. The remoteRun is set to have a fault != 0, and the state = 'error'. This behavior then will require manual intervention to update the database and resolve the failed jobs.

At this point, a master control process will have to step in and launch new remoteRun entries for the next step. Upon adding this, the chip->remoteRun operation is likely an operation of this master control program operating on a label basis.

Tasks/scripts/tools

  • Tasks
    • remote.pro
      • remote.prep.chip
      • remote.prep.cam
      • remote.prep.warp
      • remote.prep.stack
      • remote.exec
      • remote.poll
      • loading tasks
  • Scripts
    • sc_prepare_chip.pl
    • sc_prepare_cam.pl
    • sc_prepare_warp.pl
    • sc_prepare_stack.pl
    • sc_validate_files.pl
    • sc_remote_exec.pl
    • sc_validate_processing.pl
  • Tools
    • remotetool

Details

I've censored certain details in the description below both to prevent issues with security and to make things somewhat easier to understand. A description of each detail parameter is listed in the table below. For single permanent values, I've used all caps, and for example parameters, I've used the @TEMP_FILE@ convention.

DMZ_HOST Public internet connected computer that shields the computing nodes.
SEC_HOST Secure internal front end to the computing nodes. Located beyond DMZ_HOST.
DMZ_DN Full domain name for the DMZ_HOST
USERNAME Cluster username
IPP_PATH Path to IPP build

Logging in

To reduce the number of login/password authorizations, it's recommended to use the ControlMaster? ssh parameters. This requires only the first connection to be validated, as subsequent connections are passed through the already validated connection. The .ssh/config options to implement this are:

Host DMZ_HOST
  User          USERNAME
  Hostname      DMZ_DN
  ForwardX11    yes
  ForardAgent   yes
  RequestTTY    force
  ControlMaster auto
  ControlPath   ~/.ssh/connections/%h_%p_%r

The ~/.ssh/connections/ directory needs to be created if it does not already exist. With this configuration in place, you can log in to the computing front end with ssh -t DMZ_HOST ssh SEC_HOST. The DMZ_HOST does not support commands other than ssh and scp, and is used only to bounce the connection to SEC_HOST.

Data transfer

As the DMZ_HOST does not have available disk space, data transfers need to write directly to the SEC_HOST disks. The following commands illustrate how to use scp to transfer files to and from SEC_HOST:

scp @LOCAL_FILE@ DMZ_HOST:SEC_HOST:/@PATH_TO_DESTINATION@/
scp DMZ_HOST:SEC_HOST:/@PATH_TO_FILE@/@REMOTE_FILE@ @LOCAL_DESTINATION@

Data transfer connections need to be established from outside, as SEC_HOST cannot see the public internet.

SEC_HOST storage locations

There are four main storage locations.

homedir Small, and should not be used for anything that can be placed elsewhere.
/usr/projects/ps1/ This should be the default storage location for non-transient data products. It should have in the future more disk space than it currently reports (20GB).
/scratch/USERNAME/ A very large single volume that is automatically cleaned on timescales of weeks.
/scratch3/USERNAME/ Same.

IPP build

Due to the data transfer limitations, svn co will not work on SEC_HOST. A copy of the code needs to be pushed into place. However, a slightly old (r35169 2013-02-14) version of IPP is available, which I was able to get to work simply by adding

alias psconfig "source IPP_PATH/psconfig.csh"
psconfig default

in my .cshrc.

IPP Running

There is no database running on SEC_HOST, which impacts how the IPP can be run. The ippScripts assume a database exists that can be probed to determine what should be executed. In addition, the detrend information is stored in the database, which means processing raw images will need to have detrends specified manually.

Despite this issue, I was able to test that the IPP build does work. Running ppImage

> ppImage -file o5303g0240o.ota67.fits test.out -recipe PPIMAGE CHIP -Db PHOTOM F -Db DARK F -Db FLAT F -Db NOISEMAP F -threads 1 -trace config 89 -Db MASK F -Db NONLIN F -log test.log -tracedest test.trace
Number of leaks to display: 500

produced all expected outputs from the recipe (taking into account the command line option changes). For this test, I disabled the detrending manually to avoid having to specify a series of -mask/-dark/-flat options.

Parallelization

The supercomputing resources are managed by Moab, for which I've been using this page as documentation. Briefly, a job script is constructed that outlines the resources required for completion, and this is submitted to the moab scheduler. The resources are allocated, the job is run, and the job terminates. This is significantly different than the standard IPP scheduling system, which assumes full access to the computing hardware.

stask

The example stask scripts seem to provide a path around this limitation. From my reading of the scripts:

  1. A call to the stask.py python script is defined in the stask shell script. This script also defines a task list. stask is then submitted to moab as a job.
  2. The moab job constructs a list of nodes allocated to the job.
  3. The moab job passes this node list and the task list to the python script.
  4. The python script (running under moab) connects to the individual node via ssh, changes to the appropriate workdir, and runs the shell script mult.sh with the task parameters.
  5. The shell script (under python as a moab job, on the individual node) runs the requested job under the GNU parallel framework.
  6. Python completes when the shell scripts are finished on all nodes, clearing the moab job to complete and release the resources requested.

As of 2013-12-18, I have not run an IPP command through this stask scripts.

comments by EAM

Issues which we need to consider for large-scale processing on this cluster:

  • lack of a database
    • how do we provide the right detrends?
    • how do we synchronize results with IfA / gpc1?
    • we cannot use the database for job sequencing (gpc1)
  • interacting with Moab
    • can we use stask.py to run under pantasks or equivalent?
    • can we use pantasks to talk directly to Moab? (note that Serge wrote a pantasks backend to talk to condor as a drop-in replacement for pcontrol; the same could potentially be done for Moab)

The primary problem is that, without a database, we cannot coordinate our operations as we currently do. Even if pantasks could talk to Moab, without a database, we cannot sequence and schedule jobs. This precludes the use of a straight build of IPP locally to run the full system.

In any arrangement, we will certainly want to ship all of our detrend data and the reference photometry / astrometry database up front. We will have to define a way to automatically generate the commands and to retrieve results for some chunk of the processing, including the info that needs to be pushed into the gpc1 database.

Some possible ways of handling the interaction:

  1. treat the remote cluster as a set of nodes and use our pantasks (talking directly to gpc1) to send the jobs. This would be a very fine-grained integration with the current IPP processing. It would require any job (say chip) to generate a command and a bundle with the data and the pointers to the appropriate local references (detrends, etc). then the bundle would be shipped to the remote cluster. when the job is done, that fact needs to be carried back to the local pantasks, and the results retrieved.
  2. identify a complete sequence (say, chip, cam, warp) and send that as a bundle. This sounds easier that #1 above but is in fact equivalent in terms of the need for automatic generation of the bundle / command, shipping the data and discovering the completion
  3. choose a large sequence (eg, all chip processing and all downstream processing to stack a complete projection cell). this could be done with less automation at least at first, though there are still ~2k projection cells per filter, so some automation would be needed eventually.

It seems to me that any solution is going to require us to automatically ship data on some scale to LANL, automatically send a job to Moab, and automatically capture the result. We should start on achieving those goal in a generic way rather than worrying about running IPP jobs specifically up front.

2013-12-19 Telecon

  • Storage.
    • Other local storage exists there, although it is significantly slower than transferring even from Hawaii.
    • The temporary scratch disk space is cleared automatically, although there is some leeway. Files that are accessed (or updated?) have their retention time extended. There are limits to this extension, so manually gaming the system is undesirable.
  • Network.
    • SVN access may be possible with exceptions in the firewall. These exceptions will be required to point to only a small number of IP addresses.
    • Data transfer can be done through an intermediate node, although this is limited to only 20TB of temp space. This suggests pushing (as listed above) or having a firewall hole to pull data through is a more workable solution.
    • HPN-SSH exists, and can be used to speed up scp transfers (I'm not fully clear on the details, but the claim is up to 10x speed improvement).
  • We do have an account for moab processing, although by default our jobs should be "charged" to the correct one. The allocation is sufficiently large that this isn't much of a concern.
  • Bundling tasks.
    • Due to the lack of database at the computing center, it's likely that we'll need to bundle a command list here, transfer that list along with the associated data, and then retrieve results.
    • Doing this with the SAS is probably the most appropriate first test.

Bundling levels

  • C-level. This would require script options that instead of executing the final command (ppImage/psastro/pswarp/etc), those commands would be written to an output file and the script would terminate. As we do database updates after that command, we would need to also write out a set of completion commands to update the database with the result.
    • This requires significant changes in the current scripts.
    • It may be difficult to keep the database synchronized at all. Once a task has been sent to the supercomputing cluster, we need a way to flag the job as "executing" so we don't launch a second run. The alternative is that the scripts will simply wait until the job is done.
    • Keeping all the database interactions in Hawaii, and just letting the computational part execute in the supercomputer gives a solution to the detrend issue. We can construct the command to manually specify which detrend to use. Gene had the idea to fold this determination into chiptool, such that the returned metadata includes the detrends that should be used.
  • Script level. This would send entire perl script level commands to the computing cluster (chip_imfile.pl/etc), with the database interaction disabled (--no-update).
    • Although this shouldn't require major changes to the scripts, we still need something to check the results and issue the proper database commands in Hawaii.
    • This also prevents the detrend fix from above working. In fact, it would require some sort of script change to get the construction of the final command done correctly. Currently that's handled by passing only the stage_id to the script, which then asks the database for information about the object described by the stage_id.
  • Stage level. An entire stage worth of final commands is constructed for a given label.
    • This solves the synchronization issue of the C-level by knowing that we'll only queue the commands once.
    • Allows for a scan of the results before generating the commands for the next stage.
    • Requires a large amount of data to be synchronized at each step, although this could be rolled over once we understand the data transfer vs. computation times.
  • Project level. All stages are queued at the same time, with some logic used in moab to determine which jobs are dependent on others.
    • Even larger synchronization issues.
    • Requires some
    • Many stages take metadata objects as inputs. If some components fail permanently (i.e., single bad chip failures, poorly populated skycells), we need to not include them in the input metadata. This prevents all metadata from being pre-calculated.

2014-01-15

I've started playing around with running jobs with the MOAB scheduler, and after being completely confused by the msub/srun split, I think I've sorted out more details on how to submit jobs.

My first test was to submit a ppImage job to the Moab scheduler. Using the command msub -v ppI_test.cmd where that command script contains:

#!/bin/tcsh
##### Moab controll lines
#MSUB -l nodes=1:ppn=4,walltime=1:00:00
#MSUB -j oe
#MSUB -V
#MSUB -o /scratch3/watersc1/job.out

##### Commands
date
ppImage -file /scratch3/watersc1/o5303g0240o.ota67.fits /scratch3/watersc1/ppI_test -recipe PPIMAGE CHIP -Db PHOTOM F -Db DARK F -Db FLAT F -Db NOISEMAP F -Db MASK F -Db NONLIN F -threads 4
date

This submitted the job and executed it, saving the stderr/stdout information in the job.out file specified. This runs all the commands contained sequentially, meaning that there is no parallelization internal to the task, and all allocated resources can be used by each command in the task.

The second test used srun as the main command, allowing multiple jobs to be run in parallel, sharing the allocated resources: msub -V ppI_srun_test.cmd.

#!/bin/tcsh
##### Moab controll lines
#MSUB -l nodes=2:ppn=4,walltime=0:10:00
#MSUB -j oe
#MSUB -V
#MSUB -o /scratch3/watersc1/job_srun.out

##### Commands
date
srun -n2 -l --multi-prog ppI_srun_test.config
date

The config file for srun:

# ppI double run test config file
# srun -n2 -l --multiprog ppI_srun_test.config

0     ppImage -file /scratch3/watersc1/o5303g0240o.ota67.fits /scratch3/watersc1/ppI_srun_test_0 -recipe PPIMAGE CHIP -Db PHOTOM F -Db DARK F -Db FLAT F -Db NOISEMAP F -Db MASK F -Db NONLIN F -threads 1
1     ppImage -file /scratch3/watersc1/o5303g0240o.ota67.fits /scratch3/watersc1/ppI_srun_test_1 -recipe PPIMAGE CHIP -Db PHOTOM F -Db DARK F -Db FLAT F -Db NOISEMAP F -Db MASK F -Db NONLIN F -threads 1

I'm still looking into tracking command exit code status, but srun provides a -K option that terminates if any command exits with non-zero exit code.

Forcing an error by having command 1 look for a non-existent input fits file produces the following line in the output file:

srun: error: @HOST@: task 1: Exited with exit code 3

Attachments