Torque/OpenMPI dynamically allocate nodes based on number of processors - mpi

I was wondering if Torque is smart enough to assign the correct number of nodes based on how many mpi cores you request. For our cluster, we have heterogenous nodes and it can be quite wasteful to just put the number of nodes you want and processors per node. So I was wondering if you could just do something like this
qsub -I -l procs:1000
mpiexec -n 1000 mympijob
However, torque only allocates one node with this command (as I didn't specify a # of nodes). Is there a way the correct number of nodes based on my number of procs so it can be maximally efficient?
Sidebar - We are probably switching to SLURM soon, is this well within the capabilities?

Typically, what we do after the resources are allocated is not something that the scheduler can control.
In this case,
mpirun/mpiexec -n 1000
gets executed after the resources are allocated by the schduler.
The best way to go forward is to use the environment variables set by the scheduler
$MPI_HOSTS
as the value passed through the switch -n.
example:
mpirun $MPI_HOSTS <your program of choice>
You can request the number of cores that you want by adding the ppn argument to nodes.
qsub -l nodes=2:ppn=16
This allocates 32 cores, in two nodes.

Related

Julia parallel processing on PBS multiple nodes

I am looking for a way to run simple parallel processes (one function run multiple times with different arguments, no communication between process) across multiple nodes in a PBS cluster.
Currently I am able to run it on a single node setting the number of threads with an environment variable in the PBS script, and using a for loop with #thread.threads
I have found references to clustermanager.jl, but no clear working example on how to use it on PBS.
For example: does addprocs_pbs in the file take care also of the script part, or do I still need to run a pbs script as usual, and this function is called inside the julia file?
This is the code structure I am using now. Ideally, it would stay more or less the same but parallel process could run across multiple nodes.
using JLD
include("path/to/library/with/function.jl")
seed = 342;
n = 18; # number of simulations
changing_parameter = [1,2,3,4];
input_file = "some file"
CSV.read(string(input_files_folder,input_file));
# I should also parallelise this external for loop
# it currently runs 18 simulations per run, and saves the results each time
for P in changing_parameter
Random.seed!(seed);
seeds = rand(1:100000,n)
results = []
Threads.#threads for i = 1:n
push!(results,function(some_fixed_parameters, P=P, seed=seeds[i]);)
end
# get the results
# save the results
JLD.save(filename,to_save,compress=true)
end
For distributed computing you normally need to use multiprocessing rather than multi-threading (although it is OK to have multi-threaded parallel processes if you need).
Hence, what you need to do is to use the ClustersManagers library to use the cluster manager to allocate processes for your Julia cluster.
I have been using Julia with Cray clusters using SLURM so not exactly PBS, however I since your question remain unanswered here is my working code. You will use addprocs_pbs that looks to have a very similar structure.
using ClusterManagers
addprocs_slurm(36,job_name="jobname", account="some_acc_name", time="01:00:00", exename="/lustre/tetyda/home/pszufe/julia/usr/bin/julia")
Once you add the worker processes all what remains is to use the Distributed package to orchestrate your workload.

Is hierachical parallelism possible with MPI libraries?

I'm writing a computational code with MPI. I have a few parts of the software each compute different part of the problem. Each part is written with MPI thus could be run as an independent module. Now I want to combine these parts to be run together within one program, and all parts of the code run in parallel while each part itself is also running in parallel.
e.g. Total number of nodes = 10, part1 run with 6 nodes and part 2 run with 4 nodes and both running together.
Is there ways that I can mpirun with 10 nodes and mpi_init each part with desired number of node without rewritten the overall program to allocate process for each part of code?
This is not straightforward.
One option is to use an external program that with MPI_Comm_spawn() (twice) your sub-programs. The drawback is this requires one slot.
An other option needs some rewriting, since all the tasks will end up in the same MPI_COMM_WORLD, it is up to them to MPI_Comm_split() based on who they are, and use the resulting communicator instead of MPI_COMM_WORLD.

Foreach in R: optimise RAM & CPU use by sorting tasks (objects)?

I have ~200 .Rds datasets that I perform various operations on (different scripts) in a pipeline (of multiple scripts). In most of these scripts I've begun with a for loop and upgraded to a foreach. My problem is that the dataset objects are different sizes (x axis is size in mb):
so if I optimise core number usage (I have a 12core 16gbRAM machine at the office and a 16core 32gbRAM machine at home), it'll whip through the first 90 without incident, but then larger files bunch up and max out the total RAM allocation (remember Rds files are compressed so these are larger in RAM than on disk, but the variability in file size at least gives an indication of the problem). This causes workers to crash and typically leaves me with 1 to 3 cores running through the remainder of the big files (using .errorhandling = "pass"). I'm thinking it would be great to optimise the core number based on number and RAM size of workers, and total available RAM, and figured others might have been in a similar dilemma and developed strategies to address this. Some approaches I've thought of but not tried:
Approach 1: first loop or list through the files on disk, potentially by opening & closing them, use object.size() to get their sizes in RAM, sort largest to smallest, cut halfway, reverse the order of the second half, and intersperse them: smallest, biggest, 2nd smallest, 2nd biggest, etc. 2 workers (or any even numbered multiple) should therefore be working on the 'mean' RAM usage. However: worker 1 will finish its job faster than any other job in the stack and then go onto job 3, the 2nd smallest, likely finish that really quickly also then do job 4, the second largest, while worker 2 is still on the largest, meaning that by job 4, this approach has the machine processing the 2 largest RAM objects concurrently, the opposite of what we want.
Approach 2: sort objects by size-in-RAM for each object, small to large. Starting from object 1, iteratively add subsequent objects' RAM usage until total RAM core number is exceeded. Foreach on that batch. Repeat. This would work but requires some convoluted coding (probably a for loop wrapper around the foreach which passes the foreach its task list each time?). Also if there are a lot of tasks which won't exceed the RAM (per my example), the cores limit batching process will mean all 12 or 16 have to complete before the next 12 or 16 are started, introducing inefficiency.
Approach 3: sort small-large per 2. Run foreach with all cores. This will churn through the small ones maximally efficiently until the tasks get bigger, at which point workers will start to crash, reducing the number of workers sharing the RAM and thus increasing the chance the remaining workers can continue. Conceptually this will mean cores-1 tasks fail and need to be re-run, but the code is easy and should work fast. I already have code that checks the output directory and removes tasks from the jobs list if they've already been completed, which means I could just re-run this approach, however I should anticipate further losses and therefore reruns required unless I lower the cores number.
Approach 4: as 3 but somehow close the worker (reduce core number) BEFORE the task is assigned, meaning the task doesn't have to trigger a RAM overrun and fail in order to reduce worker count. This would also mean no having to restart RStudio.
Approach 5: ideally there would be some intelligent queueing system in foreach that would do this all for me but beggars can't be choosers! Conceptually this would be similar to 4, above: for each worker, don't start the next task until there's sufficient RAM available.
Any thoughts appreciated from folks who've run into similar issues. Cheers!
I've thought a bit about this too.
My problem is a bit different, I don't have any crash but more some slowdowns due to swapping when not enough RAM.
Things that may work:
randomize the iterations so that it is approximately evenly distributed (without needing to know the timings in advance)
similar to approach 5, have some barriers (waiting of some workers with a while loop and Sys.sleep()) while not enough memory (e.g. determined via package {memuse}).
Things I do in practice:
always store the results of iterations in foreach loops and test if already computed (RDS file already exists)
skip some iterations if needed
rerun the "intensive" iterations using less cores

Assigning N cores per MPI rank

I'm trying to assign N cores per MPI rank. I'm running an application with 256 MPI ranks, and I would like to assign 16 cores per MPI rank.
The solutions I found are useful but want me to use a rankfile and this gets tedious after a certain number of rank-core relationship.
Is there a better way to do it?
With Open MPI, you can
mpirun --map-by node:PE=16 ...

Reusing FFTW wisdom on clusters

I'm running distributed MPI programs on clusters using multiple nodes, where I make use of the MPI FFT's of FFTW. To save time I reuse wisdom from one run to the next. To generate this wisdom, FFTW experiments with a lot of different algorithms and what not for the given problem. I am worried that because I am working on a cluster, the best solution stored as wisdom for one set of CPUs/nodes may not be the best solution for some other set of CPUs/nodes performing the same task, and so I should not reuse wisdom unless I am running on exactly the same CPUs/nodes as the run where the wisdom was gathered.
Is this correct, or is the wisdom somehow completely indifferent to the physical hardware on which it is generated?
If your cluster is homogeneous, the saved fftw plans likely make sense, though the the way the processes are connected may affect optimal plans for mpi-related operations. But if your cluster is not homogeneous, saving the fftw plan can be suboptimal and problem related to load balance could proove hard to solve.
Taking a look at wisdom files produced by fftw and fftw_mpi for a 2D c2c transform, I can see additionnal lines likely related to phases like transposition where mpi communications are required, such as:
(fftw_mpi_transpose_pairwise_register 0 #x1040 #x1040 #x0 #x394c59f5 #xf7d5729e #xe8cf4383 #xce624769)
Indeed, there are different algorithms for transposing the 2D (or 3D) array: in the folder mpi of the source of fftw, files transpose-pairwise.c, transpose-alltoall.c and transpose-recurse.c implement these algorithms. As flags FFTW_MEASURE or FFTW_EXHAUSTIVE are set, these algorithms are run to select the fastest, as stated here. The result might depend on the topology of the network of processes (how many processes on each node? How these nodes are connected?). If the optimal plan depends on where the processes are running and on the network topology, using the wisdom utility will not be decisive. Otherwise, using the wisdom feature can save some time as the plan is built.
To test whether the optimal plan changed, you can perform a couple of runs and save the resulting plan in files: a reproductibility test!
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
fftw_mpi_gather_wisdom(MPI_COMM_WORLD);
if (rank == 0) fftw_export_wisdom_to_filename("wisdommpi.txt");
/* save the plan on each process ! Depending on the file system of the cluster, performing communications can be required */
char filename[42];
sprintf(filename, "wisdom%d.txt",rank);
fftw_export_wisdom_to_filename(filename);
Finally, to compare the produced wisdom files, try in a bash script:
for filename in wis*.txt; do
for filename2 in wis*.txt; do
echo "."
if grep -Fqvf "$filename" "$filename2"; then
echo "$filename"
echo "$filename2"
echo $"There are lines in file1 that don’t occur in file2."
fi
done
done
This script check that all lines in files are also present in the other files, following Check if all lines from one file are present somewhere in another file
On my personal computer, using mpirun -np 4 main, all wisdom files are identical except for a permutation of lines.
If the files are different from one run to another, it could be attributed to the communication pattern between processes... or sequential performance of dft for each process. The piece of code above save the optimal plan for each process. If lines related to sequential operations, without fftw_mpi in it, such as:
(fftw_codelet_n1fv_10_sse2 0 #x1440 #x1440 #x0 #xa9be7eee #x53354c26 #xc32b0044 #xb92f3bfd)
become different, it is a clue that the optimal sequential algorithm changes from one process to the other. In this case, the wall clock time of the sequential operations may also differ from one process to another. Hence, checking the load balance between processes could be instructive. As noticed in the documentation of FFTW about load balance:
Load balancing is especially difficult when you are parallelizing over heterogeneous machines; ... FFTW does not deal with this problem, however—it assumes that your processes run on hardware of comparable speed, and that the goal is therefore to divide the problem as equally as possible.
This assumption is consistent with the operation performed by fftw_mpi_gather_wisdom();
(If the plans created for the same problem by different processes are not the same, fftw_mpi_gather_wisdom will arbitrarily choose one of the plans.) Both of these functions may result in suboptimal plans for different processes if the processes are running on non-identical hardware...
The transpose operation in 2D and 3D fft requires a lot a communications: one of the implementation is a call to MPI_Alltoall involving almost the whole array. Hence, a good connectivity between nodes (infiniband...) can proove useful.
Let us know if you found different optimal plans from one run to another and how these plans differ!

Resources