I have a couple of julia programs running in a cluster of 64 processors.
I want to know why if I launch a julia-program like:
$ julia main.jl
when I see the output of the htop I get 15 processes like this:
PID USER PRI NI VIRT RES SHR S CPU MEM TIME COMMAND
21389 me 20 0 845M 413 12692S 0.0 0.1 TIME-ELAPSED julia main.jl
Is this something intrinsic to julia to optimize the main.jl script?
It sounds like your question is why the julia processes all end up on one node instead of across your cluster?
I think you are looking for the ClusterManager features: https://docs.julialang.org/en/v1/stdlib/Distributed/index.html
Related
How would I configure a program using MPI so that I split the total number of processors into M groups.
E.g. Using a command such as:
mpirun -np 4 ./a.out
In a file tree that looks like the following where all *.out files are configured to run using MPI
a.out
b.out
c.out
And within the program:
Processors 0 and 1 -> ./b.out
Processors 2 and 3 -> ./c.out
Check this out.
https://eli.thegreenplace.net/2016/c11-threads-affinity-and-hyperthreading/
Thread affinity can help you. Basically assign certain ranks to certain cores.
I am using Intel MPI and have encountered some confusing behavior when using mpirun in conjunction with slurm.
If I run (in a login node)
mpirun -n 2 python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank())"
then I get as output the expected 0 and 1 printed out.
If however I salloc --time=30 --nodes=1 and run the same mpirun from the interactive compute node, I get two 0s printed out instead of the expected 0 and 1.
Then, if I change -n 2 to -n 3 (still in compute node), I get a large error from slurm saying srun: error: PMK_KVS_Barrier task count inconsistent (2 != 1) (plus a load of other stuff), but I am not sure how to explain this either...
Now, based on this OpenMPI page, it seems these kind of operations should be supported at least for OpenMPI:
Specifically, you can launch Open MPI's mpirun in an interactive SLURM allocation (via the salloc command) or you can submit a script to SLURM (via the sbatch command), or you can "directly" launch MPI executables via srun.
Maybe the Intel MPI implementation I was using just doesn't have the same support and is not designed to be used directly in a slurm environment (?), but I am still wondering: what is the underlying nature of mpirun and slurm (salloc) that this is the behavior produced? Why would it print two 0s in the first "case," and what are the inconsistent task counts it talks about in the second "case"?
I'm testing a simple MPI program on my desktop (Ubuntu LTS 16.04/ Intel® Core™ i3-6100U CPU # 2.30GHz × 4/ gcc 4.8.5 /OpenMPI 3.0.0) and mpirun won't let me use all of the cores on my machine (4). When I run:
$ mpirun -n 4 ./test2
I get the following error:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
./test2
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
But if I run with:
$ mpirun -n 2 ./test2
everything works fine.
I've seen from other answers that I can check the number of processors with
cat /proc/cpuinfo | grep processor | wc -l
and this tells me that I have 4 processors. I'm not interested in oversubscribing, I'd just like to be able to use all my processors. Can anyone help?
Your processor has 4 hyperthreads but only 2 cores (see the specs here).
By default, Open MPI does not run more than one MPI task per core.
You can have Open MPI run up to one MPI task per hyperthread with the following option
mpirun --use-hwthread-cpus ...
FWIW
The command you mentioned reports the number of hyperthreads.
A better way to figure out the topology of a machine is via the lstopo command from the hwloc package.
MPI tasks are not bound on cores nor threads on OS X, so if you are running on a Mac, the --oversubscribe -np 4 would lead to the same result.
To resolve your problem, you can use the --use-hwthread-cpus command line arguments for mpirun, as already pointed out by Gilles Gouaillardet. In this case, Open MPI will treat the thread provided by hyperthreading as the Open MPI processor. Otherwise, it will treat a CPU core as an Open MPI processor, which is the default behavior. When using --use-hwthread-cpus, it will correctly determine the total number of processors available to you, that is, all processors available on all hosts specified in the Open MPI host file. Therefore, you do not need to specify the "-n" parameter. In addition, when using the --use-hwthread-cpus command line parameter, Open MPI refers to the threads provided by hyperthreading as "hardware threads". With this technique, you will not oversubscribe, and if some Open MPI processor will run on a virtual machine, it will use the correct number of threads assigned to that virtual machine. And if your processor has more than two threads per core, as a Xeon Phi (Knights Mill, Knights Landing, etc.), it will take all four threads per core as an Open MPI processor.
Use $ lscpu the number of cores per socket * number of sockets would give you number of physical cores(the ones that you can use for mpi) where as number of cores per socket * number of sockets * threads per core will give you number of logical cores(the one that you get by using the command $ cat /proc/cpuinfo | grep processor | wc -l)
Using RHEL7.3
Using R 3.3.2
Installed Rmpi_0.6-6.tar.gz and doMPI_0.2.1.tar.gz
Installed mpich-3.0-3.0.4-10.el7 RPM for x86_64
I created a cluster of three machines (aml1,2,3). I can run the /examples/cpi example from the mpich installation and the processes run without issue on all three machines.
I can also run an R script that needs to be run multiple times, which is discussed on the doMPI documentation -- so the script runs on all clusters.
My problem is when my R script has code prior to the %dopar% that needs to be run once on the master(aml1), and have the %dopar% run on the cluster (aml2,aml3). It only runs on the master. And doMPI says Size of MPI universe: 0 and doesn't recognize aml2 or aml3.
For example:
Run: mpirun -np 1 --hostfile ~/projects/hosts R --no-save -q < example6.R
(and my ~/projects/hosts file is defined to use 8 cores)
example6.R:
library(doMPI) #load doMPI library
cl <- startMPIcluster(verbose=TRUE)
#load data
#clean data
#perform some functions
#let's say I want to have this done in the script and only parallelize this
x <- foreach(seed=c(7, 11, 13), .combine="cbind") %dopar% {
set.seed(seed)
rnorm(3)
}
x
closeCluster(cl)
Output of example6.R:
Master processor name: aml1; nodename: aml1
Size of MPI universe: 0
Spawning 2 workers using the command:
/usr/lib64/R/bin/Rscript /usr/lib64/R/library/doMPI/RMPIworker.R WORKDIR=/home/spark LOGDIR=/home/spark MAXCORES=1 COMM=3 INTERCOMM=4 MTAG=10 WTAG=11 INCLUDEMASTER=TRUE BCAST=TRUE VERBOSE=TRUE
2 slaves are spawned successfully. 0 failed.
If I define cl <- startMPIcluster(count=34, verbose=TRUE) I still get the following but at least I can run 34 slaves:
Master processor name: aml1; nodename: aml1
Size of MPI universe: 0
34 slaves are spawned successfully. 0 failed.
How can I troubleshoot this? I would like to run the R script so it runs the first portion once on the master, and then do %dopar% on the cluster.
Thanks!!
Update 1
Since the last update, I tried running an older version of OpenMPI:
[spark#aml1 ~]$ which mpirun
/opt/openmpi-1.8.8/bin/mpirun
Per #SteveWeston, I created the following script and ran it:
[spark#aml1 ~]$ cat sanity_check.R
library(Rmpi)
print(mpi.comm.rank(0))
mpi.quit()
With the following output:
[spark#aml1 ~]$ mpirun -np 3 --hostfile ~/projects/hosts R --slave -f sanity_check.R
FIPS mode initialized
master (rank 0, comm 1) of size 3 is running on: aml1
slave1 (rank 1, comm 1) of size 3 is running on: aml1
slave2 (rank 2, comm 1) of size 3 is running on: aml1
[1] 0
Here it just hangs -- and nothing happens.
I've already accepted #SteveWeston's answer as it helped me in better understanding my original question.
I commented to his answer that I was still having issues with my R script hanging; the scripts would run, but it would never finish on its own or close its own clusters and I would have to kill it with ctrl-C.
I ultimately set up an nfs environment, build and installed openmpi-1.10.5 there, and installed my R libraries there as well. R is installed separately on both machines, but they share the same library in my nfs directory. Previously I had installed and managed everything under root, including the R libraries (I know). I'm not sure if this what caused complications, but my issues seem to be resolved.
[master#aml1 nfsshare]$ cat sanity_check.R
library(Rmpi)
print(mpi.comm.rank(0))
mpi.quit(save= "no")
[master#aml1 nfsshare]$ mpirun -np 3 --hostfile hosts R --slave -f sanity_check.R
FIPS mode initialized
[1] 1
[1] 0
[1] 2
# no need to ctrl-C here. It no longer hangs
I am trying to run a compiled program that is supposed to be running on multiple processors. But with the same data, sometimes this program runs in parallel and sometimes it won't (with the identical PBS script file!). I am suspecting that something is wrong with some of the compute nodes that won't let it run on parallel (I don't get to choose the compute node I want). How can I troubleshoot if this is a bug in the program or it is problem with the compute node?
As per the sys admin's adivce, I am using ulimit -s 100000, but this don't change anything. Also, this program is not an mpi program (runs only on a single node, with multiple processors).
The code that I run is as follows:
quorum_error_correct_reads -q 68 \
--contaminant=/data004/software/GIF/packages/masurca/2.3.0rc1/bin/../share/adapter.jf \
-m 1 -s 1 -g 1 -a 3 --thread=32 -w 10 -e 3 \
quorum_mer_db.jf aa.renamed.fastq ab.renamed.fastq ac.renamed.fastq ad.renamed.fastq ae.renamed.fastq af.renamed.fastq ag.renamed.fastq \
--no-discard -o pe.cor --verbose
Thanks for any advice you can offer. I will greatly appreciate your help!
PS: I don't have sudo access.
EDIT: I know it is supposed to be using multiple processors because, when I SSH into the node and do top -c I can see (above command) sometimes running like 3200 % CPU (all the time) and sometimes only 100 % CPU all the time. This is the only step involved and there are no other sub-process within this program. Also, I am using HPC, where I submit the job to a compute node, each with 32 procs, 512GB RAM.