I am trying to build a cluster to parallelize a R function. I use Open MPI in Ubuntu and R packages (Rmpi and Snow). The test code I am running is:
cl <- makeMPIcluster(8)
fun <- function(){
Sys.info()[c("nodename")]
}
clusterCall(cl,fun)
stopCluster(cl)
mpi.quit()
The command is:
mpirun -H localhost,node2 -n 1 R --slave -f testSnowMPI.R
The problem is that all the returns are the local hostname. I observed the process in localhost and node2 as well and the process seems to start (4 in localhost and 4 in node2) but quickly the slaves processes at node2 stop and everything is executed in localhost.
I did another test having different scripts (testSnowMPI.R) for each node and when I changed the parameter -n 1 to -n 2 I had different returns as expected but both scripts were executed by the localhost.
Another interesting test is when I make the mpirun command in localhost but I just set node2 host for the execution (-H node2). The answer I have is All nodes which are allocated for this job are already filled.
I can ping localhost from node2 and node2 from localhost. And I already have set the ssh connection without passphrase.
It seems like the processes in node2 start normally but it can not write the return to the master and then localhost do all the work.
I did the same tests having node2 as the localhost and the behaviour was exactly the same.
Do you have any idea about the weird behaviour of these tests?
EDIT
I did some tests using only Rmpi functions (without Snow functions). I wrote this script:
mpi.spawn.Rslaves()
mpi.close.Rslaves()
The command was:
mpirun -H localhost,node2,node2 -n 1 R --slave -f testSnowMPI.R
And I got this output:
master (rank 0, comm 1) of size 3 is running on: node1
slave1 (rank 1, comm 1) of size 3 is running on: node1
slave2 (rank 2, comm 1) of size 3 is running on: node1
(node1 is the localhost)
Related
I am trying to test a script I have developed locally on an interactive HPC node, and I keep running in this strange issue that mclapply works only on a single core. I see several R processes spawned in htop (as many as the number of the cores), but they all occupy only one core.
Here is how I obtain the interactive node:
srun -n 16 -N 1 -t 5 --pty bash -il
Is there a setting I am missing? How can I make this work? What can I check?
P.S. I just tested and the other programs that rely on forking to do parallel processing (say pigz) are afflicted by the same issue as well. Those that rely on MPI and messaging work properly, it seems.
Yes, you are missing a setting. Try:
srun -N 1 -n 1 -c 16 -t 5 --pty bash -il
The problem is that you are running the parallel commands within a bash shell that is allocated on a single core, so the bash process is spawned on only one of the cores requested by srun.
Otherwise, you can first allocate your resources using salloc and once you obtain them run your actual command. For instance:
salloc -N 1 -n 1 -c 16 -t 5
srun pigz file.ext
There are 2 pi in this setup:
- PI-domo: running domoticz
- PI-pump: controlling a pump with one GPIO
Those pi are far away, but can communicate through network. PI-domo has some passwordless ssh login setup to pi-pump, and contains three scripts:
- pump_on.sh: sends value to gpio with ssh to turn pump on and returns 1
`ssh pi#pi-pump -n "echo 0 > /sys/class/gpio/gpio18/value" && echo 1`
pump_off.sh: sends value to gpio with ssh to turn pump off and returns 0
ssh pi#pi-pump -n "echo 1 > /sys/class/gpio/gpio18/value" && echo 0
pump_status.sh: returns 1 if pump is on, 0 if pump is off.
All three scripts work as expected when launched in bash, but I can not find how to call them with domoticz. I created a virtual switch and set those as script:///.....[on off].sh but domoticz doesn't seem to be running any of them. nor could I find a place to read the status...
Any idea or link to a RECENT (working) tutorial would be welcome!
Found the issue: stupid me.
It turns out domoticz process was running as root and root didn't have the key setup for passwordless ssh.
I know that this is a old thread and it is answered already, but I have stumbled on the same issue and found that online answers lacked detail. So, here it goes:
On PI-domo run
sudo su to become root
Generate a new key using ssh-keygen -t rsa -b 4096 -C "nameofyourkey"
Copy your key to PI-pump by using ssh-copy-id -i /root/.ssh/yourkey.pub pi#pi-pump
ssh to pi-pump to test that ssh agent for root is working, and if all is well exit and go back to become a pi user.
Note 1: Although logging in as root of PI-domo, it is critical that pump_off and pump_status.sh contain pi#pi-pump and not root#pi-pump or this approach will fail.
Note 2: Domoticz log indicates that the above process has some error by outputting Error: Error executing script command (/home/pi/domoticz/scripts/pump_off.sh). returned: 65280. Note the 65280 error in particular
Using RHEL7.3
Using R 3.3.2
Installed Rmpi_0.6-6.tar.gz and doMPI_0.2.1.tar.gz
Installed mpich-3.0-3.0.4-10.el7 RPM for x86_64
I created a cluster of three machines (aml1,2,3). I can run the /examples/cpi example from the mpich installation and the processes run without issue on all three machines.
I can also run an R script that needs to be run multiple times, which is discussed on the doMPI documentation -- so the script runs on all clusters.
My problem is when my R script has code prior to the %dopar% that needs to be run once on the master(aml1), and have the %dopar% run on the cluster (aml2,aml3). It only runs on the master. And doMPI says Size of MPI universe: 0 and doesn't recognize aml2 or aml3.
For example:
Run: mpirun -np 1 --hostfile ~/projects/hosts R --no-save -q < example6.R
(and my ~/projects/hosts file is defined to use 8 cores)
example6.R:
library(doMPI) #load doMPI library
cl <- startMPIcluster(verbose=TRUE)
#load data
#clean data
#perform some functions
#let's say I want to have this done in the script and only parallelize this
x <- foreach(seed=c(7, 11, 13), .combine="cbind") %dopar% {
set.seed(seed)
rnorm(3)
}
x
closeCluster(cl)
Output of example6.R:
Master processor name: aml1; nodename: aml1
Size of MPI universe: 0
Spawning 2 workers using the command:
/usr/lib64/R/bin/Rscript /usr/lib64/R/library/doMPI/RMPIworker.R WORKDIR=/home/spark LOGDIR=/home/spark MAXCORES=1 COMM=3 INTERCOMM=4 MTAG=10 WTAG=11 INCLUDEMASTER=TRUE BCAST=TRUE VERBOSE=TRUE
2 slaves are spawned successfully. 0 failed.
If I define cl <- startMPIcluster(count=34, verbose=TRUE) I still get the following but at least I can run 34 slaves:
Master processor name: aml1; nodename: aml1
Size of MPI universe: 0
34 slaves are spawned successfully. 0 failed.
How can I troubleshoot this? I would like to run the R script so it runs the first portion once on the master, and then do %dopar% on the cluster.
Thanks!!
Update 1
Since the last update, I tried running an older version of OpenMPI:
[spark#aml1 ~]$ which mpirun
/opt/openmpi-1.8.8/bin/mpirun
Per #SteveWeston, I created the following script and ran it:
[spark#aml1 ~]$ cat sanity_check.R
library(Rmpi)
print(mpi.comm.rank(0))
mpi.quit()
With the following output:
[spark#aml1 ~]$ mpirun -np 3 --hostfile ~/projects/hosts R --slave -f sanity_check.R
FIPS mode initialized
master (rank 0, comm 1) of size 3 is running on: aml1
slave1 (rank 1, comm 1) of size 3 is running on: aml1
slave2 (rank 2, comm 1) of size 3 is running on: aml1
[1] 0
Here it just hangs -- and nothing happens.
I've already accepted #SteveWeston's answer as it helped me in better understanding my original question.
I commented to his answer that I was still having issues with my R script hanging; the scripts would run, but it would never finish on its own or close its own clusters and I would have to kill it with ctrl-C.
I ultimately set up an nfs environment, build and installed openmpi-1.10.5 there, and installed my R libraries there as well. R is installed separately on both machines, but they share the same library in my nfs directory. Previously I had installed and managed everything under root, including the R libraries (I know). I'm not sure if this what caused complications, but my issues seem to be resolved.
[master#aml1 nfsshare]$ cat sanity_check.R
library(Rmpi)
print(mpi.comm.rank(0))
mpi.quit(save= "no")
[master#aml1 nfsshare]$ mpirun -np 3 --hostfile hosts R --slave -f sanity_check.R
FIPS mode initialized
[1] 1
[1] 0
[1] 2
# no need to ctrl-C here. It no longer hangs
I set up MPICH3 (mpich-3.1.3) on my notebook(Intel Core i5) and a slave processor running on ARM Cortex15 processor and both running Ubuntu 14.04 OS with ssh keygen setup for free communication.
I have installed mpich3 in the folder which is shared between the cluster through nfs.
I have exported the path from my master server only.
The installation went well and i tried out the following command on my master node alone which runs fine:
mpiexec -n 2 ./cpi
Process 0 of 2 is on MingF
Process 1 of 2 is on MingF
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000182
But when I try running on my slave and master, then i get this error and it hangs:
mpiexec -f hosts -n 2 ./cpi
bash: /mirror/mpich3/bin/hydra_pmi_proxy: cannot execute binary file: Exec format error
It hangs there until i press 'Ctrl + C' to break out of it.
I am guessing its because of the change in processor type but I may be wrong. could someone help me out?
You cannot run the same executable on such different architectures as x86 and ARM. Compile it separately on both machines and pay attention to the endianness of the ARM machine.
I'm trying to run a MPI job on a cluster with torque and openmpi 1.3.2 installed and I'm always getting the following error:
"mpirun was unable to launch the specified application as it could not find an executable:
Executable: -p
Node: compute-101-10.local
while attempting to start process rank 0."
I'm using the following script to do the qsub:
#PBS -N mphello
#PBS -l walltime=0:00:30
#PBS -l nodes=compute-101-10+compute-101-15
cd $PBS_O_WORKDIR
mpirun -npersocket 1 -H compute-101-10,compute-101-15 /home/username/mpi_teste/mphello
Any idea why this happens?
What I want is to run 1 process in each node (compute-101-10 and compute-101-15). What am I getting wrong here?
I've already tried several combinations of the mpirun command, but either the program runs on only one node or it gives me the above error...
Thanks in advance!
The -npersocket option did not exist in OpenMPI 1.2.
The diagnostics that OpenMPI reported
mpirun was unable to launch the specified application as it could not
find an executable: Executable: -p
is exactly what mpirun in OpenMPI 1.2 would say if called with this option.
Running mpirun --version will determine which version of OpenMPI is default on the compute nodes.
The problem is that the -npersocket flag is only supported by Open MPI 1.3.2 and the cluster where I'm running my code only has Open MPI 1.2 which doesn't support that flag.
A possible way around is to use the flag -loadbalance and specify the nodes where i want the code to run with the flag -H node1,node2,node3,... like this:
mpirun -loadbalance -H node1,node2,...,nodep -np number_of_processes program_name
that way each node will run number_of_processes/p processes, where p the number of nodes where the processes will be run.