Openmpi trouble with mpirun and ssh - mpi

Here's the thing. I've installed openmpi on two different computer, I already compile and run separetly the hello_world example on this machines and it's works well. But the problem is when I launched this command :
mpirun -hostfile hosts -n 3 hello_c
with in the hosts file : localhost and the ip of my other machine. Then, the program ask me my ssh password, and after I fill it nothing append like mpirun just crashed. My really problem is that I can't run an mpi process on two different computers trough ssh.
I want to precise that all openmpi binary and library are well set in path, even the hello_world.
update
I've already setup a pass_wordless ssh with rsa certificate, but it does'nt work too. I've launched mpirun in debug mode (-d) and I got this :
[baptiste#baptiste RE51]$ mpirun -d -hostfile hosts hello_c
[baptiste.thinkFed:02666] procdir: /tmp/openmpi-sessions-baptiste#baptiste.thinkFed_0/53471/0/0
[baptiste.thinkFed:02666] jobdir: /tmp/openmpi-sessions-baptiste#baptiste.thinkFed_0/53471/0
[baptiste.thinkFed:02666] top: openmpi-sessions-baptiste#baptiste.thinkFed_0
[baptiste.thinkFed:02666] tmp: /tmp
[roommateServer:01102] procdir: /tmp/openmpi-sessions-baptiste#roommateServer_0/53471/0/1
[roommateServer:01102] jobdir: /tmp/openmpi-sessions-baptiste#roommateServer_0/53471/0
[roommateServer:01102] top: openmpi-sessions-baptiste#roommateServer_0
[roommateServer:01102] tmp: /tmp
And nothing else, it stay here and I've to kill mpirun.
For information, I tried to lauchn mpirun hello_c trough ssh on the remote node with this command :
ssh roomServer mpirun hello_c
This work well... I definetly can't understand why it doesn't work on all nodes ..

Assuming your compiler is setup properly as well as your hosts file. Your problem is that you need to setup passwordless ssh between the two computers, otherwise you will get the error you described. This is because MPI needs to communicate quick and efficiently and not have messages be prompted for a password which would cause the messages to stall and the program to crash.

Related

Running mpi4py script without mpi

Normally I'd use mpiexec to run a process on multiple hosts like:
mpiexec -n 8 --hostfile hosts.txt python my_mpi_script.py
where my_mpi_script.py depends on mpi4py.
Supposing I couldn't run mpiexec or mpirun, how would I be able to run my_mpi_script.py on multiple hosts -- would this be possible by changing my script or execution environment?
Edit: I'm working with a system that runs the same command on many hosts. Normally, processes would discover each other on the local network rather than all be spawned by MPI. My current solution involves: checking which host I'm on and running mpiexec on exactly one of the hosts. This doesn't work well due to some networking limitations.
Thanks.

Having problems running MPI programs with rsh (with Open MPI)

I have been running MPI programs on my testbed with ssh without problems. But when I wanted to switch to rsh to avoid encryption and run a program with mpirun, there is no output. I inspected the traffic with Wireshark and there is a TCP packet with PUSH flag, where the data says:
bash: orted: command not found
Open MPI is installed in the same directory in both machines, and they both have Ubuntu 16.04. I set it so there is no password for rsh needed within the testbed from the other machines. I can run programs on the remote machine with rsh, but not mpirun. Any idea what the problem could be?

R Parallel - connecting to remote cores

Working in R 2.14.1, on Windows 7
Using the package parallel in R, I'm trying to take advantage of cores outside of my local machine available on my network, where all remote hosts I am connecting to are identical Windows machines.
The basic form of the commands are as such to make the connection.
library(parallel)
#assume 8 cores per machine
cl<-makePSOCKcluster(c(rep("localhost", 8), rep("otherhost", 8)))
Of course, trying to debug these things can be pretty tricky, but here is where I'm at with it.
If I specify the manual = TRUE flag as below
cl<-makePSOCKcluster(c(rep("localhost", 8), rep("otherhost", 8)), manual=TRUE)
there are no problems connecting to the remote host, and running a parallel process. The computers have identical setups to the one that I am working on. Yet, when this manual flag is not set, the connection command hangs.
This seems to indicate to me that since the manual flag bypasses ssh to make the connection to the host, that ssh is the problem when manual=FALSE.
It is not guaranteed at the moment that the remote computers have ssh on them. The question is, given that I have all the pertinent windows login information for my remote hosts, and that I cannot change the settings on the remote computers, how would I connect to cores on remote machines with the package parallel in R without specifying manual = true?
Alternatively, if ssh must be installed for this to happen, let's assume all computers have ssh on them. How would I connect to cores on the remote machines without circumventing ssh?
If you need any more information please let me know, I appreciate the time.
UPDATE 1
8-26-14
Thanks to Steve Weston for his insights. I will provide an update with the exact tools and setup I use to get my system working when it's up and running.
Feel free to comment or post if you have anything else to add as to what may be the best route to go in remote connecting to a windows machine from a windows machine via makePSOCKcluster, where the manual flag is set to FALSE.
When creating a PSOCK cluster with manual=FALSE, the only way to start a worker on a remote machine is with "ssh", "rsh", or something command-line compatible, such as "plink" from PuTTY. The reason is that makePSOCKcluster starts the remote workers using the "system" function to execute commands of the form:
ssh -l user otherhost '/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=myhost PORT=10187 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
You can confirm this by looking at the source code for the newPSOCKnode function in the file snowSOCK.R from the parallel package.
For this to work, the ssh-compatible command must be available on the local machine and a corresponding ssh daemon must be running on each of the remote machines, otherwise makePSOCKcluster will simply hang. I've found that installing a good, working ssh daemon is the difficult part on Windows.
Unfortunately, manual=TRUE is generally the easiest way to create a PSOCK cluster on multiple Windows machines.
Helle everyone, I had the same problem and I managed to solve it. It is June 2018 when I'm writing this answer, my OS is windows 10 and the R version is 3.2.2. It is surprising to see this problem still exists after 4 years. I hope it can be fixed in the following release.
Before you move on, please make sure you can access the server in cmd using ssh. I didn't put any password in my code because I have the private key, you don't need to do that and you will see the reason later.
Fixing The problem
File directory
Since the function makePSOCKcluster works when manually start the workers, my first trying is to let manual=TRUE, and see what's the output. Here is my result:
machineAddresses <-list(list(host='192.168.1.220',user='jeff'))
cl <- makePSOCKcluster(spec,manual = F)
> Manually start worker on 192.168.1.220 with
"C:/PROGRA~1/R/R-32~1.2/bin/x64/Rscript" -e
"parallel:::.slaveRSOCK()" MASTER=DESKTOP-U5JA32O PORT=11756
OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
Ok, Here is the first problem. The Rscript location is incorrect(The location of Rscript in the server). Generally, it locates in C:\Program Files. In my server is C:\Program Files\R\R-3.2.2\bin. So we need to correct them by adding more option to tell this stupid code where the Rscript is:
machineAddresses <-list(list(host='192.168.1.220',
user='jeff',rscript="C:/Program Files/R/R-3.3.2/bin/Rscript"))
CMD problem
Once you fix the directory problem, you will find that the code still hangs forever. Then we need to check if we can manually access the server in R, my code is:
system("ssh jeff#192.168.1.220")
> GetConsoleMode on STD_INPUT_HANDLE failed with 6
I honestly don't know what does this error mean, but we just need to fix that. Inspired by #Steve Weston, I decide to use PuTTY, so I install it, and change my code to:
machineAddresses <-list(list(host='192.168.1.220',user='jeff',rscript="C:/Program Files/R/R-3.3.2/bin/Rscript",rshcmd="plink -pw qwer"))
The option -pw means the password. Because I'm a newbie to PuTTY, I don't know how to let the private key automatically work in PuTTY. Therefore, I use the easiest way to deal with that: put your password! The above code is equivalent to the following in cmd:
plink -pw qwer jeff#192.168.1.220 Rscript -e parallel:::.slaveRSOCK() MASTER=DESKTOP-U5JA32O PORT=11063 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
And this is exactly what we will do if we manually create the workers. For those who are new like me, you need to add the PuTTY directory in PATH in your environmental variables to run plink. Here are my final codes:
machineAddresses <-list(list(host='192.168.1.220',user='jeff',rscript="C:/Program Files/R/R-3.3.2/bin/Rscript",rshcmd="plink -pw qwer"))
cl <- makePSOCKcluster(machineAddresses,manual = F)
I run it with no problem at all. In summary, the function makePSOCKcluster makes two mistakes:
Assuming a wrong R directory in the server(At least it should assume the same directory as my local computer, but it didn't! I don't know where that strange directory comes from)
Using ssh command to start the connection, which does not work in R. It works well in cmd, but not in R. I don't know the reason.
If you are still not able to use makePSOCKcluster, here is one trick: Try to connect to the server in R using system function first. It can give you some error code, that may instruct you where the problem is. Here is my debugging code:
system("plink -pw qwer jeff#192.168.1.220 Rscript -e parallel:::.slaveRSOCK() MASTER=DESKTOP-U5JA32O PORT=11063 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE")

ssh to execute all commands in guest machine

i was created a bash script my_vp.sh that use 2 command:
setterm -cursor off
setterm -powersave off
[...]
#execute video commands
[...]
and is in a computerA
but when i execute it by ssh by another computerB_terminal:
ssh pi#192.168.1.1
execute video commands work correctly in the computerA (the same where is the script)
but the command setterm works in the computerB (the terminal where i execute the ssh command).
somebody can help me with solucione it?
thank you very much!
I am not sure I understood the question:
to execute a local script, but on another machine:
scp /path/to/local/script.bash pi#192.168.1.1:/tmp/copy_of_script.bash
and then, if it's copied correctly, execute it:
ssh pi#192.168.1.1 "chmod +x /tmp/copy_of_script.bash"
ssh pi#192.168.1.1 "bash /tmp/copy_of_script.bash"
to have the remote video (Xwindows, etc) commands appear on the originating machine:
replace : ssh with : ssh -x (to allow X-Forwarding, which will allocate a DISPLAY automatically on the remote machine that will be tunneled back to the originating machine)
for the X-forwarding to work, there are some requirements (usually ok by default, but ymmv) : read more about those requirements in this Unix.se answer

rsync over ssh hangs in a file sync daemon

I am writing a file syncing application where I collect event from the filesystem whenever the file is modified and than later I copy it over to remote share via rsync over ssh. In my setup I have a slot which is connected to a QTimer. Each 5 seconds I pick a file from a sqlite db for synchronization and start a QProcess::start with the following parameters
/usr/bin/rsync -a /aufs/another-test-folder/testfile286.txt --rsh="ssh -p 8023" user#myserver.de:/home/neox/another-test-folder/testfile286.txt --rsync-path="mkdir -p /home/neox/another-test-folder && rsync"
I have at most 2 rsync processes running in parallel. This results in a process tree:
MyApp
\_rsync
| \_ssh
|_rsync
\_ssh
The problem is that sometimes the application hangs and the ps says that ssh processes have gone zombie. First I have tried to kill MyApp with SIGKILL but no luck. Than I moved on to kill rsync and ssh but still no luck. The whole tree hangs. And if I try to start the daemon from another console or even try to ssh to another box, I can't. My idea here is that somewhere ssh blocks some IO resources. Any idea how to solve this?
P.S. This happens randomly and not often

Resources