I have a Rpi cluster MPI runs perfect on, one issue I am having is that MPI is using the master node as a compute node, how do I configure MPI so it only runs on the compute node. I tried removing the head node's IP address from the file that I use to run with mpirun, but I get back:
HYDU_sock_connect (./utils/sock.c:171): unable to get host address for mastern (2)
main (./pm/pmiserv/pmip.c:209): unable to connect to server mastern at port 42525
thanks in advance
Even if you don't start up an MPI rank on the node that you launch from, I believe most basic launchers will still start up some sort of daemon process on that node. If that's a problem, you might need to launch from a different node.
Related
I need to debug an MPI code for which I only have access to a single node/machine. The problem is the bug I am looking for only arises when running on more than node but it doesn't when running, for example, two MPI tasks in the same node, everything goes fine. I assume that my MPI implementation (mviapich2) cleverly treats tasks running on the same node by, for example, replacing network communications by IPC strategies or even memcpy.
So my question is: how could I run two MPI tasks on a single node but making MPI treat them as tasks on different nodes? Is that possible?
You can disable the MVAPICH2 shared memory device by setting the MV2_USE_SHARED_MEM environment variable to 0:
mpiexec ... -env MV2_USE_SHARED_MEM 0 ... ./executable
Make sure that your MVAPICH2 was built with the TCP/IP device, otherwise your ranks won't be able to communicate with shared memory support turned off.
I am using makeCluster function from R package snow from Linux machine to start a SOCK cluster on a remote Linux machine. All seems settled for the two machines to communicate succesfully (I am able to estabilish ssh connections between the two). But:
makeCluster("192.168.128.24",type="SOCK")
does not throw any result, just hangs indefinitely.
What am I doing wrong?
Thanks a lot
Unfortunately, there are a lot of things that can go wrong when creating a snow (or parallel) cluster object, and the most common failure mode is to hang indefinitely. The problem is that makeSOCKcluster launches the cluster workers one by one, and each worker (if successfully started) must make a socket connection back to the master before the master proceeds to launch the next worker. If any of the workers fail to connect back to the master, makeSOCKcluster will hang without any error message. The worker may issue an error message, but by default any error message is redirected to /dev/null.
In addition to ssh problems, makeSOCKcluster could hang because:
R not installed on a worker machine
snow not installed on a the worker machine
R or snow not installed in the same location as the local machine
current user doesn't exist on a worker machine
networking problem
firewall problem
and there are many more possibilities.
In other words, no one can diagnose this problem without further information, so you have to do some troubleshooting in order to get that information.
In my experience, the single most useful troubleshooting technique is manual mode which you enable by specifying manual=TRUE when creating the cluster object. It's also a good idea to set outfile="" so that error messages from the workers aren't redirected to /dev/null:
cl <- makeSOCKcluster("192.168.128.24", manual=TRUE, outfile="")
makeSOCKcluster will display an Rscript command to execute in a terminal on the specified machine, and then it will wait for you to execute that command. In other words, makeSOCKcluster will hang until you manually start the worker on host 192.168.128.24, in your case. Remember that this is a troubleshooting technique, not a solution to the problem, and the hope is to get more information about why the workers aren't starting by trying to start them manually.
Obviously, the use of manual mode bypasses any ssh issues (since you're not using ssh), so if you can create a SOCK cluster successfully in manual mode, then probably ssh is your problem. If the Rscript command isn't found, then either R isn't installed, or it's installed in a different location. But hopefully you'll get some error message that will lead you to the solution.
If makeSOCKcluster still just hangs after you've executed the specified Rscript command on the specified machine, then you probably have a networking or firewall issue.
For more troubleshooting advice, see my answer for making cluster in doParallel / snowfall hangs.
I have a bunch of erlang nodes running on a single machine, and they are all connected in a network. Sometimes the machine our application is on will be under extremely heavy load for several minutes. Often, after things return to normal, my erlang nodes think that they were disconnected, and I have to manually call net_adm:ping on each one of them to get them to re-connect to the network.
Any ideas on how I can avoid this situation?
You can increase the value of net_ticktime kernel configuration option so nodes will be pinged more infrequently. See also net_kernel:set_net_ticktime. Note, however, that all communicating nodes should have the same net_ticktime value specified.
I am aware that nodes can be started from the shell. What I am looking for is a way to start a remote node from within a module. I have searched, but have been able to find nothing.
Any help is appreciated.
There's a pool(3) facility:
pool can be used to run a set of
Erlang nodes as a pool of
computational processors. It is
organized as a master and a set of
slave nodes..
pool:start/1,2 starts a new pool.
The file .hosts.erlang is read to
find host names where the pool nodes
can be started. The slave nodes are
started with slave:start/2,3,
passing along Name and, if provided,
Args. Name is used as the first
part of the node names, Args is used
to specify command line arguments.
With pool you get load distribution facility for free.
Master node may be started this way:
erl -sname poolmaster -rsh ssh
Key -rsh here specifies an alternative to rsh for starting a slave node on a remote host. We used SSH here. Make sure your box have working SSH keys, and you can authenticate to the remote hosts using these keys.
If there are no hosts in the file .hosts.erlang, then no slave nodes are started, and you can use slave:start/2,3 to start slave nodes manually passing arguments if needed.
You could, for example start a remote node:
Arg = "-mnesia_dir " ++ M,
slave:start(H, Name, Arg).
Ensure epmd(1) is up and running on the remote boxes in order to start Erlang nodes.
Hope that helps.
A bit more low level that pool is the slave(3) module. Pool builds upon the functionality in slave.
Use slave:start to start a new slave.
You should probably also specify -rsh ssh on the command-line.
So use pool if you need the kind of functionality it offers, if you need something different you can build it yourself out of slave.
How can I do inter-process communication between two remote process on unix C/C++? Currently, popen works for two process on same host? Product need to be capability to call remote process and send /receive the data.
As you mentioned popen you may not realize this already allows you to use ssh to remotely execute a process and treat exactly the same as a locally spawned one.
popen ("ssh user#remotehost /usr/bin/cal", "r")
And a pre-emptive link for further questions on ssh:
https://serverfault.com/questions/117007/ssh-key-questions
why would you nut just open the wild card % in the IP so that they could access the host.. remorely..
192.168.1.% something like that...:D