OpenMPI mpirun hangs after giving `failed-daemon-launch` error

OpenMPI mpirun hangs after giving `failed-daemon-launch` error - mpi

When several nodes are down and can’t be connected via ssh, mpirun gives the following error and hangs forever. The expected behavior in this case should be terminating with an error code, however it’s hanging.
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
———————————————————————————————————————————————
Is mpirun doing anything in the background? Is there any configuration I can add to make mpirun exit in this case?

Related

Disable internet access when calling java -jar

I'm testing six distinct .jar-files that all need to handle the possibility of no online access.
Unfortunately, I am on a network disc, so disabling the network connection or pulling the ethernet cable does not work unless I move all the files to /tmp or /scratch and change my $HOME environment variable, all of which I'd rather not have to do as it ends up being a lot of work.
Is there a way to invoke java -jar and disable the process from accessing the internet? I have not found any such flag in the man-pages. Is there perhaps a UNIX-way of doing this, as in:
disallowinternetaccess java -jar Foo.jar

Tell your Java program to access the network through a proxy. For all internet access this would be a SOCKS5 proxy.
java -DsocksProxyHost=socks.example.com MyMain
I believe that if no proxy is running you should get an appropriate exception in your program. If you need full control of what is happening, you can look into - and possibly modify - http://jsocks.sourceforge.net/
See http://docs.oracle.com/javase/7/docs/technotes/guides/net/proxies.html for details.
Note: You can do this without any native Unix stuff, so this question fits perfectly fine on SO.

You need just turn on SecurityManager: -Djava.security.manager=default
see details - https://stackoverflow.com/a/4645781/814304
With this solution you can even handle which resource you want to show and which to hide.

Disallow MPI to run on the headnode

I have a Rpi cluster MPI runs perfect on, one issue I am having is that MPI is using the master node as a compute node, how do I configure MPI so it only runs on the compute node. I tried removing the head node's IP address from the file that I use to run with mpirun, but I get back:
HYDU_sock_connect (./utils/sock.c:171): unable to get host address for mastern (2)
main (./pm/pmiserv/pmip.c:209): unable to connect to server mastern at port 42525
thanks in advance

Even if you don't start up an MPI rank on the node that you launch from, I believe most basic launchers will still start up some sort of daemon process on that node. If that's a problem, you might need to launch from a different node.

Process stop getting network data

We have a process (written in c++ /managed), which receives network data via tcpip.
After running the process for a while while tracking network load, it seems that network get into freeze state and the process does not getting data, there are other processes in the system that using networking (same nic) which operates normally.
the process gets out of this frozen situation by itself after several minutes.
Any idea what is happening?
Any counter i can track to see if my process reach some limitations ?

It is going to be very difficult to answer specifically,
-- without knowing what exactly is your process/application about,
-- whether it is a network chat application, or a file server/client, or ......
-- without other details about your process how it is implemented, what libraries it uses, if relevant to problem.
Also you haven't mentioned what OS and environment you are running this process under,
there is very little anyone can help . It could be anything, a busy wait loopl in your code, locking problems if its a multi-threaded code,....
Nonetheless , here are some options to check:
If its linux try below commands to debug and monitor the behaviour of the process and see what could be problem-
top
Check top to see ow much resources(CPU, memory) your process is using and if there is anything abnormally high values in CPU usage for it.
pstack
This should stack frames of the process executing at time of the problem.
netstat
Run this with necessary options (tcp/udp) to check what is the stae of the network sockets opened by your process
gcore -s -c
This forces your process to core when the mentioned problem happens, and then analyze that core file using gdb
gdb
and then use command where at gdb prompt to get full back trace of the process (which functions it was executing last and previous function calls.

makeCluster function in R snow hangs indefinitely

I am using makeCluster function from R package snow from Linux machine to start a SOCK cluster on a remote Linux machine. All seems settled for the two machines to communicate succesfully (I am able to estabilish ssh connections between the two). But:
makeCluster("192.168.128.24",type="SOCK")
does not throw any result, just hangs indefinitely.
What am I doing wrong?
Thanks a lot

Unfortunately, there are a lot of things that can go wrong when creating a snow (or parallel) cluster object, and the most common failure mode is to hang indefinitely. The problem is that makeSOCKcluster launches the cluster workers one by one, and each worker (if successfully started) must make a socket connection back to the master before the master proceeds to launch the next worker. If any of the workers fail to connect back to the master, makeSOCKcluster will hang without any error message. The worker may issue an error message, but by default any error message is redirected to /dev/null.
In addition to ssh problems, makeSOCKcluster could hang because:
R not installed on a worker machine
snow not installed on a the worker machine
R or snow not installed in the same location as the local machine
current user doesn't exist on a worker machine
networking problem
firewall problem
and there are many more possibilities.
In other words, no one can diagnose this problem without further information, so you have to do some troubleshooting in order to get that information.
In my experience, the single most useful troubleshooting technique is manual mode which you enable by specifying manual=TRUE when creating the cluster object. It's also a good idea to set outfile="" so that error messages from the workers aren't redirected to /dev/null:
cl <- makeSOCKcluster("192.168.128.24", manual=TRUE, outfile="")
makeSOCKcluster will display an Rscript command to execute in a terminal on the specified machine, and then it will wait for you to execute that command. In other words, makeSOCKcluster will hang until you manually start the worker on host 192.168.128.24, in your case. Remember that this is a troubleshooting technique, not a solution to the problem, and the hope is to get more information about why the workers aren't starting by trying to start them manually.
Obviously, the use of manual mode bypasses any ssh issues (since you're not using ssh), so if you can create a SOCK cluster successfully in manual mode, then probably ssh is your problem. If the Rscript command isn't found, then either R isn't installed, or it's installed in a different location. But hopefully you'll get some error message that will lead you to the solution.
If makeSOCKcluster still just hangs after you've executed the specified Rscript command on the specified machine, then you probably have a networking or firewall issue.
For more troubleshooting advice, see my answer for making cluster in doParallel / snowfall hangs.

how to post scripts to networkmanager's dispatcher.d directory

Ubuntu 10.10 64bit athalon, gnome
My basic scenario is I'm connecting to a VPN service (via newtworkmanager pptp protocol) and I'm transferring private data (hence VPN). The service goes down intermittantly and that's alright, probably due to my ISP/OS/VPN. What is not good is that my applications will then continue to transmit data via the eth0 default route and thats not cool. After some looking around I'm suspecting the best way to deal with this is to post scripts into /etc/NetworkManager/dispatcher.d. In short, the networkmanager service will execute scripts in this directory (and pass arguments to the scripts) when anything about the network changes.
My problem is that I can't get any of my scripts to execute. They all have, per the manpage, 0755 permissions and owned by root, but when I change the network state by unplugging ethernet cable, my scripts don't execute. I can execute them from the command line, but not automatically via the dispatcher....
an example script:
#!/bin/sh -e
exec /usr/bin/wmctrl -c qBittorrent
exit 0
This script is intentionally simple for testing purposes..
I can post whatever else would be helpful.

i'm using the syntax killall -9 any_application_name_here and that's working just fine. I imagine the script didn't have access to the binary wmctrl. I think that bash interpreter in this case will only execute bash binaries.
So, in a nutshell, if you want to control your VPN traffic based on network events, one way is to post scripts to /etc/NetworkManager/dispatcher.d and use binaries that are in bash's default path.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex