MS-MPI application failed on more than one node

MS-MPI application failed on more than one node - mpi

I have two virtual boxes with windows 7. Their IPs are 10.0.0.20 and 10.0.0.22. From one virtual box I can ping the other one.
On both boxes I open an smpd connection:
smpd -p 8677
On both box I can see that port 8677 is listening. From one box, using telnet, I am able to connect the other one on port 8677. I also disabled the firewall and I don't have any antivirus.
Then I try to launch my application on 10.0.0.20:
mpiexec /port 8677 /gmachinefile host.txt myapp.exe
It works when host.txt is filled with
10.0.0.20
or
10.0.0.22
But when host.txt is composed of the two lines:
10.0.0.20
10.0.0.22
I get this error:
[01:1952] ERROR: unable to read the cmd header on the pmi server context, Other MPI error, error stack:
[01:1952] ERROR: connection to the pmi client broken, closing.
[01:1952] ERROR: unable to read the cmd header on the pmi server context, Other MPI error, error stack:
[01:1952] ERROR: connection to the pmi client broken, closing.
All the commands were launched with administrator privileges.

Related

Rabbitmq: Node down

I am getting node down error on rabbitmq, this is happening sometimes.
Able to see the below error when I execute: sudo rabbitmqctl status or sudo rabbitmqctl list_queues
Error: unable to connect to node : nodedown
connected to epmd (port 4369) on host-name
epmd reports node 'rabbit' running on port 25672
can't establish TCP connection, reason: timeout
suggestion: blocked by firewall?
version: {rabbit,"RabbitMQ","3.6.9"}
os: Ubuntu 16.04
I have checked hostname which is ok with me, not changed since the installation
Also able to telnet localhost 25672
What could be the reason behind this error and possible solution?
And one more question, I am checking node status using below API
curl -s GET http://edx:edx#127.0.0.1:15672/api/healthchecks/node/
Is above API ok or not to check the health status of the node? Please suggest if there is anything else. I have set up one shell script which will call this API and if status is not ok then it will restart rabbitmq-server service. Script is executed from cron every minute.

Looks like your rabbitmq node is... down. rabbitmqctl needs a running node to perform these commands.
If you're using systemd, you can check the service status:
service rabbitmq-server status
Or just try to restart the node:
rabbitmqctl start_app
Telnet on port 25672 tells you the rabbitmqctl is running, but RabbitMQ itself does not run on that port (by default, it's listening on 5672).

MPI A process or daemon was unable to complete a TCP connection

Open MPI: 4.0.1a
HostFile:
34bb0519eAAA
a2935f150BBB
I am in machine 34bb0519eAAA. And I could use ssh a2935f150BBB to connect a2935f150BBB successfully. And also ssh 34bb0519eAAA In machine a2935f150BBB to connect 34bb0519eAAA successfully .
But when I mpiexec command . I get error message
****Warning: Permanently added '[XX.XX.XX.XX]:XX' (a2935f150BBB'IP address) to the list of known hosts.**
----------------------**--------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: a2935f150BBB
Remote host: 34bb0519eAAA
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
I am very confused that.Because I run ssh to each other successfully . How could fail that.
Here is ssh connection
ssh a2935f150BBB
Warning: Permanently added '[XX.XX.XX.XX]:XX to the list of known hosts.
Welcome to Ubuntu 18.04.1 LTS (XXXXXXXXXXXXXXXXXX)
Documentation: https://help.ubuntu.com
Management: https://landscape.canonical.com
Support: https://ubuntu.com/advantage
This system has been minimized by removing packages and content that are
not required on a system that users do not log into.
To restore this content, you can run the 'unminimize' command.
Last login:XXXXXXXXXXXXX from XXXXXXXXXX

FileZilla: able to connect via SFTP, but failed to list directories

I used FileZilla to connect to one of my Linux servers via the SFTP protocol, but got below error stack trace.
Status: Connecting to <server_ip>...
Response: fzSftp started, protocol_version=5
Command: keyfile "C:\ruifeng_ibm.ppk"
Command: open "root#<server_ip>" 22
Status: Connected to <server_ip>
Error: Connection timed out after 20 seconds of inactivity
Error: Could not connect to server
On the server when I ran lsof -i, I was able to see the established sshd connection.
sshd 12333 root 3u IPv4 109406 0t0 TCP <server_hostname>:ssh-><workstation_ip>:54315 (ESTABLISHED)
How could the directories not be listed when the connection is successful? No idea how to debug either.

Turned out to be a silly problem.
I put below welcome message in the .bashrc file.
echo -e "\n\nHello Ruifeng...Welcome to the Arena! \n#>>------>---->>"
Either it contained some illegal characters FileZilla does not honor, or it's completely not supported by FileZilla. Too lazy to further dig in. After removing this message, the connection worked and the directories got listed.

rsync port 22: Connection timed out

I want to make a backup for my remote server folders(ubunto server)to another remote sever (Linux server). but once I run this command from the the first server it dispalys me an error message:
rsync -raz --progress firstdirectoy root#serverIP:/home
The displayed messahe is:
ssh: connect to host <serverIP> port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(601) [sender=3.0.7]
But the same command from the server 2 to the server 1 works fine and the folder is nicely copyed into the server1.
How can I escape the connexion error in order to copy my folder from server 1 to server 2 throw rsync?

Seems like server2 has no active ssh daemon while server1 has.
Try to run ssh daemon or use raw rsync protocol and rsync daemon.

If it's a connection timeout because your SSH server is slow to respond, you can tweak the timeout in rsync:
rsync -e 'ssh -o ConnectTimeout=120'
Else it may be a missing SSH daemon (sshd) on server 2 as stated by #geov, or a closed port on your firewall. You may start by testing an SSH login:
ssh user#serverIP
And see if it's working or not. Probably nmap serverIP will help you too, stating if SSH is running or not.
And please do NOT use root user for your rsync copy!

if you wait for a long time, the prompt appears
I think that your server2's IP is wrong

For me, this error appeared when attempting to rsync between two AWS EC2 instances where the two instances were not a part of the same security group.
Overview of how to create security groups
How to change the security groups of the instances
Allow instances within the same security group to communicate

Strange Vagrant error message: 'Unable to create a host network interface'

I have a Vagrant machine based on VirtualBox that has some problems (see Vagrant crashes depending on physical network). Now I tried running it on another piece of hardware (with OS X Mavericks), and got the following error message:
There was an error while executing `VBoxManage`, a CLI used by Vagrant
for controlling VirtualBox. The command and stderr is shown below.
Command: ["hostonlyif", "create"]
Stderr: VBoxManage: error: Unable to create a host network interface
VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component Host,
interface IHost, callee nsISupports
Context: "CreateHostOnlyNetworkInterface (hif.asOutParam(),
progress.asOutParam())" at line 64 of file VBoxManageHostonly.cpp
What does this mean?
For the error to appear I run
$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
[default] Clearing any previously set forwarded ports...
[default] Creating shared folders metadata...
[default] Clearing any previously set network interfaces...
… and then it crashes. Any ideas?
Oh, by the way: It's Vagrant 1.3.5 and VirtualBox 4.1.18.

sudo /Library/StartupItems/VirtualBox/VirtualBox restart
worked for me, see https://coderwall.com/p/ydma0q

The popular answer seems to be modprobe vboxnetadp (for Linux) or /Library/StartupItems/VirtualBox/VirtualBox restart (for Mac).
However, the fix for me was to add myself to the vboxusers group and relogin.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex