MPICH2 gethostbyname failed - mpi

I don't understand the error message. I am trying to do is to run a MPICH2 application after I installed mpich2 version 1.4 or 1.5 to /opt/mpich2 (both version failed with the same error). My MPI application was compiled with 1.3 but I am able to run it with mpi 1.4 on another workstation. I am testing it on Ubuntu 12.04.
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(467)..............:
MPID_Init(177).....................: channel initialization failed
MPIDI_CH3_Init(70).................:
MPID_nem_init(319).................:
MPID_nem_tcp_init(171).............:
MPID_nem_tcp_get_business_card(418):
MPID_nem_tcp_init(377).............: gethostbyname failed, localhost (errno 3)

Solution for macOS
I stumbled upon this issue on macOS 10.12.1.
The solution is to add 127.0.0.1 computername.local to /etc/hosts. Your file will look more or less like this:
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
127.0.0.1 computername.local
255.255.255.255 broadcasthost
::1 localhost
You can change/check your computer's name if you go to System Preferences > Sharing > Computer Name.

What worked for me was the following:
Make sure your hostname is the same for 1 and 2 below:
terminal hostname
"/etc/hosts" hostname
So if you type cat /etc/hosts in terminal it should look like:
// 127.0.0.1 my_hostname
My hostname was not the same for 1 and 2 for me. Once I changed them to match then my mpi program would execute.
To change your terminal hostname type the following:
sudo scutil --set HostName my_new_host_name
To change your /etc/hosts hostname type the following:
sudo nano /etc/hosts
and then add the line
127.0.0.1 my_new_hostname

This error indicates that there's a problem resolving localhost. Check your /etc/hosts file, make certain that you have localhost correctly defined here, it should be pointing to 127.0.0.1. Try using ssh to connect to localhost, make sure that works as well.

Being the question different, the answer is probably the same I gave time ago for OpenMPI:
gethostname() function missing in openMPI
The MPI portable solution is to use MPI_Get_processor_name()

adding -host localhost to the command line solved this for me. Suggested in https://github.com/pmodels/mpich/issues/4710#issuecomment-661933489
e.g.
mpiexec -host localhost -np 4 ./testExecutable

Maybe your /dev/shm is full, try to clean it.

Related

mount.nfs: requested NFS version or transport protocol is not supported

NFS Mount is not working in my RHEL 7 AWS instance.
When I do a
mount -o nfsvers=3 10.10.11.10:/ndvp2 /root/mountme2/
I get the error:
mount.nfs: requested NFS version or transport protocol is not supported
Can anyone point me where I am wrong?
Thanks.
Check the nfs service is started or reboot the nfs service.
sudo systemctl status nfs-kernel-server
In my case this package was not running and the issue was in /etc/exports file where i was having same IP address for two machines.
So i commented one ip address for the machine and restarted nf-kernel-server using
sudo systemctl restart nfs-kernel-server and reload the machine.
It worked.
A precision which might be useful for the dump (like me): systemctl status nfs-server.service and systemctl start nfs-server.service must be executed on the server!
Some additional data
If, like me, you've deleted a VM without shutting it down right you might also need to manually edit the file /etc/exports because NFS is trying to connect to it and fails but doesn't continue with the next, it just dies.
After that you can manually restart as mentioned in other answers.
In my case, a simple reload didn't suffice. I had to perform a full restart:
sudo systemctl status nfs-kernel-server
In my case, it didn't work correctly with version NFS 4.1.
So in Vargantfile in each place where is type: 'nfs' I added coma and nfs_version: 4, nfs_udp: false
Here is more detailing explanation NFS
If you're giving a specific protocol to connect with, also check to make sure your NFS server has that protocol enabled.
I got this error when trying to start up a Vagrant box, and my nfs server was running. It turns out that the command Vagrant uses is:
mount -o vers=3,udp,rw,actimeo=1 192.168.56.1:/dir/on/host /vagrant
Which specifically asks for UDP. My server was running but it was not configured to enable connecting over UDP. After consulting /etc/nfs.conf, I created /etc/nfs.conf.d/10-enable-udp.conf with the following contents to enable udp:
[nfsd]
udp=y
The name of the file doesn't matter, as long as it's in the conf.d directory and ends in .conf. Depending on your distribution it may be configured differently. You can directly edit nfs.conf, but using a conf.d file is more likely to preserve the changes after upgrading your system.
Try to ping IP address of the server "ping " from client "ping , if you get reply then install nfs server on the host. Then edit /etc/exports file don't forget to add port along with IP address
I got the solution: make an entry in nfs server /etc/nfsmount.conf with Defaultvers=3 .
There will # Defaultvers=3 just unhash it and then mount on nfs client.
Issue will be resolved!

Keepalived health check can't connect to 127.0.0.1

I've currently got a cluster of servers running Centos 7 and Docker, and I want to use Keepalived to allocate a floating IP between them. I've configured Keepalived to run a check command on each node which just does curl --silent --fail localhost:80 to ensure a HTTP server is listening.
The web app is running using a Docker container bound to port 80 and --net=host on Docker 1.10.3. Firewalld is also completely disabled.
The problem I'm having is that the curl never succeeds. If I change the check command to echo '' or anything else which exits 0 (without any network interaction) it works fine, but for some reason the curl doesn't work. When I run it from a normal bash terminal it is fine, and echo $? prints a 0.
I'm not even sure how to debug this as Keepalived doesn't provide any documentation on the matter and doesn't seem to log anything in relation to errors coming from the vrrp script.
Any help or suggestions would be greatly appreciated.
Turns out I was using an ancient version of Keepalived. Compiling the latest version from source fixed the issue (rather than using the binary from Centos repos)

Unable to execute MPICH2 on multiple machines on ubuntu 12.04 (HYDU_sock_connect issue)

I am facing difficulty in executing MPI program on two machines. The OS is Ubuntu 12.04. And the MPI implementation is MPICH2
ssh is working fine:
root#ubuntu:/home# ssh 192.168.1.9
root#gpuguy's password:
Welcome to Ubuntu 12.04.3 LTS (GNU/Linux 3.8.0-29-generic i686)
* Documentation: https://help.ubuntu.com/
131 packages can be updated.
67 updates are security updates.
Last login: Thu Oct 24 17:36:25 2013 from ubuntu.local
root#gpuguy:~#
But when I run my MPI programs it fails:
root#ubuntu:/home# mpiexec -f hosts.cfg -n 4 hello
root#192.168.1.9's password:
[proxy:0:0#gpuguy] HYDU_sock_connect (./utils/sock/sock.c:171): unable to get host address for ubuntu (1)
[proxy:0:0#gpuguy] main (./pm/pmiserv/pmip.c:209): unable to connect to server ubuntu at port 42104 (check for firewalls!)
I have already disabled firewall on both machines that is the reason I can do ssh successfully. But how to solve this issue?
My MPI code runs successfully on single machine.
For MPICH (or any MPI implementation) to work, you need to have passwordless SSH set up. I should also mention that you really shouldn't have to be logged in as root to make this work. It's generally a very bad idea to be logged in as root all of the time.
In /etc/hosts file, add ip address of each server and its hostname.
You should do this for all the servers.
for example:
10.10.0.5 server1
10.10.0.6 server2
10.10.0.7 server3
Just check in /etc/hosts file, not use tab (\t) instead of space to separate between ip address and hostname.
This is wrong:
10.10.0.5 \t server1
This is true:
10.10.0.5 server1
Be careful to not delete or modify existed lines in /etc/hosts file. only add new lines at end of file.
Also, you do not need to disable firewall to fix this issue.

Issue with ping and opkg update on Beaglebone black

I'm new with BeagleBone Black, i'm using Angstrom default distro, often i have trouble with ping,opkg update and ssh.
BeagleBone Black has 2 Network Interface, the virtual one (On USB) and the phisical eth0.
I can connect with SSH only with the USB ip, in the other i obtain:
Write failed: Broken pipe
And I've seen a lot of problems during ping and during opkg update.
OPKG stay sometimes undefinitly on this screen:
Downloading http://feeds.angstrom-distribution.org/feeds/v2012.12/ipk/eglibc/armv7a-vfp-neon/base/Packages.gz.
With no results.
And Ping often can't resolve ping to google.it
Someone had similar issues?
Thanks
I ran into a similar issue, this thread might prove helpful in fixing the opkg update problem. Most of the people in that thread do some variation of the following:
Boot your BBB and log in via SSH.
Edit /etc/resolv.conf to add Google's public DNS server:
# cat "nameserver 8.8.8.8" >> /etc/resolv.conf
Run
# route add default gw 192.168.7.1
Run opkg update and upgrade:
# opkg update
# opkg upgrade
Keep in mind, that your changes to the /etc/resolv.conf file will be lost at reboot. I have yet to investigate why.

Cannot connect to beaglebone.local

I need to know how to connect to a beaglebone (or beagleboard) with SSH when I plug it into a new network with an ethernet cable like this:
$ ssh root#beaglebone.local
So far I've only been able to access it like this, if I know the IP address:
$ ssh root#<ip_address>
But I don't always know the IP address of the board on new networks so I'm hoping to access it with with a name like: beaglebone.local.
Right now when I try to do this I get this error:
"ssh: Could not resolve hostname beaglebone.local: nodename nor servname provided, or not known"
I checked the hostname and hosts files, and added "127.0.0.1 beaglebone" to the hosts on the beaglebone, but not sure what else I can do?
# cat /etc/hostname
beaglebone
# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
127.0.0.1 beaglebone
I had a similar issue running my beaglebone on Angstrom-Cloud9-IDE-GNOME-eglibc-ipk-v2012.05-beaglebone-2012.04.22.img.xz. In this distribution, "beaglebone.local" should appear on the network after the system boots.
About 50% of the time after reboot, "beaglebone.local" would not appear on the network (although the bone would be available by IP address). When this happened, "systemctl status avahi-daemon.service" showed that the avahi-daemon failed with "exit code 255". Interestingly, a subsequent "systemctl start avaihi-daemon.service" would always be successful and "beaglebone.local" would appear on the network.
Also "journalctl | grep avahi" returned a single message stating something like "Daemon already runnin gon PID NNN".
So, I "fixed" the problem by adding the line "ExecStartPre=/bin/rm -f /var/run/avahi-daemon/pid" to the [Service] section of /lib/systemd/system/avahi-daemon.service. With this addition, "beaglebone.local" now appears on the network 100% of reboots.
I say "fixed" (i.e., in quotes) because I have not been able to track down the root cause that is leaving around the stray avahi pid file(s) and thus don't have a true fix.
-- Frank
For 'beaglebone.local' to work, your host machine must recognize Zeroconf. The BeagleBone uses Avahi to tell other systems on the LAN that it is there and serving up applications and that it should be called a 'beaglebone'. If there are more than one, the second one is generally called 'beaglebone-2.local'.
I hate answering my own questions. The following hack will work until a better way emerges:
This shell script (where xxx.xxx.xxx is the first three numbers in your computer's IP) will find your beaglebone or beagleboard (that is plugged-into ethernet on a new network with DHCP) by looping through all the ip address on the subnet and attempting to login to each as root. If it finds one then try your password. If it doesn't work just hit enter until the loop starts again. If it doesn't find the board then something else is probably wrong.
for ip in $(seq 1 254); do ssh root#xxx.xxx.xxx.$ip -o ConnectTimeout=5; [ $? -eq 0 ] && echo "xxx.xxx.xxx.$ip UP" || : ; done
UPDATE 1
Today I plugged-in the beaglebone and saw Bonjour recognize that it joined the network. So I tried it and it worked. No idea why it decided to all of the sudden but it did. Strange, but true.
I had this issue quite often with Mac OS X 10.7. But unlike Frank Halasz "systemctl status avahi-daemon.service" shown no failure. And in fact the problem was on the Mac side. Restarting Bonjour with the following commands fixed the issue.
$ sudo launchctl unload /System/Library/LaunchDaemons/com.apple.mDNSResponder.plist
$ sudo launchctl load -F /System/Library/LaunchDaemons/com.apple.mDNSResponder.plist

Resources