"Host key verification failed" error when running mpiexec within a cluster - mpi

I'm trying to create an mpi cluster by connecting two laptops and running mpi programs. I followed the steps as mentioned here (https://medium.com/mpi-cluster-setup/mpi-clusters-within-a-lan-77168e0191b1). I am able to ssh to the other nodes without a password. However when I try to run mpiexec -n 2 -hosts manager,worker ./main I get this following error.
[proxy:0:1#gunavaran-HP-Pavilion-Notebook] HYDU_sock_connect (utils/sock/sock.c:113): unable to get host address for gunavaran-HP-ENVY-15-Notebook-PC
[proxy:0:1#gunavaran-HP-Pavilion-Notebook] main (pm/pmiserv/pmip.c:181): unable to connect to server gunavaran-HP-ENVY-15-Notebook-PC at port 43211 (check for firewalls!)
Host key verification failed.
This is my hostfile
127.0.0.1 localhost
#127.0.1.1 gunavaran-HP-ENVY-15-Notebook-PC
#MPI SETUP
192.168.8.102 manager
192.168.8.108 worker
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

I changed the hostnames to manager and worker using sudo hostnamectl set-hostname. It works fine now.

Related

data is not seen from the ros robot when published from the workstation (laptop)?

I have followed this tutorial from this link https://www.youtube.com/watch?v=YMG6D... for setting up and troubleshooting the ros networks. I have modified it precisely the same way as it is provided in the video. But when sending the rostopic data from the laptop to the robot, the robot cannot receive the data, but the rostopic list will shows the topic name. I have tried disabling the firewall too, but this has no effect. What could be the possible solution to this?
OS in robot: ubiquityrobot images: ubuntu 16.04.
OS in laptop: ubuntu 16.04.
ROS distro : kinetic
PS:
I have tried using roswtf, and I have understood that two nodes are unable to establish a connection. But don't know what is blocking the rosnode to publish the data when running from the workstation.
However topic publisher data from robot is received from the workstation.
HOSTNAME and IP are set as described in the youtube link mentioned above.
Edit 1:
workstation configuration
.bashrc - > last lines
export ROS_MASTER_URI=http://ubiquityrobot.local:11311
ROS_HOSTNAME=$(hostname).local
#ROS_IP=0.0.0.0
/etc/hosts -> file
127.0.0.1 localhost
127.0.1.1 maisa-K53E
10.42.0.1 ubiquityrobot.local
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
robot side
.bashrc -> last line
ROS_IP=0.0.0.0
/etc/hosts -> lines
127.0.0.1 localhost
10.42.0.201 maisa-K53E
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
127.0.1.1 ubiquityrobot ubiquityrobot.local
roswtf output:
https://drive.google.com/file/d/1aSdPzWWtV0FwyZBTCkgjQu1knXXOJFIm/view?usp=sharing
.
Edit 2:
When publishing data on a topic from robot, we can see the topic as well as the data sent from the robot in the workstation but not vice-versa.
We have solved the problem. This is because of the firewall. Even though we have disabled the firewall using sudo ufw disable, it didn't work in our case. It seems, we have to change the rules using iptables. Interestingly, this is observed in some Linux machines only. The following link helped.
ROS communication from PC to RaspberryPi
Edit:
After disabling the firewall, I haven't rebooted the computer. Now it works fine after reboot.

Udhcpd read /etc/hosts like dns

I'm building a hotspot with udhcpd and nginx (Linux raspbian, 4.9.41-v7+, armv71). It's working very well, but i want that user enter "home" instead "192.168.2.1" on browser to access my portal.
I set-up the following configurations:
/etc/hosts
127.0.0.1 localhost
127.0.1.1 rpi
192.168.2.1 home
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
and /etc/udhcpd.conf
start 192.168.2.10
end 192.168.2.254
interface wlan0
opt dns 192.168.2.1 8.8.8.8 8.8.4.4
opt subnet 255.255.255.0
opt router 192.168.2.1
opt hostname rpi
but, when i try access "home/" or "rpi/" the following error appears:
Isn't possible to find "home" on DNS server.
ERR_NAME_NOT_RESOLVED
Client config after dhcp ack:
Connected to WiFi SSID: rpi
IP: 192.168.2.76
Any suggestions?
grateful for help.
udhcpd hasn't a build-in dns server. I just switch to dnsmasq and it worked!

What should be the host file configuration for Cloudera Installation for 5 nodes?

I am trying to install Cloudera cluster on 5 machines- 4 as ubuntu 12.04 and 1 as Oracle Enterprise Linux 5.8.
I have run the Cloudera Manager Installer on Oracle Linux Enterprise host which should act as a name node ( with ip address 192.168.1.185) and other 4 Ubuntu hosts should act as data nodes.
I have completed all the prerequisites and I have configured host files as:
For Ubuntu:
127.0.0.1 localhost
192.168.1.181 hduser1.example.co.in hduser1
192.168.1.182 hduser2.example.co.in hduser2
192.168.1.183 hduser3.example.co.in hduser3
192.168.1.184 hduser4.example.co.in hduser4
192.168.1.185 hduser5.example.co.in hduser5
#The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
For Oracle Enterprise Linux:
192.168.1.181 hduser1.example.co.in hduser1
192.168.1.182 hduser2.example.co.in hduser2
192.168.1.183 hduser3.example.co.in hduser3
192.168.1.184 hduser4.example.co.in hduser4
192.168.1.185 hduser5.example.co.in hduser5
127.0.0.1 hduser5.example.co.in hduser5 localhost.localdomain loca$
::1 localhost6.localdomain6 localhost6
I am not sure whether this configuration is correct as i have got errors related to reverse DNS as follows:
The following failures were observed in checking hostnames. Showing first 1000 failures only...
DNS reverse lookup of IP 192.168.1.184 on host hduser1.example.co.in failed. Expected hduser4.example.co.in but got hduser4.local.
DNS reverse lookup of IP 192.168.1.182 on host hduser1.example.co.in failed. Expected hduser2.example.co.in but got hduser-desktop-3.local.
DNS reverse lookup of IP 192.168.1.183 on host hduser1.example.co.in failed. Expected hduser3.example.co.in but got hduser-desktop.local.
After a long research I found that the host file configuration is correct. The problem due compatibility issues between Ubuntu and Oracle Enterprise Linux. After switching all the nodes to Ubuntu the issue was resolved.
Also I edited resolv.conf of all hosts. The configuration was as follows:
domain example.co.in
search example.co.in localdomain
nameserver 192.x.x.x

Glassfish v3 clustering

I've tried to configure cluster following through Glassfish clustering tutorials (1, 2), but I'm still having troubles with creating instance in cluster on remote host.
I think it will be better if I give you output with inserted commands, it'll probably explain more:
adam#adam-desktop:~/Pulpit/glassfish-3.1.1/bin$ ./asadmin
Use "exit" to exit and "help" for online help.
asadmin> setup-ssh adam-laptop
Successfully connected to adam#adam-laptop using keyfile /home/adam/.ssh/id_rsa
SSH public key authentication is already configured for adam#adam-laptop
Command setup-ssh executed successfully.
asadmin> install-node --installdir /home/adam/Pulpit/glassfish3 adam-laptop
Created installation zip /home/adam/Pulpit/glassfish-3.1.1/bin/glassfish8196347853130742869.zip
Successfully connected to adam#adam-laptop using keyfile /home/adam/.ssh/id_rsa
Copying /home/adam/Pulpit/glassfish-3.1.1/bin/glassfish8196347853130742869.zip (82498155 bytes) to adam-laptop:/home/adam/Pulpit/glassfish3
Installing glassfish8196347853130742869.zip into adam-laptop:/home/adam/Pulpit/glassfish3
Removing adam-laptop:/home/adam/Pulpit/glassfish3/glassfish8196347853130742869.zip
Fixing file permissions of all files under adam-laptop:/home/adam/Pulpit/glassfish3/bin
Command install-node executed successfully.
asadmin> start-domain domain1
Waiting for domain1 to start ........................
Successfully started the domain : domain1
domain Location: /home/adam/Pulpit/glassfish-3.1.1/glassfish/domains/domain1
Log File: /home/adam/Pulpit/glassfish-3.1.1/glassfish/domains/domain1/logs/server.log
Admin Port: 4848
Command start-domain executed successfully.
asadmin> enable-secure-admin
Command enable-secure-admin executed successfully.
asadmin> restart-domain domain1
Successfully restarted the domain
Command restart-domain executed successfully.
asadmin> create-cluster c1
Command create-cluster executed successfully.
asadmin> create-node-ssh --nodehost adam-laptop --installdir /home/adam/Pulpit/glassfish3 adam-laptop
Command create-node-ssh executed successfully.
asadmin> create-instance --node adam-laptop --cluster c1 i1
Successfully created instance i1 in the DAS configuration, but failed to create the instance files on node adam-laptop (adam-laptop).
Command failed on node adam-laptop (adam-laptop): Could not contact the DAS running at adam-desktop:4848. This could be because a firewall is blocking the connection back to the DAS or because the DAS host is known by a different name on the instance host adam-laptop. To change the hostname that the DAS uses to identify itself please update the DAS admin HTTP listener address.
Command _create-instance-filesystem failed.
To complete this operation run the following command locally on host adam-laptop from the GlassFish install location /home/adam/Pulpit/glassfish3:
asadmin --host adam-desktop --port 4848 create-local-instance --node adam-laptop i1
asadmin>
UPDATE
I'm putting hosts file contents and ping command output for sure, that exists connection between adam-desktop and adam-laptop:
adam#adam-desktop:~$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 adam-desktop
192.168.1.101 adam-laptop
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
adam#adam-desktop:~$ cat /etc/hostname
adam-desktop
adam#adam-desktop:~$ ping adam-laptop
PING adam-laptop (192.168.1.101) 56(84) bytes of data.
64 bytes from adam-laptop (192.168.1.101): icmp_req=1 ttl=64 time=0.786 ms
64 bytes from adam-laptop (192.168.1.101): icmp_req=2 ttl=64 time=0.694 ms
64 bytes from adam-laptop (192.168.1.101): icmp_req=3 ttl=64 time=0.687 ms
Any help?
After the domain is started, can you reach http://localhost:4848 or http://adam-desktop:4848 in your browser ?
If not, on linux glassfish requires you to set up the /etc/hosts file correctly and this is where most of my problems like this come from. Also set up the appropriate network config. On Redhat it is /etc/sysconfig/network and on Ubuntu it is /etc/hostname
It seems that error was caused by entry in /etc/hosts file.
127.0.0.1 localhost
127.0.1.1 adam-desktop
192.168.1.101 adam-laptop
after changing to:
127.0.0.1 localhost
127.0.0.1 adam-desktop
192.168.1.101 adam-laptop
it works. I had to make changes on two machines, it means on adam-desktop and adam-laptop.

Can't form mpi ring

I am facing problem in configuring and running MPI on my systems.
Here is what I tried:
1) I ran 'mpd &' on one machine and then I ran 'mpdtrace -l' on the same machine. I got this as output: "my-lappy_53430 (127.0.1.1)"
2) On another machine I ran 'mpd -h -p 53430 &' and got this error:
akshey-desktop_39993: conn error in connect_lhs: Connection timed out
akshey-desktop_39993 (connect_lhs 924): failed to connect to lhs at 10.2.28.137 52430
akshey-desktop_39993 (enter_ring 879): lhs connect failed
akshey-desktop_39993 (run 267): failed to enter ring
Can you please help with this issue? I tried to ping and ssh the first machine(on which mpd is running) from the second machine and it worked.
After this I executed 'mpdheck' on the first machine, I got this output:
* * * first ipaddr for this host (via my-lappy) is: 127.0.1.1
These are the contents of /etc/hosts of the first machine:
127.0.0.1 localhost
127.0.1.1 my-lappy
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
Then I ran 'mpdcheck -l' and got this as output:
**********
Your unqualified hostname resolves to 127.0.0.1, which is
the IP address reserved for localhost. This likely means that
you have a line similar to this one in your /etc/hosts file:
127.0.0.1 $uqhn
This should perhaps be changed to the following:
127.0.0.1 localhost.localdomain localhost
**********
Even after changing the first line of /etc/hosts to "127.0.0.1 localhost.localdomain localhost" I still got the same output from 'mpdcheck -l'
Please note that I do not have access to the DNS server of the network and these machines do not have a DNS entry in the DNS server. (I think this should not be a problem because we can always use IP addresses instead of hostnames. Isn't it so?)
Two points:
You probably don't want to wire up an MPD ring by hand. Unless you are just doing some troubleshooting with a raw mpd command, you probably want to use mpdboot. Its usage is described in the User's Guide.
Since you are using MPD, you are using MPICH2 or an MPICH2 derivative. Starting with MPICH2 1.1 there is a new process manager available, called "hydra". I encourage you to update to the latest version of MPICH2 that you can and give hydra a try. It is much more robust than MPD and has many more features, including better performance.
from my personal and recent experience, I would say that
127.0.1.1 my-lappy
must be change to you LAN address, and match your hostname. You can change it with hostname <new hostname> and/or edit permanently /etc/hostname.
Then on host1 you need to start mpd --echo and note the port on which mpd will listen:
mpd_port=N
then on host2 start:
mpd --host=host1 --port=N
It's very important that the /etc/hosts files of all the machines resolve correctly the names to the IPs.
mpdtrace -l will confirm that the ring is correctly setup.
Check for firewall on your systems that might be blocking the default ports. Turn off the firewall by turning off the ipchains and iptables to test if that is the problem.
In addition, make sure the hostnames/IP addresses are correct and can be successfully resolved.

Resources