Can't establish connection over second NIC (two hops) - networking

We are having trouble with network routing configuration in Ubuntu Xenial.
We have many servers with both Debian 8.4 (Jessie) and Ubuntu 16.04.2 (xenial)
and the exact same networking setup (or at least as far as we can see).
They all have two NICs attached to two VLANs (Say "A" and "B") both accessible
though other VLANs say, for example, from VLAN "C".
Both /etc/network/interfaces files are of the form:
NOTE: I faked names and IPs for the sake of better readability.
# VLAN A
auto eth0
iface eth0 inet static
address 192.168.111.xxx
netmask 255.255.255.0
broadcast 192.168.111.255
network 192.168.111.0
gateway 192.168.111.254
dns-nameservers 192.168.111.25 192.168.111.26
# VLAN B
auto eth1
iface eth1 inet static
address 192.168.222.xxx
netmask 255.255.255.0
broadcast 192.168.222.255
network 192.168.222.0
gateway 192.168.222.254 # <-- (Commented out in Ubuntu machine)
dns-nameservers 192.168.111.25 192.168.111.26
...say xxx is 100 for Debian Machine and 200 for Ubuntu machine and I'm
trying to ping from 192.168.1.10 in VLAN "C" to following addresses:
192.168.111.100: Works fine.
192.168.222.100: Works fine.
192.168.111.200: Works fine.
192.168.222.200: NO Answer!!
The "B" vlan is used mostly for backup and other "background" traffic to
avoid saturation problems in vlan "A".
I know that having two network paths to access same machine is not an usual
setup and I must say that only being able to connect thought one of them from
other networks is not a big problem nowadays. But what stucks to me is why
I can access to Debian Machines and not to Ubuntu ones?
Even, on the other hand, if it were working well in both platforms, we could
consider closing some services (such as ssh, and backend interfaces) from NIC
"A" to improve security (Our firewall only allows access to vlan "B" from our
IT staff vlan).
Of course, as it is commented in previous interfaces snippet, gateway
row is commented out in Ubuntu machines, but that is because, networking
initialization fails in that machines otherwise. That is, in fact, what we are
trying to solve.
But both machines routing tables are almost identical. The only difference
I could see was the onlink flag in the Ubuntu machine:
myUser#debianMachine:~$ sudo ip route
default via 192.168.111.254 dev eth0
192.168.111.0/24 dev eth0 proto kernel scope link src 192.168.111.100
192.168.222.0/24 dev eth1 proto kernel scope link src 192.168.222.100
myUser#ubuntuMachine:~$ sudo ip route
default via 192.168.111.254 dev eth0 onlink
192.168.111.0/24 dev eth0 proto kernel scope link src 192.168.111.200
192.168.222.0/24 dev eth1 proto kernel scope link src 192.168.222.200
...but I was able to remove it by following command:
myUser#ubuntuMachine:~$ sudo ip route replace default via 192.168.111.254 dev eth0
myUser#ubuntuMachine:~$ sudo ip route
default via 192.168.111.254 dev eth0
192.168.111.0/24 dev eth0 proto kernel scope link src 192.168.111.200
192.168.222.0/24 dev eth1 proto kernel scope link src 192.168.222.200
And it did'nt fix the problem.
After that, I also tried to uncomment gateway row of 'VLAN B' which, as I
said, it were commented out in /etc/network/interfaces file and tryed to
restart networking but this is what happened:
myUser#ubuntuMachine:~$ sudo /etc/init.d/networking restart
[....] Restarting networking (via systemctl): networking.serviceJob for networking.service failed because the control process exited with error code. See "systemctl status networking.service" and "journalctl -xe" for details.
failed!
...and the onlink flag came back again.
As a note, commenting out that line again and issuing new
/etc/init.d/networking restart command, the output is the same until the
machine is rebooted, (even networking, despite the VLAN B default gateyay
issue, continues working as usual).
Following are the output of suggested commands:
myUser#ubuntuMachine:~$ sudo systemctl status networking.service
● networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; vendor preset: enabled)
Drop-In: /run/systemd/generator/networking.service.d
└─50-insserv.conf-$network.conf
Active: failed (Result: exit-code) since jue 2017-12-21 14:55:29 CET; 42s ago
Docs: man:interfaces(5)
Process: 8552 ExecStop=/sbin/ifdown -a --read-environment --exclude=lo (code=exited, status=0/SUCCESS)
Process: 8940 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
Process: 8934 ExecStartPre=/bin/sh -c [ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-envi
Main PID: 8940 (code=exited, status=1/FAILURE)
dic 21 14:55:29 ubuntuMachine systemd[1]: Stopped Raise network interfaces.
dic 21 14:55:29 ubuntuMachine systemd[1]: Starting Raise network interfaces...
dic 21 14:55:29 ubuntuMachine ifup[8940]: RTNETLINK answers: File exists
dic 21 14:55:29 ubuntuMachine ifup[8940]: Failed to bring up eth1.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILUR
dic 21 14:55:29 ubuntuMachine systemd[1]: Failed to start Raise network interfaces.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Unit entered failed state.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Failed with result 'exit-code'.
...and the meaningful part of sudo journalctl -xe:
dic 21 14:55:29 ubuntuMachine sudo[8922]: myUser : TTY=pts/0 ; PWD=/home/myUser ; USER=root ; COMMAND=/etc/init.d/networking restart
dic 21 14:55:29 ubuntuMachine sudo[8922]: pam_unix(sudo:session): session opened for user root by myUser(uid=0)
dic 21 14:55:29 ubuntuMachine systemd[1]: Stopped Raise network interfaces.
-- Subject: Unit networking.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit networking.service has finished shutting down.
dic 21 14:55:29 ubuntuMachine systemd[1]: Starting Raise network interfaces...
-- Subject: Unit networking.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit networking.service has begun starting up.
dic 21 14:55:29 ubuntuMachine ifup[8940]: RTNETLINK answers: File exists
dic 21 14:55:29 ubuntuMachine ifup[8940]: Failed to bring up eth1.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
dic 21 14:55:29 ubuntuMachine systemd[1]: Failed to start Raise network interfaces.
-- Subject: Unit networking.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit networking.service has failed.
--
-- The result is failed.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Unit entered failed state.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Failed with result 'exit-code'.
dic 21 14:55:29 ubuntuMachine sudo[8922]: pam_unix(sudo:session): session closed for user root
I googled a lot about being able to found some related information but none
fully answering my question:
An explanation of "onlink" flag that seemed to me it were pointing
out the possibilitity that the "onlink" flag were responsible of a
"wrong back routing" in the meaning that «tells the kernel that the it
does not have to check if the gateway is reachable directly by the
current machine» so (I figured out) the kernel may thought it could (or
should) route the answers of incomming connections from VLAN C to the
default gateway instead of thought the same NIC from where the
connection was started.
But, as I said, removing the "onlink" flag didn't seem to change
anything.
This unix StackExchange answer seems to solve the problem (I didn't
tested it yet) by using multiple routing tables and rules (to tell the
kernel which table to use). But it doesn't explain why Debian
machines are working well (I checked /etc/iproute2/rt_tables file of
both machines and they are identical too:
myUser#bothMachines:~$ sudo cat /etc/iproute2/rt_tables
#
# reserved values
#
255 local
254 main
253 default
0 unspec
#
# local
#
#1 inr.ruhep
So my final hypothesis is that it could be just an implementation difference
between kernel versions and, having that ubuntu one is much more recent, this
could be the correct behaviour so, in modern kernels, I need to use two
different routing tables (but I'm not sure and don't know why...).
myUser#debianMachine:~$ sudo uname -a
Linux debianMachine 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux
myUser#ubuntuMachine:~$ sudo uname -a
Linux ubuntuMachine 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
And, hence, the question is:
Are we doing something wrong (or there is some bug in them) in the Ubuntu machines? Or, conversely, this is the correct behaviour and we are forced to setup more complex routing schema (either by per-vlan routes or by using two routing tables to make two default gateway's to work again)?
EDIT:
Now I tried to add static route to fix the problem:
myUser#ubuntuMachine:~$ sudo ip route add 192.168.1.0/24 via 192.168.222.254 dev eth1
...but that freezed my ssh connection (thought NIC A) even I could then connect thought NIC B (at 192.168.111.200)
Both rules at the same time seems to not being possible:
myUser#ubuntuMachine:~$ sudo ip route add 192.168.1/24 via 102.168.111.254 dev eth0
myUser#ubuntuMachine:~$ sudo ip route add 192.168.1/24 via 192.168.222.254 dev eth1
RTNETLINK answers: File exists
EDIT 2:
I finally found the Linux Advanced Routing & Traffic Control HOWTO which seems to be more accurate than all other documentation I found and specifically in its Chapter 4. Rules - routing policy database I see following text:
If you want to use this feature, make sure that your kernel is
compiled with the "IP: advanced router" and "IP: policy routing"
features
...so I thing all points to that my previous hypothesis of a kernel implementation difference was right and that difference is concretely is those two features being compiled in.

Not an authoritative answer, but my first working attempt (applying what I managed to understand):
sudo ip route add 192.168.1.0/24 via 192.168.222.254 from 192.168.222.200 dev eth1 table 253
sudo ip rule add from 192.168.222.200 table 253
Update: from and devarguments in the ip route command aren't required (it works perfetly well without them).
...after isuinng first command I couldn't connect yet, but after issuing second one yes.
The logic behind that comes from this text i found in this document:
Linux-2.x can pack routes into several routing tables identified by a number in the range from 1 to 255 or by name from the file /etc/iproute2/rt_tables By default all normal routes are inserted into the main table (ID 254) and the kernel only uses this table when calculating routes.
Actually, one other table always exists, which is invisible but even more important. It is the local table (ID 255). This table consists of routes for local and broadcast addresses. The kernel maintains this table automatically and the administrator usually need not modify it or even look at it.
In fact, I finally ended up using another routing table, identified by its id (253) instead of what I now understand it is just an alias (defined in /etc/iproute2/rt_tables file).
...and checking again that file, I now see that there was an alias ("default") already defined for that routing table (next to the "main" one which is indeed 254 as the text fragment I pasted previously says.
What I don't know yet is which is the logic behind this naming (the "default" for 253 routing table I mean) and if, for any reason, is better to use lower routing tables (1, 2, 3...) like this solution (already mentioned in the question) does.
But, for the sake of simplicity, if we aren't going to build complex routing policies and just want to fix this connectivity issue, I guess it could be a good solution to use something like (not yet tested):
gateway 192.168.222.254 table 253
post-up ip rule add from 192.168.222.200 table 253
I still need to test and check if I need an additional via 192.168.222.254 in the gateway row or if it won't work at all and need to add it with another post-up command instead.
I will update this answer with the results.
Edit 1: Same works with default routes:
sudo ip route add default from 192.168.222.200 via 192.168.222.254 table 253
sudo ip rule add from 192.168.222.200 table 253
Edit 2: First (now fully¹) working approach
After playing for a while with a testing machine, I think that the best solution is to add following rows to the second NIC configuration in /etc/network/interfaces file:
gateway 192.168.222.254 table 1
post-up ip rule add from 192.169.222.200 table 1
pre-down ip rule del from 192.168.222.200 table 1
post-up ip route add 192.188.222.0/24 dev eth1 src 192.168.222.200 table 1
Comments:
Adding table 1 to the gateway keyword worked well so additional (less readable) post-up command to add that default route was not necessary.
...in fact, using specific table (other than main) for first NIC together with a similar rule than what we used for our second NIC would be a bad idea because, that that rule will only apply when 192.168.111.200 is going to be used as source address so there will not be any "default default gateway". Leaving first NIC configuration in the main routing table, will make all ("locally generated") outgoing connections to remote LANs will go though our first default gateway by default.
First post-up command adds a rule that packets with the source address of that NIC, should be routed using table 1 (otherwise our new default gateway wouldn't be used).
pre-down command removes that rule. It is not mandatory but, without it, multiple network service restarts will duplicate this rule every time.
I also tried to use dev eth1 instead of from 192.169.222.200 (to avoid having to duplicate network address) but it didn't work. I guess which NIC to use to for "response" packets were "not yet decided".
I used table 1 for eth1 (our second NIC) and I could use table 2 for an eventual third one and so on. It wasn't needed to specify any table/rule for first NIC because it comes to the main table (not "default": see below note).
Finally(¹) the second post-up command make all things work well because (as I now realize) only (first matching) one routing table is used so the default network route (automatically created when the interface brought up) doesn't apply because it was created in table main.
I still don't know if there is a way to force it to be crated directly into table 1.
NOTE: By command sudo ip rule list we can see current routing rules as follows:
0: from all lookup local
32765: from 192.168.222.200 lookup 1
32766: from all lookup main
32767: from all lookup default
As I can understand, they are added decreasingly from 32767 to 0 and tried
increasingly until one matches. Last two ones and the "0" were already
defined by default. The former because of the logic I previously cited
from this document but that documents says that rules starts from "1"
so I guess "0" should also be some predefined "default starting point".
Edit 3:
As I said in the Edit 2 (of the question), I found this Linux Advanced Routing & Traffic Control HOWTO that helped me a lot in clarifying things.
Concretely the Routing for multiple uplinks/providers chapter was very useful to me in the task of understanding setups having "network loops" (even in our case we aren't acting as a router to Internet).

Related

Ubuntu (Oracle VM) - Mounted Samba shares hang indefinitely

I have a VM instance on Oracle Cloud (Ubuntu 22.04) set up with ZeroTier to act as a web server for some services that should work with my local Synology NAS.
For some of those services I also need to mount three SMB shares from my NAS with the ZeroTier tunnel, but I can't make it work.
I used mount and mount.cifs plenty of times with automounting too, this time it acts very strange:
running the mount command seems to succeed from the console, but /var/log/syslog reads
CIFS: VFS: \\XXX.XXX.XXX.XXX has not responded in 180 seconds.
Reconnecting...
if trying to access one of the shares (ls or lsof or cd or any other command), it succeeds for only one of the shares (always the same one), but only for the first time any command is given:
$ ls /temp
folder1 folder2 folder3
any other following command just "hangs" as if they system is working on something, but it stays like that indefinitely most of the times:
$ ls /temp
█
Just a few times it spits out this error
lsof: WARNING: can't stat() cifs file system /temp
Output information may be incomplete.
ls 1475 ubuntu 3r DIR 0,44 0 123207681 /temp
findmnt reads:
└─/temp //XXX.XXX.XXX.XXX/Downloads cifs rw,relatime,vers=2.0,cache=strict, username=[redacted],uid=1005,noforceuid,gid=0,noforcegid,addr=XXX.XXX.XXX.XXX,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=65536,wsize=65536,bsize=1048576,echo_interval=60,actimeo=1
for the remaining two "mounted" shares, none of them seems to respond to any command, not even the very first command, and they just hang like the one share that, at least, lets me browse for one time;
umount and umount -l take at least 2-3 minutes to successfully unmount the shares.
Same behavior when using smbclient and also with NFS shares from the same NAS.
What I have already tried:
update kernel and all packages;
remove, purge and reinstall cifs-utils, smbclient and so on...
tried mounting the same shares in another client / node within the ZeroTier network and it works just fine; also browsing from Windows and Android file manager apps with and without ZeroTier works flawlessly;
tried all SMB versions including SMBv3 and SMBv1 (CIFS);
tried different browsing or mounting methods / commands including mount, mount.cifs, autofs, smbclient;
tried to debug what happens behind the console, but didn't found anything that seems related to this in logs, htop or anything else. During the "hanging" sessions there is no spike in CPU, RAM or Network usage in either the Oracle VM or Synology NAS;
checked, reset and reconfigured all permissions on my NAS for shares, folders and files recursively and reconfigured users groups permissions.
What I haven't tried yet (I'll try as soon as possible):
reproduce this on another Oracle VM configured the same as the faulty one and another with a different base image (maybe Oracle Linux?);
It seems to me that the mount.cifs process doesn't really succeeds in mounting the share correctly, as it doesn't show as such anywhere. It also seems an issue not related to folder/file permissions, but rather something related to networking?
A note on something that may or may not be related to this: ZeroTier on my Synology NAS does not seems to work with IPv4 only - it remains OFFLINE. The node goes ONLINE only when IPv6 is enabled, but I must say that this is the only node in my ZT network that shows a IPv6 as public IP in the ZT web GUI - the other nodes show IPv4 public addresses.
If anyone has any clue on this, I'll be happy to support and reproduce any advice. Thank you!
I'm using YailScale, but I presume it will work the same.
You need to add the port 445 to /etc/iptables/rules.v4 just under the SSH setup like below:
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 445 -j ACCEPT (like this)
Then you need to edit the interfaces in /etc/samba/smb.conf to:
interfaces = lo tailscale0 100.0.0.0/24
Obviously, my interface is tailscale0, but yours will be different. Use ip link show to find yours. You may also need to change your IP range to suit ZeroTeirs, such as 100.0.0.0/24, which is what tailscale uses.
Then reboot!
I couldn't get it working without doing this.

Understanding Docker container resource usage

I have server running Ubuntu 16.04 with Docker 17.03.0-ce running an Nginx container. That server also has ConfigServer Security & Firewall installed. Shortly after starting the Nginx container I start receiving emails about "Excessive resource usage" with the following details:
Time: Fri Mar 24 00:06:02 2017 -0400
Account: systemd-timesync
Resource: Process Time
Exceeded: 1820 > 1800 (seconds)
Executable: /usr/sbin/nginx
Command Line: nginx: worker process
PID: 2302 (Parent PID:2077)
Killed: No
I fully understand that I can add exe:/usr/sbin/nginx to csf.pignore to stop these email alerts but I would like to understand a few things first.
Why is the "systemd-timesync" account being reported? That does not seem to have anything to do with Docker.
Why does the host machine seem to be reporting the excessive resource usage (the extended process time) when that is something running in the container?
Why are other docker containers not running Nginx not resulting in excessive resource usage emails?
I'm sure there are other questions but basically, why is this being reported the way it is being reported?
I can at least answer the first two questions:
Unlike real VMs, Docker containers are simply a collection of processes run under the host system kernel. They just have a different view on certain system resources, including their own file hierarchy, their own PID namespace and their own /etc/passwd file. As a result, they will still show up if you ps aux on the host machine.
The nginx container's /etc/passwd includes a user 'nginx' with UID 104 that runs the nginx worker process. However, in the host's /etc/passwd, UID 104 might belong to a completely different user, such as systemd-timesync.
As a result, if you run ps aux | grep nginx in the container, you might see
nginx 7 0.0 0.0 32152 2816 ? S 11:20 0:00 nginx: worker process
while on the host, you see
systemd-timesync 22004 0.0 0.0 32152 2816 ? S 13:20 0:00 nginx: worker process
even though both are the are the same process (also note the different PID namespaces; in containers, PIDs are counted from 1 again).
As a result, container processes will still be subject to ConfigServer's resource monitoring, but they might show up with random, or even non-existent user accounts.
As to why nginx triggers the emails and other containers don't, I can only assume that nginx is the only one of your containers that crosses ConfigServer's resource thresholds.

How to access docker container from another machine on local network

I'm using Docker for Windows( I am not using Docker Toolbox that use a VM) but I cannot see my container from another machine on local network. In my host everything is perfect and runs well,however, I want that other people use my container.
Despite being posting the same question in Docker's Forum , The answer was not show it. Plus, I have been looking for here but the solutions found it are about setting up the bridge option in the virtual machine , and as I said before, I am using Docker for windows that no use Virtual machine.
Docker version Command
Client:
Version: 1.12.0
API version: 1.24
Go version: go1.6.3
Git commit: 8eab29e
Built: Thu Jul 28 21:15:28 2016
OS/Arch: windows/amd64
Server:
Version: 1.12.0
API version: 1.24
Go version: go1.6.3
Git commit: 8eab29e
Built: Thu Jul 28 21:15:28 2016
OS/Arch: linux/amd64
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
789d7bf48025 gogs/gogs "docker/start.sh /bin" 5 days ago Up 42 minutes 0.0.0.0:10022->22/tcp, 0.0.0.0:5656->3000/tcp gogs
7fa7978996b8 mysql:5.7.14 "docker-entrypoint.sh" 5 days ago Up 56 minutes 0.0.0.0:8989->3306/tcp mysql
The container I want to use is gogs that is working in the port 5656.
When I use localhost:5656 y 127.0.0.1:5656 work properly, but when I use My local network IP (192.168.0.127) from other machine the container is unreachable.
Thanks in advance.
Solution:
When I installed Docker for Windows, it creates a network called vEthernet (DockerNAT) (Usually with the ip 10.0.75.1)
My local machine had a network called local area connection with the ip 192.168.0.172(With this ip I was trying to access from other PCs).
So far, My local Machine had Two networks Conections so that I went to Control panel > NetWork and Sharing center > Change Adapter Settings I selected the two networks and I right-click selected Add to bridge. That create a Third network called Ethernet.
At this point, I didnt know what was the Ip of Ethernet network, so I executed ipconfig command that show me the ip 192.168.0.17(The settings of local area connection and vEthernet (DockerNAT) disappeared and the ips 10.0.75.1 and 192.168.0.172 stop working).
With this new ip (192.168.0.17) I tried from other machine in the network and finally I could access to the container(192.168.0.17:5656).
In Hyper-V settings, putting "Docker NAT" network in "external" mode worked for me. (I can access to my container on my local network with my host's IP)

Cannot connect to beaglebone.local

I need to know how to connect to a beaglebone (or beagleboard) with SSH when I plug it into a new network with an ethernet cable like this:
$ ssh root#beaglebone.local
So far I've only been able to access it like this, if I know the IP address:
$ ssh root#<ip_address>
But I don't always know the IP address of the board on new networks so I'm hoping to access it with with a name like: beaglebone.local.
Right now when I try to do this I get this error:
"ssh: Could not resolve hostname beaglebone.local: nodename nor servname provided, or not known"
I checked the hostname and hosts files, and added "127.0.0.1 beaglebone" to the hosts on the beaglebone, but not sure what else I can do?
# cat /etc/hostname
beaglebone
# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
127.0.0.1 beaglebone
I had a similar issue running my beaglebone on Angstrom-Cloud9-IDE-GNOME-eglibc-ipk-v2012.05-beaglebone-2012.04.22.img.xz. In this distribution, "beaglebone.local" should appear on the network after the system boots.
About 50% of the time after reboot, "beaglebone.local" would not appear on the network (although the bone would be available by IP address). When this happened, "systemctl status avahi-daemon.service" showed that the avahi-daemon failed with "exit code 255". Interestingly, a subsequent "systemctl start avaihi-daemon.service" would always be successful and "beaglebone.local" would appear on the network.
Also "journalctl | grep avahi" returned a single message stating something like "Daemon already runnin gon PID NNN".
So, I "fixed" the problem by adding the line "ExecStartPre=/bin/rm -f /var/run/avahi-daemon/pid" to the [Service] section of /lib/systemd/system/avahi-daemon.service. With this addition, "beaglebone.local" now appears on the network 100% of reboots.
I say "fixed" (i.e., in quotes) because I have not been able to track down the root cause that is leaving around the stray avahi pid file(s) and thus don't have a true fix.
-- Frank
For 'beaglebone.local' to work, your host machine must recognize Zeroconf. The BeagleBone uses Avahi to tell other systems on the LAN that it is there and serving up applications and that it should be called a 'beaglebone'. If there are more than one, the second one is generally called 'beaglebone-2.local'.
I hate answering my own questions. The following hack will work until a better way emerges:
This shell script (where xxx.xxx.xxx is the first three numbers in your computer's IP) will find your beaglebone or beagleboard (that is plugged-into ethernet on a new network with DHCP) by looping through all the ip address on the subnet and attempting to login to each as root. If it finds one then try your password. If it doesn't work just hit enter until the loop starts again. If it doesn't find the board then something else is probably wrong.
for ip in $(seq 1 254); do ssh root#xxx.xxx.xxx.$ip -o ConnectTimeout=5; [ $? -eq 0 ] && echo "xxx.xxx.xxx.$ip UP" || : ; done
UPDATE 1
Today I plugged-in the beaglebone and saw Bonjour recognize that it joined the network. So I tried it and it worked. No idea why it decided to all of the sudden but it did. Strange, but true.
I had this issue quite often with Mac OS X 10.7. But unlike Frank Halasz "systemctl status avahi-daemon.service" shown no failure. And in fact the problem was on the Mac side. Restarting Bonjour with the following commands fixed the issue.
$ sudo launchctl unload /System/Library/LaunchDaemons/com.apple.mDNSResponder.plist
$ sudo launchctl load -F /System/Library/LaunchDaemons/com.apple.mDNSResponder.plist

Binding external IP address to Rabbit MQ server

I have box A and it has a consumer on it that listens on a Rabbit MQ server
I have box B that will publish a message to the listener
So as long as all of this in on box A and I start Rabbit MQ server w/ defaults it works fine.
The defaults are host=127.0.0.1 on port 5672, but
when I telnet box.a.ip.addy 5672 from box B I get:
Trying box.a.ip.addy...
telnet: connect to address box.a.ip.addy: No route to host
telnet: Unable to connect to remote host: No route to host
telnet on port 22 is fine, I can ssh into Box A from Box B
So I assume I need to change the ip that the RabbitMQ server uses
I found this: http://www.rabbitmq.com/configure.html and I now have a config file in the location the documentation said to use, with the name rabbitmq.config and it contains:
[
{rabbit, [{tcp_listeners, {"box.a.ip.addy", 5672}}]}
].
So I stopped the server, and started RabbitMQ server again. It failed. Here are the errors from the error logs. It's a little over my head. (in fact most of this is)
=ERROR REPORT==== 23-Aug-2011::14:49:36 ===
FAILED
Reason: {{case_clause,{{"box.a.ip.addy",5672}}},
[{rabbit_networking,'-boot_tcp/0-lc$^0/1-0-',1},
{rabbit_networking,boot_tcp,0},
{rabbit_networking,boot,0},
{rabbit,'-run_boot_step/1-lc$^1/1-1-',1},
{rabbit,run_boot_step,1},
{rabbit,'-start/2-lc$^0/1-0-',1},
{rabbit,start,2},
{application_master,start_it_old,4}]}
=INFO REPORT==== 23-Aug-2011::14:49:37 ===
application: rabbit
exited: {bad_return,{{rabbit,start,[normal,[]]},
{'EXIT',{rabbit,failure_during_boot}}}}
type: permanent
and here is some more from the start up log:
Erlang has closed
Error: {node_start_failed,normal}
^M
Crash dump was written to: erl_crash.dump^M
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}})^M
Please help
did you try adding?
RABBITMQ_NODE_IP_ADDRESS=box.a.ip.addy
to the /etc/rabbitmq/rabbitmq.conf file?
Per http://www.rabbitmq.com/configure.html#customise-general-unix-environment
Also per this documentation it states that the default is to bind to all interfaces. Perhaps there is a configuration setting or environment variable already set in your system to restrict the server to localhost overriding anything else you do.
UPDATE: After reading again I realize that the telnet should have returned "Connection Refused" not "No route to host." I would also check to see if you are having a firewall related issue.
You need to open up the tcp port on your firewall
Using Linux, Find the iptables config file:
eric#dev ~$ find / -name "iptables" 2>/dev/null
/etc/sysconfig/iptables
Edit the file:
sudo vi /etc/sysconfig/iptables
Fix the file by adding a port:
# Generated by iptables-save v1.4.7 on Thu Jan 16 16:43:13 2014
*filter
-A INPUT -p tcp -m tcp --dport 15672 -j ACCEPT
COMMIT

Resources