[ Disclaimer: this question was originally posted on ServerFault. However, since the official K8s documentation states "ask your questions on StackOverflow", I am also adding it here ]
I am trying to deploy a test Kubernetes cluster on Oracle Cloud, using OCI VM instances - however, I'm having issues with pod networking.
The networking plugin is Calico - it seems to be installed properly, but no traffic gets across the tunnels from one host to another. For example, here I am trying to access nginx running on another node:
root#kube-01-01:~# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
nginx-dbddb74b8-th9ns 1/1 Running 0 38s 192.168.181.1 kube-01-06 <none>
root#kube-01-01:~# curl 192.168.181.1
[ ... timeout... ]
Using tcpdump, I see the IP-in-IP (protocol 4) packets leaving the first host, but they never seem to make it to the second one (although all other packets, including BGP traffic, make it through just fine).
root#kube-01-01:~# tcpdump -i ens3 proto 4 &
[1] 16642
root#kube-01-01:~# tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
root#kube-01-01:~# curl 192.168.181.1
09:31:56.262451 IP kube-01-01 > kube-01-06: IP 192.168.21.64.52268 > 192.168.181.1.http: Flags [S], seq 3982790418, win 28000, options [mss 1400,sackOK,TS val 9661065 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
09:31:57.259756 IP kube-01-01 > kube-01-06: IP 192.168.21.64.52268 > 192.168.181.1.http: Flags [S], seq 3982790418, win 28000, options [mss 1400,sackOK,TS val 9661315 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
09:31:59.263752 IP kube-01-01 > kube-01-06: IP 192.168.21.64.52268 > 192.168.181.1.http: Flags [S], seq 3982790418, win 28000, options [mss 1400,sackOK,TS val 9661816 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
root#kube-01-06:~# tcpdump -i ens3 proto 4 &
[1] 12773
root#kube-01-06:~# tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
What I have checked so far:
The Calico routing mesh comes up just fine. I can see the BGP traffic on the packet capture, and I can see all nodes as "up" using calicoctl
root#kube-01-01:~# ./calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+-------------+
| 10.13.23.123 | node-to-node mesh | up | 09:12:50 | Established |
| 10.13.23.124 | node-to-node mesh | up | 09:12:49 | Established |
| 10.13.23.126 | node-to-node mesh | up | 09:12:50 | Established |
| 10.13.23.129 | node-to-node mesh | up | 09:12:50 | Established |
| 10.13.23.127 | node-to-node mesh | up | 09:12:50 | Established |
| 10.13.23.128 | node-to-node mesh | up | 09:12:50 | Established |
| 10.13.23.130 | node-to-node mesh | up | 09:12:52 | Established |
+--------------+-------------------+-------+----------+-------------+
The security rules for the subnet allow all traffic. All the nodes are in the same subnet, and I have a stateless rule permitting all traffic from other nodes within the subnet (I have also tried adding a rule permitting IP-in-IP traffic explicitly - same result).
The source/destination check is disabled on all the vNICs on the K8s nodes.
Other things I have noticed:
I can get calico to work if I disable IP in IP encapsulation for same-subnet traffic, and use regular routing inside the subnet (as described here for AWS)
Other networking plugins (such as weave) seem to work correctly.
So my question here is - what is happening to the IP-in-IP encapsulated traffic? Is there anything else I can check to figure out what is going on?
And yes, I know that I could have used managed Kubernetes engine directly, but where is the fun (and the learning opportunity) in that? :D
Edited to address Rico's answer below:
1) I'm also not getting any pod-to-pod traffic to flow through (no communication between pods on different hosts). But I was unable to capture that traffic, so I used node-to-pod as an example.
2) I'm also getting a similar result if I hit a NodePort svc on another node than the one the pod is running on - I see the outgoing IP-in-IP packets from the first node, but they never show up on the second node (the one actually running the pod):
root#kube-01-01:~# tcpdump -i ens3 proto 4 &
[1] 6499
root#kube-01-01:~# tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
root#kube-01-01:~# curl 127.0.0.1:32137
20:24:08.460069 IP kube-01-01 > kube-01-06: IP 192.168.21.64.40866 > 192.168.181.1.http: Flags [S], seq 3175451438, win 43690, options [mss 65495,sackOK,TS val 19444115 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
20:24:09.459768 IP kube-01-01 > kube-01-06: IP 192.168.21.64.40866 > 192.168.181.1.http: Flags [S], seq 3175451438, win 43690, options [mss 65495,sackOK,TS val 19444365 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
20:24:11.463750 IP kube-01-01 > kube-01-06: IP 192.168.21.64.40866 > 192.168.181.1.http: Flags [S], seq 3175451438, win 43690, options [mss 65495,sackOK,TS val 19444866 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
20:24:15.471769 IP kube-01-01 > kube-01-06: IP 192.168.21.64.40866 > 192.168.181.1.http: Flags [S], seq 3175451438, win 43690, options [mss 65495,sackOK,TS val 19445868 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
Nothing on the second node ( kube-01-06, the one that is actually running the nginx pod ):
root#kubespray-01-06:~# tcpdump -i ens3 proto 4
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
I used 127.0.0.1 for ease of demonstration - of course, the exact same thing happens when I hit that NodePort from an outside host:
20:25:17.653417 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56630 > 192.168.181.1.http: Flags [S], seq 980178400, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:17.654371 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56631 > 192.168.181.1.http: Flags [S], seq 3932412963, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:17.667227 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56632 > 192.168.181.1.http: Flags [S], seq 2017119223, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:20.653656 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56630 > 192.168.181.1.http: Flags [S], seq 980178400, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:20.654577 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56631 > 192.168.181.1.http: Flags [S], seq 3932412963, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:20.668595 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56632 > 192.168.181.1.http: Flags [S], seq 2017119223, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
3) As far as I can tell (please correct me if I'm wrong here), the nodes are aware of routes to pod networks, and pod-to-node traffic is also encapsulated IP-in-IP (notice the protocol 4 packets in the first capture above)
root#kube-01-01:~# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
alpine-9d85bf65c-2wx74 1/1 Running 1 23m 192.168.82.194 kube-01-08 <none>
nginx-dbddb74b8-th9ns 1/1 Running 0 10h 192.168.181.1 kube-01-06 <none>
root#kube-01-01:~# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
<snip>
192.168.181.0 10.13.23.127 255.255.255.192 UG 0 0 0 tunl0
Maybe it is a MTU issue:
Typically the MTU for your workload interfaces should match the network MTU. If you need IP-in-IP then the MTU size for both the workload and tunnel interfaces should be 20 bytes less than the network MTU for your network. This is due to the extra 20 byte header that the tunnel will add to each packet.
Read more here.
After a long time and a lot of testing, my belief is that this was caused by IP-in-IP (ipip, or IP protocol 4) traffic being blocked by the Oracle cloud networking layer.
Even though I was unable to find this documented anywhere, it is something that is common for cloud providers (Azure, for example, does the same thing - disallows IP-in-IP and unknown IP traffic).
So the possible workarounds here should be the same ones as the ones listed in the Calico documentation for Azure:
Disabling IP-in-IP for same-subnet traffic (as I mentioned in the question)
Switching Calico to VXLAN encapsulation
Using Calico for policy only, and flannel for encapsulation (VXLAN)
Are you having issues connecting from Pod to Pod?
The short answer here would seem that he PodCidr packets are getting encapsulated when they are communicating to another pod either on the same node or another node.
Note:
By default, Calico’s IPIP encapsulation applies to all container-to-container traffic.
So you will be able to connect to a pod on another node if you are inside the pod. For example, if you connect with kubectl exec -it <pod-name>.
This is the reason you can't connect to a pod/container from root#kube-01-01:~# since your node/host doesn't know anything about the PodCidr. It sends 192.168.x.x packets through the default node/host route, however, your physical network is not 192.168.x.x so they get lost since there's no other node/host that physically understands that.
The way you would connect to a nginx service would be through a Kubernetes Service, this is different from the network overlay and allows you connect to pods outside of the PodCidr. Note that these service rules are managed by the kube-proxy and are generally iptables rules. Also, with iptables you can explicitly do things like if you want to talk to IP A.A.A.A you need to go through a physical interface (i.e. tun0) or you have to through IP B.B.B.B.
Hope it helps!
I installed Openstack Ansible, Pike version. There is a separate network controller and on it one physical network interface. We created VLAN 139 that leads the traffic to gateway. Config file for that part looks like:
/etc/network/interfaces
...
auto eno1.139
iface eno1.139 inet manual
vlan-raw-device eno1
# OpenStack Networking VLAN bridge
auto br-vlan
iface br-vlan inet manual
bridge_stp off
bridge_waitport 0
bridge_fd 0
bridge_ports eno1.139
We created an external Openstack network using:
openstack network create --external --share --provider-physical-network vlan --provider-network-type vlan --provider-segment 139 provider1
and all the other steps (subnet, router, etc)
As per documentation, first test should be pinging default gateway from router namespace. When I try that it is not working:
root#infra1-neutron-agents-container-e800e983:/# ip netns exec qrouter-eb842b12-9a35-4a93-baa9-38cc73531d9f ping 139.25.25.193
When I do TCP dump on physical network interface of controller node I can see packets going out without any problem:
openstackadmin#clcontroller:~$ sudo tcpdump -i eno1 --immediate-mode -e -n | grep 139.25.25.193
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eno1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:30:09.182894 fa:16:3e:d4:b6:a1 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 50: vlan 139, p 0, ethertype 802.1Q, vlan 139, p 0, ethertype ARP, Request who-has 139.25.25.193 tell 139.25.25.200, length 28
I see ARP request getting to gateway that has 139.25.25.193 and I am trying to ping:
hpadmin#hos-gw01:~$ sudo tcpdump -i any --immediate-mode -e -n | grep 139.25.25.193
[sudo] password for hpadmin:
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
15:53:29.857281 B fa:16:3e:d4:b6:a1 ethertype 802.1Q (0x8100), length 62: vlan 139, p 0, ethertype 802.1Q, vlan 139, p 0, ethertype ARP, Request who-has 139.25.25.193 tell 139.25.25.200, length 38
15:53:29.857281 B fa:16:3e:d4:b6:a1 ethertype 802.1Q (0x8100), length 58: vlan 139, p 0, ethertype ARP, Request who-has 139.25.25.193 tell 139.25.25.200, length 38
but what is confusing is my gateway is not responding to those ARP requests.
If I try to do same thing from stand alone Linux machine connected to same network segment and same VLAN everything works perfect.
Any idea what the problem might be? Thanks in advance.
It seems that problem was that external OpenStack network was set up to be on VLAN 139. Once we changed it to be flat everything started working without any problems. I am still confused, though, why gateway did not sent ARP responses.
On my development machine, I can see the following DHCP traffic on eth1 port:
sudo tcpdump -i eth1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
09:51:58.785056 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP,
Request from 00:06:31:c7:1e:23 (oui Unknown), length 315
09:51:58.785384 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP,
Request from 00:06:31:c7:1e:23 (oui Unknown), length 315
09:51:59.786677 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP,
Request from 00:06:31:c7:1e:23 (oui Unknown), length 315
I want to see this same traffic in my Vagrant VM. So I want to bridge the eth1 port on my development machine to some port on my Vagrant VM. I also want to be able to send network traffic from my Vagrant VM back onto the eth1.
Using the answer in the following question:
https://superuser.com/questions/752954/need-to-do-bridged-adapter-only-in-vagrant-no-nat
I added this to my Vagrantfile:
config.vm.provision "shell",
run: "always",
inline: "route -A inet6 add deault gw fc00::1 eth1"
But I don't really know what this does, except that it seems to only route ipv6 traffic? Could someone help explain how to bridge eth1 to a port on my Vagrant VM?
I also tried the following in my Vagrantfile:
Vagrant::Config.run do |config|
config.vm.network :bridged
end
yet after I do vagrant up again, I still don't see any port (using ifconfig) in my VM that has the DHCP traffic.
After reading up on Vagrant, I found a solution that worked for me.
Firstly, I ran the following command on my host machine:
sudo ip route add 192.168.1.0/24 dev eth1
My understanding is that this routes traffic from eth1 physical port to 192.168.1.0 IP address.
Then, in my Vagrantfile I added the following under Vagrant.configure(2) do |config|:
config.vm.network “public_network”, bridge: “eth1”
config.vm.network “public_network”, ip: “192.168.1.0”
After doing vagrant destroy and vagrant up, I now see a new port in my vagrant VM (enp0s8), that if I do tcpdump on I see the DHCP traffic (and other traffic) that is arriving on my eth1 port on my host machine.
Not sure if this is the best solution, as I am still very new to vagrant and VirtualBox, but this worked for me.
Rsyslog Server IP: 192.168.122.94
Rsyslog Client IP: 192.168.122.93
1) Done rsyslog server force reboot
root#rsyslogserver:~# reboot -f
Write failed: Broken pipe
2) After reboot I have sent an event from rsyslog client.
3) Server is running on port 1014 and client is configured to forward logs to server on 1014
3) Ran tcpdump on rsyslog server to listen the communication on port 1014. For the first time when we send event after force reboot, rsyslog client is not able to forward event to rsyslog server. Then after, rsyslog client is able to forward logs to rsyslog server.
root#rsyslogserver:~# tcpdump -i eth1 "src port 1014"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
11:03:05.687971 IP 192.168.122.94.1014 > 192.168.122.93.40036: Flags [R], seq 3944299399, win 0, length 0
11:05:28.096264 IP 192.168.122.94.1014 > 192.168.122.93.52079: Flags [S.], seq 3014852900, ack 1286331701, win 14480, options [mss 1460,sackOK,TS val 4294939924 ecr 149156552,nop,wscale 6], length 0
11:05:28.096605 IP 192.168.122.94.1014 > 192.168.122.93.52079: Flags [.], ack 394, win 243, options [nop,nop,TS val 4294939924 ecr 149156552], length 0
Reason:
This seems general behavior of any TCP connection. If any System crashes or terminates abnormally and after that if we send any TCP request then it resets old pre-cash connection and establishes new connection.
This will not happen for normal reboot.
RefLink:
https://en.wikipedia.org/wiki/TCP_reset_attack (Section TCP resets)
But here my question is how to prevent loss of that event for the first time.
Will there be any configurations in rsyslog server/client side to prevent event loss.
I'm trying to capture tcp packets from a GPS device(client) configured to my server's 11050 port of eth1 interface. I wanna capture these packets to a file. The result is not in a human readable format. Below are list of the commands i tried with, but no results. please help...
tcpdump -w test.pcap -i eth1 tcp port 11050
tcpdump -i eth1 -X -s 11050 -w test1
test1, test.pcap both read the below!!!
��ق���ق���ʣ�N�%��S�U$��E�MO#5�&߶y?vf��6++�qe��>ۀ
tcpdump will write the captured data in a format suitable for re-parsing later with tcpdump, wireshark, Tshark, etc.
Re-read the file with tcpdump -r test.pcap and you'll get human-readable output:
$ tcpdump -r ./test.pcap
reading from file ./test.pcap, link-type EN10MB (Ethernet)
23:25:32.646075 ARP, Request who-has moxi-00067F274580 tell haig, length 28
23:25:32.646322 ARP, Reply moxi-00067F274580 is-at 00:06:7f:27:45:80 (oui Unknown), length 46
23:25:34.567932 IP haig.36941 > 192.168.0.1.domain: 36648+ A? www.google.com. (32)
...