Kubernetes nodeport concurrent connection limit - networking

I'm running Kubernetes with AWS EKS. I'm performing some load tests for a nodeport service and seeing a concurrent connection limit of ~16k-20k when hitting a node the pod is not running on. I'm wondering if there's some way to increase the number of concurrent connections.
So I'm running a nodeport service with only 1 pod that is scheduled on node A. The load test I'm running tries to connect as many concurrent websocket connections as possible. The websockets just sleep and send heartbeats every 30s to keep the connection alive.
When I point the load tester (tsung) at node A, I can get upwards of 65k concurrent websockets before the pod gets OOMKilled so memory is the limiting factor and that's fine. The real problem is when I point the load tester at node B, and kube-proxy's iptables forward the connection to node A, all of sudden, I can only get about 16k-20k concurrent websocket connections before the connections start stalling. According to netstat, they are getting stuck in the SYN_SENT state.
netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
...
20087 ESTABLISHED
30969 SYN_SENT
The only thing I can think of to check is my conntrack limit and it looks to be fine. Here is what I get for node B.
net.netfilter.nf_conntrack_buckets = 16384
net.netfilter.nf_conntrack_max = 131072
net.nf_conntrack_max = 131072
Here is the port range. I'm not sure if it matters (I'm not sure if DNAT and SNAT use up ports), but the range seems to be well above 16k.
net.ipv4.ip_local_port_range = 32768 60999
The file descriptor limit and kernel TCP settings are the same for node A and node B so I think that rules them out.
Is there anything else that could be limiting the number of concurrent connections forwarded through iptables/netfilter?

You are always going to get worse performance when hitting the NodePort where your pod is not running. Essentially, your packets are going through extra hops trying (through iptables) to get its final destination.
I'd recommend using source IP for your NodePort service. Basically, patch your service with this:
$ kubectl patch svc <your-service> -p '{"spec":{"externalTrafficPolicy":"Local"}}'
Then let your load balancer forward traffic only to NodePorts that are serving traffic.
Alternatively, if you'd like to consider something better performing you could consider using proxy mode ipvs or something like BPF/Cillium for your overlay.

Related

Failing to perform TCP hole punching using STUN

I have two hosts A and B. They're in different networks, behind different NATs and ISPs. I'm trying to set up a p2p connection between them by using hole punching. I use a STUN server to obtain mapped IP addresses and ports for both A and B. It goes on like this:
For A:
.\stunclient.exe --mode behavior stunserver.stunprotocol.org 3478
Binding test: success
Local address: 192.168.0.110:54709
Mapped address: 186.233.160.141:28769
Behavior test: success
Nat behavior: Endpoint Independent Mapping
For B:
.\stunclient.exe --mode behavior stunserver.stunprotocol.org 3478
Binding test: success
Local address: 192.168.3.1:57015
Mapped address: 45.70.35.52:12870
Behavior test: success
Nat behavior: Endpoint Independent Mapping
Then I try to perform the TCP hole punching technique (using netcat) by executing these two lines simultaneously and multiple times on A and B:
On A:
ncat -p 54709 45.70.35.52 12870
Ncat: TIMEOUT.
On B:
ncat -p 57015 186.233.160.141 28769
Ncat: TIMEOUT.
I always get "Ncat: Timeout" as output (not immediatelly, it takes some time), however, I could make a direct connection between A and B via UDP hole punching by running the following commands three times:
On A:
ncat -u -p 54709 45.70.35.52 12870
On B:
ncat -u -p 57015 186.233.160.141 28769
So the problem is TCP hole punching isn't working. Any ideas why?
Many issues that might be making this a challenge.
First, stunclient defaults to UDP whereas ncat defaults to TCP. So your first issue is that you aren't passing the flag (-u on most systems) to tell ncat to run as UDP. Or, you can try running stunclient in tcp mode. (e.g. stunclient --protocol tcp stunserver.stunclient.org), but TCP NAT traversal is much less reliable than UDP - especially with rudimentary command line tools )
I don't understand how your output above can have Host A and Host B behind the same NAT, yet both machines appear to have the same local IP address, using the same local port, but printing the same local ip address 192.168.3.3. How is this a thing? Is this just a typo? Or is one machine a VM host of the other and they are sharing an IP?
The behavior you are trying to achieve, having two hosts behind the same NAT connect via the public ip address is called hairpinning. This relies on the NAT to be smart enough to see that an outbound packet is really meant for a host behind the router itself and to loop it back through its own routing table instead of going out on the WAN adapter. Not all NATs support hairpinning. So what you have to do is try connecting through to both the local and remote ip addresses.
Also, try to avoid picking hardcoded ports like 20000. Let stunclient.exe pick a randomly available port for you. (i.e. don't specify --localport parameter). Then when you issue the ncat command, use the local port it picked to connect to the remote mapped port of the other ip address.
Hypothetical usage:
Host A
stunclient.exe stunserver.stunprotocol.org
Binding test: success
Local address: 192.168.1.2:1111
Mapped address: 45.70.35.52:2222
Host B
stunclient.exe stunserver.stunprotocol.org
Binding test: success
Local address: 192.168.1.3:3333
Mapped address: 45.70.35.52:4444
Address candidates passed from A to B: {45.70.35.52:2222, 192.168.1.2:1111}
Address candidates passed from B to A: {45.70.35.52:4444, 192.168.1.3:3333}
Host A then runs these command in parallel. But oops, ncat may not allow sharing the socket port between two running programs. Look at the documentation to see if the SO_REUSEADDR flag is exposed implicitly as a command line param. It may do this implicitly.
ncat -u -p 1111 45.70.35.52 4444
ncat -u -p 1111 192.168.1.3:3333
Host B then does this in two separate consoles:
ncat -u -p 3333 45.70.35.52 2222
ncat -u -p 3333 192.168.1.2:1111
In other words, try all 4 combinations of A to B and B to A.
I was about to mention making sure you don't have address dependent mapping by running the behavior test. (i.e. "symmetric NAT"). Symmetric NATs make p2p connectivity very difficult for the connection to "go direct". But you've got endpoint independent, which is good.

Is there a way to preserve the source port for outgoing traffic in Kubernetes?

In most TCP client/server communications, the client uses a random general purpose port number for outgoing traffic. However, my client application, which is running inside a Kubernetes cluster, must use a specific port number for outgoing traffic; this is due to requirements by the server.
This normally works fine when the application is running externally, but when inside a Kubernetes cluster, the source port is modified somewhere along the way from the pod to the worker node (verified with tcpdump on worker node).
For context, I am using a LoadBalancer Service object. The cluster is running kube-proxy in Iptables mode.
So I found that I can achieve this by setting the hostNetwork field to true in the pod's spec.
Not an ideal solution but gets the job done.

Connecting to a remote MySQL database from Docker container

I have two servers in AWS, both in a security group that allows all traffic on all ports between members of the security group. On one server I have a MySQL server (without docker, let's call this server the "MySQL server") and on the other server I have docker (let's call it the "Docker server"). I want to access MySQL from within a container on the docker server without having to route the traffic over the internet (I'd like to use the internal IP address of the MySQL server instead).
Is this possible? What are my options?
What I've tried so far
I've configured the MySQL server to listen on all interfaces, just for testing. This allows me to connect to the MySQL server successfully from the Docker server (using mysql client to connect to the private IP address of the MySQL server). However when I start a container a new network namespace is created so I can't access the private IP address of the MySQL server anymore.
I've tried using an ambassador container as described here but I run into the same problem, the private IP address of the MySQL server is not available from inside the ambassador container.
Example
Here's an example to illustrate the problem and what I'm trying to do.
From the Docker server (not in any container yet):
$ ping -c 1 10.0.0.155
PING 10.0.0.155 (10.0.0.155) 56(84) bytes of data.
64 bytes from 10.0.0.155: icmp_seq=1 ttl=64 time=0.777 ms
--- 10.0.0.155 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.777/0.777/0.777/0.000 ms
However trying from within a container:
$ sudo docker run --rm -it apcera/nats-ping-client ping -c 1 10.0.0.115
PING 10.0.0.115 (10.0.0.115) 56(84) bytes of data.
From 10.0.0.200 icmp_seq=1 Destination Host Unreachable
--- 10.0.0.115 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
I expect this because I know that docker creates a new private network just for the containers but I don't know enough to be able to get around what I'm trying to do.
How can I wire things to be able to access the mysql server from within a container?
Yes that's possible.
Whether a container can talk to the world is governed by two factors. The first factor is whether the host machine is forwarding its IP packets. The second is whether the host’s iptables allow this particular connection.
To check the setting on your kernel or to turn it on manually: (be sure to set to 1)
$ sysctl net.ipv4.conf.all.forwarding
net.ipv4.conf.all.forwarding = 0
$ sysctl net.ipv4.conf.all.forwarding=1
$ sysctl net.ipv4.conf.all.forwarding
net.ipv4.conf.all.forwarding = 1
Docker will never make changes to your system iptables rules if you set --iptables=false when the daemon starts. Otherwise the Docker server will append forwarding rules to the DOCKER filter chain. So, be sure to not use --iptables=false

UDP hole punching for Server/Client communication under NAT with STUN

Problem
I'm trying to develop a communication system where:
A, B are machines under NAT, A is the server B is the client
S is the STUN server
S is running on a machine reachable on the Internet
The flow is as follows:
A hits S with an opcode saying he's the server
S registers A as server
B hits S with an opcode saying he's the client
S sends to A B's external infos (IP, PORT)
S sends to B A's external infos (IP, PORT)
A starts sending B an opcode saying he's the server every 500ms
and meanwhile listens for packets saying he's got a client
B starts sending A an opcode saying he's the client every 500ms
and meanwhile listen for packets saying he's got the server
Trouble
Here's where the troubles start, the STUN server does its job, since both ends receive correct infos about the other.
But then I do never receive the other end's message, so both ends keep listening without receiving the handshake opcode nor anything else.
NAT's Behaviour
I did examine this NAT's behaviour and seems it does like this
A is at 192.168.X.X, on port 4444
connects to the outside exposing N.N.N.N:4444
so the port number is kept as long as it's free, gets a new (random ?) one if not available.
Tests
The tests I run have seen both ends (A, B) hosted on the same machine, both bound to the machine's internal IP, tried to bind to 127.0.0.1, 0.0.0.0, nothing changed.
If while they're listening for handshakes I echo something with nc to localhost, it is received and displayed (as an unrecognised message) without any problem. The connection routed via the NAT doesn't wotk tough, every packet is discarded.
Also tried with A hosted on the machine, B on an Android phone under mobile data, with a simple app written ad-hoc. Still locks waiting for something, like the nodejs tests.
Update:
Another thing I tried to do is to open an hole with nc
On two different machines under the same NAT I ran:
echo "GREET UNKOWN PEER" | nc -u <NAT IP> 4567 -p 4568
echo "GREET UNKOWN PEER" | nc -u <NAT IP> 4568 -p 4567
Different times for each machine. From my understanding this should punch an hole in the NAT with the first packets discarded and the subsequent forwarded. But nothing happened, no end got the message.
I've also tried:
from local machine
echo "GREET UNKOWN PEER" | nc -u <PUBLIC IP> 4567 -p 4568
from public machine
echo "GREET UNKOWN PEER" | nc -u <NAT IP> 4568 -p 4567
this one works, the local machine under NAT contacts the public one and after the first discarded packet is able to receive and send on the assigned port. I wonder why this doesn't work on two machines under the same NAT (???)
Code
I didn't show any code because I think there is some kind logic flaw in this, however here's the github project for that.
index.js contains the STUN server, the tests folder contains the test cases: test.js starts the stun server, PeerClientTest.js and PeerServerTest.js are mockups of the client and server.
Run node tests/test.js to start the server on a public machine (change IPs in config.js and tests/config.js)
then node tests/PeerServerTest.js to start the server ("A") and node tests/PeerClientTest.js to start the client ("B"). Both will recognize each other via STUN, then listen for the other end's handshake opcode while sending their own handshake opcode. This never happens so they just keep sending/listening forever.
Node is not required, so if there are better solutions in other languages just tell, will be appreciated.
B's NAT is filtering A's packets and is not letting them through. NAT filters unknown packets sent to it. Your server A is sending packet to client B. But client B previously never sent a packet through NAT to A. So to B's NAT A's packets are unknown and being discarded.
You need to punch a hole in B's NAT for the NAT to allow the incoming packets. Send a packet from B to A NAT's IP:Port. After that when you send a packet from A to B, B's NAT won't discard A's packet.
This won't work if A and B's NAT has a combination like Symmetric and Symmetric/PRC NAT. In this case you will have to use a TURN relay server.

EC2 instance drops ICMP packets. How to measure the latency?

I am trying to measure the latency between one of my machines, and an EC2 instance. EC2 instances cannot be pinged. So I tried using application level timestaps (using gettimeofday()). I send a tcp packet with a timestamp in the payload.
Upon receiving this packet, I calculate the timestamp on my machine, and obtain the difference. It always comes out to be negative. My guess was that the clocks in the two machines could be skewed. So I used ntp to synchronize both the machines, but the problem still persists.
Can someone please help.
EC2 instances can be pinged, if configured to allow it. I set one up for this today while trying to track down packet drops in us-west-2. In the security group protecting the instance, you add a rule to permit "ICMP Echo Request" from the source address of the machine where you're originating the ping.
See the AWS FAQ for this quote.
Why can't I ping my instance? Ping uses ICMP ECHO, which by default is
blocked by your firewall. You'll need to grant ICMP access to your
instances by updating the firewall restrictions that are tied to your
security group.
ec2-authorize default -P icmp -t -1:-1 -s 0.0.0.0/0
Check out the latest developer guide for details.
Section: Instance Addressing and Network Security -> Network Security
-> Examples

Resources