Does a docker container have its own TCP/IP stack? - networking

I'm trying to understand what's happening under the hood to a network packet coming from the wire connected to the host machine and directed to an application inside a Docker container.
If it were a classic VM, I know that a packet arriving on the host would be transmitted by the hypervisor (say VMware, VBox etc.) to the virtual NIC of the VM and from there through the TCP/IP stack of the guest OS, finally reaching the application.
In the case of Docker, I know that a packet coming on the host machine is forwarded from the network interface of the host to the docker0 bridge, that is connected to a veth pair ending on the virtual interface eth0 inside the container. But after that? Since all Docker containers use the host kernel, is it correct to presume that the packet is processed by the TCP/IP stack of the host kernel? If so, how?
I would really like to read a detailed explanation (or if you know a resource feel free to link it) about what's really happening under the hood. I already carefully read this page, but it doesn't say everything.
Thanks in advance for your reply.

The network stack, as in "the code", is definitely not in the container, it's in the kernel of which there's only one shared by the host and all containers (you already knew this). What each container has is its own separate network namespace, which means it has its own network interfaces and routing tables.
Here's a brief article introducing the notion with some examples: http://blog.scottlowe.org/2013/09/04/introducing-linux-network-namespaces/
and I found this article helpful too:
http://containerops.org/2013/11/19/lxc-networking/
I hope this gives you enough pointers to dig deeper.

Related

Multiple LXD containers on single macvlan interface

I'm a little confused as to how the following scenario works. It's a very simple setup, so I hope the explanation is simple.
I have a host with a single physical NIC. I create a single macvlan sub-interface in bridge mode off this physical NIC. Then I start up two LXD/LXC containers. Each with their own unique MAC and IP, but in the profile, I specify the same single macvlan sub-interface as each container's parent interface.
Both containers have access to the network without issue. I'm also able to SSH into each container using each container's unique IP address. This is the bit that confuses me:
How is all of this working underneath the hood? Both containers are using the single macvlan MAC/IP when accessing the external world. Isn't there going to be some sort of collision? Shouldn't this not work? Shouldn't I need one macvlan subinterface per container? Is there some sort of NAT going on here?
macvlan isn't documented much, hoping someone out there can help out.
There isn't NATing per say as that is at the IP layer -- MACs are the link layer -- but it is a similar result.
All of the MACs (the NIC's and the macvlan's) will get routed through the same link that goes to the NIC. The NIC device driver will then route the traffic to the correct interface (virtual or not) which puts it to one of the guests or to the host. You can think of macvlan's as virtual switches.

Split uplink and downlink between interfaces with openvswitch

I have one or more virtual machines on Debian host and two physical eth interfaces. I want to split bandwidth between eths (both for downlink and one for uplink). Is it possible with openvswitch and openflow?
The short answer is that it should be possible with OVS and OpenFlow. With OVS you can connect your VM's virtual ports and the server's physical interfaces.
Without thinking too much, you can load balancing the traffic by:
Installing a flow to direct any VM packet to your uplink port. This flow should rewrite the src IP and MAC as the ones from the downlink interface, so that it will look like it is being sent through that port.
Keep in mind that you might take your virtual ports configuration into account, and that you need some kind of mapping (something like NAT), to get the packets correctly returned to its respective VM. You can take a look in a NAT implementation, for the Ryu controller, to get some inspiration.

Docker Networking

I am trying to understand the relationship between:
eth0 on the host machine; and
docker0 bridge; and
eth0 interface on each container
It is my understanding that Docker:
Creates a docker0 bridge and then assigns it an available subnet that is not in conflict with anything running on the host; then
Docker binds docker0 to eth0 running on the host; then
Docker binds each new container it spins up to docker0, such that the container's eth0 interface connects to docker0 on the host, which in turn is connected to eth0 on the host
This way, when something external to the host tries to communicate with a container, it must send the message to a port on the host's IP, which then gets forwarded to the docker0 bridge, which then gets broadcasted to all the containers running on the host, yes?
Also, this way, when a container needs to communicate to something outside the host, it has its own IP (leased from the docker0 subnet) and so the remote caller will see the message as having came from the container's IP.
So if anything I have stated above is incorrect, please begin by clarifying for me!
Assuming I'm more or less correct, my main concerns are:
When remote services "call in" to the container, all containers get broadcasted the same message, which creates a lot of traffic/noise, but could also be a security risk (where only container 1 should be the recipient of some message, but all the other containers running on it get the message as well); and
What happens when Docker chooses identical subnets on different hosts? In this case, container 1 living on host 1 might have the same IP address as container 2 living on host 2. If container 1 needs to "call out" to some external/remote system (not living on the host), then how does that remote system differentiate between container 1 vs container 2 (both will show the same egress IP)?
I won't say that you are clear with the concept of networking in Docker.
Let me clarify this part first:
So here's how it goes:
Docker uses a feature of Linux kernel called namespaces to classify/divide the resources.
When a container starts, Docker creates a set of namespaces for that container.
This provides a layer of isolation.
One of these is the "net namespace": Used for managing network interfaces.
Now talking a bit about Network Namespaces:
Net namespace,
Lets each container have its own network resources, own network stack.
It’s own network interfaces.
It’s own routing tables.
It’s own iptables rules.
It’s own sockets (ss, netstat)
We can move a network interface across net namespaces.
So we can create a netns somewhere and use it at some other container.
Typically: Two virtual interfaces are used, which act as a cross-over cable.
Eth0 # container netNS, which is paired with virtual interface vethXXX in host
network ns.
➔ All virtual interfaces vethXXX are bridged together. (Using the bridge docker0)
Now, apart from namespaces, there's a second feature in Linux kernel that makes creation of containers possible: c-groups (or control-groups).
Control groups let us implement metering and limiting of:
Memory
CPU
Block I/O
Network
TL/DL
In Short:
Containers are made possible because of 2 main features of kernel:
Namespaces and C-Groups.
Cgroups ---> Limits how much you can use.
Namespaces ---> Limits what you can see.
And you can't effect what you can't see.
Coming back to your question, When a packet is received by the host that is intended for a container, It is encapsulated in layers such that each layer helps the network controller, which peels the packet layer after layer to send it to it's destination. (And similarly while outgoing, the packets are encapsulated layer by layer)
So, I think this answers both of your questions as well.
It's not exactly a broadcast. Other containers can't see a packet that's not related to them (Namespaces).
Since layers are added as a packet goes to the external network, the external layer (Different for different hosts) will help the packet to identify it's recipient uniquely.
P.S.:
If someone find some information erroneous, please let me know in the comments. I have written this in a hurry, will update it with better reviewed text soon.
Thank you.

TCP connection between two openshift containers

I have two applications (diy container type) which have to be connected via TCP. Let's take as example application clusternode1 and clusternode2.
Each one has TCP listener set up for $OPENSHIFT_DIY_IP:$OPENSHIFT_DIY_PORT.
For some reason clusternode1 fails to connect to any of the following options for clusternode2:
$OPENSHIFT_DIY_IP:$OPENSHIFT_DIY_PORT
$OPENSHIFT_APP_DNS
Can you please help in understanding what should be url for external TCP connection?
You might check the logs to see if the OPENSHIFT_DIY_IP for both apps are within the same subnet. If one, say, is...
1.2.3.4
...and the other is...
1.5.6.7
...for example, then you might not expect Amazon's firewalls to just arbitrarily allow TCP traffic from one subnet to another. If this were allowed by default then one person's app might try to hack another's.
I know that when you're dealing directly with Amazon AWS and you spin up multiple virtual servers you have to create virtual zones to allow traffic between them. This might be something that's necessary.
Proxy Ports I don't know if this is useful but it's possible that a private IP address is being bound to your application(s) and then a NAT server is translating that into a public IP address.

Tunneling a network connection into a VMWare guest without network

I'm trying to establish a TCP connection between a client machine and a guest VM running inside an ESXi server. The trick is that the guest VM has no network configured (intentionally). However the ESX server is on the network, so in theory it might be possible to bridge the gap with software.
Concretely, I'd like to eventually create a direct TCP connection from python code running on the client machine (I want to create an RPyC connection). However anything that results in ssh-like port tunneling would be breakthrough enough.
I'm theorizing that some combination of VMWare Tools, pysphere and obscure network adapters could be possible. But so far, my searches don't yield any result and my only ideas are either ugly (something like tunneling over file operations) and/or very error prone (basically, if I have to build a TCP stack, I know I'll be writing lots of bugs).
It's for a testing environment setup, not production; but I prefer stability over speed. I currently don't see much need for high throughput.
To summarize the setup:
Client machine (Windows/Linux, whatever works) with vmware tools installed
ESXi server (network accessible from client machine)
VMWare guest which has no NICs at all, but is accessible using vmware tools (must be Windows in my case, but a Linux solution is welcome for the sake completeness)
Any ideas and further reading suggestions would be awesome.
Thank you Internet, you are the best!
It is not clear the meaning of 'no NICs at all on guest'. If I can assume that, there is no physical NICs assigned for the guest is what is meant here. The solution is easy as a vmWare soft NIC can be provisioned for the guest VM and that will serve as the entry point to the guest netstack.
But if the soft NIC is also not available, i really wonder how and what can serve as the entry point to the netstack of guest, be it Linux/Windows. To my understanding, if thats what you meant, then you might need to make guest OS modifications to use a different door to access the guest netstack and to post/drain pkts from it. But again, when you do a proper implementation of this backdoor, it will become just another implementation of softNIC which vmware by default support. So, why not use that?
It's a bit late but a virtual serial port may be your friend. You can pick the serial port on the outer end via network or locally depending on your options. Than you can have some ppp stuff or your custom script on both ends to communicate. You could also run some tool to create a single socket from the serial link on the guest end if you want to avoid having a ppp interface but still need to tunnel a TCP connection for some application.
This should keep you safe when analyzing malicious code as long as it's not skynet :-) You still should do it with the permission of the sysadmin as you may be violating your company's rules by working around some security measurements.
If the VM 'intentionally' has no network configured, you can't connect to it over a network.
Your question embodies a contradiction in terms.

Resources