Docker overlay network doesn't clean up removed containers - networking

We're running Docker across two hosts, with overlay networking enabled and configured. It's version 1.12.1, with Consul as the KV store - but we aren't using Swarm, largely because we didn't feel it gave us the relevant control over ensuring availability and minimising resources, but anyway.
Our setup is micro service based, and we run quite a lot of containers which get restarted fairly frequently. Our model uses nginx as a "reverse proxy" for service discovery, for various reasons, and so we start multiple containers which share a --host of "nginx-lb". This works fine, and other containers on the network can connect to nginx-lb, which gives them a random one of the containers' IP addresses.
The problem we have is that in killing containers and creating new ones, sometimes (I don't know what specific circumstance this occurs in), the overlay network does not remove the old container from the system, and so other containers then try to connect to the dead ones, causing problems.
The only way to then resolve this, is to manually call a docker network disconnect -f overlay_net [container], having already run a docker network inspect overlay_net to find the errant containers.
Is there a known issue with the overlay networking not removing dead containers from the KV data, or any ideas of a fix?

Yes it's a known issue. Follow it here https://github.com/docker/docker/issues/26244

Related

Migrate from legacy network in GCE

Long story short - I need to use networking between projects to have separate billing for them.
I'd like to reach all the VMs in different projects from a single point that I will use for provisioning systems (let's call it coordinator node).
It looks like VPC network peering is a perfect solution to this. But unfortunately one of the existing networks is "legacy". Here's what google docs state about legacy networks.
About legacy networks
Note: Legacy networks are not recommended. Many newer GCP features are not supported in legacy networks.
OK, naturally the question arises: how do you migrate out of legacy network? Documentation does not address this topic. Is it not possible?
I have a bunch of VMs, and I'd be able to shutdown them one by one:
shutdown
change something
restart
unfortunately it does not seem possible to change network even when VM is down?
EDIT:
it has been suggested to recreate VMs keeping the same disks. I would still need a way to bridge legacy network with new VPC network to make migration fluent. Any thoughts on how to do that using GCE toolset?
One possible solution - for each VM in the legacy network:
Get VM parameters (API get method)
Delete VM without deleting PD (persistent disk)
Create VM in the new VPC network using parameters from step 1 (and existing persistent disk)
This way stop-change-start is not so different from delete-recreate-with-changes. It's possible to write a script to fully automate this (migration of a whole network). I wouldn't be surprised if someone already did that.
UDPATE
https://github.com/googleinterns/vm-network-migration tool automates the above process, plus it supports migration of a whole Instance Group or Load Balancer, etc. Check it out.

Why do Docker overlay networks require consensus?

Just been reading up on Docker overlay networks, very cool stuff. I just can't seem to find an answer to one thing.
According to the docs:
If you install and use Docker Swarm, you get overlay networks across your manager/worker hosts automagically, and don't need to configure anything more; but...
If you simply want a (non-Swarm) overlay network across multiple hosts, you need to configure that network with an external "KV Store" (consensus server) like Consul or ZooKeeper
I'm wondering why this is. Clearly, overlay networks require consensus amongst peers, but I'm not sure why or who those "peers" even are.
And I'm just guessing that, with Swarm, there's some internal/under-the-hood consensus server running out of the box.
Swarm Mode uses Raft for it's manager consensus with a built-in KV store. Before swarm mode, overlay networking was possible with third party KV stores. Overlay networking itself doesn't require consensus, it just relies on whatever the KV store says regardless of the other nodes or even it's own local state (I've found this out the hard way). The KV stores out there are typically setup with consensus for HA.
The KV store tracks IP allocations to containers running on each host (IPAM). This allows docker to only allocate a given address once, and to know which docker host it needs to communicate with when you connect to a container running on another host. This needs to be external from any one docker host, and preferably in an HA configuration (like swarm mode's consensus) so that it can continue to work even when some docker nodes are down.
Overlay networking between docker nodes only involves the nodes that have containers on that overlay network. So once the IP is allocated and discovered, all the communication only happens between the nodes with the relevant containers. This is easy to see with swarm mode if you create a network and then list networks on a worker, it won't be there. Once a container on that network gets scheduled, the network will appear. From docker, this reduces overhead of multi-host networking while also adding to the security of the architecture. The result looks like this graphic:
The raft consensus itself is only needed for leader election. Once a node is selected to be the leader and enough nodes remain to have consensus, only one node is writing to the KV store and maintaining the current state. Everyone else is a follower. This animation describes it better than I ever could.
Lastly, you don't need to setup an external KV store to use overlay networking outside of swarm mode services. You can implement swarm mode, configure overlay networks with the --attachable option, and run containers outside of swarm mode on that network as you would have with an external KV store. I've used this in the past as a transition state to get containers into swarm mode, where some were running with docker-compose and others had been deployed as a swarm stack.

What is the replacement for `--net=container` in new docker networking?

In the pre docker 1.9 days I used to have a vpn provider container which I could use as the network gateway for my other containers by passing the option --net=container:[container-name].
This was very simple but had a major limitation in that the provider container had to exist prior to starting the consumers and it could not be restarted.
The new docker networking stack seems to have dropped this provision in favour of creating networks which does sound better, but I'm struggling to get equivalent behaviour.
Right now I have created an internal network docker network create isolated --internal --subnet=172.32.0.0/16 and brought up 2 containers one of which is attached only to internal network and one which is attached to both the default bridge and the internal network.
Now I need to route all network traffic from the isolated container through the connected one. I've messed around with some iptable rules but tbh this is not my strongest area.
So my questions are simply: Is my approach along the right lines? What rules need to be in place in the two containers to get this working as --net=container?

Kubernetes and MPI

I want to run an MPI job on my Kubernetes cluster. The context is that I'm actually running a modern, nicely containerised app but part of the workload is a legacy MPI job which isn't going to be re-written anytime soon, and I'd like to fit it into a kubernetes "worldview" as much as possible.
One initial question: has anyone had any success in running MPI jobs on a kube cluster? I've seen Christian Kniep's work in getting MPI jobs to run in docker containers, but he's going down the docker swarm path (with peer discovery using consul running in each container) and I want to stick to kubernetes (which already knows the info of all the peers) and inject this information into the container from the outside. I do have full control over all the parts of the application, e.g. I can choose which MPI implementation to use.
I have a couple of ideas about how to proceed:
fat containers containing slurm and the application code -> populate
the slurm.conf with appropriate info about the peers at container
startup -> use srun as the container entrypoint to start the jobs
slimmer containers with only OpenMPI (no slurm) -> populate a
rankfile in the container with info from outside (provided by
kubernetes) -> use mpirun as the container entrypoint
an even slimmer approach, where I basically "fake" the MPI runtime by
setting a few environment variables (e.g. the OpenMPI ORTE ones) ->
run the mpicc'd binary directly (where it'll find out about its peers
through the env vars)
some other option
give up in despair
I know trying to mix "established" workflows like MPI with the "new hotness" of kubernetes and containers is a bit of an impedance mismatch, but I'm just looking for pointers/gotchas before I go too far down the wrong path. If nothing exists I'm happy to hack on some stuff and push it back upstream.
I tried MPI Jobs on Kubernetes for a few days and solved it by using dnsPolicy:None and dnsConfig (CustomDNS=true feature gate will be needed).
I pushed my manifests (as Helm chart) here.
https://github.com/everpeace/kube-openmpi
I hope it would help.
Assuming you don't want to use hw-specific MPI library (for example anything that uses direct access to communication fabric), I would go with option 2.
First, implement a wrapper for mpirun which populates necessary data
using kubernetes API, specifically using endpoints if using a
service (might be a good idea), could also scrape pod's exposed
ports directly.
Add some form of checkpoint program that can be used for
"rendezvous" synchronization before starting actual running code (I
don't know how well MPI deals with ephemeral nodes). This is to
ensure that when mpirun starts it has stable set of pods to use
And finally actually build a container with necessary code and I
guess SSH service for mpirun to use for starting processes in
other pods.
Another interesting option would be to use Stateful Sets, possibly even running with SLURM inside, which implement a "virtual" cluster of MPI machines running on kubernetes.
This provides stable hostnames for each node, which would reduce the problem of discovery and keeping track of state. You could also use statefully-assigned storage for container's local work filesystem (which, with some work, could be made to for example always refer to same local SSD).
Another benefit is that it would be probably least invasive to the actual application.

How can I set up a Docker network with restricted communication?

I'm trying to create something like this:
The server containers each have port 8080 exposed, and accept requests from the client, but crucially, they are not allowed to communicate with each other.
The problem here is that the server containers are launched after the client container, so I can't pass container link flags to the client like I used to, since the containers it's supposed to link to don't exist yet.
I've been looking at the newer Docker networking stuff, but I can't use a bridge because I don't want server cross-communication to be possible. It also seems to me like one bridge per server doesn't scale well, and would be difficult to manage within the client container.
Is there some kind of switch-like docker construct that can do this?
It seems like you will need to create multiple bridge networks, one per container. To simplify that, you may want to use docker-compose to specify how the networks and containers should be provisioned, and have the docker-compose tool wire it all up correctly.
Resources:
https://docs.docker.com/engine/userguide/networking/dockernetworks/
https://docs.docker.com/compose/
https://docs.docker.com/compose/compose-file/#version-2
One more side note: I think that exposed ports are accessible to all networks. If that's right, you may be able to set all of the server networking to none and rely on the exposed ports to reach the servers.
Hope this is relevant to your use-case - I'm attempting to draw context regards your actual application from the diagram and comments. I'd recommend you go the Service Discovery route. It may involve a little bit of simple API over a central store (say Redis, or SkyDNS), but would make things simple in the long run.
Kubernetes, for instance, uses SkyDNS to do so with DNS. At the end of the day, any orchestration tool of your choice would most likely do something like this out of the box: https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/dns
The idea is simple:
Use a DNS container that keeps entries of newly spawned servers
Allow the Client Container to query it for a list of servers. e.g. Picture a DNS response with a bunch of server-<<ISO Timestamp of Server Creation>>s
Disallow client containers read-access to this DNS (how to manage this permission-configuration without indirection, i.e. without proxying through an endpoint that allows writing into the DNS Container, but not reading, is going to exotic)
Bonus Edit: I just realised you can use a simpler Redis-like setup to do this, and that DNS might just be overengineering :)

Resources