How to: Podman rootless expose containers ports to the outside and see real client ip - networking

This is my first time asking something on stackoverflow. For years I've been lurking but now I decided to finally register myself. Hence, I apologize if my question/information is not formatted nicely.
Current situation:
I'm slowly getting more and more familiar with Podman and I'm in the process of moving some of my containers over from docker (rootful) to podman (rootless). I'm using Podman 4.3.1 on Debian 11. I've managed to get some containers working and was able to externally connect to them. However, the container shows client/source ip '127.0.0.1' instead of my real client's IPv4. I was wondering whether something like the following is possible?
Ideal situation:
Assigning a specific IPv4 to the container (rootless). Using nftables/iptables to forward packets from the host's network to the containers ipv4 (e.g. 192.168.1.12). Being able to see the real client's IPv4 in the container to still be able use fail2ban etc.
As you may notice, I'm still very much in the process of learning how containerization works and specifically for networking. I don't want to use the hosts network for my container for security reasons. If something is unclear tell me and I'll try to better explain myself.
Thanks for taking your time to read this :)

When you're running Podman as a non-root user, the virtual tap device that represents the container's eth0 interface can't be attached directly to a bridge device. This means it's not possible to use netfilter rules to direct traffic into the container; instead, Podman relies on a proxy process.
There are some notes on this configuration here.
By default, Podman uses the rootlessport proxy, which replaces the source ip of the connection with an internal ip from the container namespace. You can, however, explicitly request Podman to use slirp4netns as the port handler, which will preserve the source address at the expense of some performance.
For example, if I start a container like this:
podman run --name darkhttpd --rm -p 8080:8080 docker.io/alpinelinux/darkhttpd
And then connect to this from somewhere:
curl 192.168.1.200:8080
I will see in the access log:
10.0.2.100 - - [12/Feb/2023:15:30:54 +0000] "GET / HTTP/1.1" 200 354 "" "curl/7.85.0"
Where 10.0.2.100 is in fact the address of the container:
$ podman exec darkhttpd ip a show tap0
2: tap0: <BROADCAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN qlen 1000
link/ether 26:77:5b:e8:f4:6e brd ff:ff:ff:ff:ff:ff
inet 10.0.2.100/24 brd 10.0.2.255 scope global tap0
valid_lft forever preferred_lft forever
inet6 fd00::2477:5bff:fee8:f46e/64 scope global dynamic flags 100
valid_lft 86391sec preferred_lft 14391sec
inet6 fe80::2477:5bff:fee8:f46e/64 scope link
valid_lft forever preferred_lft forever
But if I explicitly request slirp4nets as the port handler:
podman run --name darkhttpd --rm -p 8080:8080 --network slirp4nets:port_handler=slirp4netns docker.io/alpinelinux/darkhttpd
Then in the access log I will see the actual source ip of the request:
192.168.1.97 - - [12/Feb/2023:15:32:17 +0000] "GET / HTTP/1.1" 200 354 "" "curl/7.74.0"
In most cases, you don't want to rely on the source ip address for authentication/authorization purposes, so the default behavior makes sense.
If you need the remote ip for logging purposes, the option presented here will work, or you can also look into running a front-end proxy in the global namespace that places the client ip into the X-Forwarded-For header and use that for your logs.

Here is an alternative solution not mentioned in the nice
answer by #larsks.
Socket activation
When using socket activation of containers, the source IP is available to the container.
Support for socket activation is not yet wide-spread but for instance the container image docker.io/library/mariadb supports socket activation. The container image docker.io/library/nginx also supports socket activation (although in a non-standard way, as nginx uses its own environment variable instead of using the standard systemd environment variable LISTEN_FDS)
I wrote a minimal demo of how to use run docker.io/library/nginx with Podman and socket activation:
https://github.com/eriksjolund/podman-nginx-socket-activation

Related

Can an OpenVPN Route over TEST-NET-1 (RFC 5735)

Background
I have a strange use-case where my VPN cannot be on any of the private subnets, but, also cannot use a TAP interface. The machine will be moving through different subnets, and requires access to the entire private address space by design. A single blocked IP would be considered a failure of design.
So, these are all off limits:
10.0.0.0/8
172.16.0.0/12
192.168.0.0/16
169.254.0.0/16
In searching for a solution, I came across RFC 5735, which defines:
192.0.2.0/24 TEST-NET-1
198.51.100.0/24 TEST-NET-2
203.0.113.0/24 TEST-NET-3
As:
For use in documentation and example code. It is often used in conjunction with domain names
example.com or example.net in vendor and protocol documentation. As described in [RFC5737], addresses within this block do not legitimately appear on the public Internet and can be used without any coordination with IANA or an Internet registry.
Which, was a "Jackpot" moment for me and my use case.
Config
I configured an OpenVPN server as such:
local 0.0.0.0
port 443
proto tcp
dev tun
topology subnet
server 203.0.113.0 255.255.255.0 # TEST-NET-3 RFC 5735
push "route 203.0.113.0 255.255.255.0"
...[Snip]...
With Client:
client
nobind
dev tun
proto tcp
...[Snip]...
And ufw rules:
:POSTROUTING ACCEPT [0:0]
-A POSTROUTING -s 203.0.113.0/24 -o ens160 -j MASQUERADE
COMMIT
However, upon running I get /sbin/ip route add 203.0.113.0/24 via 203.0.113.1 RTNETLINK answers: File exists in the error logs. While the VPN completes the rest of its connection successfully.
No connection
Running the following commands:
Server: sudo python3 -m http.server 80
Client: curl -X GET / 203.0.113.1
Results in:
curl: (28) Failed to connect to 203.0.113.1 port 80: Connection timed out
I have tried:
/sbin/ip route replace 203.0.113.0/24 dev tun 0 on client and server.
/sbin/ip route change 203.0.113.0/24 dev tun 0 on client and server.
Adding route 203.0.113.0 255.255.255.0 to the server.
Adding push "route 203.0.113.0 255.255.255.0 127.0.0.1" to server
And none of it seems to work.
Does anyone have any idea how I can force the client to push this traffic over the VPN to my server, instead of to the public IP?
This does actually work!
Just dont forget to allow connections within your firewall. I fixed my config with:
sudo ufw allow in on tun0
However, 198.18.0.0/15 and 100.64.0.0/10 defined as Benchmarking and Shared address space respectively, may be more appropriate choices, since being able to forward TEST-NET addresses may be considered a bug.

Can't expose port with podman

I am trying to walk through a tutorial that brings up an application in a docker/podman container instance.
I have attempted to use -p port:port and --expose port but neither seems to work.
I've ensured that I see the port in a listen state with ss -an.
I've made sure there isn't anything else trying to bind to that port.
No matter what I do, I can never hit localhost:port or ip_address:port.
I feel like I am fundamentally missing something but don't know where to look next.
Any suggestions for things to try or documentation to review would be appreciated.
Thanks,
Shawn
Expose: (Reference)
Expose tells Podman that the container requests that port be open, but
does not forward it. All exposed ports will be forwarded to random
ports on the host if and only if --publish-all is also specified
As per Redhat documentation for Containerfile,
EXPOSE indicates that the container listens on the specified network
port at runtime. The EXPOSE instruction defines metadata only; it does
not make ports accessible from the host. The -p option in the podman
run command exposes container ports from the host.
To specify Port Number,
The -p option in the podman run command exposes container ports from
the host.
Example:
podman run -d -p 8080:80 --name httpd-basic quay.io/httpd-parent:2.4
In above example, Port # 80 is the port number which Container listens/exposes and we can access this from outside the container via Port # 8080

What is the relation between docker0 and eth0?

I know by default docker creates a virtual bridge docker0, and all container network are linked to docker0.
As illustrated above:
container eth0 is paired with vethXXX
vethXXX is linked to docker0 same as a machine linked to switch
But what is the relation between docker0 and host eth0?
More specifically:
When a packet flows from container to docker0, how does it know it will be forwarded to eth0, and then to the outside world?
When an external packet arrives to eth0, why it is forwarded to docker0 then container? instead of processing it or drop it?
Question 2 can be a little confusing, I will keep it there and explained a little more:
It is a return packet that initialed by container(in question 1): since the outside does not know container network, the packet is sent to host eth0. How it is forwarded to container? I mean, there must be some place to store the information, how can I check it?
Thanks in advance!
After reading the answer and official network articles, I find the following diagram more accurate that docker0 and eth0 has no direct link,instead they can forward packets:
http://dockerone.com/uploads/article/20150527/e84946a8e9df0ac6d109c35786ac4833.png
There is no direct link between the default docker0 bridge and the hosts ethernet devices. If you use the --net=host option for a container then the hosts network stack will be available in the container.
When a packet flows from container to docker0, how does it know it will be forwarded to eth0, and then to the outside world?
The docker0 bridge has the .1 address of the Docker network assigned to it, this is usually something around a 172.17 or 172.18.
$ ip address show dev docker0
8: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:03:47:33:c1 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 scope global docker0
valid_lft forever preferred_lft forever
Containers are assigned a veth interface which is attached to the docker0 bridge.
$ bridge link
10: vethcece7e5 state UP #(null): <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master docker0 state forwarding priority 32 cost 2
Containers created on the default Docker network receive the .1 address as their default route.
$ docker run busybox ip route show
default via 172.17.0.1 dev eth0
172.17.0.0/16 dev eth0 src 172.17.0.3
Docker uses NAT MASQUERADE for outbound traffic from there and it will follow the standard outbound routing on the host, which may or may not involve eth0.
$ iptables -t nat -vnL POSTROUTING
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
0 0 MASQUERADE all -- * !docker0 172.17.0.0/16 0.0.0.0/0
iptables handles the connection tracking and return traffic.
When an external packet arrives to eth0, why it is forwarded to docker0 then container? instead of processing it or drop it?
If you are asking about the return path for outbound traffic from the container, see iptables above as the MASQUERADE will map the connection back through.
If you mean new inbound traffic, Packets are not forwarded into a container by default. The standard way to achieve this is to setup a port mapping. Docker launches a daemon that listens on the host on port X and forwards to the container on port Y.
I'm not sure why NAT wasn't used for inbound traffic as well. I've run into some issues trying to map large numbers of ports into containers which led to mapping real world interfaces completely into containers.
You can detect the relation by network interface iflink from a container and ifindex on the host machine.
Get iflink from a container:
$ docker exec ID cat /sys/class/net/eth0/iflink
17253
Then find this ifindex among interfaces on the host machine:
$ grep -l 17253 /sys/class/net/veth*/ifindex
/sys/class/net/veth02455a1/ifindex

Docker 1.10 container's IP in LAN

Since Docker 1.10 (and libnetwork update) we can manually give an IP to a container inside a user-defined network, and that's cool!
I want to give a container an IP address in my LAN (like we can do with Virtual Machines in "bridge" mode). My LAN is 192.168.1.0/24, all my computers have IP addresses inside it. And I want my containers having IPs in this range, in order to reach them from anywhere in my LAN (without NAT/PAT/etc...).
I obviously read Jessie Frazelle's blog post and a lot of others post here and everywhere like :
How to set a docker container's iP?
How to assign specific IP to container and make that accessible outside of VM host?
and so much more, but nothing came out; my containers still have IP addresses "inside" my docker host, and are not reachable for others computers on my LAN.
Reading Jessie Frazelle's blog post, I thought (since she uses public IP) we can do what I want to do?
Edit: Indeed, if I do something like :
network create --subnet 192.168.1.0/24 --gateway 192.168.1.1 homenet
docker run --rm -it --net homenet --ip 192.168.1.100 nginx
The new interface on the docker host (br-[a-z0-9]+) take the '--gateway' IP, which is my router IP. And the same IP on two computers on the network... BOOM
Thanks in advance.
EDIT : This solution is now useless. Since version 1.12, Docker provides two network drivers : macvlan and ipvlan. They allow assigning static IP from the LAN network. See the answer below.
After looking for people who have the same problem, we went to a workaround :
Sum up :
(V)LAN is 192.168.1.0/24
Default Gateway (= router) is 192.168.1.1
Multiple Docker Hosts
Note : We have two NIC : eth0 and eth1 (which is dedicated to Docker)
What do we want :
We want to have containers with ip in the 192.168.1.0/24 network (like computers) without any NAT/PAT/translation/port-forwarding/etc...
Problem
When doing this :
network create --subnet 192.168.1.0/24 --gateway 192.168.1.1 homenet
we are able to give containers the IP we want to, but the bridge created by docker (br-[a-z0-9]+) will have the IP 192.168.1.1, which is our router.
Solution
1. Setup the Docker Network
Use the DefaultGatewayIPv4 parameter :
docker network create --subnet 192.168.1.0/24 --aux-address "DefaultGatewayIPv4=192.168.1.1" homenet
By default, Docker will give to the bridge interface (br-[a-z0-9]+) the first IP, which might be already taken by another machine. The solution is to use the --gateway parameter to tell docker to assign a arbitrary IP (which is available) :
docker network create --subnet 192.168.1.0/24 --aux-address "DefaultGatewayIPv4=192.168.1.1" --gateway=192.168.1.200 homenet
We can specify the bridge name by adding -o com.docker.network.bridge.name=br-home-net to the previous command.
2. Bridge the bridge !
Now we have a bridge (br-[a-z0-9]+) created by Docker. We need to bridge it to a physical interface (in my case I have to NIC, so I'm using eth1 for that):
brctl addif br-home-net eth1
3. Delete the bridge IP
We can now delete the IP address from the bridge, since we don't need one :
ip a del 192.168.1.200/24 dev br-home-net
The IP 192.168.1.200 can be used as bridge on multiple docker host, since we don't use it, and we remove it.
Docker now supports Macvlan and IPvlan network drivers. The Docker documentation for both network drivers can be found here.
With both drivers you can implement your desired scenario (configure a container to behave like a virtual machine in bridge mode):
Macvlan: Allows a single physical network interface (master device) to have an arbitrary number of slave devices, each with it's own MAC adresses.
Requires Linux kernel v3.9–3.19 or 4.0+.
IPvlan: Allows you to create an arbitrary number of slave devices for your master device which all share the same MAC address.
Requires Linux kernel v4.2+ (support for earlier kernels exists but is buggy).
See the kernel.org IPVLAN Driver HOWTO for further information.
Container connectivity is achieved by putting one of the slave devices into the network namespace of the container to be configured. The master devices remains on the host operating system (default namespace).
As a rule of thumb you should use the IPvlan driver if the Linux host that is connected to the external switch / router has a policy configured that allows only one MAC per port. That's often the case in VMWare ESXi environments!
Another important thing to remember (Macvlan and IPvlan): Traffic to and from the master device cannot be sent to and from slave devices. If you need to enable master to slave communication see section "Communication with the host (default-ns)" in the "IPVLAN – The beginning" paper published by one of the IPvlan authors (Mahesh Bandewar).
Use the official Docker driver:
As of Docker v1.12.0-rc2, the new MACVLAN driver is now available in an official Docker release:
MacVlan driver is out of experimental #23524
These new drivers have been well documented by the author(s), with usage examples.
End of the day it should provide similar functionality, be easier to setup, and with fewer bugs / other quirks.
Seeing Containers on the Docker host:
Only caveat with the new official macvlan driver is that the docker host machine cannot see / communicate with its own containers. Which might be desirable or not, depending on your specific situation.
This issue can be worked-around if you have more than 1 NIC on your docker host machine. And both NICs are connected to your LAN. Then can either A) dedicate 1 of your docker hosts's 2 nics to be for docker exclusively. And be using the remaining nic for the host to access the LAN.
Or B) by adding specific routes to only those containers you need to access via the 2nd NIC. For example:
sudo route add -host $container_ip gw $lan_router_ip $if_device_nic2
Method A) is useful if you want to access all your containers from the docker host and you have multiple hardwired links.
Wheras method B) is useful if you only require access to a few specific containers from the docker host. Or if your 2nd NIC is a wifi card and would be much slower for handling all of your LAN traffic. For example on a laptop computer.
Installation:
If cannot see the pre-release -rc2 candidate on ubuntu 16.04, temporarily add or modify this line to your /etc/apt/sources.list to say:
deb https://apt.dockerproject.org/repo ubuntu-xenial testing
instead of main (which is stable releases).
I no longer recommended this solution. So it's been removed. It was using bridge driver and brctrl
.
There is a better and official driver now. See other answer on this page: https://stackoverflow.com/a/36470828/287510
Here is an example of using macvlan. It starts a web server at http://10.0.2.1/.
These commands and Docker Compose file work on QNAP and QNAP's Container Station. Notice that QNAP's network interface is qvs0.
Commands:
The blog post "Using Docker macvlan networks"[1][2] by Lars Kellogg-Stedman explains what the commands mean.
docker network create -d macvlan -o parent=qvs0 --subnet 10.0.0.0/8 --gateway 10.0.0.1 --ip-range 10.0.2.0/24 --aux-address "host=10.0.2.254" macvlan0
ip link del macvlan0-shim link qvs0 type macvlan mode bridge
ip link add macvlan0-shim link qvs0 type macvlan mode bridge
ip addr add 10.0.2.254/32 dev macvlan0-shim
ip link set macvlan0-shim up
ip route add 10.0.2.0/24 dev macvlan0-shim
docker run --network="macvlan0" --ip=10.0.2.1 -p 80:80 nginx
Docker Compose
Use version 2 because version 3 does not support the other network configs, such as gateway, ip_range, and aux_address.
version: "2.3"
services:
HTTPd:
image: nginx:latest
ports:
- "80:80/tcp"
- "80:80/udp"
networks:
macvlan0:
ipv4_address: "10.0.2.1"
networks:
macvlan0:
driver: macvlan
driver_opts:
parent: qvs0
ipam:
config:
- subnet: "10.0.0.0/8"
gateway: "10.0.0.1"
ip_range: "10.0.2.0/24"
aux_address: "host=10.0.2.254"
It's possible map a physical interface into a container via pipework.
Connect a container to a local physical interface
pipework eth2 $(docker run -d hipache /usr/sbin/hipache) 50.19.169.157/24
pipework eth3 $(docker run -d hipache /usr/sbin/hipache) 107.22.140.5/24
There may be a native way now but I haven't looked into that for the 1.10 release.

Having two NIC's on a Docker image before it boots

I have a Docker image with ubuntu:latest and a few other dependencies. The script starts automatically with expected 2 NIC's eth0 (which is there by default) and eth1. Because this second NIC isn't there the script crashes and the container is stopped. So using Pipeworks doesn't work as the container isn't there anymore.
I tried to add this to the Dockerfile:
RUN echo "auto eth1" >> /etc/network/interfaces
RUN echo "iface eth1 inet dhcp" >> /etc/network/interfaces
But that didn't work either.
Is there a proper way to achieve this, or else a hack to begin with :-)
Maybe later having something like "NIC eth1 dhcp" would be cool.
-Mark
Try to add the following to your docker file: RUN ifup eth1
I added the required features to Pipework!
You can now wait for the network to be ready by copying the pipework script to your container, and running pipework --wait.
There is also DHCP support if you specify dhcp instead of an ip.ad.dr.ess/mask when running the pipework script on the host.

Resources