An observation about Docker iptables

| ⌛ 5 minutes read

📋 Tags: Docker Networking Linux


“How do Docker containers talk?” is surprisingly interesting. The rabbit hole down containerization is deep. I skim the surface here very quickly. This should not be seen as incredibly technically complete.

Container networks in a nutshell

Initialization:

When you install Docker, the docker daemon creates a network interface, (e.g. default is docker0). This network interface is a bridge.

This bridge typically has the ‘gateway’ ip address “X.X.X.1”, and by default has a subnet of /16, meaning it can support 2^16 hosts.

This bridge can be thought of as a ‘switch’ or ‘router’ - all containers (unless config otherwise) will be given an IP address under the bridge’s subnet.

Container is launched:

When a container is spun up, the daemon gets to work.

It creates a virtual network interface for the container (something like vethabc123). This virtual network interface (usually a virtual ethernet device) are tunnels between network namespaces. This is used to map the veth interface on the host machine with the ‘real’ eth interface in the container.

The host machine’s veth handle is connected to the docker bridge.

Then, ip addresses are set. Inside the container, eth0 is set to be part of the bridge’s subnet. Its gateway address is set to be the bridge IP address. Note that the container also has a unique MAC address.

The deamon sets up some extra routing rules on the host machine. It sets up some iptables (firewall) and NAT config to make sure everything can talk to each other.

The topology is as such:

Sample topology, 2 container setup

Setup Demo

To verify the topology, you can spin up a docker container in interactive mode: $ docker run -it busybox sh.

Then, within the shell, ifconfig to see the container’s IP address (inet addr):

0
1
2
3
4
5
6
7
8
/ # ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:AC:11:00:02  
          inet addr:172.17.0.2  Bcast:172.17.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:44 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:5875 (5.7 KiB)  TX bytes:0 (0.0 B)

On the host machine, you can verify that the container’s inet address is indeed the gateway (i.e. X.X.X.1):

0
1
2
3
4
5
6
7
8
9
host@machine:~$ ifconfig
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500 
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        inet6 fe80::42:beff:fe67:b35b  prefixlen 64  scopeid 0x20<link>
        ether 02:42:be:67:b3:5b  txqueuelen 0  (Ethernet)
        RX packets 194  bytes 9287 (9.2 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2033  bytes 283427 (283.4 KB)
        TX errors 0  dropped 59 overruns 0  carrier 0  collisions 0

When you spin up the docker container, we can also see that the daemon sets up veth to the docker0 bridge via dmesg!

0
1
2
3
4
[38671.621454] veth95d99cc: entered allmulticast mode
[38671.621557] veth95d99cc: entered promiscuous mode
[38671.708957] eth0: renamed from veth4ec9b6b
[38671.709288] docker0: port 1(veth95d99cc) entered blocking state
[38671.709294] docker0: port 1(veth95d99cc) entered forwarding state

Routing Rules

The daemon did some voodoo on the routing. Let’s demystify it. We can see the firewall and routing config the daemon did by $ iptables -t nat -L -n -v!

 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
 1336  673K DOCKER     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 DOCKER     all  --  *      *       0.0.0.0/0           !127.0.0.0/8          ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
   68  3889 MASQUERADE  all  --  *      !docker0  172.17.0.0/16        0.0.0.0/0           

Chain DOCKER 
 pkts bytes target     prot opt in     out     source               destination         
   90  4140 RETURN     all  --  docker0 *       0.0.0.0/0            0.0.0.0/0

This is a pain to read. But this sets up the communication for the following cases.

Case 1: Traffic from container to the internet

Packets from the container travel through its eth0 to vethX which is mapped to the docker0 bridge.

0
1
2
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
   68  3889 MASQUERADE  all  --  *      !docker0  172.17.0.0/16        0.0.0.0/0  

This rule is in play. The rule is essentially: any traffic from 172.17.0.0 (the docker subnet) that is not going to the docker0 interface should be masqueraded to the host’s external IP (192.x.x.x)

Case 2: Traffic from container to another container on the same subnet

None of the NAT rules apply! The docker0 bridge acts as a layer 2 switch to forward traffic automatically to the other container. This is possible because the bridge has been registered with the veth handles for the containers (assuming the docker containers ran without special networking config).

For sake of brevity, I will stop here. There are many other interesting data flow cases that I could explore (what happens when a docker container runs a webapp and exposes a port to the internet) which may be worth a proper in-depth essay. Networking as always is a hell of a deep field.