Docker containers with IPv6 behind NAT

WARNING

In production IPv6 should always be used without NAT. Only use IPv6 and NAT for testing purposes. There is no valid reason to use IPv6 with NAT in any production environment.

IPv6 and NAT

IPv6 is designed to remove the need for NAT and that is a very, very good thing. NAT breaks Peer-to-Peer connections and that is exactly what is one of the great things of IPv6. Every device on the internet gets it’s own public IP-Address again.

Docker and IPv6

Support for IPv6 in Docker has been there for a while now. It is disabled by default however. The documentation describes on how to enable it.

I wanted to enable IPv6 on my Docker setup on my laptop running Ubuntu, but as my laptop is a mobile device the IPv6 prefix I have changes when I move to a different location. IPv6 Prefix Delegation isn’t available at every IPv6-enabled location either, so I wanted to figure out if I could enable IPv6 in my Docker setup locally and use NAT to have my containers reach the internet over IPv6.

At home I have IPv6 via ZeelandNet and at the office we have a VDSL connection from XS4All. When I’m on a remote location I enable our OpenVPN tunnel which has IPv6 enabled. This way I always have IPv6 available.

The Docker documentation shows that enabling IPv6 is very easy. I modified the systemd service file of docker and added a fixed IPv6 CIDR:

ExecStart=/usr/bin/dockerd --ipv6 --fixed-cidr-v6="fd00::/64" -H fd://

fd00::/64 is a Site-Local IPv6 subnet (deprecated) which can be safely used.

I then added a NAT rule into ip6tables so that it would NAT for me:

sudo ip6tables -t nat -A POSTROUTING -s fd00::/64 -j MASQUERADE

Result

My Docker containers now get a IPv6 Address as can be seen below:

root@da80cf3d8532:~# ip -6 a
1: lo:  mtu 65536 state UNKNOWN qlen 1
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
15: eth0@if16:  mtu 1500 state UP 
    inet6 fd00::242:ac11:2/64 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:2/64 scope link 
       valid_lft forever preferred_lft forever
root@da80cf3d8532:~#

In this case the address is fd00::242:ac11:2 which as assigned by Docker.

Since my laptop has IPv6 I can now ping pcextreme.nl from my Docker container.

root@da80cf3d8532:~# ping6 -c 3 pcextreme.nl -n
PING pcextreme.nl (2a00:f10:101:0:46e:c2ff:fe00:93): 56 data bytes
64 bytes from 2a00:f10:101:0:46e:c2ff:fe00:93: icmp_seq=0 ttl=61 time=14.368 ms
64 bytes from 2a00:f10:101:0:46e:c2ff:fe00:93: icmp_seq=1 ttl=61 time=16.132 ms
64 bytes from 2a00:f10:101:0:46e:c2ff:fe00:93: icmp_seq=2 ttl=61 time=15.790 ms
--- pcextreme.nl ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 14.368/15.430/16.132/0.764 ms
root@da80cf3d8532:~#

Again, this should ONLY be used for testing purposes. For production IPv6 Prefix Delegation is the route to go down.

Testing Ceph BlueStore with the Kraken release

Ceph version Kraken (11.2.0) has been released and the Release Notes tell us that the new BlueStore backend for the OSDs is now available.

BlueStore

The current backend for the OSDs is the FileStore which mainly uses the XFS filesystem to store it’s data. To overcome several limitations of XFS and POSIX in general the BlueStore backend was developed.

It will provide more performance (mainly writes), data safety due to checksumming and compression.

Users are encouraged to test BlueStore starting with the Kraken release for non-production and non-critical data sets and report back to the community.

Deploying with BlueStore

To deploy OSDs with BlueStore you can use the ceph-deploy by using the –bluestore flag.

I created a simple test cluster with three machines: alpha, bravo and charlie.

Each machine will be running a ceph-mon and ceph-osd proces.

This is the sequence of ceph-deploy commands I used to deploy the cluster

ceph-deploy new alpha bravo charlie
ceph-deploy mon create alpha bravo charlie

Now, edit the ceph.conf file in the current directory and add:

[osd]
enable_experimental_unrecoverable_data_corrupting_features = bluestore

With this setting we allow the use of BlueStore and we can now deploy our OSDs:

ceph-deploy --overwrite-conf osd create --bluestore alpha:sdb bravo:sdb charlie:sdb

Running BlueStore

This tiny cluster how runs three OSDs with BlueStore:

root@alpha:~# ceph -s
    cluster c824e460-2f09-4994-8b2f-108aedc52d19
     health HEALTH_OK
     monmap e2: 3 mons at {alpha=[2001:db8::100]:6789/0,bravo=[2001:db8::101]:6789/0,charlie=[2001:db8::102]:6789/0}
            election epoch 14, quorum 0,1,2 alpha,bravo,charlie
        mgr active: charlie standbys: alpha, bravo
     osdmap e14: 3 osds: 3 up, 3 in
            flags sortbitwise,require_jewel_osds,require_kraken_osds
      pgmap v24: 64 pgs, 1 pools, 0 bytes data, 0 objects
            43356 kB used, 30374 MB / 30416 MB avail
                  64 active+clean
root@alpha:~#
root@alpha:~# ceph osd tree
ID WEIGHT  TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.02907 root default                                       
-2 0.00969     host alpha                                     
 0 0.00969         osd.0         up  1.00000          1.00000 
-3 0.00969     host bravo                                     
 1 0.00969         osd.1         up  1.00000          1.00000 
-4 0.00969     host charlie                                   
 2 0.00969         osd.2         up  1.00000          1.00000 
root@alpha:~#

On alpha I see that osd.0 only has a small partition for a bit of configuration and the rest is used by BlueStore.

root@alpha:~# df -h /var/lib/ceph/osd/ceph-0
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1        97M  5.4M   92M   6% /var/lib/ceph/osd/ceph-0
root@alpha:~# lsblk 
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0    8G  0 disk 
├─sda1   8:1    0  7.5G  0 part /
├─sda2   8:2    0    1K  0 part 
└─sda5   8:5    0  510M  0 part [SWAP]
sdb      8:16   0   10G  0 disk 
├─sdb1   8:17   0  100M  0 part /var/lib/ceph/osd/ceph-0
└─sdb2   8:18   0  9.9G  0 part 
sdc      8:32   0   10G  0 disk 
root@alpha:~# cat /var/lib/ceph/osd/ceph-0/type
bluestore
root@alpha:~#

The OSDs should work just like OSDs running FileStore, but they should perform better.

Running headless VirtualBox inside Nested KVM

For the Ceph training at 42on I use VirtualBox to build Virtual Machines. This is because they work under MacOS, Windows and Linux.

For the internal Git at 42on we use Gitlab and I wanted to use Gitlab’s CI to build my Virtual Machines automatically.

As we don’t have any physical hardware at 42on (everything runs in the cloud) I wanted to see if I could run VirtualBox Headless inside a VM with Nested KVM enabled.

Nested KVM

The first thing I checked was if my KVM Virtual Machine actually supported Nested KVM. This can be verified with the kvm-ok command under Ubuntu:

root@glrun01:~# kvm-ok 
INFO: /dev/kvm exists
KVM acceleration can be used
root@glrun01:~#

Now that’s verified I tried to install VirtualBox.

VirtualBox

Installing VirtualBox is straight forward. Just add the repository and install the packages. Don’t forget to reboot afterwards to make sure all kernel modules are loaded and properly installed.

apt-get install virtualbox

VirtualBox Extension Pack

The trick to get everything working properly is to install Oracle’s VirtualBox Extension Pack. It took me a while to figure out that I need to install it manually. It wasn’t done by default after install.

You need to download the pack and install it using the VBoxManage command.

wget http://download.virtualbox.org/virtualbox/5.0.24/Oracle_VM_VirtualBox_Extension_Pack-5.0.24.vbox-extpack
vboxmanage extpack install Oracle_VM_VirtualBox_Extension_Pack-5.0.24.vbox-extpack
vboxmanage list extpacks
vboxmanage setproperty vrdeextpack "Oracle VM VirtualBox Extension Pack"

With that installed and configured I rebooted the machine again just to be sure.

It works!

With that it actually worked. The VirtualBox VMs can now be built inside a Nested KVM machine controlled by Gitlab’s CI 🙂

VirtualBox images to experiment with IPv6

Around me I noticed that a lot of people don’t have hands-on experience with IPv6. The networks they work in do not support IPv6 nor does their ISP provide them with native IPv6 connectivity at home.

On my local systems I often use Virtual Box to set up (IPv6) testing environments. I thought I’d create some Virtual Machine images to get some hands-on experience with IPv6.

The images and README can be found on Github and are aimed to be easy to install and work with.

Requirements

To run the images you need to have Virtual Box installed. You also should be able to use the Linux command line as the Virtual Machines are based on Ubuntu 16.04.

More information can be found in the repository on Github in the README file.

Download

You can download the images here.

How to use

Please take a look at the README on Github. It tells you how to use them.

Happy testing!

Hitch TLS Proxy performance with 15k certificates

While testing with the Hitch TLS proxy in front of Varnish I stumbled upon a slow startup with a large amount of certificates.

In this case we (at PCextreme) want to run Hitch with around 50.000 certificates configured.

The webpage of Hitch says:

Safe for large installations: performant up to 15 000 listening sockets and 500 000 certificates.

10 minutes

I started testing on my local desktop with 15.000 certificates. My desktop is a Intel NUC with Ubuntu 14.04.

wido@wido-desktop:~/repos/hitch/src$ time sudo ./hitch -n 4 -u nobody -g nogroup --config=/opt/hitch/hitch.conf

real    9m40.088s
user    9m38.482s
sys 0m0.829s
wido@wido-desktop:~/repos/hitch/src$

A 10 minute startup time for Hitch is rather long. We started searching for the root-cause.

OpenSSL

After some searching we discovered the OpenSSL version in Ubuntu 14.04 was the problem. Testing with Ubuntu 15.10 showed us different results.

root@VM-9d8e8cfd-e30f-4c40-8c4e-2e098b0f11a5:~# time hitch --daemon --pidfile=/run/hitch.pid --user hitch --group hitch --config=/etc/hitch/hitch.conf

real    0m18.673s
user    0m6.780s
sys    0m2.000s

18 seconds is a lot better than 10 minutes!

Ubuntu 14.04 comes with OpenSSL 1.0.1f and Ubuntu 15.10 with 1.0.2d and that is where the difference seems to be.

100.000 certificates

After this we started testing with 100k certificates. It took 48 seconds to start with that amount of certificates configured.

For production we will use Ubuntu 16.04 which has similar results as Ubuntu 15.10.

So if you find Hitch slow when starting, check your OpenSSL version.

AnyIP: Bind a whole subnet to your Linux machine

IPv6 Prefix Delegation

In my previous post I wrote how you can use Docker with IPv6 and Prefix Delegation.

A IPv6 subnet routed to a Linux machine can be used with other things than Docker. That’s where the AnyIP feature of the kernel comes in.

Linux Kernel AnyIP

The AnyIP feature of the Linux kernel allows you to bind a complete IPv4 or IPv6 subnet to your system.

Instead of adding all addresses manually to the kernel you can tell it to bind a complete subnet.

Configuring

IPv4

ip -4 route add local 192.168.0.0/24 dev lo

In this case the Linux kernel will now respond to ARP requests for any IPv4 address in the 192.168.0.0/24 subnet.

IPv6

ip -6 route add local 2001:db8:100::/64 dev lo

In this case the kernel will respond for Neigh Sollicitations on any IPv6 address in the 2001:db8:100::/64 subnet.

Example usage

Let’s assume that you have the IPv6 prefix 2001:db8:100::/60 routed to your Linux machine through IPv6 prefix delegation.

From that /60 subnet we take the first /64 subnet and attach it to lo.

ip -6 route add local 2001:db8:100::/64 dev lo

You can now ping any of the addresses in that subnet:

  • 2001:db8:100::1
  • 2001:db8:100::100
  • 2001:db8:100::200
  • 2001:db8:100::dead:b33f

If you would start a webserver which listens on port 80 you can use any of the IPv6 addresses in that subnet and the webserver will respond to it.

Use cases

It could be that you want to to mass-shared hosting on a system where you want to assign each hostname/domainname it’s own IPv6 address. Instead of attaching single IPs to a interface you can simply attach a complete subnet and point traffic to any of the IPs in that subnet.

Demo

On a virtual machine on PCextreme’s Aurora Compute I deployed a Instance with Prefix Delegation enabled.

After running ‘dhclient’ I got the subnet 2a00:f10:500:40::/60 assigned to my Instance.

It was then just one line to attach a /64 subnet:

ip -6 route add local 2a00:f10:500:40::/64 dev lo

Random address generator

I wrote a small piece of Python code to generate a random IPv6 address:

#!/usr/bin/env python3
"""
Generate a random IPv6 address for a specified subnet
"""

from random import seed, getrandbits
from ipaddress import IPv6Network, IPv6Address

subnet = '2a00:f10:500:40::/64'

seed()
network = IPv6Network(subnet)
address = IPv6Address(network.network_address + getrandbits(network.max_prefixlen - network.prefixlen))

print(address)

Using a small loop in Bash I could now ping random addresses in that subnet:

while [ true ]; do ping6 -c 2 `./random-ipv6.py`; done

Some example output:

--- 2a00:f10:500:40:d142:1092:ea84:74b4 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 10.252/11.680/13.108/1.428 ms
PING 2a00:f10:500:40:4e50:f264:6ea9:d184(2a00:f10:500:40:4e50:f264:6ea9:d184) 56 data bytes
64 bytes from 2a00:f10:500:40:4e50:f264:6ea9:d184: icmp_seq=1 ttl=56 time=10.0 ms
64 bytes from 2a00:f10:500:40:4e50:f264:6ea9:d184: icmp_seq=2 ttl=56 time=10.0 ms

--- 2a00:f10:500:40:4e50:f264:6ea9:d184 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 10.085/10.087/10.089/0.002 ms
PING 2a00:f10:500:40:d831:1f89:b06d:fe12(2a00:f10:500:40:d831:1f89:b06d:fe12) 56 data bytes
64 bytes from 2a00:f10:500:40:d831:1f89:b06d:fe12: icmp_seq=1 ttl=56 time=9.77 ms
64 bytes from 2a00:f10:500:40:d831:1f89:b06d:fe12: icmp_seq=2 ttl=56 time=10.1 ms

--- 2a00:f10:500:40:d831:1f89:b06d:fe12 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1005ms
rtt min/avg/max/mdev = 9.777/9.958/10.140/0.207 ms
PING 2a00:f10:500:40:2c45:26ee:5b93:fa2(2a00:f10:500:40:2c45:26ee:5b93:fa2) 56 data bytes
64 bytes from 2a00:f10:500:40:2c45:26ee:5b93:fa2: icmp_seq=1 ttl=56 time=10.2 ms
64 bytes from 2a00:f10:500:40:2c45:26ee:5b93:fa2: icmp_seq=2 ttl=56 time=10.0 ms

Installing and testing NixOS

NixOS

NixOS is a minimal and flexible Linux distribution which doesn’t use any of the existing package manager.

NixOS is a Linux distribution with a unique approach to package and configuration management. Built on top of the Nix package manager, it is completely declarative, makes upgrading systems reliable, and has many other advantages.

I wanted to test NixOS and see if it could be a candidate for a very minimal KVM hypervisor running just Qemu, libvirt and Apache CloudStack.

With this post I just wanted to share how you can quickly install NixOS inside a VirtualBox VM.

VirtualBox

On my desktop and laptop I usually use VirtualBox to quickly test something inside Virtual Machines. In this case I downloaded the NixOS minimal 64-bit ISO and created a VM:

  • 1024MB of memory
  • 8GB SATA disk
  • NixOS ISO attached

Installation

After you start the VM it will boot from the ISO. You will then find yourself in a root prompt saying just nixos.

The first step is to format your disk and mount it under /mnt.

parted /dev/sda mklabel msdos
parted /dev/sda mkpart primary 0% 100%
mkfs.xfs /dev/sda1
mount /dev/sda1 /mnt

If you have that done you can run:

nixos-generate-config

This will generate /mnt/etc/nixos/configuration.nix from where you can configure your OS.

This is what I used as my configuration:

{ config, pkgs, ... }:

{
  imports = [
      ./hardware-configuration.nix
    ];

  boot.loader.grub.enable = true;
  boot.loader.grub.version = 2;
  boot.loader.grub.device = "/dev/sda";

  boot.kernelPackages = pkgs.linuxPackages_4_1;

  time.timeZone = "Europe/Amsterdam";

  networking.firewall.enable = false;

  environment.systemPackages = with pkgs; [
    wget git screen ceph
  ];

  services.openssh.enable = true;
  services.openssh.permitRootLogin = "yes";

  virtualisation.libvirtd.enable = true;
  virtualisation.libvirtd.extraOptions = ["-l"];
  virtualisation.libvirtd.extraConfig = "listen_tls = 0\nlisten_tcp = 1";

  system.stateVersion = "15.09";
}

A minimal installation with just OpenSSH and libvirt installed.

Now you can actually install NixOS:

nixos-install

After a few minutes you will be prompted for a root-password and that’s it!

Reboot and you have a running NixOS installation 🙂

Maximum amount of Docker containers on a single host

While playing with Docker I wanted to know how many containers I could spawn on a single system.

A quick for-loop told me that the maximum is 1023 containers on a single host:

Error response from daemon: Cannot start container 09c8f46b59ccc311e8d0352789db6debd0fa1df98186c5cda98583d762d48601: adding interface vetha5d205e to bridge docker0 failed: exchange full

The limitation here is the Linux bridging which can’t have more then 1023 interfaces attached. Specifically net/bridge/br_private.h BR_PORT_BITS cannot be extended because of spanning tree requirements.

wido@wido-desktop:~$ docker ps|wc -l
1024
wido@wido-desktop:~$

Although that says 1024 there is a header line, so we have to subtract one. That brings it to 1023.

wido@wido-desktop:~$ docker version
Client:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   f4bf5c7
 Built:        Mon Oct 12 05:37:18 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   f4bf5c7
 Built:        Mon Oct 12 05:37:18 UTC 2015
 OS/Arch:      linux/amd64
wido@wido-desktop:~$

Ubuntu and the changing MAC address with bonding

With the ‘new’ style for configuring bonding under Ubuntu your bond device will not always have the same MAC address across reboots.

For example, you configure your bond in the /etc/network/interfaces file:

auto p9p1
iface p9p1 inet manual
        bond-master bond0

auto p10p1
iface p10p1 inet manual
        bond-master bond0

auto bond0
iface bond0 inet manual
        bond-slaves none
        bond-mode 4
        bond-miimon 100
        bond-updelay 5
        bond-downdelay 5

During boot, both interface p9p1 and p10p1 will be hot-plugged under bond0. The first device to be plugged into the bonding device determines which MAC address the bonded device gets.

Due to hardware timing it might be p9p1 OR p10p1 which is the first. This behavior makes the MAC address selection inconsistent between reboots and that might cause problems with:

  • DHCP for IPv4
  • IPv6 with SLAAC (Stateless Auto Configuration)
  • DHCPv6

This has been filed as bug #1288196 with Ubuntu, but no fix from that side so far.

The solutions for now:

auto p9p1
iface p9p1 inet manual
        bond-master bond0

auto p10p1
iface p10p1 inet manual
        pre-up sleep 5
        bond-master bond0

This makes sure p10p1 always comes online 5 seconds after p9p1.

But you can also set a static MAC address for the bonding device:

auto bond0
iface bond0 inet manual
        hwaddress fe:80:12:04:6d:6f
        bond-slaves none
        bond-mode 4
        bond-miimon 100
        bond-updelay 5
        bond-downdelay 5

Choose what you prefer or works best in your situation.

Playing with CephFS recursive statistics

One of the cool features of CephFS is the recursive accounting the filesystem can do.

On a regular filesystem you have to use ‘du -sh’ to figure out how big a directory is. It will traverse into the directory and sum everything up for you. This can take a very long time and be very I/O intensive.

With CephFS this is done within a second:

root@admin:~# ls -alh /mnt/cephfs/
total 4.0K
drwxr-xr-x 1 root root  81T Jan 23 13:09 .
drwxr-xr-x 6 root root 4.0K Jan 13 15:41 ..
drwxrwxr-x 1 root root    0 Jan 23 12:57 DIR1
drwxrwxr-x 1 root root  80T Apr  3 11:16 DIR2
root@admin:~#

Or fetch these statistics using the virtual xattrs of CephFS:

root@admin:~# getfattr -d -m ceph.dir.* /mnt/cephfs
getfattr: Removing leading '/' from absolute path names
# file: mnt/cephfs
ceph.dir.entries="2"
ceph.dir.files="0"
ceph.dir.rbytes="88833202521902"
ceph.dir.rctime="1430297412.09159402000"
ceph.dir.rentries="10334874"
ceph.dir.rfiles="9853051"
ceph.dir.rsubdirs="481823"
ceph.dir.subdirs="2"

root@admin:~#

It is as simple as that. Using this virtual xattrs of CephFS you instantly know how much data, files and (recursive) entries there are in any directory.

No long waits on find or du, simply ask the Metadata Server of CephFS!