Hitch TLS Proxy performance with 15k certificates

While testing with the Hitch TLS proxy in front of Varnish I stumbled upon a slow startup with a large amount of certificates.

In this case we (at PCextreme) want to run Hitch with around 50.000 certificates configured.

The webpage of Hitch says:

Safe for large installations: performant up to 15 000 listening sockets and 500 000 certificates.

10 minutes

I started testing on my local desktop with 15.000 certificates. My desktop is a Intel NUC with Ubuntu 14.04.

wido@wido-desktop:~/repos/hitch/src$ time sudo ./hitch -n 4 -u nobody -g nogroup --config=/opt/hitch/hitch.conf

real    9m40.088s
user    9m38.482s
sys 0m0.829s
wido@wido-desktop:~/repos/hitch/src$

A 10 minute startup time for Hitch is rather long. We started searching for the root-cause.

OpenSSL

After some searching we discovered the OpenSSL version in Ubuntu 14.04 was the problem. Testing with Ubuntu 15.10 showed us different results.

root@VM-9d8e8cfd-e30f-4c40-8c4e-2e098b0f11a5:~# time hitch --daemon --pidfile=/run/hitch.pid --user hitch --group hitch --config=/etc/hitch/hitch.conf

real    0m18.673s
user    0m6.780s
sys    0m2.000s

18 seconds is a lot better than 10 minutes!

Ubuntu 14.04 comes with OpenSSL 1.0.1f and Ubuntu 15.10 with 1.0.2d and that is where the difference seems to be.

100.000 certificates

After this we started testing with 100k certificates. It took 48 seconds to start with that amount of certificates configured.

For production we will use Ubuntu 16.04 which has similar results as Ubuntu 15.10.

So if you find Hitch slow when starting, check your OpenSSL version.

AnyIP: Bind a whole subnet to your Linux machine

IPv6 Prefix Delegation

In my previous post I wrote how you can use Docker with IPv6 and Prefix Delegation.

A IPv6 subnet routed to a Linux machine can be used with other things than Docker. That’s where the AnyIP feature of the kernel comes in.

Linux Kernel AnyIP

The AnyIP feature of the Linux kernel allows you to bind a complete IPv4 or IPv6 subnet to your system.

Instead of adding all addresses manually to the kernel you can tell it to bind a complete subnet.

Configuring

IPv4

ip -4 route add local 192.168.0.0/24 dev lo

In this case the Linux kernel will now respond to ARP requests for any IPv4 address in the 192.168.0.0/24 subnet.

IPv6

ip -6 route add local 2001:db8:100::/64 dev lo

In this case the kernel will respond for Neigh Sollicitations on any IPv6 address in the 2001:db8:100::/64 subnet.

Example usage

Let’s assume that you have the IPv6 prefix 2001:db8:100::/60 routed to your Linux machine through IPv6 prefix delegation.

From that /60 subnet we take the first /64 subnet and attach it to lo.

ip -6 route add local 2001:db8:100::/64 dev lo

You can now ping any of the addresses in that subnet:

  • 2001:db8:100::1
  • 2001:db8:100::100
  • 2001:db8:100::200
  • 2001:db8:100::dead:b33f

If you would start a webserver which listens on port 80 you can use any of the IPv6 addresses in that subnet and the webserver will respond to it.

Use cases

It could be that you want to to mass-shared hosting on a system where you want to assign each hostname/domainname it’s own IPv6 address. Instead of attaching single IPs to a interface you can simply attach a complete subnet and point traffic to any of the IPs in that subnet.

Demo

On a virtual machine on PCextreme’s Aurora Compute I deployed a Instance with Prefix Delegation enabled.

After running ‘dhclient’ I got the subnet 2a00:f10:500:40::/60 assigned to my Instance.

It was then just one line to attach a /64 subnet:

ip -6 route add local 2a00:f10:500:40::/64 dev lo

Random address generator

I wrote a small piece of Python code to generate a random IPv6 address:

#!/usr/bin/env python3
"""
Generate a random IPv6 address for a specified subnet
"""

from random import seed, getrandbits
from ipaddress import IPv6Network, IPv6Address

subnet = '2a00:f10:500:40::/64'

seed()
network = IPv6Network(subnet)
address = IPv6Address(network.network_address + getrandbits(network.max_prefixlen - network.prefixlen))

print(address)

Using a small loop in Bash I could now ping random addresses in that subnet:

while [ true ]; do ping6 -c 2 `./random-ipv6.py`; done

Some example output:

--- 2a00:f10:500:40:d142:1092:ea84:74b4 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 10.252/11.680/13.108/1.428 ms
PING 2a00:f10:500:40:4e50:f264:6ea9:d184(2a00:f10:500:40:4e50:f264:6ea9:d184) 56 data bytes
64 bytes from 2a00:f10:500:40:4e50:f264:6ea9:d184: icmp_seq=1 ttl=56 time=10.0 ms
64 bytes from 2a00:f10:500:40:4e50:f264:6ea9:d184: icmp_seq=2 ttl=56 time=10.0 ms

--- 2a00:f10:500:40:4e50:f264:6ea9:d184 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 10.085/10.087/10.089/0.002 ms
PING 2a00:f10:500:40:d831:1f89:b06d:fe12(2a00:f10:500:40:d831:1f89:b06d:fe12) 56 data bytes
64 bytes from 2a00:f10:500:40:d831:1f89:b06d:fe12: icmp_seq=1 ttl=56 time=9.77 ms
64 bytes from 2a00:f10:500:40:d831:1f89:b06d:fe12: icmp_seq=2 ttl=56 time=10.1 ms

--- 2a00:f10:500:40:d831:1f89:b06d:fe12 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1005ms
rtt min/avg/max/mdev = 9.777/9.958/10.140/0.207 ms
PING 2a00:f10:500:40:2c45:26ee:5b93:fa2(2a00:f10:500:40:2c45:26ee:5b93:fa2) 56 data bytes
64 bytes from 2a00:f10:500:40:2c45:26ee:5b93:fa2: icmp_seq=1 ttl=56 time=10.2 ms
64 bytes from 2a00:f10:500:40:2c45:26ee:5b93:fa2: icmp_seq=2 ttl=56 time=10.0 ms

Installing and testing NixOS

NixOS

NixOS is a minimal and flexible Linux distribution which doesn’t use any of the existing package manager.

NixOS is a Linux distribution with a unique approach to package and configuration management. Built on top of the Nix package manager, it is completely declarative, makes upgrading systems reliable, and has many other advantages.

I wanted to test NixOS and see if it could be a candidate for a very minimal KVM hypervisor running just Qemu, libvirt and Apache CloudStack.

With this post I just wanted to share how you can quickly install NixOS inside a VirtualBox VM.

VirtualBox

On my desktop and laptop I usually use VirtualBox to quickly test something inside Virtual Machines. In this case I downloaded the NixOS minimal 64-bit ISO and created a VM:

  • 1024MB of memory
  • 8GB SATA disk
  • NixOS ISO attached

Installation

After you start the VM it will boot from the ISO. You will then find yourself in a root prompt saying just nixos.

The first step is to format your disk and mount it under /mnt.

parted /dev/sda mklabel msdos
parted /dev/sda mkpart primary 0% 100%
mkfs.xfs /dev/sda1
mount /dev/sda1 /mnt

If you have that done you can run:

nixos-generate-config

This will generate /mnt/etc/nixos/configuration.nix from where you can configure your OS.

This is what I used as my configuration:

{ config, pkgs, ... }:

{
  imports = [
      ./hardware-configuration.nix
    ];

  boot.loader.grub.enable = true;
  boot.loader.grub.version = 2;
  boot.loader.grub.device = "/dev/sda";

  boot.kernelPackages = pkgs.linuxPackages_4_1;

  time.timeZone = "Europe/Amsterdam";

  networking.firewall.enable = false;

  environment.systemPackages = with pkgs; [
    wget git screen ceph
  ];

  services.openssh.enable = true;
  services.openssh.permitRootLogin = "yes";

  virtualisation.libvirtd.enable = true;
  virtualisation.libvirtd.extraOptions = ["-l"];
  virtualisation.libvirtd.extraConfig = "listen_tls = 0\nlisten_tcp = 1";

  system.stateVersion = "15.09";
}

A minimal installation with just OpenSSH and libvirt installed.

Now you can actually install NixOS:

nixos-install

After a few minutes you will be prompted for a root-password and that’s it!

Reboot and you have a running NixOS installation 🙂

Maximum amount of Docker containers on a single host

While playing with Docker I wanted to know how many containers I could spawn on a single system.

A quick for-loop told me that the maximum is 1023 containers on a single host:

Error response from daemon: Cannot start container 09c8f46b59ccc311e8d0352789db6debd0fa1df98186c5cda98583d762d48601: adding interface vetha5d205e to bridge docker0 failed: exchange full

The limitation here is the Linux bridging which can’t have more then 1023 interfaces attached. Specifically net/bridge/br_private.h BR_PORT_BITS cannot be extended because of spanning tree requirements.

wido@wido-desktop:~$ docker ps|wc -l
1024
wido@wido-desktop:~$

Although that says 1024 there is a header line, so we have to subtract one. That brings it to 1023.

wido@wido-desktop:~$ docker version
Client:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   f4bf5c7
 Built:        Mon Oct 12 05:37:18 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   f4bf5c7
 Built:        Mon Oct 12 05:37:18 UTC 2015
 OS/Arch:      linux/amd64
wido@wido-desktop:~$

Ubuntu and the changing MAC address with bonding

With the ‘new’ style for configuring bonding under Ubuntu your bond device will not always have the same MAC address across reboots.

For example, you configure your bond in the /etc/network/interfaces file:

auto p9p1
iface p9p1 inet manual
        bond-master bond0

auto p10p1
iface p10p1 inet manual
        bond-master bond0

auto bond0
iface bond0 inet manual
        bond-slaves none
        bond-mode 4
        bond-miimon 100
        bond-updelay 5
        bond-downdelay 5

During boot, both interface p9p1 and p10p1 will be hot-plugged under bond0. The first device to be plugged into the bonding device determines which MAC address the bonded device gets.

Due to hardware timing it might be p9p1 OR p10p1 which is the first. This behavior makes the MAC address selection inconsistent between reboots and that might cause problems with:

  • DHCP for IPv4
  • IPv6 with SLAAC (Stateless Auto Configuration)
  • DHCPv6

This has been filed as bug #1288196 with Ubuntu, but no fix from that side so far.

The solutions for now:

auto p9p1
iface p9p1 inet manual
        bond-master bond0

auto p10p1
iface p10p1 inet manual
        pre-up sleep 5
        bond-master bond0

This makes sure p10p1 always comes online 5 seconds after p9p1.

But you can also set a static MAC address for the bonding device:

auto bond0
iface bond0 inet manual
        hwaddress fe:80:12:04:6d:6f
        bond-slaves none
        bond-mode 4
        bond-miimon 100
        bond-updelay 5
        bond-downdelay 5

Choose what you prefer or works best in your situation.

Playing with CephFS recursive statistics

One of the cool features of CephFS is the recursive accounting the filesystem can do.

On a regular filesystem you have to use ‘du -sh’ to figure out how big a directory is. It will traverse into the directory and sum everything up for you. This can take a very long time and be very I/O intensive.

With CephFS this is done within a second:

root@admin:~# ls -alh /mnt/cephfs/
total 4.0K
drwxr-xr-x 1 root root  81T Jan 23 13:09 .
drwxr-xr-x 6 root root 4.0K Jan 13 15:41 ..
drwxrwxr-x 1 root root    0 Jan 23 12:57 DIR1
drwxrwxr-x 1 root root  80T Apr  3 11:16 DIR2
root@admin:~#

Or fetch these statistics using the virtual xattrs of CephFS:

root@admin:~# getfattr -d -m ceph.dir.* /mnt/cephfs
getfattr: Removing leading '/' from absolute path names
# file: mnt/cephfs
ceph.dir.entries="2"
ceph.dir.files="0"
ceph.dir.rbytes="88833202521902"
ceph.dir.rctime="1430297412.09159402000"
ceph.dir.rentries="10334874"
ceph.dir.rfiles="9853051"
ceph.dir.rsubdirs="481823"
ceph.dir.subdirs="2"

root@admin:~#

It is as simple as that. Using this virtual xattrs of CephFS you instantly know how much data, files and (recursive) entries there are in any directory.

No long waits on find or du, simply ask the Metadata Server of CephFS!

Limit battery state of charge on a Lenovo X1 Carbon under Ubuntu

Since the end of 2012 I have a Lenovo X1 Carbon laptop running with Ubuntu 12.04

By default a laptop charges all the way up to 100% State of Charge, something which is very bad for a battery. There is a great video on Youtube about this if you want to know all the ins and outs.

The bottom line is that I wanted to limit the charge level to 90% for my laptop. Up until now I did this manually by pulling the plug at certain points, but that didn’t always work. I sometimes forgot and the battery would charge up to 100%.

On Github I found the tpacpi-bat project which allows you to limit the charge level of your battery.

How to install?

  • Clone the project
  • Run install.pl
  • Modify your /etc/rc.local file
  • Reboot

This is what you need to put in your rc.local:

tpacpi-bat -g SP 0
tpacpi-bat -g SP 1
tpacpi-bat -g SP 2

exit 0

As far as I know the X1 Carbon has 3 batteries, so for all three we set the charge limit to 90%. This is not persistent after reboots, so we have to set it every time we boot.

You’ll now see that your battery charges to 90% at max.

Quassel IRC, never miss anything on IRC!

I was one of those guys who had irssi running inside a screen on a remote Linux box somewhere. It works just fine, but I always forgot to open the SSH session so I missed a lot of IRC conversations. Private messages were a problem as well, most of the times it was a couple of days later before I noticed somebody had actually sent me a PM…

It was time to change my IRC client, with the preference to always be online.

A short search lead me to the website of Quassel IRC, a distributed IRC server/client. Exactly what I was looking for! You just install the “core” on a remote Linux box and use the Linux, Windows, Mac OSX or Android client to participate on IRC.

The core has been running on a Ubuntu 10.04 machine for about one week now and it works like a charm. My IRC conversations are secured by SSL and I never miss a PM or when somebody tags me!

Integration of the client goes well on Ubuntu 12.04 with Unity, it integrates seamlessly with Unity and notifies me whenever I’m tagged or I receive a PM.

Looking for me on IRC? Find me on OFTC @ wido where I hang out in #ceph. Or find me on Freenode @ widodh in #cloudstack

Failover with Nexenta, NFS and the RSF-1 plugin

The title might seem a bit cryptic, but this post is about a High Available Nexenta cluster with the RSF-1 we are deploying.

While we are waiting for the moment where we can start using Ceph we are implementing new storage for our hosting clusters. Our current Linux machines with LVM and XFS are not up to the task anymore.

After some testing and discussing we chose to use Nexenta. What Nexenta is and how awesome ZFS is can be found on other places on the net, I’m not going to discuss that here.

I wanted to publish our findings about the HA plugin and NFS.

In short, we have two headends connected with two SAS JBOD’s. The RSF-1 plugin makes sure the ZPOOL is imported on one headend at the time. If one headend fails, the plugin automatically fails the pool over to the other headend.

The plugin provides one HA IP which is shared between the headends, you probably get the point.

We’ve been doing some testing and noticed that when we mount NFS (v3) over TCP the failover takes a staggering 6 minutes! Well, the failover doesn’t take 6 minutes, but that’s the time it takes for the TCP connections to recover.

When mounting over UDP the service is continued in 50 seconds, so that’s a big difference!

Some testing showed that this is due to the following kernel settings:

net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15

This page explains what those two values actually control.

We’ve been experimenting with those values and lowering retries1 to 1 gave us the same recovery times as with UDP, but sometimes the recovery would still take 6 minutes..

For now I advise to use NFS with UDP (which gives better performance anyway), but if you need to use TCP for some reason try fiddling with these values.

Distributed storage under Linux, is it there yet?

When it comes down to storage under Linux you have a lot of great options if you are looking for local storage, but what if you have so much data that local storage is not really an option? And what if you need multiple servers accessing the data? You’ll probably take NFS or iSCSI with a clustered filesystem like GFS or OCFS2.

When using NFS or iSCSI it will come down to one, two or maybe three servers storing your data, where one will have a primary role for 99.99% of the time. That is still a Single Point-of-Failure (SPoF).

Although this worked (and still is) fine, we are running into limitations. We want to store more and more data, we want to expand without downtime and we want expansion to go smoothly. Doing all that under Linux now is a ……. Let’s say: Challenge.

Energy costs are also rising, if you like it or not, it does influence the work of a system administrator. We were used to having a Active/Passive setup, but that doubles your energy consumption! In large environments that could mean a lot of money. Do we still want that? I don’t think so.

Distributed storage is what we need, no central brain, no passive nodes, but a fully distributed and fault tolerant filesystem where every node is active and it has to scale easily without any disruption in service.

I think it’s nearly there and they call it Ceph!

Ceph is a distributed file system build on top of RADOS, a scalable and distributed object store. This object store simply stores objects in pools (which some people might refer to as “buckets”). It’s this distributed object store which is the basis of the Ceph filesystem.

RADOS works with Object Store Daemons (OSD). These OSDs are a daemon which have a data directory (btrfs) where they store their objects and some basic information about the cluster. Typically a data directory of a OSD is a one hard disk formatted with btrfs.

Every pool has a replication size property, this tells RADOS how many copies of an object you want to store. If you choose 3 every object you store on that pool will be stored on three different OSDs. This provides data safety and availability, loosing one (or more) OSDs will not lead to data loss nor unavailability.

Data placement in RADOS is done by CRUSH. With CRUSH you can strategically place your objects (and it’s replica’s) in different rooms, racks, rows and servers. One might want to place the second replica on a separate power feed then the primary replica.

A small RADOS cluster could look like this:

This is a small RADOS cluster, three machines with 4 disks each and one OSD per disk. The monitor is there to inform the clients about the cluster state. Although this setup has one monitor, these can be made redundant by simple adding more (odd number is preferable).

With this post I don’t want to tell you everything about RADOS and the internal working, all this information is available on the Ceph website.

What I do want to tell you is how my experiences are with Ceph at this point and where it’s heading.

I started testing Ceph about 1.5 years ago, I stumbled on it when reading the changelog of 2.6.34, that was the first kernel where the Ceph kernel client was included.

I’m always on a quest to find a better solution for our storage, right now we are using Linux boxes with NFS, but that is really starting to hurt in many ways.

Where did Ceph get in the past 18 months? Far! I started testing when version 0.18 just got out, right now we are at 0.31!

I started testing the various components of Ceph, started on a small number of virtual machines, but currently I have two clusters running, a “semi-production” where I’m running various virtual machines with RBD and Qemu-KVM. My second cluster is a 74TB cluster with 10 machines, each having 4 2TB disks.

Filesystem            Size  Used Avail Use% Mounted on
[2a00:f10:113:1:230:48ff:fed3:b086]:/   74T  13T   61T  17% /mnt/ceph

As you can see, I’m running my cluster over IPv6. Ceph does not support dual-stack, you will have to choose between IPv4 or IPv6, where I prefer the last one.

But you are probably wondering how stable or production ready it is? That question is hard to answer. My small cluster where I run the KVM Virtual Machines (through Qem-KVM with RBD) has only 6 OSDs and a capacity of 600GB. It has been running for about 4 months now without any issues, but I have to be honest, I didn’t stress it either. I didn’t kill any machines nor did hardware fail. It should be able to handle those crashes, but I haven’t stressed that cluster.

The story is different with my big cluster. In total it’s 15 machines, 10 machines hosting a total of 40 OSDs, the rest are monitors, meta data servers and clients. It started running about 3 months ago and since I’ve seen numerous crashes. I also chose to use the WD Green 2TB disks in my cluster, that was not the best decision. Right now I have a 12% failure rate of these disks. While the failure of those disks is not a good thing, it is a good test for Ceph!

Some disk failures caused some serious problems causing the cluster to start bouncing around and never recovering from that.. But, about 2 days ago I noticed two other disks failing and the cluster fully recovered from it while a rsync was writing data to it. So, it seems to be improving!

During my further testing I have stumbled upon a lot of things. My cluster is build with Atom CPU’s, but those seem to be a bit underpowered for the work. Recovery is heavy for OSDs, so whenever something goes wrong in the cluster I see the CPU’s starting to spike towards the 100%. This is something that is being addressed.

Data placement goes in Placement Group’s, aka PGs. The more data or OSDs you add to the cluster, the more PGs you’ll get. The more PGs you have, the more memory your OSDs start to consume. My OSDs have 4GB (Atom limitation) each. Recovery is not only CPU hungry, but it will also eat your memory. Although the use of tcmalloc reduced the memory usage, OSDs sometimes use a lot of memory.

To come to some sort of a conclusion. Are we there yet? Short answer: No. Long answer: No again, but we will get there. Although Ceph still has a long way to go, it’s on the right path. I think that Ceph will become the distributed storage solution under Linux, but it will take some time. Patience is the key here!

The last thing I wanted to address is the fact that testing is needed! Bugs don’t reveal themselves you have to hunt them down. If you have spare hardware and time, do test and report!