AnyIP: Bind a whole subnet to your Linux machine

IPv6 Prefix Delegation

In my previous post I wrote how you can use Docker with IPv6 and Prefix Delegation.

A IPv6 subnet routed to a Linux machine can be used with other things than Docker. That’s where the AnyIP feature of the kernel comes in.

Linux Kernel AnyIP

The AnyIP feature of the Linux kernel allows you to bind a complete IPv4 or IPv6 subnet to your system.

Instead of adding all addresses manually to the kernel you can tell it to bind a complete subnet.

Configuring

IPv4

ip -4 route add local 192.168.0.0/24 dev lo

In this case the Linux kernel will now respond to ARP requests for any IPv4 address in the 192.168.0.0/24 subnet.

IPv6

ip -6 route add local 2001:db8:100::/64 dev lo

In this case the kernel will respond for Neigh Sollicitations on any IPv6 address in the 2001:db8:100::/64 subnet.

Example usage

Let’s assume that you have the IPv6 prefix 2001:db8:100::/60 routed to your Linux machine through IPv6 prefix delegation.

From that /60 subnet we take the first /64 subnet and attach it to lo.

ip -6 route add local 2001:db8:100::/64 dev lo

You can now ping any of the addresses in that subnet:

  • 2001:db8:100::1
  • 2001:db8:100::100
  • 2001:db8:100::200
  • 2001:db8:100::dead:b33f

If you would start a webserver which listens on port 80 you can use any of the IPv6 addresses in that subnet and the webserver will respond to it.

Use cases

It could be that you want to to mass-shared hosting on a system where you want to assign each hostname/domainname it’s own IPv6 address. Instead of attaching single IPs to a interface you can simply attach a complete subnet and point traffic to any of the IPs in that subnet.

Demo

On a virtual machine on PCextreme’s Aurora Compute I deployed a Instance with Prefix Delegation enabled.

After running ‘dhclient’ I got the subnet 2a00:f10:500:40::/60 assigned to my Instance.

It was then just one line to attach a /64 subnet:

ip -6 route add local 2a00:f10:500:40::/64 dev lo

Random address generator

I wrote a small piece of Python code to generate a random IPv6 address:

#!/usr/bin/env python3
"""
Generate a random IPv6 address for a specified subnet
"""

from random import seed, getrandbits
from ipaddress import IPv6Network, IPv6Address

subnet = '2a00:f10:500:40::/64'

seed()
network = IPv6Network(subnet)
address = IPv6Address(network.network_address + getrandbits(network.max_prefixlen - network.prefixlen))

print(address)

Using a small loop in Bash I could now ping random addresses in that subnet:

while [ true ]; do ping6 -c 2 `./random-ipv6.py`; done

Some example output:

--- 2a00:f10:500:40:d142:1092:ea84:74b4 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 10.252/11.680/13.108/1.428 ms
PING 2a00:f10:500:40:4e50:f264:6ea9:d184(2a00:f10:500:40:4e50:f264:6ea9:d184) 56 data bytes
64 bytes from 2a00:f10:500:40:4e50:f264:6ea9:d184: icmp_seq=1 ttl=56 time=10.0 ms
64 bytes from 2a00:f10:500:40:4e50:f264:6ea9:d184: icmp_seq=2 ttl=56 time=10.0 ms

--- 2a00:f10:500:40:4e50:f264:6ea9:d184 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 10.085/10.087/10.089/0.002 ms
PING 2a00:f10:500:40:d831:1f89:b06d:fe12(2a00:f10:500:40:d831:1f89:b06d:fe12) 56 data bytes
64 bytes from 2a00:f10:500:40:d831:1f89:b06d:fe12: icmp_seq=1 ttl=56 time=9.77 ms
64 bytes from 2a00:f10:500:40:d831:1f89:b06d:fe12: icmp_seq=2 ttl=56 time=10.1 ms

--- 2a00:f10:500:40:d831:1f89:b06d:fe12 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1005ms
rtt min/avg/max/mdev = 9.777/9.958/10.140/0.207 ms
PING 2a00:f10:500:40:2c45:26ee:5b93:fa2(2a00:f10:500:40:2c45:26ee:5b93:fa2) 56 data bytes
64 bytes from 2a00:f10:500:40:2c45:26ee:5b93:fa2: icmp_seq=1 ttl=56 time=10.2 ms
64 bytes from 2a00:f10:500:40:2c45:26ee:5b93:fa2: icmp_seq=2 ttl=56 time=10.0 ms

Installing and testing NixOS

NixOS

NixOS is a minimal and flexible Linux distribution which doesn’t use any of the existing package manager.

NixOS is a Linux distribution with a unique approach to package and configuration management. Built on top of the Nix package manager, it is completely declarative, makes upgrading systems reliable, and has many other advantages.

I wanted to test NixOS and see if it could be a candidate for a very minimal KVM hypervisor running just Qemu, libvirt and Apache CloudStack.

With this post I just wanted to share how you can quickly install NixOS inside a VirtualBox VM.

VirtualBox

On my desktop and laptop I usually use VirtualBox to quickly test something inside Virtual Machines. In this case I downloaded the NixOS minimal 64-bit ISO and created a VM:

  • 1024MB of memory
  • 8GB SATA disk
  • NixOS ISO attached

Installation

After you start the VM it will boot from the ISO. You will then find yourself in a root prompt saying just nixos.

The first step is to format your disk and mount it under /mnt.

parted /dev/sda mklabel msdos
parted /dev/sda mkpart primary 0% 100%
mkfs.xfs /dev/sda1
mount /dev/sda1 /mnt

If you have that done you can run:

nixos-generate-config

This will generate /mnt/etc/nixos/configuration.nix from where you can configure your OS.

This is what I used as my configuration:

{ config, pkgs, ... }:

{
  imports = [
      ./hardware-configuration.nix
    ];

  boot.loader.grub.enable = true;
  boot.loader.grub.version = 2;
  boot.loader.grub.device = "/dev/sda";

  boot.kernelPackages = pkgs.linuxPackages_4_1;

  time.timeZone = "Europe/Amsterdam";

  networking.firewall.enable = false;

  environment.systemPackages = with pkgs; [
    wget git screen ceph
  ];

  services.openssh.enable = true;
  services.openssh.permitRootLogin = "yes";

  virtualisation.libvirtd.enable = true;
  virtualisation.libvirtd.extraOptions = ["-l"];
  virtualisation.libvirtd.extraConfig = "listen_tls = 0\nlisten_tcp = 1";

  system.stateVersion = "15.09";
}

A minimal installation with just OpenSSH and libvirt installed.

Now you can actually install NixOS:

nixos-install

After a few minutes you will be prompted for a root-password and that’s it!

Reboot and you have a running NixOS installation 🙂

Maximum amount of Docker containers on a single host

While playing with Docker I wanted to know how many containers I could spawn on a single system.

A quick for-loop told me that the maximum is 1023 containers on a single host:

Error response from daemon: Cannot start container 09c8f46b59ccc311e8d0352789db6debd0fa1df98186c5cda98583d762d48601: adding interface vetha5d205e to bridge docker0 failed: exchange full

The limitation here is the Linux bridging which can’t have more then 1023 interfaces attached. Specifically net/bridge/br_private.h BR_PORT_BITS cannot be extended because of spanning tree requirements.

wido@wido-desktop:~$ docker ps|wc -l
1024
wido@wido-desktop:~$

Although that says 1024 there is a header line, so we have to subtract one. That brings it to 1023.

wido@wido-desktop:~$ docker version
Client:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   f4bf5c7
 Built:        Mon Oct 12 05:37:18 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   f4bf5c7
 Built:        Mon Oct 12 05:37:18 UTC 2015
 OS/Arch:      linux/amd64
wido@wido-desktop:~$

Ubuntu and the changing MAC address with bonding

With the ‘new’ style for configuring bonding under Ubuntu your bond device will not always have the same MAC address across reboots.

For example, you configure your bond in the /etc/network/interfaces file:

auto p9p1
iface p9p1 inet manual
        bond-master bond0

auto p10p1
iface p10p1 inet manual
        bond-master bond0

auto bond0
iface bond0 inet manual
        bond-slaves none
        bond-mode 4
        bond-miimon 100
        bond-updelay 5
        bond-downdelay 5

During boot, both interface p9p1 and p10p1 will be hot-plugged under bond0. The first device to be plugged into the bonding device determines which MAC address the bonded device gets.

Due to hardware timing it might be p9p1 OR p10p1 which is the first. This behavior makes the MAC address selection inconsistent between reboots and that might cause problems with:

  • DHCP for IPv4
  • IPv6 with SLAAC (Stateless Auto Configuration)
  • DHCPv6

This has been filed as bug #1288196 with Ubuntu, but no fix from that side so far.

The solutions for now:

auto p9p1
iface p9p1 inet manual
        bond-master bond0

auto p10p1
iface p10p1 inet manual
        pre-up sleep 5
        bond-master bond0

This makes sure p10p1 always comes online 5 seconds after p9p1.

But you can also set a static MAC address for the bonding device:

auto bond0
iface bond0 inet manual
        hwaddress fe:80:12:04:6d:6f
        bond-slaves none
        bond-mode 4
        bond-miimon 100
        bond-updelay 5
        bond-downdelay 5

Choose what you prefer or works best in your situation.

Playing with CephFS recursive statistics

One of the cool features of CephFS is the recursive accounting the filesystem can do.

On a regular filesystem you have to use ‘du -sh’ to figure out how big a directory is. It will traverse into the directory and sum everything up for you. This can take a very long time and be very I/O intensive.

With CephFS this is done within a second:

root@admin:~# ls -alh /mnt/cephfs/
total 4.0K
drwxr-xr-x 1 root root  81T Jan 23 13:09 .
drwxr-xr-x 6 root root 4.0K Jan 13 15:41 ..
drwxrwxr-x 1 root root    0 Jan 23 12:57 DIR1
drwxrwxr-x 1 root root  80T Apr  3 11:16 DIR2
root@admin:~#

Or fetch these statistics using the virtual xattrs of CephFS:

root@admin:~# getfattr -d -m ceph.dir.* /mnt/cephfs
getfattr: Removing leading '/' from absolute path names
# file: mnt/cephfs
ceph.dir.entries="2"
ceph.dir.files="0"
ceph.dir.rbytes="88833202521902"
ceph.dir.rctime="1430297412.09159402000"
ceph.dir.rentries="10334874"
ceph.dir.rfiles="9853051"
ceph.dir.rsubdirs="481823"
ceph.dir.subdirs="2"

root@admin:~#

It is as simple as that. Using this virtual xattrs of CephFS you instantly know how much data, files and (recursive) entries there are in any directory.

No long waits on find or du, simply ask the Metadata Server of CephFS!

Limit battery state of charge on a Lenovo X1 Carbon under Ubuntu

Since the end of 2012 I have a Lenovo X1 Carbon laptop running with Ubuntu 12.04

By default a laptop charges all the way up to 100% State of Charge, something which is very bad for a battery. There is a great video on Youtube about this if you want to know all the ins and outs.

The bottom line is that I wanted to limit the charge level to 90% for my laptop. Up until now I did this manually by pulling the plug at certain points, but that didn’t always work. I sometimes forgot and the battery would charge up to 100%.

On Github I found the tpacpi-bat project which allows you to limit the charge level of your battery.

How to install?

  • Clone the project
  • Run install.pl
  • Modify your /etc/rc.local file
  • Reboot

This is what you need to put in your rc.local:

tpacpi-bat -g SP 0
tpacpi-bat -g SP 1
tpacpi-bat -g SP 2

exit 0

As far as I know the X1 Carbon has 3 batteries, so for all three we set the charge limit to 90%. This is not persistent after reboots, so we have to set it every time we boot.

You’ll now see that your battery charges to 90% at max.

Quassel IRC, never miss anything on IRC!

I was one of those guys who had irssi running inside a screen on a remote Linux box somewhere. It works just fine, but I always forgot to open the SSH session so I missed a lot of IRC conversations. Private messages were a problem as well, most of the times it was a couple of days later before I noticed somebody had actually sent me a PM…

It was time to change my IRC client, with the preference to always be online.

A short search lead me to the website of Quassel IRC, a distributed IRC server/client. Exactly what I was looking for! You just install the “core” on a remote Linux box and use the Linux, Windows, Mac OSX or Android client to participate on IRC.

The core has been running on a Ubuntu 10.04 machine for about one week now and it works like a charm. My IRC conversations are secured by SSL and I never miss a PM or when somebody tags me!

Integration of the client goes well on Ubuntu 12.04 with Unity, it integrates seamlessly with Unity and notifies me whenever I’m tagged or I receive a PM.

Looking for me on IRC? Find me on OFTC @ wido where I hang out in #ceph. Or find me on Freenode @ widodh in #cloudstack

Failover with Nexenta, NFS and the RSF-1 plugin

The title might seem a bit cryptic, but this post is about a High Available Nexenta cluster with the RSF-1 we are deploying.

While we are waiting for the moment where we can start using Ceph we are implementing new storage for our hosting clusters. Our current Linux machines with LVM and XFS are not up to the task anymore.

After some testing and discussing we chose to use Nexenta. What Nexenta is and how awesome ZFS is can be found on other places on the net, I’m not going to discuss that here.

I wanted to publish our findings about the HA plugin and NFS.

In short, we have two headends connected with two SAS JBOD’s. The RSF-1 plugin makes sure the ZPOOL is imported on one headend at the time. If one headend fails, the plugin automatically fails the pool over to the other headend.

The plugin provides one HA IP which is shared between the headends, you probably get the point.

We’ve been doing some testing and noticed that when we mount NFS (v3) over TCP the failover takes a staggering 6 minutes! Well, the failover doesn’t take 6 minutes, but that’s the time it takes for the TCP connections to recover.

When mounting over UDP the service is continued in 50 seconds, so that’s a big difference!

Some testing showed that this is due to the following kernel settings:

net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15

This page explains what those two values actually control.

We’ve been experimenting with those values and lowering retries1 to 1 gave us the same recovery times as with UDP, but sometimes the recovery would still take 6 minutes..

For now I advise to use NFS with UDP (which gives better performance anyway), but if you need to use TCP for some reason try fiddling with these values.

Distributed storage under Linux, is it there yet?

When it comes down to storage under Linux you have a lot of great options if you are looking for local storage, but what if you have so much data that local storage is not really an option? And what if you need multiple servers accessing the data? You’ll probably take NFS or iSCSI with a clustered filesystem like GFS or OCFS2.

When using NFS or iSCSI it will come down to one, two or maybe three servers storing your data, where one will have a primary role for 99.99% of the time. That is still a Single Point-of-Failure (SPoF).

Although this worked (and still is) fine, we are running into limitations. We want to store more and more data, we want to expand without downtime and we want expansion to go smoothly. Doing all that under Linux now is a ……. Let’s say: Challenge.

Energy costs are also rising, if you like it or not, it does influence the work of a system administrator. We were used to having a Active/Passive setup, but that doubles your energy consumption! In large environments that could mean a lot of money. Do we still want that? I don’t think so.

Distributed storage is what we need, no central brain, no passive nodes, but a fully distributed and fault tolerant filesystem where every node is active and it has to scale easily without any disruption in service.

I think it’s nearly there and they call it Ceph!

Ceph is a distributed file system build on top of RADOS, a scalable and distributed object store. This object store simply stores objects in pools (which some people might refer to as “buckets”). It’s this distributed object store which is the basis of the Ceph filesystem.

RADOS works with Object Store Daemons (OSD). These OSDs are a daemon which have a data directory (btrfs) where they store their objects and some basic information about the cluster. Typically a data directory of a OSD is a one hard disk formatted with btrfs.

Every pool has a replication size property, this tells RADOS how many copies of an object you want to store. If you choose 3 every object you store on that pool will be stored on three different OSDs. This provides data safety and availability, loosing one (or more) OSDs will not lead to data loss nor unavailability.

Data placement in RADOS is done by CRUSH. With CRUSH you can strategically place your objects (and it’s replica’s) in different rooms, racks, rows and servers. One might want to place the second replica on a separate power feed then the primary replica.

A small RADOS cluster could look like this:

This is a small RADOS cluster, three machines with 4 disks each and one OSD per disk. The monitor is there to inform the clients about the cluster state. Although this setup has one monitor, these can be made redundant by simple adding more (odd number is preferable).

With this post I don’t want to tell you everything about RADOS and the internal working, all this information is available on the Ceph website.

What I do want to tell you is how my experiences are with Ceph at this point and where it’s heading.

I started testing Ceph about 1.5 years ago, I stumbled on it when reading the changelog of 2.6.34, that was the first kernel where the Ceph kernel client was included.

I’m always on a quest to find a better solution for our storage, right now we are using Linux boxes with NFS, but that is really starting to hurt in many ways.

Where did Ceph get in the past 18 months? Far! I started testing when version 0.18 just got out, right now we are at 0.31!

I started testing the various components of Ceph, started on a small number of virtual machines, but currently I have two clusters running, a “semi-production” where I’m running various virtual machines with RBD and Qemu-KVM. My second cluster is a 74TB cluster with 10 machines, each having 4 2TB disks.

Filesystem            Size  Used Avail Use% Mounted on
[2a00:f10:113:1:230:48ff:fed3:b086]:/   74T  13T   61T  17% /mnt/ceph

As you can see, I’m running my cluster over IPv6. Ceph does not support dual-stack, you will have to choose between IPv4 or IPv6, where I prefer the last one.

But you are probably wondering how stable or production ready it is? That question is hard to answer. My small cluster where I run the KVM Virtual Machines (through Qem-KVM with RBD) has only 6 OSDs and a capacity of 600GB. It has been running for about 4 months now without any issues, but I have to be honest, I didn’t stress it either. I didn’t kill any machines nor did hardware fail. It should be able to handle those crashes, but I haven’t stressed that cluster.

The story is different with my big cluster. In total it’s 15 machines, 10 machines hosting a total of 40 OSDs, the rest are monitors, meta data servers and clients. It started running about 3 months ago and since I’ve seen numerous crashes. I also chose to use the WD Green 2TB disks in my cluster, that was not the best decision. Right now I have a 12% failure rate of these disks. While the failure of those disks is not a good thing, it is a good test for Ceph!

Some disk failures caused some serious problems causing the cluster to start bouncing around and never recovering from that.. But, about 2 days ago I noticed two other disks failing and the cluster fully recovered from it while a rsync was writing data to it. So, it seems to be improving!

During my further testing I have stumbled upon a lot of things. My cluster is build with Atom CPU’s, but those seem to be a bit underpowered for the work. Recovery is heavy for OSDs, so whenever something goes wrong in the cluster I see the CPU’s starting to spike towards the 100%. This is something that is being addressed.

Data placement goes in Placement Group’s, aka PGs. The more data or OSDs you add to the cluster, the more PGs you’ll get. The more PGs you have, the more memory your OSDs start to consume. My OSDs have 4GB (Atom limitation) each. Recovery is not only CPU hungry, but it will also eat your memory. Although the use of tcmalloc reduced the memory usage, OSDs sometimes use a lot of memory.

To come to some sort of a conclusion. Are we there yet? Short answer: No. Long answer: No again, but we will get there. Although Ceph still has a long way to go, it’s on the right path. I think that Ceph will become the distributed storage solution under Linux, but it will take some time. Patience is the key here!

The last thing I wanted to address is the fact that testing is needed! Bugs don’t reveal themselves you have to hunt them down. If you have spare hardware and time, do test and report!

Multipath iSCSI with Ubuntu 10.04 and a EqualLogic SAN

Recently we purchased a EqualLogic PS6000XVS for a KVM environment.

In most of our iSCSI systems we use Multipath I/O, we do this by giving the iSCSI Target two NIC’s and give each NIC a IP-Address in a different subnet over a physically different network. This way we have two seperate I/O path’s to the iSCSI Target.

The EqualLogic does not support this, it only supports one virtual IP in one network, so multipathing gets a bit difficult.

On the Dell Wiki there is configuration howto, so I read that carefully.

The examples are for RedHat, but we are using Ubuntu, but that should not make a big difference, but it did….

Our storage network is in the subnet 192.168.32.0/19 where the virtual IP of the EqualLogic is 192.168.32.1. You should know, this is a virtual IP, in total we have three PS6000 nodes, which do some magic by responding with a different MAC Address for 192.168.32.1 towards each client.

One of our clients has the following configuration for the storage connectivity:

eth0      Link encap:Ethernet  HWaddr 14:FE:B5:C6:62:E0  
          inet addr:192.168.37.4  Bcast:192.168.63.255  Mask:255.255.224.0
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:27263332 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25323692 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:24569609290 (22.8 GiB)  TX bytes:132201626154 (123.1 GiB)
          Interrupt:170 Memory:e6000000-e6012800 

eth1      Link encap:Ethernet  HWaddr 14:FE:B5:C6:62:E2  
          inet addr:192.168.38.4  Bcast:192.168.63.255  Mask:255.255.224.0
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:27246580 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25335109 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:24549507448 (22.8 GiB)  TX bytes:132201622012 (123.1 GiB)
          Interrupt:178 Memory:e8000000-e8012800

It took some work to get this working. Bot NIC’s are connected to the same subnet, through different switches though.

The first problem you will run into is the ARP flux problem of Linux, I’m not going to write to much about this, on the internet there is more then enough information written about this topic.

I ended up with this configuration:

auto eth0
iface eth0 inet static
        address 192.168.37.4
        netmask 255.255.224.0
        post-up sysctl -w net.ipv4.conf.eth0.rp_filter=0
        post-up sysctl -w net.ipv4.conf.eth0.arp_ignore=1
        post-up sysctl -w net.ipv4.conf.eth0.arp_announce=2

auto eth2
iface eth2 inet static
        address 192.168.38.4
        netmask 255.255.224.0
        post-up sysctl -w net.ipv4.conf.eth2.rp_filter=0
        post-up sysctl -w net.ipv4.conf.eth2.arp_ignore=1
        post-up sysctl -w net.ipv4.conf.eth2.arp_announce=2

For Open-iSCSI I created two interfaces called ieth0 and ieth1 and routed my iSCSI traffic through them. How you can do this can be found at the Dell wiki.

But it did not work! I was able to ping the EqualLogic over eth0, but not over eth1. If I brought down eth0, it would work over eth1, but not vise versa. It took me a while to find it, but it’s due to a default setting in Ubuntu, done in /etc/sysctl.d/10-network-security.conf, this enables rp_filter (Reverse Path Filtering) by default, so I modified that file

# Turn on Source Address Verification in all interfaces to
# prevent some spoofing attacks.
#net.ipv4.conf.default.rp_filter=1
#net.ipv4.conf.all.rp_filter=1

And voila! My iSCSI multipathing started to work! My multipath shows:

[size=1.0T][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=2][active]
 \_ 13:0:0:0 sdk 8:160 [active][ready]
 \_ 14:0:0:0 sdj 8:144 [active][ready]
eql-0-8a0906-4f2b9e409-2b800184d024d9db_c () dm-4 EQLOGIC,100E-00
[size=2.0T][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=2][active]
 \_ 6:0:0:0 sdg 8:96  [active][ready]
 \_ 11:0:0:0 sdf 8:80  [active][ready]

This should work under Ubuntu 10.04. Took me some time to figure it all out, but now it’s working like a charm. But still, I prefer multipathing over two different VLAN’s and subnets, really odd that the EqualLogic does not support this!