Building Ceph .deb packages

Once in a while I have to build Ubuntu packages (.deb) for Ceph to test some of the Pull Requests I write.

I always forget on how to build these packages and probably other people do so as well.

I build Ceph on my laptop using a Docker container with a few commands.

docker run -v /home/wido/repos/ceph:/usr/src/ceph -ti ubuntu:18.04 bash

I have the Ceph repository checked out locally in /home/wido/repos/ceph which I then map into the Docker container.

Once in the container it’s fairly simple:

apt update
apt -y install dpkg-dev
cd /usr/src/ceph
dpkg-buildpackage -us -uc -b -j 4

This will now yield a list of packages you need to install. Once you’ve installed them:

./do_cmake.sh
dpkg-buildpackage -us -uc -b -j 4

Now be ready to wait for a long time as this can take 2 hours as it does on my laptop.

After the build succeeds you’ll find the .deb files in /usr/src

Creating a Management Routing Instance (VRF) on Juniper QFX5100

For a Ceph cluster I have two Juniper QFX5100 switches running as a Virtual Chassis.

This Virtual Chassis is currently only performing L2 forwarding, but I want to move this to a L3 setup where the QFX switches use Dynamic Routing (BGP) and thus become the gateway(s) for the Ceph servers.

This should work, but one of the things I was missing is a dedicated Management Port which uses a different routing table/instance.

Starting with JunOS 17.3R1 you can create a Management Routing Instance as described on the website of Juniper.

set system management-instance

This now creates the Routing Instance called mgmt_junos.

I try to run as much as possible IPv6-only or at least prefer IPv6 over IPv4.

I ran into the problem that configuring an IPv6 address on my em0 interface just wouldn’t work. It kept saying that the IPv6 address was Duplicate.

This is probably something which happens because both QFX switches are connected to the same Out of Band switch and causes it to receive it’s DAD over a different link. I had to disable DAD on interface em0 to make it work.

In addition I configured all DNS lookups to be performed using this routing instance.

The end result for my configuration (snippets):

system {
management-instance;
name-server {
2a00:f10:ff04:153::53 routing-instance mgmt_junos;
2a00:f10:ff04:253::53 routing-instance mgmt_junos;
93.180.70.22 routing-instance mgmt_junos;
93.180.70.30 routing-instance mgmt_junos;
}
}
interfaces {
unit 0 {
family inet {
address 172.17.5.10/24;
}
family inet6 {
address 2a00:f10:XXX:XXX::100/64
dad-disable;
}
}
}
routing-instances {
mgmt_junos {
routing-options {
rib mgmt_junos.inet6.0 {
static {
route ::/0 next-hop 2a00:f10:XXX:XXX::1;
}
}
static {
route 0.0.0.0/0 next-hop 172.17.5.1;
}
}
}
}

This now allows me to SSH to my Juniper QFX Virtual Chassis over interface em0 which uses a different routing instance/table.

Should I make a mistake in the default routing instance, for example a BGP misconfiguration, I can still SSH to my switch(es).

Or if there is a routing error (BGP issue) I can also still reach the switches.

Setting noout flag per Ceph OSD

Prior to Ceph Luminous you could only set the noout flag cluster-wide which means that none of your OSDs will be marked as out.

On large(r) cluster this isn’t always what you want as you might be performing maintenance on a part of the cluster, but you sill want other OSDs which go down to be marked as out.

You can however also set the noout flag per OSD basis using this command:

ceph osd add-noout 0

This means that osd.0 will not be marked as out. You can verify this by looking at the OSDMap:

root@alpha:~# ceph osd dump|grep osd.0
osd.0 up in weight 1 up_from 5 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:[2001:db8::100]:6800/1618,v1:[2001:db8::100]:6801/1618,v2:0.0.0.0:6802/1618,v1:0.0.0.0:6803/1618] [v2:[2001:db8::100]:6804/1618,v1:[2001:db8::100]:6805/1618,v2:0.0.0.0:6806/1618,v1:0.0.0.0:6807/1618] exists,noout,up c0c3e5c3-918b-4f3e-a48c-ea8e7c014a3b
root@alpha:~#

Here you can see the noout flag has been set for osd.0.

Removing the flag for this OSD is the reverse process:

ceph osd rm-noout 0

Other flags you can set per osd:

  • nodown
  • noup
  • noin
  • noout

Use this to your advance!

Comparing two Ceph CRUSH maps

Sometimes you want to test if changes you are about to make to a CRUSH map will cause data to move or not.

In this case I wanted to change a rule in CRUSH where it would use device classes, but I didn’t want any of the ~1PB of data in that cluster to move.

By swapping IDs I could prevent data to move:

root default {
id -50 # do not change unnecessarily
id -53 class hdd # do not change unnecessarily
id -122 class ssd # do not change unnecessarily
root default {
id -53 # do not change unnecessarily
id -50 class hdd # do not change unnecessarily
id -122 class ssd # do not change unnecessarily

Notice how I swapped the IDs. After this I updated the rule:

rule rgw {
id 6
type replicated
min_size 1
max_size 10
step take ams02-objects class hdd
step chooseleaf firstn 0 type host
step emit
}

I then compiled the CRUSHMap and ran crushtool to see if there were any differences:

root@mon01:~# crushtool -i crushmap --compare crushmap.new 
rule 0 had 0/10240 mismatched mappings (0)
rule 1 had 0/10240 mismatched mappings (0)
rule 2 had 0/10240 mismatched mappings (0)
rule 3 had 0/10240 mismatched mappings (0)
rule 4 had 0/10240 mismatched mappings (0)
rule 5 had 0/3072 mismatched mappings (0)
rule 6 had 0/10240 mismatched mappings (0)
maps appear equivalent
root@mon01:~#

No changes! So it was safe to inject this map:

root@mon01:~# ceph osd setcrushmap -i crushmap.new

HAProxy in front of Ceph Manager dashboard

The Ceph Mgr dashboard plugin allows for an easy dashboard which can show you how your Ceph cluster is performing.

In certain situations you can’t contact the Mgr daemons directly and you have to place a Proxy server between your computer and the Mgr daemons.

This can be done easily with HAProxy and the following configuration which assumes that:

  • SSL has been disabled in the Dashboard plugin
  • Dashboard plugin listens in port 8080
  • Mgr is running on the hosts mon01, mon02 and mon03
global
  log         127.0.0.1 local1
  log         127.0.0.1 local2 notice

  chroot      /var/lib/haproxy
  pidfile     /var/run/haproxy.pid
  maxconn     4000
  user        haproxy
  group       haproxy
  daemon

  stats socket /var/lib/haproxy/stats

defaults
  log                     global
  mode                    http
  retries                 3
  timeout http-request    10s
  timeout queue           1m
  timeout connect         10s
  timeout client          1m
  timeout server          1m
  timeout http-keep-alive 10s
  timeout check           10s
  maxconn                 3000
  option                  httplog
  no option               httpclose
  no option               http-server-close
  no option               forceclose

  stats enable
  stats hide-version
  stats refresh 30s
  stats show-node
  stats uri /haproxy?stats
  stats auth admin:haproxy

frontend https
  bind *:80
  default_backend ceph-dashboard

backend ceph-dashboard
  balance roundrobin
  option httpchk GET /
  http-check expect status 200
  server mon01 mon01:8080 check
  server mon02 mon02:8080 check
  server mon03 mon03:8080 check

You can now point your browser to the URL/IP of your HAProxy and use your Ceph dashboard.

In case a Mgr machine fails the health checks of HAProxy will make sure it fails over to on of the other Mgr daemons.

Renaming a network interface with systemd-networkd on Ubuntu 18.04

On a Ubuntu system where I’m creating a VXLAN Proof of Concept with CloudStack I wanted to rename the interface enp5s0 to cloudbr0.

I found many documentation on the internet on how to do this with *.link files, but I was missing the golden tip, which was you need to re-generate your initramfs.

/etc/systemd/network/50-cloudbr0.link

[Match]
MACAddress=00:25:90:4b:81:54

[Link]
Name=cloudbr0

After you create this file, re-generate your initramfs:

update-initramfs -c -k all

You can now use cloudbr0 in *.network files to use it like any other network interface.

In my case this is how my interfaces look like:

1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
6: cloudbr0:  mtu 9000 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:25:90:4b:81:54 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.11/24 brd 192.168.0.255 scope global cloudbr0
       valid_lft forever preferred_lft forever
    inet6 2a00:f10:114:0:225:90ff:fe4b:8154/64 scope global dynamic mngtmpaddr noprefixroute 
       valid_lft 2591993sec preferred_lft 604793sec
    inet6 fe80::225:90ff:fe4b:8154/64 scope link 
       valid_lft forever preferred_lft forever
8: cloudbr1:  mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 86:fa:b6:31:6e:c1 brd ff:ff:ff:ff:ff:ff
    inet 172.16.0.11/24 brd 172.16.0.255 scope global cloudbr1
       valid_lft forever preferred_lft forever
    inet6 fe80::84fa:b6ff:fe31:6ec1/64 scope link 
       valid_lft forever preferred_lft forever
9: vxlan100:  mtu 1450 qdisc noqueue master cloudbr1 state UNKNOWN group default qlen 1000
    link/ether 56:df:29:8d:db:83 brd ff:ff:ff:ff:ff:ff

VXLAN with VyOS and Ubuntu 18.04

VXLAN

Virtual Extensible LAN uses encapsulation technique to encapsulate OSI layer 2 Ethernet frames within layer 4 UDP datagrams. More on this can be found on the link provided.

For a Ceph and CloudStack environment I needed to set up a Proof-of-Concept using VXLAN and some refurbished hardware. The main purpose of this PoC is to verify that VXLAN works with CloudStack, Ceph and Ubuntu 18.04

VyOS

VyOS is an open source network operating system based on Debian Linux. It supports VXLAN, so using this we were able to test VXLAN in this setup.

In production a other VXLAN capable router would be used, but for a PoC VyOS works just fine running on a regular server.

Configuration

The VyOS router is connected to ‘the internet’ with one NIC and the other NIC is connected to a switch.

Using static routes a IPv4 subnet (/24) and a IPv6 subnet (/48) are routed towards the VyOS router. These are then splitted and send to multiple VLANs.

As it took me a while to configure VXLAN under VyOS

I’m only posting that configuration.

interfaces {
    ethernet eth0 {
        address 31.25.96.130/30
        address 2a00:f10:100:1d::2/64
        duplex auto
        hw-id 00:25:90:80:ed:fe
        smp-affinity auto
        speed auto
    }
    ethernet eth5 {
        duplex auto
        hw-id a0:36:9f:0d:ab:be
        mtu 9000
        smp-affinity auto
        speed auto
        vif 300 {
            address 192.168.0.1/24
            description VXLAN
            mtu 9000
        }
    vxlan vxlan1000 {
        address 10.0.0.1/23
        address 2a00:f10:114:1000::1/64
        group 239.0.3.232
        ip {
            enable-arp-accept
            enable-arp-announce
        }
        ipv6 {
            dup-addr-detect-transmits 1
            router-advert {
                cur-hop-limit 64
                link-mtu 1500
                managed-flag false
                max-interval 600
                name-server 2a00:f10:ff04:153::53
                name-server 2a00:f10:ff04:253::53
                other-config-flag false
                prefix 2a00:f10:114:1000::/64 {
                    autonomous-flag true
                    on-link-flag true
                    valid-lifetime 2592000
                }
                reachable-time 0
                retrans-timer 0
                send-advert true
            }
        }
        link eth5.300
        mtu 1500
        vni 1000
    }
    vxlan vxlan2000 {
        address 109.72.91.1/26
        address 2a00:f10:114:2000::1/64
        group 239.0.7.208
        ipv6 {
            dup-addr-detect-transmits 1
            router-advert {
                cur-hop-limit 64
                link-mtu 1500
                managed-flag false
                max-interval 600
                name-server 2a00:f10:ff04:153::53
                name-server 2a00:f10:ff04:253::53
                other-config-flag false
                prefix 2a00:f10:114:2000::/64 {
                    autonomous-flag true
                    on-link-flag true
                    valid-lifetime 2592000
                }
                reachable-time 0
                retrans-timer 0
                send-advert true
            }
        }
        link eth5.300
        mtu 1500
        vni 2000
    }
}

VLAN 300 on eth5 is used to route VNI 1000 and 2000 in their own multicast groups.

The MTU of eth5 is set to 9000 so that the encapsulated traffic of VXLAN can still be 1500 bytes.

Ubuntu 18.04

To test if VXLAN was actually working on the Ubuntu 18.04 host I made a very simple script:

ip link add vxlan1000 type vxlan id 1000 dstport 4789 group 239.0.3.232 dev vlan300 ttl 5
ip link set up dev vxlan1000
ip addr add 10.0.0.11/23 dev vxlan1000
ip addr add 2a00:f10:114:1000::101/64 dev vxlan1000

That works! I can ping 10.0.0.11 and 2a00:f10:114:1000::1 from my Ubuntu 18.04 machine!

Testing with ConfigDrive and cloud-init

cloud-init

cloud-init is a very easy way to bootstrap/configure Virtual Machines running in a cloud environment. It can read it’s metadata from various data sources and configure for public SSH keys or create users for example.

Most large clouds support cloud-init to quickly deploy new Instances.

Config Drive

Config Drive is a data source which reads a local ‘CD-Rom’ device which contains the metadata for the Virtual Machine. This allows for auto configuration of Virtual Machines without them requiring network. My main use case for Config Drive is CloudStack which has support for Config Drive since version 4.11.

I wanted to test with Config Drive outside CloudStack to test some functionality.

meta_data.json

On my laptop I spun up a Ubuntu 18.04 Virtual Machine with cloud-init installed and I attached a ISO which I created.

/home/wido/Desktop/
             cloud-init/
               openstack/
                 latest/
                   meta_data.json

In meta_data.json I put:

{
  "hostname": "ubuntu-test",
  "name": "ubuntu-test",
  "uuid": "0109a241-6fd9-46b6-955a-cd52ad168ee7",
  "public_keys": [
      "ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwPKBJDJdlvOKIfilr0VSkF9i3viwLtO8GyCpxL/8TxrGKnEg19LPLN3lwKWbbTqBZgRmbrR3bgQfM4ffPoTCxSPv44eZCF8jMPv8PxpC0yVaTcqW4Q7woD7pjdIuGVImrmEls0U8rS3uGQDx7LhFphkAh+blfUtobqzyHvqcbtVEh+drESn8AXrKd1MZfGg6OB8Xrfdr6d959uHBHFJ8pOxxppYbInxKREPb3XmZzmoNQUmqFRN/VNVTreRHAxDcPM8pEPuNmr3Vp+vDVvfpA58yr2rZ21ASB4LlNOOSEx7vnLd6uH9rsqAJOtr0ZEE29fU609i4rd6Zda2HTGQO+Q== wido@wido-laptop",
      "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC0HJ4gm7QCUZjeXh/GmCzUamABgtvOZ4GcDOA1mSGoXqGHYkG8qPa5OaURx5yqROmclTtgwkaHxjY4in6Bi7DGxlSyFQuEOKBwNCcZOrGFiRbazeh3c5jEXgoCtDynrqBjuivWTaxzl46WVglxGSy2LAaps9sAhDU+8Xfm9wBYYtjv3CDAHEe9rN3UxgGEO5K+yhoFzcI9vZl4Q9hvMZRRBBFq3vOFoLr9pyyxVUNiH9oQLubeiqDwnQqaGn48WMX9eO9KiH32eSnzlgJg8viFyonVRJq2dxJMEcrfolQhD9GR1BXNeVGDNJYuo9FX1/NeZZYJTUAwZ4jP008CfaBAnuKBfdZTkjNZlXxMQOY29VAetHI/urvh4NK17o4g4gcBpmLV+13saBIHTLXmA8ADkGZgWE6yIYEah04nBHr28vmZJBX4vpQYz9VWiZcJeEsGah/oaPkFMNQB7mXE+plz46YJ1JtozBpntir7A4cHJX3qZNz/JwOU7yAc4K3EP6fGirSvIDtQ/SJs9Gp51Wbv32E3/5ouemsGnh/ziOc8eIUJMTFWqpEG9qVi/tVJiryuvRcha6+0cZmnWnknBztzRO3Oh3JgPwECdNA/X0acXlzFlRbkQpAJx9+ADESauNaIvfQ3l+kZx3m/1eAAZpAI2ZrnvQMUa40XYksiwncM4Q== wido@wido-desktop"
  ]
}

Using mkisofs I was able to create a ISO and attached it to the VM:

mkisofs -iso-level 3 -V 'CONFIG-2' -o /var/lib/libvirt/images/cloud-init.iso /home/wido/Desktop/cloud-init

The ISO was attached to the Virtual Machine and after reboot I could log in with the user ubuntu over SSH using my pubkey!

Placement Groups with Ceph Luminous stay in activating state

Placement Groups stuck in activating

When migrating from FileStore with BlueStore with Ceph Luminuous you might run into the problem that certain Placement Groups stay stuck in the activating state.

44    activating+undersized+degraded+remapped

PG Overdose

This is a side-effect of the new PG overdose protection in Ceph Luminous.

Too many PGs on your OSDs can cause serious performance or availability problems.

You can see the amount of Placement Groups per OSD using this command:

$ ceph osd df

Increase Max PG per OSD

The default value is a maximum of 200 PGs per OSD and you should stay below that! However, if you are hit by PGs in the activating state you can set this configuration value:

[global]
mon_max_pg_per_osd = 500

Then restart the OSDs and MONs which are serving the affected by this.

Usually you shouldn’t run into this, but if this hits you in the middle of a migration or upgrade this might save you.

Quick overview of Ceph version running on OSDs

When checking a Ceph cluster it’s useful to know which versions you OSDs in the cluster are running.

There is a very simple on-line command to do this:

ceph osd metadata|jq '.[].ceph_version'|sort|uniq -c

Running this on a cluster which is currently being upgraded to Jewel to Luminous it shows:

     10 "ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)"
   1670 "ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)"
    426 "ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)"
     66 "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)"

So 66 OSDs are running Luminous and 2106 OSDs are running Jewel.

Starting with Luminous there is also this command:

ceph features

This shows us all daemon and client versions in the cluster:

{
    "mon": {
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 5
        }
    },
    "osd": {
        "group": {
            "features": "0x7fddff8ee84bffb",
            "release": "jewel",
            "num": 426
        },
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 66
        }
    },
    "client": {
        "group": {
            "features": "0x7fddff8ee84bffb",
            "release": "jewel",
            "num": 357
        },
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 7
        }
    }
}