Do not use SMR disks with Ceph

Many new disks like the Seagate He8 disks are using a technique called Shingled Magnetic Recording to increase capacity.

As these disks offer a very low price per Gigabyte they seem interesting to use in a Ceph cluster.


Due to the nature of SMR these disks are very, very, very bad when it comes to Random Write performance. Random I/O is something that Ceph does a lot on the backing disks.

This results in disks spiking to 100% utilization very quickly causing all kinds of trouble with OSDS going down and committing suicide.

Do NOT use them

The solution is very simple. Do not use SMR disks in Ceph but stick to the traditional PMR disks in your Ceph cluster.

In the future we might see SMR support in the new BlueStore of Ceph, but at this moment no work has been done, so don’t expect anything soon.

Testing Ceph BlueStore with the Kraken release

Ceph version Kraken (11.2.0) has been released and the Release Notes tell us that the new BlueStore backend for the OSDs is now available.


The current backend for the OSDs is the FileStore which mainly uses the XFS filesystem to store it’s data. To overcome several limitations of XFS and POSIX in general the BlueStore backend was developed.

It will provide more performance (mainly writes), data safety due to checksumming and compression.

Users are encouraged to test BlueStore starting with the Kraken release for non-production and non-critical data sets and report back to the community.

Deploying with BlueStore

To deploy OSDs with BlueStore you can use the ceph-deploy by using the –bluestore flag.

I created a simple test cluster with three machines: alpha, bravo and charlie.

Each machine will be running a ceph-mon and ceph-osd proces.

This is the sequence of ceph-deploy commands I used to deploy the cluster

ceph-deploy new alpha bravo charlie
ceph-deploy mon create alpha bravo charlie

Now, edit the ceph.conf file in the current directory and add:

enable_experimental_unrecoverable_data_corrupting_features = bluestore

With this setting we allow the use of BlueStore and we can now deploy our OSDs:

ceph-deploy --overwrite-conf osd create --bluestore alpha:sdb bravo:sdb charlie:sdb

Running BlueStore

This tiny cluster how runs three OSDs with BlueStore:

root@alpha:~# ceph -s
    cluster c824e460-2f09-4994-8b2f-108aedc52d19
     health HEALTH_OK
     monmap e2: 3 mons at {alpha=[2001:db8::100]:6789/0,bravo=[2001:db8::101]:6789/0,charlie=[2001:db8::102]:6789/0}
            election epoch 14, quorum 0,1,2 alpha,bravo,charlie
        mgr active: charlie standbys: alpha, bravo
     osdmap e14: 3 osds: 3 up, 3 in
            flags sortbitwise,require_jewel_osds,require_kraken_osds
      pgmap v24: 64 pgs, 1 pools, 0 bytes data, 0 objects
            43356 kB used, 30374 MB / 30416 MB avail
                  64 active+clean
root@alpha:~# ceph osd tree
-1 0.02907 root default                                       
-2 0.00969     host alpha                                     
 0 0.00969         osd.0         up  1.00000          1.00000 
-3 0.00969     host bravo                                     
 1 0.00969         osd.1         up  1.00000          1.00000 
-4 0.00969     host charlie                                   
 2 0.00969         osd.2         up  1.00000          1.00000 

On alpha I see that osd.0 only has a small partition for a bit of configuration and the rest is used by BlueStore.

root@alpha:~# df -h /var/lib/ceph/osd/ceph-0
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1        97M  5.4M   92M   6% /var/lib/ceph/osd/ceph-0
root@alpha:~# lsblk 
sda      8:0    0    8G  0 disk 
├─sda1   8:1    0  7.5G  0 part /
├─sda2   8:2    0    1K  0 part 
└─sda5   8:5    0  510M  0 part [SWAP]
sdb      8:16   0   10G  0 disk 
├─sdb1   8:17   0  100M  0 part /var/lib/ceph/osd/ceph-0
└─sdb2   8:18   0  9.9G  0 part 
sdc      8:32   0   10G  0 disk 
root@alpha:~# cat /var/lib/ceph/osd/ceph-0/type

The OSDs should work just like OSDs running FileStore, but they should perform better.

Running headless VirtualBox inside Nested KVM

For the Ceph training at 42on I use VirtualBox to build Virtual Machines. This is because they work under MacOS, Windows and Linux.

For the internal Git at 42on we use Gitlab and I wanted to use Gitlab’s CI to build my Virtual Machines automatically.

As we don’t have any physical hardware at 42on (everything runs in the cloud) I wanted to see if I could run VirtualBox Headless inside a VM with Nested KVM enabled.

Nested KVM

The first thing I checked was if my KVM Virtual Machine actually supported Nested KVM. This can be verified with the kvm-ok command under Ubuntu:

root@glrun01:~# kvm-ok 
INFO: /dev/kvm exists
KVM acceleration can be used

Now that’s verified I tried to install VirtualBox.


Installing VirtualBox is straight forward. Just add the repository and install the packages. Don’t forget to reboot afterwards to make sure all kernel modules are loaded and properly installed.

apt-get install virtualbox

VirtualBox Extension Pack

The trick to get everything working properly is to install Oracle’s VirtualBox Extension Pack. It took me a while to figure out that I need to install it manually. It wasn’t done by default after install.

You need to download the pack and install it using the VBoxManage command.

vboxmanage extpack install Oracle_VM_VirtualBox_Extension_Pack-5.0.24.vbox-extpack
vboxmanage list extpacks
vboxmanage setproperty vrdeextpack "Oracle VM VirtualBox Extension Pack"

With that installed and configured I rebooted the machine again just to be sure.

It works!

With that it actually worked. The VirtualBox VMs can now be built inside a Nested KVM machine controlled by Gitlab’s CI 🙂

Chown Ceph OSD data directory using GNU Parallel

Starting with Ceph version Jewel (10.2.X) all daemons (MON and OSD) will run under the privileged user ceph. Prior to Jewel daemons were running under root which is a potential security issue.

This means data has to change ownership before a daemon running the Jewel code can run.

Chown data

As the Release Notes state you will have to chown all your data to ceph:ceph in /var/lib/ceph.

chown -R ceph:ceph /var/lib/ceph

On a system with multiple OSDs this might take a lot of time, using GNU Parallel you can save yourself a lot of time.

Static UID

The ceph User and Group have been assigned static UID and GIDs in the major distributions:

  • Fedora/CentOS/RHEL: 167:167
  • Debian/Ubuntu: 64045/64045

Chown in parallel

Using these commands you can chown the data in /var/lib/ceph much faster.

WARNING: Make sure the OSDs are stopped on the system before you continue!

Now you can run these commands (Ubuntu in this case):

find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -type d|parallel chown -R 64045:64045
chown 64045:64045 /var/lib/ceph
chown 64045:64045 /var/lib/ceph/*
chown 64045:64045 /var/lib/ceph/bootstrap-*/*

The first command will take the longest. I tested it on a system with 24 OSDs all containing about 800GB of data. That took roughly 20 minutes.

Slow requests with Ceph: ‘waiting for rw locks’

Slow requests in Ceph

When a I/O operating inside Ceph is taking more than X seconds, which is 30 by default, it will be logged as a slow request.

This is to show you as a admin that something is wrong inside the cluster and you have to take action.

Origin of slow requests

Slow requests can happen for multiple reasons. It can be slow disks, network connections or high load on machines.

If a OSD has slow requests you can log on to the machine and see what Ops are blocking:

ceph daemon osd.X dump_ops_in_flight

waiting for rw locks

Yesterday I got my hands on a Ceph cluster which had a very high number, over 2k, of slow requests.

On all OSDs they showed ‘waiting for rw locks’

This is hard to diagnose and it was. Usually this is where OSDs are busy connecting to other OSDs or performing any other network actions.

Usually when you see ‘waiting for rw locks’ there is something wrong with the network.

The network

In this case the Ceph cluster is connecting over Layer 2 and that network didn’t change. A few hours earlier there was a change to the Layer 3 network, but since Ceph was running over Layer 2 we didn’t connect the two dots.

After some more searching we noticed that the hosts couldn’t perform DNS lookups properly.


Ceph doesn’t use DNS internally, but it could still be that it was a problem.

After some searching we found that DNS wasn’t the problem, but there were two default routes on the system where one was down.

Layer 3

This Ceph cluster is communicating over Layer 3 and the problem was caused by the fact that the cluster had a hard time talking back to various clients.

This caused various network buffers to fill up and that caused communication problems between OSDs.

So always make sure you double-check the network since that is usually the root-cause.

Using TRIM/DISCARD with Ceph RBD and libvirt


Using TRIM/DISCARD you can give back free space to a Ceph cluster. Normally, any thin provisioned block device will keep on growing until its maximum size while being used. Using the DISCARD command a underlying block device can be instructed to discard blocks which do not contain data.

In the case of Ceph’s RBD we can shrink our RBD images again which gives us back free space in our Ceph cluster.


Using this feature is only supported if you use VirtIO-SCSI and not if you use plain VirtIO.

Some searching brought me to this XML for my Ubuntu 15.10 guest:

<disk type='network' device='disk'>
  <driver name='qemu' type='raw' cache='none' discard='unmap'/>
  <auth username='admin'>
    <secret type='ceph' uuid='f94812dd-f06f-48f6-9839-1edf7ee8f8d6'/>
  <source protocol='rbd' name='libvirt/image1'>
    <host name=''/>
  <target dev='sda' bus='scsi'/>
  <controller type='scsi' index='0' model='virtio-scsi'/>

Inside the guest

I tried a Ubuntu 15.10 guest but this should be supported in any other modern Linux guest.

lspci shows me:

root@ubuntu1510:~# lspci 
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
00:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c895a

And I have a sda block device which my guest uses:

root@ubuntu1510:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            230M     0  230M   0% /dev
tmpfs            49M  4.6M   45M  10% /run
/dev/sda1       9.3G  1.3G  7.6G  15% /
tmpfs           245M     0  245M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           245M     0  245M   0% /sys/fs/cgroup
tmpfs            49M     0   49M   0% /run/user/0

Now I can run fstrim which will trim the block device:

root@ubuntu1510:~# fstrim -v /
/: 128 MiB (134217728 bytes) trimmed

The Ceph Trafficlight

At PCextreme we have a 700TB Ceph cluster which is used behind our public cloud Aurora Compute which runs Apache CloudStack.

Ceph health

One of the things we monitor of the Ceph cluster is it’s health. This can be OK, WARN or ERR. It speaks for itself that you always want to see OK, but things do go wrong. Disks fail, machines die, kernel panics happen. Stuff goes wrong.

I thought it was a cool idea to buy a used real traffic light which I could install at the office. OK would be green, WARN would be orange/amber and ERR would be red.

2nd hand Trafficlight

Some searching on the internet brought me to They sell used (Dutch) traffic lights. I bought a Vialis 2230 (The largest on the picture below).

Vialis trafficlight overview

For EUR 75,00 I got my hands on a original trafficlight!

Controlling the lights

When I got the trafficlight it was already equipped with LED lights which work on 230V. A 30cm cable (cut off) was sticking out with 4 wires in it:

  • Blue: Neutral
  • Green: Phase/Positive for Green
  • Yellow: Phase/Positive for Orange/Amber
  • Red: Phase/Positive for Red

It was easy. All I had to do was buy a add-on board for a Raspberry Pi so I could control the lights.

Solid State Relay

My search for a add-on board brought me to, they make all kinds of add-on boards for the Raspberry Pi.

One of them is a SSR (Solid State Relay) board which has 4 outputs. Their wiki explained that it was very simple to control the Relays using Python.

Solid State Relay board

A quick test at my desk at home brought be to a working setup.

Addition components

After writing the code which controls the light it was time to buy some housing where I could install it in.

At Conrad I found the things I needed. A housing, some connectors and some cabling. A overview of my order:

Conrad order

This was needed since I would install it at the office and it needed to be safe. You don’t want somebody to get shocked by 230V. That’s kind of dangerous.

Bringing it together

It was time to start drilling and soldering! In my shed it looked like this:

My shed

And a few more pictures of building it. Took me about 3 hours to complete.







At the office

The next day it was time to install it at the office! Some drilling and the result:

Health OK: Green


Health WARN: Amber/Orange


Health ERR: Red

No picture! We can trigger a WARN state in Ceph without service interruptions, but not a ERR state.

The code

The Python code I wrote is all on Github. It’s just some Python code which polls our Ceph dashboard every second. If the status changes it also changes the traffic light.

Ceph Monitors are laggy or clock might be skewed

This weekend I got to investigate a Ceph cluster which had issues where the Monitors were constantly performing new elections.

After some investigation on of the three monitors was eating 100% CPU on a single core and kept printing this in the logs:

mon.charlie@2(peon).paxos(paxos updating c 106399655..106400232) lease_expire from mon.0 [2a00:XXX:121:XXX::6789:1]:6789/0 is 2.380296 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)

Digging further I found that the LevelDB store in /var/lib/ceph/mon/X/store.db was 2.5GB in size.

Compact on Start

You can tell the monitor to compact the LevelDB database on start. Add the following to your ceph.conf:

mon compact on start = true

Now restart the monitor and it will compact the LevelDB database.

The CPU usage now dropped and the monitors were happy again.

Protecting your Ceph pools against removal or property changes

One of the dangers of Ceph was that by accident you could remove a multi TerraByte pool and loose all the data. Although the CLI tools asked you for conformation, librados and all it’s bindings did not.

Imagine explaining that you just removed a 200TB pool from your storage system due to a typo in your Python code…

So I suggested that we came up with a mechanism to prevent pools from being deleted from a Ceph cluster. And Sage quickly came up with something!

Hammer v0.94

Ceph version 0.94 aka ‘Hammer’ came out a couple of weeks ago and it has a some fancy features which prevent you from removing a pool by accident or on purpose.

Monitors denying pool removal

A new configuration setting for the monitors has been introduced:

mon_allow_pool_delete = false

If you add that to the ceph.conf ([mon] section) and restart your MONs you will not be able to remove any pool from your Ceph cluster. Not via the CLI or directly via librados. The Monitors will simply refuse it:

root@admin:~# ceph osd pool delete rbd rbd --yes-i-really-really-mean-it
Error EPERM: pool deletion is disabled; you must first set the mon_allow_pool_delete config option to true before you can destroy a pool
root@admin:~# rados rmpool rbd rbd --yes-i-really-really-mean-it
pool rbd does not exist
error 1: (1) Operation not permitted

This is a cluster-wide configuration setting and can only be changed by restarting your Monitors. A good way to prevent anybody from removing a pool by accident or on purpose.

Pool flags

A different way to achieve this is by setting the new nodelete flag on a pool. Setting this flag prevents the pool from being removed.

Next to this flag a couple of other flags were introduced:

  • nodelete
  • nosizechange
  • nopgchange

The flags speak for themselves. If you set these flags those operations are no longer allowed:

root@admin:~# ceph osd pool set rbd nosizechange true
set pool 0 nosizechange to true
root@admin:~# ceph osd pool set rbd size 5
Error EPERM: pool size change is disabled; you must unset nosizechange flag for the pool first

I’m not allowed to change the size (aka replication level/setting) for the pool ‘rbd’ while that flag is set.

Applying all flags

To apply these flags quickly to all your pools, simply execute these three one-liners:

$ for pool in $(rados lspools); do ceph osd pool set $pool nosizechange true; done
$ for pool in $(rados lspools); do ceph osd pool set $pool nopgchange true; done
$ for pool in $(rados lspools); do ceph osd pool set $pool nodelete true; done

Your Ceph cluster just became a lot safer! No data loss or downtime due to fat fingers anymore 🙂

Rebuilding libvirt under CentOS 7.1 with RBD storage pool support

If you want to use CentOS 7.1 for your hypervisors with Apache CloudStack and Ceph’s RBD as Primary Storage you need to rebuild libvirt.

CloudStack requires libvirt to be built with RBD storage pool support. It uses libvirt to manage RBD volumes. By default libvirt under CentOS is not built with this support. (On Ubuntu it is btw).

Rebuilding from source

First we need to install a couple of packages:

$ yum install -y rpm-build gcc make ceph-devel

Now we need to download the sRPM:

$ wget

Create a rpmbuild directory:

$ mkdir /root/rpmbuild

Now edit /root/.rpmmacros so that it contains:

%_topdir    /root/rpmbuild

Install the sRPM:

$ rpm -i libvirt-1.2.8-16.el7.src.rpm

Open the /root/rpmbuild/SPECS/libvirt.spec file and look for:

    %define with_storage_rbd      0

Change this to:

    %define with_storage_rbd      1

Now build the RPM:

$ cd /root/rpmbuild
$ rpmbuild -ba SPECS/libvirt.spec

After a couple of minutes you should have RPMs with RBD storage pool support enabled!