Using TRIM/DISCARD with Ceph RBD and libvirt

TRIM/DISCARD

Using TRIM/DISCARD you can give back free space to a Ceph cluster. Normally, any thin provisioned block device will keep on growing until its maximum size while being used. Using the DISCARD command a underlying block device can be instructed to discard blocks which do not contain data.

In the case of Ceph’s RBD we can shrink our RBD images again which gives us back free space in our Ceph cluster.

Libvirt

Using this feature is only supported if you use VirtIO-SCSI and not if you use plain VirtIO.

Some searching brought me to this XML for my Ubuntu 15.10 guest:

<disk type='network' device='disk'>
  <driver name='qemu' type='raw' cache='none' discard='unmap'/>
  <auth username='admin'>
    <secret type='ceph' uuid='f94812dd-f06f-48f6-9839-1edf7ee8f8d6'/>
  </auth>
  <source protocol='rbd' name='libvirt/image1'>
    <host name='hostname.of.my.ceph.monitor'/>
  </source>
  <target dev='sda' bus='scsi'/>
  <controller type='scsi' index='0' model='virtio-scsi'/>
</disk>

Inside the guest

I tried a Ubuntu 15.10 guest but this should be supported in any other modern Linux guest.

lspci shows me:

root@ubuntu1510:~# lspci 
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
00:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c895a
root@ubuntu1510:~#

And I have a sda block device which my guest uses:

root@ubuntu1510:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            230M     0  230M   0% /dev
tmpfs            49M  4.6M   45M  10% /run
/dev/sda1       9.3G  1.3G  7.6G  15% /
tmpfs           245M     0  245M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           245M     0  245M   0% /sys/fs/cgroup
tmpfs            49M     0   49M   0% /run/user/0
root@ubuntu1510:~#

Now I can run fstrim which will trim the block device:

root@ubuntu1510:~# fstrim -v /
/: 128 MiB (134217728 bytes) trimmed
root@ubuntu1510:~#

Rebuilding libvirt under CentOS 7.1 with RBD storage pool support

If you want to use CentOS 7.1 for your hypervisors with Apache CloudStack and Ceph’s RBD as Primary Storage you need to rebuild libvirt.

CloudStack requires libvirt to be built with RBD storage pool support. It uses libvirt to manage RBD volumes. By default libvirt under CentOS is not built with this support. (On Ubuntu it is btw).

Rebuilding from source

First we need to install a couple of packages:

$ yum install -y rpm-build gcc make ceph-devel

Now we need to download the sRPM:

$ wget http://vault.centos.org/centos/7.1.1503/os/Source/SPackages/libvirt-1.2.8-16.el7.src.rpm

Create a rpmbuild directory:

$ mkdir /root/rpmbuild

Now edit /root/.rpmmacros so that it contains:

%_topdir    /root/rpmbuild

Install the sRPM:

$ rpm -i libvirt-1.2.8-16.el7.src.rpm

Open the /root/rpmbuild/SPECS/libvirt.spec file and look for:

%else
    %define with_storage_rbd      0
%endif

Change this to:

%else
    %define with_storage_rbd      1
%endif

Now build the RPM:

$ cd /root/rpmbuild
$ rpmbuild -ba SPECS/libvirt.spec

After a couple of minutes you should have RPMs with RBD storage pool support enabled!

Calculating RADOS objects for RBD images

Ceph’s RBD (RADOS Block Device) is just a thin wrapper on top of RADOS, the object store of Ceph.

It stripes (by default) over 4MB objects in RADOS. It’s very simple to calculate which RADOS object corresponds with which sector on your RBD image/block device.

First you have to find out the block device’s object prefix name and the stripe size:

ceph@daisy:~$ sudo rbd info test
rbd image 'test':
	size 128 MB in 32 objects
	order 22 (4096 KB objects)
	block_name_prefix: rb.0.1066.2ae8944a
	format: 1
ceph@daisy:~$

In this case the stripe size is 4MB (order 2^22) and the object name prefix is rb.0.1066.2ae8944a

With one line of Perl we can calculate the object name in RADOS:

perl -e 'printf "BLOCK_NAME_PREFIX.%012x\n", ((SECTOR_OFFSET * 512) / (4 * 1024 * 1024))'

Let’s say that we want the object for sector 1 of our block device:

perl -e 'printf "rb.0.1066.2ae8944a.%012x\n", ((0 * 512) / (4 * 1024 * 1024))'

This tells us that we need to fetch object rb.0.1066.2ae8944a.000000000000 from RADOS. This can be done using the ‘rados’ command:

sudo rados -p rbd get rb.0.1066.2ae8944a.000000000000 rb.0.1066.2ae8944a.000000000000

Voila, you just fetched 4MB of your drive. Might be useful if you want to do some data recovery or such.

Enhanced RBD support for CloudStack 4.2

About 1 hour ago the new storage subsystem got merged into the master branch of CloudStack. That is wonderful news for all you out there who want to use features like snapshotting with RBD in CloudStack.

In pre-4.2 CloudStack a snapshot was the same as a backup. As soon as you created a snapshot it would also copy that snapshot to the secondary storage. This could not only lead to high network utilization when talking about 1TB RBD volumes, but it also caused problems with the underlying ‘qemu-img’ tool. To make a long story short: Snapshots with RBD just wouldn’t work in CloudStack 4.0 or 4.1 without resorting to dirty hacking. Which we didn’t.

The new storage subsystem separates the backup and snapshot process. Snapshots are handled by the primary storage and they can be copied to the ‘backup storage’ on request. This allows is to use the full snapshot potential of RBD.

I was waiting for the storage subsystem to be merged into the master branch before I could start working on this. About two weeks ago I already wrote a small function spec in CloudStack’s wiki to describe what has to be done.

A couple of choices still have to be made. Traditionally we could do everything through libvirt and ‘qemu-img’, but from what I can see now we’ll run into some trouble. We might have to go through the process of wrapping librbd into a Java library to get it all done, but I’m not completely positive about that. Some patches for libvirt(-java) could probably also do the job, but it would take a lot of time and work to get those upstream and into the repositories. The goal is to have this new RBD code work natively on a Ubuntu 13.04 system.

The expectation is that CloudStack 4.2 will be released mid-July this year, but if you are a daredevil you can always track the master branch and play around with that.

I’ll post updates on the cloudstack-dev list on a regular base about the progress, but you can also watch the master branch and search for commits with ‘RBD’ in the message.

Ceph distributed storage with CloudStack

As we are nearing the CloudStack 4.0 release I figured it was time I’d write something about the Ceph integration in CloudStack 4.0

In the beginning of this year we (my company) decided we wanted to use CloudStack for our cloud product, but we also wanted to use Ceph for the storage. CloudStack lacked the support for Ceph, so I decided I’d implement that.

Fast forward 4 months, a long flight to California, becoming a committer and PPMC member of CloudStack, various patches for libvirt(-java) and here we are, 25 September 2012!

RBD, the RADOS Block Device from Ceph enables you to stripe disks for (virtual) machines across your Ceph cluster. This not only gives high performance, it gives you virtually unlimited scalability (without downtime!) and redundancy. Something your NetApp, EMC or EqualLogic SAN can’t give you.

Although I’m a very big fan of Nexenta (use it a lot) it also has it’s limitations. A SAS environment won’t keep scaling for ever and SAS is expensive! Yes, ZFS is truly awesome, but you can’t compare it to the distributed powers Ceph has.

The current implementation of RBD in CloudStack is for Primary Storage only, but that’s mainly what you want, it has a couple of limitations though:

  • You still need either NFS or Local Storage for your System VMs
  • Snapshotting isn’t enabled (see below!)
  • It only works with KVM (Using RBD in Qemu)

If you are happy with that you’ll able to allocate hundreds of TB’s to your CloudStack cluster like it was nothing.

What do you need to use RBD for Primary Storage?

  • CloudStack 4.0 (RC2 is out now)
  • Hypervisors with Ubuntu 12.04.1
  • librbd and librados on your hypervisors
  • Libvirt 0.10.0 (Needs manual installation)
  • Qemu compiled with RBD enabled

There is no need for special configuration on your Hypervisor, that’s all controlled by the Management Server. I’d however recommend that you test the Ceph connectivity first:

rbd -m <monitor address> –user <cephx id> –key <cephx key> ls

If that works you can go ahead and add the RBD Primary Storage pool to your CloudStack cluster. It should be there when adding a new storage pool.

It behaves like any storage pool in CloudStack, except the fact that it is running on the next generation of storage 🙂

About the snapshots, this will be implemented in a later version, probably 4.2. It mainly has to do with the way how CloudStack currently handles snapshots. A major overhaul of the storage code is planned and as part of that I’ll implement snapshotting.

Testing is needed! So if you have the time, please test and report back!

You can find me on the Ceph and CloudStack IRC channels and mailinglists, feel free to contact me. Remember that I’m in GMT +2 (Netherlands).