Safely backing up your Ceph monitors

So you might wonder: Why do I need to make a backup of my Ceph monitors? I have multiple monitors.

That’s true, but would you run into the very unfortunate situation where you loose all you monitors, you loose all your data. The monitors contain very important metadata (pgmap, osdmap, crushmap) to run your cluster. If you loose that metadata, you practially loose all your data.

Ceph’s monitors use Google’s LevelDB to store all their information. When looking at a monitors data directory you’ll see something like this:

[root@mon1:/var/lib/ceph/mon/ceph-alpha]$ ls -alR
.:
total 16
drwxr-xr-x 3 root root 4096 Sep 23  2013 .
drwxr-xr-x 3 root root 4096 Mar 24 11:04 ..
-rw-r--r-- 1 root root   55 Sep 23  2013 keyring
drwxr-xr-x 2 root root 4096 Mar 25 14:09 store.db

./store.db:
total 236172
drwxr-xr-x 2 root root    4096 Mar 25 14:09 .
drwxr-xr-x 3 root root    4096 Sep 23  2013 ..
-rw-r--r-- 1 root root 2116576 Mar  1 01:35 1400870.sst
-rw-r--r-- 1 root root 2111248 Mar  1 01:40 1400992.sst
...
...
-rw-r--r-- 1 root root 1149227 Mar 25 14:09 2026520.sst
-rw-r--r-- 1 root root      17 Mar 25 04:34 CURRENT
-rw-r--r-- 1 root root       0 Sep 23  2013 LOCK
-rw-r--r-- 1 root root 2196679 Mar 25 14:09 LOG
-rw-r--r-- 1 root root 3829307 Mar 25 04:33 LOG.old
-rw-r--r-- 1 root root  983040 Mar 25 14:09 MANIFEST-2016290
[root@mon1:/var/lib/ceph/mon/ceph-alpha]$

So it’s very tempting to simply run your favorite backup tool and back up this directory. Usually it’s less then 500MB, so it’s very simple to do so.

It’s however not a wise idea to do so, since you have to be sure the LevelDB database is in a consistent state before backing it up.

In a production cluster you will probably have a least three monitors, so stopping a monitor is not a big problem.

A simple backup solution would be:

service ceph stop mon
tar czf /var/backups/ceph-mon-backup_$(date +'%a').tar.gz /var/lib/ceph/mon
service ceph start mon

Put that in a Shell script and have CRON run it every 24 hours. Make sure not all three monitors create their backup at the same time, but this works just fine.

You now have a tarball which you can upload to any offsite location to make sure your monitors are safe.

Another solution would be to run the monitors on a ZFS on Linux filesystem and use ZFS’s snapshot functionalities. But you can’t be 100% sure that your LevelDB database is in a consistent state at that point.

The safest solution at this moment is to fully stop the monitor, create the backup and start the monitor again. Just make sure you don’t stop all monitors at the same time.

Changing the region of a RGW bucket

As of Ceph version 0.67 (Dumpling) the Ceph Object Gateway aka RADOS Gateway supports regions. This allows you to create a geo-replicated Amazon S3 compatible service.

While working on a setup we decided later in the process that we wanted regions, but we already created about 50 buckets with data in them. We didn’t feel like re-creating all the buckets, so we wanted to change the region of the buckets.

A fresh Object Gateway has a region ‘default’ with one zone ‘default’. We created the region ‘ams02’ (Amsterdam) with one zone called ‘zone01’.

All buckets had the region ‘default’ which we wanted to change to ‘ams02’. No data migrated is required since all the data is on the same Ceph cluster.

This can be done with a couple of ‘radosgw-admin’ commands.

The bucket in these examples is ‘widodh’.

$ radosgw-admin metadata get bucket:widodh

This outputs JSON data:

{ "key": "bucket:widodh",
  "ver": { "tag": "_2qGuaDCBixHpx2lddTe0g-x",
      "ver": 1},
  "mtime": 1380653343,
  "data": { "bucket": { "name": "widodh",
          "pool": ".rgw.buckets",
          "index_pool": ".rgw.buckets.index",
          "marker": "default.20111.1",
          "bucket_id": "default.20111.1"},
      "owner": "widodh",
      "creation_time": 1380653343,
      "linked": "true",
      "has_bucket_info": "false"}}

With this information we can get the rest of the information:

$ radosgw-admin metadata get bucket.instance:widodh:default.20111.1

The id at the end is ‘bucket_id’ from the previous command.

This returns us:

{ "key": "bucket.instance:widodh:default.20111.1",
  "ver": { "tag": "_-HNwyMLAnRALV9tyPqdX5_V",
      "ver": 1},
  "mtime": 1380653343,
  "data": { "bucket_info": { "bucket": { "name": "widodh",
              "pool": ".rgw.buckets",
              "index_pool": ".rgw.buckets.index",
              "marker": "default.20111.1",
              "bucket_id": "default.20111.1"},
          "creation_time": 1380653343,
          "owner": "widodh",
          "flags": 0,
          "region": "default",
          "placement_rule": "default-placement",
          "has_instance_obj": "true"},
      "attrs": [
            { "key": "user.rgw.acl",
              "val": "AgKXAAAAAgIgAAAABgAAAHdpZG9kaBIAAABXaWRvIGRlbiBIb2xsYW5kZXIDA2sAAAABAQAAAAYAAAB3aWRvZGgPAAAAAQAAAAYAAAB3aWRvZGgDA0AAAAACAgQAAAAAAAAABgAAAHdpZG9kaAAAAAAAAAAAAgIEAAAADwAAABIAAABXaWRvIGRlbiBIb2xsYW5kZXIAAAAAAAAAAA=="},
            { "key": "user.rgw.idtag",
              "val": ""},
            { "key": "user.rgw.manifest",
              "val": ""}]}}

Save this output to a file and change the ‘region’ value to what you want, in this case I changed ‘default’ to ‘ams02’.

Afterwards you run:

$ radosgw-admin metadata put bucket.instance:widodh:default.20111.1 < bucket.json

Now I could change these configuration variables in the ceph.conf:

[client.radosgw.rgw1]
    host = rgw1
    ...
    ...
    rgw zone = zone01
    rgw region = ams02
    ...
    ...

We had to change the information of 50 buckets and we didn't feel like doing this manually, so I wrote this script:

#!/usr/bin/env python

import rados
import os
import json
import copy
import subprocess

ceph_id = "admin"
ceph_secret = "ADMIN SECRET"
ceph_monitor = "MONITOR ADDRESS"
ceph_rgw_pool = ".rgw"
ceph_rgw_region = "NEW RGW REGION"

def change_bucket_region(bucket, region):
	me = os.popen("radosgw-admin metadata get bucket:" + bucket)
	meta = json.loads(me.read())
	id = meta['data']['bucket']['bucket_id']
	mei = os.popen("radosgw-admin metadata get bucket.instance:" + bucket + ":" + id)
	imeta = json.loads(mei.read())
	region = imeta['data']['bucket_info']['region']
	if region is not ceph_rgw_region:
		newmeta = copy.copy(imeta)
		newmeta['data']['bucket_info']['region'] = ceph_rgw_region
		stdin = json.dumps(newmeta)
		process = subprocess.Popen(['radosgw-admin', 'metadata', 'put', "bucket.instance:" + bucket + ":" + id], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
		process.stdin.write(stdin)
		process.stdin.close()
		process.wait()


try:
	r = rados.Rados(rados_id=ceph_id)
	r.conf_set("mon_host", ceph_monitor)
	r.conf_set("key", ceph_secret)
	r.connect()

	io = r.open_ioctx(ceph_rgw_pool)

	i = io.list_objects()
	while True:
		try:
			o = i.next()
			b = str(o.key)
			if b[0] is not ".":
				change_bucket_region(b, ceph_rgw_region)
		except StopIteration:
			break

	io.close()
	r.shutdown()
except Exception as e:
	print "Error" + str(e)

Also available as a download.

Use this script with caution since it will change the region of ALL buckets on your cluster to what you specify.

A quick note on running CloudStack with RBD on Ubuntu 12.04

When you want to use Ceph as Primary Storage in Apache CloudStack you need a recent version of libvirt with RBD storage pool support enabled.

If you want to use Ubuntu 12.04 LTS (Precise) you would need to manually compile libvirt since the default libvirt version doesn’t include RBD storage pool support.

But not any more! Ubuntu has their Cloud Archive which is aimed at OpenStack, but that doesn’t matter, we just want a newer version of libvirt with RBD storage pool support.

So, add this PPA and a Apt source for Ceph and you can use RBD with CloudStack without compiling anything!

$ sudo apt-get install ubuntu-cloud-keyring
$ echo deb http://ubuntu-cloud.archive.canonical.com/ubuntu precise-updates/grizzly main | sudo tee /etc/apt/sources.list.d/cloud-archive.list
$ wget -q -O- 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' | sudo apt-key add -
$ echo deb http://eu.ceph.com/debian-cuttlefish/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
$ sudo apt-get install cloudstack-agent

Voila, you now have all the packages you need to run a CloudStack agent with RBD support.

Redundant Ceph monitors with Round Robin DNS

One of the unique features of Ceph is that it can be build without any Single Point of Failure. No single machine will take your cluster down when designed properly.

Ceph’s monitors play a crucial part in this. To make them redundant you want a odd number of monitors, where 3 is more then sufficient for most clusters.

When librados (The RADOS client) reads the ceph.conf it can read something like:

[mon.a]
  mon addr = 192.168.0.1:6789

[mon.b]
  mon addr = 192.168.0.2:6789

[mon.c]
  mon addr = 192.168.0.3:6789

The problem is that when working with for example Apache CloudStack you can’t have it read a ceph.conf nor does CloudStack support multiple Ceph monitors.

The reason behind this is that CloudStack passes storage pools in the form or URIs internally, for example: rbd://1.2.3.4:6789/mypool

So you’d be stuck with a single monitor in CloudStack. It’s not a disaster, since when a client successfully connects to the Ceph cluster it will receive a monitor map which tells it which other monitors are available should the one he’s connected to fail. But when you want to connect when that specific monitor is down you have a problem.

A solution to this is to create a Round Robin DNS record with all your monitors in it:

monitor.ceph.lan. A 192.168.0.1
monitor.ceph.lan. A 192.168.0.2
monitor.ceph.lan. A 192.168.0.3

You can have your librados client connect to “monitor.ceph.lan” and it will connect to one of the monitors listed in that A record. Is one of the monitors down? It will connect to another one.

This doesn’t only work with CloudStack, but it works with any RADOS client like Qemu, libvirt, phprados, rados-java, python-rados, etc, etc. Anything that connects via librados.

P.S.: Ceph fully (!) supports IPv6, so you can also create a Round Robin AAAA-record 🙂

Enhanced RBD support for CloudStack 4.2

About 1 hour ago the new storage subsystem got merged into the master branch of CloudStack. That is wonderful news for all you out there who want to use features like snapshotting with RBD in CloudStack.

In pre-4.2 CloudStack a snapshot was the same as a backup. As soon as you created a snapshot it would also copy that snapshot to the secondary storage. This could not only lead to high network utilization when talking about 1TB RBD volumes, but it also caused problems with the underlying ‘qemu-img’ tool. To make a long story short: Snapshots with RBD just wouldn’t work in CloudStack 4.0 or 4.1 without resorting to dirty hacking. Which we didn’t.

The new storage subsystem separates the backup and snapshot process. Snapshots are handled by the primary storage and they can be copied to the ‘backup storage’ on request. This allows is to use the full snapshot potential of RBD.

I was waiting for the storage subsystem to be merged into the master branch before I could start working on this. About two weeks ago I already wrote a small function spec in CloudStack’s wiki to describe what has to be done.

A couple of choices still have to be made. Traditionally we could do everything through libvirt and ‘qemu-img’, but from what I can see now we’ll run into some trouble. We might have to go through the process of wrapping librbd into a Java library to get it all done, but I’m not completely positive about that. Some patches for libvirt(-java) could probably also do the job, but it would take a lot of time and work to get those upstream and into the repositories. The goal is to have this new RBD code work natively on a Ubuntu 13.04 system.

The expectation is that CloudStack 4.2 will be released mid-July this year, but if you are a daredevil you can always track the master branch and play around with that.

I’ll post updates on the cloudstack-dev list on a regular base about the progress, but you can also watch the master branch and search for commits with ‘RBD’ in the message.

Ceph distributed storage with CloudStack

As we are nearing the CloudStack 4.0 release I figured it was time I’d write something about the Ceph integration in CloudStack 4.0

In the beginning of this year we (my company) decided we wanted to use CloudStack for our cloud product, but we also wanted to use Ceph for the storage. CloudStack lacked the support for Ceph, so I decided I’d implement that.

Fast forward 4 months, a long flight to California, becoming a committer and PPMC member of CloudStack, various patches for libvirt(-java) and here we are, 25 September 2012!

RBD, the RADOS Block Device from Ceph enables you to stripe disks for (virtual) machines across your Ceph cluster. This not only gives high performance, it gives you virtually unlimited scalability (without downtime!) and redundancy. Something your NetApp, EMC or EqualLogic SAN can’t give you.

Although I’m a very big fan of Nexenta (use it a lot) it also has it’s limitations. A SAS environment won’t keep scaling for ever and SAS is expensive! Yes, ZFS is truly awesome, but you can’t compare it to the distributed powers Ceph has.

The current implementation of RBD in CloudStack is for Primary Storage only, but that’s mainly what you want, it has a couple of limitations though:

  • You still need either NFS or Local Storage for your System VMs
  • Snapshotting isn’t enabled (see below!)
  • It only works with KVM (Using RBD in Qemu)

If you are happy with that you’ll able to allocate hundreds of TB’s to your CloudStack cluster like it was nothing.

What do you need to use RBD for Primary Storage?

  • CloudStack 4.0 (RC2 is out now)
  • Hypervisors with Ubuntu 12.04.1
  • librbd and librados on your hypervisors
  • Libvirt 0.10.0 (Needs manual installation)
  • Qemu compiled with RBD enabled

There is no need for special configuration on your Hypervisor, that’s all controlled by the Management Server. I’d however recommend that you test the Ceph connectivity first:

rbd -m <monitor address> –user <cephx id> –key <cephx key> ls

If that works you can go ahead and add the RBD Primary Storage pool to your CloudStack cluster. It should be there when adding a new storage pool.

It behaves like any storage pool in CloudStack, except the fact that it is running on the next generation of storage 🙂

About the snapshots, this will be implemented in a later version, probably 4.2. It mainly has to do with the way how CloudStack currently handles snapshots. A major overhaul of the storage code is planned and as part of that I’ll implement snapshotting.

Testing is needed! So if you have the time, please test and report back!

You can find me on the Ceph and CloudStack IRC channels and mailinglists, feel free to contact me. Remember that I’m in GMT +2 (Netherlands).