ceph – Page 8 – Wido den Hollander

Calculating RADOS objects for RBD images

Ceph’s RBD (RADOS Block Device) is just a thin wrapper on top of RADOS, the object store of Ceph.

It stripes (by default) over 4MB objects in RADOS. It’s very simple to calculate which RADOS object corresponds with which sector on your RBD image/block device.

First you have to find out the block device’s object prefix name and the stripe size:

ceph@daisy:~$ sudo rbd info test
rbd image 'test':
	size 128 MB in 32 objects
	order 22 (4096 KB objects)
	block_name_prefix: rb.0.1066.2ae8944a
	format: 1
ceph@daisy:~$

In this case the stripe size is 4MB (order 2^22) and the object name prefix is rb.0.1066.2ae8944a

With one line of Perl we can calculate the object name in RADOS:

perl -e 'printf "BLOCK_NAME_PREFIX.%012x\n", ((SECTOR_OFFSET * 512) / (4 * 1024 * 1024))'

Let’s say that we want the object for sector 1 of our block device:

perl -e 'printf "rb.0.1066.2ae8944a.%012x\n", ((0 * 512) / (4 * 1024 * 1024))'

This tells us that we need to fetch object rb.0.1066.2ae8944a.000000000000 from RADOS. This can be done using the ‘rados’ command:

sudo rados -p rbd get rb.0.1066.2ae8944a.000000000000 rb.0.1066.2ae8944a.000000000000

Voila, you just fetched 4MB of your drive. Might be useful if you want to do some data recovery or such.

Safely backing up your Ceph monitors

So you might wonder: Why do I need to make a backup of my Ceph monitors? I have multiple monitors.

That’s true, but would you run into the very unfortunate situation where you loose all you monitors, you loose all your data. The monitors contain very important metadata (pgmap, osdmap, crushmap) to run your cluster. If you loose that metadata, you practially loose all your data.

Ceph’s monitors use Google’s LevelDB to store all their information. When looking at a monitors data directory you’ll see something like this:

[root@mon1:/var/lib/ceph/mon/ceph-alpha]$ ls -alR
.:
total 16
drwxr-xr-x 3 root root 4096 Sep 23  2013 .
drwxr-xr-x 3 root root 4096 Mar 24 11:04 ..
-rw-r--r-- 1 root root   55 Sep 23  2013 keyring
drwxr-xr-x 2 root root 4096 Mar 25 14:09 store.db

./store.db:
total 236172
drwxr-xr-x 2 root root    4096 Mar 25 14:09 .
drwxr-xr-x 3 root root    4096 Sep 23  2013 ..
-rw-r--r-- 1 root root 2116576 Mar  1 01:35 1400870.sst
-rw-r--r-- 1 root root 2111248 Mar  1 01:40 1400992.sst
...
...
-rw-r--r-- 1 root root 1149227 Mar 25 14:09 2026520.sst
-rw-r--r-- 1 root root      17 Mar 25 04:34 CURRENT
-rw-r--r-- 1 root root       0 Sep 23  2013 LOCK
-rw-r--r-- 1 root root 2196679 Mar 25 14:09 LOG
-rw-r--r-- 1 root root 3829307 Mar 25 04:33 LOG.old
-rw-r--r-- 1 root root  983040 Mar 25 14:09 MANIFEST-2016290
[root@mon1:/var/lib/ceph/mon/ceph-alpha]$

So it’s very tempting to simply run your favorite backup tool and back up this directory. Usually it’s less then 500MB, so it’s very simple to do so.

It’s however not a wise idea to do so, since you have to be sure the LevelDB database is in a consistent state before backing it up.

In a production cluster you will probably have a least three monitors, so stopping a monitor is not a big problem.

A simple backup solution would be:

service ceph stop mon
tar czf /var/backups/ceph-mon-backup_$(date +'%a').tar.gz /var/lib/ceph/mon
service ceph start mon

Put that in a Shell script and have CRON run it every 24 hours. Make sure not all three monitors create their backup at the same time, but this works just fine.

You now have a tarball which you can upload to any offsite location to make sure your monitors are safe.

Another solution would be to run the monitors on a ZFS on Linux filesystem and use ZFS’s snapshot functionalities. But you can’t be 100% sure that your LevelDB database is in a consistent state at that point.

The safest solution at this moment is to fully stop the monitor, create the backup and start the monitor again. Just make sure you don’t stop all monitors at the same time.

Changing the region of a RGW bucket

As of Ceph version 0.67 (Dumpling) the Ceph Object Gateway aka RADOS Gateway supports regions. This allows you to create a geo-replicated Amazon S3 compatible service.

While working on a setup we decided later in the process that we wanted regions, but we already created about 50 buckets with data in them. We didn’t feel like re-creating all the buckets, so we wanted to change the region of the buckets.

A fresh Object Gateway has a region ‘default’ with one zone ‘default’. We created the region ‘ams02’ (Amsterdam) with one zone called ‘zone01’.

All buckets had the region ‘default’ which we wanted to change to ‘ams02’. No data migrated is required since all the data is on the same Ceph cluster.

This can be done with a couple of ‘radosgw-admin’ commands.

The bucket in these examples is ‘widodh’.

$ radosgw-admin metadata get bucket:widodh

This outputs JSON data:

{ "key": "bucket:widodh",
  "ver": { "tag": "_2qGuaDCBixHpx2lddTe0g-x",
      "ver": 1},
  "mtime": 1380653343,
  "data": { "bucket": { "name": "widodh",
          "pool": ".rgw.buckets",
          "index_pool": ".rgw.buckets.index",
          "marker": "default.20111.1",
          "bucket_id": "default.20111.1"},
      "owner": "widodh",
      "creation_time": 1380653343,
      "linked": "true",
      "has_bucket_info": "false"}}

With this information we can get the rest of the information:

$ radosgw-admin metadata get bucket.instance:widodh:default.20111.1

The id at the end is ‘bucket_id’ from the previous command.

This returns us:

{ "key": "bucket.instance:widodh:default.20111.1",
  "ver": { "tag": "_-HNwyMLAnRALV9tyPqdX5_V",
      "ver": 1},
  "mtime": 1380653343,
  "data": { "bucket_info": { "bucket": { "name": "widodh",
              "pool": ".rgw.buckets",
              "index_pool": ".rgw.buckets.index",
              "marker": "default.20111.1",
              "bucket_id": "default.20111.1"},
          "creation_time": 1380653343,
          "owner": "widodh",
          "flags": 0,
          "region": "default",
          "placement_rule": "default-placement",
          "has_instance_obj": "true"},
      "attrs": [
            { "key": "user.rgw.acl",
              "val": "AgKXAAAAAgIgAAAABgAAAHdpZG9kaBIAAABXaWRvIGRlbiBIb2xsYW5kZXIDA2sAAAABAQAAAAYAAAB3aWRvZGgPAAAAAQAAAAYAAAB3aWRvZGgDA0AAAAACAgQAAAAAAAAABgAAAHdpZG9kaAAAAAAAAAAAAgIEAAAADwAAABIAAABXaWRvIGRlbiBIb2xsYW5kZXIAAAAAAAAAAA=="},
            { "key": "user.rgw.idtag",
              "val": ""},
            { "key": "user.rgw.manifest",
              "val": ""}]}}

Save this output to a file and change the ‘region’ value to what you want, in this case I changed ‘default’ to ‘ams02’.

Afterwards you run:

$ radosgw-admin metadata put bucket.instance:widodh:default.20111.1 < bucket.json

Now I could change these configuration variables in the ceph.conf:

[client.radosgw.rgw1]
    host = rgw1
    ...
    ...
    rgw zone = zone01
    rgw region = ams02
    ...
    ...

We had to change the information of 50 buckets and we didn't feel like doing this manually, so I wrote this script:

#!/usr/bin/env python

import rados
import os
import json
import copy
import subprocess

ceph_id = "admin"
ceph_secret = "ADMIN SECRET"
ceph_monitor = "MONITOR ADDRESS"
ceph_rgw_pool = ".rgw"
ceph_rgw_region = "NEW RGW REGION"

def change_bucket_region(bucket, region):
	me = os.popen("radosgw-admin metadata get bucket:" + bucket)
	meta = json.loads(me.read())
	id = meta['data']['bucket']['bucket_id']
	mei = os.popen("radosgw-admin metadata get bucket.instance:" + bucket + ":" + id)
	imeta = json.loads(mei.read())
	region = imeta['data']['bucket_info']['region']
	if region is not ceph_rgw_region:
		newmeta = copy.copy(imeta)
		newmeta['data']['bucket_info']['region'] = ceph_rgw_region
		stdin = json.dumps(newmeta)
		process = subprocess.Popen(['radosgw-admin', 'metadata', 'put', "bucket.instance:" + bucket + ":" + id], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
		process.stdin.write(stdin)
		process.stdin.close()
		process.wait()


try:
	r = rados.Rados(rados_id=ceph_id)
	r.conf_set("mon_host", ceph_monitor)
	r.conf_set("key", ceph_secret)
	r.connect()

	io = r.open_ioctx(ceph_rgw_pool)

	i = io.list_objects()
	while True:
		try:
			o = i.next()
			b = str(o.key)
			if b[0] is not ".":
				change_bucket_region(b, ceph_rgw_region)
		except StopIteration:
			break

	io.close()
	r.shutdown()
except Exception as e:
	print "Error" + str(e)

Also available as a download.

Use this script with caution since it will change the region of ALL buckets on your cluster to what you specify.