Redundant Ceph monitors with Round Robin DNS

One of the unique features of Ceph is that it can be build without any Single Point of Failure. No single machine will take your cluster down when designed properly.

Ceph’s monitors play a crucial part in this. To make them redundant you want a odd number of monitors, where 3 is more then sufficient for most clusters.

When librados (The RADOS client) reads the ceph.conf it can read something like:

[mon.a]
  mon addr = 192.168.0.1:6789

[mon.b]
  mon addr = 192.168.0.2:6789

[mon.c]
  mon addr = 192.168.0.3:6789

The problem is that when working with for example Apache CloudStack you can’t have it read a ceph.conf nor does CloudStack support multiple Ceph monitors.

The reason behind this is that CloudStack passes storage pools in the form or URIs internally, for example: rbd://1.2.3.4:6789/mypool

So you’d be stuck with a single monitor in CloudStack. It’s not a disaster, since when a client successfully connects to the Ceph cluster it will receive a monitor map which tells it which other monitors are available should the one he’s connected to fail. But when you want to connect when that specific monitor is down you have a problem.

A solution to this is to create a Round Robin DNS record with all your monitors in it:

monitor.ceph.lan. A 192.168.0.1
monitor.ceph.lan. A 192.168.0.2
monitor.ceph.lan. A 192.168.0.3

You can have your librados client connect to “monitor.ceph.lan” and it will connect to one of the monitors listed in that A record. Is one of the monitors down? It will connect to another one.

This doesn’t only work with CloudStack, but it works with any RADOS client like Qemu, libvirt, phprados, rados-java, python-rados, etc, etc. Anything that connects via librados.

P.S.: Ceph fully (!) supports IPv6, so you can also create a Round Robin AAAA-record 🙂

Published by

Wido den Hollander

I am the CTO at PCextreme B.V. and i will keep this blog just to post some interesting information about my daily work on Linux (and other) systems.