Using the new dashboard in ceph-mgr

The upcoming Ceph Luminous (12.2.0) release features the new ceph-mgr daemon which has a few default plugins. One of these plugins is a dashboard to give you a graphical overview of your cluster.

Enabling Module

To enable the dashboard you have to enable the module in your /etc/ceph/ceph.conf on all machines running the ceph-mgr daemon. These are usually your Monitors.

Add this to the configuration:

[mgr]
mgr_modules = dashboard

Don’t restart your ceph-mgr daemon yet. More configuration changes have to be made first.

Setting server address and port

A server address and optionally a port have to be configured as a config-key.

By setting the value to :: the dashboard will be available on all IPv4 and IPv6 addresses on port 7000 (default):

ceph config-key put mgr/dashboard/server_addr ::

Restart daemons

Now restart all ceph-mgr daemons on your hosts:

systemctl restart ceph-mgr@

Accessing the dashboard

The default port is 7000, so now go to the IP-Address of the active ceph-mgr and open the see the dashboard.

You can find the active ceph-mgr in the ceph status:

root@alpha:~# ceph -s
  cluster:
    id:     30d838cd-955f-42e5-bddb-5609e1c880f8
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum alpha,bravo,charlie
    mgr: charlie(active), standbys: alpha, bravo
    osd: 3 osds: 3 up, 3 in
 
  data:
    pools:   1 pools, 64 pgs
    objects: 0 objects, 0 bytes
    usage:   3173 MB used, 27243 MB / 30416 MB avail
    pgs:     64 active+clean
 
root@alpha:~#

In this case charlie is the active mgr which in my case has IPv6 Address 2001:db8::102.

Point your browser to: http://[2001:db8::102]:7000 and you will see the dashboard.

Do not use SMR disks with Ceph

Many new disks like the Seagate He8 disks are using a technique called Shingled Magnetic Recording to increase capacity.

As these disks offer a very low price per Gigabyte they seem interesting to use in a Ceph cluster.

Performance

Due to the nature of SMR these disks are very, very, very bad when it comes to Random Write performance. Random I/O is something that Ceph does a lot on the backing disks.

This results in disks spiking to 100% utilization very quickly causing all kinds of trouble with OSDS going down and committing suicide.

Do NOT use them

The solution is very simple. Do not use SMR disks in Ceph but stick to the traditional PMR disks in your Ceph cluster.

In the future we might see SMR support in the new BlueStore of Ceph, but at this moment no work has been done, so don’t expect anything soon.

Testing Ceph BlueStore with the Kraken release

Ceph version Kraken (11.2.0) has been released and the Release Notes tell us that the new BlueStore backend for the OSDs is now available.

BlueStore

The current backend for the OSDs is the FileStore which mainly uses the XFS filesystem to store it’s data. To overcome several limitations of XFS and POSIX in general the BlueStore backend was developed.

It will provide more performance (mainly writes), data safety due to checksumming and compression.

Users are encouraged to test BlueStore starting with the Kraken release for non-production and non-critical data sets and report back to the community.

Deploying with BlueStore

To deploy OSDs with BlueStore you can use the ceph-deploy by using the –bluestore flag.

I created a simple test cluster with three machines: alpha, bravo and charlie.

Each machine will be running a ceph-mon and ceph-osd proces.

This is the sequence of ceph-deploy commands I used to deploy the cluster

ceph-deploy new alpha bravo charlie
ceph-deploy mon create alpha bravo charlie

Now, edit the ceph.conf file in the current directory and add:

[osd]
enable_experimental_unrecoverable_data_corrupting_features = bluestore

With this setting we allow the use of BlueStore and we can now deploy our OSDs:

ceph-deploy --overwrite-conf osd create --bluestore alpha:sdb bravo:sdb charlie:sdb

Running BlueStore

This tiny cluster how runs three OSDs with BlueStore:

root@alpha:~# ceph -s
    cluster c824e460-2f09-4994-8b2f-108aedc52d19
     health HEALTH_OK
     monmap e2: 3 mons at {alpha=[2001:db8::100]:6789/0,bravo=[2001:db8::101]:6789/0,charlie=[2001:db8::102]:6789/0}
            election epoch 14, quorum 0,1,2 alpha,bravo,charlie
        mgr active: charlie standbys: alpha, bravo
     osdmap e14: 3 osds: 3 up, 3 in
            flags sortbitwise,require_jewel_osds,require_kraken_osds
      pgmap v24: 64 pgs, 1 pools, 0 bytes data, 0 objects
            43356 kB used, 30374 MB / 30416 MB avail
                  64 active+clean
root@alpha:~#
root@alpha:~# ceph osd tree
ID WEIGHT  TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.02907 root default                                       
-2 0.00969     host alpha                                     
 0 0.00969         osd.0         up  1.00000          1.00000 
-3 0.00969     host bravo                                     
 1 0.00969         osd.1         up  1.00000          1.00000 
-4 0.00969     host charlie                                   
 2 0.00969         osd.2         up  1.00000          1.00000 
root@alpha:~#

On alpha I see that osd.0 only has a small partition for a bit of configuration and the rest is used by BlueStore.

root@alpha:~# df -h /var/lib/ceph/osd/ceph-0
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1        97M  5.4M   92M   6% /var/lib/ceph/osd/ceph-0
root@alpha:~# lsblk 
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0    8G  0 disk 
├─sda1   8:1    0  7.5G  0 part /
├─sda2   8:2    0    1K  0 part 
└─sda5   8:5    0  510M  0 part [SWAP]
sdb      8:16   0   10G  0 disk 
├─sdb1   8:17   0  100M  0 part /var/lib/ceph/osd/ceph-0
└─sdb2   8:18   0  9.9G  0 part 
sdc      8:32   0   10G  0 disk 
root@alpha:~# cat /var/lib/ceph/osd/ceph-0/type
bluestore
root@alpha:~#

The OSDs should work just like OSDs running FileStore, but they should perform better.

Playing with CephFS recursive statistics

One of the cool features of CephFS is the recursive accounting the filesystem can do.

On a regular filesystem you have to use ‘du -sh’ to figure out how big a directory is. It will traverse into the directory and sum everything up for you. This can take a very long time and be very I/O intensive.

With CephFS this is done within a second:

root@admin:~# ls -alh /mnt/cephfs/
total 4.0K
drwxr-xr-x 1 root root  81T Jan 23 13:09 .
drwxr-xr-x 6 root root 4.0K Jan 13 15:41 ..
drwxrwxr-x 1 root root    0 Jan 23 12:57 DIR1
drwxrwxr-x 1 root root  80T Apr  3 11:16 DIR2
root@admin:~#

Or fetch these statistics using the virtual xattrs of CephFS:

root@admin:~# getfattr -d -m ceph.dir.* /mnt/cephfs
getfattr: Removing leading '/' from absolute path names
# file: mnt/cephfs
ceph.dir.entries="2"
ceph.dir.files="0"
ceph.dir.rbytes="88833202521902"
ceph.dir.rctime="1430297412.09159402000"
ceph.dir.rentries="10334874"
ceph.dir.rfiles="9853051"
ceph.dir.rsubdirs="481823"
ceph.dir.subdirs="2"

root@admin:~#

It is as simple as that. Using this virtual xattrs of CephFS you instantly know how much data, files and (recursive) entries there are in any directory.

No long waits on find or du, simply ask the Metadata Server of CephFS!

Flash LSI 2308 to IT-mode on a SuperMicro X10SL7-F mainboard

For a new SSD-only Ceph deployment I got my hands on a couple of SuperMicro SYS-1018D-73MTF 1U server with 8 Intel SSDs.

Great machines, but by default SuperMicro ships them with the IR (Integrated RAID) firmware and I wanted to use the IT (JBOD) firmware.

The X10SL7 mainboard has a UEFI bios, so flashing the controller was rather easy.

First, download the latest firmware for the LSI 2308 controller from SuperMicro. At this moment it is the PH19-IT firmware.

Now we have to create a small disk image with a FAT32 filesystem to use as a virtual drive for the IPMI.

I run Ubuntu on my desktops and laptops, so this is how I created that small disk image:

NOTE: You can generate the ISO with the steps below, or download it directly.

mkfs.msdos -C ipmi.iso 2880
sudo mkdir /media/ipmi/
sudo mount -o loop ipmi.iso /media/ipmi/
wget ftp://ftp.supermicro.com/Driver/SAS/LSI/2308/Firmware/IT/PH19-IT.zip
sudo unzip PH19-IT.zip UEFI/* -d /media/ipmi/
sudo umount /media/ipmi/

Now open your IPMI and attach ipmi.iso to your server.

Reboot the machine and hit the F11 key to open a boot menu and start the build in UEFI shell.

When in the UEFI shell enter these commands:

fs0:
cd UEFI
SMC2308T.NSH

At some point it will ask for the 9 remaining digits of the SAS address. Simply enter 000000000 (9 zeros) and you’re fine.

Reboot the system and the controller should now be in IT mode

Deploying Ceph over IPv6

I like to deploy Ceph clusters over IPv6. I actually think that’s the way forward. IPv4 is legacy just like iSCSI and NFS are.

Last week I was at a customer deploying a new Ceph cluster and they wanted to deploy with IPv6! Most deployment I did with IPv6 were done manually and not with ceph-deploy, but when trying to deploy with ceph-deploy over IPv6 I ran into some issues.

Before going into that I want to make something clear. With Ceph you choose either IPv4 OR IPv6. There is NO dual-stack support. So the whole cluster (including clients) communicates over IPv6 or over IPv4. Switching afterwards is not possible. So that’s why I urge people to deploy with IPv6 since you probably want to have your cluster running for a long time.

All package repos (including the Ceph ones) have IPv6 enabled, so in my opinion there is no good reason to prefer IPv4 with a Ceph deployment when IPv6 is available. I even think it’s easier in large deployment due to the Router Advertisements in IPv6.

Having that said it’s time to go back to the ceph-deploy issue.

In ceph.conf you have to enclose IPv6 addresses for monitors with a [ and ]. This is what ceph-deploy did wrong:

[global]
mon_host = 2a00:f10:X:X::X,2a00:f10:X:X::Y,2a00:f10:X:X::Z

While it should have been:

[global]
mon_host = [2a00:f10:X:X::X],[2a00:f10:X:X::Y],[2a00:f10:X:X::Z]
ms_bind_ipv6 = true

The ms_bind_ipv6 setting tells the Messenger inside Ceph to bind on IPv6. It’s important that you set that setting on all hosts in the Ceph cluster, otherwise things will go wrong badly. Heartbeats and such will not work.

I wrote a patch for ceph-deploy which fixes it. It writes the ‘mon_host’ setting correctly and also adds the ‘ms_bind_ipv6’ setting when IPv6 is used for the monitors.

Failover with Nexenta, NFS and the RSF-1 plugin

The title might seem a bit cryptic, but this post is about a High Available Nexenta cluster with the RSF-1 we are deploying.

While we are waiting for the moment where we can start using Ceph we are implementing new storage for our hosting clusters. Our current Linux machines with LVM and XFS are not up to the task anymore.

After some testing and discussing we chose to use Nexenta. What Nexenta is and how awesome ZFS is can be found on other places on the net, I’m not going to discuss that here.

I wanted to publish our findings about the HA plugin and NFS.

In short, we have two headends connected with two SAS JBOD’s. The RSF-1 plugin makes sure the ZPOOL is imported on one headend at the time. If one headend fails, the plugin automatically fails the pool over to the other headend.

The plugin provides one HA IP which is shared between the headends, you probably get the point.

We’ve been doing some testing and noticed that when we mount NFS (v3) over TCP the failover takes a staggering 6 minutes! Well, the failover doesn’t take 6 minutes, but that’s the time it takes for the TCP connections to recover.

When mounting over UDP the service is continued in 50 seconds, so that’s a big difference!

Some testing showed that this is due to the following kernel settings:

net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15

This page explains what those two values actually control.

We’ve been experimenting with those values and lowering retries1 to 1 gave us the same recovery times as with UDP, but sometimes the recovery would still take 6 minutes..

For now I advise to use NFS with UDP (which gives better performance anyway), but if you need to use TCP for some reason try fiddling with these values.

Distributed storage under Linux, is it there yet?

When it comes down to storage under Linux you have a lot of great options if you are looking for local storage, but what if you have so much data that local storage is not really an option? And what if you need multiple servers accessing the data? You’ll probably take NFS or iSCSI with a clustered filesystem like GFS or OCFS2.

When using NFS or iSCSI it will come down to one, two or maybe three servers storing your data, where one will have a primary role for 99.99% of the time. That is still a Single Point-of-Failure (SPoF).

Although this worked (and still is) fine, we are running into limitations. We want to store more and more data, we want to expand without downtime and we want expansion to go smoothly. Doing all that under Linux now is a ……. Let’s say: Challenge.

Energy costs are also rising, if you like it or not, it does influence the work of a system administrator. We were used to having a Active/Passive setup, but that doubles your energy consumption! In large environments that could mean a lot of money. Do we still want that? I don’t think so.

Distributed storage is what we need, no central brain, no passive nodes, but a fully distributed and fault tolerant filesystem where every node is active and it has to scale easily without any disruption in service.

I think it’s nearly there and they call it Ceph!

Ceph is a distributed file system build on top of RADOS, a scalable and distributed object store. This object store simply stores objects in pools (which some people might refer to as “buckets”). It’s this distributed object store which is the basis of the Ceph filesystem.

RADOS works with Object Store Daemons (OSD). These OSDs are a daemon which have a data directory (btrfs) where they store their objects and some basic information about the cluster. Typically a data directory of a OSD is a one hard disk formatted with btrfs.

Every pool has a replication size property, this tells RADOS how many copies of an object you want to store. If you choose 3 every object you store on that pool will be stored on three different OSDs. This provides data safety and availability, loosing one (or more) OSDs will not lead to data loss nor unavailability.

Data placement in RADOS is done by CRUSH. With CRUSH you can strategically place your objects (and it’s replica’s) in different rooms, racks, rows and servers. One might want to place the second replica on a separate power feed then the primary replica.

A small RADOS cluster could look like this:

This is a small RADOS cluster, three machines with 4 disks each and one OSD per disk. The monitor is there to inform the clients about the cluster state. Although this setup has one monitor, these can be made redundant by simple adding more (odd number is preferable).

With this post I don’t want to tell you everything about RADOS and the internal working, all this information is available on the Ceph website.

What I do want to tell you is how my experiences are with Ceph at this point and where it’s heading.

I started testing Ceph about 1.5 years ago, I stumbled on it when reading the changelog of 2.6.34, that was the first kernel where the Ceph kernel client was included.

I’m always on a quest to find a better solution for our storage, right now we are using Linux boxes with NFS, but that is really starting to hurt in many ways.

Where did Ceph get in the past 18 months? Far! I started testing when version 0.18 just got out, right now we are at 0.31!

I started testing the various components of Ceph, started on a small number of virtual machines, but currently I have two clusters running, a “semi-production” where I’m running various virtual machines with RBD and Qemu-KVM. My second cluster is a 74TB cluster with 10 machines, each having 4 2TB disks.

Filesystem            Size  Used Avail Use% Mounted on
[2a00:f10:113:1:230:48ff:fed3:b086]:/   74T  13T   61T  17% /mnt/ceph

As you can see, I’m running my cluster over IPv6. Ceph does not support dual-stack, you will have to choose between IPv4 or IPv6, where I prefer the last one.

But you are probably wondering how stable or production ready it is? That question is hard to answer. My small cluster where I run the KVM Virtual Machines (through Qem-KVM with RBD) has only 6 OSDs and a capacity of 600GB. It has been running for about 4 months now without any issues, but I have to be honest, I didn’t stress it either. I didn’t kill any machines nor did hardware fail. It should be able to handle those crashes, but I haven’t stressed that cluster.

The story is different with my big cluster. In total it’s 15 machines, 10 machines hosting a total of 40 OSDs, the rest are monitors, meta data servers and clients. It started running about 3 months ago and since I’ve seen numerous crashes. I also chose to use the WD Green 2TB disks in my cluster, that was not the best decision. Right now I have a 12% failure rate of these disks. While the failure of those disks is not a good thing, it is a good test for Ceph!

Some disk failures caused some serious problems causing the cluster to start bouncing around and never recovering from that.. But, about 2 days ago I noticed two other disks failing and the cluster fully recovered from it while a rsync was writing data to it. So, it seems to be improving!

During my further testing I have stumbled upon a lot of things. My cluster is build with Atom CPU’s, but those seem to be a bit underpowered for the work. Recovery is heavy for OSDs, so whenever something goes wrong in the cluster I see the CPU’s starting to spike towards the 100%. This is something that is being addressed.

Data placement goes in Placement Group’s, aka PGs. The more data or OSDs you add to the cluster, the more PGs you’ll get. The more PGs you have, the more memory your OSDs start to consume. My OSDs have 4GB (Atom limitation) each. Recovery is not only CPU hungry, but it will also eat your memory. Although the use of tcmalloc reduced the memory usage, OSDs sometimes use a lot of memory.

To come to some sort of a conclusion. Are we there yet? Short answer: No. Long answer: No again, but we will get there. Although Ceph still has a long way to go, it’s on the right path. I think that Ceph will become the distributed storage solution under Linux, but it will take some time. Patience is the key here!

The last thing I wanted to address is the fact that testing is needed! Bugs don’t reveal themselves you have to hunt them down. If you have spare hardware and time, do test and report!

Multipath iSCSI with Ubuntu 10.04 and a EqualLogic SAN

Recently we purchased a EqualLogic PS6000XVS for a KVM environment.

In most of our iSCSI systems we use Multipath I/O, we do this by giving the iSCSI Target two NIC’s and give each NIC a IP-Address in a different subnet over a physically different network. This way we have two seperate I/O path’s to the iSCSI Target.

The EqualLogic does not support this, it only supports one virtual IP in one network, so multipathing gets a bit difficult.

On the Dell Wiki there is configuration howto, so I read that carefully.

The examples are for RedHat, but we are using Ubuntu, but that should not make a big difference, but it did….

Our storage network is in the subnet 192.168.32.0/19 where the virtual IP of the EqualLogic is 192.168.32.1. You should know, this is a virtual IP, in total we have three PS6000 nodes, which do some magic by responding with a different MAC Address for 192.168.32.1 towards each client.

One of our clients has the following configuration for the storage connectivity:

eth0      Link encap:Ethernet  HWaddr 14:FE:B5:C6:62:E0  
          inet addr:192.168.37.4  Bcast:192.168.63.255  Mask:255.255.224.0
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:27263332 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25323692 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:24569609290 (22.8 GiB)  TX bytes:132201626154 (123.1 GiB)
          Interrupt:170 Memory:e6000000-e6012800 

eth1      Link encap:Ethernet  HWaddr 14:FE:B5:C6:62:E2  
          inet addr:192.168.38.4  Bcast:192.168.63.255  Mask:255.255.224.0
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:27246580 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25335109 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:24549507448 (22.8 GiB)  TX bytes:132201622012 (123.1 GiB)
          Interrupt:178 Memory:e8000000-e8012800

It took some work to get this working. Bot NIC’s are connected to the same subnet, through different switches though.

The first problem you will run into is the ARP flux problem of Linux, I’m not going to write to much about this, on the internet there is more then enough information written about this topic.

I ended up with this configuration:

auto eth0
iface eth0 inet static
        address 192.168.37.4
        netmask 255.255.224.0
        post-up sysctl -w net.ipv4.conf.eth0.rp_filter=0
        post-up sysctl -w net.ipv4.conf.eth0.arp_ignore=1
        post-up sysctl -w net.ipv4.conf.eth0.arp_announce=2

auto eth2
iface eth2 inet static
        address 192.168.38.4
        netmask 255.255.224.0
        post-up sysctl -w net.ipv4.conf.eth2.rp_filter=0
        post-up sysctl -w net.ipv4.conf.eth2.arp_ignore=1
        post-up sysctl -w net.ipv4.conf.eth2.arp_announce=2

For Open-iSCSI I created two interfaces called ieth0 and ieth1 and routed my iSCSI traffic through them. How you can do this can be found at the Dell wiki.

But it did not work! I was able to ping the EqualLogic over eth0, but not over eth1. If I brought down eth0, it would work over eth1, but not vise versa. It took me a while to find it, but it’s due to a default setting in Ubuntu, done in /etc/sysctl.d/10-network-security.conf, this enables rp_filter (Reverse Path Filtering) by default, so I modified that file

# Turn on Source Address Verification in all interfaces to
# prevent some spoofing attacks.
#net.ipv4.conf.default.rp_filter=1
#net.ipv4.conf.all.rp_filter=1

And voila! My iSCSI multipathing started to work! My multipath shows:

[size=1.0T][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=2][active]
 \_ 13:0:0:0 sdk 8:160 [active][ready]
 \_ 14:0:0:0 sdj 8:144 [active][ready]
eql-0-8a0906-4f2b9e409-2b800184d024d9db_c () dm-4 EQLOGIC,100E-00
[size=2.0T][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=2][active]
 \_ 6:0:0:0 sdg 8:96  [active][ready]
 \_ 11:0:0:0 sdf 8:80  [active][ready]

This should work under Ubuntu 10.04. Took me some time to figure it all out, but now it’s working like a charm. But still, I prefer multipathing over two different VLAN’s and subnets, really odd that the EqualLogic does not support this!

How to turn off the Journal with EXT4

For a specific system i wanted maximum performance, so i tried of turning the journaling on my ext4 device.

It took me some time to find out how, so here is a small howto:

1. Unmount your EXT4 filesystem
2. tune2fs -O ^has_journal /dev/sdX
3. mount your filesystem again.

And voila! You have an ext4 filesystem without a journal.

Note: This works if you have a kernel newer then 2.6.28!