Slow requests with Ceph: ‘waiting for rw locks’

Slow requests in Ceph

When a I/O operating inside Ceph is taking more than X seconds, which is 30 by default, it will be logged as a slow request.

This is to show you as a admin that something is wrong inside the cluster and you have to take action.

Origin of slow requests

Slow requests can happen for multiple reasons. It can be slow disks, network connections or high load on machines.

If a OSD has slow requests you can log on to the machine and see what Ops are blocking:

ceph daemon osd.X dump_ops_in_flight

waiting for rw locks

Yesterday I got my hands on a Ceph cluster which had a very high number, over 2k, of slow requests.

On all OSDs they showed ‘waiting for rw locks’

This is hard to diagnose and it was. Usually this is where OSDs are busy connecting to other OSDs or performing any other network actions.

Usually when you see ‘waiting for rw locks’ there is something wrong with the network.

The network

In this case the Ceph cluster is connecting over Layer 2 and that network didn’t change. A few hours earlier there was a change to the Layer 3 network, but since Ceph was running over Layer 2 we didn’t connect the two dots.

After some more searching we noticed that the hosts couldn’t perform DNS lookups properly.

DNS

Ceph doesn’t use DNS internally, but it could still be that it was a problem.

After some searching we found that DNS wasn’t the problem, but there were two default routes on the system where one was down.

Layer 3

This Ceph cluster is communicating over Layer 3 and the problem was caused by the fact that the cluster had a hard time talking back to various clients.

This caused various network buffers to fill up and that caused communication problems between OSDs.

So always make sure you double-check the network since that is usually the root-cause.

Installing and testing NixOS

NixOS

NixOS is a minimal and flexible Linux distribution which doesn’t use any of the existing package manager.

NixOS is a Linux distribution with a unique approach to package and configuration management. Built on top of the Nix package manager, it is completely declarative, makes upgrading systems reliable, and has many other advantages.

I wanted to test NixOS and see if it could be a candidate for a very minimal KVM hypervisor running just Qemu, libvirt and Apache CloudStack.

With this post I just wanted to share how you can quickly install NixOS inside a VirtualBox VM.

VirtualBox

On my desktop and laptop I usually use VirtualBox to quickly test something inside Virtual Machines. In this case I downloaded the NixOS minimal 64-bit ISO and created a VM:

  • 1024MB of memory
  • 8GB SATA disk
  • NixOS ISO attached

Installation

After you start the VM it will boot from the ISO. You will then find yourself in a root prompt saying just nixos.

The first step is to format your disk and mount it under /mnt.

parted /dev/sda mklabel msdos
parted /dev/sda mkpart primary 0% 100%
mkfs.xfs /dev/sda1
mount /dev/sda1 /mnt

If you have that done you can run:

nixos-generate-config

This will generate /mnt/etc/nixos/configuration.nix from where you can configure your OS.

This is what I used as my configuration:

{ config, pkgs, ... }:

{
  imports = [
      ./hardware-configuration.nix
    ];

  boot.loader.grub.enable = true;
  boot.loader.grub.version = 2;
  boot.loader.grub.device = "/dev/sda";

  boot.kernelPackages = pkgs.linuxPackages_4_1;

  time.timeZone = "Europe/Amsterdam";

  networking.firewall.enable = false;

  environment.systemPackages = with pkgs; [
    wget git screen ceph
  ];

  services.openssh.enable = true;
  services.openssh.permitRootLogin = "yes";

  virtualisation.libvirtd.enable = true;
  virtualisation.libvirtd.extraOptions = ["-l"];
  virtualisation.libvirtd.extraConfig = "listen_tls = 0\nlisten_tcp = 1";

  system.stateVersion = "15.09";
}

A minimal installation with just OpenSSH and libvirt installed.

Now you can actually install NixOS:

nixos-install

After a few minutes you will be prompted for a root-password and that’s it!

Reboot and you have a running NixOS installation 🙂

Using TRIM/DISCARD with Ceph RBD and libvirt

TRIM/DISCARD

Using TRIM/DISCARD you can give back free space to a Ceph cluster. Normally, any thin provisioned block device will keep on growing until its maximum size while being used. Using the DISCARD command a underlying block device can be instructed to discard blocks which do not contain data.

In the case of Ceph’s RBD we can shrink our RBD images again which gives us back free space in our Ceph cluster.

Libvirt

Using this feature is only supported if you use VirtIO-SCSI and not if you use plain VirtIO.

Some searching brought me to this XML for my Ubuntu 15.10 guest:

<disk type='network' device='disk'>
  <driver name='qemu' type='raw' cache='none' discard='unmap'/>
  <auth username='admin'>
    <secret type='ceph' uuid='f94812dd-f06f-48f6-9839-1edf7ee8f8d6'/>
  </auth>
  <source protocol='rbd' name='libvirt/image1'>
    <host name='hostname.of.my.ceph.monitor'/>
  </source>
  <target dev='sda' bus='scsi'/>
  <controller type='scsi' index='0' model='virtio-scsi'/>
</disk>

Inside the guest

I tried a Ubuntu 15.10 guest but this should be supported in any other modern Linux guest.

lspci shows me:

root@ubuntu1510:~# lspci 
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
00:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c895a
root@ubuntu1510:~#

And I have a sda block device which my guest uses:

root@ubuntu1510:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            230M     0  230M   0% /dev
tmpfs            49M  4.6M   45M  10% /run
/dev/sda1       9.3G  1.3G  7.6G  15% /
tmpfs           245M     0  245M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           245M     0  245M   0% /sys/fs/cgroup
tmpfs            49M     0   49M   0% /run/user/0
root@ubuntu1510:~#

Now I can run fstrim which will trim the block device:

root@ubuntu1510:~# fstrim -v /
/: 128 MiB (134217728 bytes) trimmed
root@ubuntu1510:~#