Slow requests in Ceph
When a I/O operating inside Ceph is taking more than X seconds, which is 30 by default, it will be logged as a slow request.
This is to show you as a admin that something is wrong inside the cluster and you have to take action.
Origin of slow requests
Slow requests can happen for multiple reasons. It can be slow disks, network connections or high load on machines.
If a OSD has slow requests you can log on to the machine and see what Ops are blocking:
ceph daemon osd.X dump_ops_in_flight
waiting for rw locks
Yesterday I got my hands on a Ceph cluster which had a very high number, over 2k, of slow requests.
On all OSDs they showed ‘waiting for rw locks’
This is hard to diagnose and it was. Usually this is where OSDs are busy connecting to other OSDs or performing any other network actions.
Usually when you see ‘waiting for rw locks’ there is something wrong with the network.
The network
In this case the Ceph cluster is connecting over Layer 2 and that network didn’t change. A few hours earlier there was a change to the Layer 3 network, but since Ceph was running over Layer 2 we didn’t connect the two dots.
After some more searching we noticed that the hosts couldn’t perform DNS lookups properly.
DNS
Ceph doesn’t use DNS internally, but it could still be that it was a problem.
After some searching we found that DNS wasn’t the problem, but there were two default routes on the system where one was down.
Layer 3
This Ceph cluster is communicating over Layer 3 and the problem was caused by the fact that the cluster had a hard time talking back to various clients.
This caused various network buffers to fill up and that caused communication problems between OSDs.
So always make sure you double-check the network since that is usually the root-cause.