ceph – Widodh

Using L3 (BGP) routing for your Ceph storage

Many Ceph storage environments out there are deployed using a L2 underlay.

This means that the Ceph servers (MON, OSD, etc) are connected using LACP/Bonding to a pair of switches. On their ‘bond0’ device (example) they are assigned an IPv4/IPv6 address and this is used for connectivity between the Ceph nodes and the Ceph clients.

Although this works fine, I try to avoid L2 as much as possible in datacenter deployments. L2 scales up to a certain point, but it has it’s limitations. Modern Top-of-Rack (ToR) switches can easily route traffic and wire-speed. This used to be a limitation of switches in the past. When designing environments I prefer using a L3 approach.

This blogpost is there to show you the rough concept. It’s NOT a copy and paste tutorial. You will need to adapt it to your situation.

Network setup and BGP configuration

Using Juniper QFX5100 switches and Frrouting on the Ceph nodes I’ve established BGP sessions between the ToR and Ceph nodes according to the diagram below.

Each nodes has two independent BGP sessions with the Top-of-Rack in it’s rack. Via these BGP sessions they advertise their local IPv6 /128 address. Via the same sessions they receive a default ::/0 IPv6 route.

ceph01# sh bgp summary 

IPv6 Unicast Summary (VRF default):
BGP router identifier 1.2.3.4, local AS number 65101 vrf-id 0
BGP table version 10875
RIB entries 511, using 96 KiB of memory
Peers 2, using 1448 KiB of memory
Peer groups 1, using 64 bytes of memory

Neighbor        V    AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
enp196s0f0np0   4 65002    487385    353917        0    0    0 3d18h17m            1        1 N/A
enp196s0f1np1   4 65002    558998    411452        0    0    0 01:38:55            1        1 N/A

Total number of neighbors 2
ceph01#

Here we see two BGP sessions active over both NICs of the Ceph node. We can also see that a default IPv6 route is received via BGP.

ceph01# sh ipv6 route ::/0
Routing entry for ::/0
  Known via "bgp", distance 20, metric 0
  Last update 01:42:00 ago
    fe80::e29:efff:fed7:4719, via enp196s0f0np0, weight 1
    fe80::7686:e2ff:fe7c:a19e, via enp196s0f1np1, weight 1

ceph01#

The Frrouting configuration ( /etc/frr/frr.conf ) is fairly simple:

frr defaults traditional
hostname ceph01
log syslog informational
no ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
interface enp196s0f0np0
 no ipv6 nd suppress-ra
exit
!
interface enp196s0f1np1
 no ipv6 nd suppress-ra
exit
!
interface lo
 ipv6 address 2001:db8:100:1::/128
exit
!
router bgp 65101
 bgp router-id 1.2.3.4
 no bgp ebgp-requires-policy
 no bgp default ipv4-unicast
 no bgp network import-check
 neighbor upstream peer-group
 neighbor upstream remote-as external
 neighbor enp196s0f0np0 interface peer-group upstream
 neighbor enp196s0f1np1 interface peer-group upstream
 !
 address-family ipv6 unicast
  redistribute connected
  neighbor upstream activate
 exit-address-family
exit
!
end

On the Juniper switches a configuration was defined for the BGP Unnumbered (RFC5549) configuration as well. This blogpost explains very well on how BGP Unnumbered works on JunOS, I am not going to repeat it. I will highlight a couple of pieces of configuration.

root@tor01# show interfaces xe-0/0/1
description ceph01;
unit 0 {
    mtu 9216;
    family inet6;
}

root@tor01# show protocols router-advertisement 
interface xe-0/0/1.0;

root@tor01# show | compare 
[edit]
+  policy-options {
+      as-list bgp_unnumbered_as_list members 65101-65199;
+  }
[edit protocols]
+   bgp {
+       group ceph {
+           family inet6 {
+               unicast;
+           }
+           multipath;
+           export default-v6;
+           import ceph-loopback;
+           dynamic-neighbor bgp_unnumbered {
+               peer-auto-discovery {
+                   family inet6 {
+                       ipv6-nd;
+                   }
+                   interface xe-0/0/1.0;
+                   interface xe-0/0/2.0;
+                   interface xe-0/0/3.0;
+               }
+           }
+           peer-as-list bgp_unnumbered_as_list;
+       }
+   }
[edit policy]
+ policy-statement default-v6 {
+    from {
+        route-filter ::/0 exact;
+   }
+   then accept;
+}
+ policy-statement ceph-loopback {
+    from {
+        route-filter 2001:db8:100::/64 upto /128;
+   }
+   then accept;
+}

This will set up the BGP sessions via the interfaces xe-0/0/1 until xe-0/0/3 using IPv6 Autodiscovery.

The Ceph nodes should now be able to ping the other nodes:

PING 2001:db8:100::2(2001:db8:100::2) 56 data bytes
64 bytes from 2001:db8:100::2: icmp_seq=1 ttl=63 time=0.058 ms
64 bytes from 2001:db8:100::2: icmp_seq=2 ttl=63 time=0.063 ms
64 bytes from 2001:db8:100::2: icmp_seq=3 ttl=63 time=0.071 ms

--- 2001:db8:100::2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2037ms
rtt min/avg/max/mdev = 0.058/0.064/0.071/0.005 ms

Ceph configuration

From Ceph’s perspective there is not much to do. We just need to specify the IPv6 subnet Ceph is allowed to use and bind to.

[global]
	 mon_host = 2001:db8:100::1, 2001:db8:100::2, 2001:db8:100::3
	 ms_bind_ipv4 = false
	 ms_bind_ipv6 = true
	 public_network = 2001:db8:100::/64

This is all the configuration needed for Ceph 🙂

wdh@ceph01:~$ sudo ceph health
HEALTH_OK
wdh@infra-04-01-17:~$ sudo ceph mon dump
election_strategy: 1
0: [v2:[2001:db8:100::1]:3300/0,v1:[2001:db8:100::1]:6789/0] mon.ceph01
1: [v2:[2001:db8:100::2]:3300/0,v1:[2001:db8:100::2]:6789/0] mon.ceph02
2: [v2:[2001:db8:100::3]:3300/0,v1:[2001:db8:100::3]:6789/0] mon.ceph02
dumped monmap epoch 6
wdh@infra-04-01-17:~$

Creating a Management Routing Instance (VRF) on Juniper QFX5100

For a Ceph cluster I have two Juniper QFX5100 switches running as a Virtual Chassis.

This Virtual Chassis is currently only performing L2 forwarding, but I want to move this to a L3 setup where the QFX switches use Dynamic Routing (BGP) and thus become the gateway(s) for the Ceph servers.

This should work, but one of the things I was missing is a dedicated Management Port which uses a different routing table/instance.

Starting with JunOS 17.3R1 you can create a Management Routing Instance as described on the website of Juniper.

set system management-instance

This now creates the Routing Instance called mgmt_junos.

I try to run as much as possible IPv6-only or at least prefer IPv6 over IPv4.

I ran into the problem that configuring an IPv6 address on my em0 interface just wouldn’t work. It kept saying that the IPv6 address was Duplicate.

This is probably something which happens because both QFX switches are connected to the same Out of Band switch and causes it to receive it’s DAD over a different link. I had to disable DAD on interface em0 to make it work.

In addition I configured all DNS lookups to be performed using this routing instance.

The end result for my configuration (snippets):

system {
     management-instance;
     name-server {
         2a00:f10:ff04:153::53 routing-instance mgmt_junos;
         2a00:f10:ff04:253::53 routing-instance mgmt_junos;
         93.180.70.22 routing-instance mgmt_junos;
         93.180.70.30 routing-instance mgmt_junos;
     }
 }
 interfaces {
     unit 0 {
         family inet {
             address 172.17.5.10/24;
         }
         family inet6 {
             address 2a00:f10:XXX:XXX::100/64
             dad-disable;
         }
     }
 }
 routing-instances {
     mgmt_junos {
         routing-options {
             rib mgmt_junos.inet6.0 {
                 static {
                     route ::/0 next-hop 2a00:f10:XXX:XXX::1;
                 }
             }
             static {
                 route 0.0.0.0/0 next-hop 172.17.5.1;
             }
         }
     }
 }

This now allows me to SSH to my Juniper QFX Virtual Chassis over interface em0 which uses a different routing instance/table.

Should I make a mistake in the default routing instance, for example a BGP misconfiguration, I can still SSH to my switch(es).

Or if there is a routing error (BGP issue) I can also still reach the switches.

Comparing two Ceph CRUSH maps

Sometimes you want to test if changes you are about to make to a CRUSH map will cause data to move or not.

In this case I wanted to change a rule in CRUSH where it would use device classes, but I didn’t want any of the ~1PB of data in that cluster to move.

By swapping IDs I could prevent data to move:

root default {
     id -50      # do not change unnecessarily
     id -53 class hdd        # do not change unnecessarily
     id -122 class ssd       # do not change unnecessarily

root default {
     id -53      # do not change unnecessarily
     id -50 class hdd        # do not change unnecessarily
     id -122 class ssd       # do not change unnecessarily

Notice how I swapped the IDs. After this I updated the rule:

rule rgw {
         id 6
         type replicated
         min_size 1
         max_size 10
         step take ams02-objects class hdd
         step chooseleaf firstn 0 type host
         step emit
 }

I then compiled the CRUSHMap and ran crushtool to see if there were any differences:

root@mon01:~# crushtool -i crushmap --compare crushmap.new 
 rule 0 had 0/10240 mismatched mappings (0)
 rule 1 had 0/10240 mismatched mappings (0)
 rule 2 had 0/10240 mismatched mappings (0)
 rule 3 had 0/10240 mismatched mappings (0)
 rule 4 had 0/10240 mismatched mappings (0)
 rule 5 had 0/3072 mismatched mappings (0)
 rule 6 had 0/10240 mismatched mappings (0)
 maps appear equivalent
 root@mon01:~#

No changes! So it was safe to inject this map:

root@mon01:~# ceph osd setcrushmap -i crushmap.new