bgp – Wido den Hollander

Linux bridging with Virtual Machines and pure L3 routing and BGP

For those who have followed my in the last few years know that I am a big fan of Layer 3 routing, BGP, VXLAN and EVPN. In networks I design I try to eliminate the use of Layer 2 as much as possible.

Although I think VXLAN is great, it still creates a virtual Layer 2 domain where hosts exist. Multicast and broadcast traffic are still required for Neighbor Discovery in IPv6 and ARP in IPv4. This is not always ideal. EVPN is also not simple, as it can be complex to set up and maintain. Even so, I would choose EVPN with VXLAN over any Layer 2 network any day.

Layer 3 routing

My goal was to see if I could remove Layer 2 entirely and use pure Layer 3 routing for my virtual machines. This requires routing single host IPv4 and IPv6 addresses directly to the virtual machines, without any shared Layer 2 domain.

I came across Redistribute Neighbor in Cumulus Linux, which uses a Python daemon called rdnbrd. This daemon intercepts IPv4 ARP packets from hosts and injects them as single host IPv4 routes into the BGP routing table.

Could this also work for virtual machines and with IPv6? Yes!

Over several months I spoke with various people at conferences, read a number of online articles and used these pieces of information to build a working prototype on my Proxmox server, which runs BGP.

/32 and /128 towards a VM

In the end it wasn’t that difficult. I started with creating a Linux bridge on my Proxmox node where I would configure two addresses, 169.254.0.1/32 for IPv4 and fe80::1/64 for IPv6. This is how it looks like in the /etc/network/interfaces file.

auto vmbr1
iface vmbr1 inet static
    address 169.254.0.1/32
    address fe80::1/64
    bridge-ports none
    bridge-stp off
    bridge-fd 0

The webserver running this WordPress blog was reconfigured and attached to this bridge. Inside the Virtual Machine there is Ubuntu Linux with netplan and this is what I ended up configuring in /etc/netplan/network.yaml

network:
  ethernets:
    ens18:
      accept-ra: no
      nameservers:
          addresses:
              - 2620:fe::fe
              - 2620:fe::9
      addresses:
              - 2.57.57.30/32
              - 2001:678:3a4:100::80/128
      routes:
      - to: default
        via: fe80::1
      - to: default
        via: 169.254.0.1
        on-link: true
  version: 2

Here you can see that I configured two addresses (2.57.57.30/32 and 2001:678:3a4:100::80/128) and manually configured the IPv4 and IPv6 gateways.

root@web01:~# fping 169.254.0.1
169.254.0.1 is alive
root@web01:~# fping6 fe80::1%ens18
fe80::1%ens18 is alive
root@web01:~#

The VM can reach both the gateways, great! You can also see that these are set as the default gateway and the addresses have been configured on the interface ens18.

root@web01:~# ip addr show dev ens18 scope global
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:02:45:76:d2:35 brd ff:ff:ff:ff:ff:ff
    altname enp0s18
    inet 2.57.57.30/32 scope global ens18
       valid_lft forever preferred_lft forever
    inet6 2001:678:3a4:100::80/128 scope global 
       valid_lft forever preferred_lft forever
root@web01:~#

root@web01:~# ip -6 route show
::1 dev lo proto kernel metric 256 pref medium
2001:678:3a4:100::80 dev ens18 proto kernel metric 256 pref medium
fe80::/64 dev ens18 proto kernel metric 256 pref medium
default via fe80::1 dev ens18 proto static metric 1024 pref medium
root@web01:~# ip -4 route show
default via 169.254.0.1 dev ens18 proto static onlink 
root@web01:~#

Routing on the Proxmox node

On the Proxmox node I now needed to add these routes and the Neighbors into the ARP (IPv4) and NDP (IPv6) tables based on the MAC address, this resulted in these commands to be executed:

ip -6 route add 2001:678:3a4:100::80/128 dev vmbr1
ip -6 neigh add 2001:678:3a4:100::80 lladdr 52:02:45:76:d2:35 dev vmbr1 nud permanent
ip -4 route add 2.57.57.30/32 dev vmbr1
ip -4 neigh add 2.57.57.30 lladdr 52:02:45:76:d2:35 dev vmbr1 nud permanent

It required manual execution of these commands, but for a production environment you would need to have some form of automation who does this for you.

On my Proxmox node there is the FRRouting BGP daemon running which now picks up these routes and advertises them to the upstream router:

hv-138-a12-26# sh bgp neighbors 2001:678:3a4:1::50 advertised-routes 2001:678:3a4:100::80/128
BGP table version is 25, local router ID is 2.57.57.4, vrf id 0
Default local pref 100, local AS 212540
BGP routing table entry for 2001:678:3a4:100::80/128, version 22
Paths: (1 available, best #1, table default)
  Advertised to non peer-group peers:
  2001:678:3a4:1::50
  Local
    :: from :: (2.57.57.4)
      Origin incomplete, metric 1024, weight 32768, valid, sourced, best (First path received)
      Last update: Fri Nov 28 22:52:35 2025

Total number of prefixes 1
hv-138-a12-26# sh ip bgp neighbors 2001:678:3a4:1::50 advertised-routes 2.57.57.30/32
BGP table version is 11, local router ID is 2.57.57.4, vrf id 0
Default local pref 100, local AS 212540
BGP routing table entry for 2.57.57.30/32, version 9
Paths: (1 available, best #1, table default)
  Advertised to non peer-group peers:
  2001:678:3a4:1::50
  Local
    0.0.0.0 from 0.0.0.0 (2.57.57.4)
      Origin incomplete, metric 0, weight 32768, valid, sourced, best (First path received)
      Last update: Fri Nov 28 22:52:47 2025

Total number of prefixes 1
hv-138-a12-26#

This makes the upstream aware of these routes and establishes connectivity.

VM mobility

This example is just a single Proxmox node, but this could easily work in a clustered environment. Using automation you would need to make sure the routes and ARP/NDP entries ‘follow’ the VM as it migrates to a different host.

This could be achieved using Hookscripts in Proxmox for example, but this is something I haven’t researched.

This blogpost is primarily to show that this is technically possible and it’s up to you on how to implement this into your environment should you want to do so.

Routing IPv6 through Wireguard with MikroTik and Debian

Recently, I became a RIPE member, which resulted in an IPv6 /29 subnet being allocated to me. One of my main goals was to route a /48 from this /29 to my home, allowing me to use my own IPv6 addresses on my local network.

However, my ISP (Ziggo Zakelijk) does not allow me to announce my own IPv6 address space. Instead, they have statically assigned me a /48 subnet (2001:41f0:6f67::/48) as part of my business account.

Regardless, I wanted the flexibility to switch ISPs in the future without the hassle of renumbering my home network. And, of course, it’s also a fun technical challenge to get this working! Geeky stuff!

Wireguard from AS212540

In a datacenter in Amsterdam, I have a Dell R430 server running Proxmox, where I also run the FRRouting (FRR) daemon to announce AS212540.

ipv6 route 2a14:9b80::/32 Null0
ip router-id 2.57.57.4
!
router bgp 212540

 address-family ipv6 unicast
  redistribute kernel
  redistribute connected
  redistribute static
  neighbor upstream-v6 activate
  neighbor upstream-v6 soft-reconfiguration inbound
  neighbor upstream-v6 route-map upstream-in in
  neighbor upstream-v6 route-map upstream-out out

My idea was to use WireGuard from this server to route a /48 to my Mikrotik CCR1036-8G-2S+ running at home. I would then be able to use parts of my own IPv6 space at home.

Route my own IPv6 /48 to my house
Use Wireguard from my Proxmox server in Amsterdam
Use IPv6 as the underlay under Wireguard
Use as little configuration as possible

Turned out that WireGuard as super easy to get up and running. In the end this was all the configuration I needed in /etc/wireguard/wg0.conf

[Interface]
Address = 2a14:9b80:0:1::1/64
ListenPort = 51820
PrivateKey = THISISMYPRIVATEKEY

[Peer]
PublicKey = PUBLICKEYOFMIKROTIK
AllowedIPs = 2a14:9b80:101::/48
Endpoint = [2001:41f0:6f67::2]:51820

There are plenty of tutorials online on how to set up WireGuard and generate the necessary keys. Instead, I want to show how I built an IPv6-only environment. No NAT, just pure routing!

If you look closely you can spot a few things

2001:41f0:6f67::2 is the WAN IP of my MikroTik router at home
2a14:9b80:101::/48 is the subnet I’m routing towards my house via Wireguard

The BGP announcement on my Proxmox server already routes the /32 to the server. Setting up a WireGuard tunnel was all that was needed to ensure the routes propagated in the local routing table.

root@proxmox:~# ip -6 route show
blackhole 2001:678:3a4:100::/56 dev lo proto static metric 20 pref medium
2a14:9b80::/64 dev vmbr2 proto kernel metric 256 pref medium
2a14:9b80:0:1::/64 dev wg0 proto kernel metric 256 pref medium
2a14:9b80:101::/48 dev wg0 metric 1024 pref medium
blackhole 2a14:9b80::/32 dev lo proto static metric 20 pref medium
default via 2001:678:3a4:1::3 dev vmbr0 proto kernel metric 1024 onlink pref medium
root@proxmox:~#

For the completeness here is the output of the wg command:

interface: wg0
  public key: IhYOkpqE0cIBclaR7zGLml/7BriPIMoMdjmM5dbkkGs=
  private key: (hidden)
  listening port: 51820

peer: ZZkW3L0OES1bqQKdDpe3GQ88G4I3ABZVasuEVyvS5iM=
  endpoint: [2001:41f0:6f67::2]:51820
  allowed ips: 2a14:9b80:101::/48
  latest handshake: 1 second ago
  transfer: 642.48 KiB received, 1.17 MiB sent

Source based routing

Since my home network already has native IPv6 through my ISP, I needed to make sure that only specific traffic was routed outbound via the WireGuard tunnel.

Typically, routing is based on the destination address, but in cases like this, it’s necessary to route based on the source address and sometimes the destination address as well.

In MikroTik, this is known as Policy Routing, which allows for this level of control. It took me a few hours to figure out, and I also discovered that I needed RouterOS 7.18, as earlier versions did not support this setup properly.

Eventually this is the configuration I ended up with, from which I will only show the relevant parts:

/interface wireguard
add listen-port=51820 mtu=1420 name=wg-hrl23
/interface wireguard peers
add allowed-address=::/0 endpoint-address=2001:678:3a4:1::100 endpoint-port=51820 interface=wg-hrl23 name=peer1 persistent-keepalive=30s public-key="IhYOkpqE0cIBclaR7zGLml/7BriPIMoMdjmM5dbkkGs="

/routing table
add disabled=no fib name=wireguard

/ipv6 route
add comment="Ziggo Zakelijk" disabled=no dst-address=::/0 gateway=2001:41f0:6f67::1
add blackhole dst-address=2a14:9b80:101::/48
add disabled=no gateway=wg-hrl23 routing-table=wireguard

/ipv6 address
add address=2001:41f0:6f67::2 comment="Ziggo Zakelijk" interface=ether1
add address=2001:41f0:6f67:1::1 interface=bridgeLocal
add address=2a14:9b80:101::1 interface=wg-hrl23
add address=2a14:9b80:101:1::1 interface=ether4

/ipv6 firewall filter
add action=accept chain=input comment="Wireguard with proxmox01" dst-port=51820 in-interface-list=WAN protocol=udp src-address=2001:678:3a4:1::100/128

/routing rule
add action=lookup dst-address=2001:41f0:6f67::/48 src-address=2a14:9b80:101::/48 table=main
add action=lookup-only-in-table disabled=no src-address=2a14:9b80:101::/48 table=wireguard

The last two lines are the most important!

Traffic between my regular LAN and the subnet routed to my home via Wireguard should just be regular routed traffic
Otherwise the traffic should be diverted to the routing table “wireguard” which routes the traffic outbound via Wireguard

Diagram

To clarify a couple of things I have created a Diagram to hopefully explain how it all comes together.

I hope this explains and inspires you to do the same!

Using L3 (BGP) routing for your Ceph storage

Many Ceph storage environments out there are deployed using a L2 underlay.

This means that the Ceph servers (MON, OSD, etc) are connected using LACP/Bonding to a pair of switches. On their ‘bond0’ device (example) they are assigned an IPv4/IPv6 address and this is used for connectivity between the Ceph nodes and the Ceph clients.

Although this works fine, I try to avoid L2 as much as possible in datacenter deployments. L2 scales up to a certain point, but it has it’s limitations. Modern Top-of-Rack (ToR) switches can easily route traffic and wire-speed. This used to be a limitation of switches in the past. When designing environments I prefer using a L3 approach.

This blogpost is there to show you the rough concept. It’s NOT a copy and paste tutorial. You will need to adapt it to your situation.

Network setup and BGP configuration

Using Juniper QFX5100 switches and Frrouting on the Ceph nodes I’ve established BGP sessions between the ToR and Ceph nodes according to the diagram below.

Each nodes has two independent BGP sessions with the Top-of-Rack in it’s rack. Via these BGP sessions they advertise their local IPv6 /128 address. Via the same sessions they receive a default ::/0 IPv6 route.

ceph01# sh bgp summary 

IPv6 Unicast Summary (VRF default):
BGP router identifier 1.2.3.4, local AS number 65101 vrf-id 0
BGP table version 10875
RIB entries 511, using 96 KiB of memory
Peers 2, using 1448 KiB of memory
Peer groups 1, using 64 bytes of memory

Neighbor        V    AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
enp196s0f0np0   4 65002    487385    353917        0    0    0 3d18h17m            1        1 N/A
enp196s0f1np1   4 65002    558998    411452        0    0    0 01:38:55            1        1 N/A

Total number of neighbors 2
ceph01#

Here we see two BGP sessions active over both NICs of the Ceph node. We can also see that a default IPv6 route is received via BGP.

ceph01# sh ipv6 route ::/0
Routing entry for ::/0
  Known via "bgp", distance 20, metric 0
  Last update 01:42:00 ago
    fe80::e29:efff:fed7:4719, via enp196s0f0np0, weight 1
    fe80::7686:e2ff:fe7c:a19e, via enp196s0f1np1, weight 1

ceph01#

The Frrouting configuration ( /etc/frr/frr.conf ) is fairly simple:

frr defaults traditional
hostname ceph01
log syslog informational
no ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
interface enp196s0f0np0
 no ipv6 nd suppress-ra
exit
!
interface enp196s0f1np1
 no ipv6 nd suppress-ra
exit
!
interface lo
 ipv6 address 2001:db8:100:1::/128
exit
!
router bgp 65101
 bgp router-id 1.2.3.4
 no bgp ebgp-requires-policy
 no bgp default ipv4-unicast
 no bgp network import-check
 neighbor upstream peer-group
 neighbor upstream remote-as external
 neighbor enp196s0f0np0 interface peer-group upstream
 neighbor enp196s0f1np1 interface peer-group upstream
 !
 address-family ipv6 unicast
  redistribute connected
  neighbor upstream activate
 exit-address-family
exit
!
end

On the Juniper switches a configuration was defined for the BGP Unnumbered (RFC5549) configuration as well. This blogpost explains very well on how BGP Unnumbered works on JunOS, I am not going to repeat it. I will highlight a couple of pieces of configuration.

root@tor01# show interfaces xe-0/0/1
description ceph01;
unit 0 {
    mtu 9216;
    family inet6;
}

root@tor01# show protocols router-advertisement 
interface xe-0/0/1.0;

root@tor01# show | compare 
[edit]
+  policy-options {
+      as-list bgp_unnumbered_as_list members 65101-65199;
+  }
[edit protocols]
+   bgp {
+       group ceph {
+           family inet6 {
+               unicast;
+           }
+           multipath;
+           export default-v6;
+           import ceph-loopback;
+           dynamic-neighbor bgp_unnumbered {
+               peer-auto-discovery {
+                   family inet6 {
+                       ipv6-nd;
+                   }
+                   interface xe-0/0/1.0;
+                   interface xe-0/0/2.0;
+                   interface xe-0/0/3.0;
+               }
+           }
+           peer-as-list bgp_unnumbered_as_list;
+       }
+   }
[edit policy]
+ policy-statement default-v6 {
+    from {
+        route-filter ::/0 exact;
+   }
+   then accept;
+}
+ policy-statement ceph-loopback {
+    from {
+        route-filter 2001:db8:100::/64 upto /128;
+   }
+   then accept;
+}

This will set up the BGP sessions via the interfaces xe-0/0/1 until xe-0/0/3 using IPv6 Autodiscovery.

The Ceph nodes should now be able to ping the other nodes:

PING 2001:db8:100::2(2001:db8:100::2) 56 data bytes
64 bytes from 2001:db8:100::2: icmp_seq=1 ttl=63 time=0.058 ms
64 bytes from 2001:db8:100::2: icmp_seq=2 ttl=63 time=0.063 ms
64 bytes from 2001:db8:100::2: icmp_seq=3 ttl=63 time=0.071 ms

--- 2001:db8:100::2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2037ms
rtt min/avg/max/mdev = 0.058/0.064/0.071/0.005 ms

Ceph configuration

From Ceph’s perspective there is not much to do. We just need to specify the IPv6 subnet Ceph is allowed to use and bind to.

[global]
	 mon_host = 2001:db8:100::1, 2001:db8:100::2, 2001:db8:100::3
	 ms_bind_ipv4 = false
	 ms_bind_ipv6 = true
	 public_network = 2001:db8:100::/64

This is all the configuration needed for Ceph 🙂

wdh@ceph01:~$ sudo ceph health
HEALTH_OK
wdh@infra-04-01-17:~$ sudo ceph mon dump
election_strategy: 1
0: [v2:[2001:db8:100::1]:3300/0,v1:[2001:db8:100::1]:6789/0] mon.ceph01
1: [v2:[2001:db8:100::2]:3300/0,v1:[2001:db8:100::2]:6789/0] mon.ceph02
2: [v2:[2001:db8:100::3]:3300/0,v1:[2001:db8:100::3]:6789/0] mon.ceph02
dumped monmap epoch 6
wdh@infra-04-01-17:~$