EVPN Multihoming Split-Horizon Filters Not Functional

EVPN Multihoming Split-Horizon Filters Not Functional

By: Anupam Murikurthy

Email: anupammurikurthy@gmail.com

READ THE FULL WRITE UP: VYOS EVPN MH BUG v2 - Google Docs

Goal:

This is a long write up for a bug, thanks for reading! My goal is to deploy a standards compliant active-active EVPN ESI LAG multihoming design using two VyOS/FRR leafs and multiple downstream L2 access switches connected via LACP. The EVPN control plane should allow me to:

  1. Dual home multiple access switches to a pair of VyOS leafs for redundancy and active-active load balancing

  2. Stretch L2 VLANs/VNIs across the fabric without loops

  3. Ensure that the Designated Forwarder (DF) only forwards necessary BUM traffic toward the Ethernet Segment

  4. Ensure that the non-DF drops all overlay originated BUM to prevent loops and MAC flapping

  5. Maintain stable CAM tables on all downstream switches

  6. Provide fast failover and consistent forwarding for hosts behind those downstream switches

Problem:

Split-horizon filtering is not working in FRR’s EVPN MH implementation.

Fabric Design: OSPF Underlay + eBGP EVPN + VXLAN
Tested On:

  • VyOS 1.5 Rolling + FRR 10.2.4 (vyos-2025.11.04-0019-rolling-generic-amd64)

  • Debian 13.1.0 + FRR 10.5 (manual configuration with systemd-networkd + FRR)

I’ve been testing EVPN ESI multihoming between two VyOS leaf switches and multiple downstream L2 switches using LACP. The EVPN control plane behavior appears correct (Type-1 AD, Type-2 MAC/IP, Type-3 IMET, Type-4 ES routes all install properly), however, ARP loops back into the ES and breaks multihoming standards whenever switching infrastructure is connected downstream in the data plane.

BUM traffic received on the non-DF leaf is being VXLAN encapsulated and sent to the DF, and then the DF leaf is flooding that BUM traffic back into the ES.

This should be impossible according to EVPN MH expectations across multiple vendors.

The result is:

  • MAC flapping on downstream switches

  • Blackholing of traffic / Packet loss

  • Downstream STP port blocking

  • Complete DoS conditions in networks with many MACs

Behavior Observed:

1. BUM Loop Behavior

  1. A host connected to a downstream access switch generates BUM (ARP)

  2. BUM arrives on LEAF2 (non-DF)

  3. LEAF2 VXLAN encapsulates and forwards BUM to LEAF1 (DF)

  4. LEAF1 (DF) decapsulates and floods the BUM back into the ES

  5. Downstream switches now see the same host MAC arriving from the MH LAG and the host’s switch access port.

  6. Switch CAM tables flap rapidly resulting in packet loss and instability

2. STP BPDUs Loop Behavior

  1. A dual homed switch connected to VyOS leafs generates STP BPDUs

  2. BPDUs arrive on LEAF2 (non-DF)

  3. LEAF2 VXLAN encapsulates and forwards BPDUs to LEAF1 (DF)

  4. LEAF1 (DF) decapsulates and floods the BPDUs back into the ES

  5. The dual homed switch now sees STP it originated on the LAG

  6. Switch blocks the LAG port or causes STP topology flapping

3. The Issue Scales With MAC Count and More Switches

  • In my small virtual lab with only a few MACs, Arista vEOS counted ~300 MAC moves/flaps within two minutes between the LAG and host access port

  • In my home network with many MACs and switches, this behavior caused a complete network wide denial of service, bringing the entire VLAN to a crawl

Request to Developers / How to Reproduce:

Please revisit the data plane implementation of split-horizon filtering for EVPN MH, particularly given that it is being advertised as a fully working capability.

The EVPN MH split-horizon issue is fully reproducible across multiple environments, operating systems, and FRR versions. Below is the setup required to reproduce the behavior, along with the exact VyOS 1.5 configs to trigger the MAC flapping and BUM looping.

I am more than willing to join a conference call to discuss this in depth with any interested developers.

I’ve put some thought into this after looking at FRR’s implementation.

I’m not sure if FRR’s implementation is incomplete, or if they never intended for FRR itself to implement those filters. It would make sense that FRR wouldn’t itself block that traffic, since there’s too much variability in the systems that FRR could get installed on. So they implemented a dplane API to see if the DF has changed states. This was implemented in this commit: evpn-mh: support for DF election by AnuradhaKaruppiah · Pull Request #7158 · FRRouting/frr · GitHub.

An implementation using that API wouldn’t be too terribly difficult to implement. The solution would look something like this:

  1. A simple lua script would be made and run when a br_port event is triggered. This seems to be triggered by the updating of a type-4 route, but I’m not 100% on that. The script just needs to pass the interface name (like bond0), and the df state as either 0 (df) or 1 (non-df) to a python script. Something like this:
-- /etc/frr/scripts/evpn_mh_call_py.lua
local bit_ok, bit = pcall(require, "bit")

local function is_non_df(flags)
	return (bit_ok and bit.band(flags or 0, 1) ~= 0) or (((flags or 0) % 2) == 1)
end

local function shq(s)
	s = tostring(s or "")
	return "'" .. s:gsub("'", "'\\''") .. "'"
end

function on_rib_process_dplane_results(ctx)
	if ctx and ctx.br_port then
		local ifname = ctx.zd_ifname or ""
		local non_df = is_non_df(ctx.br_port.flags or 0) and "1" or "0"

		local py  = "/home/vyos/test.py"
		local cmd = string.format("/usr/bin/env python3 %s %s %s >/dev/null 2>&1 &",
			shq(py), shq(ifname), shq(non_df))
		os.execute(cmd)
	end
	return {}
end
  1. That python script would check the DF state, and get the list of VTEPs from show evpn es detail json. If the device is the DF (and wasn’t before), it would implement nftables filters like this:
Mark the packet on ingress. These are remote VTEPs in the ES:
table netdev evpn_sph {
        chain evpn_sph_ingress {
                type filter hook ingress device "eth1" priority filter; policy accept;
                ip saddr 10.1.2.2 udp dport 4789 meta mark set 0x00000064 counter packets 6 bytes 1812
                ip saddr 10.1.2.3 udp dport 4789 meta mark set 0x00000064 counter packets 4 bytes 684
        }
}
Match the mark on egress towards the ES
table bridge evpn_sph {
        chain evpn_sph_forward {
                type filter hook forward priority filter; policy accept;
                meta mark 0x00000064 meta pkttype multicast counter packets 66 bytes 11916 drop
                meta mark 0x00000064 meta pkttype broadcast counter packets 0 bytes 0 drop
        }
}

These filters are tested to correctly create these split-horizon filters in my lab, preventing the flooding of frames back into the ES.

2 Likes

Update; I hammered out a python script to test this. Here’s a quick demo:

When a device is the DF, it’ll have this config:

PE1 (DF)
vyos@PE1# run show evpn es 03:aa:bb:cc:dd:ee:f0:00:00:64 | match "DF status"
 DF status: df 

vyos@PE1# sudo nft list table netdev evpn_sph
table netdev evpn_sph {
        set vteps {
                type ipv4_addr
                flags interval
                elements = { 10.1.2.2 }
        }

        chain evpn_sph_ingress {
                type filter hook ingress devices = { eth1, eth3 } priority filter; policy accept;
                ip saddr @vteps udp dport 4789 meta mark set 0x00000064 counter packets 37696 bytes 6604847
        }
}

vyos@PE1# sudo nft list table bridge evpn_sph
table bridge evpn_sph {
        chain evpn_sph_forward {
                type filter hook forward priority 0; policy accept;
                meta mark 0x00000064 meta pkttype multicast counter packets 75394 bytes 9441266 drop
                meta mark 0x00000064 meta pkttype broadcast counter packets 0 bytes 0 drop
        }
}

vyos@PE1# sudo bridge -d -j link show dev bond0 | jq '.[0] | {flood, mcast_flood, bcast_flood}'
{
  "flood": true,
  "mcast_flood": true,
  "bcast_flood": true
}

And here is the non-DF:

PE2 (Non-DF):
vyos@PE2# run show evpn es 03:aa:bb:cc:dd:ee:f0:00:00:64 | match 'DF status'
 DF status: non-df 

vyos@PE2# sudo nft list table netdev evpn_sph
Error: No such file or directory
list table netdev evpn_sph
                  ^^^^^^^^

vyos@PE2# sudo nft list table bridge evpn_sph
Error: No such file or directory
list table bridge evpn_sph
                  ^^^^^^^^

vyos@PE2# sudo bridge -d -j link show dev bond0 | jq '.[0] | {flood, mcast_flood, bcast_flood}'
{
  "flood": false,
  "mcast_flood": false,
  "bcast_flood": false
}

If I make a change to the DF preference on PE2, the following will happen:

  1. FRR will send a rib event update, triggering the lua script.
  2. The lua script will call the python script, passing through the name of the bond interface, as well as the DF status as either 0 (df), or 1 (non-df).
  3. The python script will reverse the states of the SPH filters for each PE.

Change DF on PE2:

vyos@PE2# set interfaces bonding bond0 evpn es-df-pref 1200
PE1 (Non-DF):
vyos@PE1# run show evpn es 03:aa:bb:cc:dd:ee:f0:00:00:64 | match "DF status"
 DF status: non-df 

vyos@PE1# sudo nft list table netdev evpn_sph
Error: No such file or directory
list table netdev evpn_sph
                  ^^^^^^^^

vyos@PE1# sudo nft list table bridge evpn_sph
Error: No such file or directory
list table bridge evpn_sph
                  ^^^^^^^^

vyos@PE1# sudo bridge -d -j link show dev bond0 | jq '.[0] | {flood, mcast_flood, bcast_flood}'
{
  "flood": false,
  "mcast_flood": false,
  "bcast_flood": false
}

vyos@PE1# run show log | match frr-e
Nov 25 07:16:15 frr-evpn-mh[6270]: SPH filters for bond0 have been set asnon-df
PE2 (DF):
vyos@PE2# run show evpn es 03:aa:bb:cc:dd:ee:f0:00:00:64 | match "DF status"
 DF status: df 

vyos@PE2# sudo nft list table netdev evpn_sph
table netdev evpn_sph {
        set vteps {
                type ipv4_addr
                flags interval
                elements = { 10.1.2.1 }
        }

        chain evpn_sph_ingress {
                type filter hook ingress devices = { eth1, eth3 } priority filter; policy accept;
                ip saddr @vteps udp dport 4789 meta mark set 0x00000064 counter packets 2 bytes 604
        }
}

vyos@PE2# sudo nft list table bridge evpn_sph
table bridge evpn_sph {
        chain evpn_sph_forward {
                type filter hook forward priority 0; policy accept;
                meta mark 0x00000064 meta pkttype multicast counter packets 4 bytes 1008 drop
                meta mark 0x00000064 meta pkttype broadcast counter packets 0 bytes 0 drop
        }
}

vyos@PE2# sudo bridge -d -j link show dev bond0 | jq '.[0] | {flood, mcast_flood, bcast_flood}'
{
  "flood": true,
  "mcast_flood": true,
  "bcast_flood": true
}

vyos@PE2# run show log | match frr-e
Nov 25 07:16:15 frr-evpn-mh[7558]: SPH filters for bond0 have been set as df

This would need to be tested to ensure the trigger from FRR was a consistent way to configure the filters. It all works in my simple lab.

4 Likes

Thank you for your work here, the logic seems correct and looks awesome! Could you please share the lab lua & python scripts so I can test it further?

Also, how are you hooking the lua script into FRR in the VyOS config? I can insert the following into the frr.conf file, but it’s not persistent through reboots and config changes: zebra on-rib-process script evpn_mh_call_py

@amurikurthy

I placed this on github, just follow the directions there. Let me know if you have any questions: GitHub - l0crian1/vyos-evpn-sph: Split-Horizon filtering solution for VyOS

Things that would be changed if this were adopted as a proper solution:

  1. The frr user calls the python script from the lua script, which means it can’t execute the necessary nft and bridge commands without being added to the sudoers file. This should be changed to where the lua script just updates a file in memory with the interface and df state. This allows the python script to run as a sudo user, while keeping the frr user to have limited privilege’s.

    The python script would be daemonized, and would read that file that was changed. If it sees a difference between the configured state and the system state, it would execute the nft and bridge commands as it does in test.py for this proof of concept. This would prevent the need of creating the sudoers file for passwordless elevation.

    The daemon can additionally execute a paranoid refresh every 30 seconds (or whatever time makes sense), to ensure if the FRR trigger fails, the system can still recover.

  2. The nft config would be moved into the jinja2 render pipeline, which simplifies the script and allows for the nft config updates to be loaded atomically.

  3. The underlay interfaces would become config objects. Maybe under the evpn section for the bond. They wouldn’t need to be declared manually from within the python script.

1 Like

@amurikurthy

I updated the project on github with the stated changes in my last post. The only thing I didn’t do was new CLI for the underlay interfaces, those still need to be hardcoded in the tests.py script. That would be added if the solution was adopted, but this should be a stable™ solution for testing now.

@L0crian

I was unable to get your latest commit (3403ea2) from 7 hours ago to work.

test.py is failing with the error:

ImportError: cannot import name ‘wait_for’ from ‘vyos.utils.misc’

I reverted to the previous commit (e4aa082), and while the vyos-evpn-sph service did run, I did not see any nftables entries created or any logs being generated.

Yesterday I was able to successfully test your initial commit (cf1c8a8) and I began running chaos tests. I noticed that FRR only seems to trigger the lua script when something changes on the bond interfaces. Edge scenarios where PEs are abruptly rebooted or PEs come online at the same time do not appear to trigger the expected behavior.

I was able to reboot PEs and unplug links in certain sequences that caused nftables to either not populate on either PE or to populate on both.

I believe a mechanism is needed for the script to run periodically whenever the EVPN table is updated or VTEP joins/leaves the ESI.

You’ll need to be on latest rolling for that. You can remove the import and call to wait_for. It’s not strictly necessary.

Test this latest update, it was consistent in my testing. I also made it so you can multiple ES with the DF on different PEs. There were some permissions issues with frr trying to access the directory when it was made by vyos instead of the lua script.

@L0crian

I was able to get your latest commit to run. I initially found an error.

Line 229 in test.py is:
rc, _ = rc_cmd(‘sudo nft -c --file /run/nftables_nat.conf’)

It should be:

rc, _ = rc_cmd(f‘sudo nft -c --file {nftables_conf}’)

I found a scenario where the switch hashes BUM traffic to the non-DF, causing it to be black holed. Bridge ports on the non-DF need to forward BUM when the PEs are participating in multiple ESes simultaneously.

The workaround I found is to make each PE the DF for only one ES, however, this prevents a PE from serving as the DF for more than one ES.

The script is very much a proof of concept, there’s tons of cleanup that would need to happen. Luckily that was just a check, so it won’t break anything.

In this solution (as of right now), the downstream switches would need to handle the inter-rack traffic (using some kind of MEC) rather than the PEs, since the switches would mainly serve as port expansion.

I think if it is desired to have a topology like you have here, the solution for the non-df would need to change. I disabled the flooding since there would be zero impact to the performance in doing that, compared to enabling a firewall. Though nft is very efficient, so it’s not that big of a deal. And traffic only traverses a couple of rules.

The non-df would likely also get a firewall config, except it would match all underlay traffic instead of specific VTEPs, so BUM from other racks would be denied at the non-df, but local traffic could still be forwarded between those switches. So rather than disabling all flooding for the bond, the solution would be something like this:

DF:

table netdev evpn_sph {
        set vteps {
                type ipv4_addr
                flags interval
                elements = { 10.1.2.2 }
        }

        chain evpn_sph_ingress {
                type filter hook ingress devices = { eth1, eth3 } priority filter; policy accept;
                ip saddr @vteps udp dport 4789 meta mark set 0x04fc867d counter
        }
}
table bridge evpn_sph {
        chain evpn_sph_forward {
                type filter hook forward priority 0; policy accept;
                oifname "bond0" meta mark 0x04fc867d meta pkttype { broadcast, multicast }  drop
        }
}

Non-DF:

table netdev evpn_sph {
        chain evpn_sph_ingress {
                type filter hook ingress devices = { eth1, eth3 } priority filter; policy accept;
                udp dport 4789 meta mark set 0x04fc867d counter
        }
}
table bridge evpn_sph {
        chain evpn_sph_forward {
                type filter hook forward priority 0; policy accept;
                oifname "bond0" meta mark 0x04fc867d meta pkttype { broadcast, multicast } counter drop
        }
}

There actually is a benefit to doing it like this. If changes were made to the bridge, the bridge conf_mode script would have needed to track whether the interface was a non-df, and then reapply the disabling of flooding. With just using nft, the daemon would handle it entirely. The logic would also be simplified a lot. df and non-df would get marked differently, and I could check that to determine the configured state.

@amurikurthy

I changed the behavior like I stated in my last post. This should behave better for your topology now. Here’s how the new nft tables will look in an environment where there’s different DFs per ES:

table netdev evpn_sph {
        set vteps {
                type ipv4_addr
                flags interval
                elements = { 10.1.2.1 }
        }

        chain evpn_sph_ingress {
                type filter hook ingress devices = { eth1, eth3 } priority filter; policy accept;
                ip saddr @vteps udp dport 4789 meta mark set 0x04fc867d counter packets 23 bytes 3711 accept
        }
}

table bridge evpn_sph {
        set df_bonds {
                type ifname
                flags interval
                auto-merge
                elements = { "bond1" }
        }

        set non_df_bonds {
                type ifname
                flags interval
                auto-merge
                elements = { "bond0" }
        }

        chain evpn_sph_forward {
                type filter hook forward priority 0; policy accept;
                oifname @df_bonds meta mark 0x04fc867d meta pkttype { broadcast, multicast } counter packets 18 bytes 2178 drop
                iifname "vxlan*" oifname @non_df_bonds meta pkttype { broadcast, multicast } counter packets 18 bytes 2178 drop
        }
}
1 Like

I’ve created a bug report, to add his functionality in a way native on VyOS, thanks for the information and testing

2 Likes