[Bug / Help] VXLAN (VNI 10) + EVPN lab: intermittent L2 forwarding & OSPF/BGP neighbor loss — e1000 TX Unit Hang observed

Hello VyOS team,

I built a small leaf–spine lab on VMware Workstation using VyOS routers to test VXLAN+EVPN and I’m encountering a reproducible data-plane failure: Host1 cannot ping Host2 across the VXLAN fabric. The control plane (BGP EVPN, OSPF) often looks healthy, but traffic stops and OSPF neighbors go down. The hypervisor console shows e1000 ... Detected Tx Unit Hang on router NICs (R3 and R2), and the problem disappears if I replace e1000 with VMXNET3. I suspect this is either a VyOS/E1000 interaction bug or a kernel/driver timing/MTU/encapsulation edge case that causes decap/transfer failures.

I want your team to review the problem data and advise whether:

  • This is a VyOS bug (kernel module / offload / VXLAN / FRR interaction), or

  • An expected limitation of the e1000 virtual NIC on VMware (and therefore VMXNET3 must be used), or

  • If there is a VyOS config/workaround to keep e1000 stable for this topology.

Below I explain the topology, config summary, exact test steps that reproduce the issue, observed behavior, and an exhaustive list of attachments (commands & pcaps) I’ve prepared. Please try to give a clear actionable fix (kernel option / sysctl / disable offload / recommended NIC type / config change).

Thank you — I appreciate your help

Host1 (10.10.1.2/24) –[vmnet2 host-only]– R1 (br10/eth1) --VXLAN(VNI10 over underlay)-> R2 (br10/eth1) –[vmnet2 host-only]– Host2 (10.20.1.2/24)
R1 eth2 (20.1.1.1/24) –[vmnet3 underlay]– R3 eth1 (20.1.1.2/24)
R2 eth2 (30.1.1.1/24) –[vmnet3 underlay]– R3 eth2 (30.1.1.2/24)
All routers eth0 → vmnet19 (management 192.168.138.0/24)

VMs:

  • R1 (VyOS): loopback 1.1.1.1/32, vxlan10 source 1.1.1.1, br10 10.10.1.1/24, eth2 20.1.1.1/24, bgp system-as 65000, EVPN config set

  • R2 (VyOS): loopback 2.2.2.2/32, vxlan10 source 2.2.2.2, br10 10.20.1.1/24, eth2 30.1.1.1/24

  • R3 (VyOS, spine): loopback 3.3.3.3/32, BGP RR toward both R1 & R2, static routes for loopbacks, OSPF underlay

  • Host1: ens34 10.10.1.2/24, default later pointed to 10.10.1.1 (R1)

  • Host2: ens34 10.20.1.2/24, default later pointed to 10.20.1.1 (R2)

Important VMware notes:

  • Initially the VMs were created with the default e1000 virtual NIC type. After prolonged runtime, R2 and R3 consoles display repeated:e1000 0000:02:05.0 eth1: Detected Tx Unit Hang

  • Replacing the VM NICs with VMXNET3 resolved the TX hang & stabilized the fabric in my tests.

Steps to reproduce (minimal)

  1. Boot order used in lab: R3 → R1 → R2 → jumphost → Host1 → Host2 (power-on)

  2. Verify underlay OSPF forms between R1↔R3 & R2↔R3

  3. Configure BGP EVPN neighbors to R3 (route reflector) and advertise VNI 10

  4. Configure br10+vxlan10 (VNI 10) on R1 and R2; vxlan configured with source-address as loopbacks (1.1.1.1 and 2.2.2.2) and nolearning.

  5. From Host1: ping 10.20.1.2 — sometimes works; after ~minutes/hours, ping fails with R1 replying Destination Net Unreachable or no reply; show commands show OSPF adjacency loss (Init or empty) or BGP EVPN still up but data-plane fails.

  6. Console on R3/R2 prints Detected Tx Unit Hang. A reboot sometimes returns to normal for a short time.
    Observed when failing:

    • show bridge br10 sometimes has vxlan10 in BLOCKING (STP), but STP removal & forward state correction did not permanently fix it.

    • bridge fdb show br10 shows extern_learn entries and dst mapping but inner packets never reach br10 on the peer in failing cases.

    • tcpdump -i eth2 udp port 4789 shows VXLAN packets being sent by the sender, but receiver sometimes does not see the outer packets (indicating VM NIC TX hang or underlay drop).

    • Console logs show e1000 ... Detected Tx Unit Hang repeatedly.

Exact VyOS configuration snippets (what I applied)

(I will attach full show configuration commands from R1, R2, R3 as files — see list below.)

Example R1:

set interfaces bridge br10 address ‘10.10.1.1/24’
set interfaces bridge br10 member interface eth1
set interfaces bridge br10 member interface vxlan10
set interfaces bridge br10 mtu ‘1600’

note: originally stp was enabled and then removed

set interfaces ethernet eth2 address ‘20.1.1.1/24’
set interfaces loopback lo address ‘1.1.1.1/32’
set interfaces vxlan vxlan10 parameters nolearning
set interfaces vxlan vxlan10 port ‘4789’
set interfaces vxlan vxlan10 source-address ‘1.1.1.1’
set interfaces vxlan vxlan10 vni ‘10’
set protocols bgp … (neighbors to 3.3.3.3)
set protocols ospf area 0 network ‘20.1.1.0/24’

What I’ve tried so far (troubleshooting steps)

  • Verified show bgp l2vpn evpn summary (neighbors up) and show ip ospf neighbor where possible.

  • Disabled STP on br10 to avoid STP blocking VXLAN.

  • Verified bridge fdb shows extern_learn mapping for MACs for the remote hosts.

  • Captured inner (br10) and outer (eth2) traffic with tcpdump on R1 & R2 simultaneously while pinging hosts.

  • Discovered e1000 Detected Tx Unit Hang in console logs on R3 (and replicated on R2).

  • Switched failing router NICs from e1000 → VMXNET3; problem disappeared — adjacency stabilized and pings succeed.

  • Confirmed MTU: router interfaces were set to 1600 for underlay and hosts 1500; vxlan/bridge configured accordingly.

  • Verified hosts’ default routes were set to their leaf router (Host1 → 10.10.1.1, Host2 → 10.20.1.1) prior to tests

Please review the data below and advise:

  1. Is the e1000 Detected Tx Unit Hang console message actionable by VyOS configuration or kernel parameter? (i.e., can VyOS be tuned with sysctl, ethtool, offload settings, or kernel module options to avoid/mitigate this, or is it purely a VMware/e1000 driver issue?)

  2. If this is a VyOS/host kernel issue, recommend exact workarounds (commands to apply permanently) — e.g. disabling specific offloads, setting nolearning changes, alternate VXLAN configuration, or recommended kernel modules/versions.

  3. If this is a VMware e1000 limitation, please confirm: is VMXNET3 the only supported NIC type for VyOS in such VXLAN+EVPN experiments? If so, add that to guidance.

  4. If there is any VyOS logging or debug mode I should enable (specific dmesg keys, FRR verbosity, vxlan debug) to gather more useful information for you, tell me the exact commands and I’ll add them to the ticket.

If you run VyOS as a VM-guest I would highly recommend to use the virtio NIC’s rather than e1000 or vmnet etc.

Also if the host have e1000 NIC’s there is an ongoing fuckery with Intel 1G-10G RJ45 nics where current workaround seems to be to disable at least TSO and GSO offloading. Some goes the extra mile and disable all sort of offloading for their Intel NIC’s to get them stable without hanging.

Excellent, this helped much with solution, issue is fixed. Thankyou

What was the fix? :slight_smile:

Apologies, Changing the nic type to vmxnet3 fixed the issue.