Hello VyOS team,
I built a small leaf–spine lab on VMware Workstation using VyOS routers to test VXLAN+EVPN and I’m encountering a reproducible data-plane failure: Host1 cannot ping Host2 across the VXLAN fabric. The control plane (BGP EVPN, OSPF) often looks healthy, but traffic stops and OSPF neighbors go down. The hypervisor console shows e1000 ... Detected Tx Unit Hang on router NICs (R3 and R2), and the problem disappears if I replace e1000 with VMXNET3. I suspect this is either a VyOS/E1000 interaction bug or a kernel/driver timing/MTU/encapsulation edge case that causes decap/transfer failures.
I want your team to review the problem data and advise whether:
-
This is a VyOS bug (kernel module / offload / VXLAN / FRR interaction), or
-
An expected limitation of the e1000 virtual NIC on VMware (and therefore VMXNET3 must be used), or
-
If there is a VyOS config/workaround to keep e1000 stable for this topology.
Below I explain the topology, config summary, exact test steps that reproduce the issue, observed behavior, and an exhaustive list of attachments (commands & pcaps) I’ve prepared. Please try to give a clear actionable fix (kernel option / sysctl / disable offload / recommended NIC type / config change).
Thank you — I appreciate your help
Host1 (10.10.1.2/24) –[vmnet2 host-only]– R1 (br10/eth1) --VXLAN(VNI10 over underlay)-> R2 (br10/eth1) –[vmnet2 host-only]– Host2 (10.20.1.2/24)
R1 eth2 (20.1.1.1/24) –[vmnet3 underlay]– R3 eth1 (20.1.1.2/24)
R2 eth2 (30.1.1.1/24) –[vmnet3 underlay]– R3 eth2 (30.1.1.2/24)
All routers eth0 → vmnet19 (management 192.168.138.0/24)
VMs:
-
R1 (VyOS): loopback
1.1.1.1/32, vxlan10 source1.1.1.1, br10 10.10.1.1/24, eth2 20.1.1.1/24, bgp system-as 65000, EVPN config set -
R2 (VyOS): loopback
2.2.2.2/32, vxlan10 source2.2.2.2, br10 10.20.1.1/24, eth2 30.1.1.1/24 -
R3 (VyOS, spine): loopback
3.3.3.3/32, BGP RR toward both R1 & R2, static routes for loopbacks, OSPF underlay -
Host1: ens34 10.10.1.2/24, default later pointed to 10.10.1.1 (R1)
-
Host2: ens34 10.20.1.2/24, default later pointed to 10.20.1.1 (R2)
Important VMware notes:
-
Initially the VMs were created with the default e1000 virtual NIC type. After prolonged runtime, R2 and R3 consoles display repeated:e1000 0000:02:05.0 eth1: Detected Tx Unit Hang
-
Replacing the VM NICs with VMXNET3 resolved the TX hang & stabilized the fabric in my tests.
Steps to reproduce (minimal)
-
Boot order used in lab: R3 → R1 → R2 → jumphost → Host1 → Host2 (power-on)
-
Verify underlay OSPF forms between R1↔R3 & R2↔R3
-
Configure BGP EVPN neighbors to R3 (route reflector) and advertise VNI 10
-
Configure br10+vxlan10 (VNI 10) on R1 and R2; vxlan configured with
source-addressas loopbacks (1.1.1.1 and 2.2.2.2) andnolearning. -
From Host1:
ping 10.20.1.2— sometimes works; after ~minutes/hours, ping fails with R1 replyingDestination Net Unreachableor no reply; show commands show OSPF adjacency loss (Init or empty) or BGP EVPN still up but data-plane fails. -
Console on R3/R2 prints
Detected Tx Unit Hang. A reboot sometimes returns to normal for a short time.
Observed when failing:-
show bridge br10sometimes has vxlan10 in BLOCKING (STP), but STP removal & forward state correction did not permanently fix it. -
bridge fdb show br10showsextern_learnentries anddstmapping but inner packets never reach br10 on the peer in failing cases. -
tcpdump -i eth2 udp port 4789shows VXLAN packets being sent by the sender, but receiver sometimes does not see the outer packets (indicating VM NIC TX hang or underlay drop). -
Console logs show
e1000 ... Detected Tx Unit Hangrepeatedly.
-
Exact VyOS configuration snippets (what I applied)
(I will attach full show configuration commands from R1, R2, R3 as files — see list below.)
Example R1:
set interfaces bridge br10 address ‘10.10.1.1/24’
set interfaces bridge br10 member interface eth1
set interfaces bridge br10 member interface vxlan10
set interfaces bridge br10 mtu ‘1600’note: originally stp was enabled and then removed
set interfaces ethernet eth2 address ‘20.1.1.1/24’
set interfaces loopback lo address ‘1.1.1.1/32’
set interfaces vxlan vxlan10 parameters nolearning
set interfaces vxlan vxlan10 port ‘4789’
set interfaces vxlan vxlan10 source-address ‘1.1.1.1’
set interfaces vxlan vxlan10 vni ‘10’
set protocols bgp … (neighbors to 3.3.3.3)
set protocols ospf area 0 network ‘20.1.1.0/24’
What I’ve tried so far (troubleshooting steps)
-
Verified
show bgp l2vpn evpn summary(neighbors up) andshow ip ospf neighborwhere possible. -
Disabled STP on br10 to avoid STP blocking VXLAN.
-
Verified
bridge fdbshowsextern_learnmapping for MACs for the remote hosts. -
Captured inner (br10) and outer (eth2) traffic with tcpdump on R1 & R2 simultaneously while pinging hosts.
-
Discovered
e1000 Detected Tx Unit Hangin console logs on R3 (and replicated on R2). -
Switched failing router NICs from e1000 → VMXNET3; problem disappeared — adjacency stabilized and pings succeed.
-
Confirmed MTU: router interfaces were set to 1600 for underlay and hosts 1500; vxlan/bridge configured accordingly.
-
Verified hosts’ default routes were set to their leaf router (Host1 → 10.10.1.1, Host2 → 10.20.1.1) prior to tests
Please review the data below and advise:
-
Is the
e1000 Detected Tx Unit Hangconsole message actionable by VyOS configuration or kernel parameter? (i.e., can VyOS be tuned with sysctl, ethtool, offload settings, or kernel module options to avoid/mitigate this, or is it purely a VMware/e1000 driver issue?) -
If this is a VyOS/host kernel issue, recommend exact workarounds (commands to apply permanently) — e.g. disabling specific offloads, setting
nolearningchanges, alternate VXLAN configuration, or recommended kernel modules/versions. -
If this is a VMware e1000 limitation, please confirm: is VMXNET3 the only supported NIC type for VyOS in such VXLAN+EVPN experiments? If so, add that to guidance.
-
If there is any VyOS logging or debug mode I should enable (specific
dmesgkeys, FRR verbosity, vxlan debug) to gather more useful information for you, tell me the exact commands and I’ll add them to the ticket.