This is with VyOS 1.5-rolling-202501110007. I have a router with 2x 40 GbE L3 links, each to a different Arista switch. I’m talking OSPF and iBGP to the switches, and it looks like ECMP is being set up correctly, but I’m seeing >99% of my outbound traffic on the same link, even after changing the multipath hash policy.
For this test, I’m most concerned with the route from my router to 10.0.0.7
, which I’m using to run an HTTP load test through the router. I’m seeing 2 ECMP routes to this host, one through each interface:
$ show ip route 10.0.0.7
Routing entry for 10.0.0.7/32
Known via "ospf", distance 110, metric 12, best
Last update 00:19:01 ago
* 10.0.6.69, via eth4, weight 1
* 10.0.7.69, via eth6, weight 1
The kernel sees the same thing:
$ ip route show 10.0.0.7
10.0.0.7 nhid 84 proto ospf metric 20
nexthop via 10.0.6.69 dev eth4 weight 1
nexthop via 10.0.7.69 dev eth6 weight 1
I’ve set the IPv4 and IPv6 multipath hash to 1
, which should include the source and destination ports in the ECMP hash. This should cause individual flows to be balanced across links, not just hosts:
set system sysctl parameter net.ipv4.fib_multipath_hash_policy value '1'
set system sysctl parameter net.ipv6.fib_multipath_hash_policy value '1'
However, when I open up 256 TCP connections from a single test server (10.0.0.7, as above) through the router, the return traffic for all of them ends up on the same interface. According to atop
, I’m seeing 39 Gbps outbound on eth6
and 181 Mbps on eth4
.
I saw the same behavior with a nightly from July before I upgraded this morning. Does anyone have any suggestions on how to debug this? Since I’m using 256 different TCP connections, I’d expect that the fib_multipath_hash_policy
would send ~half of them over each interface, not 100% over a single link. I’ve also tried setting net.ipv4.fib_multipath_hash_fields=0xfff
and using fib_multipath_hash_policy=3
(to change the hash on practically every stable field in each packet) with no obvious change in behavior.