WAN Load Balancing/Failover Conntrack Issues

Hi, I have a VyOS 1.4 system (commit hash 22345e61da3d51), I’m trying to configure WAN load balancing/failover, and I’m currently stuck with several issues, listed below. The physical WAN connections are connected to a switch and then trunked to my router on eth0. Thus, the WAN interfaces are eth0.50 and eth0.60. All the issues appear to be related to conntrack and I assume my configuration is incorrect, I just can’t figure out what about it is the problem.

The intended setup is a combination of WAN load balancing and failover. Ideally, traffic is load balanced during normal operation, and shifts to one interface when the other one goes down. The load balancing appears to be mostly working, depending on which site I use it reports one IP or the other. The full config is here: config-wan-load-balancing-sanitized.txt (74.3 KB).

Issues:

  • Failover doesn’t work: Unplugging one WAN connection causes major failures, depending on which connection I disconnect. Either existing conntrack entries are maintained but new connections succeed, or all connections (including new ones) fail.

  • WAN ingress replies eventually go out the wrong interface: After a while (over several weeks) I noticed that some replies to inbound WAN traffic began going out the wrong interface. For example, UptimeRobot pings began failing reliably, and eventually one Wireguard endpoint also stopped working. Over time this got worse until Wireguard failed to connect on both endpoints. Performing packet captures on the interfaces, it seems that incoming traffic traffic (incl. ICMP and Wireguard) on one interface was going out the other interface, despite sticky-connections inbound and rules excluding WAN traffic from load balancing:

 rule 30 {
     description "Exclude WAN"
     exclude
     inbound-interface eth0.50
     protocol all
 }
 rule 31 {
     description "Exclude WAN"
     exclude
     inbound-interface eth0.60
     protocol all
 }
  • I can’t access my modem network via SNAT: One of my connections uses a cable modem which passes through DHCP to my WAN interface but also has the IP 192.168.100.1/24 to allow for management. Because assigning an IP to the WAN interface caused my dynamic DNS to use the private address, I set up a source NAT rule (below), which seems to work sometimes but not all devices can connect, even when no firewall rules interfere.
     rule 100 {
         description "Translate to modem network"
         destination {
             address 192.168.100.0/24
         }
         outbound-interface {
             name eth0.60
         }
         translation {
             address 192.168.100.10
         }
     }

This is the corresponding WAN load balancing rule:

 rule 10 {
     description "Modem network SNAT"
     destination {
         address 192.168.100.0/24
     }
     inbound-interface eth1+
     interface eth0.60 {
         weight 1
     }
     protocol all
 }