Understanding problem load balancing and defaultgateway

Okay. There are two WAN lines and one LAN line.

The basic load balancing is configured like this:

 wan {
     flush-connections
     interface-health eth4 {
         failure-count 1
         nexthop 8.8.8.8
         success-count 10
     }
     interface-health eth5 {
         failure-count 1
         nexthop 8.8.4.4
         success-count 10
     }
     rule 5 {
         destination {
             address 192.168.0.0/16
         }
         exclude
         inbound-interface eth+
         protocol all
     }
    rule 10 {
         inbound-interface eth0
         interface eth4 {
         }
         protocol all
     }
     rule 11 {
         inbound interface eth0
         interface eth5 {
         }
         protocol all
     }

So far so good. That part works too, only I think there is a problem with another setting (the default gateway) and this is where my understanding problem starts.

Apparently the defaultroute has priority over the loadbalancing.

Variant1. Both WAN lines have a fixed IP without DHCP and without PPPOE.
The defaultruting entry looks like this:

 route 0.0.0.0/0 {
     next-hop 37.0.0.1 {
     }
     next-hop 109.0.0.1 { }
     }
 }

This works fine so far…at least until one of the lines is no longer running cleanly or is completely down. Of course the loadbalancing gets this, but not the defaultgateway. The effect was that the traffic became extremely slow. So as if 50% of the traffic is lost.

Variant2. One WAN line has a fixed IP and the second gets the IP via DHCP.

Here the problem is even clearer. The DHCP interface gets a kernel route with the metric 0 and the WAN line with the fixed IP gets 1 as the smallest metric. We now had the case that the line with DHCP was defective at the provider. I.e. it was quasi online, but without traffic. End of the story, nothing worked. He always tried to send the traffic over the DHCP line.

So where is my understanding problem?

I fought this recently when working on multi-WAN VPN setup. Try adding a distance metric to your default route.

set protocols static route 0.0.0.0/0 next-hop 37.0.0.1 distance 5
set protocols static route 0.0.0.0/0 next-hop 109.0.0.1 distance 10

The load-balancing configuration is applied to your LAN traffic and modifies the routing table for those users, but the router doesn’t look at that routing table when making decisions. So the main routing table, which has the default routes, is used to make decisions. By adding the distance metric, you are telling the router the preferred interface (lower is preferred).

Okay sure, but then the firewall has the problem that it can’t get to the Internet if the line with the lower metric fails.
There is no check of the gateway here. Since Vyos but also DNS and NTP forwarding makes, I’m dependent on the that can always. Do I have a thinking error?

Using distance metrics will setup preferential routes, so distance 1 and distance 5 (or whatever numbers) will make one preferred over the other. If WAN1 next-hop is unreachable, the router will failover to WAN2 by itself.

Now, if the next-hop is available but the ISP has routing issues, then that’s not your issue. Except you would have to adjust your routing tables to make another interface preferred over the other.