WAN failover not working for me, rolling 1.4 versions, any help please?

Hi all,

I have been having this issue constantly, sometimes it appears to work but i finally figured out its all false, its still load sharing.
End goal: Have internet going over eth0.167 only unless its failed

I will try to simplify my setup but pasted full config too, as I have VRRP at play too on the LAN side, not WAN.

Here is the full config file, apologies for length but its mostly firewall

Version

mario@vyos007:~$ show version

Version:          VyOS 1.4-rolling-202202200623
Release train:    sagitta

Built by:         autobuild@vyos.net
Built on:         Sun 20 Feb 2022 06:23 UTC
Build UUID:       e70d3f7d-3523-4a8f-af6c-8fe07fc09744
Build commit ID:  7c82c5c7104675

Architecture:     x86_64
Boot via:         installed image
System type:      VMware guest

Hardware vendor:  VMware, Inc.
Hardware model:   VMware Virtual Platform
Hardware S/N:     VMware-42 1f 29 92 a1 0b 9f a2-cc e7 85 32 d3 2a 7a 9a
Hardware UUID:    92291f42-0ba1-a29f-cce7-8532d32a7a9a

Copyright:        VyOS maintainers and contributors

I get internet going by adding a static route to 0.0.0.0 with dhcp-interface eth0.167 but even this is not reliable and its noticable when the 4G internet hasnt been paid as traffic is dropping out. And then defeats having failover at all.

Really appreciate any help please, I must be not seeing something right in front of my nose :frowning:
I log all my new connections and can see eth0.197 being constantly used, even tho show wan-load-balance showing it as failed.

My WAN (dhcp cable modem) is on eth0.167, my expensive 4G backup ethernet modem/router is on dhcp eth0.197 and vyos gets a 10.x address from it.

I am not sure what I am doing anymore and whatever happens i am losing connectivity as most traffic is going over eth0.197 that I left the credits lapse. So currently this interface is failing in show wan-load-balance

mario@vyos007:~$ show wan-load-balance
Interface:  eth0.167
  Status:  active
  Last Status Change:  Mon May  2 17:47:04 2022
  +Test:  ping  Target: 1.0.0.1
    Last Interface Success:  0s
    Last Interface Failure:  n/a
    # Interface Failure(s):  0

Interface:  eth0.197
  Status:  failed
  Last Status Change:  Mon May  2 17:47:24 2022
  -Test:  ping  Target: 1.1.1.1
    Last Interface Success:  n/a
    Last Interface Failure:  0s
    # Interface Failure(s):  51

My load-balance configuration

mario@vyos007# show load-balancing
 wan {
     enable-local-traffic
     flush-connections
     interface-health eth0.167 {
         failure-count 3
         nexthop dhcp
         success-count 1
         test 10 {
             resp-time 5
             target 1.0.0.1
             ttl-limit 1
             type ping
         }
     }
     interface-health eth0.197 {
         failure-count 3
         nexthop dhcp
         success-count 1
         test 10 {
             resp-time 5
             target 1.1.1.1
             ttl-limit 1
             type ping
         }
     }
     rule 5 {
         destination {
             address 192.168.0.0/16
         }
         exclude
         inbound-interface eth+
         protocol all
     }
     rule 6 {
         destination {
             address 172.16.0.0/12
         }
         exclude
         inbound-interface eth+
         protocol all
     }
     rule 7 {
         destination {
             address 10.0.0.0/8
         }
         exclude
         inbound-interface eth+
         protocol all
     }
     rule 10 {
         failover
         inbound-interface eth0.7v7
         interface eth0.167 {
             weight 10
         }
         interface eth0.197 {
             weight 1
         }
         protocol all
     }
     rule 20 {
         failover
         inbound-interface eth0.11v11
         interface eth0.167 {
             weight 10
         }
         interface eth0.197 {
             weight 1
         }
         protocol all
     }
     rule 30 {
         failover
         inbound-interface eth0.13v13
         interface eth0.167 {
             weight 10
         }
         interface eth0.197 {
             weight 1
         }
         protocol all
     }
     rule 40 {
         failover
         inbound-interface eth0.17v17
         interface eth0.167 {
             weight 10
         }
         interface eth0.197 {
             weight 1
         }
         protocol all
     }
     rule 50 {
         failover
         inbound-interface eth0.67v67
         interface eth0.167 {
             weight 10
         }
         interface eth0.197 {
             weight 1
         }
         protocol all
     }
     rule 70 {
         failover
         inbound-interface eth0.131v131
         interface eth0.167 {
             weight 10
         }
         interface eth0.197 {
             weight 1
         }
         protocol all
     }
     sticky-connections {
         inbound
     }
 }

my nat source rules

mario@vyos007# show nat source
 rule 5010 {
     description "Masquerade for WAN"
     outbound-interface eth0.167
     translation {
         address masquerade
     }
 }
 rule 5020 {
     description "Masquerade for WAN_BCK"
     outbound-interface eth0.197
     translation {
         address masquerade
     }
 }

I do not have anything in show policy or show protocols worthy of adding here

My route table with above setup

mario@vyos007:~$ sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.73.215.1, eth0.197 onlink, weight 1, 00:01:18
  *                   via 203.7.0.1, eth0.167, weight 1, 00:01:18
S>* 10.73.215.1/32 [1/0] is directly connected, eth0.197, weight 1, 00:01:18
C>* 10.73.215.124/32 is directly connected, eth0.197, 00:00:30
S>* 10.168.17.0/24 [1/0] via 192.168.17.100, eth0.17, weight 1, 00:15:05
S>* 10.168.19.0/24 [1/0] via 192.168.17.100, eth0.17, weight 1, 00:15:05
S>* 192.168.0.0/16 [254/0] unreachable (blackhole), weight 1, 00:15:05
C * 192.168.7.0/24 is directly connected, eth0.7v7, 00:15:01
C>* 192.168.7.0/24 is directly connected, eth0.7, 00:15:14
C * 192.168.11.0/24 is directly connected, eth0.11v11, 00:15:01
C>* 192.168.11.0/24 is directly connected, eth0.11, 00:15:14
C * 192.168.13.0/24 is directly connected, eth0.13v13, 00:15:01
C>* 192.168.13.0/24 is directly connected, eth0.13, 00:15:14
C * 192.168.17.0/24 is directly connected, eth0.17v17, 00:15:01
C>* 192.168.17.0/24 is directly connected, eth0.17, 00:15:14
C * 192.168.53.0/24 is directly connected, eth0.53v53, 00:15:01
C>* 192.168.53.0/24 is directly connected, eth0.53, 00:15:14
C * 192.168.67.0/24 is directly connected, eth0.67v67, 00:15:01
C>* 192.168.67.0/24 is directly connected, eth0.67, 00:15:14
C * 192.168.79.0/24 is directly connected, eth0.79v79, 00:15:01
C>* 192.168.79.0/24 is directly connected, eth0.79, 00:15:14
S>* 192.168.100.0/24 [1/0] is directly connected, eth0.167, weight 1, 00:14:59
C * 192.168.131.0/24 is directly connected, eth0.131v131, 00:15:01
C>* 192.168.131.0/24 is directly connected, eth0.131, 00:15:14
S>* 192.168.197.0/24 [1/0] is directly connected, eth0.197, weight 1, 00:14:58
C>* 203.7.0.0/19 is directly connected, eth0.167, 00:15:12

i did actually notice that both routes have 210 distance and there is an additional route for the eth0.197 too somehow, which is probably associated with my problems, however i cant see a way to make it work yet…

Hi!
First of all, I should say that I did not see in detail your configuration, because there are some known bugs on load balancing on 1.4 (and I see your version is from February).

Please review if routing tables are created as they should, and also review NAT table.

Also, if both WAN interfaces (eth0.197 and eth0.167) gets IP through dhcp, it’s correct that you see both default routes in main table.

Bummer, I checked T4362 from your links (thanks!) and definitely exact same issue.
Is this planned to be fixed soon? I have been having this problem for quite a while as have others who try failover also of course.

Thanks for the support!

Workaround: add manually these entries in desired routing table:

sudo ip route add table 201 default via <gateway_WAN_X> dev <eth0.X>
sudo ip route add table 202 default via <gateway_WAN_Y> dev <eth0.Y>