WAN Load Balancing when interface stays up

level410 · November 11, 2022, 2:10pm

I am trying to configure WAN load balancing for two connections: one wired (eth0) connection and a failover wwan connection. The idea is that failover should occur any time the primary wired connection is not available.

This works fine in a lost-link scenario - eth0 is quickly marked as “failed” and traffic changes over to the wwan0 interface. However, failover never occurs when the link itself stays up, but when the internet is not available. I am simulating a failure where, for example, a cable modem link stays online but the upstream connection is lost (I am doing this in VMware by simply changing the vnic network to something that goes nowhere).

In both cases, the failure is detected by wan-load-balance (status changes to failed for eth0), however only when the link itself is lost do the downstream (eth1) clients actually failover to the wwan0 interface for outbound traffic. I also noticed that in the case where eth0 stays up, the default route out eth0 remains in the routing table.

Am I doing something wrong? Config below.

Thanks!

interfaces {
    ethernet eth0 {
        address dhcp
        hw-id 00:0c:29:03:37:de
    }
    ethernet eth1 {
        address 192.168.2.1/24
        hw-id 00:0c:29:03:37:e8
    }
    loopback lo {
    }
    wwan wwan0 {
        address 167.20.XXX.XXX/32
        apn b2b.static
    }
}
load-balancing {
    wan {
        flush-connections
        interface-health eth0 {
            nexthop 1.1.1.1
        }
        interface-health wwan0 {
            nexthop 8.8.8.8
        }
        rule 1 {
            failover
            inbound-interface eth1
            interface eth0 {
                weight 100
            }
            interface wwan0 {
                weight 1
            }
        }
    }
}
protocols {
    static {
        interface-route 0.0.0.0/0 {
            next-hop-interface wwan0 {
                distance 250
            }
        }
        route 0.0.0.0/0 {
            dhcp-interface eth0
        }
    }
}
service {
    ssh {
    }
}
system {
    config-management {
        commit-revisions 100
    }
    conntrack {
        modules {
            ftp
            h323
            nfs
            pptp
            sip
            sqlnet
            tftp
        }
    }
    console {
        device ttyS0 {
            speed 115200
        }
    }
    host-name vyos
    login {
        user vyos {
            authentication {
                encrypted-password ...
                plaintext-password ""
            }
        }
    }
    ntp {
        server time1.vyos.net {
        }
        server time2.vyos.net {
        }
        server time3.vyos.net {
        }
    }
    syslog {
        global {
            facility all {
                level info
            }
            facility protocols {
                level debug
            }
        }
    }
}

zsdc · November 11, 2022, 7:34pm

You need to add routes to test hosts via appropriate interfaces to make it work:
For example, in 1.4:

set protocols static route 1.1.1.1/32 dhcp-interface 'eth0'
set protocols static route 8.8.8.8/32 interface wwan0

level410 · November 11, 2022, 8:03pm

Thanks, but that doesn’t seem to work either. Here’s my config in case I’m missing something stupid, along with what I am seeing:

 interfaces {
     ethernet eth0 {
         address dhcp
         hw-id 00:0c:29:03:37:de
     }
     ethernet eth1 {
         address 192.168.2.1/24
         hw-id 00:0c:29:03:37:e8
     }
     loopback lo {
     }
     wwan wwan0 {
         address 167.20.XXX.XXX/32
         apn b2b.static
     }
 }
 load-balancing {
     wan {
         flush-connections
         interface-health eth0 {
             failure-count 1
             nexthop 1.1.1.1
             success-count 1
         }
         interface-health wwan0 {
             failure-count 1
             nexthop 8.8.8.8
             success-count 1
         }
         rule 1 {
             failover
             inbound-interface eth1
             interface eth0 {
                 weight 100
             }
             interface wwan0 {
                 weight 1
             }
             protocol all
         }
     }
 }
 protocols {
     static {
         interface-route 0.0.0.0/0 {
             next-hop-interface wwan0 {
                 distance 250
             }
         }
         interface-route 8.8.8.8/32 {
             next-hop-interface wwan0 {
             }
         }
         route 0.0.0.0/0 {
             dhcp-interface eth0
         }
         route 1.1.1.1/32 {
             dhcp-interface eth0
         }
     }
 }
 service {
     ssh {
     }
 }
 system {
     config-management {
         commit-revisions 100
     }
     conntrack {
         modules {
             ftp
             h323
             nfs
             pptp
             sip
             sqlnet
             tftp
         }
     }
     console {
         device ttyS0 {
             speed 115200
         }
     }
     host-name vyos
     login {
         user vyos {
             authentication {
                 encrypted-password ...
                 plaintext-password ""
             }
         }
     }
     ntp {
         server time1.vyos.net {
         }
         server time2.vyos.net {
         }
         server time3.vyos.net {
         }
     }
     syslog {
         global {
             facility all {
                 level info
             }
             facility protocols {
                 level debug
             }
         }
     }
 }

When eth0 is up:

vyos@vyos:~$ show wan-load-balance 

Interface:  eth0e
  Status:  activee
  Last Status Change:  Fri Nov 11 20:19:06 2022e
  +Test:  ping  Target: 1.1.1.1e
    Last Interface Success:  0s e
    Last Interface Failure:  21s        e
    # Interface Failure(s):  0e
e
Interface:  wwan0e
  Status:  activee
  Last Status Change:  Fri Nov 11 20:17:08 2022e
  +Test:  ping  Target: 8.8.8.8e
    Last Interface Success:  0s e
    Last Interface Failure:  n/a                e
    # Interface Failure(s):  0e
e

evyos@vyos:~$ show ip route

Codes: K - kernel route, C - connected, S - static, R - RIP,e
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,e
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,e
       F - PBR, f - OpenFabric,e
       > - selected route, * - FIB route, q - queued, r - rejected, b - backupe
e
S   0.0.0.0/0 [250/0] is directly connected, wwan0, weight 1, 00:05:07e
S>* 0.0.0.0/0 [1/0] via 192.168.17.1, eth0, weight 1, 00:05:33e
S   0.0.0.0/0 [210/0] via 192.168.17.1, eth0, weight 1, 00:05:33e
S>* 1.1.1.1/32 [1/0] via 192.168.17.1, eth0, weight 1, 00:03:53e
S>* 8.8.8.8/32 [1/0] is directly connected, wwan0, weight 1, 00:03:53e
C>* 167.20.XXX.XXX/32 is directly connected, wwan0, 00:05:08e
C>* 192.168.2.0/24 is directly connected, eth1, 00:05:34e
C>* 192.168.17.0/24 is directly connected, eth0, 00:05:34e

When eth0 is link-down:

vyos@vyos:~$ show wan-load-balance 

Interface:  eth0e
  Status:  failede
  Last Status Change:  Fri Nov 11 20:19:47 2022e
  -Test:  ping  Target: 1.1.1.1e
    Last Interface Success:  11s        e
    Last Interface Failure:  0s e
    # Interface Failure(s):  1e
e
Interface:  wwan0e
  Status:  activee
  Last Status Change:  Fri Nov 11 20:17:08 2022e
  +Test:  ping  Target: 8.8.8.8e
    Last Interface Success:  0s e
    Last Interface Failure:  n/a                e
    # Interface Failure(s):  0e
e

evyos@vyos:~$ show ip route

Codes: K - kernel route, C - connected, S - static, R - RIP,e
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,e
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,e
       F - PBR, f - OpenFabric,e
       > - selected route, * - FIB route, q - queued, r - rejected, b - backupe
e
S>* 0.0.0.0/0 [250/0] is directly connected, wwan0, weight 1, 00:05:36e
S>* 8.8.8.8/32 [1/0] is directly connected, wwan0, weight 1, 00:04:22e
C>* 167.20.XXX.XXX/32 is directly connected, wwan0, 00:05:37e
C>* 192.168.2.0/24 is directly connected, eth1, 00:06:03e

When eth0 is soft down (no upstream connection):

vyos@vyos:~$ show wan-load-balance 

Interface:  eth0e
  Status:  failede
  Last Status Change:  Fri Nov 11 20:20:31 2022e
  -Test:  ping  Target: 1.1.1.1e
    Last Interface Success:  12s        e
    Last Interface Failure:  1s e
    # Interface Failure(s):  1e
e
Interface:  wwan0e
  Status:  activee
  Last Status Change:  Fri Nov 11 20:17:08 2022e
  +Test:  ping  Target: 8.8.8.8e
    Last Interface Success:  1s e
    Last Interface Failure:  n/a                e
    # Interface Failure(s):  0e
e

evyos@vyos:~$ show ip route

Codes: K - kernel route, C - connected, S - static, R - RIP,e
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,e
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,e
       F - PBR, f - OpenFabric,e
       > - selected route, * - FIB route, q - queued, r - rejected, b - backupe
e
S>* 0.0.0.0/0 [1/0] via 192.168.17.1, eth0, weight 1, 00:00:30e
S   0.0.0.0/0 [210/0] via 192.168.17.1, eth0, weight 1, 00:00:31e
S   0.0.0.0/0 [250/0] is directly connected, wwan0, weight 1, 00:06:21e
S>* 1.1.1.1/32 [1/0] via 192.168.17.1, eth0, weight 1, 00:00:30e
S>* 8.8.8.8/32 [1/0] is directly connected, wwan0, weight 1, 00:05:07e
C>* 167.20.XXX.XXX/32 is directly connected, wwan0, 00:06:22e
C>* 192.168.2.0/24 is directly connected, eth1, 00:06:48e
C>* 192.168.17.0/24 is directly connected, eth0, 00:00:31e

level410 · November 12, 2022, 3:14am

Good news: I got the load balancing to work. Bad news, after about 15-30 seconds of the traffic failing over to wwan0, it stops passing traffic and spits out this error. I’m baffled. Any ideas?

This was a constant ping using ping 1.1.1.1 interface wwan0 running while I failed over.

Full config:

interfaces {
    ethernet eth0 {
        address dhcp
        hw-id 00:0c:29:03:37:de
    }
    ethernet eth1 {
        address 192.168.99.1/24
        hw-id 00:0c:29:03:37:e8
    }
    loopback lo {
    }
    wwan wwan0 {
        address 167.20.XXX.XXX/27
        apn b2b.static
    }
}
load-balancing {
    wan {
        flush-connections
        interface-health eth0 {
            failure-count 2
            nexthop dhcp
            success-count 1
            test 1 {
                resp-time 1
                target 8.8.4.4
                ttl-limit 1
                type ping
            }
        }
        interface-health wwan0 {
            failure-count 2
            nexthop 167.20.XXX.XXX
            success-count 1
            test 1 {
                resp-time 5
                target 8.8.8.8
                ttl-limit 1
                type ping
            }
        }
        rule 1 {
            failover
            inbound-interface eth1
            interface eth0 {
                weight 100
            }
            interface wwan0 {
                weight 1
            }
            protocol all
        }
    }
}
protocols {
    static {
        route 8.8.4.4/32 {
            dhcp-interface eth0
        }
        route 8.8.8.8/32 {
            next-hop 167.20.XXX.XXX {
            }
        }
    }
}
service {
    dhcp-server {
        shared-network-name internal {
            subnet 192.168.99.0/24 {
                default-router 192.168.99.1
                name-server 1.1.1.1
                range 0 {
                    start 192.168.99.10
                    stop 192.168.99.20
                }
            }
        }
    }
    ssh {
    }
}
system {
    config-management {
        commit-revisions 100
    }
    conntrack {
        modules {
            ftp
            h323
            nfs
            pptp
            sip
            sqlnet
            tftp
        }
    }
    console {
        device ttyS0 {
            speed 115200
        }
    }
    host-name vyos
    login {
        user vyos {
            authentication {
                encrypted-password ...
                plaintext-password ""
            }
        }
    }
    ntp {
        server time1.vyos.net {
        }
        server time2.vyos.net {
        }
        server time3.vyos.net {
        }
    }
    syslog {
        global {
            facility all {
                level info
            }
            facility protocols {
                level debug
            }
        }
    }
}

zsdc · November 12, 2022, 3:38pm

Just in case, I would like to recall - load-balancing does not work for local traffic from the router. So, you need to test it from an external device behind the router. Or at least enable local traffic balancing, but I would not recommend this for multiple reasons - first of all, unpredictable management connection behavior:

set load-balancing wan enable-local-traffic

Also, you cannot check load-balancing work with the show ip route. Traffic that is actually balanced does not use the main routing table, thus you will not see the real routes there.

level410 · November 12, 2022, 4:23pm

Understood. I was doing all the testing from an ubuntu machine connected to eth1.

At the same time, I ran the ping on vyos out the wwan0 interface to illustrate that the wwan0 stuff seems to “crash” for lack of a better term. Shortly (30-60 seconds) after failing over, neither the connected devices on eth1 nor the vyos router itself can push traffic out wwan0.

I don’t know what the ping: sendmsg: No buffer space available message indcates, but I assume it is related?

Would LOVE to get this working, but I’m at the limit of my ability to troubleshoot. Ideas greatly appreciated.

16again · November 13, 2022, 11:26am

I don’t like that route, it might only work if next-hop does proxy ARP. There should be a hard-coded GW on this interface, use that as next-hop in the route.

Also, the /32 routes look fine, but what happens when interface is down? If WAN1 link is down , and health-check for WAN1 ends up on WAN2, this might confuse LB logic in assuming WAN1 is still alive

marc_s · November 13, 2022, 3:36pm

There are in fact quite a few problems with WAN load balancing, see ⚓ T4443 Wan Load Balancing Multiple Regressions. Have you seen that?

level410 · November 13, 2022, 4:13pm

I tried it with an actual gateway/next-hop and the behavior is unfortunately the same. The status reported by show wan-load-balance for the respective links is always correct, i.e. it is detecting the correct status of the links in call cases, the traffic failover just doesn’t work as expected.

That’s a big bummer, looks like these issues have been open since May. I guess these means it is unlikely there will be a quick fix

marc_s · November 14, 2022, 6:47am

Local traffic - that is “traffic generated on the router” - won’t take policy-based routing tables but will just go out the default route. Only when an interface actually goes down (as opposed to the upstream failure you simulated), the default route is changed. Enabling enable-local-traffic doesn’t seem to have any effect (see bugs I mentioned).

I have a similar challenge. I run NextDNS forwarder locally on the router. When all the LAN-side traffic is correctly routed but local traffic is excluded, that effectively means the clients haven’t got working DNS.

level410 · November 17, 2022, 2:53am

Yes, I’m following re: local traffic. However, the lan-side failover doesn’t happen in the scenarios I showed above either. Something is definitely wrong considering the “no buffer space” messages. It’s especially strange because when this happens, the wwan0 interface gets totally hosed and requires a reboot to pass traffic again.

roman121 · November 18, 2022, 7:14am

In the beyond couple of days. I have been encountering challenges signing into my Bitbucket record, and I have seen the typical goal for the issue with the accompanying message which seems, by all accounts, to be common goal for issue:

pepe · November 19, 2022, 4:57pm

level410:

load-balancing {
    wan {
        flush-connections
        interface-health eth0 {
            nexthop 1.1.1.1
        }
        interface-health wwan0 {
            nexthop 8.8.8.8
        }
        rule 1 {
            failover
            inbound-interface eth1
            interface eth0 {
                weight 100
            }
            interface wwan0 {
                weight 1
            }
        }
    }
}

You configured all interfaces as failover. One should be set as primary interface. Try change to:

        rule 1 {
            inbound-interface eth1
            interface eth0 {
            }
        }
        rule 2 {
            failover
            inbound-interface eth1
            interface wwan0 {
            }
        }

Documentation: WAN load balancing — VyOS 1.3.x (equuleus) documentation

level410 · November 22, 2022, 2:06pm

Oh you magnificent human being. You have no idea how long I have been trying to make this work - that did it! I totally misread that section and thought both interfaces went under the same rule.

PM me your paypal/venmo/whatever, I’m buying you some beer. Seriously.

Thank you!