Wan failover with DHCP

I write here the solution I have found to have a dual wan failover with DHCP. I hope it could be useful to other people with similar requirements.

Configuration:

  • eth0 is the main wan getting address over DHCP.
  • eth1.11 is the secondary wan connected to a 4G router. It has address 192.168.11.123/24, with 192.168.11.1 as gateway.

Goal:
Implement a simple faiolver mechanism (no need of load-balancing) using the VyOS failolver protocol that does not support by itself DHCP interfaces.

Implementation:
The first step is to define the default routes using the failover. In order to do that, the default route must not be installed by DHCP. With my ISP, just setting the no-default-route option in the interface dhcp-options does not work. The goal can be reached using the DHCP hooks.
A pre-hook (e.g., /config/scripts/dhcp-client/pre-hooks.d/01-no-default-route) can indeed be used to avoid the installation of the default route. That is usually achieved un-setting the new_routers variable:

RUN="yes"

# Use FD 19 to capture the debug stream caused by "set -x":
exec 19>/tmp/01-no-default-route.log

# Tell bash about it  (there's nothing special about 19, its arbitrary)
export BASH_XTRACEFD=19

set -x

# Setting new_routers to an empty string avoids the installation
# of the default roots and allows to properly setup failover rules.
# That applies only to eth0, the main WAN getting the IP via dhcp.
#
# See /config/scripts/setup-failover-routes.sh
# See /config/scripts/dhcp-client/post-hooks.d/01-failover
# See https://vyos.dev/T5724

if [ "$RUN" = "yes" ]; then
    if [ "$interface" = "eth0" ]; then
        case "$reason" in
            BOUND|RENEW|REBIND|REBOOT)
            export new_gw="$new_routers"
            export old_gw="$old_routers"
            new_routers=""
            ;;
        esac
    fi
fi

set +x

The new_routers and old_routers variable (set by DHCP) are exported in the new_gw and old_gw variables, used in the post-hook script (e.g., /config/scripts/dhcp-client/post-hooks.d/01-failover):

RUN="yes"

# Use FD 19 to capture the debug stream caused by "set -x":
exec 19>/tmp/01-failover.log

# Tell bash about it  (there's nothing special about 19, its arbitrary)
export BASH_XTRACEFD=19

set -x

# Execute the script to configure the failover mechanism in case of a
# BOUND, RENEW, REBIND, REBOOT.
# That applies only to eth0, the main WAN getting the IP via dhcp.
#
# See /config/scripts/setup-failover-routes.sh
# See /config/scripts/dhcp-client/pre-hooks.d/01-no-default-route
# See https://vyos.dev/T5724

if [ "$RUN" = "yes" ]; then
    if [ "$interface" = "eth0" ]; then
        case $reason in
            BOUND|RENEW|REBIND|REBOOT)
            sudo /config/scripts/setup-failover-routes.sh $old_gw $new_gw
            ;;
        esac
    fi
fi

set +x

The post-hook script calls the /config/scripts/setup-failover-routes.sh, used to actually configure the failover:

#!/bin/vbash

if [ "$(id -g -n)" != 'vyattacfg' ] ; then
    exec sg vyattacfg -c "/bin/vbash $(readlink -f $0) $1 $2"
fi

# Use FD 19 to capture the debug stream caused by "set -x":
exec 19>/tmp/failover.log

# Tell bash about it  (there's nothing special about 19, its arbitrary)
export BASH_XTRACEFD=19

set -x

OLD_GW="$1"
NEW_GW="$2"

set +x

source /opt/vyatta/etc/functions/script-template

configure

if [ ! -z "$NEW_GW" ]; then
    delete protocols failover route 0.0.0.0/0

    set protocols failover route 0.0.0.0/0 next-hop $NEW_GW check target '1.1.1.1'
    set protocols failover route 0.0.0.0/0 next-hop $NEW_GW check target '4.2.2.1'
    set protocols failover route 0.0.0.0/0 next-hop $NEW_GW check timeout '5'
    set protocols failover route 0.0.0.0/0 next-hop $NEW_GW check type 'icmp'
    set protocols failover route 0.0.0.0/0 next-hop $NEW_GW interface 'eth0'
    set protocols failover route 0.0.0.0/0 next-hop $NEW_GW metric '1'

    set protocols failover route 0.0.0.0/0 next-hop 192.168.11.1 check target '1.0.0.1'
    set protocols failover route 0.0.0.0/0 next-hop 192.168.11.1 check target '4.2.2.2'
    set protocols failover route 0.0.0.0/0 next-hop 192.168.11.1 check timeout '5'
    set protocols failover route 0.0.0.0/0 next-hop 192.168.11.1 check type 'icmp'
    set protocols failover route 0.0.0.0/0 next-hop 192.168.11.1 interface 'eth1.11'
    set protocols failover route 0.0.0.0/0 next-hop 192.168.11.1 metric '254'

    delete protocols static route 1.1.1.1/32
    delete protocols static route 4.2.2.1/32
    delete protocols static route 1.0.0.1/32
    delete protocols static route 4.2.2.2/32
    if [ ! -z "$OLD_GW" ]; then
        delete protocols static route $OLD_GW/32
    fi

    set protocols static route $NEW_GW/32 interface eth0
    set protocols static route 1.1.1.1/32 next-hop $NEW_GW interface 'eth0'
    set protocols static route 4.2.2.1/32 next-hop $NEW_GW interface 'eth0'
    set protocols static route 1.0.0.1/32 next-hop 192.168.11.1 interface 'eth1.11'
    set protocols static route 4.2.2.2/32 next-hop 192.168.11.1 interface 'eth1.11'
fi

commit
exit

That script sets the two default routes using the new_gw and old_gw variables set in the DHCP pre-hook script. As you can see, the main wan has a lower metrics so that the corresponding default route is used when both wans are up. Additionally, some static routes are also defined for the targets used for testing the two default routes.

At this point, everything should be done. At the same time, I would like to receive an e-mail notification when one of the two wans is added or removed and, additionally, I would like to clean conntrack when the status of the main wan changes. In order to do that, I use the event handler:

 event CellWanAdded {
     filter {
         pattern "ip route add.*dev eth1\\.11 .*"
         syslog-identifier vyos-failover
     }
     script {
         environment ACTION {
             value added
         }
         environment INTERFACE {
             value eth1.11
         }
         path /config/scripts/failover-handler.py
     }
 }
 event CellWanRemoved {
     filter {
         pattern "ip route del.*dev eth1\\.11 .*"
         syslog-identifier vyos-failover
     }
     script {
         environment ACTION {
             value deleted
         }
         environment INTERFACE {
             value eth1.11
         }
         path /config/scripts/failover-handler.py
     }
 }
 event MainWanAdded {
     filter {
         pattern "ip route add.*dev eth0 .*"
         syslog-identifier vyos-failover
     }
     script {
         environment ACTION {
             value added
         }
         environment FLUSH_CONNTRACK {
             value true
         }
         environment INTERFACE {
             value eth0
         }
         environment RESTORE_IPV6 {
             value true
         }
         path /config/scripts/failover-handler.py
     }
 }
 event MainWanRemoved {
     filter {
         pattern "ip route del.*dev eth0 .*"
         syslog-identifier vyos-failover
     }
     script {
         environment ACTION {
             value deleted
         }
         environment FLUSH_CONNTRACK {
             value true
         }
         environment INTERFACE {
             value eth0
         }
         environment REJECT_IPV6 {
             value true
         }
         path /config/scripts/failover-handler.py
     }
 }

A simple regex is used to identify the events corresponding to the add/remove actions for the two wans, and environment variables are used to trigger actions in the /config/scripts/failover-handler.py handler script:

#!/usr/bin/env python3

import smtplib, ssl
from os import environ
from sys import exit
from email.mime.text import MIMEText
from systemd import journal

from vyos.util import rc_cmd

def sendMail(interface_name, interface_action):
    port = 465  # For SSL
    smtp_server = "smtp.gmail.com"
    sender_email = "xxxxx@gmail.com"  # Enter your address
    receiver_email = "xxxxx@gmail.com"  # Enter receiver address
    password = "xxxxxxxxxxxxx"

    event_message = environ.get('message')

    body = f"""\
    eth0: main WAN
    eth1.11: back-up cellular wan

    Event message: {event_message}."""

    message = MIMEText(body)
    message['From'] = sender_email
    message['To'] = receiver_email
    message['Subject'] = f"VyOS failover: interface {interface_name}, action {interface_action}"

    try:
        context = ssl.create_default_context()
        with smtplib.SMTP_SSL(smtp_server, port, context=context) as server:
            server.login(sender_email, password)
            server.sendmail(sender_email, receiver_email, message.as_string())
    except Exception as err:
        journal.send(f'Error sending notification e-mail: {err}')

def flushConntrack():
    rc, command = rc_cmd('/usr/bin/sudo /usr/sbin/conntrack -F')
    if rc != 0:
        journal.send(f'{command} -- return-code [RC: {rc}]')
    else:
        journal.send('Flushed conntrack')

def rejectIPV6():
    rc, command = rc_cmd('/config/scripts/reject_ipv6.sh')
    if rc != 0:
        journal.send(f'{command} -- return-code [RC: {rc}]')
    else:
        journal.send('Default route for IPV6 is now rejected')

def restoreIPV6():
    rc, command = rc_cmd('/config/scripts/restore_ipv6.sh')
    if rc != 0:
        journal.send(f'{command} -- return-code [RC: {rc}]')
    else:
        journal.send('Default route for IPV6 is now restored')

if __name__ == '__main__':
    interface = environ.get('INTERFACE')
    action = environ.get('ACTION')
    flush_conntrack = environ.get('FLUSH_CONNTRACK')
    reject_ipv6 = environ.get('REJECT_IPV6')
    restore_ipv6 = environ.get('RESTORE_IPV6')

    if reject_ipv6 == 'true':
        rejectIPV6()
    elif restore_ipv6 == 'true':
        restoreIPV6()

    if flush_conntrack == 'true':
        flushConntrack()

    sendMail(interface, action)

    exit(0)

My secondary wan does not support IPv6 that is then “enabled/disabled” using the /config/scripts/reject_ipv6.sh and /config/scripts/restore_ipv6.sh scripts:

#!/bin/vbash

if [ "$(id -g -n)" != 'vyattacfg' ] ; then
    exec sg vyattacfg -c "/bin/vbash $(readlink -f $0) $@"
fi

set +x

source /opt/vyatta/etc/functions/script-template

configure

set protocols static route6 ::/0 reject

commit
exit
#!/bin/vbash

if [ "$(id -g -n)" != 'vyattacfg' ] ; then
    exec sg vyattacfg -c "/bin/vbash $(readlink -f $0) $@"
fi

set +x

source /opt/vyatta/etc/functions/script-template

configure

delete protocols static route6 ::/0 reject

commit
exit

I am aware the solution is not completely generic. At the same time, it took me a while to get there and I hope it could be somehow useful to somebody else.

Please, note that the two following fixes have to be applied in order for this solution to work:

Of course, any feed-back is very welcome.

2 Likes

Hi @giuppo77 just wanted to say thank you and let you know that your scripts work like a charm. Finally I’ve been able to remove the clunky WAN load balancing. Now I’m able to use PBR as well (as PBR and WLB collide).

With graceful handling of dynamic nexthop peers (most notably DHCP, PPPoE) in the static failover routes, WLB might eventually be superseded/phased out.

So, I’ve taken the liberty to add a link to this thread in the following tickets:

Maybe someone with better Python coding skills than me will be able to make a generic implementation.

1 Like

Glad my solution can be useful to somebody else.

I have recently updated to one of the last rolling releases, and the import for rc_cmd now needs to be:

from vyos.utils.process import rc_cmd

Without that the python script fails.

2 Likes

I have actually modified the DHCP client pre-hook in order to deal with additional DHCP cases:

if [ "$RUN" = "yes" ]; then
    if [ "$interface" = "eth0" ]; then
        case "$reason" in
            BOUND|RENEW|REBIND|REBOOT)
                export new_gw="$new_routers"
                export old_gw="$old_routers"
                new_routers=""
                ;;

            EXPIRE|FAIL|STOP)
                old_ip_address=""
                old_routers=""
                ;;
        esac
    fi
fi

Indeed, it happened a couple of times that the DHCP renewal was failing and the fail-over mechanism was not removing the route because already removed by the DHCP client.
The handling of EXPIRE, FAIL and STOP improves that scenario.

1 Like