VRRP no-preempt : one interface not staying in BACKUP after reboot

patient0 · January 23, 2024, 11:18am

Hi,

This is my second go with VyOS after I couldn’t get my head around it a few years back. Now it’s different and I love it, no idea what changed.

I read through a few forum thread with related titles but did find them to either being older or not matching. If you guys disagree feel free to merge it.

On a Proxmox server I use a pure lab server I set up two VyOS 1.3.5 VM to use in a VRRP configuration. VRRP is a new topic for me but I thought I understood the idea but I’m not sure anmore.

Since it’s a new topic for my I based it on the High Availability Walkthrough but without OSPF and BGP.

The issue is as explained in another thread that the router with higher prio doesn’t stay in BACKUP mode after reboot even though ‘no-preempt’ is set, but it happens only for the WAN interface.
The difference is that in the other thread and the keepalived docu it was mentioned that it does affect bond interfaces. In my case I don’t use bond but linux bridges and don’t see why only one of them is affected.

Setting startup-delay didn’t help but only delay things :), setting to 10, 60 and 120. The log on router2 shows (as in the other thread) that router1 announces itself as MASTER after reboot. Right now I work around it by disabling vrrp group public and enabling it right away again.

On router1 I did switch to VyOS 1.4 and 1.5 with the same config but it didn’t make any difference.

The rough data:

router1 eth0/WAN: 10.101.102.9
router1 eth1/LAN: 192.168.1.2

router2 eth0/WAN: 10.101.102.10
router2 eth1/LAN: 192.168.1.3

vrrp WAN IP: 10.101.102.8
vrrp LAN IP: 192.168.1.1

WAN gateway: 10.101.102.1

eth0 -> vmbr0 (WAN) on Proxmox | vmbr0 ip set to 10.101.102.1, VLAN aware
eth1 -> vmbr1 (LAN) on Proxmox | vmbr1 no ip set, VLAN aware

The VRRP config

router1:
 vyos@vyos-lts-2nd# show high-availability vrrp
 global-parameters {
     startup-delay 10
 }
 group int {
     hello-source-address 192.168.1.2
     interface eth1
     no-preempt
     peer-address 192.168.1.3
     priority 200
     virtual-address 192.168.1.1/24 {
     }
     vrid 1
 }
 group public {
     hello-source-address 10.101.102.9
     interface eth0
     no-preempt
     peer-address 10.101.102.10
     preempt-delay 60
     priority 200
     virtual-address 10.101.102.8/24 {
     }
     vrid 102
 }
 sync-group sync {
     member int
}

router2:
 vyos@vyos-lts-2nd# show high-availability vrrp
 global-parameters {
     startup-delay 10
 }
 group int {
     hello-source-address 192.168.1.3
     interface eth1
     no-preempt
     peer-address 192.168.1.2
     priority 100
     virtual-address 192.168.1.1/24 {
     }
     vrid 1
 }
 group public {
     hello-source-address 10.101.102.10
     interface eth0
     no-preempt
     peer-address 10.101.102.9
     priority 100
     virtual-address 10.101.102.8/24 {
     }
     vrid 102
 }
 sync-group sync {
     member int
}

After rebooting router1:

vyos@vyos-lts:~$ show vrrp
Name    Interface      VRID  State      Priority  Last Transition  
------  -----------  ------  -------  ----------  -----------------
int     eth1              1  BACKUP          200  1m56s
public  eth0            102  MASTER          200  1m44s

Adding: if I put the public vrrp group into the sync-group sync too then it does work, public stays in BACKUP mode after a reboot. In my understanding that should not be necessary though.
2nd Addition: I changed the IPs for router1, router2 and vrrp IP since I wanted to see if I can add another router or two the the VRRP cluster.

L0crian · January 23, 2024, 2:25pm

I couldn’t recreate your issue on either 1.3.5 or 1.4.0-rc1. Can you provide the output of this file on both of your routers?:
/run/keepalived/keepalived.conf

patient0 · January 23, 2024, 2:50pm

Hi @L0crian

Thanks for looking into it! Are you running VyOS on Proxmox linux bridges too?
And here are the outputs of the two keepalived.conf’s:

router1 keepalived.conf

$ cat vyos-1_keepalived.conf
# Autogenerated by VyOS
# Do not edit this file, all your changes will be lost
# on next commit or reboot

global_defs {
    dynamic_interfaces
    script_user root
    vrrp_startup_delay 10


    notify_fifo /run/keepalived/keepalived_notify_fifo
    notify_fifo_script /usr/libexec/vyos/system/keepalived-fifo.py
}

vrrp_instance int {
    state BACKUP
    interface eth1
    virtual_router_id 1
    priority 200
    advert_int 1


    nopreempt
    unicast_peer { 192.168.1.3 }
    unicast_src_ip 192.168.1.2
    virtual_ipaddress {
        192.168.1.1/24
    }
}
vrrp_instance public {
    state BACKUP
    interface eth0
    virtual_router_id 102
    priority 200
    advert_int 1


    nopreempt
    unicast_peer { 10.101.102.10 }
    unicast_src_ip 10.101.102.9
    virtual_ipaddress {
        10.101.102.8/24
    }
}

vrrp_sync_group sync {
    group {
        int
    }

    notify_master "/usr/libexec/vyos/vyos-vrrp-conntracksync.sh master sync"
    notify_backup "/usr/libexec/vyos/vyos-vrrp-conntracksync.sh backup sync"
    notify_fault "/usr/libexec/vyos/vyos-vrrp-conntracksync.sh fault sync"
}

router2 keepalived.conf

$ cat vyos-2_keepalived.conf
# Autogenerated by VyOS
# Do not edit this file, all your changes will be lost
# on next commit or reboot

global_defs {
    dynamic_interfaces
    script_user root
    vrrp_startup_delay 10


    notify_fifo /run/keepalived/keepalived_notify_fifo
    notify_fifo_script /usr/libexec/vyos/system/keepalived-fifo.py
}

vrrp_instance int {
    state BACKUP
    interface eth1
    virtual_router_id 1
    priority 100
    advert_int 1


    nopreempt
    unicast_peer { 192.168.1.2 }
    unicast_src_ip 192.168.1.3
    virtual_ipaddress {
        192.168.1.1/24
    }
}
vrrp_instance public {
    state BACKUP
    interface eth0
    virtual_router_id 102
    priority 100
    advert_int 1


    nopreempt
    unicast_peer { 10.101.102.9 }
    unicast_src_ip 10.101.102.10
    virtual_ipaddress {
        10.101.102.8/24
    }
}

vrrp_sync_group sync {
    group {
        int
    }

    notify_master "/usr/libexec/vyos/vyos-vrrp-conntracksync.sh master sync"
    notify_backup "/usr/libexec/vyos/vyos-vrrp-conntracksync.sh backup sync"
    notify_fault "/usr/libexec/vyos/vyos-vrrp-conntracksync.sh fault sync"
}

Edit: The images I run/ran are compiled myself, since I only have access to rolling
2nd Edit: I changed the IPs (vrrp IP now 192.168.1.1, router1 IP 192.168.1.2, router2 IP 192.168.1.3) and that makes it of course not easier to check for you guys, Sorry.

L0crian · January 23, 2024, 3:06pm

I am just running them in my lab environment (GNS3). I was thinking maybe there was an order of operations thing going on where you didn’t have nopreempt initially, but that doesn’t appear to be the case. Can you double check you don’t accidentally have router 2’s IP as a secondary WAN IP on router1. That could cause it as well.

patient0 · January 23, 2024, 3:20pm

Mmmh, I don’t think that should be the issue,

But ere the interface configs:

router1 interface config

[email protected]# show
 ethernet eth0 {
     address 10.101.102.9/24
     firewall {
         in {
             name OUTSIDE-IN
         }
         local {
             name OUTSIDE-LOCAL
         }
     }
     hw-id bc:24:11:a2:7e:dc
 }
 ethernet eth1 {
     address 192.168.1.2/24
     hw-id bc:24:11:12:81:99
 }
 loopback lo {
 }
[edit interfaces]

router2 interface config

router2:
[email protected]# show
 ethernet eth0 {
     address 10.101.102.10/24
     firewall {
         in {
             name OUTSIDE-IN
         }
         local {
             name OUTSIDE-LOCAL
         }
     }
     hw-id bc:24:11:42:97:4c
 }
 ethernet eth1 {
     address 192.168.1.3/24
     hw-id bc:24:11:c8:82:f9
 }
 loopback lo {
 }
[edit interfaces]

In the keepalived log on router2 it looks like the correct way to have it’s mode changed to BACKUP. Only that router2 shouldn’t receive that message at all - I thought.

keepalived on router2 gets demoted to BACKUP

...
Jan 23 15:03:05 vyos-1 qemu-ga: info: guest-ping called
Jan 23 15:03:40 vyos-1 kernel: [   86.144377] IPv4: martian source 10.101.102.8 from 10.101.102.8, on dev eth0
Jan 23 15:03:40 vyos-1 kernel: [   86.144386] ll header: 00000000: ff ff ff ff ff ff bc 24 11 a2 7e dc 08 06
Jan 23 15:03:40 vyos-1 kernel: [   86.144388] IPv4: martian source 10.101.102.8 from 10.101.102.8, on dev eth0
Jan 23 15:03:40 vyos-1 kernel: [   86.144389] ll header: 00000000: ff ff ff ff ff ff bc 24 11 a2 7e dc 08 06
Jan 23 15:03:40 vyos-1 kernel: [   86.144389] IPv4: martian source 10.101.102.8 from 10.101.102.8, on dev eth0
Jan 23 15:03:40 vyos-1 kernel: [   86.144390] ll header: 00000000: ff ff ff ff ff ff bc 24 11 a2 7e dc 08 06
Jan 23 15:03:40 vyos-1 kernel: [   86.144391] IPv4: martian source 10.101.102.8 from 10.101.102.8, on dev eth0
Jan 23 15:03:40 vyos-1 kernel: [   86.144392] ll header: 00000000: ff ff ff ff ff ff bc 24 11 a2 7e dc 08 06
Jan 23 15:03:40 vyos-1 kernel: [   86.144393] IPv4: martian source 10.101.102.8 from 10.101.102.8, on dev eth0
Jan 23 15:03:40 vyos-1 kernel: [   86.144393] ll header: 00000000: ff ff ff ff ff ff bc 24 11 a2 7e dc 08 06
Jan 23 15:03:37 vyos-1 qemu-ga: message repeated 3 times: [ info: guest-ping called]
Jan 23 15:03:40 vyos-1 Keepalived_vrrp[2062]: (public) Master received advert from 10.101.102.9 with higher priority 200, ours 100
Jan 23 15:03:40 vyos-1 Keepalived_vrrp[2062]: (public) Entering BACKUP STATE
Jan 23 15:03:40 vyos-1 keepalived-fifo.py: INSTANCE public changed state to BACKUP
Jan 23 15:03:42 vyos-1 ntpd[2103]: Deleting interface #10 eth0, 10.101.102.8#123, interface stats: received=0, sent=0, dropped=0, active_time=42 secs
...

L0crian · January 23, 2024, 3:25pm

Yeah, that looks fine. Do you receive information from router2 on router1 when that happens?

Should also mention to check the firewall on that interface and make sure you’re not blocking the VRRP packets from router2 to router1.

patient0 · January 23, 2024, 3:41pm

No, it doesn’t look like it, should router1 get infos from router2? The log at 3:40 is only one line from keepalived:

Jan 23 15:03:40 vyos-1 keepalived-fifo.py: INSTANCE public changed state to MASTER

L0crian · January 23, 2024, 3:49pm

Yes, you might have a unidirectional issue then. router2 is receiving messages from router1, allowing it to drop to backup, but router 1 never receives from router 2, forcing it to always be master.

First thing to check would be the firewall, but after that make sure the traffic is being allowed through proxmox.

patient0 · January 23, 2024, 4:12pm

Mmmh, I may have to increase the log level. Nothing in the logs but manually it works:

on router1 disable vrrp group ‘public’ and enable it again, then router2 is and stays MASTER.
on router2 disable vrrp group ‘public’ then router1 changed to MASTER, enabling it again changes nothing since router2 has lower prio.

Communcation between the host seems to be ok(-ish). They are on the same bridge on the same host, so basically a switch (is what I assume). To be sure I have disabled the Proxmox firewall before.

L0crian · January 23, 2024, 4:26pm

Gotcha! I rebooted left and right in my lab with both 1.3.5 and 1.4.0-rc1, and couldn’t recreate it. If it is a bug, it might be an order of operations thing, but your keepalived.conf file looked fine.

I guess I should ask, did you grab that output after the reboot when your failure condition is present? Or when it was functioning as expected?

patient0 · January 24, 2024, 6:18am

The output of the log file or the keepalived?

keepalived.conf was after the failure condition and the logs were from the right when it happened (tail -f /varlog/messages | fgrep keepalive).

Edit: And I don’t seem to know how to increate the log right log level, setting facility all to debug was of no help.

patient0 · January 30, 2024, 7:32pm

@L0crian I removed the firewall and then it does work. The issue was - as so often - a layer 8 one.

Thanks for your help.

L0crian · January 30, 2024, 7:37pm

Nice, glad you were able to get it working! I was afraid if it was a bug, that it was going to be a crazy order of operations thing to recreate it.