VRRP flapping impacting FRR daemons

We’ve seen VRRP flap sometimes, occasionally causing FRR’s OSPF or BGP daemons to have problems. @Dmitry and @Viacheslav have helped me over Slack with this, particularly recovering a partially failed router without resorting to a reboot and being impacted by https://phabricator.vyos.net/T1698

I have managed to create a pair of VMs that exhibit the same behaviour of flapping VRRP so hopefully this will provide a base for further diagnosis. I haven’t tried stripping out the rest of the config, but after a while, the VRRP flap happens with this pair of configs (note the eth2 interfaces on each router are connected to each other, eth3 interfaces are not).

set high-availability vrrp group HA interface 'eth2.10'
set high-availability vrrp group HA priority '210'
set high-availability vrrp group HA virtual-address '10.10.0.1/24'
set high-availability vrrp group HA vrid '10'
set interfaces ethernet eth2 address '10.0.0.1/24'
set interfaces ethernet eth2 hw-id '08:00:27:29:f4:6f'
set interfaces ethernet eth2 vif 10 address '10.10.0.101/24'
set interfaces ethernet eth3 address '10.0.1.1/24'
set interfaces ethernet eth3 hw-id '08:00:27:4c:bc:e3'
set interfaces loopback lo address '10.1.0.1/32'
set protocols bgp 64512 address-family ipv4-unicast network 10.0.1.0/24
set protocols bgp 64512 neighbor 10.1.0.2 address-family ipv4-unicast nexthop-self
set protocols bgp 64512 neighbor 10.1.0.2 remote-as '64512'
set protocols bgp 64512 neighbor 10.1.0.2 timers holdtime '10'
set protocols bgp 64512 neighbor 10.1.0.2 timers keepalive '3'
set protocols bgp 64512 neighbor 10.1.0.2 update-source '10.1.0.1'
set protocols bgp 64512 parameters router-id '10.1.0.1'
set protocols ospf area 0 network '10.1.0.1/32'
set protocols ospf area 0 network '10.0.0.0/24'
set protocols ospf area 1 network '10.0.1.0/24'
set protocols ospf passive-interface 'eth3'
set protocols ospf passive-interface 'eth2.10'
set service ssh port '22'
set system config-management commit-revisions '100'
set system console device ttyS0 speed '9600'
set system host-name 'r1'
set system host-name 'vyos'
set system login user vyos authentication encrypted-password '$6$BmnMQJsuW0S/fQu$s2GE/sxDIYovh3XgbxGzp8mWSB8OnRE6npjN4pYpbhB3fZRbKP5WiuWzXhdrkcNJ0VMTXFna7mJro01ujz5.F.'
set system login user vyos authentication plaintext-password ''
set system login user vyos level 'admin'
set system ntp server 0.pool.ntp.org
set system ntp server 1.pool.ntp.org
set system ntp server 2.pool.ntp.org
set system syslog global facility all level 'info'
set system syslog global facility protocols level 'debug'
wset high-availability vrrp group HA interface 'eth2.10'
set high-availability vrrp group HA priority '150'
set high-availability vrrp group HA virtual-address '10.10.0.1/24'
set high-availability vrrp group HA vrid '10'
set interfaces ethernet eth2 address '10.0.0.2/24'
set interfaces ethernet eth2 hw-id '08:00:27:6b:67:ec'
set interfaces ethernet eth2 vif 10 address '10.10.0.102/24'
set interfaces ethernet eth3 address '10.0.2.1/24'
set interfaces ethernet eth3 hw-id '08:00:27:b8:04:56'
set interfaces loopback lo address '10.1.0.2/32'
set protocols bgp 64512 address-family ipv4-unicast network 10.0.2.0/24
set protocols bgp 64512 neighbor 10.1.0.1 address-family ipv4-unicast nexthop-self
set protocols bgp 64512 neighbor 10.1.0.1 remote-as '64512'
set protocols bgp 64512 neighbor 10.1.0.1 timers holdtime '10'
set protocols bgp 64512 neighbor 10.1.0.1 timers keepalive '3'
set protocols bgp 64512 neighbor 10.1.0.1 update-source '10.1.0.2'
set protocols bgp 64512 parameters router-id '10.1.0.2'
set protocols ospf area 0 network '10.1.0.2/32'
set protocols ospf area 0 network '10.0.0.0/24'
set protocols ospf area 2 network '10.0.2.0/24'
set protocols ospf passive-interface 'eth3'
set protocols ospf passive-interface 'eth2.10'
set service ssh port '22'
set system config-management commit-revisions '100'
set system console device ttyS0 speed '9600'
set system host-name 'r2'
set system host-name 'vyos'
set system login user vyos authentication encrypted-password '$6$BmnMQJsuW0S/fQu$s2GE/sxDIYovh3XgbxGzp8mWSB8OnRE6npjN4pYpbhB3fZRbKP5WiuWzXhdrkcNJ0VMTXFna7mJro01ujz5.F.'
set system login user vyos authentication plaintext-password ''
set system login user vyos level 'admin'
set system ntp server 0.pool.ntp.org
set system ntp server 1.pool.ntp.org
set system ntp server 2.pool.ntp.org
set system syslog global facility all level 'info'
set system syslog global facility protocols level 'debug'

Username/password for both is vyos/vyos.

Once that’s been running for a while, this turns up in sudo journalctl |grep " 15:38"|grep -v systemd|grep -v agetty:

Mar 19 15:38:13 r1 kernel: IPv4: martian source 10.10.0.1 from 10.10.0.1, on dev eth2.10
Mar 19 15:38:13 r1 kernel: ll header: 00000000: ff ff ff ff ff ff 08 00 27 6b 67 ec 08 06        ........'kg...
Mar 19 15:38:13 r1 kernel: IPv4: martian source 10.10.0.1 from 10.10.0.1, on dev eth2.10
Mar 19 15:38:13 r1 kernel: ll header: 00000000: ff ff ff ff ff ff 08 00 27 6b 67 ec 08 06        ........'kg...
Mar 19 15:38:13 r1 kernel: IPv4: martian source 10.10.0.1 from 10.10.0.1, on dev eth2.10
Mar 19 15:38:13 r1 kernel: ll header: 00000000: ff ff ff ff ff ff 08 00 27 6b 67 ec 08 06        ........'kg...
Mar 19 15:38:13 r1 kernel: IPv4: martian source 10.10.0.1 from 10.10.0.1, on dev eth2.10
Mar 19 15:38:13 r1 kernel: ll header: 00000000: ff ff ff ff ff ff 08 00 27 6b 67 ec 08 06        ........'kg...
Mar 19 15:38:14 r1 kernel: IPv4: martian source 10.10.0.1 from 10.10.0.1, on dev eth2.10
Mar 19 15:38:14 r1 kernel: ll header: 00000000: ff ff ff ff ff ff 08 00 27 6b 67 ec 08 06        ........'kg...
Mar 19 15:38:14 r1 Keepalived_vrrp[12750]: (HA) Received advert from 10.10.0.102 with lower priority 150, ours 210, forcing new election
Mar 19 15:38:16 r1 bgpd[1027]: %NOTIFICATION: rcvd End-of-RIB for IPv4 Unicast from 10.1.0.2 in vrf default
Mar 19 15:38:13 r2 Keepalived_vrrp[12674]: (HA) Entering MASTER STATE
Mar 19 15:38:13 r2 Keepalived_vrrp[12674]: (HA) Master received advert from 10.10.0.101 with higher priority 210, ours 150
Mar 19 15:38:13 r2 Keepalived_vrrp[12674]: (HA) Entering BACKUP STATE
Mar 19 15:38:14 r2 bgpd[1112]: [EC 33554451] bgp_process_packet: BGP OPEN receipt failed for peer: 10.1.0.1
Mar 19 15:38:16 r2 bgpd[1112]: %NOTIFICATION: rcvd End-of-RIB for IPv4 Unicast from 10.1.0.1 in vrf default

This still seems to be happening, further diagnosis and packet capturing shows that as expected, the priority 210 router (r1) issues an announcement every second. These packets can be seen on tcpdumps from both routers.

Sometimes the priority 150 router (r2) just decides to become master for a moment until it receives the next advertisement from r1.

Happy to diagnose further but right now it looks like keepalived doing something it just shouldn’t.