Unexpected route leak (route-map not found) due to bgpd restarted by watchfrr

The following content was translated by ChatGPT, so it might be somewhat difficult to understand. Please bear with me.

Two months ago, I reported this bug (https://forum.vyos.io/t/unexpected-route-leak-route-map-not-found). However, at that time, my disk failed, and many logs were lost, so I couldn’t pinpoint the issue. Today, this problem reappeared, and I believe I may have identified its cause.

Here’s an overview of the issue:


Today, I discovered that I was sending the entire routing table to all my BGP neighbors. Fortunately, I had configured maximum-prefix-out 200, so only 200 routes were sent, preventing more severe consequences.

When I running show bgp route-map AUTOGEN-PEER-OUT, it revealed that the route-map was not found:


My filters AUTOGEN-ASxxxx-PEER-OUT called this missing route-map, so nothing was filtered, and as a result, the full table was sent to my neighbors. Looking into frr.conf, I noticed that the route-maps defined earlier in the file were still present, but the later ones were missing when I ran show bgp route-map. This suggests that FRR didn’t load the latter part of the route-maps. However, I haven’t touched the router in a long time—why would the route-maps suddenly disappear?

After checking the logs, I found that at some point, watchfrr decided to restart bgpd because it was unresponsive:

Nov 21 15:27:23 core-mci-us watchfrr[2289]: [T58XM-TP956][EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago
Nov 21 15:27:23 core-mci-us watchfrr[2289]: [YFT0P-5Q5YX] Forked background command [pid 233455]: /usr/lib/frr/watchfrr.sh restart bgpd
Nov 21 15:27:25 core-mci-us bgpd[2317]: [ZW1GY-R46JE] Terminating on signal

After watchfrr restarted bgpd, it took a long time to come back (maybe the configuration is too long) and lost some route-maps during the process.

Here’s the likely sequence of events: my upstream sent me a large number (~700k) of route updates, and because I had enabled FRR’s SNMP while using a relatively old CPU, bgpd became unresponsive within 90 seconds. This triggered watchfrr to restart bgpd, and then some route-map were lost.

Nov 21 15:25:36 core-mci-us bgpd[2317]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x7faaa05595d0 arg=0x55c352afa580 timer  r=-80.327    (bgp_generate_updgrp_packets)() (&connection->t_generate_updgrp_packets) from ../bgpd/bgp_io.c:152} was scheduled to pop greater than 4s ago
Nov 21 15:25:36 core-mci-us bgpd[2317]: [RZ3YY-GPH41][EC 100663310] snmp[warning]: AgentX master agent failed to respond to ping.  Attempting to re-register.
Nov 21 15:25:36 core-mci-us bgpd[2317]: [QN9FK-3DQX7] snmp[info]: NET-SNMP version 5.9.4.pre2 AgentX subagent connected
Nov 21 15:25:36 core-mci-us bgpd[2317]: [QN9FK-3DQX7] snmp[info]: AgentX master disconnected us, reconnecting in 15
Nov 21 15:25:36 core-mci-us bgpd[2317]: [ZBSQQ-8BYDT][EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Nov 21 15:25:52 core-mci-us bgpd[2317]: [QN9FK-3DQX7] snmp[info]: NET-SNMP version 5.9.4.pre2 AgentX subagent connected

Here’s my vyos version:

Version:          VyOS 1.5-rolling-202410010007
Release train:    current
Release flavor:   generic

Built by:         [email protected]
Built on:         Tue 01 Oct 2024 00:07 UTC
Build UUID:       819a0a01-f7dc-41a6-bb0c-0f366842dca2
Build commit ID:  a0deb45ac8367e

Architecture:     x86_64
Boot via:         installed image
System type:      KVM guest
Secure Boot:      n/a (BIOS)

Hardware vendor:  QEMU
Hardware model:   Standard PC (i440FX + PIIX, 1996)
Hardware S/N:
Hardware UUID:    a64c30a1-d185-413c-ac56-6a5937ccaff1

Copyright:        VyOS maintainers and contributors

My VyOS instance runs on a virtual machine with 8 cores E5-2690v4 CPU, which might be a bit outdated. The VyOS setup includes around 70 BGP sessions, each with long filters, resulting in an FRR configuration file of about 7,000 lines.

So I think this incident can perhaps be broken down into two issues:

  1. watchfrr restarted bgpd. Maybe we could find a way to prevent it from doing so.
  2. bgpd lost some route-maps after restarting. This issue might need to be reported to FRR.

Perhaps we could add an option in the VyOS configuration file to set watchfrr -t, allowing it not to restart bgpd even if there is no response after 90 seconds. If I use the default 90s, watchfrr might repeatedly restart bgpd, making it impossible for bgpd to run or for VyOS to start properly. Currently, I execute watchfrr ignore bgpd during VyOS startup as a temporary workaround to address this issue.