The following content was translated by ChatGPT, so it might be somewhat difficult to understand. Please bear with me.
Two months ago, I reported this bug (https://forum.vyos.io/t/unexpected-route-leak-route-map-not-found). However, at that time, my disk failed, and many logs were lost, so I couldn’t pinpoint the issue. Today, this problem reappeared, and I believe I may have identified its cause.
Here’s an overview of the issue:
Today, I discovered that I was sending the entire routing table to all my BGP neighbors. Fortunately, I had configured maximum-prefix-out 200
, so only 200 routes were sent, preventing more severe consequences.
When I running show bgp route-map AUTOGEN-PEER-OUT
, it revealed that the route-map was not found:
My filters
AUTOGEN-ASxxxx-PEER-OUT
called this missing route-map, so nothing was filtered, and as a result, the full table was sent to my neighbors. Looking into frr.conf
, I noticed that the route-maps defined earlier in the file were still present, but the later ones were missing when I ran show bgp route-map
. This suggests that FRR didn’t load the latter part of the route-maps. However, I haven’t touched the router in a long time—why would the route-maps suddenly disappear?
After checking the logs, I found that at some point, watchfrr
decided to restart bgpd
because it was unresponsive:
Nov 21 15:27:23 core-mci-us watchfrr[2289]: [T58XM-TP956][EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago
Nov 21 15:27:23 core-mci-us watchfrr[2289]: [YFT0P-5Q5YX] Forked background command [pid 233455]: /usr/lib/frr/watchfrr.sh restart bgpd
Nov 21 15:27:25 core-mci-us bgpd[2317]: [ZW1GY-R46JE] Terminating on signal
After watchfrr
restarted bgpd
, it took a long time to come back (maybe the configuration is too long) and lost some route-maps during the process.
Here’s the likely sequence of events: my upstream sent me a large number (~700k) of route updates, and because I had enabled FRR’s SNMP while using a relatively old CPU, bgpd
became unresponsive within 90 seconds. This triggered watchfrr
to restart bgpd
, and then some route-map were lost.
Nov 21 15:25:36 core-mci-us bgpd[2317]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x7faaa05595d0 arg=0x55c352afa580 timer r=-80.327 (bgp_generate_updgrp_packets)() (&connection->t_generate_updgrp_packets) from ../bgpd/bgp_io.c:152} was scheduled to pop greater than 4s ago
Nov 21 15:25:36 core-mci-us bgpd[2317]: [RZ3YY-GPH41][EC 100663310] snmp[warning]: AgentX master agent failed to respond to ping. Attempting to re-register.
Nov 21 15:25:36 core-mci-us bgpd[2317]: [QN9FK-3DQX7] snmp[info]: NET-SNMP version 5.9.4.pre2 AgentX subagent connected
Nov 21 15:25:36 core-mci-us bgpd[2317]: [QN9FK-3DQX7] snmp[info]: AgentX master disconnected us, reconnecting in 15
Nov 21 15:25:36 core-mci-us bgpd[2317]: [ZBSQQ-8BYDT][EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Nov 21 15:25:52 core-mci-us bgpd[2317]: [QN9FK-3DQX7] snmp[info]: NET-SNMP version 5.9.4.pre2 AgentX subagent connected
Here’s my vyos version:
Version: VyOS 1.5-rolling-202410010007
Release train: current
Release flavor: generic
Built by: [email protected]
Built on: Tue 01 Oct 2024 00:07 UTC
Build UUID: 819a0a01-f7dc-41a6-bb0c-0f366842dca2
Build commit ID: a0deb45ac8367e
Architecture: x86_64
Boot via: installed image
System type: KVM guest
Secure Boot: n/a (BIOS)
Hardware vendor: QEMU
Hardware model: Standard PC (i440FX + PIIX, 1996)
Hardware S/N:
Hardware UUID: a64c30a1-d185-413c-ac56-6a5937ccaff1
Copyright: VyOS maintainers and contributors
My VyOS instance runs on a virtual machine with 8 cores E5-2690v4 CPU, which might be a bit outdated. The VyOS setup includes around 70 BGP sessions, each with long filters, resulting in an FRR configuration file of about 7,000 lines.
So I think this incident can perhaps be broken down into two issues:
watchfrr
restartedbgpd
. Maybe we could find a way to prevent it from doing so.bgpd
lost some route-maps after restarting. This issue might need to be reported to FRR.
Perhaps we could add an option in the VyOS configuration file to set watchfrr -t
, allowing it not to restart bgpd
even if there is no response after 90 seconds. If I use the default 90s, watchfrr
might repeatedly restart bgpd
, making it impossible for bgpd
to run or for VyOS to start properly. Currently, I execute watchfrr ignore bgpd
during VyOS startup as a temporary workaround to address this issue.