Unexpected route leak (route-map not found) due to bgpd restarted by watchfrr

canoziia · November 21, 2024, 1:36pm

The following content was translated by ChatGPT, so it might be somewhat difficult to understand. Please bear with me.

Two months ago, I reported this bug (https://forum.vyos.io/t/unexpected-route-leak-route-map-not-found). However, at that time, my disk failed, and many logs were lost, so I couldn’t pinpoint the issue. Today, this problem reappeared, and I believe I may have identified its cause.

Here’s an overview of the issue:

Today, I discovered that I was sending the entire routing table to all my BGP neighbors. Fortunately, I had configured maximum-prefix-out 200, so only 200 routes were sent, preventing more severe consequences.

When I running show bgp route-map AUTOGEN-PEER-OUT, it revealed that the route-map was not found:

My filters AUTOGEN-ASxxxx-PEER-OUT called this missing route-map, so nothing was filtered, and as a result, the full table was sent to my neighbors. Looking into frr.conf, I noticed that the route-maps defined earlier in the file were still present, but the later ones were missing when I ran show bgp route-map. This suggests that FRR didn’t load the latter part of the route-maps. However, I haven’t touched the router in a long time—why would the route-maps suddenly disappear?

After checking the logs, I found that at some point, watchfrr decided to restart bgpd because it was unresponsive:

Nov 21 15:27:23 core-mci-us watchfrr[2289]: [T58XM-TP956][EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago
Nov 21 15:27:23 core-mci-us watchfrr[2289]: [YFT0P-5Q5YX] Forked background command [pid 233455]: /usr/lib/frr/watchfrr.sh restart bgpd
Nov 21 15:27:25 core-mci-us bgpd[2317]: [ZW1GY-R46JE] Terminating on signal

After watchfrr restarted bgpd, it took a long time to come back (maybe the configuration is too long) and lost some route-maps during the process.

Here’s the likely sequence of events: my upstream sent me a large number (~700k) of route updates, and because I had enabled FRR’s SNMP while using a relatively old CPU, bgpd became unresponsive within 90 seconds. This triggered watchfrr to restart bgpd, and then some route-map were lost.

Nov 21 15:25:36 core-mci-us bgpd[2317]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x7faaa05595d0 arg=0x55c352afa580 timer  r=-80.327    (bgp_generate_updgrp_packets)() (&connection->t_generate_updgrp_packets) from ../bgpd/bgp_io.c:152} was scheduled to pop greater than 4s ago
Nov 21 15:25:36 core-mci-us bgpd[2317]: [RZ3YY-GPH41][EC 100663310] snmp[warning]: AgentX master agent failed to respond to ping.  Attempting to re-register.
Nov 21 15:25:36 core-mci-us bgpd[2317]: [QN9FK-3DQX7] snmp[info]: NET-SNMP version 5.9.4.pre2 AgentX subagent connected
Nov 21 15:25:36 core-mci-us bgpd[2317]: [QN9FK-3DQX7] snmp[info]: AgentX master disconnected us, reconnecting in 15
Nov 21 15:25:36 core-mci-us bgpd[2317]: [ZBSQQ-8BYDT][EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Nov 21 15:25:52 core-mci-us bgpd[2317]: [QN9FK-3DQX7] snmp[info]: NET-SNMP version 5.9.4.pre2 AgentX subagent connected

Here’s my vyos version:

Version:          VyOS 1.5-rolling-202410010007
Release train:    current
Release flavor:   generic

Built by:         [email protected]
Built on:         Tue 01 Oct 2024 00:07 UTC
Build UUID:       819a0a01-f7dc-41a6-bb0c-0f366842dca2
Build commit ID:  a0deb45ac8367e

Architecture:     x86_64
Boot via:         installed image
System type:      KVM guest
Secure Boot:      n/a (BIOS)

Hardware vendor:  QEMU
Hardware model:   Standard PC (i440FX + PIIX, 1996)
Hardware S/N:
Hardware UUID:    a64c30a1-d185-413c-ac56-6a5937ccaff1

Copyright:        VyOS maintainers and contributors

My VyOS instance runs on a virtual machine with 8 cores E5-2690v4 CPU, which might be a bit outdated. The VyOS setup includes around 70 BGP sessions, each with long filters, resulting in an FRR configuration file of about 7,000 lines.

So I think this incident can perhaps be broken down into two issues:

watchfrr restarted bgpd. Maybe we could find a way to prevent it from doing so.
bgpd lost some route-maps after restarting. This issue might need to be reported to FRR.

Perhaps we could add an option in the VyOS configuration file to set watchfrr -t, allowing it not to restart bgpd even if there is no response after 90 seconds. If I use the default 90s, watchfrr might repeatedly restart bgpd, making it impossible for bgpd to run or for VyOS to start properly. Currently, I execute watchfrr ignore bgpd during VyOS startup as a temporary workaround to address this issue.

system · January 20, 2025, 1:37pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.