FRR/BGP crashing when updating a preflix list

,

I am running VyOS 1.3.2 on a bare metal server, using it as an edge router with multiple transit providers and multiple downstreams. I am repeatedly able to reproduce this issue where FRR crashes upon updating a prefix list. In this case, I added two /48’s to a prefix list for downstreams, which caused FRR to crash constantly. I have to reboot to work around this (which is obviously far from ideal, plus there is the naughty issue in 1.3.x where you need to shutdown all of your peer groups before rebooting as sessions will come up before VyOS applies prefix lists causing you to blast your route table everywhere).

I found the following output in /var/log/messages

Mar  3 18:45:51 edge1 systemd[26237]: opt-vyatta-config-tmp-new_config_26251.mount: Succeeded.
Mar  3 18:45:51 edge1 systemd[1]: opt-vyatta-config-tmp-new_config_26251.mount: Succeeded.
Mar  3 18:45:53 edge1 watchfrr[1382]: [EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago
Mar  3 18:45:53 edge1 watchfrr[1382]: Forked background command [pid 26389]: /usr/lib/frr/watchfrr.sh restart bgpd
Mar  3 18:45:56 edge1 bgpd[1426]: [EC 100663313] SLOW THREAD: task bgp_route_map_update_timer (5618f694b5e0) ran for 97389ms (cpu time 97382ms)
Mar  3 18:45:56 edge1 bgpd[1426]: Terminating on signal
Mar  3 18:45:56 edge1 bgpd[1426]: [EC 100663314] Attempting to process an I/O event but for fd: 35(8) no thread to handle this!
Mar  3 18:45:56 edge1 bgpd[1426]: [EC 100663314] Attempting to process an I/O event but for fd: 42(8) no thread to handle this!
Mar  3 18:45:56 edge1 bgpd[1426]: [EC 100663314] Attempting to process an I/O event but for fd: 56(8) no thread to handle this!
Mar  3 18:46:08 edge1 zebra[1421]: [EC 4043309122] Client 'bgp' encountered an error and is shutting down.
Mar  3 18:46:08 edge1 zebra[1421]: [EC 4043309122] Client 'vnc' encountered an error and is shutting down.

There seem to be numerous messages of this nature, e.g.

Mar  3 18:47:14 edge1 bgpd[26620]: [EC 100663313] SLOW THREAD: task subgroup_coalesce_timer (55a2d7bdeeb0) ran for 6996ms (cpu time 6996ms)

Any ideas on what can be done? Kindof in a pickle because BGP on 1.3 is so problematic, but 1.4 rolling doesn’t seem anywhere near stable enough to use in prod.

Appreciate any insight.

I’ve since migrated to 1.4 rolling and things appear to be much better. The BGP code rewrites really shine! No issues updating prefix lists and seems much faster overall from a configuration perspective.

3 Likes