BGPv6 session causing consistent 100% CPU

tchatzi · January 4, 2022, 8:59pm

Hello everyone and apologies for my first post being a troubleshooting issue, but I’m completely stumped:
Using self-built VyOS 1.3 (HEAD of equuleus at the time - git commit 2f691bb2f61e96d832ca116e388c85cfec1f5ff7) I have a single bgp6 session that drives zebra totally nuts, using 100% of four cpu cores until i shut it down.
This is running on bare metal repurposed Checkpoint P-210 (Core i5-750).
VyOS is running another 2 BGP IPv4 sessions - one internal and one external - and an internal IPv6 BGP session without breaking a sweat. I added this external IPv6 BGP session and things went wild.
For the purposes of narrowing it down, I shut down the internal BGP session and captured whatever is going on the wire and nothing peculiar stood out.
The capture is here.
Any ideas would be appreciated.

fernando · January 5, 2022, 9:15pm

Hi

I couldn’t open that file ,but it is possible to run show <ip|ipv6> bgp neighbors <address> (both address-family) and it also verify that you have setting :

set protocols bgp <asn> parameters default no-ipv4-unicast

another advice is that you should check your current route-maps and filter , if they are correct .

tchatzi · January 6, 2022, 12:33pm

The file is a xz’ed tcpdump cap file and I just tried downloading it and it opened with wireshark just fine after ``xz -d’’ . Anyhow, the capture file merely demonstrates that the peer does not appear to be doing anything nasty, at least to my eyes.

show ip bgp neighbors 2a01:d0:7fff:102::1

and

show ipv6 bgp neighbors 2a01:d0:7fff:102::1

do not exhibit any differences, although

show ip bgp summary

says NoNeg for this peer, while
show ipv6 bgp summary correctly says

Neighbor            V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
2a01:d0:7fff:102::1 4      xxxxx    716997       711        0    0    0 02:02:14       138198        1

In all, it all looks in order, apart from zebra consuming all cpu.
The complete output of show ip bgp neighbors 2a01:d0:7fff:102::1 is ~2k of text. Should I paste it here ?

set protocols bgp <asn> parameters default no-ipv4-unicast

this is already there

another advice is that you should check your current route-maps and filter , if they are correct .

fairly standard rpki stuff and bogon filters that seem to work for every other peer.

Edit: I just altered the import filter-list to only allow paths the reside completely only inside the peer’s AS and deny everything else

set policy as-path-list as-importpaths rule 10 action 'permit'
set policy as-path-list as-importpaths rule 10 regex '^[peer AS]$' 
set policy as-path-list as-importpaths rule 20 action 'deny'
set policy as-path-list as-importpaths rule 20 regex '.*'
set protocols bgp [my AS] neighbor 2a01:d0:7fff:102::1 address-family ipv6-unicast filter-list import 'as-importpaths'

and the situation did not improve. zebra still chews through 100% of 4 cpu cores.

fernando · January 6, 2022, 5:53pm

Hi

regarding this question :

The complete output of show ip bgp neighbors 2a01:d0:7fff:102::1 is ~2k of text. Should I paste it here ?

it would be great if you could share that information (here / elsewhere) , maybe there is something that it could be useful ,this log with the process with high cpu :

/var/log/atop

it should show when the cpu increase (time) and you can associate with an event(or setting)

tchatzi · January 6, 2022, 7:25pm

it would be great if you could share that information (here / elsewhere) , maybe there is something that it could be useful ,this log with the process with high cpu

Well here is the output

And here’s the atop log you asked for. Bear in mind that I only had the bgp session running between ~20:00 and ~21:00 local time (UTC+2),
because while the BGP session is up, ntpd spams about 10 messages/sec routing socket reports: No buffer space available in the system log which is obviously due to the stress the system is being put under and stops as soon as I shutdown the session.

fernando · January 7, 2022, 2:06pm

thanks for sharing , it seems to be working well (peer 2a01:d0:7fff:102::1) . Another possibility is you have enabled snmp in your current instance and the pulls cause high cpu.(could you check it ?)

also, it is possible to verify that the values of cpu/ram are correct for this purpose.

tchatzi · January 7, 2022, 5:44pm

it seems to be working well (peer 2a01:d0:7fff:102::1)

it’s working fine, but appears to demand way more than any other BGP session, cpu-wise. It’s worse than a `reset ip bgp [peer]‘’ and that only lasts for a couple of seconds.

Another possibility is you have enabled snmp in your current instance and the pulls cause high cpu.(could you check it ?)

snmpd is enabled and running as per default, but nothing pulls data off it and it is firewalled anyhow. shutting it down did not make a difference.

also, it is possible to verify that the values of cpu/ram are correct for this purpose.

perhaps I misunderstand, but i7-750 and 8GB of RAM should be plenty for such a setup - I have numerous others of comparable specs and none exhibit such behavior.

marekm72 · June 2, 2022, 8:15am

The high CPU usage with IPv6 BGP is something I’ve seen too - in my case, newly configured third router sending lots of updates to the other two iBGP peers, IPv6 routing table version increasing by a few thousands every seconds, eating more than 1 core of a quad-core Xeon E3-1220v2. In my case it stopped after adding OSPFv3 configuration (which didn’t affect the iBGP sessions directly as they were established over IPv4 for both IPv4 and IPv6 routes - the IPv4 setup was already running in production use, while IPv6 is something new I’m still in the process of setting up and testing).

tchatzi · November 14, 2022, 9:52am

FWIW it appears that something was wrong with the stock frr version 7.5.1 in VyOS 1.3.
I replaced frr with the precompiled packages for debian buster from Index of /frr/ (version 8.3.1-0~deb10u1) and everything appears to be in order. I haven’t had the opportunity to locate the issue in 7.5.1 and understand that replacing packages in LTS isn’t generally acceptable or supported, but it worked for us.