High CPU usage by bgpd when snmp is active

aldemaro · August 29, 2019, 1:32pm

Hello everyone!

I’m facing a strange issue with bgpd when snmpv2 is active, where it uses 100% CPU all the time.

If snmp is enabled this happens: Screenshot by Lightshot
When I disable (service snmpd stop) everything gets back to normal: Screenshot by Lightshot

If it is relevant, I have 9 BGP sessions (5 v4 and 4 v6), with 3 full routing tables and I’m using self compiled vyos 1.2.1.

Are there any logs where I can get more information about what is causing this?

I’d be really glad if anyone could help me debug this issue.

Thank you!

garysteers · August 29, 2019, 1:51pm

Hey @aldemaro

Does this happen just by having snmp enabled, or is it tied to when something is polling the device?

aldemaro · August 29, 2019, 1:55pm

Hello @garysteers!

I’ve disabled pooling and the issue is still present. I’ve also tested with SNMPv3 and the same happens. SNMP only need to be active for this to happen.

Looks like there is something wrong with frr+snmp integration, but I can’t figure out what. Are there any bgpd logs? I found frr.log, but nothing wrong there:

Aug 25 10:46:27 localhost zebra[1030]: snmp[info]: NET-SNMP version 5.7.2.1 AgentX subagent connected
Aug 25 10:46:27 localhost bgpd[1034]: snmp[info]: NET-SNMP version 5.7.2.1 AgentX subagent connected
Aug 25 10:46:27 localhost ospfd[1049]: snmp[info]: NET-SNMP version 5.7.2.1 AgentX subagent connected
Aug 25 10:46:27 localhost ospf6d[1053]: snmp[info]: NET-SNMP version 5.7.2.1 AgentX subagent connected
Aug 25 10:46:27 localhost ripd[1041]: snmp[info]: NET-SNMP version 5.7.2.1 AgentX subagent connected

garysteers · August 29, 2019, 2:06pm

There may be something in /var/log/messages

You can also use monitor snmp (which basically runs the above with a filter for snmpd messages)

fvbrasileiro · August 29, 2019, 2:25pm

I have the same problem. Every day between 7:00 pm and 7:10 pm, the bgpd process ends after snmpd reaches 100% cpu.

The problem only occurs if bgpd and snmpd are running. Even if snmp collection is not being performed.

Occurs in both VM and HW.

aldemaro · August 29, 2019, 9:28pm

Unfortunately there is no useful information:

snmpd started: Screenshot by Lightshot
snmpd stoped: Screenshot by Lightshot

fajncba75 · September 4, 2019, 8:40am

yes we have the same problem… vyos 1.2.2.

aldemaro · September 4, 2019, 9:47am

Maybe it’s time to report this bug in phabricator?

Anyone knows how to disable the integration between frr and snmpd? This should work as a workaround, and would let we monitor at least some aspects over snmp.

I might setup a test router to try to find out, but any help would be appreciated.

Dmitry · September 4, 2019, 10:35am

It seems like frrouting issue

github.com/FRRouting/frr

FRR v5.0 increased CPU load

opened 05:39PM - 16 Jun 18 UTC

closed 10:35PM - 04 Feb 19 UTC

patrick7

bug performance

Since upgrading to v5 (at this point I also enabled SNMP), I see FRR is using 10…0% of one CPU core. Also I see the following errors in the log: ``` root@cr1:~# grep "Broken pipe" /var/log/frr/frr.log 2018/06/15 10:21:25 BGP: buffer_flush_available: write error on fd 110: Broken pipe Jun 15 10:21:25 cr1 bgpd[12541]: buffer_flush_available: write error on fd 110: Broken pipe 2018/06/15 10:21:25 BGP: buffer_flush_available: write error on fd 136: Broken pipe Jun 15 10:21:25 cr1 bgpd[12541]: buffer_flush_available: write error on fd 136: Broken pipe 2018/06/15 10:21:25 BGP: buffer_flush_available: write error on fd 110: Broken pipe Jun 15 10:21:25 cr1 bgpd[12541]: buffer_flush_available: write error on fd 110: Broken pipe 2018/06/15 10:21:25 BGP: buffer_flush_available: write error on fd 136: Broken pipe Jun 15 10:21:25 cr1 bgpd[12541]: buffer_flush_available: write error on fd 136: Broken pipe 2018/06/15 10:21:25 BGP: buffer_flush_available: write error on fd 110: Broken pipe Jun 15 10:21:25 cr1 bgpd[12541]: buffer_flush_available: write error on fd 110: Broken pipe 2018/06/15 17:18:04 BGP: buffer_flush_available: write error on fd 136: Broken pipe Jun 15 17:18:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 136: Broken pipe 2018/06/15 17:18:04 BGP: buffer_flush_available: write error on fd 141: Broken pipe Jun 15 17:18:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 141: Broken pipe 2018/06/15 17:18:04 BGP: buffer_flush_available: write error on fd 136: Broken pipe Jun 15 17:18:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 136: Broken pipe 2018/06/15 17:18:04 BGP: buffer_flush_available: write error on fd 141: Broken pipe Jun 15 17:18:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 141: Broken pipe 2018/06/15 17:18:04 BGP: buffer_flush_available: write error on fd 136: Broken pipe Jun 15 17:18:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 136: Broken pipe 2018/06/15 20:48:46 BGP: buffer_flush_available: write error on fd 40: Broken pipe Jun 15 20:48:46 cr1 bgpd[12541]: buffer_flush_available: write error on fd 40: Broken pipe 2018/06/15 20:48:46 BGP: buffer_flush_available: write error on fd 119: Broken pipe Jun 15 20:48:46 cr1 bgpd[12541]: buffer_flush_available: write error on fd 119: Broken pipe 2018/06/15 20:48:46 BGP: buffer_flush_available: write error on fd 40: Broken pipe Jun 15 20:48:46 cr1 bgpd[12541]: buffer_flush_available: write error on fd 40: Broken pipe 2018/06/15 20:49:03 BGP: buffer_flush_available: write error on fd 40: Broken pipe Jun 15 20:49:03 cr1 bgpd[12541]: buffer_flush_available: write error on fd 40: Broken pipe 2018/06/15 20:49:03 BGP: buffer_flush_available: write error on fd 97: Broken pipe Jun 15 20:49:03 cr1 bgpd[12541]: buffer_flush_available: write error on fd 97: Broken pipe 2018/06/15 20:49:03 BGP: buffer_flush_available: write error on fd 104: Broken pipe Jun 15 20:49:03 cr1 bgpd[12541]: buffer_flush_available: write error on fd 104: Broken pipe 2018/06/15 20:49:03 BGP: buffer_flush_available: write error on fd 65: Broken pipe Jun 15 20:49:03 cr1 bgpd[12541]: buffer_flush_available: write error on fd 65: Broken pipe 2018/06/15 20:49:04 BGP: buffer_flush_available: write error on fd 104: Broken pipe Jun 15 20:49:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 104: Broken pipe 2018/06/15 20:49:04 BGP: buffer_flush_available: write error on fd 40: Broken pipe Jun 15 20:49:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 40: Broken pipe 2018/06/15 20:49:04 BGP: buffer_flush_available: write error on fd 65: Broken pipe Jun 15 20:49:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 65: Broken pipe 2018/06/15 20:49:04 BGP: buffer_flush_available: write error on fd 40: Broken pipe Jun 15 20:49:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 40: Broken pipe 2018/06/15 20:49:20 BGP: buffer_flush_available: write error on fd 65: Broken pipe Jun 15 20:49:20 cr1 bgpd[12541]: buffer_flush_available: write error on fd 65: Broken pipe 2018/06/15 23:12:01 BGP: buffer_flush_available: write error on fd 65: Broken pipe Jun 15 23:12:01 cr1 bgpd[12541]: buffer_flush_available: write error on fd 65: Broken pipe 2018/06/15 23:12:01 BGP: buffer_flush_available: write error on fd 104: Broken pipe Jun 15 23:12:01 cr1 bgpd[12541]: buffer_flush_available: write error on fd 104: Broken pipe ``` The last entry matches the time when the load increased. Also see https://gist.github.com/patrick7/4c47c3afa6815f25451d0fc4e33c0469

I think you can try edit sudo nano /etc/frr/daemons and delete -M snmp from bgpd_options=.... Then run sudo killall bgpd and wait while process run automatically

aldemaro · September 6, 2019, 1:09am

There is no bgpd_options in this file. I’ll dig deeper and try to find it.

In the meantime the only workaround I found is to stop snmpd.

Edit: I found the actual file to edit to be /etc/frr/daemons.conf. Killing the bgpd process for some reason will not spawn a new process tho, so you will need to reboot the router to bring it back up.

fvbrasileiro · September 20, 2019, 11:03am

After removing the “-M snmp” option and rebooting. The snmpd service continued to reach 100% CPU and stop responding to queries, but bgpd was unaffected and continued to function normally.

Dmitry · September 23, 2019, 2:15pm

Hello @fvbrasileiro, how many interfaces your router has? Can you provide output top command?

fvbrasileiro · September 23, 2019, 6:00pm

5 interfaces.

Interface  S/L
---------  ---
eth0       u/u
eth1       u/u
eth1.111   u/u
eth2       u/u
eth3       A/D
eth4       u/u
eth4.444   u/u

top - 19:06:24 up 23:50,  1 user,  load average: 0.23, 0.12, 0.04
Tasks: 142 total,   2 running,  99 sleeping,   0 stopped,   0 zombie
%Cpu0  :  3.7 us,  0.3 sy,  0.0 ni, 96.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  1.4 us,  0.7 sy,  0.0 ni, 96.6 id,  0.3 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu2  :  0.3 us,  0.3 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 98.7 us,  1.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem:   4040552 total,  2620536 used,  1420016 free,   111856 buffers
KiB Swap:        0 total,        0 used,        0 free.   244692 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3963 snmp      20   0  173256 119956   4504 R 100.0  3.0   2:05.66 snmpd
 4008 root      20   0  104248  20036  14980 S   4.0  0.5  50:36.23 uacctd
 4003 root      20   0  119708  33596  23456 S   1.3  0.8  26:15.33 uacctd
   17 root      20   0       0      0      0 S   0.3  0.0   1:14.61 ksoftirqd/1
 1008 root      20   0   12064   4692   1556 S   0.3  0.1   0:32.71 haveged
 4006 root      20   0  127988  42876  25736 S   0.3  1.1   6:45.53 uacctd
    1 root      20   0  110636   5124   3204 S   0.0  0.1   0:01.95 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.01 kthreadd
    3 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_par_gp
    5 root      20   0       0      0      0 I   0.0  0.0   0:00.39 kworker/0:0-cgr

Dmitry · September 24, 2019, 6:04pm

@fvbrasileiro which exactly VyOS version are you using? I think we need perf utility for looking what function is using in process.

fvbrasileiro · September 24, 2019, 8:25pm

You can chose:

me@BGP-2:~$ sh system image
The system currently has the following image(s) installed:

   1: 1.2.3-epa1 (default boot) (running image)
   2: 1.2-rolling-201908222129
   3: 1.2.0-rolling+201908050337
   4: 1.2.0-rolling+201907231917
   5: 1.2.2
   6: 1.2.0-rolling+201906251727
   7: 1.2.0-rolling+201906070337
   8: 1.2.0-rolling+201906030337
   9: 1.2.1

hagbard · September 24, 2019, 10:40pm

show version should give you the running image version.

fvbrasileiro · September 25, 2019, 10:47am

The current version is VyOS 1.2.3-epa1.

But, I can also use a vyos-build version to perform the tests.

Dmitry · September 26, 2019, 7:22pm

@fvbrasileiro, can you try some SMP config changes?
edit /etc/default/snmpd
and replace

SNMPDOPTS='-Lsd -Lf /dev/null -u snmp -g snmp -p /run/snmpd.pid'

to

SNMPDOPTS='-LSed -u snmp -g snmp -I -ipCidrRouteTable,inetCidrRouteTable -p /run/snmpd.pid'

Then restart snmp.

fvbrasileiro · September 27, 2019, 10:16pm

After add “I -ipCidrRouteTable,inetCidrRouteTable” and restart, no problem with high CPU.

Dmitry · September 27, 2019, 10:30pm

Good, I think we can add this by default, can you create task on phabricator with your issue?