Hello everyone!
I’m facing a strange issue with bgpd when snmpv2 is active, where it uses 100% CPU all the time.
If snmp is enabled this happens: Screenshot by Lightshot
When I disable (service snmpd stop) everything gets back to normal: Screenshot by Lightshot
If it is relevant, I have 9 BGP sessions (5 v4 and 4 v6), with 3 full routing tables and I’m using self compiled vyos 1.2.1.
Are there any logs where I can get more information about what is causing this?
I’d be really glad if anyone could help me debug this issue.
Thank you!
Hey @aldemaro
Does this happen just by having snmp enabled, or is it tied to when something is polling the device?
Hello @garysteers !
I’ve disabled pooling and the issue is still present. I’ve also tested with SNMPv3 and the same happens. SNMP only need to be active for this to happen.
Looks like there is something wrong with frr+snmp integration, but I can’t figure out what. Are there any bgpd logs? I found frr.log, but nothing wrong there:
Aug 25 10:46:27 localhost zebra[1030]: snmp[info]: NET-SNMP version 5.7.2.1 AgentX subagent connected
Aug 25 10:46:27 localhost bgpd[1034]: snmp[info]: NET-SNMP version 5.7.2.1 AgentX subagent connected
Aug 25 10:46:27 localhost ospfd[1049]: snmp[info]: NET-SNMP version 5.7.2.1 AgentX subagent connected
Aug 25 10:46:27 localhost ospf6d[1053]: snmp[info]: NET-SNMP version 5.7.2.1 AgentX subagent connected
Aug 25 10:46:27 localhost ripd[1041]: snmp[info]: NET-SNMP version 5.7.2.1 AgentX subagent connected
There may be something in /var/log/messages
You can also use monitor snmp
(which basically runs the above with a filter for snmpd messages)
I have the same problem. Every day between 7:00 pm and 7:10 pm, the bgpd process ends after snmpd reaches 100% cpu.
The problem only occurs if bgpd and snmpd are running. Even if snmp collection is not being performed.
Occurs in both VM and HW.
Unfortunately there is no useful information:
snmpd started: Screenshot by Lightshot
snmpd stoped: Screenshot by Lightshot
yes we have the same problem… vyos 1.2.2.
Maybe it’s time to report this bug in phabricator?
Anyone knows how to disable the integration between frr and snmpd? This should work as a workaround, and would let we monitor at least some aspects over snmp.
I might setup a test router to try to find out, but any help would be appreciated.
Dmitry
September 4, 2019, 10:35am
9
It seems like frrouting issue
opened 05:39PM - 16 Jun 18 UTC
closed 10:35PM - 04 Feb 19 UTC
bug
performance
Since upgrading to v5 (at this point I also enabled SNMP), I see FRR is using 10… 0% of one CPU core. Also I see the following errors in the log:
```
root@cr1:~# grep "Broken pipe" /var/log/frr/frr.log
2018/06/15 10:21:25 BGP: buffer_flush_available: write error on fd 110: Broken pipe
Jun 15 10:21:25 cr1 bgpd[12541]: buffer_flush_available: write error on fd 110: Broken pipe
2018/06/15 10:21:25 BGP: buffer_flush_available: write error on fd 136: Broken pipe
Jun 15 10:21:25 cr1 bgpd[12541]: buffer_flush_available: write error on fd 136: Broken pipe
2018/06/15 10:21:25 BGP: buffer_flush_available: write error on fd 110: Broken pipe
Jun 15 10:21:25 cr1 bgpd[12541]: buffer_flush_available: write error on fd 110: Broken pipe
2018/06/15 10:21:25 BGP: buffer_flush_available: write error on fd 136: Broken pipe
Jun 15 10:21:25 cr1 bgpd[12541]: buffer_flush_available: write error on fd 136: Broken pipe
2018/06/15 10:21:25 BGP: buffer_flush_available: write error on fd 110: Broken pipe
Jun 15 10:21:25 cr1 bgpd[12541]: buffer_flush_available: write error on fd 110: Broken pipe
2018/06/15 17:18:04 BGP: buffer_flush_available: write error on fd 136: Broken pipe
Jun 15 17:18:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 136: Broken pipe
2018/06/15 17:18:04 BGP: buffer_flush_available: write error on fd 141: Broken pipe
Jun 15 17:18:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 141: Broken pipe
2018/06/15 17:18:04 BGP: buffer_flush_available: write error on fd 136: Broken pipe
Jun 15 17:18:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 136: Broken pipe
2018/06/15 17:18:04 BGP: buffer_flush_available: write error on fd 141: Broken pipe
Jun 15 17:18:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 141: Broken pipe
2018/06/15 17:18:04 BGP: buffer_flush_available: write error on fd 136: Broken pipe
Jun 15 17:18:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 136: Broken pipe
2018/06/15 20:48:46 BGP: buffer_flush_available: write error on fd 40: Broken pipe
Jun 15 20:48:46 cr1 bgpd[12541]: buffer_flush_available: write error on fd 40: Broken pipe
2018/06/15 20:48:46 BGP: buffer_flush_available: write error on fd 119: Broken pipe
Jun 15 20:48:46 cr1 bgpd[12541]: buffer_flush_available: write error on fd 119: Broken pipe
2018/06/15 20:48:46 BGP: buffer_flush_available: write error on fd 40: Broken pipe
Jun 15 20:48:46 cr1 bgpd[12541]: buffer_flush_available: write error on fd 40: Broken pipe
2018/06/15 20:49:03 BGP: buffer_flush_available: write error on fd 40: Broken pipe
Jun 15 20:49:03 cr1 bgpd[12541]: buffer_flush_available: write error on fd 40: Broken pipe
2018/06/15 20:49:03 BGP: buffer_flush_available: write error on fd 97: Broken pipe
Jun 15 20:49:03 cr1 bgpd[12541]: buffer_flush_available: write error on fd 97: Broken pipe
2018/06/15 20:49:03 BGP: buffer_flush_available: write error on fd 104: Broken pipe
Jun 15 20:49:03 cr1 bgpd[12541]: buffer_flush_available: write error on fd 104: Broken pipe
2018/06/15 20:49:03 BGP: buffer_flush_available: write error on fd 65: Broken pipe
Jun 15 20:49:03 cr1 bgpd[12541]: buffer_flush_available: write error on fd 65: Broken pipe
2018/06/15 20:49:04 BGP: buffer_flush_available: write error on fd 104: Broken pipe
Jun 15 20:49:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 104: Broken pipe
2018/06/15 20:49:04 BGP: buffer_flush_available: write error on fd 40: Broken pipe
Jun 15 20:49:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 40: Broken pipe
2018/06/15 20:49:04 BGP: buffer_flush_available: write error on fd 65: Broken pipe
Jun 15 20:49:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 65: Broken pipe
2018/06/15 20:49:04 BGP: buffer_flush_available: write error on fd 40: Broken pipe
Jun 15 20:49:04 cr1 bgpd[12541]: buffer_flush_available: write error on fd 40: Broken pipe
2018/06/15 20:49:20 BGP: buffer_flush_available: write error on fd 65: Broken pipe
Jun 15 20:49:20 cr1 bgpd[12541]: buffer_flush_available: write error on fd 65: Broken pipe
2018/06/15 23:12:01 BGP: buffer_flush_available: write error on fd 65: Broken pipe
Jun 15 23:12:01 cr1 bgpd[12541]: buffer_flush_available: write error on fd 65: Broken pipe
2018/06/15 23:12:01 BGP: buffer_flush_available: write error on fd 104: Broken pipe
Jun 15 23:12:01 cr1 bgpd[12541]: buffer_flush_available: write error on fd 104: Broken pipe
```
The last entry matches the time when the load increased.
Also see https://gist.github.com/patrick7/4c47c3afa6815f25451d0fc4e33c0469
I think you can try edit sudo nano /etc/frr/daemons
and delete -M snmp
from bgpd_options=...
. Then run sudo killall bgpd
and wait while process run automatically
There is no bgpd_options in this file. I’ll dig deeper and try to find it.
In the meantime the only workaround I found is to stop snmpd.
Edit: I found the actual file to edit to be /etc/frr/daemons.conf. Killing the bgpd process for some reason will not spawn a new process tho, so you will need to reboot the router to bring it back up.
Dmitry:
-M snmp
After removing the “-M snmp” option and rebooting. The snmpd service continued to reach 100% CPU and stop responding to queries, but bgpd was unaffected and continued to function normally.
Dmitry
September 23, 2019, 2:15pm
12
Hello @fvbrasileiro , how many interfaces your router has? Can you provide output top
command?
5 interfaces.
Interface S/L
--------- ---
eth0 u/u
eth1 u/u
eth1.111 u/u
eth2 u/u
eth3 A/D
eth4 u/u
eth4.444 u/u
top - 19:06:24 up 23:50, 1 user, load average: 0.23, 0.12, 0.04
Tasks: 142 total, 2 running, 99 sleeping, 0 stopped, 0 zombie
%Cpu0 : 3.7 us, 0.3 sy, 0.0 ni, 96.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 1.4 us, 0.7 sy, 0.0 ni, 96.6 id, 0.3 wa, 0.0 hi, 1.0 si, 0.0 st
%Cpu2 : 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 98.7 us, 1.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
KiB Mem: 4040552 total, 2620536 used, 1420016 free, 111856 buffers
KiB Swap: 0 total, 0 used, 0 free. 244692 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3963 snmp 20 0 173256 119956 4504 R 100.0 3.0 2:05.66 snmpd
4008 root 20 0 104248 20036 14980 S 4.0 0.5 50:36.23 uacctd
4003 root 20 0 119708 33596 23456 S 1.3 0.8 26:15.33 uacctd
17 root 20 0 0 0 0 S 0.3 0.0 1:14.61 ksoftirqd/1
1008 root 20 0 12064 4692 1556 S 0.3 0.1 0:32.71 haveged
4006 root 20 0 127988 42876 25736 S 0.3 1.1 6:45.53 uacctd
1 root 20 0 110636 5124 3204 S 0.0 0.1 0:01.95 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
5 root 20 0 0 0 0 I 0.0 0.0 0:00.39 kworker/0:0-cgr
Dmitry
September 24, 2019, 6:04pm
14
@fvbrasileiro which exactly VyOS version are you using? I think we need perf
utility for looking what function is using in process.
You can chose:
me@BGP-2:~$ sh system image
The system currently has the following image(s) installed:
1: 1.2.3-epa1 (default boot) (running image)
2: 1.2-rolling-201908222129
3: 1.2.0-rolling+201908050337
4: 1.2.0-rolling+201907231917
5: 1.2.2
6: 1.2.0-rolling+201906251727
7: 1.2.0-rolling+201906070337
8: 1.2.0-rolling+201906030337
9: 1.2.1
hagbard
September 24, 2019, 10:40pm
16
show version
should give you the running image version.
The current version is VyOS 1.2.3-epa1.
But, I can also use a vyos-build version to perform the tests.
Dmitry
September 26, 2019, 7:22pm
18
@fvbrasileiro , can you try some SMP config changes?
edit /etc/default/snmpd
and replace
SNMPDOPTS='-Lsd -Lf /dev/null -u snmp -g snmp -p /run/snmpd.pid'
to
SNMPDOPTS='-LSed -u snmp -g snmp -I -ipCidrRouteTable,inetCidrRouteTable -p /run/snmpd.pid'
Then restart snmp.
After add “I -ipCidrRouteTable,inetCidrRouteTable” and restart, no problem with high CPU.
Dmitry
September 27, 2019, 10:30pm
20
Good, I think we can add this by default, can you create task on phabricator with your issue?