High CPU usage when snmp is active - VyOS 1.2.7 LTS

BNKT0P · July 25, 2021, 12:43pm

Hi, just to warn everyone if using VyOS 1.2.7 LTS image with snmp, bgp and vrrp.
At the moment I have exactly the same behaviour as described in this topic.
After some time of working, bgpd, vrrpd and snmpd stops, and in the log I can see following messages:

ul 13 10:39:02 dp-router1 bgpd[1188]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Jul 13 10:39:09 dp-router1 ripd[1196]: [EC 100663310] snmp[warning]: AgentX master agent failed to respond to ping.  Attempting to re-register.
Jul 13 10:39:13 dp-router1 zebra[1183]: [EC 100663310] snmp[warning]: AgentX master agent failed to respond to ping.  Attempting to re-register.
Jul 13 10:39:15 dp-router1 ospf6d[1210]: [EC 100663310] snmp[warning]: AgentX master agent failed to respond to ping.  Attempting to re-register.
Jul 13 10:39:18 dp-router1 Keepalived_vrrp[3688]: AgentX master agent failed to respond to ping.  Attempting to re-register.
Jul 13 10:39:21 dp-router1 ripd[1196]: [EC 100663313] SLOW THREAD: task agentx_timeout (7efe81593b80) ran for 18018ms (cpu time 0ms)
Jul 13 10:39:22 dp-router1 ospfd[1204]: [EC 100663310] snmp[warning]: AgentX master agent failed to respond to ping.  Attempting to re-register.
Jul 13 10:39:23 dp-router1 bgpd[1188]: [EC 100663313] SLOW THREAD: task agentx_timeout (7fae45866b80) ran for 6006ms (cpu time 0ms)
Jul 13 10:39:25 dp-router1 zebra[1183]: [EC 100663313] SLOW THREAD: task agentx_timeout (7fb83d47ab80) ran for 18017ms (cpu time 0ms)
Jul 13 10:39:27 dp-router1 ospf6d[1210]: [EC 100663313] SLOW THREAD: task agentx_timeout (7f1dba1a2b80) ran for 18018ms (cpu time 0ms)
Jul 13 10:39:34 dp-router1 ospfd[1204]: [EC 100663313] SLOW THREAD: task agentx_timeout (7fd994b84b80) ran for 18017ms (cpu time 0ms)
Jul 13 10:41:03 dp-router1 watchfrr[1153]: [EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago
Jul 13 10:41:03 dp-router1 watchfrr[1153]: [EC 100663303] Forked background command [pid 1438]: /usr/lib/frr/watchfrr.sh restart bgpd
Jul 13 10:41:10 dp-router1 watchfrr[1153]: [EC 268435457] ripd state -> unresponsive : no response yet to ping sent 90 seconds ago
Jul 13 10:41:10 dp-router1 watchfrr[1153]: [EC 100663303] Forked background command [pid 1515]: /usr/lib/frr/watchfrr.sh restart ripd
Jul 13 10:41:12 dp-router1 watchfrr[1153]: [EC 268435457] ospf6d state -> unresponsive : no response yet to ping sent 90 seconds ago
Jul 13 10:41:12 dp-router1 watchfrr[1153]: [EC 100663303] Forked background command [pid 1542]: /usr/lib/frr/watchfrr.sh restart ospf6d
Jul 13 10:41:15 dp-router1 watchfrr[1153]: [EC 268435457] zebra state -> unresponsive : no response yet to ping sent 90 seconds ago
Jul 13 10:41:19 dp-router1 watchfrr[1153]: [EC 268435457] ospfd state -> unresponsive : no response yet to ping sent 90 seconds ago
Jul 13 10:41:23 dp-router1 watchfrr[1153]: Warning: restart bgpd child process 1438 still running after 20 seconds, sending signal 15
Jul 13 10:41:23 dp-router1 watchfrr[1153]: restart bgpd process 1438 terminated due to signal 15
Jul 13 10:41:30 dp-router1 watchfrr[1153]: Warning: restart ripd child process 1515 still running after 20 seconds, sending signal 15
Jul 13 10:41:30 dp-router1 watchfrr[1153]: restart ripd process 1515 terminated due to signal 15
Jul 13 10:41:32 dp-router1 watchfrr[1153]: Warning: restart ospf6d child process 1542 still running after 20 seconds, sending signal 15
Jul 13 10:41:32 dp-router1 watchfrr[1153]: restart ospf6d process 1542 terminated due to signal 15
Jul 13 10:41:35 dp-router1 watchfrr[1153]: [EC 100663303] Forked background command [pid 2050]: /usr/lib/frr/watchfrr.sh restart all
Jul 13 10:41:35 dp-router1 staticd[1216]: Terminating on signal
Jul 13 10:41:35 dp-router1 bfdd[1220]: VRF disable default id 0
Jul 13 10:41:35 dp-router1 bfdd[1220]: VRF Deletion: default(0)
Jul 13 10:41:35 dp-router1 ripngd[1200]: Terminating on signal
Jul 13 10:41:35 dp-router1 watchfrr[1153]: [EC 268435457] staticd state -> down : read returned EOF
Jul 13 10:41:35 dp-router1 zebra[1183]: [EC 4043309121] Client 'static' encountered an error and is shutting down.
Jul 13 10:41:35 dp-router1 zebra[1183]: [EC 4043309121] Client 'ripng' encountered an error and is shutting down.
Jul 13 10:41:35 dp-router1 zebra[1183]: [EC 4043309121] Client 'bfd' encountered an error and is shutting down.
Jul 13 10:41:35 dp-router1 watchfrr[1153]: [EC 268435457] bfdd state -> down : read returned EOF
Jul 13 10:41:35 dp-router1 watchfrr[1153]: [EC 268435457] ripngd state -> down : read returned EOF
Jul 13 10:41:55 dp-router1 watchfrr[1153]: Warning: restart all child process 2050 still running after 20 seconds, sending signal 15
Jul 13 10:41:55 dp-router1 watchfrr[1153]: restart all process 2050 terminated due to signal 15
Jul 13 10:43:00 dp-router1 watchfrr[1153]: [EC 100663303] Forked background command [pid 3074]: /usr/lib/frr/watchfrr.sh restart all
Jul 13 10:43:00 dp-router1 watchfrr.sh: Cannot stop bfdd: pid file not found
Jul 13 10:43:00 dp-router1 watchfrr.sh: Cannot stop ripngd: pid file not found
Jul 13 10:43:00 dp-router1 watchfrr.sh: Cannot stop staticd: pid file not found
Jul 13 10:43:20 dp-router1 watchfrr[1153]: Warning: restart all child process 3074 still running after 20 seconds, sending signal 15
Jul 13 10:43:20 dp-router1 watchfrr[1153]: restart all process 3074 terminated due to signal 15
Jul 13 10:45:20 dp-router1 watchfrr[1153]: [EC 100663303] Forked background command [pid 4103]: /usr/lib/frr/watchfrr.sh restart all
Jul 13 10:45:20 dp-router1 watchfrr.sh: Cannot stop bfdd: pid file not found
Jul 13 10:45:20 dp-router1 watchfrr.sh: Cannot stop staticd: pid file not found
Jul 13 10:45:20 dp-router1 watchfrr.sh: Cannot stop ripngd: pid file not found
Jul 13 10:45:40 dp-router1 watchfrr[1153]: Warning: restart all child process 4103 still running after 20 seconds, sending signal 15
Jul 13 10:45:40 dp-router1 watchfrr[1153]: restart all process 4103 terminated due to signal 15
Jul 13 10:49:44 dp-router1 watchfrr[1153]: [EC 100663303] Forked background command [pid 5302]: /usr/lib/frr/watchfrr.sh restart all
Jul 13 10:49:44 dp-router1 watchfrr.sh: Cannot stop bfdd: pid file not found
Jul 13 10:49:44 dp-router1 watchfrr.sh: Cannot stop staticd: pid file not found
Jul 13 10:49:44 dp-router1 watchfrr.sh: Cannot stop ripngd: pid file not found
Jul 13 10:50:04 dp-router1 watchfrr[1153]: Warning: restart all child process 5302 still running after 20 seconds, sending signal 15
Jul 13 10:50:04 dp-router1 watchfrr[1153]: restart all process 5302 terminated due to signal 15


admin@dp-router1:~$ restart vrrp
admin@dp-router1:~$ sh vrrp
VRRP information is not available
admin@dp-router1:~$ sh ip bgp neighbors
:...skipping...

Top shows that snmpd process consumes 100% of the CPU.

The 1.2.7 LTS image has a fix described here already applied:
SNMPDOPTS='-LSed -u snmp -g snmp -I -ipCidrRouteTable,inetCidrRouteTable -p /run/snmpd.pid'

If you can help with this issue, you are welcome, but if you are using this image in the production, stop to do it, especially if you router is under heavy load of traffic.

tjh · July 25, 2021, 6:27pm

Does it relieve the issue if you restart snmpd?

/etc/init.d/snmpd restart

BNKT0P · July 26, 2021, 6:45am

Usually I simply restart such “hanged” router, and it helps for a 1-2 weeks…

As I wrote before, you can simply will not meet with this issue if your router not working with a lot of traffic and don’t using a snmp. At this moment I’m copying a lot of data between datacenters - on both ends I have installed a VyOS 1.2.7 LTS. On one router I don’t have snmp (and bgp with vrrp) enabled on the other end I’m monitoring a router with the Zabbix 5.0 via snmp also it has a ebgp session and vrrp enabled.
SNMP enabled router has fall two times per 10 days while copying a 6.5 Tb file via rsync, the average speed is ~ 100 Mbit/sec

tjh · July 26, 2021, 7:58am

Sorry I wasn’t trying to suggest it wasn’t a real issue! I was just wondering if restarting snmp might resolve it for you.

You’re right - I won’t experience it. My 2 primary Vyos instances are home routers doing no more than 1Gbp/s. My routers do run VRRP, and I also monitor them with Zabbix 5.4.3. I have a single eBGP session too.

I wonder if it might be a problem with your network card driver? It seems like not just SNMP failing, but also FRR etc, like the kernel is hanging.

What network cards do you use?

BNKT0P · July 26, 2021, 8:30am

I have Dell servers with the VyOS 1.2.7 installed, with these NICs:

admin@dp-router1:~$ sudo lspci | egrep -i --color 'network|ethernet'    
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
02:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
02:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet PCIe
04:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet PCIe

admin@dp-router1:~$ modinfo tg3
filename:       /lib/modules/4.19.178-amd64-vyos/kernel/drivers/net/ethernet/broadcom/tg3.ko
firmware:       tigon/tg3_tso5.bin
firmware:       tigon/tg3_tso.bin
firmware:       tigon/tg3.bin
version:        3.137
license:        GPL
description:    Broadcom Tigon3 ethernet driver

admin@dp-router1:~$ modinfo i40e
filename:       /lib/modules/4.19.178-amd64-vyos/updates/drivers/net/ethernet/intel/i40e/i40e.ko
version:        2.12.6
license:        GPL
description:    Intel(R) 40-10 Gigabit Ethernet Connection Network Driver

Viacheslav · July 26, 2021, 8:54am

Bgpd daemon is not responded ~ 90 sec and watchfrr restart the bgp and other daemons
If it happens it load the daemon without configuration.
It is already fixed in 1.3-rc5 with 2 commits

vyos-router and protocols

But needs more tests. You can also do the same changes for 1.2.x

It temporarily saves routing configuration and if watchfrr restarts the router the configuration will be loaded from this file.

But needs to understand why snmp loads CPU. Do you have a fullview? Which SNMP items do you get? For example full routing table.

BNKT0P · July 26, 2021, 9:49am

Hi, VyOS routers not using FullView, only default route from the upstream provider.
In the Zabbix we are monitoring following items:

system load, uptime, bgp neighbors and their states, vrrp/keepalived states, network interfaces in/out packets, speed, etc.

BNKT0P · July 26, 2021, 9:53am

The problem is that after restart vrrp it doesn’t shows anything:

admin@dp-router1:~$ restart vrrp
admin@dp-router1:~$ sh vrrp
VRRP information is not available
admin@dp-router1:~$

Viacheslav · July 26, 2021, 10:12am

Can you share a screenshot?
“sudo top” and press 1.

Can you disable snmp items one by one? Needs to understand which item increases the load.
So there is not enough time (0.2 sec) to get information from vrrp/keepalived.

BNKT0P · July 26, 2021, 11:24am

At this moment router is working as expected.
During the fault, no snmp data was available of course, but I’ve taken a screenshot.
Also before the fault no high CPU usage or sudden traffic peak were observed, it’s just happened ((

If you will give me exact instructions what to do in the next fault, I’ll collect all required data.

P.S.
also we have ancient VyOSes 1.1.8 - no any problems with the snmp, and they receive a fullview from bgp neighbours

BNKT0P · July 26, 2021, 1:34pm

username or password are incorrect (

Viacheslav · July 27, 2021, 11:10am

There are mount-frr-conf and frr-save

maimun.najib · July 28, 2021, 3:33am

I also experienced the same problem in version 1.3.0-rc5, when PPPOE users connected more than 1000 users, the CPU immediately increased and after checking the one that used the most CPU was SNMPD 100% on one core. after the pppoe user disconnected, all cpu usage goes back to normal. For a temporary solution, I didn’t set SNMP when the router was used as a PPPoE server with a large number of clients.

BNKT0P · July 28, 2021, 6:39am

In the Zabbix host configuration page for the VyOS router, I’ve disabled snmp bulk requests, let’s see if it will help something.