FRR crash when with SNMP

erojas · October 8, 2019, 2:40pm

I have vyos configured with snmp v2 and have Solar Winds Orion as SNMP monitoring when server tries to poll the vyos device then the frr crash, below the logs, after that none of the routing protocols worked, so i posted as a bug as it seems to be one.

routerman@polux:~ sho ip bgp summary % BGP instance not found routerman@polux:~ sho ip ospf su

Invalid command: show ip ospf [su]

routerman@polux:~$ sho ip ospf neighbor
% OSPF instance not found

THE LOGS
Oct 07 23:07:40 polux bgpd[1190]: [EC 100663310] snmp[warning]: AgentX master agent failed to respond to ping. Attempting to re-register.
Oct 07 23:07:40 polux ospfd[1207]: [EC 100663310] snmp[warning]: AgentX master agent failed to respond to ping. Attempting to re-register.
Oct 07 23:07:41 polux ospf6d[1211]: [EC 100663310] snmp[warning]: AgentX master agent failed to respond to ping. Attempting to re-register.
Oct 07 23:07:43 polux sshd[3857]: refused connect from 222.186.42.241 (222.186.42.241)
Oct 07 23:07:43 polux sshd[3858]: refused connect from 222.186.42.241 (222.186.42.241)
Oct 07 23:07:43 polux sshd[3859]: refused connect from 222.186.42.241 (222.186.42.241)
Oct 07 23:07:47 polux ripd[1199]: [EC 100663310] snmp[warning]: AgentX master agent failed to respond to ping. Attempting to re-register.
Oct 07 23:07:47 polux zebra[1186]: [EC 100663310] snmp[warning]: AgentX master agent failed to respond to ping. Attempting to re-register.
Oct 07 23:07:52 polux bgpd[1190]: [EC 100663313] SLOW THREAD: task agentx_timeout (7fecaddb9a90) ran for 18018ms (cpu time 1ms)
Oct 07 23:07:52 polux ospfd[1207]: [EC 100663313] SLOW THREAD: task agentx_timeout (7f5e7833ea90) ran for 18018ms (cpu time 0ms)
Oct 07 23:07:53 polux ospf6d[1211]: [EC 100663313] SLOW THREAD: task agentx_timeout (7efe2a31ba90) ran for 18018ms (cpu time 0ms)
Oct 07 23:07:59 polux ripd[1199]: [EC 100663313] SLOW THREAD: task agentx_timeout (7f06fe65ca90) ran for 18018ms (cpu time 0ms)
Oct 07 23:07:59 polux zebra[1186]: [EC 100663313] SLOW THREAD: task agentx_timeout (7fd937d8aa90) ran for 18018ms (cpu time 1ms)
Oct 07 23:08:13 polux bgpd[1190]: [EC 100663313] SLOW THREAD: task agentx_timeout (7fecaddb9a90) ran for 6005ms (cpu time 0ms)
Oct 07 23:08:24 polux sshd[3860]: refused connect from 222.186.180.223 (222.186.180.223)
Oct 07 23:09:37 polux watchfrr[1152]: [EC 268435457] ospfd state -> unresponsive : no response yet to ping sent 90 seconds ago
Oct 07 23:09:37 polux watchfrr[1152]: [EC 100663303] Forked background command [pid 3862]: /usr/lib/frr/watchfrr.sh restart ospfd
Oct 07 23:09:43 polux watchfrr[1152]: [EC 268435457] ospf6d state -> unresponsive : no response yet to ping sent 90 seconds ago
Oct 07 23:09:43 polux watchfrr[1152]: [EC 100663303] Forked background command [pid 3929]: /usr/lib/frr/watchfrr.sh restart ospf6d
Oct 07 23:09:44 polux watchfrr[1152]: [EC 268435457] zebra state -> unresponsive : no response yet to ping sent 90 seconds ago
Oct 07 23:09:49 polux watchfrr[1152]: [EC 268435457] ripd state -> unresponsive : no response yet to ping sent 90 seconds ago
Oct 07 23:09:53 polux watchfrr[1152]: [EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago
Oct 07 23:09:57 polux watchfrr[1152]: Warning: restart ospfd child process 3862 still running after 20 seconds, sending signal 15
Oct 07 23:09:57 polux watchfrr[1152]: restart ospfd process 3862 terminated due to signal 15
Oct 07 23:10:03 polux watchfrr[1152]: Warning: restart ospf6d child process 3929 still running after 20 seconds, sending signal 15
Oct 07 23:10:03 polux watchfrr[1152]: restart ospf6d process 3929 terminated due to signal 15
Oct 07 23:10:04 polux watchfrr[1152]: [EC 100663303] Forked background command [pid 4273]: /usr/lib/frr/watchfrr.sh restart all
Oct 07 23:10:04 polux staticd[1217]: Terminating on signal
Oct 07 23:10:04 polux zebra[1186]: [EC 4043309117] Client ‘static’ encountered an error and is shutting down.
Oct 07 23:10:04 polux watchfrr[1152]: [EC 268435457] staticd state -> down : read returned EOF
Oct 07 23:10:04 polux ripngd[1203]: Terminating on signal
Oct 07 23:10:04 polux zebra[1186]: [EC 4043309117] Client ‘ripng’ encountered an error and is shutting down.
Oct 07 23:10:04 polux watchfrr[1152]: [EC 268435457] ripngd state -> down : read returned EOF
Oct 07 23:10:24 polux watchfrr[1152]: Warning: restart all child process 4273 still running after 20 seconds, sending signal 15
Oct 07 23:10:24 polux watchfrr[1152]: restart all process 4273 terminated due to signal 15
Oct 07 23:11:24 polux watchfrr[1152]: [EC 100663303] Forked background command [pid 5285]: /usr/lib/frr/watchfrr.sh restart all
Oct 07 23:11:24 polux watchfrr.sh[5296]: Cannot stop staticd: pid file not found
Oct 07 23:11:24 polux watchfrr.sh[5301]: Cannot stop ripngd: pid file not found
Oct 07 23:11:44 polux watchfrr[1152]: Warning: restart all child process 5285 still running after 20 seconds, sending signal 15
Oct 07 23:11:44 polux watchfrr[1152]: restart all process 5285 terminated due to signal 15
Oct 07 23:13:44 polux watchfrr[1152]: [EC 100663303] Forked background command [pid 6301]: /usr/lib/frr/watchfrr.sh restart all
Oct 07 23:13:44 polux watchfrr.sh[6312]: Cannot stop staticd: pid file not found
Oct 07 23:13:44 polux watchfrr.sh[6316]: Cannot stop ripngd: pid file not found
Oct 07 23:14:04 polux watchfrr[1152]: Warning: restart all child process 6301 still running after 20 seconds, sending signal 15
Oct 07 23:14:04 polux watchfrr[1152]: restart all process 6301 terminated due to signal 15

Dmitry · October 8, 2019, 2:49pm

Seems this known issue ⚓ T1705 High CPU usage by bgpd when snmp is active
Can you make some changes, like this
High CPU usage by bgpd when snmp is active - #18 by Dmitry

erojas · October 9, 2019, 8:41pm

thanks Followed that thread and applied the workaround