High CPU usage by process zebra on VyOS 1.3

Hello…

I have two border routers with VyOS 1.3 built by myself.
Each router have similar configs with 16 IPv4 and 12 IPv6 BGP neighbors, iBGP IPv4 and IPv6 links between routers works correctly. I have downlinks OSPF and OSPFv3. Also OSPF and OSPFv3 between routers.
I have snmp v3 on each router. Zabbix monitors each router with BGP template.
One VyOS router works fine, but another have issue with CPU load by zebra process - up to 50%-60%.
I have read forums and docs, but haven’t find any solution.
I tried to disable snmp - same problem.
And I have a strange problem on router with high CPU load - when VyOS booting and mounts config network interface with vifs and shaper and redirect to ifb is missing. When I configuring by hands this interface after boot this interface works fine.

Can anyone suggest the possible cause of this problem ?

VyOS 1.3
Version: VyOS 1.3-rolling-202205261220
Release train: equuleus

Built by: ilia@sam-isp.net
Built on: Thu 26 May 2022 12:20 UTC
Build UUID: 532c0750-afbd-4431-902d-30e86d0a2afa
Build commit ID: ec82d1fffe7213

Architecture: x86_64
Boot via: installed image
System type: bare metal

Hardware vendor: HP
Hardware model: ProLiant DL360p Gen8
Hardware S/N: 6CU532YN0C
Hardware UUID: 312e3342-0032-4336-5535-3332594e3043

Copyright: VyOS maintainers and contributors

You can enable zebra debug for several seconds and check logs after. I suppose there will be a lot of events that use CPU intensively.

To enable:

sudo vtysh -c 'debug zebra dplane' -c 'debug zebra events' -c 'debug zebra fpm' -c 'debug zebra kernel' -c 'debug zebra nexthop' -c 'debug zebra nht' -c 'debug zebra packet' -c 'debug zebra rib' -c 'conf t' -c 'log syslog debugging'

To disable:

sudo vtysh -c 'no debug zebra dplane' -c 'no debug zebra events' -c 'no debug zebra fpm' -c 'no debug zebra kernel' -c 'no debug zebra nexthop' -c 'no debug zebra nht' -c 'no debug zebra packet' -c 'no debug zebra rib' -c 'conf t' -c 'log syslog'

Be careful - detailed debugging can make the situation with the CPU much worse. Therefore, better to do this carefully.

Thank You for Your reply.
I will try debug.

I have turned on all bgp neighbors and zebra debug.
I can see a lot of zebra events :

redistribute_update
zsend_redistribute_route
Not Notifying Owner
netlink_parse_info
zread_route_add

And so on.

But now zebra takes 1%-2% CPU and it all running about 2 hours with same config.

More then twelve hours router works fine. Zebra taking about 1%-2% CPU.
I did not change anything in config before I turned on BGP neighbors.
No I have all 16 IPv4 and 12 IPv6 neighbors on and iBGP link between routers.

After 1 day zebra and bgpd processes periodically takes up to 30% CPU.
Same problem, that was described above.

Nothing has been changed in config and topology on both routers.

And I see many records in /var/log/messages :

Aug 16 10:56:46 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 10:56:46 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 10:57:01 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
Aug 16 11:07:46 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 11:07:46 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 11:08:01 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
Aug 16 11:19:17 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 11:19:17 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 11:19:32 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
Aug 16 11:30:46 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 11:30:46 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 11:31:01 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
Aug 16 11:31:46 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 11:31:46 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 11:32:01 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
Aug 16 11:38:47 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 11:38:47 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 11:39:02 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
Aug 16 11:52:16 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 11:52:16 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 11:52:31 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
Aug 16 12:09:16 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 12:09:16 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 12:09:31 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
Aug 16 12:21:16 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 12:21:16 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 12:21:31 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
Aug 16 12:35:16 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 12:35:16 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 12:35:31 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
Aug 16 12:39:46 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 12:39:46 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 12:40:01 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
Aug 16 12:43:17 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 12:43:17 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 12:43:32 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected
Aug 16 13:05:46 vyos-1 bgpd[1392]: snmp[info]: AgentX master disconnected us, reconnecting in 15
Aug 16 13:05:46 vyos-1 bgpd[1392]: [EC 100663303] Failed to set snmp fd back to original settings: Bad file descriptor(9)
Aug 16 13:06:01 vyos-1 bgpd[1392]: snmp[info]: NET-SNMP version 5.7.3 AgentX subagent connected

I have changed BIOS NUMA settings on both routers - HP DL360p G8 (Performance Options > Advanced Performance Tuning Options > NUMA Group Size Optimization > Interleave > Enabled) - and now everything works fine about 24 hours.
I think, that zebra and bgpd high CPU load depends on NUMA mode. Possibly FRR does not support NUMA. That’s why wjen I made one NUMA node with two sockets in it zebra and bgpd takes 1% - 3% CPU time.