BGP Session disconnects - SLOW THREAD: task smux_connect

Hello,
I’m currently testing VyOS and the BGP session drops once every few hours because the BGP process freezes and therefore doesn’t send keepalive messages in 40s intervals.

Keep Alive messages:

Jul 4 11:51:06 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:51:06 vyos bgpd[3791]: $IPv6_PEER sending KEEPALIVE Jul 4 11:51:46 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:51:46 vyos bgpd[3791]: $IPv6_PEER sending KEEPALIVE Jul 4 11:52:26 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:52:26 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:53:30 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:53:30 vyos bgpd[3791]: $IPv6_PEER sending KEEPALIVE Jul 4 11:54:23 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:54:23 vyos bgpd[3791]: $IPv6_PEER sending KEEPALIVE Jul 4 11:55:03 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:55:03 vyos bgpd[3791]: $IPv6_PEER sending KEEPALIVE Jul 4 11:55:53 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:55:53 vyos bgpd[3791]: $IPv6_PEER sending KEEPALIVE

Jul 4 11:45:16 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 15029ms (cpu time 0ms) Jul 4 11:45:54 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 7010ms (cpu time 0ms) Jul 4 11:46:33 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7009ms (cpu time 0ms) Jul 4 11:47:10 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7009ms (cpu time 0ms) Jul 4 11:47:47 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7020ms (cpu time 0ms) Jul 4 11:47:53 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 15019ms (cpu time 0ms) Jul 4 11:48:31 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 7008ms (cpu time 0ms) Jul 4 11:50:16 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 7006ms (cpu time 0ms) Jul 4 11:50:20 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 7007ms (cpu time 0ms) Jul 4 11:51:06 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 15019ms (cpu time 0ms) Jul 4 11:51:07 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 7009ms (cpu time 0ms) Jul 4 11:51:14 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7010ms (cpu time 0ms) Jul 4 11:51:45 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 15030ms (cpu time 0ms) Jul 4 11:52:04 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 7007ms (cpu time 0ms) Jul 4 11:52:26 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 31050ms (cpu time 0ms) Jul 4 11:53:30 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 63099ms (cpu time 0ms) Jul 4 11:53:41 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 127260ms (cpu time 0ms) Jul 4 11:53:47 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 7001ms (cpu time 0ms) Jul 4 11:53:50 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 63099ms (cpu time 0ms) Jul 4 11:54:23 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 15023ms (cpu time 0ms) Jul 4 11:54:42 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 31090ms (cpu time 0ms) Jul 4 11:54:48 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 15038ms (cpu time 0ms) Jul 4 11:55:02 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 31080ms (cpu time 0ms) Jul 4 11:55:23 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7010ms (cpu time 0ms) Jul 4 11:55:53 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 15019ms (cpu time 0ms) Jul 4 11:56:04 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 31090ms (cpu time 0ms) Jul 4 11:56:13 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7009ms (cpu time 0ms) Jul 4 11:56:21 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 7009ms (cpu time 0ms) Jul 4 11:57:05 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 31090ms (cpu time 0ms) Jul 4 11:57:51 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 15029ms (cpu time 0ms) Jul 4 11:57:54 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 15025ms (cpu time 0ms) Jul 4 11:58:28 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 7009ms (cpu time 0ms) Jul 4 11:58:43 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 15036ms (cpu time 0ms) Jul 4 11:59:05 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7009ms (cpu time 0ms) Jul 4 11:59:40 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 15019ms (cpu time 0ms)

Hardware Info:

[code]----------------
Vyatta Version and Package Changes

Version: VyOS 1.1.7
Description: VyOS 1.1.7 (helium)
Copyright: 2016 VyOS maintainers and contributors
Built by: maintainers@vyos.net
Built on: Wed Feb 17 09:57:31 UTC 2016
Build ID: 1602170957-4459750
System type: x86 64-bit
Boot via: image
HW model: ProLiant DL360e Gen8
HW S/N: CZJ3500BFQ
HW UUID: 34373030-3635-435A-4A33-353030424651
Uptime: 12:01:36 up 8:11, 2 users, load average: 0.00, 0.02, 0.05[/code]

BGP Configuration:

bgp $ASNUM { address-family { ipv6-unicast { network $IPv6_/29 { } } } neighbor $IPv4_PEER { remote-as $REM_ASNUM update-source $IPv4_IP } neighbor $IPv6_PEER { address-family { ipv6-unicast { } } remote-as $REM_ASNUM update-source $IPv6_IP } network $NET1_/24 { } network $NET2_/24 { } network $NET3_/24 { } network $NET4_/24 { } network $NET5_/24 { } network $NET6_/24 { } network $NET7_/22 { } parameters { router-id $IPv4_IP } }

top:

pu0 : 0.1%us, 0.0%sy, 0.0%ni, 91.7%id, 0.0%wa, 0.0%hi, 8.2%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.1%us, 0.1%sy, 0.0%ni, 94.6%id, 0.0%wa, 0.0%hi, 5.1%si, 0.0%st Cpu3 : 0.1%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.0%hi, 1.4%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.0%hi, 1.7%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 0.1%us, 0.1%sy, 0.0%ni, 98.1%id, 0.0%wa, 0.0%hi, 1.7%si, 0.0%st Cpu9 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 0.1%us, 0.0%sy, 0.0%ni, 98.1%id, 0.0%wa, 0.0%hi, 1.7%si, 0.0%st Cpu11 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 0.0%us, 0.0%sy, 0.0%ni, 98.4%id, 0.0%wa, 0.0%hi, 1.6%si, 0.0%st Cpu13 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 0.1%us, 0.0%sy, 0.0%ni, 98.1%id, 0.0%wa, 0.0%hi, 1.8%si, 0.0%st Cpu15 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

vmstat:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 7546832 83080 108264 0 0 0 0 1 18 0 1 99 0 0 0 0 7547668 83080 108264 0 0 0 8 137149 372 0 1 99 0 0 0 0 7550036 83080 108264 0 0 0 0 125155 266 0 0 100 0 0 0 0 7548612 83080 108264 0 0 0 0 126945 314 0 0 100 0 0 0 0 7545940 83080 108264 0 0 0 0 119227 307 0 1 99 0 0 0 0 7551220 83080 108264 0 0 0 0 117026 304 0 0 100 0 0 0 0 7548992 83080 108264 0 0 0 0 120597 389 0 1 99 0 0 0 0 7549240 83080 108264 0 0 0 0 136031 206 0 0 100 0 0 0 0 7550232 83088 108256 0 0 0 32 148895 315 0 0 100 0 0 0 0 7547544 83088 108264 0 0 0 0 147729 339 0 1 99 0 0 0 0 7546764 83088 108264 0 0 0 20 156677 362 0 1 99 0 0 0 0 7546668 83088 108264 0 0 0 0 159350 263 0 0 100 0 0 0 0 7545548 83088 108264 0 0 0 0 168053 341 0 1 99 0 0 0 0 7547564 83088 108264 0 0 0 0 173445 280 0 1 99 0 1 0 0 7547532 83088 108264 0 0 0 0 173424 232 0 1 99 0 0 0 0 7547788 83096 108256 0 0 0 60 171006 327 0 1 99 0 0 0 0 7547276 83096 108264 0 0 0 0 173452 350 0 1 99 0 0 0 0 7547340 83096 108268 0 0 0 0 167156 335 0 1 99 0 0 0 0 7545996 83096 108268 0 0 0 0 171909 307 0 1 99 0 0 0 0 7546348 83096 108268 0 0 0 0 163887 295 0 1 99 0 0 0 0 7547116 83104 108260 0 0 0 20 164847 376 0 1 99 0 0 0 0 7546412 83104 108260 0 0 0 8 159988 302 0 1 99 0 0 0 0 7545644 83104 108268 0 0 0 0 153223 339 0 0 100 0 0 0 0 7548268 83104 108268 0 0 0 0 163509 284 0 0 100 0 0 0 0 7548332 83104 108268 0 0 0 0 159174 297 0 1 99 0 0 0 0 7547308 83104 108268 0 0 0 0 165399 286 0 1 99 0 0 0 0 7548204 83104 108268 0 0 0 0 166225 237 0 1 99 0

I have even configured the remote router to only send the default route 0.0.0.0/0 but even then this issue occurs. The whole routing table has only 10 entries.
HP Server has Hyperthreading enabled and Power Mode set to High Static Performance.
8 GB RAM installed and CPU is 2x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz.

Traffic: 100k pps in / 100k pps out, 150 Mbit/s in, 200 Mbit/s out.

Any Ideas why this issue occurs?

Hello,
most likely it’s some obscure bug in quagga,
I found similar reports on ERL devices