BGP Session disconnects - SLOW THREAD: task smux_connect


#1

Hello,
I’m currently testing VyOS and the BGP session drops once every few hours because the BGP process freezes and therefore doesn’t send keepalive messages in 40s intervals.

Keep Alive messages:

Jul 4 11:51:06 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:51:06 vyos bgpd[3791]: $IPv6_PEER sending KEEPALIVE Jul 4 11:51:46 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:51:46 vyos bgpd[3791]: $IPv6_PEER sending KEEPALIVE Jul 4 11:52:26 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:52:26 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:53:30 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:53:30 vyos bgpd[3791]: $IPv6_PEER sending KEEPALIVE Jul 4 11:54:23 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:54:23 vyos bgpd[3791]: $IPv6_PEER sending KEEPALIVE Jul 4 11:55:03 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:55:03 vyos bgpd[3791]: $IPv6_PEER sending KEEPALIVE Jul 4 11:55:53 vyos bgpd[3791]: $IPv4_PEER sending KEEPALIVE Jul 4 11:55:53 vyos bgpd[3791]: $IPv6_PEER sending KEEPALIVE

Jul 4 11:45:16 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 15029ms (cpu time 0ms) Jul 4 11:45:54 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 7010ms (cpu time 0ms) Jul 4 11:46:33 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7009ms (cpu time 0ms) Jul 4 11:47:10 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7009ms (cpu time 0ms) Jul 4 11:47:47 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7020ms (cpu time 0ms) Jul 4 11:47:53 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 15019ms (cpu time 0ms) Jul 4 11:48:31 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 7008ms (cpu time 0ms) Jul 4 11:50:16 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 7006ms (cpu time 0ms) Jul 4 11:50:20 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 7007ms (cpu time 0ms) Jul 4 11:51:06 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 15019ms (cpu time 0ms) Jul 4 11:51:07 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 7009ms (cpu time 0ms) Jul 4 11:51:14 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7010ms (cpu time 0ms) Jul 4 11:51:45 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 15030ms (cpu time 0ms) Jul 4 11:52:04 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 7007ms (cpu time 0ms) Jul 4 11:52:26 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 31050ms (cpu time 0ms) Jul 4 11:53:30 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 63099ms (cpu time 0ms) Jul 4 11:53:41 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 127260ms (cpu time 0ms) Jul 4 11:53:47 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 7001ms (cpu time 0ms) Jul 4 11:53:50 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 63099ms (cpu time 0ms) Jul 4 11:54:23 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 15023ms (cpu time 0ms) Jul 4 11:54:42 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 31090ms (cpu time 0ms) Jul 4 11:54:48 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 15038ms (cpu time 0ms) Jul 4 11:55:02 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 31080ms (cpu time 0ms) Jul 4 11:55:23 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7010ms (cpu time 0ms) Jul 4 11:55:53 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 15019ms (cpu time 0ms) Jul 4 11:56:04 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 31090ms (cpu time 0ms) Jul 4 11:56:13 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7009ms (cpu time 0ms) Jul 4 11:56:21 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 7009ms (cpu time 0ms) Jul 4 11:57:05 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 31090ms (cpu time 0ms) Jul 4 11:57:51 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 15029ms (cpu time 0ms) Jul 4 11:57:54 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 15025ms (cpu time 0ms) Jul 4 11:58:28 vyos ripd[3780]: SLOW THREAD: task smux_connect (7f25dc700440) ran for 7009ms (cpu time 0ms) Jul 4 11:58:43 vyos bgpd[3791]: SLOW THREAD: task smux_connect (7fbceda2c440) ran for 15036ms (cpu time 0ms) Jul 4 11:59:05 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 7009ms (cpu time 0ms) Jul 4 11:59:40 vyos ospfd[3785]: SLOW THREAD: task smux_connect (7f19839e6440) ran for 15019ms (cpu time 0ms)

Hardware Info:

[code]----------------
Vyatta Version and Package Changes

Version: VyOS 1.1.7
Description: VyOS 1.1.7 (helium)
Copyright: 2016 VyOS maintainers and contributors
Built by: maintainers@vyos.net
Built on: Wed Feb 17 09:57:31 UTC 2016
Build ID: 1602170957-4459750
System type: x86 64-bit
Boot via: image
HW model: ProLiant DL360e Gen8
HW S/N: CZJ3500BFQ
HW UUID: 34373030-3635-435A-4A33-353030424651
Uptime: 12:01:36 up 8:11, 2 users, load average: 0.00, 0.02, 0.05[/code]

BGP Configuration:

bgp $ASNUM { address-family { ipv6-unicast { network $IPv6_/29 { } } } neighbor $IPv4_PEER { remote-as $REM_ASNUM update-source $IPv4_IP } neighbor $IPv6_PEER { address-family { ipv6-unicast { } } remote-as $REM_ASNUM update-source $IPv6_IP } network $NET1_/24 { } network $NET2_/24 { } network $NET3_/24 { } network $NET4_/24 { } network $NET5_/24 { } network $NET6_/24 { } network $NET7_/22 { } parameters { router-id $IPv4_IP } }

top:

pu0 : 0.1%us, 0.0%sy, 0.0%ni, 91.7%id, 0.0%wa, 0.0%hi, 8.2%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.1%us, 0.1%sy, 0.0%ni, 94.6%id, 0.0%wa, 0.0%hi, 5.1%si, 0.0%st Cpu3 : 0.1%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.0%hi, 1.4%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.0%hi, 1.7%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 0.1%us, 0.1%sy, 0.0%ni, 98.1%id, 0.0%wa, 0.0%hi, 1.7%si, 0.0%st Cpu9 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 0.1%us, 0.0%sy, 0.0%ni, 98.1%id, 0.0%wa, 0.0%hi, 1.7%si, 0.0%st Cpu11 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 0.0%us, 0.0%sy, 0.0%ni, 98.4%id, 0.0%wa, 0.0%hi, 1.6%si, 0.0%st Cpu13 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 0.1%us, 0.0%sy, 0.0%ni, 98.1%id, 0.0%wa, 0.0%hi, 1.8%si, 0.0%st Cpu15 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

vmstat:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 7546832 83080 108264 0 0 0 0 1 18 0 1 99 0 0 0 0 7547668 83080 108264 0 0 0 8 137149 372 0 1 99 0 0 0 0 7550036 83080 108264 0 0 0 0 125155 266 0 0 100 0 0 0 0 7548612 83080 108264 0 0 0 0 126945 314 0 0 100 0 0 0 0 7545940 83080 108264 0 0 0 0 119227 307 0 1 99 0 0 0 0 7551220 83080 108264 0 0 0 0 117026 304 0 0 100 0 0 0 0 7548992 83080 108264 0 0 0 0 120597 389 0 1 99 0 0 0 0 7549240 83080 108264 0 0 0 0 136031 206 0 0 100 0 0 0 0 7550232 83088 108256 0 0 0 32 148895 315 0 0 100 0 0 0 0 7547544 83088 108264 0 0 0 0 147729 339 0 1 99 0 0 0 0 7546764 83088 108264 0 0 0 20 156677 362 0 1 99 0 0 0 0 7546668 83088 108264 0 0 0 0 159350 263 0 0 100 0 0 0 0 7545548 83088 108264 0 0 0 0 168053 341 0 1 99 0 0 0 0 7547564 83088 108264 0 0 0 0 173445 280 0 1 99 0 1 0 0 7547532 83088 108264 0 0 0 0 173424 232 0 1 99 0 0 0 0 7547788 83096 108256 0 0 0 60 171006 327 0 1 99 0 0 0 0 7547276 83096 108264 0 0 0 0 173452 350 0 1 99 0 0 0 0 7547340 83096 108268 0 0 0 0 167156 335 0 1 99 0 0 0 0 7545996 83096 108268 0 0 0 0 171909 307 0 1 99 0 0 0 0 7546348 83096 108268 0 0 0 0 163887 295 0 1 99 0 0 0 0 7547116 83104 108260 0 0 0 20 164847 376 0 1 99 0 0 0 0 7546412 83104 108260 0 0 0 8 159988 302 0 1 99 0 0 0 0 7545644 83104 108268 0 0 0 0 153223 339 0 0 100 0 0 0 0 7548268 83104 108268 0 0 0 0 163509 284 0 0 100 0 0 0 0 7548332 83104 108268 0 0 0 0 159174 297 0 1 99 0 0 0 0 7547308 83104 108268 0 0 0 0 165399 286 0 1 99 0 0 0 0 7548204 83104 108268 0 0 0 0 166225 237 0 1 99 0

I have even configured the remote router to only send the default route 0.0.0.0/0 but even then this issue occurs. The whole routing table has only 10 entries.
HP Server has Hyperthreading enabled and Power Mode set to High Static Performance.
8 GB RAM installed and CPU is 2x Intel® Xeon® CPU E5-2430 v2 @ 2.50GHz.

Traffic: 100k pps in / 100k pps out, 150 Mbit/s in, 200 Mbit/s out.

Any Ideas why this issue occurs?


#2

Hello,
most likely it’s some obscure bug in quagga,
I found similar reports on ERL devices