VyOS became unresponsive with a lot of bgp sessions


#1

Dear colleagues,

I migrated a router with Vyatta to VyOS 1.1.0, but instability appeared right after the upgrade. The router has a lot of bgp session - around 100 and from time to time without any reason the router become unresponsive for around 15-20 seconds.

Please find below some logs with call traces:

[2652491.189858] Call Trace:
[2652491.189863] [] ? default_idle+0x1b/0x2c
[2652491.189867] [] ? cpu_startup_entry+0x132/0x1a9
[2652491.189871] [] ? start_secondary+0x25e/0x263
[2652491.189873] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 f4 c3 8b 07 53 49 89 c9 49 89 d0 8b 0a 0f a2 89 07 89 1e 41
[2652491.189909] NMI backtrace for cpu 15
[2652491.189915] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 3.13.11-1-amd64-vyos #1
[2652491.189929] Hardware name: Supermicro H8DGU/H8DGU, BIOS 2.0a 11/08/2011
[2652491.189941] task: ffff8802168d2a00 ti: ffff8802168e0000 task.ti: ffff8802168e0000
[2652491.189949] RIP: 0010:[] [] native_safe_halt+0x2/0x3
[2652491.189965] RSP: 0018:ffff8802168e1ed0 EFLAGS: 00000246
[2652491.189976] RAX: 0000000000000000 RBX: ffff8802168e0010 RCX: ffff880617cc0000
[2652491.189985] RDX: 0000000000000000 RSI: 000000000000000f RDI: 0000000000000001
[2652491.189992] RBP: ffff8802168e0000 R08: 0000000000000000 R09: ffff8802168e0000
[2652491.189999] R10: 0140000000000000 R11: ffff8802168e0010 R12: ffff8802168e0000
[2652491.190006] R13: ffff8802168e0010 R14: ffff8802168e0010 R15: ffff8802168e0010
[2652491.190014] FS: 00007f303a10b700(0000) GS:ffff880617cc0000(0000) knlGS:0000000000000000
[2652491.190022] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[2652491.190029] CR2: ffffffffff600400 CR3: 000000080e068000 CR4: 00000000000007e0
[2652491.190040] Stack:
[2652491.190046] ffffffff8101650b ffff8802168e0000 ffffffff81083b41 0000000000013080
[2652491.190070] 979237b1cc46d5b6 0000000000000282 00000000300001df 0000000000000000
[2652491.190109] 0000000000000000 0000000000000000 0000000000000000 0000000000000000

[2652491.186377] Call Trace:
[2652491.186379]
[2652491.186380] [] ? default_send_IPI_mask_sequence_phys+0x9e/0xc7
[2652491.186387] [] ? arch_trigger_all_cpu_backtrace+0x44/0x6c
[2652491.186391] [] ? rcu_check_callbacks+0x1f8/0x5a0
[2652491.186395] [] ? tick_nohz_handler+0xce/0xce
[2652491.186398] [] ? update_process_times+0x31/0x56
[2652491.186402] [] ? tick_sched_timer+0x74/0x90
[2652491.186405] [] ? __run_hrtimer+0x92/0x11c
[2652491.186409] [] ? hrtimer_interrupt+0xde/0x1ec
[2652491.186413] [] ? read_tsc+0x5/0x16
[2652491.186416] [] ? tick_check_idle+0x43/0x93
[2652491.186420] [] ? smp_apic_timer_interrupt+0x1d/0x2d
[2652491.186424] [] ? apic_timer_interrupt+0x6d/0x80
[2652491.186426]
[2652491.186427] [] ? native_safe_halt+0x2/0x3
[2652491.186433] [] ? default_idle+0x1b/0x2c
[2652491.186437] [] ? cpu_startup_entry+0x132/0x1a9
[2652491.186441] [] ? start_secondary+0x25e/0x263
[2652491.186443] Code: f2 48 89 f0 0f 83 8a 00 00 00 48 89 d1 49 89 d0 48 c1 e9 06 49 83 e0 c0 4c 8d 0c cf 48 89 f7 4c 29 c7 83 e2 3f 74 39 48 83 c8 ff <88> d1 48 d3 e0 49 23 01 48 83 ff 3f 76 3b 48 85 c0 75 4e 49 83
[2652491.186483] NMI backtrace for cpu 1
[2652491.186495] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.13.11-1-amd64-vyos #1
[2652491.186498] Hardware name: Supermicro H8DGU/H8DGU, BIOS 2.0a 11/08/2011
[2652491.186501] task: ffff88021689c600 ti: ffff8802168bc000 task.ti: ffff8802168bc000
[2652491.186504] RIP: 0010:[] [] native_safe_halt+0x2/0x3
[2652491.186511] RSP: 0018:ffff8802168bded0 EFLAGS: 00000246
[2652491.186513] RAX: 0000000000000000 RBX: ffff8802168bc010 RCX: ffff880217c40000
[2652491.186515] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
[2652491.186518] RBP: ffff8802168bc000 R08: 0000000000000000 R09: ffff8802168bc000
[2652491.186520] R10: 0140000000000000 R11: ffff8802168bc010 R12: ffff8802168bc000
[2652491.186523] R13: ffff8802168bc010 R14: ffff8802168bc010 R15: ffff8802168bc010
[2652491.186526] FS: 00007f303a10b700(0000) GS:ffff880217c40000(0000) knlGS:0000000000000000
[2652491.186528] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[2652491.186530] CR2: ffffffffff600400 CR3: 000000041174b000 CR4: 00000000000007e0
[2652491.186532] Stack:
[2652491.186534] ffffffff8101650b ffff8802168bc000 ffffffff81083b41 0000000000013080
[2652491.186538] 2dbacbf22979a2fc 0000000000000282 00000000300001df 0000000000000000
[2652491.186542] 0000000000000000 0000000000000000 0000000000000000 0000000000000000


#2

I recently have found a similar issue with 1.1.7 Vyos. If i run a full bgp feed (580k prefixes) I get ssh session hangs but console seems stable. Just fyi! I may be more stable in this version than 1.1.0


#3

Is this running on physical hardware or VM?
I found a similar fix and it was my vmware configuration type and nic drivers!
Just fyi!


#4

Hello beaven67, this was on physical hardware.
Anyway i didn’t find a fix and changed the router.


#5

Hello,
Can you provide specs of hw used.

We also working on 1.2 beta, you may want to test it

Thanks for feedback!


#6

I Had the same issue


#7

Just wondering if you have netflow collection enabled maybe?


#8

I am also experiencing instability with our Vyos Router which handles the most BGP connections. We occasionally need to reboot the router for it to be responsive again. We are running Vyos 1.1.7 on a baremetal build, so I’m wondering if there’s an issue with the hardware or firmware. This issue has persisted through a couple build upgrades, but it is only affecting one of our Vyos Routers, albeit the most heavily utilized one.

Generally when the issues begin the SSH on router becomes unresponsive.

We do have netflow collection enabled. I’m parsing through logs at /var/log/messages to see if anything stands out, but any help would be much appreciated!

Thanks.


#9

Hello,
i will recommend to disable NetFlow if possible, we got several reports regarding this type of issue.
This will be addressed in 1.2.x


#10

Thank you for the tip. I will remove Netflow for now and may try to implement again when 1.2.X is released.


#11

Well, we made it a couple weeks but unfortunately the Vyos router became unresponsive again. I did remove the netflow configuration commands previously. We are running Vyos on a bare-metal install and we are using broadcom NICs. I see some people running Vyos in a VM are having issues with the VMXNET3 NICs, so I’m curious if the VyOs router is having driver issues. However we do have one other Vyos router installed bare-metal with the same hardware and the second router has had no issues at all.

I will add we are running the two Vyos routers in a VRRP cluster if there’s any known bugs with VRRP. I have been trying to review the logs but unfortunately when the Vyos router locks up it generally does not have a chance to write anything to the disk before failing.


#12

That is strange,
we usually recommend use intel NIC since they better performance.
Maybe we can set up remote logging and some monitoring via SNMP
Please PM as we will be happy to assist with issues and VRRP

Regarding router lock ups, is it still continue to work but you can’t access to it?
We had similar symptoms with VMs which lost their storage temporary(timed out).
Cand you add your hw specs, since you have other system with same hw, i will advise check storage on affected server

Thanks!


#13

Thanks for the suggestion. We are going to look into replacing the on-board Broadcom NICs with some Intel NICs. We currently have 2 Broadcom NICs (onboard) and 2 Intel NICs (expansion card) on the server that is locking up.

We do have remote logging configured so reviewing those did provide some additional info. It looks like there was a VRRP election forced immediately before one of the two VyOs routers in the cluster locked up. When the router locks up it stops moving packets altogether and you cannot ssh to the device. We generally need to log into the IDRAC to reboot the VyOs instance to bring it back online.

These Logs occurred on the Problematic VyOs router immediately before it locked up:

Sep 8 03:39:47 PROBLEM-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Re
ceived lower prio advert, forcing new election
Sep 8 03:39:47 PROBLEM-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) IP
SEC-AH : Syncing seq_num - Increment seq
Sep 8 03:39:47 PROBLEM-RTR Keepalived_vrrp: Error sending gratutious ARP on
eth1v99 for XX.XX.XX.1
Sep 8 03:39:47 PROBLEM-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Se
nding gratuitous ARPs on eth1v99 for XX.XX.XX.1
Sep 8 03:39:47 PROBLEM-RTR Keepalived_vrrp: Error sending gratutious ARP on
eth1v99 for XX.XX.XX.1
Sep 8 03:39:47 PROBLEM-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Re
ceived lower prio advert, forcing new election
Sep 8 03:39:47 PROBLEM-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) IP
SEC-AH : Syncing seq_num - Increment seq
Sep 8 03:39:47 PROBLEM-RTR Keepalived_vrrp: Error sending gratutious ARP on
eth1v99 for XX.XX.XX.1
Sep 8 03:39:47 PROBLEM-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Sending gratuitous ARPs on eth1v99 for XX.XX.XX.1
Sep 8 03:39:47 PROBLEM-RTR Keepalived_vrrp: Error sending gratutious ARP on eth1v99 for XX.XX.XX.1
Sep 8 03:39:48 PROBLEM-RTR Keepalived_vrrp: last message repeated 3 times
Sep 8 03:39:48 PROBLEM-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Received lower prio advert, forcing new election
Sep 8 03:39:48 PROBLEM-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) IPSEC-AH : Syncing seq_num - Increment seq
Sep 8 03:39:48 PROBLEM-RTR Keepalived_vrrp: Error sending gratutious ARP on eth1v99 for XX.XX.XX.1
Sep 8 03:39:48 PROBLEM-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Sending gratuitous ARPs on eth1v99 for XX.XX.XX.1
Sep 8 03:39:48 PROBLEM-RTR Keepalived_vrrp: last message repeated 3 times
Sep 8 03:39:48 PROBLEM-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Received lower prio advert, forcing new election
Sep 8 03:39:48 PROBLEM-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) IPSEC-AH : Syncing seq_num - Increment seq
Sep 8 03:39:48 PROBLEM-RTR Keepalived_vrrp: Error sending gratutious ARP on eth1v99 for XX.XX.XX.1
Sep 8 03:39:48 PROBLEM-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Sending gratuitous ARPs on eth1v99 for XX.XX.XX.1
Sep 8 03:39:48 PROBLEM-RTR Keepalived_vrrp: Error sending gratutious ARP on eth1v99 for XX.XX.XX.1
Sep 8 03:39:48 PROBLEM-RTR Keepalived_vrrp: Error sending gratutious ARP on eth1v99 for XX.XX.XX.1
Sep 8 03:39:49 PROBLEM-RTR Keepalived_vrrp: last message repeated 3 times
Sep 8 03:39:49 PROBLEM-RTR kernel: [215845.836832] ------------[ cut here ]------------
Sep 8 03:39:49 PROBLEM-RTR Keepalived_vrrp: last message repeated 3 times
Sep 8 03:39:49 PROBLEM-RTR kernel: [215845.836832] ------------[ cut here ]------------
Sep 8 03:39:49 PROBLEM-RTR kernel: [215845.836852] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0xf2/0x14f()
Sep 8 03:39:49 PROBLEM-RTR kernel: [215845.836856] NETDEV WATCHDOG: eth1 (tg3): transmit queue 0 timed out
Sep 8 03:39:49 PROBLEM-RTR kernel: [215845.836858] Modules linked in: macvlan ip_set xt_LOG xt_comment iptable_nat nf_nat_ipv4 ip6table_filter ip6table_raw ip6_tables iptable
filter nf_conntrack_ipv4 nf_defrag_ipv4 xt_CT nfnetlink_cthelper nfnetlink iptable_raw nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_h323 nf_conntrack_h323 nf_nat
sip nf_conntrack_sip nf_nat_proto_gre nf_nat_tftp nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq
_ondemand cpufreq_conservative fuse ghash_clmulni_intel crc32c_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 x86_pkg_temp_thermal coretemp dcdbas evdev
ipmi_si microcode hid_generic ipmi_msghandler shpchp pcspkr lpc_ich mfd_core acpi_power_meter button processor thermal_sys battery usb_storage ohci_hcd squashfs loop overlayfs e
xt4 jbd2 crc16 raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod ses enc
Sep 8 03:39:49 PROBLEM-RTR kernel: losure usbhid hid megaraid_sas ahci libahci tg3 igb dca i2c_algo_bit i2c_core ptp pps_core
Sep 8 03:39:49 PROBLEM-RTR kernel: [215845.836987] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.11-1-amd64-vyos #1
Sep 8 03:39:49 PROBLEM-RTR kernel: [215845.836994] Hardware name: Dell Inc. PowerEdge R320/08VT7V, BIOS 2.4.2 01/29/2015

These Logs occurred on the other router in the VRRP Pair around that same time:

Sep 8 03:39:46 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Transition to MASTER STATE
Sep 8 03:39:46 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Transition to MASTER STATE
Sep 8 03:39:47 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Entering MASTER STATE
Sep 8 03:39:47 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Entering MASTER STATE
Sep 8 03:39:47 OTHER-RTR netplugd[2742]: eth1v99: state DOWN flags 0x00001002 BROADCAST,MULTICAST -> 0x00011043 UP,BROADCAST,RUNNING,MULTICAST,10000
Sep 8 03:39:47 OTHER-RTR netplugd[2742]: eth1v99: state DOWN flags 0x00001002 BROADCAST,MULTICAST -> 0x00011043 UP,BROADCAST,RUNNING,MULTICAST,10000
Sep 8 03:39:47 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) setting protocol VIPs.
Sep 8 03:39:47 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Sending gratuitous ARPs on eth1v99 for XX.XX.XX.1
Sep 8 03:39:47 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99): Sending SNMP notification
Sep 8 03:39:47 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99): Sending SNMP notification vrrpTrapNewMaster
Sep 8 03:39:47 OTHER-RTR snmpd[34400]: Got trap from peer on fd 16
Sep 8 03:39:47 OTHER-RTR netplugd[36789]: /etc/netplug/netplug eth1v99 in -> pid 36789
Sep 8 03:39:47 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) setting protocol VIPs.
Sep 8 03:39:47 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Sending gratuitous ARPs on eth1v99 for XX.XX.XX.1
Sep 8 03:39:47 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99): Sending SNMP notification
Sep 8 03:39:47 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99): Sending SNMP notification vrrpTrapNewMaster
Sep 8 03:39:47 OTHER-RTR snmpd[34400]: Got trap from peer on fd 16
Sep 8 03:39:47 OTHER-RTR netplugd[36789]: /etc/netplug/netplug eth1v99 in -> pid 36789
Sep 8 03:39:47 OTHER-RTR kernel: [26171232.062462] device eth1 entered promiscuous mode
Sep 8 03:39:47 OTHER-RTR kernel: [26171232.062462] device eth1 entered promiscuous mode
Sep 8 03:39:47 OTHER-RTR netplugd[2742]: eth1v99: state INNING pid 36789 exited status 0
Sep 8 03:39:47 OTHER-RTR netplugd[2742]: eth1v99: state INNING pid 36789 exited status 0
Sep 8 03:39:49 OTHER-RTR Keepalived_vrrp: Netlink reflector reports IP fe80::200:5eff:fe00:163 added
Sep 8 03:39:49 OTHER-RTR Keepalived_vrrp: Netlink reflector reports IP fe80::200:5eff:fe00:163 added
Sep 8 03:39:52 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Sending gratuitous ARPs on eth1v99 for XX.XX.XX.1
Sep 8 03:39:52 OTHER-RTR Keepalived_vrrp: VRRP_Instance(vy-eth1-99) Sending gratuitous ARPs on eth1v99 for XX.XX.XX.1
Sep 8 03:40:56 OTHER-RTR ntpd[3582]: Listen normally on 48 eth1v99 XX.XX.XX.1 UDP 123
Sep 8 03:40:56 OTHER-RTR ntpd[3582]: Listen normally on 48 eth1v99 XX.XX.XX.1 UDP 123

As for the storage these Vyos routers are running on local storage each on their own Poweredge R320, so there’s no potential iSCSI / SAN disconnect issues.

As for hardware we are running on a PowerEdge R320 with 4 Intel CPUs, 16 GB RAM, 2 Broadcom, 2 Intel NICs. I can be more specific with the hardware if you’d like. All the NICs are rated at 1 GBps.

Thank you!


#14

Hello,
lines
Sep 8 03:39:49 PROBLEM-RTR kernel: [215845.836852] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0xf2/0x14f()
Sep 8 03:39:49 PROBLEM-RTR kernel: [215845.836856] NETDEV WATCHDOG: eth1 (tg3): transmit queue 0 timed out
points to issues with eth1

Since you told you have similar servers and problem only observed in one,
i will recommend disable on board NICs on affected server and add two additional intel cards
Ideally you will do that on second server too.
Let me know how it goes


#15

Thanks again for the suggestion. We are going to look into swapping the Broadcom NICs for Intel NICs and we will also double-check the back-end switch / cables. I may not have a chance to report back for a couple weeks but I’ll make sure to provide a status update once I have some new info.


#16

Dear All,

I have the exact same problem on about 7 boxes with tg3 drivers (Broadcom NIC).

In the logs I see this:

Sep 9 09:46:51 cr21 kernel: [80169.143428] tg3 0000:02:00.0 eth0: transmit timed out, resetting
Sep 9 09:46:51 cr21 kernel: [80170.400408] tg3 0000:02:00.0 eth0: 0x00000000: 0x165714e4, 0x00100546, 0x02000001, 0x00800000
Sep 9 09:46:51 cr21 kernel: [80170.400579] tg3 0000:02:00.0 eth0: 0x00000010: 0x97b9000c, 0x00000000, 0x97ba000c, 0x00000000

After this, 2 possible scenarios:
a) The router is working again after 15-20 seconds of being unresponsive
b) The router totaly freeze and it must be rebooted

Is there a fix for this, like a more recent driver ?

Thanks in advance for your input.

Christian


#17

Hello Christian,
there is no fix, and we do not expect to deliver any major changes on 1.1.x series
1.2.x should come in 6 months or so; it will have little bit fresher drivers,
however, I will recommend switching to Intel NICs whenever possible


#18

Checking in again a few weeks later. We did swap the Broadcom NICs for Intel NICs and we have had no issues so far. I’m going to keep an eye on this and if anything changes I’ll provide an update. Thanks for the assistance!


#19

Which intel card did you put in the router?
I had some issues with some Intel cards on other Linux servers (not vyos)


#20

Gandalf,

Sorry for the delayed response. We just used an Intel Pro 1000 card we had a spare of and it has worked great. Before swapping the card this Vyos router would lock up on us every other month, and it’s ran clean for several months now.