Memory Leak on VyOS 1.2 (20180921) consumes 6GB in less than 7 days

vyos-memory.txt (31.5 KB)

The Setup is like this:
There is:

  • 1 iBGP with full routing (receive less than 700k)
  • 1 eBGP v4 with full routing (receive ± 700k)
  • 1 eBGP v6 with full routing (receive ± 60k)
  • 2 eBGP v4 (receive ± 18k)
  • 3 OSPF neighbors (P2P) and ± 200 IGP routes

The problem is not with traffic amount, that is less than 1.5Gbps aggregated.
I’m using Mellanox x3 10G and Intel 1G boards

Thanks for provided info.
will need same info as it grows
can you maybe provide it each two days or so ?
Really appreciate your help

System Uptime
uptime
23:12:28 up 2 days, 8:17, 1 user, load average: 0.20, 0.12, 0.10

Free Memory (8h after last post)
free -h
total used free shared buffers cached
Mem: 7.8G 5.9G 1.9G 19M 114M 164M
-/+ buffers/cache: 5.6G 2.1G
Swap: 0B 0B 0B

show memory (8h after last post)
vyos-memory-8h.txt (31.4 KB)

It was needed to restart the router.

run show system uptime
05:16:59 up 2 days, 14:22, 1 user, load average: 0.02, 0.08, 0.08

show system memory
Total: 7982
Free: 1793
Used: 6189

vyos-memory-14h.txt (31.5 KB)

Last saturday, the router killed BGP process after memory arrive less than 1GB RAM free.

After reboot, appeared this error below from charon. BGP sessions who haven’t MD5 password doesn’t got UP. I stopped this process and restart the server again, and then the BGP sessions started normally.

Could ipsec/charon be the problem for memory leak?
Now:
show system memory
Total: 7982
Free: 5930
Used: 2052

Oct 16 05:41:37 ROUTER-BGP vyatta-router[1915]: Starting VyOS router: migrate rl-system firewall configure.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Reloading.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Started VyOS Router.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Getty on tty1...
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Started Getty on tty1.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Login Prompts.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Reached target Login Prompts.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting LSB: AWS EC2 instance init script to fetch and load ssh public key...
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Started LSB: AWS EC2 instance init script to fetch and load ssh public key.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Multi-User System.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Reached target Multi-User System.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Graphical Interface.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Reached target Graphical Interface.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Update UTMP about System Runlevel Changes...
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Started Update UTMP about System Runlevel Changes.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Startup finished in 18.836s (kernel) + 2min 36.775s (userspace) = 2min 55.611s.
Oct 16 05:41:38 ROUTER-BGP kernel: [  176.082887] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based  firewall rule not found. Use the iptables CT target to attach helpers instead.
Oct 16 05:41:38 ROUTER-BGP charon: 04[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:46 ROUTER-BGP charon: 11[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:48 ROUTER-BGP charon: 04[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:50 ROUTER-BGP charon: 10[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:50 ROUTER-BGP rsyslogd0: action 'action 3' resumed (module 'builtin:omfwd') [try http://www.rsyslog.com/e/0 ]
Oct 16 05:41:50 ROUTER-BGP rsyslogd-2359: action 'action 3' resumed (module 'builtin:omfwd') [try http://www.rsyslog.com/e/2359 ]
Oct 16 05:41:51 ROUTER-BGP charon: 04[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:52 ROUTER-BGP charon: 06[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:54 ROUTER-BGP charon: 10[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:55 ROUTER-BGP charon: 07[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:56 ROUTER-BGP charon: 03[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:57 ROUTER-BGP charon: 09[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:58 ROUTER-BGP charon: 09[KNL] unable to receive from RT event socket No buffer space available (105)

vtysh -c “show memory” after reboot
vyos-memory-0h-reboot.txt (31.5 KB)

I found backtrace logs when memory ended and stoped BGP process.
kernel-backtrace.txt (10.0 KB)

thanks for that, i see om-killer invoked but it’s not clear what is actually consume all memory to trigger it

strongswan can be root cause, but i’m not 100% sure at this point

Below, memory status 26 hours ago

show system memory
Total: 7982
Free: 5930
Used: 2052

Below, memory status now (26 hours after last print)

show system memory
Total: 7982
Free: 4170
Used: 3812

Almost 900MB RAM consumed per day.

VyOS-mem-26h_uptime.txt (31.6 KB)

Looks like the bgpd process itself.

Oct 13 14:42:56 SERVER-BGP kernel: [975257.727898] bgpd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[…]
Oct 13 14:42:56 SERVER-BGP kernel: [975257.837299] Killed process 2028 (bgpd) total-vm:1182792kB, anon-rss:988988kB, file-rss:0kB, shmem-rss:0kB
Oct 13 14:42:56 SERVER-BGP kernel: [975257.980689] oom_reaper: reaped process 2028 (bgpd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I think what you see from charon is a subsequent issue due to the memory shortage.

Could Increased memory usage with 2018-01-08 HEAD · Issue #1610 · FRRouting/frr · GitHub be the issue?
Or Crazy memory usage · Issue #2527 · FRRouting/frr · GitHub.

Below, memory status now (48 hours uptime)

show system memory
Total: 7982
Free: 2668
Used: 5314

VyOS-mem-48h_uptime.txt (31.7 KB)

so it not looks like FRR issue, but issues with something else
now we need to investigate what exactly eat up all memory

so from here question is how we approch this.
Someone on FRR channel suggested possible leak within NIC drivers (they observed that behaviour in past) it can be the case or not

Thanks @hagbard, I started a simple test to do some tests and try something that could help.
I installed 7 routers (VyOS 20180921) using libvirt, almost simulating the topology from my setup before posted.

  • 3 routers are external routers from a ASN, only eBGP with “test AS”

  • 2 routers are internal to the same ASN, each one with 1 eBGP to above routers, and iBGP between them. They are connected to IGP by OSPF

  • 2 internal routers only with OSPF

  • Scenario is running 3 days and 10 hours

  • All routers with 2GB RAM and I was verifying memory every 10min (492 logs) with “cat /proc/meminfo | grep MemFree”

  • On 1 of the Test ASN BGP router (VYOS 2) I connect another eBGP only to inject 710k real routes to my scenario

>> VYOS 1 =================

Only receiving only 2 routers (is an external router)
MemFree: 1799368 kB (read 1)
MemFree: 1779640 kB (read 492)

vyos@vyos1# run show ip bgp summary
BGP router identifier 10.1.1.1, local AS number 65001 vrf-id 0
BGP table version 7
RIB entries 13, using 2080 bytes of memory
Peers 1, using 21 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.10.101.2 4 65002 4953 4958 0 0 0 3d10h28m 2

>> VYOS 2 =================

Receiving 710k routers (advertised by real BGP router), propagate to another 710k to VYOS 5,
propagate 710k to iBGP VYOS 6, and 2 OSPF sessions
MemFree: 486452 kB (read 1)
MemFree: 456364 kB (read 492)

vyos@vyos2# run show ip bgp summary
IPv4 Unicast Summary:
BGP router identifier 10.1.1.2, local AS number 65002 vrf-id 0
BGP table version 5541888
RIB entries 1334424, using 204 MiB of memory
Peers 4, using 82 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.1.1.6 4 65002 4951 1062941 0 0 0 3d10h28m 0
10.10.101.1 4 65001 4959 4954 0 0 0 3d10h29m 5
10.10.104.2 4 65003 1062932 1062940 0 0 0 3d10h29m 1
192.168.255.1 4 265048 1287699 1062936 0 0 0 3d10h26m 733680

vyos@vyos2# run show ip ospf neighbor
Neighbor ID Pri State Dead Time Address Interface RXmtL RqstL DBsmL
10.1.1.3 1 Full/DROther 36.181s 10.10.102.2 eth0.102:10.10.102.1 0 0 0
10.1.1.6 1 Full/DROther 33.742s 10.10.105.2 eth0.105:10.10.105.1 0 0 0

vyos@vyos2# run show ip route summary
Route Source Routes FIB (vrf Default-IP-Routing-Table)
connected 6 6
static 1 0
ospf 8 6
ebgp 733770 733768
ibgp 0 0

Totals 733785 733780

>> VYOS 3 =================

Only OSPF Router, connecter do VYOS2
MemFree: 1800856 kB (read 1)
MemFree: 1782076 kB (read 492)

vyos@vyos3# run show ip ospf neighbor
Neighbor ID Pri State Dead Time Address Interface RXmtL RqstL DBsmL
10.1.1.2 1 Full/DROther 35.868s 10.10.102.1 eth0.102:10.10.102.2 0 0 0
10.1.1.4 1 Full/DROther 33.287s 10.10.103.2 eth0.103:10.10.103.1 0 0 0

>> VYOS 4 =================

Only OSPF Router connected to VYOS3
MemFree: 1805392 kB (read 1)
MemFree: 1778508 kB (read 492)

vyos@vyos4# run show ip ospf neighbor
Neighbor ID Pri State Dead Time Address Interface RXmtL RqstL DBsmL
10.1.1.3 1 Full/DROther 39.095s 10.10.103.1 eth0.103:10.10.103.2 0 0 0

>> VYOS 5 =================

eBGP with VYOS2, receiving 700k routes
MemFree: 728244 kB (read 1)
MemFree: 702180 kB (read 492)

vyos@vyos5# run show ip bgp summary
IPv4 Unicast Summary:
BGP router identifier 10.1.1.5, local AS number 65003 vrf-id 0
BGP table version 5541930
RIB entries 1334412, using 204 MiB of memory
Peers 1, using 21 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.10.104.1 4 65002 1062962 1062954 0 0 0 3d10h29m 733680

>> VYOS 6 =================

iBGP and OSPF with VYOS2, eBGP with VYOS7 (down session)
MemFree: 340440 kB (read 1)
MemFree: 315496 kB (read 4092)

vyos@vyos6# run show ip bgp summary
IPv4 Unicast Summary:
BGP router identifier 10.1.1.6, local AS number 65002 vrf-id 0
BGP table version 5541994
RIB entries 1334431, using 204 MiB of memory
Peers 2, using 41 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.1.1.2 4 65002 1062987 4954 0 0 0 3d10h30m 733692
10.10.106.2 4 65004 228573 228960 0 0 0 3d09h23m Connect

vyos@vyos6# run show ip ospf neighbor
Neighbor ID Pri State Dead Time Address Interface RXmtL RqstL DBsmL
10.1.1.2 1 Full/DROther 34.820s 10.10.105.1 eth0.105:10.10.105.2 0 0 0

>> VYOS 7 =================

eBGP with VYOS6, but only receiving connection problem (BGP down).
MemFree: 624124 kB (read 1)
MemFree: 697864 kB (read 492)

vyos@vyos7# run show ip bgp summary
IPv4 Unicast Summary:
BGP router identifier 10.10.106.2, local AS number 65004 vrf-id 0
BGP table version 2968951
RIB entries 0, using 0 bytes of memory
Peers 1, using 21 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.10.106.1 4 65002 228561 228573 0 0 0 3d09h23m Connect

==============================

  1. On VYOS is a BGP down session and there remain 30MB memory used. Is this normal? If not, may deleted routes (from this router 7, other BGP routers and VyOS that was this bug reason) continue allocated?

  2. 30MB in 3 days is not muth memory compared with 6GB in 4 days.

@syncer, if this happens, how could we see this?

root@ROUTER-BGP:/home/luis# lspci | grep Eth
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
13:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
13:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
17:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]

root@ROUTER-BGP:/home/luis# dmesg | grep eth
[ 2.154739] e1000e 0000:13:00.0 eth0: (PCI Express:2.5GT/s:Width x4) 00:15:17:24:75:18
[ 2.154742] e1000e 0000:13:00.0 eth0: Intel(R) PRO/1000 Network Connection
[ 2.154818] e1000e 0000:13:00.0 eth0: MAC: 0, PHY: 4, PBA No: C57721-005
[ 2.344709] e1000e 0000:13:00.1 eth1: (PCI Express:2.5GT/s:Width x4) 00:15:17:24:75:19
[ 2.344712] e1000e 0000:13:00.1 eth1: Intel(R) PRO/1000 Network Connection
[ 2.344789] e1000e 0000:13:00.1 eth1: MAC: 0, PHY: 4, PBA No: C57721-005
[ 2.900428] bnx2 0000:03:00.0 eth2: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found at mem f8000000, IRQ 16, node addr 00:1e:0b:ca:af:28
[ 3.800376] bnx2 0000:05:00.0 eth3: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found at mem fa000000, IRQ 17, node addr 00:1e:0b:ca:af:26
[ 10.283450] mlx4_en: eth4: Link Up

I have another server ( same HP ProLiant DL380), with 1 mellanox x3 board too, but another 4x1Gbps instead of 2x1Gbps ports.
The server is powered on (20 days) with default VyOS and memory is normal.

root@SERVER2-BGP:/home/vyos# free -h
total used free shared buffers cached
Mem: 7.8G 624M 7.2G 16M 138M 157M
-/+ buffers/cache: 328M 7.5G
Swap: 0B 0B 0B

root@SERVER2-BGP:/home/vyos# uptime
16:41:29 up 20 days, 12:41, 1 user, load average: 0.00, 0.00, 0.00
root@SERVER2-BGP:/home/vyos#

Boards are similar

root@SERVER2-BGP:/home/vyos# lspci | grep Eth
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
15:00.0 Ethernet controller: Intel Corporation 82575GB Gigabit Network Connection (rev 02)
15:00.1 Ethernet controller: Intel Corporation 82575GB Gigabit Network Connection (rev 02)
16:00.0 Ethernet controller: Intel Corporation 82575GB Gigabit Network Connection (rev 02)
16:00.1 Ethernet controller: Intel Corporation 82575GB Gigabit Network Connection (rev 02)
18:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
root@SERVER2-BGP:/home/vyos# dmesg -T | grep eth
[Thu Jan 20 03:59:32 2011] ACPI Error: Method parse/execution failed _SB._OSC, AE_AML_BUFFER_LIMIT (20180531/psparse-516)
[Thu Jan 20 03:59:34 2011] igb 0000:15:00.0: eth0: (PCIe:2.5Gb/s:Width x4) 00:1b:21:40:db:00
[Thu Jan 20 03:59:34 2011] igb 0000:15:00.0: eth0: PBA No: Unknown
[Thu Jan 20 03:59:34 2011] bnx2 0000:03:00.0 eth1: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found at mem f8000000, IRQ 16, node addr 00:1e:0b:cc:12:e4
[Thu Jan 20 03:59:34 2011] igb 0000:15:00.1: eth2: (PCIe:2.5Gb/s:Width x4) 00:1b:21:40:db:01
[Thu Jan 20 03:59:34 2011] igb 0000:15:00.1: eth2: PBA No: Unknown
[Thu Jan 20 03:59:35 2011] igb 0000:16:00.0: eth3: (PCIe:2.5Gb/s:Width x4) 00:1b:21:40:db:04
[Thu Jan 20 03:59:35 2011] igb 0000:16:00.0: eth3: PBA No: Unknown
[Thu Jan 20 03:59:35 2011] igb 0000:16:00.1: eth4: (PCIe:2.5Gb/s:Width x4) 00:1b:21:40:db:05
[Thu Jan 20 03:59:35 2011] igb 0000:16:00.1: eth4: PBA No: Unknown
[Thu Jan 20 03:59:35 2011] bnx2 0000:05:00.0 eth5: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found at mem fa000000, IRQ 17, node addr 00:1e:0b:cc:12:e2
[Thu Jan 20 03:59:42 2011] mlx4_en: eth6: Link Up
[Thu Jan 20 03:59:42 2011] mlx4_en: eth7: Link Up
root@SERVER2-BGP:/home/vyos#

affected servers are with e1000e
can you confirm that both affected servers with e1000e drivers?

in rc4 we added atoptool so will be little easier to collect required info

What about running the driver with debug? Would be 'modprobe debug:0-16 (16 is logging everything, very noisy makes sure you have enough space). I see issues with the e100 and e1000 in a virtual environment too, kernel resets and reloads it. I’m going to test the setup above and leave the boxes running, to reproduce.