Memory Leak on VyOS 1.2 (20180921) consumes 6GB in less than 7 days

Hi,
I’m using VyOS version 1.2.0-rolling+201809210337 and there is a memory leak that consumes all memory in less than 10 days (±6GB from 8GB total). BGP v4 ad v6, OSPF and some firewall rules.

I’m using version 1.2.0-rolling+201807230337 for 81 days and everything is OK.

More info about VyOS with memory leak

show system memory detail
MemTotal: 8173860 kB
MemFree: 2939404 kB
MemAvailable: 2976244 kB
Buffers: 100916 kB
Cached: 162056 kB
SwapCached: 0 kB
Active: 1731760 kB
Inactive: 103004 kB
Active(anon): 1574924 kB
Inactive(anon): 16772 kB
Active(file): 156836 kB
Inactive(file): 86232 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 36 kB
Writeback: 0 kB
AnonPages: 1571612 kB
Mapped: 32364 kB
Shmem: 19908 kB
Slab: 284016 kB
SReclaimable: 28264 kB
SUnreclaim: 255752 kB
KernelStack: 2512 kB
PageTables: 6572 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 4086928 kB
Committed_AS: 1853752 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 1521664 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 74064 kB
DirectMap2M: 8312832 kB

What info can be useful?

This should be reported to FRR maintainers. I’ll check the recent changelogs to see if there were any related fixes and make an updated package for you to test if I find anything.

Meanwhile, could you post the output of vtysh -c ‘sh run’ somewhere so that a bug report can be created?

Please dump also vtysh -c “show memory”

Can you tell bit more about your current setup? do you have several full feeds.
It will be great if you also can provide sanitized config

Hi,
I can’t attack TXT file with more/all information requested. Put all into comment is to big too. Please allow me attack files to give more informations.

show system memory 
Total: 7982
Free:  2704
Used:  5278

show ip route summary 
Route Source         Routes               FIB  (vrf Default-IP-Routing-Table)
connected            10                   10                   
static               7                    0                    
ospf                 206                  200                  
ebgp                 717286               717286               
ibgp                 2925                 2925                 
------
Totals               720434               720421

vyos-config.txt (19.7 KB)

vyos-memory.txt (31.5 KB)

The Setup is like this:
There is:

  • 1 iBGP with full routing (receive less than 700k)
  • 1 eBGP v4 with full routing (receive ± 700k)
  • 1 eBGP v6 with full routing (receive ± 60k)
  • 2 eBGP v4 (receive ± 18k)
  • 3 OSPF neighbors (P2P) and ± 200 IGP routes

The problem is not with traffic amount, that is less than 1.5Gbps aggregated.
I’m using Mellanox x3 10G and Intel 1G boards

Thanks for provided info.
will need same info as it grows
can you maybe provide it each two days or so ?
Really appreciate your help

System Uptime
uptime
23:12:28 up 2 days, 8:17, 1 user, load average: 0.20, 0.12, 0.10

Free Memory (8h after last post)
free -h
total used free shared buffers cached
Mem: 7.8G 5.9G 1.9G 19M 114M 164M
-/+ buffers/cache: 5.6G 2.1G
Swap: 0B 0B 0B

show memory (8h after last post)
vyos-memory-8h.txt (31.4 KB)

It was needed to restart the router.

run show system uptime
05:16:59 up 2 days, 14:22, 1 user, load average: 0.02, 0.08, 0.08

show system memory
Total: 7982
Free: 1793
Used: 6189

vyos-memory-14h.txt (31.5 KB)

Last saturday, the router killed BGP process after memory arrive less than 1GB RAM free.

After reboot, appeared this error below from charon. BGP sessions who haven’t MD5 password doesn’t got UP. I stopped this process and restart the server again, and then the BGP sessions started normally.

Could ipsec/charon be the problem for memory leak?
Now:
show system memory
Total: 7982
Free: 5930
Used: 2052

Oct 16 05:41:37 ROUTER-BGP vyatta-router[1915]: Starting VyOS router: migrate rl-system firewall configure.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Reloading.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Started VyOS Router.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Getty on tty1...
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Started Getty on tty1.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Login Prompts.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Reached target Login Prompts.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting LSB: AWS EC2 instance init script to fetch and load ssh public key...
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Started LSB: AWS EC2 instance init script to fetch and load ssh public key.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Multi-User System.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Reached target Multi-User System.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Graphical Interface.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Reached target Graphical Interface.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Update UTMP about System Runlevel Changes...
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Started Update UTMP about System Runlevel Changes.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Startup finished in 18.836s (kernel) + 2min 36.775s (userspace) = 2min 55.611s.
Oct 16 05:41:38 ROUTER-BGP kernel: [  176.082887] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based  firewall rule not found. Use the iptables CT target to attach helpers instead.
Oct 16 05:41:38 ROUTER-BGP charon: 04[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:46 ROUTER-BGP charon: 11[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:48 ROUTER-BGP charon: 04[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:50 ROUTER-BGP charon: 10[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:50 ROUTER-BGP rsyslogd0: action 'action 3' resumed (module 'builtin:omfwd') [try http://www.rsyslog.com/e/0 ]
Oct 16 05:41:50 ROUTER-BGP rsyslogd-2359: action 'action 3' resumed (module 'builtin:omfwd') [try http://www.rsyslog.com/e/2359 ]
Oct 16 05:41:51 ROUTER-BGP charon: 04[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:52 ROUTER-BGP charon: 06[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:54 ROUTER-BGP charon: 10[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:55 ROUTER-BGP charon: 07[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:56 ROUTER-BGP charon: 03[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:57 ROUTER-BGP charon: 09[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:58 ROUTER-BGP charon: 09[KNL] unable to receive from RT event socket No buffer space available (105)

vtysh -c “show memory” after reboot
vyos-memory-0h-reboot.txt (31.5 KB)

I found backtrace logs when memory ended and stoped BGP process.
kernel-backtrace.txt (10.0 KB)

thanks for that, i see om-killer invoked but it’s not clear what is actually consume all memory to trigger it

strongswan can be root cause, but i’m not 100% sure at this point

Below, memory status 26 hours ago

show system memory
Total: 7982
Free: 5930
Used: 2052

Below, memory status now (26 hours after last print)

show system memory
Total: 7982
Free: 4170
Used: 3812

Almost 900MB RAM consumed per day.

VyOS-mem-26h_uptime.txt (31.6 KB)

Looks like the bgpd process itself.

Oct 13 14:42:56 SERVER-BGP kernel: [975257.727898] bgpd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[…]
Oct 13 14:42:56 SERVER-BGP kernel: [975257.837299] Killed process 2028 (bgpd) total-vm:1182792kB, anon-rss:988988kB, file-rss:0kB, shmem-rss:0kB
Oct 13 14:42:56 SERVER-BGP kernel: [975257.980689] oom_reaper: reaped process 2028 (bgpd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I think what you see from charon is a subsequent issue due to the memory shortage.

Could Increased memory usage with 2018-01-08 HEAD · Issue #1610 · FRRouting/frr · GitHub be the issue?
Or Crazy memory usage · Issue #2527 · FRRouting/frr · GitHub.

Below, memory status now (48 hours uptime)

show system memory
Total: 7982
Free: 2668
Used: 5314

VyOS-mem-48h_uptime.txt (31.7 KB)

so it not looks like FRR issue, but issues with something else
now we need to investigate what exactly eat up all memory