Memory Leak on VyOS 1.2 (20180921) consumes 6GB in less than 7 days


#1

Hi,
I’m using VyOS version 1.2.0-rolling+201809210337 and there is a memory leak that consumes all memory in less than 10 days (±6GB from 8GB total). BGP v4 ad v6, OSPF and some firewall rules.

I’m using version 1.2.0-rolling+201807230337 for 81 days and everything is OK.

More info about VyOS with memory leak

show system memory detail
MemTotal: 8173860 kB
MemFree: 2939404 kB
MemAvailable: 2976244 kB
Buffers: 100916 kB
Cached: 162056 kB
SwapCached: 0 kB
Active: 1731760 kB
Inactive: 103004 kB
Active(anon): 1574924 kB
Inactive(anon): 16772 kB
Active(file): 156836 kB
Inactive(file): 86232 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 36 kB
Writeback: 0 kB
AnonPages: 1571612 kB
Mapped: 32364 kB
Shmem: 19908 kB
Slab: 284016 kB
SReclaimable: 28264 kB
SUnreclaim: 255752 kB
KernelStack: 2512 kB
PageTables: 6572 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 4086928 kB
Committed_AS: 1853752 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 1521664 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 74064 kB
DirectMap2M: 8312832 kB

What info can be useful?


#2


#4

This should be reported to FRR maintainers. I’ll check the recent changelogs to see if there were any related fixes and make an updated package for you to test if I find anything.

Meanwhile, could you post the output of vtysh -c ‘sh run’ somewhere so that a bug report can be created?


#5

Please dump also vtysh -c “show memory”


#6

Can you tell bit more about your current setup? do you have several full feeds.
It will be great if you also can provide sanitized config


#7

Hi,
I can’t attack TXT file with more/all information requested. Put all into comment is to big too. Please allow me attack files to give more informations.

show system memory 
Total: 7982
Free:  2704
Used:  5278

show ip route summary 
Route Source         Routes               FIB  (vrf Default-IP-Routing-Table)
connected            10                   10                   
static               7                    0                    
ospf                 206                  200                  
ebgp                 717286               717286               
ibgp                 2925                 2925                 
------
Totals               720434               720421

#8

vyos-config.txt (19.7 KB)


#9

vyos-memory.txt (31.5 KB)


#10

The Setup is like this:
There is:

  • 1 iBGP with full routing (receive less than 700k)
  • 1 eBGP v4 with full routing (receive ± 700k)
  • 1 eBGP v6 with full routing (receive ± 60k)
  • 2 eBGP v4 (receive ± 18k)
  • 3 OSPF neighbors (P2P) and ± 200 IGP routes

The problem is not with traffic amount, that is less than 1.5Gbps aggregated.
I’m using Mellanox x3 10G and Intel 1G boards


#11

Thanks for provided info.
will need same info as it grows
can you maybe provide it each two days or so ?
Really appreciate your help


#12

System Uptime
uptime
23:12:28 up 2 days, 8:17, 1 user, load average: 0.20, 0.12, 0.10

Free Memory (8h after last post)
free -h
total used free shared buffers cached
Mem: 7.8G 5.9G 1.9G 19M 114M 164M
-/+ buffers/cache: 5.6G 2.1G
Swap: 0B 0B 0B

show memory (8h after last post)
vyos-memory-8h.txt (31.4 KB)


#13

It was needed to restart the router.

run show system uptime
05:16:59 up 2 days, 14:22, 1 user, load average: 0.02, 0.08, 0.08

show system memory
Total: 7982
Free: 1793
Used: 6189

vyos-memory-14h.txt (31.5 KB)

Last saturday, the router killed BGP process after memory arrive less than 1GB RAM free.


#14

After reboot, appeared this error below from charon. BGP sessions who haven’t MD5 password doesn’t got UP. I stopped this process and restart the server again, and then the BGP sessions started normally.

Could ipsec/charon be the problem for memory leak?
Now:
show system memory
Total: 7982
Free: 5930
Used: 2052

Oct 16 05:41:37 ROUTER-BGP vyatta-router[1915]: Starting VyOS router: migrate rl-system firewall configure.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Reloading.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Started VyOS Router.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Getty on tty1...
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Started Getty on tty1.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Login Prompts.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Reached target Login Prompts.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting LSB: AWS EC2 instance init script to fetch and load ssh public key...
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Started LSB: AWS EC2 instance init script to fetch and load ssh public key.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Multi-User System.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Reached target Multi-User System.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Graphical Interface.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Reached target Graphical Interface.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Starting Update UTMP about System Runlevel Changes...
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Started Update UTMP about System Runlevel Changes.
Oct 16 05:41:38 ROUTER-BGP systemd[1]: Startup finished in 18.836s (kernel) + 2min 36.775s (userspace) = 2min 55.611s.
Oct 16 05:41:38 ROUTER-BGP kernel: [  176.082887] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based  firewall rule not found. Use the iptables CT target to attach helpers instead.
Oct 16 05:41:38 ROUTER-BGP charon: 04[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:46 ROUTER-BGP charon: 11[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:48 ROUTER-BGP charon: 04[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:50 ROUTER-BGP charon: 10[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:50 ROUTER-BGP rsyslogd0: action 'action 3' resumed (module 'builtin:omfwd') [try http://www.rsyslog.com/e/0 ]
Oct 16 05:41:50 ROUTER-BGP rsyslogd-2359: action 'action 3' resumed (module 'builtin:omfwd') [try http://www.rsyslog.com/e/2359 ]
Oct 16 05:41:51 ROUTER-BGP charon: 04[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:52 ROUTER-BGP charon: 06[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:54 ROUTER-BGP charon: 10[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:55 ROUTER-BGP charon: 07[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:56 ROUTER-BGP charon: 03[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:57 ROUTER-BGP charon: 09[KNL] unable to receive from RT event socket No buffer space available (105)
Oct 16 05:41:58 ROUTER-BGP charon: 09[KNL] unable to receive from RT event socket No buffer space available (105)

vtysh -c “show memory” after reboot
vyos-memory-0h-reboot.txt (31.5 KB)


IPSEC CHARON 100% usage in one core after some time
#15

I found backtrace logs when memory ended and stoped BGP process.
kernel-backtrace.txt (10.0 KB)


#16

thanks for that, i see om-killer invoked but it’s not clear what is actually consume all memory to trigger it


#17

strongswan can be root cause, but i’m not 100% sure at this point


#18

Below, memory status 26 hours ago

show system memory
Total: 7982
Free: 5930
Used: 2052

Below, memory status now (26 hours after last print)

show system memory
Total: 7982
Free: 4170
Used: 3812

Almost 900MB RAM consumed per day.

VyOS-mem-26h_uptime.txt (31.6 KB)


#19

Looks like the bgpd process itself.

Oct 13 14:42:56 SERVER-BGP kernel: [975257.727898] bgpd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[…]
Oct 13 14:42:56 SERVER-BGP kernel: [975257.837299] Killed process 2028 (bgpd) total-vm:1182792kB, anon-rss:988988kB, file-rss:0kB, shmem-rss:0kB
Oct 13 14:42:56 SERVER-BGP kernel: [975257.980689] oom_reaper: reaped process 2028 (bgpd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I think what you see from charon is a subsequent issue due to the memory shortage.

Could https://github.com/FRRouting/frr/issues/1610 be the issue?
Or https://github.com/FRRouting/frr/issues/2527.


#20

Below, memory status now (48 hours uptime)

show system memory
Total: 7982
Free: 2668
Used: 5314

VyOS-mem-48h_uptime.txt (31.7 KB)


#21

so it not looks like FRR issue, but issues with something else
now we need to investigate what exactly eat up all memory