Slow memory leak on backup routers?

bunny · November 9, 2016, 11:56pm

See attached image from zabbix for the past month.
Not sure what commands I would need to find out where it is going.

V1 and V2 are a pair, V1 is primary.
V3 and V4 are a pari, V3 is primary.
They share the same back end network, but the fronts are separate.

Starting around 10th, V2 and V4 started slowly leaking memory, this is about when the V3/4 network went live.
It’s interesting that V2, which is basically idle waiting for V1 to commit suicide, sees a faster loss than V4.

At this time, it is not serious and survey suggests it could go on for about a year. But I have a firewall/router with a 5 year uptime that is just fine, so if we are aiming for excellence…

cgb · November 10, 2016, 12:07am

I’ve had my fair share of commercial vendor memory leaks, CPU spikes and ‘leaks’ (constantly growing CPU usage). No doubt there are examples of highly reliable network OS’s and hardware platforms, but even then, you often have to work through a list of bugs and caveats when assessing the OS release to evaluate whether you would be impacted by any known problems, and still it’s often a case of finding a reliable OS release and sticking with it forever

Not sure about the cause of the memory leak, but it’s Linux underneath, so if you are comfortable becoming root and troubleshooting (top, ps etc…), you could report back some more information.

I’ve been running 1.1.7 in multiple places, in clustered configurations, and memory use has been very consistent.

bunny · November 10, 2016, 12:23am

You must be thinking of Juniper’s SRX100 and the periodic meltdowns even a simple config has. ^^;

Sure. Point me in the right direction, my linux-fu is weak, but I know enough to be a danger to myself!

It’s not urgent right now, because these are cloud, I’ll be impressed if they stay up a year.

cgb · November 10, 2016, 12:51am

I’ve worked with Cisco more than any other vendor, and their switches, routers & firewalls are littered with these kind of strange symptom and egg-shell walking configuration caveats and software releases

Check out these links on memory troubleshooting:

See if any suggestions there help you to identiy processes high in memory, or growing in memory.

bunny · November 18, 2016, 2:13am

A little update. (unfortunately, the other VM of this pair V3/V4 is currently dead “yay cloud!”)
Survey suggests the memory is being eaten by conntrack.

top - 11:08:59 up 67 days, 4:10, 1 user, load average: 0.05, 0.15, 0.15
Tasks: 79 total, 2 running, 77 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.7%us, 0.9%sy, 0.0%ni, 94.0%id, 0.0%wa, 0.0%hi, 4.5%si, 0.0%st
Mem: 8183456k total, 702036k used, 7481420k free, 160812k buffers
Swap: 0k total, 0k used, 0k free, 117932k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3678 root 0 -20 328m 309m 580 S 3 3.9 2047:08 conntrackd

Here is V1/V2. V1 is the primary: (Same uptime)
3799 root 0 -20 113m 94m 596 S 2 1.2 949:55.44 conntrackd <- V1 “active”
3792 root 0 -20 480m 460m 656 S 5 5.8 1449:50 conntrackd <- V2 “standby”