VyOS 1.1.1 issue with pmacct and overall tuning of VyOS for extreme performance

Dear colleagues,

We are experiencing DDoS attacks very often these days. We are using pmacct.
pmacct is a small set of passive network monitoring tools to measure, account, classify, aggregate and export IPv4 and IPv6 traffic

We are using machines with the following characteristics:

  1. AMD Opteron™ Processor 6128
    32 GB of memory

  2. AMD Opteron™ Processor 6176
    128 GB of memory

However the issue is that because of the large amount of traffic that is coming - more than 2-3 Gbps the pmacct got stuck and do not want to send information to our DDoS system. Do you have any idea on tuning the VyOS as a whole to get the maximum performance regarding this ?
Please find below some of the messages that I see in the logs

Jan 18 16:56:15 r5-zrh pmacctd[7880]: WARN: Failed during write: Resource temporarily unavailable
Jan 18 16:56:15 r5-zrh pmacctd[7880]: last message repeated 8 times
Jan 18 16:56:15 r5-zrh pmacctd[7882]: ERROR ( default/memory ): We are missing data.
Jan 18 16:56:15 r5-zrh pmacctd[7882]: If you see this message once in a while, discard it. Otherwise some solutions follow:
Jan 18 16:56:15 r5-zrh pmacctd[7882]: - increase shared memory size, ‘plugin_pipe_size’; now: ‘10485760’.
Jan 18 16:56:15 r5-zrh pmacctd[7882]: - increase buffer size, ‘plugin_buffer_size’; now: ‘10240’.
Jan 18 16:56:15 r5-zrh pmacctd[7882]: - increase system maximum socket size.#012
Jan 18 16:56:15 r5-zrh pmacctd[7882]: ERROR ( default/memory ): We are missing data.
Jan 18 16:56:15 r5-zrh pmacctd[7882]: If you see this message once in a while, discard it. Otherwise some solutions follow:
Jan 18 16:56:15 r5-zrh pmacctd[7882]: - increase shared memory size, ‘plugin_pipe_size’; now: ‘10485760’.
Jan 18 16:56:15 r5-zrh pmacctd[7882]: - increase buffer size, ‘plugin_buffer_size’; now: ‘10240’.
Jan 18 16:56:15 r5-zrh pmacctd[7882]: - increase system maximum socket size.#012

You might be able to utilise the ‘raw’ table to avoid logging traffic identified as attacks. From iptables manual:

raw:
This  table  is  used  mainly for configuring exemptions from connection tracking in combination with the NOTRACK target.  It registers at the netfilter hooks with higher priority and is thus called before ip_con‐ntrack, or any other IP tables.  It provides the following built-in chains: PREROUTING (for packets arriving via any network interface) OUTPUT (for packets generated by local processes)

In general software firewalls are always going to struggle with that kind of ingress without some ASIC hardware handling the forwarding plane. Maybe not on raw throughput, any feature that can peak into the traffic flow for various reasons.

One strategy might be to move the traffic analysis off the firewalls and onto something that spans the ingress ports of your infrastructure/edge.

HTH,

Chris

Hello Chris,

Thank you for your comment. Yes actually I will be moving the export of traffic samples from the switches instead of the router.
However the overall performance is compromised when great amount of traffic is going through VyOS.
I don’t know if there are more settings for tuning in order router to handle large amount of traffic/connections.

Regards,
Kaloyan

Hi Kaloyan,

Yeah I’m not sure, I’ve only been willing to use VyOS for multi 100Mbit links, probably no more than 1Gbps without some significant testing as I would expect to see the kind of symptoms you are experiencing.

It just depends on what resources are getting strained during load - is a CPU pinned at 100%? If so, what processes are consuming all of that CPU? Logging framework taxed by high number of flows?

There is an iptables module called NFQUEUE that allows userspace to inspect packets, and one of it’s features is the ability to have queues spread across your CPUs (–queue-cpu-fanout). Although the added latency and overhead of queueing packets for userspace might add latency to packets, it’ possible the scale-out approach might give you better handling of high pps under DDOS.

But if you do some reading, with the size of DDOS that hit network equipment these days, very few vendors survive the hits without some casualties and the trick is to really statelessly drop as many undesirable packets as close to your edge before they hit any stateful software-based firewalls :slight_smile:


Oh, and I wanted to mention that one of the reasons Ntop’s PF_RING ZC (http://www.ntop.org/products/pf_ring/pf_ring-zc-zero-copy/) and Intel’s DPDK exists is because LInux’s networking stack on general purpose hardware/VMs struggle with a few multi-gigabits of stateful traffic (as I understand it).

One more thing :slight_smile:

I’ve just mocked up a simple rtr1 <-> rtr2 <-> rtr3 and ran iperf with tcp between rtr1 and rtr3, all on a modern high specced Intel server running ESX, switched on local isolated vSwitch so no over-the-wire flow. In theory the max is 10Gbps for vNics between VMs on the same physical host.

iperf is driving around 3-3.1Gbps for a single TCP flow, 3.7Gbps for 4 parallel flows, and 4.23Gbps for 32 parallel flows, at 64 parallel flows iperf seemed to pegging the CPU trying to generate that many flows :slight_smile:

I guess that is something missing in VyOS land - some documented hardware & throughtput (pps, bps) cases to guide people’s expectations.

Hello cgb,

Thank you for your comment. I really appreciate it. In case of a DDoS the pmacct process is the one that is utilizing almost all of the resources. However with the last DDoS that I suffered, the routing of the system seemed like messed up and I needed to move it to another router.
I will be moving away from VyOS probably at this time, but wanted to help others with the issues I am facing.
I am researching some options, if you have met such alternative I will be thankful to share it.

Regards,
Kaloyan

Hi kaloyan,

Did you ever keep VyOS or move off? What has been your experiences if you are still using? Have you tuned your router much or what? We don’t have those traffic levels above 1Gbps. When we did experience above 1 Gbps we were on Cisco gear. I’ve been using VyOS for years now and I love it albeit with very few issues.

Hello ocosa,

Yes, we did move off VyOS.
We are now using BSDRP, which is way stable and is capable of reaching 4-5 Gbps without an issue. It has been compiled from sources by ourselves, because the community version had some limitations.

We stopped using VyOS a few years ago.

Regards,

Kaloyan,

That’s awesome you are seeing 4-5 Gbps with your new setup. Is that on the edge or in your core? Are you taking full BGP feeds? Have you had a DDoS attack and mitigated that the attack on the platform? What type of hardware are you using and which NICs are you using? I like VyOS and been using it for quite some time but as our traffic ratios grow I’d like to map out our future. Obviously, I’d like to stick with VyOS.

Hi @ocosa
i will suggest to start testing 1.2 versions and see how we can fix the issue if it persists there
we just not spending time on 1.1.x anymore
You can follow NetFlow issue o 1.2 here ⚓ T739 flow-accounting stops
all input are welcome

I will start testing 1.2. I am on 1.1.8 for my edge routers. No issues other than some tuning that needs to happen for gigabit ports. Nothing a little sysctl cannot fix. I was looking for some tuning guides that were out there and searched on the forum and found this thread.