UDP pkt/s performance issue

Hello,

I’ve been trying to solve an issue where my VyOS router is not performing well when sending a high amount of UDP packets.

My scenario looks like this:

  1. Sender uses a C script to send UDP packets with a payload size of 64 bytes. Source: 10.0.0.10, Destination: 10.0.1.10:4321.
  2. Receiver also uses a C script to listen on port 4321 for incoming UDP packets.
  3. In between there’s a VyOS router that forwards these packets.
    Script: dump/how-to-receive-a-million-packets at master · majek/dump · GitHub

Even sending 600k pkt/s VyOS receives only 400k pkt/s. I can scale it up to 2M but the result is the same, VyOS is not able to forward about 20% of the packets. Also, there’s a huge increase in rx_no_dma_resources errors and fdir_miss(even tho there are no flow director rules enabled, and fdir_miss only increases when using UDP) If I remove the route to destination and scale up the pkt/s to 2M then there’s no loss(bonding the interface helped with this).

VyOS version is 1.3
VyOS is using 10G Intel 82599ES NICs with the latest driver version.
Bond mode 802.3ad. Tried to disable the bond and use single interfaces, same results.
Increased ring buffers to 4096 ethtool: -G rx 4096 tx 4096
Disabled flow control: ethtool -A autoneg off rx off tx off
Disabled lro: ethtool -K lro off
Turning off any interrupt limitations: ethtool -C rx-usecs 0
RSS enabled

CPU: E5-2680 v4
Disabled Hyperthreading
Enabled/Disabled irqbalance
Using custom affinity scripts found in here: ixgbe/set_irq_affinity at master · majek/ixgbe · GitHub

Some sysctl parameters:
net.ipv4.udp_mem = 11416320 15221760 22832640
net.ipv4.udp_rmem_min = 16384
net.ipv4.udp_wmem_min = 16384
net.core.default_qdisc = fq
net.core.netdev_max_backlog = 250000
net.core.netdev_budget = 1024

Topology looks like this:

Has anyone had similar issues? Any advice on troubleshooting?

Here’s the bwm-ng output on the vyos router

And here’s the sender:

Is this just the limit of your hardware? (this probably only uses a single core)
400k/s packets would translate into 4.8Gb/s throughput for normal sized packets (1500 MTU)
Is udp source port different for all subsequent packets? That’s not normal traffic, and it will involve more conntrack processing.

The source port is different. I also use separate destination IP addresses to ensure that RSS forwards the packets into several different queues. I agree that this is not regular traffic, but it still shows that there’s issue where a single core is not capable of receiving ~400k UDP pkt/s which seems very low, or am I wrong here? Conntrack and iptables are disabled.

Try to enable offloads tso/lro/gro/gso/rps

vyos@vyos2# set interfaces ethernet eth0 offload
Possible completions:
   gro          Enable Generic Receive Offload
   gso          Enable Generic Segmentation Offload
   lro          Enable Large Receive Offload
   rps          Enable Receive Packet Steering
   sg           Enable Scatter-Gather
   tso          Enable TCP Segmentation Offloading

Offloading some of these features definitely helped with the RX side of things! Checking the CPU core load I can definitely see that load went from 100% to 25%. However, it looks like it’s not able to transmit all those received packets. This can be seen in bond0 and bond1 interfaces in the screenshot. The core which is responsible for TX is at 100% load. Do you have any other recommendations?

Any standard Linux tunning will help :slight_smile:
The next step it’s turn off power save, disable logical cores and use fastpath bypass
https://phabricator.vyos.net/T4502
Also bind interrupts to local numa/memory bank
Play with adapting interrupts

Power saving features and hyperthreading is disabled. Interrupts are mapped to the cores which are responsible for the NICs queues. I’ve tried playing with irqbalance but results are the same.

Also tried to increase net.core.netdev_max_backlog - didin’t helped.

I’ll be taking a look at fastpath bypass.

Do you have any other recommendations?

Try just the fast path, I believe you’ll like the result :wink:
To tune some other parameters, you need to understand where is the bottleneck

I just tried using the fast path and it’s amazing! I finally achieved the results I wanted. Here’s the pkt/s rate using 1 queue (1 core). It’s able to transmit everything it receives.

Are there any fallbacks using fastpath? After reading some papers I noticed that fast path does not work with fragmented packets, which makes sense, since fast path is using conntrack system.

1 Like

Good stuff! Would you terribly mind sharing the steps you’ve taken to achieve your goals? It will help people with similar goals to find this topic (e.g. using the boards’ search function). As a performance related topic, this is valuable input for scaling in larger environments :slightly_smiling_face:

Are you still testing with “unique” packets (i.e. , not belonging to existing connecting, new source port and/or dest address) ?
Then I doubt fast path can really help, as 1st packet is handled in software.

Below is the short summary which helped me to achieve better performance:

  1. Sender is sending >600k pkt/s on 1 flow(10.0.0.10:11404-> 10.0.1.10:4321). Vyos router was not able to receive all these packets. It could do up to 500k pkt/s. Enabling RPS offload allowed the router to receive all of these packets.
    set interfaces ethernet ethX offload rps
  2. Kernel could not forward all of these packets to the receiver. Thanks to @Viacheslav I got introduced to netfilter’s flowtable: Netfilter’s flowtable infrastructure — The Linux Kernel documentation
    This feature allows skipping regular netfilter’s hooks which may not be necessary for packets that need to be forwarded. This article really helped me to have a better understanding about it: Flowtables - Part 1: A Netfilter/Nftables Fastpath [Thermalcircle.de]
    Using this fastpath, router is able to forward all the received packets.
nft add flowtable ip filter f { hook ingress priority 0\; devices = { eth5, eth7, bond0 }\; counter\; }
nft add chain ip filter forward { type filter hook forward priority 0\; }
nft add rule ip filter forward ip protocol udp flow add @f

I hope this helps someone.

2 Likes

The test is done using a single flow, 10.0.0.10:11404 → 10.0.1.10:4321

However, scaling it up to 3 flows: 10.0.0.10:11404 → 10.0.1.10:4321, 10.0.1.11:4321, 10.0.1.12:4321 I get bad results, router is still capable of receiving all the packets, but has some issues forwarding them.

My goal is to make the router forward atleast 2M pkt/s

edit: apparently I reached the link capacity limit! had to reduce the payload size and I am able to TX 2M pkt/s. However there’s some loss, will dig deeper.

The next thing that you can try it is playing with XDP

set interfaces ethernet ethX xdp

1 Like

So, without XDP I am able to reach 4M pkt/s until it starts dropping packets. To reach 4M pkt/s I scaled up the to the 7 flows which sends 4.2M pkt/s.

root@test02-dut:/home/vyos# conntrack -L | grep -i offload  | wc -l
conntrack v1.4.6 (conntrack-tools): 12 flow entries have been shown.
7

I’ve enabled XDP and I am even happier with the results. I can see some rx_no_dma_resource errors at 5-6M pkt/s, but at a very slow rate. Scaling it up to 20 flows, I reached the 10M pkt/s zone.

I am not able to see those flows in conntrack, I assume it’s because XDP works at the lowest level, even before netfilter’s flowtable or even taps. Would that be correct? If yes, then flowtable is not even needed? Maybe there’s VyOS documentation on how exactly it’s using the XDP?

It is correct, it works on the lower layer
Originally it was added in commit
based on the article Building an XDP (eXpress Data Path) based BGP peering router | by Andree Toonk | The Startup | Medium
And then, it was several changes T2666