Performance question for tcp throughput

Hi,

I have a question about performance, how many CPU cores do I need for Vyos to get approx. 25 Gbgit per TCP through the network card?

Is there a guideline value?

greetz

1 Like

I don’t have any 25GbE gear, however I would say that whoever does will need more context regarding the scenario.
Such as what type of traffic (if will be L3 traffic or will you just bridging 2 ports and let traffic flow at L2), what interface offloads you can use, etc.

Agreed; in general you are limited by the number of packets per second (pps). That is influenced by what features you have enabled (how many firewall rules, NAT, etc), what offloads your NIC hardware is capable of (as Ralm said). Additionally, how many TCP flows are you planning on processing and is there sufficient entropy? Each TCP flow will be handled by a particular CPU, so if you are hoping to get 25Gbps through a single TCP flow you will probably be disappointed (and adding CPU cores won’t help, though faster cores will). Also, how big are the packets you are looking to transfer? Closer to 50 bytes or 1500 bytes?

For what it’s worth, on embedded Jasper Lake Celeron cores, I can get ~400k pps/core with NAT and 16 firewall rules. So, rough math for 25 Gbps at 1500 byte packets, you’d need ~2M pps to saturate 25Gbps or 5 Jasper Lake Celeron cores. But again, your specific configuration and workload matters a lot.

Testing with silicom XL710 dual 40G card on an E3-1230v3, basic offloads, and no firewalls rules i was able to do 40G iperf without issue from a single host with multiple streams . I have some other VMs on nodes with dual e5-2650v2s that can do 25g without issue on virtio interfaces, although the load is a bit higher on them since its a virtual nic.

2 Likes

Unfortunately (in this case) OPNsense is based on FreeBSD while VyOS is based on Linux, so its like comparing a wagyu beef with a maine lobster :slight_smile:

But a somewhat of a hint are the performance metrics available for the OPNsense appliances:

https://shop.opnsense.com/dec3800-series-update-2024/

Firewall Throughput: 17.4 Gbps
Firewall Packets Per Second: 1450Kpps

And the above is runned on a:

https://www.deciso.com/netboard-a20/

CPU Model EPYC3201
AMD Embedded EPYC 3201 (Octa Core 1.5Ghz max turbo frequency 3.1Ghz – 30W -No GPU)

https://www.amd.com/en/products/embedded/epyc/epyc-3000-series.html#specifications

Another (still not relevant for what you ask about but can give a hint) is to look at Mikrotik and their CRS518 vs CCR2216 who both uses the same switchchip.

But the CRS518 when pushing routing (fastpath) meaning passing the mgmt cpu gets about 0.5 Gbps (while switching where the packet only touch the switch chip can push in total about 1.2 Tbps):

Architecture: MIPSBE
CPU: QCA9531
CPU core count: 1
CPU nominal frequency: 650 MHz
Switch chip model: 98DX8525

And in their case the CCR2216 use the same switch chip but the mgmt cpu is changed to:

Architecture: ARM 64bit
CPU: AL73400
CPU core count: 16
CPU nominal frequency: 2000 MHz
Switch chip model: 98DX8525

The result then becomes about 69.3 Gbps in total performance for traffic pushed through the mgmt CPU.

In short changing from MIPS to ARM, from 1 core to 16 core and from 650MHz per core to 2000 MHz per core and the performance of routing packets through the CPU went from about 0.5 Gbps in total performance to about 69.3 Gbps.

In your case I would probably go for a Mellanox card instead of an Intel card.

Verify which offloading settings you can apply for each interface.

Go for a as fast single core CPU as possible.

For example the F-series of CPUs from AMD EPYC Genoa series.

https://www.amd.com/content/dam/amd/en/documents/products/epyc/epyc-9004-series-processors-data-sheet.pdf

When it comes to AMD use 12 memory sticks to fully utilize the 12-memorychannels.

There are also a few tweaks for the BIOS like enable “Performance mode” on the other hand this will mostly just consume more power and generate more heat that needs to be cooled off.

I probably need to write a little more about this. Please excuse me
The Vyos is currently running as a VM under KVM. In it there is a Vlan network where VM run and then it goes out into the network via NAT. On the KVM host I use Vswitch for the networks.

The hosts are each equipped with 2x Xeon 6248R
Some have Mellanox mt27710, others have BCM57414 NetXtreme from Broadcom. The mix is due to the shortage of components in 2021/22.
I can see I’ll probably have to deal with it here
SR-IOV and network cards
https://docs.nvidia.com/networking/display/mlnxofedv581011/single+root+io+virtualization+(sr-iov)

More like lobster and crawfish. The BSD network stack and drivers are kind of trash in comparison, hence why I’m on Linux, Even if’s a bit less simple to configure, I’ll opt for performance. For example, running 10 gigs through PF takes 100% of my standby box’s CPU, while the same box with more features enabled on my Vyos install takes 1-2% CPU. It’s a massive difference, and I’m running Intel cards, which are supposed to be the best on BSD.

I would say you have probably some other malfunction going on if the same baremetal with latest FreeBSD uses 100% CPU to push 10Gbps throughput on a 10G nic and VyOS running current stable linux kernel (assuming VyOS 1.5 rolling) would only use 1-2% CPU to push 10Gbps throughput on the same 10G nic (and same motherboard, CPU, RAM etc).

It’s possible, yet I do know the kernel driver support is always way behind, and even when I compile the latest, it’s still the same. I really don’t know what the difference is. The only thing that could be the issue as far as CPU is is that I was doing flow accounting on BSD and don’t currently on Linux. Perhaps if I enabled it, I’d see the same thing. Otherwise, it’s just a flat single-subnet setup; no extra services or routing are enabled on either platform. I can easily push 10 gigs without any struggle on 6x kernel Linux platforms, without offloads, so there’s something that’s uniquely different on BSD’s stack with what is supposed to be an optimal NIC for the platform (500-series Intel). As I see it, with the one exception, there’s no real difference in enabled functionality. Netfilter, as I understand it, is considerably more efficient than PF, so that may also have something to do with it as well. I never really got to the bottom of it, as I have no pain points on VYOS with regard to network performance with little to no tweaking.

I got so tired of struggling with it that I just moved to Vyos. I don’t even need offloads for full performance, and the CPU isn’t even close to breaking a sweat. I’ve seen plenty of other similar stories while trying to bend BSD to my will, yet I could never stop the CPU spikes., even if I severely limited my flows. The box is an appliance form-factor Dual Xeon Cascade Lake, which should have plenty of power, yet it’s a drastically different experience between the two OS’s.

After the initial learning curve of the CLI, I can configure everything the same, and with docker/podman options in Vyos, I get even more flexibility with regard to adding more services to the box. I haven’t tried HA yet on the Vyos platform as I’m waiting on another cluster node to arrive, but once it does, that’ll be my next step: a new KVM host as the primary as it’s a newer generation chip and the current box as it’s partner in crime. I’m really hoping the HA in Vyos is fleshed out; I will find out soon!

Netflix uses FreeBSD for their CDN solution named Open Connect so if the difference were 50:1 between Linux vs FreeBSD in network performance on the same hardware I seriously doubt Netflix would have chosen to use FreeBSD at all:

https://papers.freebsd.org/2019/fosdem/looney-netflix_and_freebsd/

Also:

The “other” FreeBSD optimizations used by Netflix to serve video at 800Gb/s from a single server