10GBe tuning for best throughput between NAS and workstation

Hello!
I’m using VyOS as a virtual switch for 10Gbe connection between workstation and NAS. It is in ESXi alongside OPNsense for general purpose routing.

With direct connection between NAS and WS I can have 1 GB/s during SMB file copying. However with VyOS in between, I can only reach ~580MB/s, before tuning it was ~450MB/s (ethtool -K eth1 rx on tx on sg on tso on gso on gro on lro on ntuple on rxhash on; ethtool -G eth1 tx 4096 rx 4096).

Both NAS and WS uses Mellanox ConnectX-3, VyOS has PCIe passthrough of X520 and VMXNET3 for 1GBe LAN intel link shared with OPNsense. All VyOS interfaces (eth0-2) are tied in singe bridge. I have 9000 MTU set on 10GBe links on each side.

The ESXi spec is not high - G4400, 8GB DDR4, however max CPU usage by VyOS during copying is around 6%. VyOS has assigned 2 vCPUs and 2GB RAM.

Is there anything I can do to improve performance? Currently I don’t know where is bottleneck in 10GBe switching. I know I can always upgrade the hardware, but even current limited resources seems not be fully used.

All advises will be appreciated!

Hello @nefph. Could you provide an output of top command and press 1 when you copy something through VyOS?
Did you use VLANs? Try also enable RPS

set interfaces ethernet eth0 offload rps

and set system option to performance throughput

set system option performance throughput 

Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don’t waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don’t they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU’s (if you don’t use some sort of RAID 0 or NAS or …).

Measure, don’t assume. Don’t forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.

Please don’t discount large files. Why not large files? Its 2021, I transfer 80GB files or 300GB often enough.
It depends how you use it, it shouldnt matter if its 5MB or 300GB or more, this is not being helpful.

As a technical exercise I am going to be trying to see what I can get out of the CX-3 cards over 40Gb switch - routed