I have been receiving rx_csum_offload_errors on my interfaces on multiple VyOS routers. I’m running VyOS 1.4.2. One of the interface (eth13) is my IP Transit on one of my border router shown below:
I’ve searched what’s the error are actually is and it looks like there are checksum validation error for L3/L4 packets.
At the same time, my customer is complaining that they have packet loss on MTR tests that they run to an external destination. I did run the MTR tests on my own to the same destination, with the same routing backbone and encountered no packet loss at all. So I do believe those errors above might be related.
How can I determine which packet, source, destination etc that is causing this errors?
If its Intel e1000/e1000e you can try to disable offloading for TSO and GSO and see if that helps.
Seems to be an ongoing issue for the past months which Intel still havent fixed.
Other than that if its transceiverbased NIC then try to switch the transceiver and/or the cabling like patchcables or use a different port on the patchpanel.
I also do think it’s more of a local problem than external routing since I notice the errors for both external and internal interfaces from multiple VyOS 1.4.2 routers @roedie
@Apachez they are ixgbe interfaces. I would say all the transceivers and cables are really new from the factory (FS/Fiberstore). So the chances for multiple faulty are low. I’ve only noticed the error on routers that are actually forwards the traffic e.g. master VRRP router and main border router.
I’ve read somewhere if I disable the offload, the checksum calculation and verification will be done on the CPU instead. Doesn’t this mean the errors are still there, just that they’re now processed by the CPU? Which is also going to increase the CPU load.
Now if you have the checksum errors with default settings of offloading but they go away when you disable TSO and GSO offloading (which is the ongoing issue with Intel NICs mainly e1000/e1000e but it can exist with others aswell) and the checksum errors vanishes then we for sure know that it is the issue of broken Intel drivers.
If the checksumerrors remains (dont forget to reboot the device aswell and just because its troubleshooting do it with a complete poweroff aka disconnect from the powergrid for +10 seconds or so) then the error is either with the card itself (faulty chip, faulty phy, faulty connector) or bad cabling.
You can after replacing transceivers and cables and patchpanels (doesnt matter if they are new, they could have been damanaged in transport from China to your place) also verify that you have data from a specific source.
Preferly disconnect all networkcables and just connect a laptop with a direct TP cable (no patchpanels in between) to verify if you can reproduce the error or not and by that have as few devices (and components) involved during the troubleshooting in order to verify what is actually the fix.