Troubleshooting rx_csum_offload_errors on interfaces

I have been receiving rx_csum_offload_errors on my interfaces on multiple VyOS routers. I’m running VyOS 1.4.2. One of the interface (eth13) is my IP Transit on one of my border router shown below:

$ sh interfaces eth eth13 statistics | grep “errors”
rx_errors: 636294
rx_csum_offload_errors: 636294

$ netstat -s
Icmp:
    InCsumErrors: 1483
Tcp:
    InCsumErrors: 6485957
Udp:
    19 packet receive errors
    0 receive buffer errors
    0 send buffer errors
    InCsumErrors: 19

I’ve searched what’s the error are actually is and it looks like there are checksum validation error for L3/L4 packets.

At the same time, my customer is complaining that they have packet loss on MTR tests that they run to an external destination. I did run the MTR tests on my own to the same destination, with the same routing backbone and encountered no packet loss at all. So I do believe those errors above might be related.

How can I determine which packet, source, destination etc that is causing this errors?

What NIC are you using with what kind of connection?

AFAIK this seems more like a local network problem to me than a routing issue. More like NIC, optic, cabling or congestion etc.

oh, and is it a physical machine or aVM?

If its Intel e1000/e1000e you can try to disable offloading for TSO and GSO and see if that helps.

Seems to be an ongoing issue for the past months which Intel still havent fixed.

Other than that if its transceiverbased NIC then try to switch the transceiver and/or the cabling like patchcables or use a different port on the patchpanel.

I also do think it’s more of a local problem than external routing since I notice the errors for both external and internal interfaces from multiple VyOS 1.4.2 routers @roedie

@Apachez they are ixgbe interfaces. I would say all the transceivers and cables are really new from the factory (FS/Fiberstore). So the chances for multiple faulty are low. I’ve only noticed the error on routers that are actually forwards the traffic e.g. master VRRP router and main border router.

I agree with @Apachez - Try disabling offloads.

I’ve read somewhere if I disable the offload, the checksum calculation and verification will be done on the CPU instead. Doesn’t this mean the errors are still there, just that they’re now processed by the CPU? Which is also going to increase the CPU load.

My opinion: Don’t try and over optimise. Get it working first, and work back from there.

I expect the hardware offloads are introducing the error.

Thats a later problem to resolve.

Now if you have the checksum errors with default settings of offloading but they go away when you disable TSO and GSO offloading (which is the ongoing issue with Intel NICs mainly e1000/e1000e but it can exist with others aswell) and the checksum errors vanishes then we for sure know that it is the issue of broken Intel drivers.

If the checksumerrors remains (dont forget to reboot the device aswell and just because its troubleshooting do it with a complete poweroff aka disconnect from the powergrid for +10 seconds or so) then the error is either with the card itself (faulty chip, faulty phy, faulty connector) or bad cabling.

You can after replacing transceivers and cables and patchpanels (doesnt matter if they are new, they could have been damanaged in transport from China to your place) also verify that you have data from a specific source.

Preferly disconnect all networkcables and just connect a laptop with a direct TP cable (no patchpanels in between) to verify if you can reproduce the error or not and by that have as few devices (and components) involved during the troubleshooting in order to verify what is actually the fix.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.