Vyos appears to have re-enabled tso, gso, gro by default in latest VyOS 1.5-rolling-202406181053; this causes regressions

doctorpangloss · June 18, 2024, 11:53pm

When diagnosing an issue where I would lose network connectivity when making uploads on the network, I observed the following configuration:

show interfaces ethernet eth0
 address dhcp
 description OUTSIDE
 hw-id 04:0e:3c:8b:a6:bf
 offload {
     gro
     gso
     sg
     tso
 }

Previously, VyOS left these off How come offload settings isnt enabled by default? which makes sense because i210, i225 devices (the most common NICs in the world right now) are buggy with them turned on.

These reappeared and caused me significant toil.

Version:          VyOS 1.5-rolling-202406181053
Release train:    current
Release flavor:   generic

Built by:         [email protected]
Built on:         Tue 18 Jun 2024 13:50 UTC
Build UUID:       d2c3d495-0be8-455c-83e7-7960a7578ea8
Build commit ID:  2b3d1167850b85

Architecture:     x86_64
Boot via:         installed image
System type:      bare metal

Hardware vendor:  HP
Hardware model:   HP EliteDesk 800 G5 Desktop Mini
Hardware S/N:     MXL95025NY
Hardware UUID:    800b5dc3-e6c8-ba65-0bcb-dc6bfdfbccb2

Copyright:        VyOS maintainers and contributors

tjh · June 19, 2024, 12:26am

If this is a bug report, it might pay to state what version you were on and how you arrived at the version you’re on.
And also if you can reproduce this happening again in a test environment.

rpendleton · June 21, 2024, 6:49am

I noticed the same issue when updating from 1.5-rolling-202405310019 to 1.5-rolling-202406190020. I’m not sure if the offloading is what caused my connectivity issues, but recovering from this upgrade was quite difficult.

After performing this upgrade, all of my ethernet interfaces had a new offload section added with several types enabled by default. My WAN connection became completely unavailable, and I was unable to SSH into VyOS from any device on the LAN. DNS queries sent to VyOS were also failing, even for queries that had static mappings that could have been answered without using an upstream resolver. It seems like all communication to VyOS itself was being dropped, despite having firewall rules that should have allowed that traffic.

Using a console connection, I tried deleting the offload sections and then I committed and saved, but that didn’t seem to help. I tried rebooting, but that didn’t seem to help either. I’m pretty sure the offload sections were actually being re-added after reboots, but I’m unsure whether that was a side-effect of me trying to switch back and forth between the new system image and an older one. Either way, I wasn’t able to find a way to get things working at all on the newer version of VyOS.

Downgrading was actually pretty weird too. At one point while on the newer version, I power cycled my ISP’s fiber jack, and that restored Internet connectivity, but not SSH or DNS. I then downgraded to the older system image, but that broke the Internet while fixing SSH and DNS. I probably should have restarted my ISP’s fiber jack at that point as well, since that may have resolved the remaining issues.

Instead, in the end, I had to do a combination of switching back to the older system image, rebooting the VyOS VM, restarting the VyOS host machine, fulling removing power from the host machine instead of just rebooting it (thinking the NICs might have gotten into a bad state that restarts wouldn’t fix), and power cycling my ISP’s fiber jack. After doing those things in a variety of different orders, I eventually got things working again on the older system image.

My suspicion is that offloading broke MAC address spoofing on my WAN interface, which could have caused a majority of my issues. It could also just be buggy offloading implementations on my NICs though.

If I can find some time, I’ll see if I can reproduce these issues in a more controlled test environment instead of in my home network…

doctorpangloss · June 24, 2024, 7:20pm

well if anyone is searching for catastrophic connectivity or LAN issues in the latest VyOS issues, you must delete the offload section’s configuration today to resolve your problem.

If this is a bug report, it might pay to state what version you were on and how you arrived at the version you’re on.

I am not sure why I am blocked by Phabricator from making bug reports, but I observed this issue going from 1.5-rolling-202404141045 to 1.5-rolling-202406181053

vyozzy · August 19, 2024, 6:07pm

I’ve two identical KVMs. Both are running fine (1.5-rolling-202408120022) - until yesterday, when I took one KVM and installed (1.5-rolling-202408181910). Since than the new installed machine’s /var/log/messages gets spammed with ethX: bad gso: type: 1, size: 1452. The connectivity/throughput is catastrophic.

delete interface ethernet eth0 offload
commit
save

After the commit the connectivity immediately gets significantly better. I’ve not measured the differences yet, all I can tell for now: with offloading enabled loading a small (200kbyte!) login-webpage lasts over 30 seconds, now it feels ok again.

The problem is, even I’ve commited and saved the offload removal, after rebooting the (KVM, the host, both) the offload settings are present again? How do I get permanently rid of offloading? I already disabled gso on the hosts’ (proxmox) physical nic and the virtual bridge - at least I think so (I don’t really know what exactly I’m doing here, I’m some kind of a network-virtualization-noob)

tjh · August 19, 2024, 8:06pm

This is a kernel bug - I’ve hit the same issue running 6.6.44 on a non-Vyos box (that’s virtualised)
Since upgrading to 6.6.46 the error has gone away.

So yea, it’s a bug in the kernel and I’m sure when newer images pull in a later kernel you’ll see it go away.

vyozzy · August 19, 2024, 9:24pm

Thanks for your answer. Do you mean there’s no relation between ´bas gso` and offloading?

Here I’ve read that the bug is fixed and already made it into the 6.10 and 6.6 kernel stable queue - do you have a clue how long it approximately takes until the updates get into the nightly-builds?

tjh · August 19, 2024, 9:35pm

I’m not sure of the relation sorry, I just know that there is that kernel bug that causes that error to appear in dmesg and performance to tank. If you can disable/fix it by changing offloads, I don’t know. I haven’t encountered it on Vyos, just on my Linux server and it didn’t seem to impact performance there.

I don’t know how long it’ll take to get a new Kernel but it’s usually pretty quick I believe, maybe a week?

vyozzy · August 19, 2024, 9:57pm

The bug concerns virtio - so if your server is physical hardware there will be no impact.

tjh · August 19, 2024, 10:09pm

Yea my server is a virtual machine - it had the log errors flying up the screen, but it didn’t seem to impact performance of any of the services that the server provides. Speedtests from the server (via Ookla’s cli version of Speedtest) were still fine etc, nothing else seemed to break.

I suspect if I was forwarding traffic that might have have been different, but this is a single interface server, not a router.

vyozzy · August 19, 2024, 10:24pm

Ah ok, I see. Thanks!