I’m new to VyOS and am trying to test it and see it working, but unfortunately the USB won’t fully boot.
I was able to successfully install debian bookworm 12.5, because I read somewhere that if that would panic as well, it might give pointers, but that works fine.
Unfortunately I can’t scroll this panic output and I’m not sure why it happens:
I’ve also tried building the VyOS 1.4 LTS version as the documentation describes, to see if this isn’t a regression of some sort, but the process fails with repositories not working:
E: Failed to fetch http://dev.packages.vyos.net/repositories/sagitta/dists/sagitta/InRelease 403 Forbidden [IP: 104.18.30.79 443]
I’m not sure why this happens and how to troubleshoot this problem further.
Any advice would be greatly appreciated!
It is the livecd that crashes during boot.
I don’t have IPMI access to the machine, but it’s right in front of me.
@c-po I’m unsure how to add more recent Kernel Firmware to the build process.
I’ve done Build VyOS — VyOS 1.5.x (circinus) documentation inside the build container and started the build script ./build-linux-firmware.sh. It complained about missing kernel sources.
So I went up to clone the linux kernel and checkout latest, but the kernel build fails:
root@55c17aa3d0dd:/vyos/packages/linux-kernel/linux# git checkout v6.9.7
Updating files: 100% (11635/11635), done.
Note: switching to 'v6.9.7'.
You are in 'detached HEAD' state. You can look around, make experimental
...
HEAD is now at 12c740d50d4e Linux 6.9.7
root@55c17aa3d0dd:/vyos/packages/linux-kernel/linux# cd ..
root@55c17aa3d0dd:/vyos/packages/linux-kernel# ./build-kernel.sh
I: Copy Kernel config (x86_64_vyos_defconfig) to Kernel Source
'arch/arm64/configs/vyos_defconfig' -> 'linux/arch/arm64/configs/vyos_defconfig'
'arch/x86/configs/vyos_defconfig' -> 'linux/arch/x86/configs/vyos_defconfig'
I: clean modified files
HEAD is now at 12c740d50d4e Linux 6.9.7
I: Apply Kernel patch: /vyos/packages/linux-kernel/patches/kernel/0001-linkstate-ip-device-attribute.patch
patching file Documentation/networking/ip-sysctl.rst
Hunk #1 succeeded at 1754 (offset 20 lines).
patching file include/linux/inetdevice.h
Hunk #1 succeeded at 139 (offset 2 lines).
patching file include/linux/ipv6.h
Hunk #1 succeeded at 91 with fuzz 1 (offset 7 lines).
patching file include/uapi/linux/ip.h
patching file include/uapi/linux/ipv6.h
patching file net/ipv4/devinet.c
Hunk #1 succeeded at 2572 (offset -23 lines).
patching file net/ipv6/addrconf.c
Hunk #1 FAILED at 5656.
Hunk #2 succeeded at 7149 (offset 64 lines).
1 out of 2 hunks FAILED -- saving rejects to file net/ipv6/addrconf.c.rej
patching file net/ipv6/route.c
Hunk #1 succeeded at 680 (offset 3 lines).
Hunk #2 succeeded at 729 (offset 3 lines).
Seems to be related to mt7921 and you are not alone.
Theories about anything between badly written drivers to poor connectivity of the connectors (so bad data gets onto the databuss) or overheating chip.
Try to disable that card through BIOS and see if you get rid of the kernel panics then we have isolated it to the card itself and then you can try the various other tricks like adding cooling, try to refit the card (if its removable) etc.
Yeah I have the MS-01 board with Core i9-13900H.
I currently have ubuntu-24.04 installed and it runs fine, so I don’t think I have any cooling issues.
I am currently using the card to connect to the box via wifi.
My guts tells me it’s a driver issue with the installed kernel in the VyOS LiveCD iso
Just ran into this right now, updating my main router from 1.4.0-epa2 to 1.4.0 GA. Same Kernel panic for the same MT7921. I had to remove the card to get the system to function.
GL-iNet changed to the closed sourced MTK SDK for release 4.6.0 and onwards due to issues with the opensourced Mediatek drivers (they also released another release based on OpenWRT24 which uses Linux Kernel 6.6.x who uses the opensourced driver so perhaps there is some issue with Mediatek devices and Linux kernels older than 6.6.0?):
Due to certain performance and compatibility issues with the open-source drivers for the model, firmware version 4.6.0 will utilize the MTK SDK to ensure a better user experience. If these issues are resolved in the future, we will revert to the Native OpenWrt version with the open-source driver. For customers preferring the open-source driver, we will provide a synchronized Native OpenWrt version labeled 4.x.x-opxx, based on the OpenWrt main branch with kernel version 6.6.x. The MTK SDK will be used for their 4.x version. We will continue to address bugs in the open-source version and will make it the main line if it eventually outperforms the closed-source driver.
"
The faulty driver was added via ⚓ T6293 add Mediatek MT7921 to defconfig and is about to be removed until we upgrade to the next Kernel LTS version (December 2024)
Slightly off-topic but is there something that can be done to the design/config of VyOS to avoid similar events in future?
One would be to have an additional grub option with “safe” settings but that will of course fail if your mgmt-interface happens to be using one of the bad drivers (and that you need console access to select that option if things goes south anyway).
Partial solution would be to have an config option where you can append “set system boot module_blacklist” but for that to work you must have a kernel that boots (perhaps along with the “safe” boot option above)?
Another thought is if its possible to have nic drivers being loaded late like some “set system” option to have a delayed start of lets say 60 seconds (as default but configurable up to 900 seconds or whatever)?
Im thinking this way wifi and such perhaps could be delayed so if something like this occurs again in future you at least have a box that works for lets say up to 15min at which you can login to it and fix whatever issue that exists before it crashes?
That is when the box boots only the mgmt-interface and fixed interfaces using safe drivers work and the rest are activated through that delayed start.
Also - is there a watchdog configurable so that if VyOS gets a kernel panic that should reboot on itself after lets say 60 seconds or so (handy if the box is at a remote location where you dont have easy physical access to it in order to manually reboot it by powercycling or such)?
Well who will decide which drivers to block and which not?
This is not how config-load works. There is (and must be) a priority list of who comes first when booting a system, and many services depend on interfaces to render config files or configure routing. If the interface is not present we can not continue as e.g. routing will be broken.
With “set system boot module_blacklist” then the admin could decide which drivers to block and which not and that setting would survive updates which current method of manually alter conf files in bash mode doesnt.
The delayed start could either be a global setting (defaulting to 0 seconds) or an alternative method would be if interfaces are using a dummy driver during boot and not until everything is done the driver can be exchanged to the correct one (however having a global delay is less overworking).
That will not work as you will not have a working CLI because you will never reach it. You would need a custom ISO with a custom config or boot with custom kernel options. All that adds more pain then simply waiting a bit and upgrading to a more recent Linux Kernel fixing the issue.