Kernel panic booting usb vyos-1.5-rolling-202406280020-amd64.iso

Rando · June 28, 2024, 3:06pm

Hi all,

I’m new to VyOS and am trying to test it and see it working, but unfortunately the USB won’t fully boot.
I was able to successfully install debian bookworm 12.5, because I read somewhere that if that would panic as well, it might give pointers, but that works fine.

Unfortunately I can’t scroll this panic output and I’m not sure why it happens:

In debian I can get the lspci information:

$ lspci
00:00.0 Host bridge: Intel Corporation Device a706
00:01.0 PCI bridge: Intel Corporation Device a70d
00:02.0 VGA compatible controller: Intel Corporation Raptor Lake-P [Iris Xe Graphics] (rev 04)
00:06.0 PCI bridge: Intel Corporation Raptor Lake PCIe 4.0 Graphics Port
00:06.2 PCI bridge: Intel Corporation Device a73d
00:07.0 PCI bridge: Intel Corporation Raptor Lake-P Thunderbolt 4 PCI Express Root Port
00:07.2 PCI bridge: Intel Corporation Raptor Lake-P Thunderbolt 4 PCI Express Root Port
00:0d.0 USB controller: Intel Corporation Raptor Lake-P Thunderbolt 4 USB Controller
00:0d.2 USB controller: Intel Corporation Raptor Lake-P Thunderbolt 4 NHI
00:0d.3 USB controller: Intel Corporation Raptor Lake-P Thunderbolt 4 NHI
00:14.0 USB controller: Intel Corporation Alder Lake PCH USB 3.2 xHCI Host Controller (rev 01)
00:14.2 RAM memory: Intel Corporation Alder Lake PCH Shared SRAM (rev 01)
00:16.0 Communication controller: Intel Corporation Alder Lake PCH HECI Controller (rev 01)
00:16.3 Serial controller: Intel Corporation Alder Lake AMT SOL Redirection (rev 01)
00:1c.0 PCI bridge: Intel Corporation Alder Lake-P PCH PCIe Root Port (rev 01)
00:1d.0 PCI bridge: Intel Corporation Device 51b2 (rev 01)
00:1d.3 PCI bridge: Intel Corporation Device 51b3 (rev 01)
00:1f.0 ISA bridge: Intel Corporation Raptor Lake LPC/eSPI Controller (rev 01)
00:1f.3 Audio device: Intel Corporation Raptor Lake-P/U/H cAVS (rev 01)
00:1f.4 SMBus: Intel Corporation Alder Lake PCH-P SMBus Host Controller (rev 01)
00:1f.5 Serial bus controller: Intel Corporation Alder Lake-P PCH SPI Controller (rev 01)
01:00.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
01:00.1 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
02:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. OM8PGP4 NVMe PCIe SSD (DRAM-less)
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
03:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
58:00.0 Ethernet controller: Intel Corporation Ethernet Controller I226-V (rev 04)
59:00.0 Ethernet controller: Intel Corporation Ethernet Controller I226-LM (rev 04)
5a:00.0 Network controller: MEDIATEK Corp. MT7922 802.11ax PCI Express Wireless Network Adapter

I’ve also tried building the VyOS 1.4 LTS version as the documentation describes, to see if this isn’t a regression of some sort, but the process fails with repositories not working:

E: Failed to fetch http://dev.packages.vyos.net/repositories/sagitta/dists/sagitta/InRelease 403 Forbidden [IP: 104.18.30.79 443]

I’m not sure why this happens and how to troubleshoot this problem further.
Any advice would be greatly appreciated!

c-po · June 28, 2024, 3:56pm

The Kernel crash is from adding an IPv6 address to an interface.

What particular hardware are you using? Is it the LiveCD that crashes? Do you have IPMI access to the machine?

marc_s · June 28, 2024, 9:26pm

mt7921 is a Mediatek wifi card I think

c-po · June 29, 2024, 11:25am

@rando you you re-test this with a more recent Kernel Firmware?

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/

Rando · July 1, 2024, 7:50am

It is the livecd that crashes during boot.
I don’t have IPMI access to the machine, but it’s right in front of me.

@c-po I’m unsure how to add more recent Kernel Firmware to the build process.
I’ve done Build VyOS — VyOS 1.5.x (circinus) documentation inside the build container and started the build script ./build-linux-firmware.sh. It complained about missing kernel sources.

So I went up to clone the linux kernel and checkout latest, but the kernel build fails:

root@55c17aa3d0dd:/vyos/packages/linux-kernel/linux# git checkout v6.9.7
Updating files: 100% (11635/11635), done.
Note: switching to 'v6.9.7'.

You are in 'detached HEAD' state. You can look around, make experimental
...
HEAD is now at 12c740d50d4e Linux 6.9.7
root@55c17aa3d0dd:/vyos/packages/linux-kernel/linux# cd ..
root@55c17aa3d0dd:/vyos/packages/linux-kernel# ./build-kernel.sh
I: Copy Kernel config (x86_64_vyos_defconfig) to Kernel Source
'arch/arm64/configs/vyos_defconfig' -> 'linux/arch/arm64/configs/vyos_defconfig'
'arch/x86/configs/vyos_defconfig' -> 'linux/arch/x86/configs/vyos_defconfig'
I: clean modified files
HEAD is now at 12c740d50d4e Linux 6.9.7
I: Apply Kernel patch: /vyos/packages/linux-kernel/patches/kernel/0001-linkstate-ip-device-attribute.patch
patching file Documentation/networking/ip-sysctl.rst
Hunk #1 succeeded at 1754 (offset 20 lines).
patching file include/linux/inetdevice.h
Hunk #1 succeeded at 139 (offset 2 lines).
patching file include/linux/ipv6.h
Hunk #1 succeeded at 91 with fuzz 1 (offset 7 lines).
patching file include/uapi/linux/ip.h
patching file include/uapi/linux/ipv6.h
patching file net/ipv4/devinet.c
Hunk #1 succeeded at 2572 (offset -23 lines).
patching file net/ipv6/addrconf.c
Hunk #1 FAILED at 5656.
Hunk #2 succeeded at 7149 (offset 64 lines).
1 out of 2 hunks FAILED -- saving rejects to file net/ipv6/addrconf.c.rej
patching file net/ipv6/route.c
Hunk #1 succeeded at 680 (offset 3 lines).
Hunk #2 succeeded at 729 (offset 3 lines).

c-po · July 2, 2024, 2:29pm

I see the exact same error on a Minisforum MS-01 board with the WIFI card

c-po · July 2, 2024, 6:45pm

Okay, updating the linux-firmware binaries to the latest version did not have any effect

Apachez · July 2, 2024, 7:53pm

Seems to be related to mt7921 and you are not alone.

Theories about anything between badly written drivers to poor connectivity of the connectors (so bad data gets onto the databuss) or overheating chip.

Try to disable that card through BIOS and see if you get rid of the kernel panics then we have isolated it to the card itself and then you can try the various other tricks like adding cooling, try to refit the card (if its removable) etc.

Rando · July 3, 2024, 6:35am

Yeah I have the MS-01 board with Core i9-13900H.
I currently have ubuntu-24.04 installed and it runs fine, so I don’t think I have any cooling issues.
I am currently using the card to connect to the box via wifi.
My guts tells me it’s a driver issue with the installed kernel in the VyOS LiveCD iso

L0crian · July 3, 2024, 3:19pm

Just ran into this right now, updating my main router from 1.4.0-epa2 to 1.4.0 GA. Same Kernel panic for the same MT7921. I had to remove the card to get the system to function.

c-po · July 4, 2024, 3:09pm

Ubuntu 24.04 LTS ISO comes with kernel version 6.8

VyOS 1.5 and 1.4 uses Kernel 6.6 (LTS).
The Idea is to move to the 2024 LTS Kernel version once it’s released in December

Maybe we can backport the driver from a more recent Kernel to 6.6

c-po · July 16, 2024, 8:19pm

Moved the card to a different router board and the issue persists:

[   53.878865] vyos-router[1496]: Mounting VyOS Config...done.                  
[   71.367839] BUG: kernel NULL pointer dereference, address: 0000000000000008  
[   71.374838] #PF: supervisor read access in kernel mode                       
[   71.380008] #PF: error_code(0x0000) - not-present page                       
[   71.385161] PGD 0 P4D 0                                                      
[   71.387735] Oops: 0000 [#1] PREEMPT SMP NOPTI                                
[   71.392110] CPU: 2 PID: 2840 Comm: ip Not tainted 6.6.32-amd64-vyos #1       
[   71.398659] Hardware name: Gowin Solution Co.,Ltd GW-MB-U01 /GW-MB-U01 , BIOS
 ARD1U001 04/14/2024                                                            
[   71.407534] RIP: 0010:mt7921_ipv6_addr_change+0x37/0x1d0 [mt7921_common]     
[   71.414283] Code: 68 02 00 00 41 54 4c 89 ef 53 48 89 d3 48 83 e4 f0 48 83 ec
 70 65 48 8b 04 25 28 00 00 00 48 89 44 24 68 48 8b 86 80 07 00 00 <48> 8b 40 08
 48 c7 44 24 15 00 00 00 00 48 89 44 24 08 0f b6 86 28                          
[   71.433015] RSP: 0018:ffffbaf1821cf7d0 EFLAGS: 00010282                      
[   71.438258] RAX: 0000000000000000 RBX: ffff923fe5a32000 RCX: ffff923fe49b03c0
[   71.445410] RDX: ffff923fe5a32000 RSI: ffff923fe0c81c30 RDI: ffff923fe5a32268
[   71.452553] RBP: ffffbaf1821cf870 R08: 0000000000000000 R09: 0000000000031fd0
[   71.459696] R10: 0000000000000002 R11: ffff923fc0d3f000 R12: ffff923fe5a32000

So as soon as VyOS detects the WIFI nic and creates it - boom

Apachez · July 16, 2024, 11:13pm

A longshot but I wonder if this can be related?

GL-iNet changed to the closed sourced MTK SDK for release 4.6.0 and onwards due to issues with the opensourced Mediatek drivers (they also released another release based on OpenWRT24 which uses Linux Kernel 6.6.x who uses the opensourced driver so perhaps there is some issue with Mediatek devices and Linux kernels older than 6.6.0?):

https://dl.gl-inet.com/router/mt3000/

"
Downloads of Native OpenWrt 24 Firmware

Due to certain performance and compatibility issues with the open-source drivers for the model, firmware version 4.6.0 will utilize the MTK SDK to ensure a better user experience. If these issues are resolved in the future, we will revert to the Native OpenWrt version with the open-source driver. For customers preferring the open-source driver, we will provide a synchronized Native OpenWrt version labeled 4.x.x-opxx, based on the OpenWrt main branch with kernel version 6.6.x. The MTK SDK will be used for their 4.x version. We will continue to address bugs in the open-source version and will make it the main line if it eventually outperforms the closed-source driver.
"

c-po · July 17, 2024, 5:58am

As workaround you can boot using the following Kernel commandline: module_blacklist=mt7921e

Which will disable the driver and thus the device is not created.

c-po · July 17, 2024, 6:45am

The faulty driver was added via ⚓ T6293 add Mediatek MT7921 to defconfig and is about to be removed until we upgrade to the next Kernel LTS version (December 2024)

github.com/vyos/vyos-build

T6584: Revert "T6293: add Mediatek MT7921 to defconfig"

vyos:current ← c-po:kernel-changes

opened 06:49AM - 17 Jul 24 UTC

c-po

+4 -4

## Change Summary The Wifi card does not boot up properly on the 6.6 Kernel… series and causes panics. It looks like this is fixe din Kernel 6.8 thus it should be re-enabled once we upgrade to the 2024 Kernel LTS version ## Types of changes  - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Code style update (formatting, renaming) - [ ] Refactoring (no functional changes) - [ ] Migration from an old Vyatta component to vyos-1x, please link to related PR inside obsoleted component - [x] Other (please describe): revert changes ## Related Task(s) * https://vyos.dev/T6293 * https://vyos.dev/T6584 ## Related PR(s) * https://github.com/vyos/vyos-build/pull/584/commits ## Component(s) name Kernel ## Proposed changes ## How to test  Boot system with MT7921 m.2 WIFI NIC ## Checklist: - [x] I have read the [**CONTRIBUTING**](https://github.com/vyos/vyos-build/blob/current/CONTRIBUTING.md) document - [x] I have linked this PR to one or more Phabricator Task(s) - [x] My commit headlines contain a valid Task id - [ ] My change requires a change to the documentation - [ ] I have updated the documentation accordingly

alain · July 22, 2024, 8:46pm

Could it be that a certain vendor:device of mt7922 is bugged? I have two mt7922 devices that seem to run fine:

[14c3:7922] on a Debian Bookworm with Kernel 6.1.0-23-amd64
[14c3:0616] which is an mt7922 branded as AMD RZ616 on Debian Trixie with Kernel 6.9.9-amd64

What are your vendor:device IDs?

Apachez · July 22, 2024, 9:39pm

Slightly off-topic but is there something that can be done to the design/config of VyOS to avoid similar events in future?

One would be to have an additional grub option with “safe” settings but that will of course fail if your mgmt-interface happens to be using one of the bad drivers (and that you need console access to select that option if things goes south anyway).

Partial solution would be to have an config option where you can append “set system boot module_blacklist” but for that to work you must have a kernel that boots (perhaps along with the “safe” boot option above)?

Another thought is if its possible to have nic drivers being loaded late like some “set system” option to have a delayed start of lets say 60 seconds (as default but configurable up to 900 seconds or whatever)?

Im thinking this way wifi and such perhaps could be delayed so if something like this occurs again in future you at least have a box that works for lets say up to 15min at which you can login to it and fix whatever issue that exists before it crashes?

That is when the box boots only the mgmt-interface and fixed interfaces using safe drivers work and the rest are activated through that delayed start.

Also - is there a watchdog configurable so that if VyOS gets a kernel panic that should reboot on itself after lets say 60 seconds or so (handy if the box is at a remote location where you dont have easy physical access to it in order to manually reboot it by powercycling or such)?

c-po · August 13, 2024, 8:29pm

set system option reboot-on-panic

Well who will decide which drivers to block and which not?

This is not how config-load works. There is (and must be) a priority list of who comes first when booting a system, and many services depend on interfaces to render config files or configure routing. If the interface is not present we can not continue as e.g. routing will be broken.

Apachez · August 13, 2024, 10:01pm

With “set system boot module_blacklist” then the admin could decide which drivers to block and which not and that setting would survive updates which current method of manually alter conf files in bash mode doesnt.

The delayed start could either be a global setting (defaulting to 0 seconds) or an alternative method would be if interfaces are using a dummy driver during boot and not until everything is done the driver can be exchanged to the correct one (however having a global delay is less overworking).

c-po · August 30, 2024, 6:04am

That will not work as you will not have a working CLI because you will never reach it. You would need a custom ISO with a custom config or boot with custom kernel options. All that adds more pain then simply waiting a bit and upgrading to a more recent Linux Kernel fixing the issue.