Bad page state in swapper

Hi,

After reporting that the previous error was resolved by updating to a newer rolling release I now see different “bad page state” messages appearing.

The complete message reads:

Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542563] Modules linked in: ip_set xt_comment fuse nft_chain_nat_ipv4 nf_nat_ipv4 nft_chain_nat_ipv6 nf_nat_ipv6 nft_chain_route_ipv6 xt_CT xt_tcpudp nft_compat nfnetlink_cthelper nft_counter nf_tables nfnetlink nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_h323 nf_conntrack_h323 nf_nat_sip nf_conntrack_sip nf_nat_proto_gre nf_nat_tftp nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c sha512_ssse3 sha512_generic skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel qat_c62x(O) aes_x86_64 crypto_simd cryptd intel_qat(O) glue_helper intel_cstate mei_me intel_uncore dh_generic uio authenc mei intel_rapl_perf evdev efi_pstore pcspkr efivars iTCO_wdt
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542587]  iTCO_vendor_support pcc_cpufreq ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad button mpls_iptunnel mpls_router ip_tunnel mpls_gso br_netfilter bridge stp llc efivarfs ip_tables x_tables autofs4 usb_storage ohci_hcd uhci_hcd ehci_hcd squashfs zstd_decompress xxhash loop overlay ext4 crc32c_generic crc16 mbcache jbd2 nls_cp437 vfat fat hid_generic usbhid hid nls_ascii sd_mod ahci libahci crc32c_intel i40e(O) libata igb(O) xhci_pci i2c_i801 scsi_mod xhci_hcd
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542606] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G    B      O      4.19.161-amd64-vyos #1
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542608] Hardware name: Supermicro SYS-5019D-FN8TP/X11SDV-8C-TP8F, BIOS 1.0c 11/08/2018
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542609] Call Trace:
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542610]  <IRQ>
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542614]  dump_stack+0x66/0x90
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542617]  bad_page.cold.129+0x7f/0xb2
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542620]  get_page_from_freelist+0x9d8/0xe40
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542624]  ? arp_process+0x27d/0x7e0
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542628]  __alloc_pages_nodemask+0xe2/0x200
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542641]  i40e_alloc_rx_buffers+0x11f/0x270 [i40e]
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542653]  i40e_napi_poll+0x9c9/0x1460 [i40e]
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542658]  net_rx_action+0xf6/0x2c0
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542663]  __do_softirq+0xc6/0x218
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542667]  irq_exit+0xb5/0xc0
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542671]  do_IRQ+0x72/0xd0
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542674]  common_interrupt+0xf/0xf
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542675]  </IRQ>
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542679] RIP: 0010:cpuidle_enter_state+0x12f/0x1d0
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542681] Code: 89 c3 e8 94 d2 c0 ff 45 84 ff 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 86 00 00 00 31 ff e8 f8 8c c5 ff fb 66 0f 1f 44 00 00 <48> b8 ff ff ff ff f3 01 00 00 48 2b 1c 24 ba ff ff ff 7f 48 89 d9
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542682] RSP: 0018:ffffacd14011fe98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd5
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542685] RAX: ffff9aba2fc5ff00 RBX: 00014067ff1df36a RCX: 000000000000001f
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542686] RDX: 00014067ff1df36a RSI: 0000000037c86f51 RDI: 0000000000000000
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542687] RBP: ffff9aba2fc68148 R08: 0000000000000002 R09: 000000000001f7c0
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542689] R10: 000d462a514f21bc R11: ffff9aba2fc5f068 R12: 0000000000000003
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542690] R13: ffffffff87a49ab8 R14: 0000000000000003 R15: 0000000000000000
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542694]  ? cpuidle_enter_state+0x10c/0x1d0
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542698]  do_idle+0x213/0x250
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542701]  cpu_startup_entry+0x6a/0x70
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542703]  start_secondary+0x1a4/0x200
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542706]  secondary_startup_64+0xa4/0xb0
Dec  8 10:02:14 rtr-nbd-2 kernel: [352293.542709] BUG: Bad page state in process swapper/1  pfn:43af63

It references the i40 driver. Was there an update to the driver in last months that could explain these messages?

Other than the message the system seems to operate fine.

Thanks,
Ton

Did you try increasing your ringbuffer? (ethtool -(k|K) …)?

@hagbard already exist possibility define ring buffers via CLI

set interfaces ethernet ethX ring-buffer rx 4096
set interfaces ethernet ethX ring-buffer tx 4096

@netbase can you describe, how much interfaces do you have based on i40e driver and provide an output for each interface

show interfaces ethernet ethX physical 

As I remember some X/XL710 based NICs require compatible firmware with driver.
Maybe also will be helpful to see an output

sudo dmesg | grep i40e

I’ll try the ringbuffer options.

The interface info is:

$ show interfaces ethernet eth7 physical
Settings for eth7:
Supported ports: [ FIBRE ]
Supported link modes: 10000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: 10000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 10000Mb/s
Duplex: Full
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
Supports Wake-on: g
Wake-on: g
Current message level: 0x0000000f (15)
drv probe link timer
Link detected: yes
driver: i40e
version: 2.13.10
firmware-version: 3.33 0x80001006 1.1747.0
expansion-rom-version:
bus-info: 0000:b7:00.3
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

I have 2 identical machines. The other one is running an older rolling release.
I checked the interface in there as well and driver is different in this older release.
This older release doesn’t generate these messages.

It has the following driver info:

driver: i40e
version: 2.11.29
firmware-version: 3.33 0x80001006 1.1747.0

I’ll report back about the changes for the ringbuffer settings. This might take a while. The messages started after a couple of days last time.

Regards,

Ton

What’s the output off 'ethtool -g '?
When the page error happens, do you have a connection reset for asingle connection or is there a full link reset?

Yeah worst case scenario is compiling and installing the driver manually.

I think the firmware version does not compatible with the driver version
Bit more information you can find in DPDK docs http://doc.dpdk.org/guides/nics/i40e.html#recommended-matching-list