Random reboots on Westmere platform?

Mr_Funken · November 27, 2018, 9:45pm

Hey Guys,

I facing a really annoying issue since quite some time.
Until around 2 months I have used 2 Dell R610 with 2 Intel Xeon x5670 on each.
Each Server was equipt with 24GB DDR3 ECC RAM and Intel X520-DA Dual 10G NIC.

They both did a great work and was able to route 8-10G Traffic (depends on the type/pps) just fine.
However, during all the time I have used them, I was facing an issue.

They rebooted just randomly. It stated “slowly” and occurred just once a week or so.
As they both were redundant and I was really busy I have ignored that behavior.
But, it started to get quiet worse… It started to occur 2-3 times a week and finally nearly every 24hrs.

I was not able to debug the issue due to the case, that the error log was just lost after the random reboot.

As a redesign of the network already was on my todo for quite some time, I have started with a complete new install. This time with 1.2 (as I was running 1.1.8 before).
And who would guess it, the issue was solved. No random reboots anymore. Or at least I was thinking that.
After around 3-4 months it started again… The server randomly restarted, and exactly as before, first weekly, then in the end daily.

At the end, the whole project has moved to new hardware in a new location.
I was thinking that the issue might be caused by some broadcast storm that occurred (due to the case, that one if the connected networks was a really big L2 network) which somehow resulted in a kernel panic.

The new hardware in the new location works just perfect.

However… I also have another server in a colocation. Its an HP ProLiant G6 also running on a single Intel Xeon x5670. There I have got exactly the same issue over the same time. Also first was running on 1.1.8 and later switched to 1.2. This machine just uses the onboard Intel 1G Nics and route about 50-70Mbit/s. The connected networks are much smaller (7-10 Machines), so my theory seems to be wrong…

Can it be an Incompatibility with the CPU?
I mean, that is basically the only similarity that the machines have.

Or did anyone of you have an Idea how I debug that issue?

I really hope someone can help me here

hagbard · November 27, 2018, 10:01pm

What about remote syslog? Anything you can see before it reboots? Temperature an issue at the location?

syncer · November 27, 2018, 10:38pm

Don’t see a point to debug hardware which is not supported by it’s manufacturer, just waste of time

Mr_Funken · November 30, 2018, 2:28am

@syncer
You’re right, it is an old platform. However, they are still wide spread.
Also if you think about the use case as for example an office router I would say they are still perfect for it.
Cheap to get and more then enough power for that usecase.

@hagbard
Thanks four your tip. Rsyslog did helped and give me some logs.

2018-11-30 02:19:44 	Emergency (0) 	KERNEL [ 6582.882260] Kernel panic - not syncing: Fatal exception in interrupt
2018-11-30 02:19:44 	Warning (4) 	KERNEL [ 6582.796768] CR2: 00007f1695f8c000 CR3: 0000000001c09002 CR4: 00000000000606e0
2018-11-30 02:19:44 	Warning (4) 	KERNEL [ 6582.727922] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2018-11-30 02:19:44 	Warning (4) 	KERNEL [ 6582.630993] FS: 0000000000000000(0000) GS:ffff88046f9c0000(0000) knlGS:0000000000000000
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6582.545507] R13: ffff880462f1f001 R14: ffffc90001b37c1c R15: ffff880462f1f000
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6582.460016] R10: ffffc90001b37b90 R11: ffff88045a9d6600 R12: ffff8804660a5a00
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6582.374528] RBP: dead000000000100 R08: 0000000000000001 R09: 0000000000000000
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6582.283841] RDX: ffffe8ffffbc6008 RSI: 00000000fffffe01 RDI: ffffffff8143fe4b
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6582.198352] RAX: 0000000000000000 RBX: dead000000000100 RCX: 0000000000000000
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6582.135746] RSP: 0018:ffffc90001b37bc0 EFLAGS: 00010286
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6581.910896] Code: 41 57 41 56 49 89 f7 41 55 41 54 49 89 d4 55 48 8d 96 90 00 00 00 53 41 50 48 89 fb 49 89 ce 31 c0 48 89 14 24 48 85 db 74 73 <48> 8b 2b 48 c7 03 00 00 00 00 48 8b 05 9a a8 86 00 48 85 ed 41 0f
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6581.852461] RIP: 0010:dev_hard_start_xmit+0x2a/0xaf
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6581.667965] ---[ end trace 89ac819e7778eae7 ]---
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6581.531499] ahci isci i2c_i801 libahci e1000e libsas igb ixgbe scsi_transport_sas i2c_algo_bit dca mdio ptp pps_core i2c_core
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.687540] Modules linked in: binfmt_misc nfnetlink_log xt_NFLOG vxlan ip6_udp_tunnel udp_tunnel 8021q garp mrp stp llc ip_set xt_TCPMSS xt_comment iptable_mangle iptable_nat nf_nat_ipv4 ip6table_mangle ip6table_filter ip6table_raw ip6_tables iptable_filter xt_CT nfnetlink_cthelper nfnetlink iptable_raw nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_h323 nf_conntrack_h323 nf_nat_sip nf_conntrack_sip nf_nat_proto_gre nf_nat_tftp nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c fuse x86_pkg_temp_thermal coretemp ghash_clmulni_intel pcbc iTCO_wdt aesni_intel evdev aes_x86_64 crypto_simd cryptd glue_helper lpc_ich pcspkr serio_raw mfd_core pcc_cpufreq ioatdma button ipv6 autofs4 usb_storage ohci_hcd loop raid1 md_mod crc32c_intel
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.644696] ret_from_fork+0x35/0x40
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.600806] ? kthread_stop+0x49/0x49
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.563158] kthread+0xf8/0x100
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.521350] ? sort_range+0x17/0x17
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.472263] smpboot_thread_fn+0x164/0x17f
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.429414] run_ksoftirqd+0x17/0x26
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.386568] __do_softirq+0xdc/0x1e4
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.342678] net_rx_action+0xe6/0x279
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.299829] gro_cell_poll+0x4f/0x69
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.253861] napi_gro_receive+0x27/0x7a
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.206853] dev_gro_receive+0x4a2/0x56e
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.150484] netif_receive_skb_internal+0x58/0xc6
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.092035] __netif_receive_skb_one_core+0x4d/0x69
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.042950] ? ip_check_defrag+0x1b1/0x1b1
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6580.001138] ip_forward+0x3aa/0x3ba
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.952053] ? ip_finish_output+0x37/0x130
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.902961] ip_finish_output2+0x275/0x310
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.851800] ? ip_finish_output2+0x275/0x310
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.809992] ? eth_header+0x24/0xac
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.749464] ? nf_nat_ipv4_out+0xf/0x8e [nf_nat_ipv4]
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.701415] __dev_queue_xmit+0x4af/0x5e4
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.672083] Call Trace:
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.586594] CR2: 00007f1695f8c000 CR3: 0000000001c09002 CR4: 00000000000606e0
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.517746] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.420819] FS: 0000000000000000(0000) GS:ffff88046f9c0000(0000) knlGS:0000000000000000
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.335329] R13: ffff880462f1f001 R14: ffffc90001b37c1c R15: ffff880462f1f000
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.249841] R10: ffffc90001b37b90 R11: ffff88045a9d6600 R12: ffff8804660a5a00
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.164354] RBP: dead000000000100 R08: 0000000000000001 R09: 0000000000000000
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6579.078867] RDX: ffffe8ffffbc6008 RSI: 00000000fffffe01 RDI: ffffffff8143fe4b
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6578.993379] RAX: 0000000000000000 RBX: dead000000000100 RCX: 0000000000000000
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6578.930770] RSP: 0018:ffffc90001b37bc0 EFLAGS: 00010286
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6578.705923] Code: 41 57 41 56 49 89 f7 41 55 41 54 49 89 d4 55 48 8d 96 90 00 00 00 53 41 50 48 89 fb 49 89 ce 31 c0 48 89 14 24 48 85 db 74 73 <48> 8b 2b 48 c7 03 00 00 00 00 48 8b 05 9a a8 86 00 48 85 ed 41 0f
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6578.647478] RIP: 0010:dev_hard_start_xmit+0x2a/0xaf
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6578.564067] Hardware name: Supermicro X9SRL-F/X9SRL-F, BIOS 3.0a 12/05/2013
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6578.477540] CPU: 7 PID: 46 Comm: ksoftirqd/7 Not tainted 4.19.0-amd64-vyos #32
2018-11-30 02:19:43 	Warning (4) 	KERNEL [ 6578.419126] general protection fault: 0000 [#1] SMP

So, seems like I guess right with the kernel panic.
However, I do not really understand much of the logs.

Did you guys have an idea, about the reason?

Mr_Funken · November 30, 2018, 2:37am

Seems like an old but forgotten thing?

Mr_Funken · December 2, 2018, 2:36pm

Hey,

An small update about the issue.
After updating to the RC9, the system seems to be more stable. Now is day 4 without any reboots.
Maybe the removal of the spectre stuff is the reason for that?

hagbard · December 2, 2018, 5:39pm

That could be the case, since the logs show problems with syncing threads. The problem in this entire case is the CPU vendor, who won’t hand out patches for those older CPUs and expects that the OS vendors fix their problems without telling them the specifics. There are still years to come where those problems will be present.

Mr_Funken · December 2, 2018, 6:04pm

I know some sec folks will kill me for that question,
But does specte migration codes does even make any sense on VyOS systems?

I mean, all sepctre/meltdown exploits I know need the posebility to run code on the system.
As VyOS does not have any “access levels” anymore, everyone who can run code on the sys, already have root access.

hagbard · December 3, 2018, 4:42pm

There are things you don’t want to leave your system by accident, the root access has nothing to do with it. Surely, the issue for the OS is not as critical as it sounds, but it’s still quite an issue.

btopping · December 4, 2018, 4:32am

How much is your time worth? Why not get “more” and try the software on a second box to be sure it’s not the hardware?

Mr_Funken · December 4, 2018, 9:42am

You’re right. It would cost to much time to fix, whatever the bug is, just for me.

But it isn’t just me. The R610/R710 Models are still active in use. I know many Datacenters where they are still present, we are working with many Companies that using them, and did you ever take a look in the typical “home lab” forums? It feels like there isn’t a homelab without one of these.

btopping · December 4, 2018, 1:29pm

I’m not talking about a software bug. What if the problem is with your hardware? Bad memory, bad northbridge, bad cache line on the processor, issue with network, etc etc.

Have you looked at the sources where the problem is? What does it look like is happening?

Mr_Funken · December 4, 2018, 2:12pm

You didn’t read my 1st post fully (What I fully can understand, as it is a lot of text and offtopic stuff)

I have the same issues, on 3 different servers (hardware), in 2 different datacenters.
The chance that it is a hardware issue, is really small.

What does it look like is happening?

I have posted the log above. It’s a kernel panic.

hagbard · December 4, 2018, 5:42pm

You need to install the dbg version of the kernel and store crash dumps, so you can replay them and see what exactly is causing it.

Mr_Funken · December 5, 2018, 12:54pm

Mh… the issue seems to be really fixed since the RC9.
Now is day 7 without any random reboots. Before I was happy when I have reached the 24hrs mark.
But I don’t think that it was the removed spectre stuff. As the issues also occurred on 1.1.8.

However, thanks a lot to @hagbard for your offered help and tips to get more details about the issue!

system · December 7, 2018, 12:54pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.