Hey Guys,
I facing a really annoying issue since quite some time.
Until around 2 months I have used 2 Dell R610 with 2 Intel Xeon x5670 on each.
Each Server was equipt with 24GB DDR3 ECC RAM and Intel X520-DA Dual 10G NIC.
They both did a great work and was able to route 8-10G Traffic (depends on the type/pps) just fine.
However, during all the time I have used them, I was facing an issue.
They rebooted just randomly. It stated “slowly” and occurred just once a week or so.
As they both were redundant and I was really busy I have ignored that behavior.
But, it started to get quiet worse… It started to occur 2-3 times a week and finally nearly every 24hrs.
I was not able to debug the issue due to the case, that the error log was just lost after the random reboot.
As a redesign of the network already was on my todo for quite some time, I have started with a complete new install. This time with 1.2 (as I was running 1.1.8 before).
And who would guess it, the issue was solved. No random reboots anymore. Or at least I was thinking that.
After around 3-4 months it started again… The server randomly restarted, and exactly as before, first weekly, then in the end daily.
At the end, the whole project has moved to new hardware in a new location.
I was thinking that the issue might be caused by some broadcast storm that occurred (due to the case, that one if the connected networks was a really big L2 network) which somehow resulted in a kernel panic.
The new hardware in the new location works just perfect.
However… I also have another server in a colocation. Its an HP ProLiant G6 also running on a single Intel Xeon x5670. There I have got exactly the same issue over the same time. Also first was running on 1.1.8 and later switched to 1.2. This machine just uses the onboard Intel 1G Nics and route about 50-70Mbit/s. The connected networks are much smaller (7-10 Machines), so my theory seems to be wrong…
Can it be an Incompatibility with the CPU?
I mean, that is basically the only similarity that the machines have.
Or did anyone of you have an Idea how I debug that issue?
I really hope someone can help me here