Ping spikes/packet loss on VyOS 1.3.3

jnissen · June 30, 2023, 7:50am

Hi everyone,

I’ve recently deployed a few VyOS instances running on Dell PowerEdge R430
CPU: 2x Xeon E5-2637 v3
Memory: 32GB DDR4
NIC’s: 2x Intel X520-DA2

VyOS is running directly on the bare metal, no virtualization layer.

When pinging towards just about anything to/from one of the instances. I will get random ping spike and about 1% packet loss.
When i check pcaps on the target I’m pinging they respond normally with very low time.
But pcap on the box with issues shows the responding icmp packet arriving “late”.
The other instance with exactly the same hardware has no issues.
The ping spikes has no specific intervals.

The other day i rebooted the VyOS instance and the ping spikes and packet loss was gone, but it resumed again today.
Any ideas to what can be the issue?

The box with the issues is running 73 BGP sessions.

RIB entries 1674629, using 307 MiB of memory
Peers 73, using 1555 KiB of memory

Ping results towards 1.1.1.1 (Direct peering with Cloudflare)

64 bytes from 1.1.1.1: icmp_seq=364 ttl=63 time=0.119 ms
64 bytes from 1.1.1.1: icmp_seq=365 ttl=63 time=0.197 ms
64 bytes from 1.1.1.1: icmp_seq=366 ttl=63 time=0.155 ms
64 bytes from 1.1.1.1: icmp_seq=367 ttl=63 time=0.138 ms
64 bytes from 1.1.1.1: icmp_seq=368 ttl=63 time=349 ms
64 bytes from 1.1.1.1: icmp_seq=369 ttl=63 time=0.179 ms
64 bytes from 1.1.1.1: icmp_seq=370 ttl=63 time=0.150 ms
64 bytes from 1.1.1.1: icmp_seq=371 ttl=63 time=0.210 ms
64 bytes from 1.1.1.1: icmp_seq=372 ttl=63 time=0.159 ms

fernando · June 30, 2023, 2:21pm

it cloud be affected to Intel hyper-threading , try to disable it on BIOS . it should improved the ping spikes intervals.

jnissen · June 30, 2023, 2:41pm

Hi @fernando,

Thanks for the fast response. Since i made the post i’ve tried to increase the buffers with

ethtool -G xxx tx 4096 rx 4096

Seems to work, but i will test further. If there’s still issues, i will try disabling hyper-threading.

fernando · June 30, 2023, 2:48pm

great, it can be configured on VyOS-cli :

set interfaces ethernet ethx ring-buffer rx 4096 
set interfaces ethernet ethx ring-buffer tx 4096

so our NOS keeps this changing persistent .

jnissen · July 1, 2023, 11:18am

Thank you, but the ping spikes is now back again… I fail to see how it could be hyperthreading as it works after a reboot for 12 hours or so. I guess the reason it worked yesterday was that the interface was restarted when i applied the ring-buffer. So it seems like it’s tied with the uptime of the interface.
When running

ethtool -S ethx

The rx_missed_errors value is 1.5 million. That seems like a problem, but so far my research only led me to the ring-buffer size, which is already increased to maximum.

16again · July 1, 2023, 3:15pm

I guess missed packets (RX or TX) wouldn’t result in higher ping times, but ping time outs.
Do you have the same problem on internal interface?
Internally, mirror switch port, so you can see on sniffer if sent-out packet already is delayed for 300ms.

Apachez · July 2, 2023, 10:32pm

Even if you are directly peering with cloudflare the actual box replying for 1.1.1.1 might be some distance away.

Do you get the same spikes if you ping whats the physical nexthop for your links for all interfaces?

This way you could figure out if this is something involving this box or if the issue is elsewhere.

If you get the same behaviour no matter which interface you are pinging nexthop I then assume that you have tested these settings?

https://docs.vyos.io/en/equuleus/configuration/system/option.html#performance

set system option performance throughput

or

set system option performance latency

jnissen · July 4, 2023, 7:45am

Yes, same issue on all interfaces. I’ve tried mirroring the port on the switch and there’s no delay. The delay happens when the VyOS box receives the package.

jnissen · July 4, 2023, 9:10am

Same issue, even if i ping the next hop address. No matter what interface it is.
I haven’t tried the link you sent but i will take a look at it. But in my opinion ping spikes upwards of 1000 ms shouldn’t be OK no matter what performance profile we’re running.

I will try disabling hyper-threading, if that doesn’t work i will try applying the performance profile.

fernando · July 4, 2023, 11:39am

I think when you disable the HT , it should work as expected , let us know what was the result.

system · July 11, 2023, 7:39am

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.