Packet drops happening again

Just some more testing I did… I completely disabled NAT, Firewall and Told conntrack to ignore all protocols so that rules all them out.

Yup I understand your topology/setup now. I don’t have much free time at the moment but I’ll try and repo this over the next day or two and then I think we’ll just log a new Phabricator ticket for it (it must be kernel related)

I installed a fresh copy of Debian 10 Buster and it’s running 4.9.0-8-amd64 Debian 4.19.98-1+deb10u1 (2020-04-27)

I did the test and there are no TX drops at all. I’m going to assume since I had no TX drops with Debian, CentOS, or PFsense, it does seem that VyOS is the only affected distro.

What would it take to update the kernel in rolling release?

FWIW, I ran the same with a Physical Ubiquiti EdgeRouter and no TX drops.

So also interesting lab results:

I’ve been doing iperf3 tests and found that the TX drops showup during iperf tests as well as the ping flood i was using before.

I was able to achieve 3.5Gb/sec forwarding through this VyOS box despite the TX dropped packets.
I dropped in a CentOS router to see the difference and I’m able to achieve full 10Gb/sec forwarded.

So as miniscule of a problem it sounds like to hear me say that 3.75% of small packets below 222 are being dropped, this affects things such as TCP ACK messages and other short messages, and the net result is that I am losing 65% of my possible bandwidth due to this…

I dont know how to get a copy of vyos 1.2.5 stable to test with, I seem to only have 1.1.8 and 1.3-rolling

I am going to try the iperf and ping tests against 1.1.8 again.

I would really love to use vyos in my environment but I can’t honestly go to production running like this.

So this morning I built the following two Rolling Routers, the first with a leg into my LAN.

LAN - ([192.168.0.225/24]-ROLLING-[10.10.10.1/30])====VBOX-BRIDGE====([10.10.10.2/30]-ROLLING)

The first router is simple rolling booted up in LiveISO mode with a /24 on the first interface and a /30 on the second, no other config. The second is also LiveISO with only a /30 configured on its Interface.

From my box 192.168.0.5, pinging 10.10.10.2 (so Router 1 is routing this for me)

[6:40:20] root :: micro  ➜  ~ » ping -f 10.10.10.2 -s 250 -c 10000
PING 10.10.10.2 (10.10.10.2) 250(278) bytes of data.

--- 10.10.10.2 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 925ms
rtt min/avg/max/mdev = 1.779/3.097/91.304/2.611 ms, pipe 5, ipg/ewma 3.189/3.766 ms
[6:40:55] root :: micro  ➜  ~ » ping -f 10.10.10.2 -s 56 -c 10000
PING 10.10.10.2 (10.10.10.2) 56(84) bytes of data.

--- 10.10.10.2 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 290ms
rtt min/avg/max/mdev = 1.697/2.941/26.901/1.576 ms, pipe 2, ipg/ewma 3.126/2.350 ms

I still can’t repro the packet loss I’m afraid, try as I might. I’ve tried many different small values between 56 and 300 for -s and still haven’t lost a packet yet.

I wonder if it’s something to do with the offload settings that are being set - if you disable all offload settings on your NICs does that alter/fix the problem?

I’m not in a position to fire up a Xen host, so I probably can’t help debug this problem any further I’m afraid.

Ok so if is behaving differently under different Hypervisors that’s odd but I guess it tells us exactly where the problem is! NIC Drivers!

In Xen when you install VyOS it comes up using paravirtual interface drivers.

When you run ethtool you get no output, any attempt to change offload or ring buffer sizes is met with an “Operation not supported” error.

Someone on this thread had a problem like this involving VMware too?

Can I PM you with access to my Xen?

Just got the news about the 5/8 rolling release fixing a bunch of issues, but it did not fix this TX drop issue. Still occuring on 1.3-rolling-202005082150

3.75% packetloss of packets under 214 bytes thru vyos router on xen hypervisor

Just tested on 1.2.5 and cannot reproduce

I can reproduce with the steps mentioned by sonicbx on xcp-ng 8.x and recent rolling versions. I am happy to provide access to some VMs in order to help solving the issue.

1 Like

Glad to hear it’s not just me!

I created my first phabricator task!

https://phabricator.vyos.net/T2505

Hi @sonicbx, Can you check this behavior behind on Debian or similar distributives on this hypervisor?

I did try making a Debian 10 VM and enabled up forwarding and it doesn’t have any problem.

It´s happens in HVM mode or in PV or both?

Both HVM and PV …

Worth noting that as HVM it still uses PV NIC drivers because they’re preinstalled.

I don’t know how to force it to use HVM drivers.

I confirm TX-DROP on rolling.
R1 and R3 - LTS
R2 - rolling

R1_eth1 == eth1_R2_eth2 == eth1_R3

Ping:
R1= R2_eth1 without problems
R1 = R2_eth2 without problems
R1 = R3 - 7% loss

I’ll try to figure it out.

1 Like

Thank you for trying to resolve this!

1 Like

Presumably, this problem can be solved by this patch. ref. https://patchwork.kernel.org/patch/9293785/
Need to check it out.

I would be happy to do some tests with this patch. Is there any easy way to cross-compile the vyos kernel?

Hi,
Did anyone find a working fix? Am currently running 1.4-rolling-202101300218 on XCP-NG with around 1-2% packets dropped on TX.

Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0      1500 55393733      0     71 0      93398166      0 1094371      0 BMRU
eth1      1500 94974007      0      0 0      98273116      0 885979      0 BMRU
eth2      1500 22236780      0      0 0      60714591      0 212657      0 BMRU
eth3      1500        0      0      0 0             7      0      0      0 BMRU
eth4      1500 17282995      0      0 0      43214192      0 262450      0 BMRU
eth5      1500        3      0      0 0           220      0      0      0 BMRU
eth6      1500  6829409      0      0 0       4785681      0 184062      0 BMRU