Packet drops happening again

sonicbx · April 27, 2020, 6:55am

https://phabricator.vyos.net/T935

I found this old case which is describing a problem I am currently having on the latest rolling…

7.5% packetloss of packets under a certain bytes…

happens even on a basic installl…

see the side by side using pings with 214 vs 215…

I should mention this only seems to happen when routing from ethernet to ethernet, in other words,
When I connect thru OpenVPN to a vyos and ping thru it (in openvpn out ethernet) i don’t see the problem.

This is causing issues with ping test software i use which is reporting packetloss. it gets worse going thru two vyos…

vyos@router3:~$ sudo netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0      1500   112582      0      0 0         16765      0     36      0 BMRU
eth1      1500  1684396      0      0 0        831019      0  10586      0 BMRU
eth2      1500   816280      0      0 0         18746      0      0      0 BMRU
eth3      1500   101582      0      0 0            16      0      0      0 BMRU
eth4      1500   395557      0      0 0        394728      0  11228      0 BMRU
lo       65536     2871      0      0 0          2871      0      0      0 LRU

Dmitry · April 27, 2020, 6:35pm

Which NICs drivers used?

sudo ethtool -i eth0
sudo ethtool -i eth1
show hardware pci

sonicbx · April 28, 2020, 9:05am

@Dmitry I am running under Xen on XCP-NG 8
Originally I was using a fully PV VM. After having this issue, I’ve tried as HVM with PV drivers and without and I have the same issue. Also I tried using the intel NIC mode instead of the default Realtek mode and it does the same thing except the minimum packet size is smaller… the problem occurs using ping sizes up to 222 on the Realtek versus 214 on the intel driver.

fwiw for clarity
intel nic:
ping sizes 1 thru 214 (~7.5% loss)
ping sizes 215 thru 1502 (0% loss)
ping sizes 1503 thru 1694 (Packet Fragmentation) (~7.5% loss)
ping size 1695 thru 2982 (Packet Fragmentation) (0% loss)
ping size 2983 thru 3174 (Packet Fragmentation) (~7.5% loss)
ping size 3175… (Packet Fragmentation) (0% loss)

vyos@router3:~$ show hardware pci

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)

00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]

00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]

00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)

00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)

00:02.0 VGA compatible controller: Device 1234:1111

00:03.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 02)

vyos@router3:~$

vyos@router3:~$ sudo ethtool -i eth1

driver: vif

version: 

firmware-version: 

expansion-rom-version: 

bus-info: vif-1

supports-statistics: yes

supports-test: no

supports-eeprom-access: no

supports-register-dump: no

supports-priv-flags: no

sonicbx · April 28, 2020, 11:03am

This old bug i was referencing is talking about VyOS on AWS, which runs on Xen also from what I recall, so not suprized this is coming up on my own Xen.

[06:51 xen1 ~]# netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0      9000 1006030681      0      0 0      74970818      0      0      0 BMRU
eth1      9000        0      0      0 0             0      0      0      0 BMU
eth2      9000 609145160      0      0 0      735455697      0      0      0 BMRU
eth3      9000        0      0      0 0             0      0      0      0 BMU
eth2.2    9000        0      0      0 0       2240382      0      0      0 BMRU
eth2.3    9000        0      0      0 0       1180813      0      0      0 BMRU
eth2.120  9000 50670144      0      0 0      54883148      0      0      0 BMRU
eth2.121  9000   553713      0      0 0          4262      0      0      0 BMRU
eth2.161  9000        0      0      0 0        572587      0      0      0 BMRU
eth2.162  9000        0      0      0 0        570766      0      0      0 BMRU
eth2.163  9000        0      0      0 0        536892      0      0      0 BMRU
eth2.500  9000 15715220      0      0 0      11923650      0      0      0 BMRU
eth2.161  9000        0      0      0 0        240987      0      0      0 BMRU
lo       65536  5634463      0      0 0       5634463      0      0      0 LRU
vif10.0   9000 31636610      0      0 0      48742707      0      0      0 BMORU
vif11.3   9000 17429960      0      0 0      32153482      0      0      0 BMORU
vif12.3   9000  1148213      0      0 0      19038720      0      0      0 BMORU
vif13.0   9000   611289      0      0 0       1537666      0      0      0 BMORU
vif15.0   9000   507773      0      0 0      18714269      0      0      0 BMORU
vif15.1   9000        9      0      0 0        608667      0      0      0 BMORU
vif15.2   9000     4262      0      0 0        553688      0      0      0 BMORU
vif25.0   9000 353316087      0      0 0      417722773      0      0      0 BMORU
vif25.1   9000   536181      0      0 0        101196      0      0      0 BMORU
vif25.2   9000   290614      0      0 0        507603      0      0      0 BMORU
vif25.3   9000 396930378      0      0 0      58636142      0      0      0 BMORU
vif25.4   9000    25700      0      0 0        216940      0      0      0 BMORU
vif27.0   9000   523928      0      0 0       1848803      0      0      0 BMORU
vif3.0    9000 225652326      0      0 0      237752225      0      0      0 BMORU
vif30.0   9000  2483145      0      0 0       3609470      0      0      0 BMORU
vif31.0   9000 30609262      0      0 0      220969966      0      0      0 BMORU
vif33.0   9000 27941473      0      0 0      176375412      0      0      0 BMORU
vif4.0    9000  1673304      0      0 0       1316065      0      0      0 BMORU
vif47.0   9000  2703754      0      0 0      15241547      0      0      0 BMORU
vif48.0   9000  1548621      0      0 0      14686822      0      0      0 BMORU
vif49.0   9000 165351181      0      0 0      177161995      0      0      0 BMORU
vif53.0   9000  3601101      0      0 0      15557479      0      0      0 BMORU
vif53.1   9000    52465      0      0 0        413667      0      0      0 BMORU
vif53.2   9000  1179489      0      0 0        543974      0      0      0 BMORU
vif53.3   9000    58257      0      0 0        419358      0      0      0 BMORU
vif53.4   9000   262846      0      0 0         86307      0      0      0 BMORU
vif54.0   9000     8088      0      0 0        886336      0      0      0 BMORU
vif58.0   9000     1474      0      0 0      13996680      0      0      0 BMORU
vif59.0   1500 54883140      0      0 0      62037358      0      0      0 BMORU
vif70.0   1500 11923650      0      0 0      15715220      0      0      0 BMORU
vif75.0   9000    16018      0      0 0        937911      0      0      0 BMORU
vif75.1   9000 146130608      0      0 0      49231514      0      0      0 BMORU
vif75.2   9000 41171697      0      0 0      15267499      0      0      0 BMORU
vif75.3   9000       68      0      0 0        934622      0      0      0 BMORU
vif75.4   9000    11229      0      0 0        374432      0      0      0 BMORU
vif76.0   9000    28159      0      0 0        941674      0      0      0 BMORU
vif76.1   9000   106834      0      0 0       7500072      0      0      0 BMORU
vif76.2   9000       29      0      0 0       7430551      0      0      0 BMORU
vif76.3   9000   293945      0      0 0        819681      0      0      0 BMORU
vif76.4   9000      606      0      0 0        371497      0      0      0 BMORU
vif78.0   9000   563148      0      0 0       1078773      0      0      0 BMORU
vif78.1   9000   723157      0      0 0       8481730      0      0      0 BMORU
vif78.2   9000       18      0      0 0       7428637      0      0      0 BMORU
vif78.3   9000       18      0      0 0        933978      0      0      0 BMORU
vif78.4   9000      156      0      0 0        371331      0      0      0 BMORU
vif79.0   9000    60607      0      0 0        286727      0      0      0 BMORU
vif8.0    9000    48113      0      0 0        612195      0      0      0 BMORU
vif81.0   9000    68020      0      0 0        788626      0      0      0 BMORU
vif81.1   9000  3286000      0      0 0       8100669      0      0      0 BMORU
vif82.0   9000    28797      0      0 0        758798      0      0      0 BMORU
vif82.1   9000   588157      0      0 0       5447598      0      0      0 BMORU
vif83.0   9000   271153      0      0 0        930145      0      0      0 BMORU
vif83.1   9000  1092463      0      0 0       5701406      0      0      0 BMORU
vif84.0   9000    33498      0      0 0        762033      0      0      0 BMORU
vif84.1   9000   612289      0      0 0       5469333      0      0      0 BMORU
vif94.0   9000     3714      0      0 0         23301      0      0      0 BMORU
vif94.1   9000    18341      0      0 0        150345      0      0      0 BMORU
vif94.2   9000     4143      0      0 0        128139      0      0      0 BMORU
vif94.3   9000       14      0      0 0         20960      0      0      0 BMORU
vif94.4   9000     9556      0      0 0          5083      0      0      0 BMORU
xapi0     9000  2240312      0      0 0             0      0      0      0 BMRU
xapi1     9000  1180771      0      0 0             0      0      0      0 BMRU
xapi2     9000   570754      0      0 0             0      0      0      0 BMRU
xapi3     1500   608710      0      0 0             0      0      0      0 BMRU
xapi4     9000   551920      0      0 0             0      0      0      0 BMRU
xapi5     1500        2      0      0 0             0      0      0      0 BMRU
xapi7     9000   536882      0      0 0             0      0      0      0 BMRU
xapi8     9000   240987      0      0 0             0      0      0      0 BMRU
xapi20    9000   572575      0      0 0             0      0      0      0 BMRU
xenbr0    9000 74890691      0      0 0      64010098      0      0      0 BMRU
xenbr1    9000        0      0      0 0             0      0      0      0 BMU
xenbr2    9000 52068178      0      0 0      46891629      0      0      0 BMRU
xenbr3    9000        0      0      0 0             0      0      0      0 BMU

Theres no packet drops on any interfaces in the hypervisor.

Also, the show hardware pci command above was run while I had the VM in HVM/PV mode for testing. I have reverted it back to full PV now (like I want it) and now when I run show hardware pci I get zero output.

also, i changed the number of vCPU’s to 5 now (1 per NIC), same results.

I don’t have anything set interfaces ethernet ethx smp-affinity, but i tried that before and it still happens.
also commands like set interfaces ethernet ethx mtu 9000 don’t work, but I would like it to support jumbo frames.

I was running full PV with SR-IOV, and everything was working great until I ran into a complication…
When using SR-IOV, the VMs will not live migrate. I do not need to live migrate my vyos boxes but I need to be able to live migrate my other VMs/servers… and I found out the hard way that when you have vyos on SR-IOV and you have VMs on virtual interfaces that are not SR-IOV, the only thing that gets thru is DHCP, after that, the VMs don’t get any ARP traffic from vyos. So I had to scrap SR-IOV for now, until such time that I can have these hypervisors running all vyos instance and all the VM servers are running on other hypervisors. That will happen eventually as I scale out, but for now, the 4 physical boxes that I have need to have SR-IOV off.

sonicbx · April 29, 2020, 7:12pm

Not sure, but this might be a issue with a networking function in the Xen kernel…

hammerstud · May 3, 2020, 10:31pm

I can confirm I’m seeing this issue with the latest 1.3 rolling, even as far back as the 1.2 December 19 snapshot, specifically over wireguard. It doesn’t happen when using 1.2.5 stable. I’ve seen this on baremetal (protectcli), vmware esxi 6.7U3, and XCP-NG 8.0.

tjh · May 4, 2020, 3:21am

@hammerstud If you look at this ticket (since closed) you can see some commands to run while hitting to packet loss, to see if you’re seeing the kernel see packets being dropped.

The command, specifically, is watch -tn 1 "ifconfig -a | grep -A 5 eth1 | grep 'TX packets' | sed 's/^.* dropped:\$[0-9]\\{1,\\}\$ .*\$/\1/g’

You will need to replace the “eth1” with the specific interface you’re seeing problems with.

Do you see packet drops increasing there when you see the actual ping dropped packets?

sonicbx · May 4, 2020, 10:12am

A simpler command is to just watch sudo netstat -i

And yes, TX drop does increment when a packet is lost.

Since I have a new environment and the traffic is highly controlled, I can reboot my VYOS box and see there is 0 TX drops. I can run my ping flood with like 250 byte packets and I get 0% loss and no TX drops increment. Then I run the same test with 200 byte packets, and wham 7.5% loss shown in the ping test and the TX drops increment. It shows TX drops on both in and out ethernet interfaces, but not at the same time. About 3.75% on each NIC, resulting in the total 7.5% loss shown in the ping test.

This occurs on a local LAN environment, with No VPN involved and no physical interfaces involved… IE running this on a hypervisor with virtual NICs on internal virtual networks and fully PV VMs, it’s happening, so I’m not even sure if this data ever hits the real physical NIC in the box?

I have also started a thread on the XCP-NG Forum (the hypervisor)

sonicbx · May 5, 2020, 1:12am

I made a little lab environment for testing to see if it might be a hypervisor problem, and it’s not.

no TX drops when I replace my VyOS router with a CentOS router running kernel 4.18 or 5.6

I also tried with a PFsense router and also no packet loss no TX drops… It seems like this issue only affects VYOS…

sonicbx · May 5, 2020, 2:26am

I’ve tested this to and it’s affecting atleast
1.3-rolling-202003300117
1.3-rolling-202005040117

Does not seem to be a problem in version 1.1.8

tjh · May 5, 2020, 2:38am

I’ll fire a rolling up in my lab.

I can’t reproduce it on official 1.2.5 image.

tjh · May 5, 2020, 3:22am

Can’t repro here at home on vyos-1.3-rolling-202005040117-amd64.iso

This is on my laptop running Virtualbox (Connected via wireless even)

My Virtualbox is bridged to the Adapter (running as a virt-io interface), the Adapter got a DHCP address on the LAN. I’m pinging another host on the LAN:

Any hints on what else to try to repro it?

I’ve tried these sizes:

-s 170
-s 180
-s 190
-s 200
-s 210
-s 220
-s 230
-s 240
-s 250

I haven’t managed to drop a single packet.

I don’t doubt you at all, I just can’t get it to bug out here.

sonicbx · May 5, 2020, 8:57am

It doesn’t happen when I ping from the vyos box.

It happens when I ping thru it. IP forwarding required to trigger the issue.

sonicbx · May 5, 2020, 8:58am

Here’s the detailed steps…

-Create 2 networks I’ll call them Network1 and Network2.
-Create 2 linux VMs (any flavor should work), I’ll call them Server1 and and Server2
-Give Server1 a virtual NIC on Network1 with IP 192.168.1.111/24 gateway 192.168.1.1
-Give Server2 a virtual NIC on Network2 with IP 192.168.2.222/24 gateway 192.168.2.1
-Create a VyOS VM (will run OK from LiveCD) and assign a NIC on both networks.
-On Vyos:
–set interfaces ethernet eth0 address 192.168.1.1/24
–set interfaces ethernet eth1 address 192.168.2.1/24
–commit
–watch netstat -i
-You should observe initially TX drops is 0 on both interfaces
-On Server1:
–ping -f 192.168.2.222 -s 250 -c 1000
You should observe there is 0% packetloss and No TX drops incremented.
–ping -f 192.168.2.222 -s 56 -c 1000
You should now see there is about 7.5% packetloss (round trip) and there are about 37 or 38 TX drops on eth0 and eth1.

sonicbx · May 5, 2020, 9:01am

Just some more testing I did… I completely disabled NAT, Firewall and Told conntrack to ignore all protocols so that rules all them out.

tjh · May 5, 2020, 9:19am

Yup I understand your topology/setup now. I don’t have much free time at the moment but I’ll try and repo this over the next day or two and then I think we’ll just log a new Phabricator ticket for it (it must be kernel related)

sonicbx · May 5, 2020, 9:57am

I installed a fresh copy of Debian 10 Buster and it’s running 4.9.0-8-amd64 Debian 4.19.98-1+deb10u1 (2020-04-27)

I did the test and there are no TX drops at all. I’m going to assume since I had no TX drops with Debian, CentOS, or PFsense, it does seem that VyOS is the only affected distro.

What would it take to update the kernel in rolling release?

sonicbx · May 5, 2020, 10:48am

FWIW, I ran the same with a Physical Ubiquiti EdgeRouter and no TX drops.

So also interesting lab results:

I’ve been doing iperf3 tests and found that the TX drops showup during iperf tests as well as the ping flood i was using before.

I was able to achieve 3.5Gb/sec forwarding through this VyOS box despite the TX dropped packets.
I dropped in a CentOS router to see the difference and I’m able to achieve full 10Gb/sec forwarded.

So as miniscule of a problem it sounds like to hear me say that 3.75% of small packets below 222 are being dropped, this affects things such as TCP ACK messages and other short messages, and the net result is that I am losing 65% of my possible bandwidth due to this…

I dont know how to get a copy of vyos 1.2.5 stable to test with, I seem to only have 1.1.8 and 1.3-rolling

I am going to try the iperf and ping tests against 1.1.8 again.

I would really love to use vyos in my environment but I can’t honestly go to production running like this.

tjh · May 5, 2020, 6:48pm

So this morning I built the following two Rolling Routers, the first with a leg into my LAN.

LAN - ([192.168.0.225/24]-ROLLING-[10.10.10.1/30])====VBOX-BRIDGE====([10.10.10.2/30]-ROLLING)

The first router is simple rolling booted up in LiveISO mode with a /24 on the first interface and a /30 on the second, no other config. The second is also LiveISO with only a /30 configured on its Interface.

From my box 192.168.0.5, pinging 10.10.10.2 (so Router 1 is routing this for me)

[6:40:20] root :: micro  ➜  ~ » ping -f 10.10.10.2 -s 250 -c 10000
PING 10.10.10.2 (10.10.10.2) 250(278) bytes of data.

--- 10.10.10.2 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 925ms
rtt min/avg/max/mdev = 1.779/3.097/91.304/2.611 ms, pipe 5, ipg/ewma 3.189/3.766 ms
[6:40:55] root :: micro  ➜  ~ » ping -f 10.10.10.2 -s 56 -c 10000
PING 10.10.10.2 (10.10.10.2) 56(84) bytes of data.

--- 10.10.10.2 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 290ms
rtt min/avg/max/mdev = 1.697/2.941/26.901/1.576 ms, pipe 2, ipg/ewma 3.126/2.350 ms

I still can’t repro the packet loss I’m afraid, try as I might. I’ve tried many different small values between 56 and 300 for -s and still haven’t lost a packet yet.

I wonder if it’s something to do with the offload settings that are being set - if you disable all offload settings on your NICs does that alter/fix the problem?

I’m not in a position to fire up a Xen host, so I probably can’t help debug this problem any further I’m afraid.

sonicbx · May 5, 2020, 7:06pm

Ok so if is behaving differently under different Hypervisors that’s odd but I guess it tells us exactly where the problem is! NIC Drivers!

In Xen when you install VyOS it comes up using paravirtual interface drivers.

When you run ethtool you get no output, any attempt to change offload or ring buffer sizes is met with an “Operation not supported” error.

Someone on this thread had a problem like this involving VMware too?

Can I PM you with access to my Xen?