1gbit <-> 1gbit odd +60ms latency + poor throughput problem solving

Hi,

Fairly new to VyOS but pretty familiar with other linux stuff. Recently installed on inexpensive Core2 Duo 3Ghz machine with 4GB ram, 2 times 1Gbit intel ethernet. Started configuring simple Vyos with Source and Destination NAT, DHCP, Simple few FW rules and so.

Everything went well in the beginning (few days) but then I experienced issues of huge performance drop. So like from speedtest of 20ms latency and 300Mbit/100Mbit to 80ms and 25Mbit/25Mbit.

Ok, I thought that maybe there is internet issue, maybe some nat is slowing or so.

Now I have eliminated (a lot of hours) like:

  • Internet connection works, so if switched to other device it works and when put back to VyOS it is slow
  • all switches, managed, nonmanaged tested, also cables changed
  • configuration rolled back to minimum
  • power saving things removed from pc bios
  • reinstalled VyOS from beginning
  • a lot of other

Now probably what is left is that maybe ethernet card just broke down? I’m really out of ideas how to debug the issue. Have not seen other issues than slowness and latency. I have tried to google similar issues but have not found. Why it is that +65ms? Why it shapes to almost exact 25Mbit/25Mbit? Really looks like sw issue but cant figure what is causing that.

I have tried to look from logs and eth statistics + monitoring and figuring out (learning) to understand and fix things but don’t know anymore what to look next. Oc I do not want this to be so that if it is obvious hw error then it should be replaced but the behavior looks really odd to my experience.

Any tips is welcome and will post my findings here when I solve this.

VyOS 1.3.0, Dell Optiplex 960, dualcore E8400 3GHz, 4GB Ram, 64GB SSD, Internal eth0 driver e1000e, external eth1 driver e1000. (Internet - 5G bridged - Vyos - Lan)

edit. (specs)

Even when shaped at 25/25 , ping time should be way better when link is idle.
Use traceroute to 8.8.8.8 (windows pathping) to get a clue of the hop causing delay

This was really good, I kind of had tested it before but not this clearly understood it.
So when I ping some reliable like ftp.funet.fi in Finland from the VyOS itself and run speedtest from Windows client, the ping latency is ok and not affected much. It is like 24ms and goes up to 40 or 60ms. If speedtest would go fastest pace it would be hundreds of ms.

And from windows tracert it reveals it

tracert 8.8.8.8

Tracing route to dns.google [8.8.8.8]
over a maximum of 30 hops:

1 16 ms 103 ms 103 ms 192.168.100.x
2 82 ms 103 ms 103 ms ww-xx-yy-zz.aaaaaaaa.fi [ww.xx.yy.zz]
3 * * * Request timed out.
4 103 ms 118 ms 91 ms 10.64.192.25
5 140 ms 46 ms 127 ms 213.192.186.82
6 57 ms 98 ms 119 ms ae3.bbr1.hel2.fi.ip4.elisa.net [213.192.186.81]
7 107 ms 118 ms 99 ms 213.192.184.95
8 107 ms 126 ms 100 ms 213.192.185.93
9 114 ms 99 ms 98 ms 142.250.213.179
10 118 ms 107 ms 91 ms 209.85.241.29
11 64 ms 100 ms 101 ms dns.google [8.8.8.8]

192.168.100.xx is my VyOS gw.

ok now I got some clarify, don’t know though yet how to fix it.

changed external pci card to different as I thought it was faulty. Booted, configured (eth1 changed to eth2) and run speedtest and thought it is now fixed as good throughput and latency. Then once I was wrapping it up heard beep from the machine and found out from log…

Feb 15 08:34:53 vyos kernel: [  405.253971] irq 16: nobody cared (try booting with the "irqpoll" option)
Feb 15 08:34:53 vyos kernel: [  405.256434] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.4.165-amd64-vyos #1
Feb 15 08:34:53 vyos kernel: [  405.256435] Hardware name: Dell Inc. OptiPlex 960                 /0G261D, BIOS A18 08/06/2013
Feb 15 08:34:53 vyos kernel: [  405.256435] Call Trace:
Feb 15 08:34:53 vyos kernel: [  405.256438]  <IRQ>
Feb 15 08:34:53 vyos kernel: [  405.256443]  dump_stack+0x66/0x81
Feb 15 08:34:53 vyos kernel: [  405.256445]  __report_bad_irq+0x32/0xac
Feb 15 08:34:53 vyos kernel: [  405.256447]  note_interrupt.cold.10+0xa/0x63
Feb 15 08:34:53 vyos kernel: [  405.256450]  handle_irq_event_percpu+0x65/0x70
Feb 15 08:34:53 vyos kernel: [  405.256451]  handle_irq_event+0x34/0x60
Feb 15 08:34:53 vyos kernel: [  405.256453]  handle_fasteoi_irq+0x95/0x140
Feb 15 08:34:53 vyos kernel: [  405.256455]  do_IRQ+0x44/0xd0
Feb 15 08:34:53 vyos kernel: [  405.256456]  common_interrupt+0xf/0xf
Feb 15 08:34:53 vyos kernel: [  405.256457]  </IRQ>
Feb 15 08:34:53 vyos kernel: [  405.256460] RIP: 0010:mwait_idle+0x52/0x80
Feb 15 08:34:53 vyos kernel: [  405.256461] Code: 48 8b 04 25 c0 bb 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 75 24 e9 07 00 00 00 0f 00 2d e9 c5 5a 00 31
c0 48 89 c1 fb 0f 01 c9 <65> 48 8b 04 25 c0 bb 01 00 f0 80 60 02 df c3 fb 66 66 90 66 66 90
Feb 15 08:34:53 vyos kernel: [  405.256462] RSP: 0018:ffffa0174007fee0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd9
Feb 15 08:34:53 vyos kernel: [  405.256464] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
Feb 15 08:34:53 vyos kernel: [  405.256465] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000082
Feb 15 08:34:53 vyos kernel: [  405.256465] RBP: ffffffff8888b8c0 R08: 0000000000000001 R09: 0000005e621b3fec
Feb 15 08:34:53 vyos kernel: [  405.256466] R10: 0000012c198c5bcb R11: 0000000000001800 R12: 0000000000000000
Feb 15 08:34:53 vyos kernel: [  405.256467] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Feb 15 08:34:53 vyos kernel: [  405.256469]  do_idle+0x1ce/0x250
Feb 15 08:34:53 vyos kernel: [  405.256471]  cpu_startup_entry+0x14/0x20
Feb 15 08:34:53 vyos kernel: [  405.256473]  start_secondary+0x15f/0x1b0
Feb 15 08:34:53 vyos kernel: [  405.256475]  secondary_startup_64+0xa4/0xb0
Feb 15 08:34:53 vyos kernel: [  405.256476] handlers:
Feb 15 08:34:53 vyos kernel: [  405.258014] [<00000000381cd7c1>] usb_hcd_irq
Feb 15 08:34:53 vyos kernel: [  405.259969] [<000000003ba784a4>] rtl8169_interrupt [r8169]
Feb 15 08:34:53 vyos kernel: [  405.262261] **Disabling IRQ #16**

and now it is slow again.

so it seems it is old pchw ethernet irq related thing.

ok, so I think what affected it was putting usb device to one of the USB ports in front of the pc (USB 1&2), like keyboard and that sharing IRQ with eth1

cat /proc/interrupts |grep eth
16: 0 207451 IO-APIC 16-fasteoi uhci_hcd:usb2, eth1
30: 91878 0 PCI-MSI 409600-edge eth0

did not put irqpoll yet on kernel parameters but changed the keyboard to back of the pc.

That behavior was weird, maybe it does some polling after that to pci device when it is in “failsafe” mode with IRQ:s and that makes it +65ms and “shaping” to 25Mbps :wink:

I think this is pretty much solved with this.

It looks like it needs that irqpoll or something else as issue came back.

what is the correct manual way to modify grub.cfg or /etc/default/grub and then have it activated?

update-grub gives this
/usr/sbin/grub-probe: error: failed to get canonical path of `overlay’.

Was there some VyOS command to modify grub or some manual way?