Severe latency issues in VyOS 1.3.6

Hello everyone,

a few days ago we updated our redundant routers to the official VyOS version 1.3.6. The routers are connected via 10G and receive full tables for IPv4 and IPv6 via BGP from several providers.

However, severe latency spikes occurred about 3 days after the update, i.e. individual packets to both the primary router itself and the servers connected via it sometimes took over 1000ms. Rebooting the router helps and the latency is back to normal, but unfortunately this condition only lasts a few days.

At least so far, the secondary router doesn’t show the problem, although it usually doesn’t have much to do. The problem only seems to occur when network traffic actually occurs over a longer period of time.

When trying to isolate the problem, all that has happened so far is that FRR shows the following messages on the primary router around the problematic periods:

Apr 28 06:40:45 router zebra[1509]: [EC 4043309102] NH_INSTALL operation preformed on Nexthop ID (3553510) in the kernel, that we no longer have in our table
Apr 28 06:41:01 router zebra[1509]: [EC 4043309102] NH_INSTALL operation preformed on Nexthop ID (3553657) in the kernel, that we no longer have in our table
Apr 28 06:41:30 router zebra[1509]: [EC 4043309102] NH_INSTALL operation preformed on Nexthop ID (3554005) in the kernel, that we no longer have in our table

Unfortunately, we didn’t find much about this online and aren’t sure if this is the cause or just a symptom of the problem. However, these messages cannot actually be found on the second router. Has anyone encountered similar problems or knows what exactly the messages mentioned mean?

Alternatively, what would be the best way or approach to further advance debugging?

Thanks for the help!

I’ve experienced a similar issue when peering at an IXP, specifically, v6 peering and importing it’s routing table.

When running sudo perf top I was able to see that there was a single interrupt causing the issue. I don’t remember it exactly and for some reason did not document it, but I think it was fib6_table_lookup.

When the peering session with that IXP was shut down the issue got away.

Are you able to run sudo perf top and send the output? You may need to install VyOS specific linux-tools.

For me upgrading to 1.4-epa1 made the issue go away, so I did no more troubleshooting.

Thanks for the feedback. Unfortunately, after the last blog entry, the repositories are no longer available, so I can’t simply install the linux-perf tools. I would be glad to be able to check this out and maybe report a bug. Unfortunately, an update to 1.4 isn’t easy either, as at first glance at least the configuration of the firewall is completely broken.

I’ve just upgraded from 1.3.5 to 1.3.6 (unfortunately, shortly before reading this thread) - so will see in a few days if I get bitten by this too. I’m running two routers, peering with an IXP as well. Which version were you running before the upgrade, which didn’t have the issue yet?

Wouldnt something like this work then?

Add this to /etc/apt/sources.list (comment unwanted sources):

deb bookworm main contrib non-free non-free-firmware
deb bookworm-updates main contrib non-free non-free-firmware
deb bookworm-proposed-updates main contrib non-free non-free-firmware
deb bookworm-backports main contrib non-free non-free-firmware
deb bookworm-security main contrib non-free non-free-firmware
deb current main

Then run:

sudo apt-get update
sudo apt-get install linux-perf-tools

or whatever the package might be named?

1 Like

Could any of these cases (and workarounds) be related to your case?

If you didn’t have any problems running 1.3.5 then you’re probably fine as we were coming from a relatively old version of VyOS.

As for the sources.list, using the packages for the rolling release isn’t going to work since we are using VyOS 1.3.6 at the moment. The problem with simply installing linux-perf is related to the latest blog post and topics like Unable to build ISO 1.4

In regards to the workarounds, VyOS 1.3.6 is using FRRouting (version 7.5.1-20231128-03-g06f8c4ce0) and it doesn’t know the command zebra nexthop-group keep 1. We are also unsure at the moment if raising the dplane limit would help, since we would expect this to be a problem to be encountered right after starting the BGP daemon.

What we know right now is, that it is not only a problem with an IXP but with IPv6 BGP, zebra and maybe the kernel in general. By shutting all the IPv6 BGP sessions, we seem to be able to get rid of the (IPv4 and IPv6) lag spikes. We can even reenable some sessions and the problems are temporarily gone.

We are currently considering upgrading to VyOS 1.4.0-epa2, even though we already encountered several problems with the migration of the configuration. The whole firewall configuration is gone after going from 1.3.6 to 1.4.0-epa2. Besides that, a policy route is missing and all our route-maps that set a community are broken, since the communities are not there anymore. We are checking for more problems soon and will share our finding in the development portal.

1 Like