Vyos-config: Configuration error but why?

lkthomas · August 3, 2024, 7:37am

vyos@router:~$ show version
Version:          VyOS 1.5-rolling-202407010024
Release train:    current
Release flavor:   generic

Built by:         [email protected]
Built on:         Mon 01 Jul 2024 03:14 UTC
Build UUID:       d6b612de-81e8-4ef4-9a27-ba1cd4139622
Build commit ID:  057db80447b3dd

Architecture:     x86_64
Boot via:         installed image
System type:      KVM guest

Hardware vendor:  QEMU
Hardware model:   Standard PC (i440FX + PIIX, 1996)
Hardware S/N:
Hardware UUID:    2e589f27-592c-44ec-9496-ce8fe8256607

Copyright:        VyOS maintainers and contributors
vyos@router:~$ conf
WARNING: There was a config error on boot: saving the configuration now could overwrite data.
You may want to check and reload the boot config
[edit]
vyos@router# save /tmp/config.boot
[edit]
vyos@router# exit
exit
vyos@router:~$ diff /config/config.boot /tmp/config.boot
11,36d10
<     ethernet eth1 {
<         offload {
<             gro
<             gso
<             sg
<             tso
<         }
<         vif 3103 {
<             redirect "eth2.2805"
<         }
<         vif 3104 {
<             redirect "eth2.2805"
<         }
<         vif 3200 {
<             redirect "eth2.2805"
<         }
<         vif 3203 {
<             redirect "eth2.2805"
<         }
<         vif 3204 {
<             redirect "eth2.2805"
<         }
<         vif 3205 {
<             redirect "eth2.2805"
<         }
<     }
102a77
>
vyos@router:~$ conf
WARNING: There was a config error on boot: saving the configuration now could overwrite data.
You may want to check and reload the boot config
[edit]
vyos@router# load /config/config.boot
Loading configuration from '/config/config.boot'
Load complete. Use 'commit' to make changes effective.
[edit]
vyos@router# commit
[edit]
vyos@router# save
[edit]
vyos@router# exit
exit

After I boot up, tried to compare the config, found out the redirect seems not applied correctly. But, after reboot and do a warm loading of the config, no errors; Redirect functional is the critical component to us, so we can’t remove it, however, why is it having issue at boot up?

lkthomas · August 3, 2024, 7:59am

I found out the reason behind:
the eth2.2805 interface is start later than eth1, so when eth1 initialized, it wouldn’t see eth2.2805, hence it failed, how could I fix the start up sequence ?

talmakion · August 3, 2024, 7:59am

Does it look anything like these? ⚓ T260 Redirect traffict between two L3 interfaces & ⚓ T6393 Port mirroring to tunnel interface fails during boot

There are potential race conditions in how redirects & traffic mirrors are applied - since it’s part of the interface setup, if the target interface isn’t ready yet, it will cause an error.

Often this isn’t a problem during runtime but it does become a problem at boot time, depending on the order interfaces come up.

You can try re-ordering the interfaces and see if that sorts it out, but it may not be stable. There needs to be work done to properly apply the settings after the interfaces are readied.

talmakion · August 3, 2024, 8:19am

You can try changing your interface hook-ups to have the redirect targets on eth1 and the real interfaces on eth2. This would mean swapping the physical or vSwitch connectivity as well to match.

Since they’re all ethernet, this should work, but if you’re mixing interfaces (like someone trying to mirror to a tunnel in that bug ticket), different interface types are applied in a specific order, tunnels happen after ethernet, so that’s just not possible at present.

As a nasty workaround, if you don’t want to change interface ordering, you could re-apply the redirects after a reboot manually, with a script or via automation like Ansible. As long as the target interfaces exist, it’ll work fine.

If you’ve got commercial support you might be able to ask VyOS to work on a proper fix for the feature.

lkthomas · August 4, 2024, 5:26am

I ended up swapping eth2 to eth1 and moving the config between two interfaces and it works, thanks but it isn’t practical if we have multiple layers of dependency, I also checked ⚓ T5794 Flowtable with Bond Race, it’s a shame that we can’t config priority (/usr/libexec/vyos/priority.py) of the interfaces when start-up, but it works for now, thanks

talmakion · August 4, 2024, 7:27am

I agree it’s an annoying problem to have or to work around, but it’s not a simple fix, just from the way the config system works under the hood. Both the tickets I linked are in my flagged list to contribute to at some point in time, but as it’s not something I need and I’m just a random user that occasionally puts in a PR, I don’t have an ETA.

The easiest fix is likely to be moving any dependent config into another node elsewhere (like set service mirror-redirect ...) and change the node priority to suit. This is what Viacheslav suggested in the ticket.

I wouldn’t mind a multi-phase mechanism which doesn’t cause these kinds of issues in nodes sharing the same conf_script or priority value.

For these interface configs, it could be something as simple as moving the runtime setup into if-up hooks rather than insisting on doing it directly out of conf_scripts. There are problems with that as well - the config system won’t be able to catch and handle runtime errors occurring in the hooks, for example.

Something more general could be useful, but complex, and may not line up with the maintainers’ vision of how applying config should work.

lkthomas · August 5, 2024, 1:25am

do you think it will work if I point it to dummy interface and from dummy to redirect back to physical interface? Dummy would always up, so there wouldn’t be a problem, right?

talmakion · August 5, 2024, 2:03am

Unfortunately, there’s no such thing as “always up” for boot time config. It’s the first time VyOS has seen the config. There’s no problems once they exist though, hence the workaround of “just recreate it after boot”.

If the dummies come up before ethernet, the dummy->ethernet 2 redirect would fail.

If they come up after ethernet, the ethernet 1->dummy redirect would fail.

You’d also be taking a double performance hit and I’m not sure if redirected traffic can be redirected.

The redirect/mirror config is applied at the same time the interface is brought up and given an address, all in one step. The fix would be to separate that out, wait for all interfaces to be configured, then apply redirections and mirroring.

lkthomas · August 5, 2024, 2:42am

what’s the differences between mirror-redirect VS redirect?

talmakion · August 5, 2024, 4:42am

Mirror is a copy, redirect is a move. Redirected traffic never enters the original target interface for sockets to see. Mirrored traffic shows up on both.

lkthomas · August 5, 2024, 5:06am

Sounds like redirect would use less resources than mirror, am I correct?

talmakion · August 5, 2024, 5:30am

Definitely yes. It doesn’t go through the entire network stack, but there’s still a lookup and a re-queue, whether it happens on CPU or offloaded. It will have some impact on PPS.

EDIT: Just to clarify, mirror and redirect are almost the same thing as far as tc is concerned. They are just different arguments to filter policy. Interface.set_mirror_redirect() is where they are both processed from config into tc calls.

Just noticed you were talking about ‘mirror-redirect’ vs ‘redirect’ - just in case there was confusion, it’s just mirror and redirect. There aren’t 2 different ways to redirect .

lkthomas · August 6, 2024, 10:10am

Since I am dealing with huge amount of data, is there a way to fine tune it? the VLAN 2805 now it’■■■■■ or miss, sometimes it running perfectly fine, sometimes it just stop forwarding

vyos@router# tc qdisc show
qdisc noqueue 0: dev lo root refcnt 2
qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: dev eth1 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: dev eth2 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc noqueue 0: dev pim6reg root refcnt 2
qdisc noqueue 0: dev eth1.2805 root refcnt 2
qdisc noqueue 0: dev eth2.3001 root refcnt 2
qdisc ingress ffff: dev eth2.3001 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3002 root refcnt 2
qdisc ingress ffff: dev eth2.3002 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3003 root refcnt 2
qdisc ingress ffff: dev eth2.3003 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3010 root refcnt 2
qdisc ingress ffff: dev eth2.3010 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3012 root refcnt 2
qdisc ingress ffff: dev eth2.3012 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3103 root refcnt 2
qdisc ingress ffff: dev eth2.3103 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3104 root refcnt 2
qdisc ingress ffff: dev eth2.3104 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3105 root refcnt 2
qdisc ingress ffff: dev eth2.3105 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3106 root refcnt 2
qdisc ingress ffff: dev eth2.3106 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3107 root refcnt 2
qdisc ingress ffff: dev eth2.3107 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3200 root refcnt 2
qdisc ingress ffff: dev eth2.3200 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3203 root refcnt 2
qdisc ingress ffff: dev eth2.3203 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3204 root refcnt 2
qdisc ingress ffff: dev eth2.3204 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3205 root refcnt 2
qdisc ingress ffff: dev eth2.3205 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3206 root refcnt 2
qdisc ingress ffff: dev eth2.3206 parent ffff:fff1 ----------------
qdisc noqueue 0: dev eth2.3207 root refcnt 2
qdisc ingress ffff: dev eth2.3207 parent ffff:fff1 ----------------

explanation from GPT:

root refcnt 2: Indicates that this is the root qdisc with a reference count of 2.
limit 10240p: Maximum queue length is 10240 packets.
flows 1024: Number of flows is limited to 1024.
quantum 1514: Maximum packet size for each round of scheduling.
target 5ms: Target queue delay is 5 milliseconds.
interval 100ms: Control loop interval is 100 milliseconds.
memory_limit 32Mb: Memory limit for the queue is 32 megabytes.
ecn: Explicit Congestion Notification is enabled.
drop_batch 64: Maximum number of packets to drop in a batch.

talmakion · August 6, 2024, 1:37pm

Why do you need the redirect in the first place?

Would it be better to use a VLAN-aware bridge or just route the traffic?

lkthomas · August 6, 2024, 1:57pm

perfect one, in fact, because I am lazy

I tried to route the traffic but it didn’t work, it’s a matter of fact that I disabled IGMP snooping so my VLAN interfaces will be flooded with multicast packets, however, how could I route from multiple interfaces to another one without using redirect?

Let me check what is VLAN-aware bridge, never heard of that

talmakion · August 6, 2024, 4:25pm

I’m just not sure what you’re trying to do with so many interfaces in redirect, there might be a better way to do it.

Usually, mirror and redirect are for diagnostics and monitoring.

lkthomas · August 6, 2024, 5:03pm

that’s correct, I am trying to mimick a tap switch

talmakion · August 7, 2024, 12:10am

Would netflows exported from the router be better?

I’m guessing you don’t have a switch with actual SPAN port capability if you’re doing this.

lkthomas · August 7, 2024, 1:02am

That’s correct, we want to do everything at virtualised environment

talmakion · August 7, 2024, 2:47am

VMware can do promisc and SPAN/RSPAN (with a full dvSwitch). KVM exposes promisc to guests. Open vSwitch can do SPAN/RSPAN.

Is the VyOS just watching a bunch of VLANs and packing the traffic it sees into a monitoring VLAN?

For something like this, I’d try to get as close to the metal as possible. That’s my only real suggestion.