Wireguard handshakes aren't completed when tunneled over WAN in VRF

ACiD_GRiM · October 1, 2023, 6:40am

I have 3 WAN interfaces with DHCP addresses, I’m moving from an openwrt router to Vyos “hopefully”, and the similar configuration works with routing tables, however no specific VRF. I’m hoping to get some help getting the final connection working…

The basic setup:
3 WAN links are each assigned to a separate VRF, so their DHCP default gateway exists in seperate tables
3 Wireguard tunnels that terminate at the same IP, however different ports. These are configured with a FWMark
3 local route policies that match on a single fwmark, and set the routing table to one specific WAN vrf/table

On OpenWRT the tunnels connect over the internet using the configured WAN interface, and then the wg interface exists in the default vrf/table so I can configure routing protocols, bgp in this case.

However in vyos, I see the handshakes being sent between the remote and local, however it seems the replies aren’t being directed back to the wireguard interface to establish the tunnel. Setting +p in debug shows that the handshake didn’t complete within 5 secs.
I tried also assigning the wireguard interfaces to the same vrf as their desired WAN link, but the connections were still made using the default vrf/table unless I also enable local-policy. No packets were transmitted in this case over the WAN

Is there something I’m missing to get this functionality working in Vyos?

Version:          VyOS 1.5-rolling-202309280022
Release train:    current


set interfaces wireguard wg0 fwmark '3540'
set interfaces wireguard wg1 fwmark '3550'
set interfaces wireguard wg2 fwmark '3560'

set interfaces ethernet eth0 vrf 'wan_primary'
set interfaces ethernet eth1 vrf 'wan_secondary'
set interfaces wireless wlan0 vrf 'wan_tertiary'

set policy local-route rule 10 fwmark '3540'
set policy local-route rule 10 set table '100'
set policy local-route rule 11 fwmark '3550'
set policy local-route rule 11 set table '101'
set policy local-route rule 12 fwmark '3560'
set policy local-route rule 12 set table '102'

set vrf name wan_primary protocols static route 0.0.0.0/0 blackhole distance '254'
set vrf name wan_primary table '100'
set vrf name wan_secondary protocols static route 0.0.0.0/0 blackhole distance '254'
set vrf name wan_secondary table '101'
set vrf name wan_tertiary protocols static route 0.0.0.0/0 blackhole distance '254'
set vrf name wan_tertiary table '102'

The blackholes just prevent wireguard traffic from escaping if the interface is down

Power · October 1, 2023, 9:01pm

Hey ACiD_GRiM,
interesting, I’m attempting a comparable setup. My approach is different, however I encountered the same behavior: Any custom tables get ignored and the main table is used for all tunnels.

Here’s the link to my thread: Two WANs, two wireguards, one datacenter instance, many attempts, no joy

Out of curiosity: What kind of WAN links do you use? Are you in a double-NAT situation?

Power · October 1, 2023, 10:44pm

Could you elaborate more here? Specifically, do you use three different PKIs or do you share keys?

And again, pardon me asking again because it’s not crystal clear from your initial post, could you clarify this important part, please?

Who is sending the initial packet and who does reply and when does the chain break?

Power · October 2, 2023, 12:01am

And not to forget the obvious: Could you post the output

show firewall

please?

ACiD_GRiM · October 2, 2023, 2:07am

I have 3 different WAN ISP’s (starlink, LTE, Wifi) and I tunnel across the three links to a single router, all three ISP interfaces are behind some kind of NAT, however the router’s public IP is directly connected to internet, public IP. In this situation both routers are Vyos, however the remote end is a recent 1.4 rolling and this new router is 1.5.
I don’t think the version matters on wireguard compatibility, the tunnels connect in the default VRF

Each tunnel has distinct PSK and pub/priv keys on both ends

      | Sat  wg0------------wg100 \
local | LTE  wg1------------wg101-eth0 | remote
      | WiFi wg2------------wg102 /

Sat = wan_primary
LTE = wan_secondary
WiFI = wan_tertiary

The basic Premise isn’t much different than bonding multiple WAN links together in a multipath route, however because the links are drastically different speed, I just set BGP local preference to only select one route at a time.
The separate route tables just keep the tunnels pinned to their expected ISP uplinks so I know wg0 always uses starlink even when it’s offline.

On this new vyos router I attempted, there is NO firewall yet, however I have NAT destination rules to the /32 IP of the CPE address which masquerade so I can manage the “modems”. This is no different than OpenWrt.
I’m currently back on the OpenWrt configuration which is the exact same hardware as the new vyos, everything works and the only explicit difference other than kernel version 5.10 vs 6.x is Openwrt isn’t using explicit VRF, just separate routing tables…
I don’t know if VRF is just a fancy name for route tables or if it’s also separate network namespaces too.

In this case, the local router is sending the handshake, I also see the reply on the correct interface. so this suggests to me that the replies received on VRF wan_primary, for example, aren’t passed to the default VRF where the wireguard kernel object is initated.

Apachez · October 2, 2023, 2:16am

It would be helpful for future readers if you could provide config output of “show config commands | strip-private” for the relevant parts.

Regarding VRF thats somewhat confusing when it comes to Linux (which VyOS is based on).

What other vendors define as “VRF” (Cisco, Juniper, Arista and so on) is called “netns” (network namespaces) in the linux world.

VyOS do have netns commands available but I dont think its fully implemented yet.

VRF in Linux world is just a different routing table but no other segmentation/separation of traffic.

For example where having a “VRF MGMT” in other vendors products is a security feature this security separation is missing when it comes to Linux. You need to use “netns” to get the same level of separation.