OpenVPN server + VyOS cluster: how to reach all cluster members?

mzuc · April 29, 2020, 7:55pm

Hi all,
I have a VyOS cluster with 2 members. On this cluster I configured a few OpenVPN tunnels for roadwarriors.

My problem now is that both cluster members have static routes created by the OpenVPN daemon for the roadwarrior tunnels, which makes it impossible for an OpenVPN client to reach the non-primary member, because that member would then try to route the response packets following its own routing table entries for the tunnel prefixes, but since the primary is handling the tunnels, those packets will get lost.

To make it clearer:

Router A: 10.0.0.1 (primary - tunnels are currently concentrated here)
Router B: 10.0.0.2 (slave)
OpenVPN tunnel subnet: 10.10.10.0/24

Routing table for Router A:

admin@router-a:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued route, r - rejected route

S>* 0.0.0.0/0 [210/0] via 10.0.0.254, eth0, 09:08:15
C>* 10.10.10.0/24 is directly connected, vtun0, 06:51:36

Routing table for Router B:

admin@router-b:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued route, r - rejected route

S>* 0.0.0.0/0 [210/0] via 10.0.0.254, eth0, 09:10:04
C>* 10.10.10.0/24 is directly connected, vtun0, 06:53:15

This makes packets coming from a client (let’s say 10.10.10.5) and directed to 10.0.0.2 (Router B, non-primary cluster member) flow through Router A and then be routed to Router B’s eth0. Router B will then try to reply to 10.10.10.5, but since it has a static route for 10.10.10.0/24 via a local interface, packets will never make it back to the OpenVPN client.

Is there a way to make it possible for 10.10.10.5 to reach both members?
Thanks

mzuc · April 30, 2020, 12:13pm

FYI, for now I solved adding SNAT rules on each cluster member targeting the opposite member as rule destination, with the source being each OpenVPN subnet.

Wondering if someone might have a more elegant solution to this

dmadole · May 4, 2020, 7:02am

Why not just put a different address space on OpenVPN on each router, such as 10.10.10.0/25 on A and 10.10.10.128/25 on B. Then a static route on each router’s subnet to the other.

The other approach is make the routers active/passive so that traffic only goes to one at a time. Use VRRP with sync groups for that.

mzuc · May 5, 2020, 9:50am

@dmadole I also thought about that but it still seems to me as a workaround more than a proper solution…

sonicbx · May 5, 2020, 8:07pm

when you say you are running a vyos cluster are you using “set high-availability vrrp group” or “set cluster group” ?

if you are using set cluster group then add “service openvpn” to your cluster config. This will stop and start start the openvpn daemon. This should make the kernel route to 10.10.10.0 disappear from the slave when openvpn gets stopped… when using cluster group you should assume that that the cluster slave is doing absolutely nothing, so then therefore there should be no need for VPN users to be reachable from all the cluster members?

mzuc · May 6, 2020, 10:24am

@sonicbx my cluster is set up via set cluster group and actually looks like this:

cluster {
    dead-interval 1100
    group CLUSTER-NAME {
        auto-failback false
        monitor 1.1.1.1
        monitor 1.0.0.1
        primary router-a
        secondary router-b
        service 10.0.0.100/32/eth0
        service ipsec
        service openvpn
    }
    interface eth0
    keepalive-interval 500
    mcast-group 239.1.0.254
    monitor-dead-interval 1100
    pre-shared-secret ****************
}

However, OpenVPN routes are present on the slave (router-b) even if it’s not active:

admin@router-b:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued route, r - rejected route

S>* 0.0.0.0/0 [210/0] via 10.0.0.254, eth0, 6d23h37m
C>* 10.10.10.0/24 is directly connected, vtun0, 6d21h21m

admin@router-b:~$ show cluster status
=== Status report on secondary node router-b ===

  Primary router-a: Active

  Secondary router-b (this node): Active (standby)

  Monitor 1.0.0.1: Reachable
  Monitor 1.1.1.1: Reachable

  Resources [10.0.0.100/32/eth0 ipsec openvpn]:
    Active on primary router-a

So the behavior you are describing when you say

This should make the kernel route to 10.10.10.0 disappear from the slave when openvpn gets stopped

is not what I see on the cluster.
Do you have an explanation for that?

Thanks!

sonicbx · May 6, 2020, 10:29am

If you can see any routes on vtun0 that means openvpn is up and running, and for some reason…

sudo ps aux | grep openvpn

try stopping it on the slave with

sudo /etc/init.d/openvpn stop

then check the ps aux results again and confirm no openvpn processes are running, then check your route table you should not see any routes to vtun0 because it shouldn’t exist at all.

next, force cluster failover, if the cluster daemon does it’s job nicely then it should start ipsec and openvpn like you told it to. At that point the route should appear in the table there.

If you do a failback/failover a second time you should see that openvpn process on the active host.

If this isn’t working then theres some sort of bug going on.

mzuc · May 6, 2020, 10:42am

I think there’s some kind of bug because when I force the failover, openvpn daemon stays active on router-a which is now slave.

sonicbx · May 6, 2020, 11:38am

Well thats strange, I guess I never noticed on mine. I have the same thing and I can confirm openvpn daemon stays running on the slave despite having set cluster group vyos service openvpn

Atleast on 1.1.6