Failover performance improvement over Spanning Tree

jmoseby · July 28, 2020, 5:01pm

I’m new to Vyos and evaluating it for an application that requires maximum network availability. If we use L2 across four (or more) routers with mesh or box interconnect, we would need failover on the order of a second, and not 15 to 30. I realize that 99% of applications don’t need this level of performance.

The Vyos feature list explicitly states that RSTP is not supported. What are the options in Vyos for improving L2 failover performance?

Feature request to include mstpd?
Roll my own with mstpd for rstp/pvst?
Some other protocol?

As far as I know none of the fancier protocols have a viable linux code base (SPB, Trill, etc)

Thanks,
John

Viacheslav · July 28, 2020, 5:46pm

@jmoseby can you share the network diagram with interfaces and ip addresses?

jmoseby · July 28, 2020, 7:40pm

Here is a bridging diagram of a candidate configuration. The goal is to reliably tunnel L2 between datacenters. The service VMs need uninterrupted L2 connectivity, even with virtual router upgrade/reboots and tunnel failures.

jmoseby · July 28, 2020, 8:45pm

Looking at the roadmap, the MC-LAG feature seems to be a promising feature. Assuming that you can LAG tunnels together.

Viacheslav · July 29, 2020, 6:23am

I have not found an implementation of MC-LAG on clean Linux.
Only a few mentions in Cumulus Networks. To do this, we need a Linux example.

Viacheslav · July 31, 2020, 7:37am

Without configuration it difficult to say what technology will be the best in that case.
Maybe Bonding active/passive/RoundRobin/VRRP.

It’s need to understand what you mean with Tunnel in l2 topology and see configurations.
What do VyOS? Only bridging/swithching?

jmoseby · July 31, 2020, 2:21pm

So the diagram only addresses the layer 2 connectivity. Vyos would be also doing routing, BFD, etc. which is not shown. I don’t expect any issues with that part.

At the moment, I am investigating this alternative L2 configuration that will tie failover time to vrrp instead of stp:

VRRP on the WAN interface between the two routers in each data center. Then running a single tunnel (vxlan for instance) between datacenters using the VRRP floating IP as the source/dest for the tunnel. If this works there will only be one L2 connection between data centers so a L2 loop is avoided. The downside so far is that version 2.0.10 of keepalived does not support vrrp fast advertisement (< 1s), so I still haven’t solved my failover time issue. I believe the latest version of keepalived does sub-second adverts.

RSTP/PVST via mstpd is still an option but also requires new code.

MC-LAG would be great since the tunnels could all form a single LAG between datacenters, but like you, I haven’t found any source code.

So at this point, every possible solution of sub-second failover that I see requires a code change (add mstpd, or up-version keepalived).