VyOS on Mellanox Spectrum (SNxxxx) switches w/ switchdev

I have a Mellanox SN2010 switch in my hands (4x QSFP28, 18xSFP28, with an Intel C2558 CPU and 8 GB of RAM) that I’d like to use with VyOS. If you’re not familiar with Mellanox’s SNxxxx switches, they’re mostly controlled by the mlxsw module in the stock kernel, which has been part of VyOS for a while. Out of the box, it’ll boot recent VyOS nightly builds, update the switch firmware if needed, discover all of the ports, and mostly work right.

Thanks to switchdev, there shouldn’t be any real support needed for L2/L3 configs in VyOS. The mlxsw module copies generic Linux L2 and L3 configs into the switch ASIC and then propagates hardware counters back into the kernel. At least in theory, it should be perfectly possible to set up multiple bridges, VLANs, routed interfaces, and even VxLAN via VyOS’s UI and get hardware offloading with no changes to VyOS at all.

The reason that I’m posting this under “development” is that there are a handful of little things that need to change to actually get a good out-of-the-box experience with these switches (or really any switchdev switch, which includes some OpenWRT-ish ARM systems as well now). I’m willing to do most of the work on these, but I figure it’s better to discuss them here before I start in and get patches rejected.

Issues:

  • Naming the switch ports as eth0ethNisn’t great when (a) the physical port has a label on it and (b) the kernel knows what the label says for each port. It’s trivial to rename the interfaces via udev rules, but that rapidly runs afoul of VyOS’s name validators (here and elsewhere). For now, I’m renaming my ports as en$attr{phys_port_name}; since phys_port_name is p1pN, that produces enp1enpN interfaces which VyOS allows, but it doesn’t really feel great, even if it is more or less what enp interface names are supposed to mean.
  • Splitting ports (QSFP→4x SFP) is more difficult. First, it really breaks port name validators, because splitting a QSFP named enp1 4 ways will produce enp1s0enp1s3. Next, there’s no real mechanism that I can see in VyOS for managing port splits like this. The kernel has generic support via devlink port split (manpage), but I’m not sure what the right config in VyOS would even look like for that. Logically, you’d want something like interface ethernet enp1 / port split 4, but then that would make the kernel remove the enp1 interface and replace it with enp1s0 and friends. So then there wouldn’t actually be an enp1 interface, but its config would be critical to the system’s operation. That feels wrong.
  • There are also a number of generic features that the switch supports but VyOS doesn’t today, like PTP. From what I can see, adding linuxptp and configuring it just like any other PTP-supporting interface should work.
  • Finally, there are minor issues around fan control and environment support, but these are all generic Linux issues. Manually creating /etc/fancontroland running systemctl enable fancontrolhandles most of them, but it’d be nice to have a UI for it.

So, I have a few questions:

  1. Does VyOS (in principle, at least) have an interface naming scheme that works for switch ports other than ethN? Trying to map eth21 into a specific physical port on the front is much more difficult than having an interface name that includes the port name that’s silk-screened onto the switch. For people using these switches with generic Debian, I’ve seen people use swpN, while I’ve been using enpN. The mlxsw driver provides a name that starts with p today, but that’s probably not guaranteed with other switchdev devices. If I added sw[0-9a-z]+ to the validation list and a single udev rule to map mlxsw_spectrum devices into sw*, would that be acceptable in principle?
  2. Does anyone have a suggestion on how devlink port split should be configured? IIRC the same interface should work for mlxsw, Intel E8xx-family NICs, and probably Mellanox ConnectX-8+ NICs. The issue (as I see it) is that when splitting an interface then that interface name will vanish, to be replaced by 2 (or more) replacement names. It looks like Juniper mostly puts this sort of config into chassis fpc <slot>, so that might be a precedent for putting it somewhere other than interface.
  3. Does anyone have any objections in principle to me adding config support for fancontrol?
  4. Does anyone have any issues with PTP? I’d need to pull in the Debian linuxptp package and figure out how to best map it into interface and elsewhere. The biggest issue with PTP is that the underlying config varies depending on which ports are involved and if they share a PTP Hardware Clock or not. Some multi-port NICs share a single PHC across all ports, while others have a PHC per port, so getting this right may be a bit tricky.
4 Likes

When it comes to renaming I would rather prefer something like LAN1, LAN2 etc. And for splitted ports LAN28_1, LAN28_2 etc.

Other than that it would probably be nice to have the ability to either define the prefix globally (one prefer swpx another prefer LANx) or make it possible to rename them one by one.

I would for example assume that these whiteboxes might come with an interface labeled MGMT on the chassi and for that I would prefer if that ended up as MGMT and not LAN14 or such.

The naming seems currently be done in vyos-1x/src/udev/vyos_net_name at current · vyos/vyos-1x · GitHub

We can control the prefix (sw in my example), but the per-port naming is up to the switchdev driver. I guess it’s technically possible to write something that takes the names from switchdev and applies some logic to them, changing Mellanox’s 1s0split names to ${prefix}1_0or similar, but udev doesn’t make that particularly easy.

Also, it looks like kernel.org’s switchport docs recommend that drivers use swXpYsZ names (X for multiple switches, Y for the port number, and Z for sub-ports where applicable), but it’s under the driver’s control. I have a Banana Pi R4 here, with a 4-port MediaTek switchdev switch, and it sets phys_port_name to p1 through p4. Convention for the BPi R4 seems to be LAN1/LAN2/LAN3/WAN, but those names aren’t exposed anywhere that I can find in /sys/. So users could presumably rename devices on their own, but it’d be very hard for VyOS to know out of the box that that specific device had that naming convention.

I dont know if they use switchdev behind the scene but in an Arista box (as an example) no matter what the hardware model is the interfaces are named:

Ethernet1
Ethernet2

Ethernet24
Ethernet25/1
Ethernet25/2
Ethernet25/3
Ethernet25/4
Management1

Personally I have some deep hate against the “systemd” syntax of naming interfaces.

Sure its (on paper) a way to based on physical location on the motherboard bring a name to the interface which is why doing something like this is about the first thing I do to get sane names of the interfaces instead of en0p4s4:

Arista is a whole different thing, as is SONiC. I don’t know how Arista works under the hood, but SONiC has a whole complicated adaptation layer that maps between switch devices, the kernel, and the routing config in the UI. IIRC parts of it flow through redis, and it’s spread over multiple docker containers. That’s probably not a model that we’d want to emulate in any way.

Also, Arista and SONiC each ship with a definition for each supported switch type that lists device names and other parameters, and they have a way to probe at startup to ID the switch type and figure out which naming table to use. We don’t really have any of that data. It wouldn’t be impossible to build something, but it’d always lag behind actual hardware and there’d be a lot of heuristics involved.

I agree that Arista’s naming is superior to anything that we can do trivially. I’m particularly fond of the way that it handles splittable interfaces; if the first Ethernet interface is splittable (but is unsplit at the moment), then it’ll be named Ethernet1/1. If it’s not splittable, then it’ll be Ethernet1 without the /1. If you go ahead and split it 4 ways, then you’ll gain Ethernet1/2 through Ethernet1/4, but Ethernet1/1 won’t vanish. That makes it much easier to put split configs into the interface Ethernet1/1 part of config space because the name never changes.

I just don’t think it’s necessarily worth the effort to rewrite interface names that deeply. Also, / isn’t a valid character in Linux interface names, so we can’t copy the naming pattern as-is.

Lookling at my two switchdev systems (the SN2010 and a Banana Pi R4), devlink port clearly shows that none of the interfaces on the BPi are splittable (splittable false for each), while all of the interfaces on the SN2010 are splittable, even the single-lane SFP28 ports. Here are a few examples:

$ devlink port
...
pci/0000:01:00.0/3: type eth netdev enp17 flavour physical port 17 splittable false lanes 1
pci/0000:01:00.0/4: type eth netdev enp18 flavour physical port 18 splittable false lanes 1
pci/0000:01:00.0/9: type eth netdev enp19 flavour physical port 19 splittable true lanes 4
pci/0000:01:00.0/5: type eth netdev enp20 flavour physical port 20 splittable true lanes 4
...

If we wanted to do down this road (and I’m kind of against it, given the expected number of users), then presumably we could rename splittable true lanes >1 as something like Ethernet${port}_1 and splittable false and splittable true lanes 1 as Ethernet${port}.

FWIW, after running devlink port split pci/0000:01:00.0/13 count 2, then devlink port shows this:

pci/0000:01:00.0/13: type eth netdev enp22s0 flavour physical port 22 split_group 22 splittable false lanes 2
pci/0000:01:00.0/14: type eth netdev enp22s1 flavour physical port 22 split_group 22 splittable false lanes 2

So that’s what split interfaces look like. As an aside, notice that the PCI addresses and port numbers are not in the same order – enp19 and enp20 are inverted here. And (not shown here) enp1 is actually the second-highest bus address, so it’d probably be assigned eth21 out of the box without udev’s help. So if we stuck with ethX naming and assigned names in PCI bus order, then we’d end up making a total mess of things. So we really do need something in udev’s config to remap the names in order at the very minimum.

Yeah I have nothing against Ethernet28_1 if Ethernet28/1 isnt possible.

And the portorder is why I prefer to rename the interfaces manually (on a regular Linux box such as Proxmox) because having en0p4s4 really doesnt tell me anything when I stare at the rare of the server and it got 12 interfaces.

I have seen all kind of ordering top to down and down to top aswell as both left or right and right to left and something you have like port 5-8 to the left and 1-4 to the right (yes - wtf!).

Only that really works to me is to give the interfaces custom names either whats already written on the chassi or just call it a day and go ETH0 to whatever number the are and then I do it left to right and when hitting a pcie-cage it will be the upper top left as the first and the bottom right as the last.

We also have like Mikrotik who have its first in the bottom left (wtf!?).

As can be seen on these pictures:

This article by Pim seems to be very apropos: IPng Networks - Debian on Mellanox SN2700 (32x100G)

I personally like the udev rule that he proposes, and plan to emulate.

1 Like

Yeah, Pim’s article is kind of what started this for me.