VyOS on Mellanox Spectrum (SNxxxx) switches w/ switchdev

I have a Mellanox SN2010 switch in my hands (4x QSFP28, 18xSFP28, with an Intel C2558 CPU and 8 GB of RAM) that I’d like to use with VyOS. If you’re not familiar with Mellanox’s SNxxxx switches, they’re mostly controlled by the mlxsw module in the stock kernel, which has been part of VyOS for a while. Out of the box, it’ll boot recent VyOS nightly builds, update the switch firmware if needed, discover all of the ports, and mostly work right.

Thanks to switchdev, there shouldn’t be any real support needed for L2/L3 configs in VyOS. The mlxsw module copies generic Linux L2 and L3 configs into the switch ASIC and then propagates hardware counters back into the kernel. At least in theory, it should be perfectly possible to set up multiple bridges, VLANs, routed interfaces, and even VxLAN via VyOS’s UI and get hardware offloading with no changes to VyOS at all.

The reason that I’m posting this under “development” is that there are a handful of little things that need to change to actually get a good out-of-the-box experience with these switches (or really any switchdev switch, which includes some OpenWRT-ish ARM systems as well now). I’m willing to do most of the work on these, but I figure it’s better to discuss them here before I start in and get patches rejected.

Issues:

  • Naming the switch ports as eth0ethNisn’t great when (a) the physical port has a label on it and (b) the kernel knows what the label says for each port. It’s trivial to rename the interfaces via udev rules, but that rapidly runs afoul of VyOS’s name validators (here and elsewhere). For now, I’m renaming my ports as en$attr{phys_port_name}; since phys_port_name is p1pN, that produces enp1enpN interfaces which VyOS allows, but it doesn’t really feel great, even if it is more or less what enp interface names are supposed to mean.
  • Splitting ports (QSFP→4x SFP) is more difficult. First, it really breaks port name validators, because splitting a QSFP named enp1 4 ways will produce enp1s0enp1s3. Next, there’s no real mechanism that I can see in VyOS for managing port splits like this. The kernel has generic support via devlink port split (manpage), but I’m not sure what the right config in VyOS would even look like for that. Logically, you’d want something like interface ethernet enp1 / port split 4, but then that would make the kernel remove the enp1 interface and replace it with enp1s0 and friends. So then there wouldn’t actually be an enp1 interface, but its config would be critical to the system’s operation. That feels wrong.
  • There are also a number of generic features that the switch supports but VyOS doesn’t today, like PTP. From what I can see, adding linuxptp and configuring it just like any other PTP-supporting interface should work.
  • Finally, there are minor issues around fan control and environment support, but these are all generic Linux issues. Manually creating /etc/fancontroland running systemctl enable fancontrolhandles most of them, but it’d be nice to have a UI for it.

So, I have a few questions:

  1. Does VyOS (in principle, at least) have an interface naming scheme that works for switch ports other than ethN? Trying to map eth21 into a specific physical port on the front is much more difficult than having an interface name that includes the port name that’s silk-screened onto the switch. For people using these switches with generic Debian, I’ve seen people use swpN, while I’ve been using enpN. The mlxsw driver provides a name that starts with p today, but that’s probably not guaranteed with other switchdev devices. If I added sw[0-9a-z]+ to the validation list and a single udev rule to map mlxsw_spectrum devices into sw*, would that be acceptable in principle?
  2. Does anyone have a suggestion on how devlink port split should be configured? IIRC the same interface should work for mlxsw, Intel E8xx-family NICs, and probably Mellanox ConnectX-8+ NICs. The issue (as I see it) is that when splitting an interface then that interface name will vanish, to be replaced by 2 (or more) replacement names. It looks like Juniper mostly puts this sort of config into chassis fpc <slot>, so that might be a precedent for putting it somewhere other than interface.
  3. Does anyone have any objections in principle to me adding config support for fancontrol?
  4. Does anyone have any issues with PTP? I’d need to pull in the Debian linuxptp package and figure out how to best map it into interface and elsewhere. The biggest issue with PTP is that the underlying config varies depending on which ports are involved and if they share a PTP Hardware Clock or not. Some multi-port NICs share a single PHC across all ports, while others have a PHC per port, so getting this right may be a bit tricky.
5 Likes

When it comes to renaming I would rather prefer something like LAN1, LAN2 etc. And for splitted ports LAN28_1, LAN28_2 etc.

Other than that it would probably be nice to have the ability to either define the prefix globally (one prefer swpx another prefer LANx) or make it possible to rename them one by one.

I would for example assume that these whiteboxes might come with an interface labeled MGMT on the chassi and for that I would prefer if that ended up as MGMT and not LAN14 or such.

The naming seems currently be done in vyos-1x/src/udev/vyos_net_name at current · vyos/vyos-1x · GitHub

We can control the prefix (sw in my example), but the per-port naming is up to the switchdev driver. I guess it’s technically possible to write something that takes the names from switchdev and applies some logic to them, changing Mellanox’s 1s0split names to ${prefix}1_0or similar, but udev doesn’t make that particularly easy.

Also, it looks like kernel.org’s switchport docs recommend that drivers use swXpYsZ names (X for multiple switches, Y for the port number, and Z for sub-ports where applicable), but it’s under the driver’s control. I have a Banana Pi R4 here, with a 4-port MediaTek switchdev switch, and it sets phys_port_name to p1 through p4. Convention for the BPi R4 seems to be LAN1/LAN2/LAN3/WAN, but those names aren’t exposed anywhere that I can find in /sys/. So users could presumably rename devices on their own, but it’d be very hard for VyOS to know out of the box that that specific device had that naming convention.

I dont know if they use switchdev behind the scene but in an Arista box (as an example) no matter what the hardware model is the interfaces are named:

Ethernet1
Ethernet2

Ethernet24
Ethernet25/1
Ethernet25/2
Ethernet25/3
Ethernet25/4
Management1

Personally I have some deep hate against the “systemd” syntax of naming interfaces.

Sure its (on paper) a way to based on physical location on the motherboard bring a name to the interface which is why doing something like this is about the first thing I do to get sane names of the interfaces instead of en0p4s4:

Arista is a whole different thing, as is SONiC. I don’t know how Arista works under the hood, but SONiC has a whole complicated adaptation layer that maps between switch devices, the kernel, and the routing config in the UI. IIRC parts of it flow through redis, and it’s spread over multiple docker containers. That’s probably not a model that we’d want to emulate in any way.

Also, Arista and SONiC each ship with a definition for each supported switch type that lists device names and other parameters, and they have a way to probe at startup to ID the switch type and figure out which naming table to use. We don’t really have any of that data. It wouldn’t be impossible to build something, but it’d always lag behind actual hardware and there’d be a lot of heuristics involved.

I agree that Arista’s naming is superior to anything that we can do trivially. I’m particularly fond of the way that it handles splittable interfaces; if the first Ethernet interface is splittable (but is unsplit at the moment), then it’ll be named Ethernet1/1. If it’s not splittable, then it’ll be Ethernet1 without the /1. If you go ahead and split it 4 ways, then you’ll gain Ethernet1/2 through Ethernet1/4, but Ethernet1/1 won’t vanish. That makes it much easier to put split configs into the interface Ethernet1/1 part of config space because the name never changes.

I just don’t think it’s necessarily worth the effort to rewrite interface names that deeply. Also, / isn’t a valid character in Linux interface names, so we can’t copy the naming pattern as-is.

Lookling at my two switchdev systems (the SN2010 and a Banana Pi R4), devlink port clearly shows that none of the interfaces on the BPi are splittable (splittable false for each), while all of the interfaces on the SN2010 are splittable, even the single-lane SFP28 ports. Here are a few examples:

$ devlink port
...
pci/0000:01:00.0/3: type eth netdev enp17 flavour physical port 17 splittable false lanes 1
pci/0000:01:00.0/4: type eth netdev enp18 flavour physical port 18 splittable false lanes 1
pci/0000:01:00.0/9: type eth netdev enp19 flavour physical port 19 splittable true lanes 4
pci/0000:01:00.0/5: type eth netdev enp20 flavour physical port 20 splittable true lanes 4
...

If we wanted to do down this road (and I’m kind of against it, given the expected number of users), then presumably we could rename splittable true lanes >1 as something like Ethernet${port}_1 and splittable false and splittable true lanes 1 as Ethernet${port}.

FWIW, after running devlink port split pci/0000:01:00.0/13 count 2, then devlink port shows this:

pci/0000:01:00.0/13: type eth netdev enp22s0 flavour physical port 22 split_group 22 splittable false lanes 2
pci/0000:01:00.0/14: type eth netdev enp22s1 flavour physical port 22 split_group 22 splittable false lanes 2

So that’s what split interfaces look like. As an aside, notice that the PCI addresses and port numbers are not in the same order – enp19 and enp20 are inverted here. And (not shown here) enp1 is actually the second-highest bus address, so it’d probably be assigned eth21 out of the box without udev’s help. So if we stuck with ethX naming and assigned names in PCI bus order, then we’d end up making a total mess of things. So we really do need something in udev’s config to remap the names in order at the very minimum.

Yeah I have nothing against Ethernet28_1 if Ethernet28/1 isnt possible.

And the portorder is why I prefer to rename the interfaces manually (on a regular Linux box such as Proxmox) because having en0p4s4 really doesnt tell me anything when I stare at the rare of the server and it got 12 interfaces.

I have seen all kind of ordering top to down and down to top aswell as both left or right and right to left and something you have like port 5-8 to the left and 1-4 to the right (yes - wtf!).

Only that really works to me is to give the interfaces custom names either whats already written on the chassi or just call it a day and go ETH0 to whatever number the are and then I do it left to right and when hitting a pcie-cage it will be the upper top left as the first and the bottom right as the last.

We also have like Mikrotik who have its first in the bottom left (wtf!?).

As can be seen on these pictures:

This article by Pim seems to be very apropos: IPng Networks - Debian on Mellanox SN2700 (32x100G)

I personally like the udev rule that he proposes, and plan to emulate.

2 Likes

Yeah, Pim’s article is kind of what started this for me.

Having VyOS on a switch would be kind of nice, yes. :slight_smile:

While troubleshooting some issues with the interface named “wan” on my first VyOS build for the BPI-R4 this week together with some other people on the Banana Pi forum, I came across the ticket below regarding different interface names than the currently supported ones.

These Banana Pi development boards might be a bit of a special case since they don’t come pre-programmed with MAC addresses. You need to write a base MAC to the EEPROM yourself, or you need to hardcode one in the device tree to avoid having the MAC addresses changing on every boot.

Ahh, interesting. I have a Banana Pi R4 that I’d also like VyOS running on, so (a) it’s nice to see someone else doing the work and (b) it’s good to see that the same issue comes up.

I haven’t looked yet, does VyOS have a single regex for Ethernet interface names, or are there 2 or 3 slightly different flavors in different layers of the system? In any case, it’d be good to get sw, lan, and wan interfaces all added at the same time.

Okay, I actually read the ticket, and the pushback was interesting. I don’t entirely disagree, but there’s a tension here on devices that have silk-screened device names on the chassis. If the chassis says “LAN1”, then calling it “eth4” in software leads to a bad user experience.

On something like a Mellanox switch, with potentially over 100 interfaces, then unstructured ethX interfaces go from “bad user experience” to “unusable” pretty quickly. Especially since (IIRC) the enumeration order on at least the SN2xxx switches doesn’t match the physical port order on the front panel.

As a start, using the altname kernel property to assign a secondary name to interfaces might be useful, although it probably wouldn’t appear anywhere in VyOS’s native UI. At least ip would work, and it’d make the names documented and visible someplace, which would be a start. I see that systemd supports altname (via AlternativeName in .link files); presumably udev does as well?

2 Likes

I assume unofficial builds shouldn’t be discussed here, but you can find a thread on the Banana Pi forum about it. I can confirm that hardware offloading works when configured properly with tve right kernel. Since yesterday, it’s acting as my main router. :slightly_smiling_face:

These builds hardcode the MAC addresses in the device tree so that you don’t need to program the EEPROM. If you do, the EEPROM MACs will take precedence over the ones in the device tree for the SFP+ interfaces as well as the internal link to the switch (it shows up as an interface that you’ll just have to ignore).

  1. It also renames the interfaces connected to the switch. It’s technically not needed for lan1, lan2 and lan3, but I can personally confirm that it is for wan. :smile:
  2. You will also have to manually add the interfaces to the configuration as it contains a default configuration override to make it possible to access it over SSH, instead of requiring you to use the serial console. I assume that’s why VyOS doesn’t add the interfaces automatically the way it normally does.
  3. It boots using u-boot, not grub, so the update process is slightly different. Not that bad though. :slight_smile:
  4. No port LEDs with the 6.18 kernel.

Other than those minor quirks, it seems to be working well so far. I only use the SFP+ slots, but the guy who makes them and documented the build process seems to use only the RJ45 ports, so they all seem to work fine.

Yes, it would preferable to be able to have interface names matching the labels on the case, especially on devices with lots of ports.

Yeah, enumeration order isn’t necessarily intuitive unless you know how they are connected hardware wise. My Qotom C3758 mini PC is logical in terms of the 4 SFP+ slots, but very unintuitive for the 5 RJ45 ports. I ended up having to document it into the interface description in my config and constantly refer to it to double check. This is clearly not feasible for switches.

If you don’t get intuitive names in the VyOS CLI, it seems to me that it would still be very cumbersome.

I know you’re more referencing conf_mode and not op_mode, but in 1.4.4, rolling, and probably stream (I haven’t checked), there is the show interfaces kernel op_mode command that can be used to show all interfaces on the host; not just those VyOS knows how to configure. You can do something like this to help identify interfaces:

show interfaces kernel detail | match "(Interface|Device|Alternate Names|^$)"

 Interface               | eth0
 Device                  | Mellanox Technologies MT27500 Family [ConnectX-3] [15b3:1003]
 Alternate Names         | enp4s0d1

 Interface               | eth1
 Device                  | Intel Corporation Ethernet Controller I225-V [8086:15f3] (rev 03)
 Alternate Names         | enp1s0

 Interface               | eth2
 Device                  | Intel Corporation Ethernet Controller I225-V [8086:15f3] (rev 03)
 Alternate Names         | enp3s0

 Interface               | eth3
 Device                  | Intel Corporation Ethernet Controller I225-V [8086:15f3] (rev 03)
 Alternate Names         | enp2s0

 Interface               | eth4
 Device                  | Mellanox Technologies MT27500 Family [ConnectX-3] [15b3:1003]
 Alternate Names         | enp4s0

 Interface               | wlan0
 Device                  | Intel Corporation Wi-Fi 6 AX201 160MHz [8086:4df0] (rev 01)
 Alternate Names         | wlo1

This interface naming in that output does kind of underpin how the indiscriminate naming can causes issues, since my 2 SFP ports (Mellanox controller) on the same controller should logically be adjacent, but they are eth0 and eth4.

There’s probably half a dozen touch points for interface naming.

There’s probably more here and there. I think it could be useful to allow configuring from aliases. For instance, if I did something like this:

set interfaces ethernet enp1s0 address 10.1.2.3/24

Then the checks would simply check if that’s a valid alias for an interface using something like this:

ip -j link show dev enp1s0 | jq '.[].ifname'
"eth1"

When running the conf_mode script, it would just rename the interface to be configured as "eth1".

You’d need wrap the command time regex constraint in a bash or python script so it can make sure it’s a valid alias, which is slower. But at the same time, you will only have “so many” ethernet interfaces, so that impact would be minimal.

All of that would be fairly trivial to execute.

Have you made any progress on this? :slight_smile:

I’m asking both because I’m curious and to avoid this thread being closed.

@ScottLaird has an excellent blog site here: https://scottstuff.net/

They actually have a blog post on this called “part 1”, so hopefully he’ll find time to do a part 2 with further testing, I know I’d be interested in reading it.

Part 1: Running VyOS on a Mellanox SN2010 Switch, Pt 1 - scottstuff.net

3 Likes

Life keeps getting in the way, and the SN2010 is slightly too loud to want running next to my desk 24x7, which makes experimentation a bit tricky. Hopefully I’ll be able to fix the fan situation this weekend (via a 3D printed cover w/ 120mm fans), and then then I can start in on PRs for sw* interface names and the smaller, hopefully less-contentious changes.

2 Likes

I’d be very curious if the hardware offloading for flowtables works with it. When nftables added hardware support, it specifically worked with Mellanox to add the support.

That was indeed a good read. Thanks! :slight_smile:

“It feels like a real network device, not three shell scripts in a trenchcoat.” :smile:

Yeah, I’d also be interested in reading a part 2.

It would certainly be cool to try it. Mellanox and MediaTek are two of the few manufacturers that implement it. I don’t know whether Mellanox implemented support for other hardware than their NICs, but trying it can’t hurt. :slightly_smiling_face:

So part2 will be “how I replaced the noisy fans with silent fans from Noctua” before part3 will be on progress of using hardware offloading in VyOS? :slight_smile: