VyOS update breaks routing (FRR)

I’ve updated VyOS to two different releases and both of them broke routing.

When the update is applied, VyOS never comes back. But I see in the console the it allegedly finished loading the old config into the new image.

VyOS (the edge firewall) is at least two hops away from user networks, with multiple paths, OSPF handles that; I sh ip ro tab a to see if it can reach the intranet(s) but it tells me something about zebra not running.

I restarted FRR — well, first I tried to restart zebra because I dumb — and it seems to become active successfully but there’s still in its status a lot of red text for a CLI. That’s never good. :frowning:

Anyway, because of < things, lots of >, all information I could manage to gather is a sad screenshot from the console on a browser. I’m posting it in the off-chance it’s useful to a dev because I’m about to wipe this mistake of an OS as soon as I post it.


Updated to/from

  • 202410180006 (2024) → 202505250022 (May25th)
  • 202410180006 (2024) → 202505260020 (May26th)
    or:
  • 202410180006 (2024) → 202505250022 (May25th) → 202410180006 (2024) → 202505260020 (May26th)

The earlier version I had, have of VyOS is 1.5-rolling-202410180006 which thankfully was easy to revert to.

I updated to the release of the 25th and 26th of May; both had the same problem— The zebra part, I didn’t check further in the 25th’s nightly.

I did try to redo the OSPF setup, but it was everything there already. It wasn’t a failed config import, it just didn’t run.

After the 25th’s install, I rebooted on the earlier one (from October 2024), from there I deleted the 25th, still in 2024 I set it as the default in order to be allowed to delete the 25th’s image. I downloaded the 26th’s and installed it.

In other words: I did not try1 to move from the 25th to the 26th. Both were updates to the 2024 image.

I’ll try to make time to install from scratch later today, hopefully it fares better. I’ll update this if there’s something interesting. I don’t suppose this is the place to suggest the great feature of popping a random D pic out of the blue in the console, right? :upside_down_face: D is for dynamic, of course. :roll_eyes: That would be an interesting find.

1

I couldn’t even if i wanted to; the keyboard in the system is really messed up and there’s no clipboard access in the console.

2

Well, “great” might be somewhat of a stretch, whereas a stretch might be just a given.

Hi can you try to install a clean VYOS and using the config of the old one?

show configuration commans

save the output and paste it to a clean install…

Sorry for the wait,

Yeah, I did reinstall a new one that same day, however since a container, a very important one, is there, plus some symlinks to support it, and I took the chance to clean up the ruleset but more than anything my aliases/groups, it took a lot more than the copy+paste that I too was expecting. I was tempted to replace the whole /config which I had backed up, but opted not to in case the problem was due to corruption.

It worked. Now I have to test updating that image again though, see what happens. Like I said, I installed it that same day of the post, but didn’t set it up until just today, so there were a few more images to update to to test. I got the latest (instead my usual second to last— to get a little leeway if something goes wrong, it’s stupid, I know). The install was so textbook that I thought the firewall was rebooting. It wasn’t.

Starting VyOS router. appeared finally, then Mounting VyOS Config…done. and it stayed there. Eventually I heard the Siri guy voice say the Internet connection was lost (it’s an automation thing).

I was starting to get uncomfortable when the prompt finally appeared. Huge relief… except that I never heard the Siri voice I was back online.

Because I’m not.

show ip ospf
vyos@routelogic:~$ show ip ospf
vyos@routelogic:~$ show ip ospf neighbor
% OSPF is not enabled in vrf default
vyos@routelogic:~$
systemctl status frr
# systemctl status frr
● frr.service - FRRouting
     Loaded: loaded (/lib/systemd/system/frr.service; disabled; preset: enabled)
    Drop-In: /etc/systemd/system/frr.service.d
             └─override.conf
     Active: active (running) since Tue 2025-06-03 15:17:43 MST; 18min ago
       Docs: https://frrouting.readthedocs.io/en/latest/setup.html
   Main PID: 3122 (watchfrr)
     Status: "restarting all"
      Tasks: 20 (limit: 7069)
     Memory: 69.1M
        CPU: 11.050s
     CGroup: /system.slice/frr.service
             ├─3122 /usr/lib/frr/watchfrr -d -F traditional zebra mgmtd bgpd ripd ripngd ospfd ospf6d isisd babeld pim6d ldpd nhrpd staticd bfdd fabricd
             ├─9953 /usr/lib/frr/mgmtd -d -F traditional --daemon -A 127.0.0.1
             ├─9960 /usr/lib/frr/bgpd -d -F traditional --daemon -A 127.0.0.1 -M rpki -M snmp
             ├─9967 /usr/lib/frr/ripd -d -F traditional --daemon -A 127.0.0.1 -M snmp
             ├─9973 /usr/lib/frr/ripngd -d -F traditional --daemon -A ::1
             ├─9975 /usr/lib/frr/ospfd -d -F traditional --daemon -A 127.0.0.1 -M snmp
             ├─9977 /usr/lib/frr/ospf6d -d -F traditional --daemon -A ::1 -M snmp
             ├─9979 /usr/lib/frr/isisd -d -F traditional --daemon -A 127.0.0.1 -M snmp
             ├─9981 /usr/lib/frr/babeld -d -F traditional --daemon -A 127.0.0.1
             ├─9983 /usr/lib/frr/pim6d -d -F traditional --daemon -A ::1
             ├─9985 /usr/lib/frr/ldpd -L -u frr -g frr
             ├─9986 /usr/lib/frr/ldpd -E -u frr -g frr
             ├─9987 /usr/lib/frr/ldpd -d -F traditional --daemon -A 127.0.0.1 -M snmp
             ├─9989 /usr/lib/frr/nhrpd -d -F traditional --daemon -A 127.0.0.1
             ├─9994 /usr/lib/frr/staticd -d -F traditional --daemon -A 127.0.0.1
             ├─9996 /usr/lib/frr/bfdd -d -F traditional --daemon -A 127.0.0.1
             └─9998 /usr/lib/frr/fabricd -d -F traditional --daemon -A 127.0.0.1

Jun 03 15:36:04 routelogic bgpd[9960]: [VMFZK-56S5Y] bgp_zebra_label_manager_connect: failed connecting synchronous zclient!
Jun 03 15:36:05 routelogic ldpd[9987]: [G89VD-0S2H5] Error connecting synchronous zclient!
Jun 03 15:36:05 routelogic bgpd[9960]: [VMFZK-56S5Y] bgp_zebra_label_manager_connect: failed connecting synchronous zclient!
Jun 03 15:36:06 routelogic ldpd[9987]: [G89VD-0S2H5] Error connecting synchronous zclient!
Jun 03 15:36:06 routelogic bgpd[9960]: [VMFZK-56S5Y] bgp_zebra_label_manager_connect: failed connecting synchronous zclient!
Jun 03 15:36:07 routelogic ldpd[9987]: [G89VD-0S2H5] Error connecting synchronous zclient!
Jun 03 15:36:07 routelogic bgpd[9960]: [VMFZK-56S5Y] bgp_zebra_label_manager_connect: failed connecting synchronous zclient!
Jun 03 15:36:08 routelogic ldpd[9987]: [G89VD-0S2H5] Error connecting synchronous zclient!
Jun 03 15:36:08 routelogic bgpd[9960]: [VMFZK-56S5Y] bgp_zebra_label_manager_connect: failed connecting synchronous zclient!
Jun 03 15:36:09 routelogic ldpd[9987]: [G89VD-0S2H5] Error connecting synchronous zclient!
[edit]
@routelogic#

Not completely at least, I can reach the router because I have access to that VLAN, but again OSPF was turned off. Not good new after all. I hope I can go back like last time “but while I’m here…” I thought, “might as well gather as much info as I can for the gurus” right?

I did systemctl reload frr, it got a little was is it to you bro? like before, I solved the puzzle and then I got one of those friendly suggestions to check out journald kinda like the software equivalent to “somebody in the morgue is looking to ask you some questions, it has to be there.”

I recorded the whole thing, like in video recording, machine/console and all. I hope it’s worth something, it took forever to upload on a cell connection:

I don’t expect anybody to actually get anything useful other than the frequency with which the logs being spammed so here’s that log too:
journalctl-stream.log (227.1 KB)

I edited only my username because of the obvious: the real on isn’t funny1. Everything else is verbatim.

Thanks for answering and — if you’re part of the team — for listening. :heart:︎ I’m done now, though; it’s 19 o’clock, I started in the morning. I just found out there’s a :tent:︎ (tent) unicode symbol. It’s not related to anything. :joy: (Ever.)


orig

Before I decided to "be thorough," and “take one for the team”, and “it only hurts at the beginning” and all that, this ended in "It worked." way above. Then I commented (below) on another thing that I noticed it seemed like a bug. In hindsight, I guess might not be that important huh? :upside_down_face:


I noticed that commit errors are lot more verbose, and a lot more pythony; because of the clean up. I left some empty nested firewall groups, it screamed at me to tell me that rather than the usual succinctly vague reference to the section it’s having problems with, e.g;

[ system conntrack ]
Traceback (most recent call last):
  File "/usr/libexec/vyos/services/vyos-configd", line 146, in run_script
    script.apply(c)
  File "/usr/libexec/vyos//conf_mode/system_conntrack.py", line 249, in apply
    call_dependents()
  File "/usr/lib/python3/dist-packages/vyos/configdep.py", line 172, in call_dependents
    f()
  File "/usr/lib/python3/dist-packages/vyos/configdep.py", line 141, in func_impl
    run_conditionally(target, tag_value, config)
  File "/usr/lib/python3/dist-packages/vyos/configdep.py", line 132, in run_conditionally
    run_config_mode_script(target, config)
  File "/usr/lib/python3/dist-packages/vyos/configdep.py", line 111, in run_config_mode_script
    mod.apply(c)
  File "/usr/libexec/vyos//conf_mode/nat.py", line 259, in apply
    call_dependents()
  File "/usr/lib/python3/dist-packages/vyos/configdep.py", line 172, in call_dependents
    f()
  File "/usr/lib/python3/dist-packages/vyos/configdep.py", line 141, in func_impl
    run_conditionally(target, tag_value, config)
  File "/usr/lib/python3/dist-packages/vyos/configdep.py", line 132, in run_conditionally
    run_config_mode_script(target, config)
  File "/usr/lib/python3/dist-packages/vyos/configdep.py", line 110, in run_config_mode_script
    mod.generate(c)
  File "/usr/libexec/vyos//conf_mode/policy_route.py", line 191, in generate
    render(nftables_conf, 'firewall/nftables-policy.j2', policy)
  File "/usr/lib/python3/dist-packages/vyos/template.py", line 174, in render
    rendered = render_to_string(template, content, formater, location)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/vyos/template.py", line 143, in render_to_string
    rendered = template.render(content)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/jinja2/environment.py", line 1301, in render
    self.environment.handle_exception()
  File "/usr/lib/python3/dist-packages/jinja2/environment.py", line 936, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "/usr/share/vyos/templates/firewall/nftables-policy.j2", line 87, in top-level template code
    {{ group_tmpl.groups(firewall_group, True, True) }}
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/jinja2/runtime.py", line 777, in _invoke
    rv = self._func(*arguments)
         ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/vyos/templates/firewall/nftables-defines.j2", line 25, in template
    elements = { {{ group_conf.address | nft_nested_group(includes, group.ipv6_address_group, 'address') | join(",") }} }
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/vyos/template.py", line 674, in nft_nested_group
    add_includes(name)
  File "/usr/lib/python3/dist-packages/vyos/template.py", line 663, in add_includes
    if key in groups[name]:
              ~~~~~~^^^^^^
KeyError: 'def_namemasters'

[[system conntrack]] failed

The thing I edited were firewall groups, eventually I got it sorted out, the commit passed and I heard Siri say that I was back online. The error opens and closes with [system conntrack][ ]s) so, I’m thinking maybe it’s a bug because even without a GUI, VyOS has a pristine pfft!-who-needs-a-GUI-anyway? presentation.

1

Just kidding.

My unicode got replaced with emojis! :face_with_symbols_over_mouth: Where’s a voodoo doll when you need one.

– UPDATE

I’m now on the third image. As the two images before, or rather, as the image before and after it, it says that there’s a filesystem issue. I think. I don’t know, I’ve never seen a zebra if not through a NatGeo or something like that. Zoology is not my thing.

If you set up a VRF or at least set vrf bind-to-all and commit it makes OSPF start. I don’t know how to set up VRFs correctly though and zebra is still dead.
– IN-UPDATE UPDATE: It dies a little later, perhaps due to inactivity i think. i.e. # run show ip ospf returns nothing.

Static routes don’t work either because well… zebra.

“Seems like they went all in with FRR rather than this from here and this from there like in other platforms.” I say, pretending to know what the hell I’m talking about. :joy:

Curiously enough, the brave little router that couldn’t route, is online.

First hop is fine.

Now that really is it for me. I’ll come back if I find something— not that I’ll be lookin’. I think this might need checking from a dev for real.

Do you have an example of minimal config that reproduces the bug?

UPDATE

I installed VyOS baremetal — first time ever too, I’m both in the middle of a problem and geeking out — so I could rule out potential overly aggresive filesystem timeouts, which if it’s a thing, it’s about the only thing I can conceive it to have corrupted the installation.

Like its VM counterparts it started out okay until eventually zebra stops running, and though the FRR service itself is running, it shows the familiar “zclient” red if you poke just a little bit deeper in the logs— Not even that, just systemd status will do.

The image I’m using is v2025.06.01-0024.

I’m not completely sure, mainly because it makes no sense, but a have a pretty strong hunch of what might be causing it. If I’m correct I’ll bother you one last time. :crossed_fingers:

It paid off… I don’t know why though.

Whether it’s pfSense, OPNsense, OpenWRT, plain Linux, whatever, I’ve never quite made SNMP work with FRR, or something “X-SNMP”— I don’t know. Instead checking that box would make FRR crash.

FRR-crashing checkbox on pfSense

While editing the config file (directly from SSH) I copy it whole to a new editor each time to double-check for those fun JSON syntax errors. You’re always left at the end when you paste as you know which means scrolling or searching back your position. Near the end though, there’s the settings for FRR, the package (??) not the protocols. It caught my attention while I was collecting info for one of the many drafts that don’t make it, I believe.

What I don’t understand is why does it fail after a while, and not immediately. Why after changing settings none of which related to FRR, or routing, or interfaces even. It just flips out of nowhere. I deleted those lines, went to config mode, issued load. It told me that I should reboot. I did, and a little after I heard Siri yapping about something.

I guess I fixed it. it’s fixed. I’m not complaining or anything, but I had nothing to do with it. I merely recognized something I had seen before (which I haven’t gotten to work either BTW…on various platforms). Probably it would’ve worked by getting rid of snmp{…} alone.

I really don’t feel like finding out for myself :roll_eyes: not today anyway :yawning_face: but thanks for answering and thanks to the forum gods for not banning me.