Rolling release breaks my internet access

Applying any rolling image after 1.3-rolling-202012230217 breaks my config. I see no errors on boot, can “load”, “commit” with no errors but get no internet.

I can ping some external IP addresses but not all, pings to any DNS simply doesn’t work.

Nothing changed on config side and booting back into 1.3-rolling-202012230217 gets me up and running again.

Let me know what extra information I can provide to help track down the issue.

Happy Christmas everyone!

Still no joy with the latest automated rolling images.

The most recent seems to break all WAN connectivity. Cannot even ping via IP address. Back to VyOS 1.3-rolling-202012230217 is my only option.

Anyone else experiencing issues?

Just tried the latest rolling image 1.3-rolling-202012271303 and still breaks.

Checked the routing tables between that and my working image. Seems the default route is now including wireguard interfaces

Differences shown below;

Working Image (1.3-rolling-202012230217)

S>* 0.0.0.0/0 [1/0] via 172.31.255.5, eth0, weight 1, 00:00:16

Broken Image (1.3-rolling-202012271303)

S>* 0.0.0.0/0 [1/0] is directly connected, wg0, weight 1, 00:01:03
*                 is directly connected, wg1, weight 1, 00:01:03
*                 via 172.31.255.5, eth0, weight 1, 00:01:03

Not sure if I am on the right track here. Any help would be appreciated.

Hello @phillipmcmahon, could you provide your configuration commands for reproducing in our lab?
Did you try a temporary disable WG interface?

For sure I can, do you want that sanitised or not? If not, then let me know how I can share it please.

@phillipmcmahon, sure, use strip-private to save privacy

Here you go.

commands-stripped.txt (37.8 KB)

Thanks, I think it is a very interesting nuance and I want to reproduce this in our LAB.

Let me know if you need any more information.

Hey Dmitry,

Any progress on this one, and anything I can do to assist?

Hello @phillipmcmahon, sorry. I’m a bit busy with another task. I will back to you when finishing current tasks.

Hello @phillipmcmahon, I think this issue comes from updating FRR to the latest stable version. What interesting, in FRR config I see routes on the main table

vyos@vyos# sudo vtysh -c "show run" | grep wg
ip route 0.0.0.0/0 wg1
ip route 0.0.0.0/0 wg0

Note: It happens only after when router booting, in another case

vyos@vyos# run show ip route 0.0.0.0
Routing entry for 0.0.0.0/0
  Known via "static", distance 1, metric 0, best
  Last update 00:00:32 ago
  * directly connected, wg1, weight 1
  * directly connected, wg0, weight 1
  * 172.16.0.254, via eth0, weight 1

vyos@vyos# delete protocols static table
vyos@vyos# commit
vyos@vyos# set protocols static table 100 interface-route 0.0.0.0/0 next-hop-interface wg0
vyos@vyos# set protocols static table 100 route 0.0.0.0/0 blackhole distance '255'
vyos@vyos# set protocols static table 110 interface-route 0.0.0.0/0 next-hop-interface wg1
vyos@vyos# set protocols static table 110 route 0.0.0.0/0 blackhole distance '255'
vyos@vyos# commit
vyos@vyos# run show ip route 0.0.0.0
Routing entry for 0.0.0.0/0
  Known via "static", distance 1, metric 0, best
  Last update 00:01:46 ago
  * 172.16.0.254, via eth0, weight 1

The same issue already described on the https://phabricator.vyos.net/T3172

Thanks for the update, very much appreciated.

So in the meantime, I can run a post-boot script to perform those commands and should be good to go?

As a total break to routing, I hope this will be fixed in a short space of time.

Happy new year.

Hello @phillipmcmahon, happy new year.

Not exactly, this bug related to FRR version, @c-po already created a bug report to the official FRR Github repo.

We need to wait for fix.

Totally get it is a bug, but what is stopping me running those commands in vyos-postconfig-bootup.script to delete the routes and re-add them on reboot/upgrade as a temp workaround until the source bug is fixed?

As I understand the source bug, it’s dependent on the order on which the commands get executed.

@phillipmcmahon

I originated that Phabricator bug, so that’s why it appeared to fix itself when I deleted/re-added the configuration after boot. Unfortunately that was entirely by accident, and depending on the complexity of the config, might be tough to replicate.

With that said, there’s nothing stopping you from using that methodology to fix it if you can find the order that works for you.

Ok. I will stay on the rolling from 17.12 and await a proper fix in FRR to come along.

Thanks for the help and the explanation.

I wonder if you’re seeing the same thing we did on a recent rolling build (30th December 2020, where after configure and commit we ended up in this situation:

Jan  3 21:30:47 dekker watchfrr[1141]: [EC 268435457] zebra state -> down : read returned EOF
Jan  3 21:30:47 dekker ospfd[1205]: [EC 134217741] Packet[DD]: Neighbor 46.227.201.3 MTU 2000 is larger than [eth7.1401:46.227.200.237]'s MTU 1500
Jan  3 21:30:47 dekker watchfrr[1141]: [EC 268435457] ripd state -> down : read returned EOF
Jan  3 21:30:48 dekker watchfrr[1141]: [EC 268435457] ripngd state -> down : read returned EOF
Jan  3 21:30:48 dekker watchfrr[1141]: [EC 268435457] ospfd state -> down : read returned EOF
Jan  3 21:30:48 dekker watchfrr[1141]: ospfd state -> up : connect succeeded
Jan  3 21:30:48 dekker watchfrr[1141]: [EC 268435457] ospfd state -> down : unexpected read error: Connection reset by peer
Jan  3 21:30:48 dekker watchfrr[1141]: [EC 268435457] ospf6d state -> down : read returned EOF
Jan  3 21:30:48 dekker watchfrr[1141]: [EC 268435457] ldpd state -> down : read returned EOF
Jan  3 21:30:48 dekker watchfrr[1141]: [EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
Jan  3 21:30:48 dekker watchfrr[1141]: [EC 268435457] isisd state -> down : read returned EOF
Jan  3 21:30:48 dekker watchfrr[1141]: [EC 268435457] bfdd state -> down : read returned EOF
Jan  3 21:30:52 dekker watchfrr[1141]: Forked background command [pid 41571]: /usr/lib/frr/watchfrr.sh restart all
Jan  3 21:30:52 dekker staticd[1228]: Terminating on signal

Looks like a fix has been submitted and added as a PR.

With the recent introduction of 1.4 will this update make it into 1.3 LTS as it’s a core functionality breaking bug?

Can you let me know when it is safer to try out a rolling 1.4 version with the fix rolled in please.

1 Like