VyOS on GCP looses network capabilities after 30-35 minutes, and can't ping LAN units

Hi
I have default VyOS deployment on GCP. Static LAN IP, Static External IP.

Every 30-35 minutes it looses its static default route injected by the kernel. No routing would work until VM is rebooted.

The other strange issue is that ARP won’t resolve for local LAN units. Hence I can’t reach them from VyOS.

any suggestions would be greatly appreciated.

Hello @aa7
What settings do you have on the external interface? Is this dhcp or a static settings?
Can you execute command before problem and after:

show ip route
show arp
sudo arp -an
sudo netstat -rn
sudo ip r
sudo ip a

Can you ping your default gw? Do you use conntrack?
After problem also execute:
sudo dmesg -T | tail -n 50

Hi Viacheslav.

  • I can’t ping dgw, but I think it is normal, no one of my GCP instances (Linux & Windows) can ping dgw.
  • I will run dmesg in a bit and submit it as separate replay (i just restarted the system).
  • for the “before the problem” output below I added ping 10.150.0.10 with show arp so you can visualise that ARP is not resolving for LAN units (10.150.0.10 is one of my lan units with no OS firewall or an other protection).
  • Finally: the rest of requested info is below: (disregard dum interfaces created only for VPN testing purposes).

== after the problem ===

adm@rtr1:~$ ping 8.8.8.8
connect: Network is unreachable

adm@rtr1:~$ sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route

C>* 10.150.0.0/20 is directly connected, eth0, 09:08:51
K>* 10.150.0.1/32 [0/0] is directly connected, eth0, 09:08:55
C * 10.150.0.5/32 is directly connected, eth0, 09:08:57
C>* 10.150.0.5/32 is directly connected, eth0, 09:08:57
C>* 192.168.7.0/24 is directly connected, dum7, 09:08:51
C>* 192.168.8.0/24 is directly connected, dum8, 09:08:51

adm@rtr1:~$ sh arp
Address HWtype HWaddress Flags Mask Iface
10.150.0.11 (incomplete) eth0
10.150.0.1 ether 42:01:0a:96:00:01 C eth0
10.150.0.10 (incomplete) eth0

adm@rtr1:~$ sudo arp -an
? (10.150.0.11) at on eth0
? (10.150.0.1) at 42:01:0a:96:00:01 [ether] on eth0
? (10.150.0.10) at on eth0

adm@rtr1:~$ sudo netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
10.150.0.0 0.0.0.0 255.255.240.0 U 0 0 0 eth0
10.150.0.1 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
192.168.7.0 0.0.0.0 255.255.255.0 U 0 0 0 dum7
192.168.8.0 0.0.0.0 255.255.255.0 U 0 0 0 dum8

adm@rtr1:~$ sudo ip r
10.150.0.0/20 dev eth0 proto kernel scope link src 10.150.0.5
10.150.0.1 dev eth0 scope link
192.168.7.0/24 dev dum7 proto kernel scope link src 192.168.7.1
192.168.8.0/24 dev dum8 proto kernel scope link src 192.168.8.1

adm@rtr1:~$ sudo ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc pfifo_fast state UP group default qlen 1000
link/ether 42:01:0a:96:00:05 brd ff:ff:ff:ff:ff:ff
inet 10.150.0.5/32 brd 10.150.0.5 scope global eth0
valid_lft forever preferred_lft forever
inet 10.150.0.5/20 brd 10.150.15.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::4001:aff:fe96:5/64 scope link
valid_lft forever preferred_lft forever
3: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether d2:da:37:23:e9:b2 brd ff:ff:ff:ff:ff:ff
4: dum7: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 5e:59:92:b4:c4:b2 brd ff:ff:ff:ff:ff:ff
inet 192.168.7.1/24 brd 192.168.7.255 scope global dum7
valid_lft forever preferred_lft forever
inet6 fe80::5c59:92ff:feb4:c4b2/64 scope link
valid_lft forever preferred_lft forever
5: dum8: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether da:d7:c6:8a:b1:52 brd ff:ff:ff:ff:ff:ff
inet 192.168.8.1/24 brd 192.168.8.255 scope global dum8
valid_lft forever preferred_lft forever
inet6 fe80::d8d7:c6ff:fe8a:b152/64 scope link
valid_lft forever preferred_lft forever

== before the problem ===
adm@rtr1:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route
K * 0.0.0.0/0 [0/0] via 10.150.0.1, eth0 inactive, 00:02:39
C>* 10.150.0.0/20 is directly connected, eth0, 00:02:34
K>* 10.150.0.1/32 [0/0] is directly connected, eth0, 00:02:39
C * 10.150.0.5/32 is directly connected, eth0, 00:02:40
C>* 10.150.0.5/32 is directly connected, eth0, 00:02:41
C>* 192.168.7.0/24 is directly connected, dum7, 00:02:34
C>* 192.168.8.0/24 is directly connected, dum8, 00:02:34

adm@rtr1:~$ show arp
Address HWtype HWaddress Flags Mask Iface
10.150.0.1 ether 42:01:0a:96:00:01 C eth0

adm@rtr1:~$ sudo arp -an
? (10.150.0.1) at 42:01:0a:96:00:01 [ether] on eth0

adm@rtr1:~$ sudo netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.150.0.1 0.0.0.0 UG 0 0 0 eth0
10.150.0.0 0.0.0.0 255.255.240.0 U 0 0 0 eth0
10.150.0.1 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
192.168.7.0 0.0.0.0 255.255.255.0 U 0 0 0 dum7
192.168.8.0 0.0.0.0 255.255.255.0 U 0 0 0 dum8

adm@rtr1:~$ sudo ip r
default via 10.150.0.1 dev eth0
10.150.0.0/20 dev eth0 proto kernel scope link src 10.150.0.5
10.150.0.1 dev eth0 scope link
192.168.7.0/24 dev dum7 proto kernel scope link src 192.168.7.1
192.168.8.0/24 dev dum8 proto kernel scope link src 192.168.8.1

adm@rtr1:~$ sudo ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc pfifo_fast state UP group default qlen 1000
link/ether 42:01:0a:96:00:05 brd ff:ff:ff:ff:ff:ff
inet 10.150.0.5/32 brd 10.150.0.5 scope global eth0
valid_lft forever preferred_lft forever
inet 10.150.0.5/20 brd 10.150.15.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::4001:aff:fe96:5/64 scope link
valid_lft forever preferred_lft forever
3: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1e:c6:06:40:da:a7 brd ff:ff:ff:ff:ff:ff
4: dum7: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether e2:04:f9:24:49:90 brd ff:ff:ff:ff:ff:ff
inet 192.168.7.1/24 brd 192.168.7.255 scope global dum7
valid_lft forever preferred_lft forever
inet6 fe80::e004:f9ff:fe24:4990/64 scope link
valid_lft forever preferred_lft forever
5: dum8: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 42:b6:7f:d2:f7:2d brd ff:ff:ff:ff:ff:ff
inet 192.168.8.1/24 brd 192.168.8.255 scope global dum8
valid_lft forever preferred_lft forever
inet6 fe80::40b6:7fff:fed2:f72d/64 scope link
valid_lft forever preferred_lft forever

admin@rtr1:~ ping 10.150.0.10 PING 10.150.0.10 (10.150.0.10) 56(84) bytes of data. From 10.150.0.5 icmp_seq=1 Destination Host Unreachable From 10.150.0.5 icmp_seq=2 Destination Host Unreachable From 10.150.0.5 icmp_seq=3 Destination Host Unreachable ^C --- 10.150.0.10 ping statistics --- 4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3066ms pipe 4 adm@rtr1:~ sudo arp -an
? (10.150.0.10) at on eth0
? (10.150.0.1) at 42:01:0a:96:00:01 [ether] on eth0

adm@rtr1:~$ sho arp
Address HWtype HWaddress Flags Mask Iface
10.150.0.10 (incomplete) eth0
10.150.0.1 ether 42:01:0a:96:00:01 C eth0

here it happened again… the router just “lost” its dgw… and I’m posting dnesg output.

adm@rtr1:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route

C>* 10.150.0.0/20 is directly connected, eth0, 00:37:25
K>* 10.150.0.1/32 [0/0] is directly connected, eth0, 00:37:30
C * 10.150.0.5/32 is directly connected, eth0, 00:37:31
C>* 10.150.0.5/32 is directly connected, eth0, 00:37:32
C>* 192.168.7.0/24 is directly connected, dum7, 00:37:25
C>* 192.168.8.0/24 is directly connected, dum8, 00:37:25
adm@rtr1:~$ sudo dmesg -T | tail -n 50
[Mon Dec 2 09:27:30 2019] usbcore: registered new interface driver usb-storage
[Mon Dec 2 09:27:36 2019] EXT4-fs (sda1): re-mounted. Opts: (null)
[Mon Dec 2 09:27:36 2019] random: systemd: uninitialized urandom read (16 bytes read)
[Mon Dec 2 09:27:36 2019] systemd[1]: systemd 215 running in system mode. (+PAM +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ -SECCOMP -APPARMOR)
[Mon Dec 2 09:27:36 2019] systemd[1]: Detected virtualization ‘kvm’.
[Mon Dec 2 09:27:36 2019] systemd[1]: Detected architecture ‘x86-64’.
[Mon Dec 2 09:27:36 2019] systemd[1]: Inserted module ‘autofs4’
[Mon Dec 2 09:27:36 2019] systemd[1]: Set hostname to .
[Mon Dec 2 09:27:36 2019] random: systemd-sysv-ge: uninitialized urandom read (16 bytes read)
[Mon Dec 2 09:27:37 2019] random: systemd-sysv-ge: uninitialized urandom read (16 bytes read)
[Mon Dec 2 09:27:37 2019] systemd[1]: Cannot add dependency job for unit display-manager.service, ignoring: Unit display-manager.service failed to load: No such file or directory.
[Mon Dec 2 09:27:37 2019] systemd[1]: Starting Forward Password Requests to Wall Directory Watch.
[Mon Dec 2 09:27:37 2019] systemd[1]: Started Forward Password Requests to Wall Directory Watch.
[Mon Dec 2 09:27:37 2019] systemd[1]: Expecting device dev-ttyS0.device…
[Mon Dec 2 09:27:37 2019] systemd[1]: Starting Remote File Systems (Pre).
[Mon Dec 2 09:27:38 2019] systemd-udevd[526]: starting version 215
[Mon Dec 2 09:27:38 2019] random: crng init done
[Mon Dec 2 09:27:38 2019] random: 6 urandom warning(s) missed due to ratelimiting
[Mon Dec 2 09:27:38 2019] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input3
[Mon Dec 2 09:27:38 2019] ACPI: Power Button [PWRF]
[Mon Dec 2 09:27:38 2019] input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input4
[Mon Dec 2 09:27:38 2019] ACPI: Sleep Button [SLPF]
[Mon Dec 2 09:27:38 2019] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 10737418240 ms ovfl timer
[Mon Dec 2 09:27:38 2019] RAPL PMU: hw unit of domain pp0-core 2^-0 Joules
[Mon Dec 2 09:27:38 2019] RAPL PMU: hw unit of domain package 2^-0 Joules
[Mon Dec 2 09:27:38 2019] RAPL PMU: hw unit of domain dram 2^-16 Joules
[Mon Dec 2 09:27:38 2019] cryptd: max_cpu_qlen set to 1000
[Mon Dec 2 09:27:38 2019] AVX2 version of gcm_enc/dec engaged.
[Mon Dec 2 09:27:38 2019] AES CTR mode by8 optimization enabled
[Mon Dec 2 09:27:38 2019] alg: No test for pcbc(aes) (pcbc-aes-aesni)
[Mon Dec 2 09:27:38 2019] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
[Mon Dec 2 09:27:38 2019] EDAC sbridge: Ver: 1.1.2
[Mon Dec 2 09:27:39 2019] systemd-journald[469]: Received request to flush runtime journal from PID 1
[Mon Dec 2 09:27:40 2019] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[Mon Dec 2 09:27:40 2019] Bridge firewalling registered
[Mon Dec 2 09:27:40 2019] fuse init (API version 7.27)
[Mon Dec 2 09:27:41 2019] NET: Registered protocol family 17
[Mon Dec 2 09:27:46 2019] Process accounting resumed
[Mon Dec 2 09:27:52 2019] systemd[1]: Started ACPI event daemon.
[Mon Dec 2 09:27:52 2019] systemd[1]: Listening on ACPID Listen Socket.
[Mon Dec 2 09:27:52 2019] systemd[1]: Mounted /.
[Mon Dec 2 09:27:58 2019] NET: Registered protocol family 15
[Mon Dec 2 09:27:58 2019] Initializing XFRM netlink socket
[Mon Dec 2 09:27:58 2019] NET: Registered protocol family 38
[Mon Dec 2 09:27:58 2019] alg: No test for xcbc(camellia) (xcbc(camellia-asm))
[Mon Dec 2 09:27:58 2019] alg: No test for rfc3686(ctr(camellia)) (rfc3686(ctr-camellia-aesni-avx2))
[Mon Dec 2 09:28:00 2019] systemd[1]: Started ACPI event daemon.
[Mon Dec 2 09:28:00 2019] systemd[1]: Listening on ACPID Listen Socket.
[Mon Dec 2 09:28:00 2019] systemd[1]: Mounted /.
[Mon Dec 2 09:28:04 2019] alg: No test for echainiv(authenc(hmac(sha1),cbc(aes))) (echainiv(authenc(hmac(sha1-generic),cbc-aes-aesni)))

update:
I just configured the dgw statically as:
adm@rtr1# set protocols static route 0.0.0.0/0 next-hop 10.150.0.1 distance 1
… and routing functionality is back.

I guess what happens is the instance receives its DHCP configs (including dgw) during deployment. When I change eth0 IP from ephemeral to static the dgw injection (by the kernel) expires after 30 mins or so.

I certainly can leave with statically configured dgw as long as it is stable. One remaining issue is ARP resolution. I absolutely need VyOS to “talk” to the rest of GCP units. So far all pings fail. No ARPs except to GW (10.150.0.1).

Update:
To (quick and dirty) resolve arp issue I added static APR IP to MAC mapping at the OS level.
Below is full procedure.
adm@rtr1:~$ su -
Password:
root@rtr1:~# bash
root@rtr1:~# sudo arp -s 10.150.0.10 xx:xx:xx:xx:xx:0a
root@rtr1:~# exit
root@rtr1:~# exit
adm@rtr1:~# sh arp
Address HWtype HWaddress Flags Mask Iface
10.150.0.10 ether xx:xx:xx:xx:xx:0a CM eth0
10.150.0.1 ether xx:xx:xx:xx:xx:01 C eth0
adm@rtr1:~# ping 10.150.0.10
PING 10.150.0.10 (10.150.0.10) 56(84) bytes of data.
64 bytes from 10.150.0.10: icmp_seq=1 ttl=128 time=1.84 ms
64 bytes from 10.150.0.10: icmp_seq=2 ttl=128 time=0.299 ms
^C
— 10.150.0.10 ping statistics —
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.299/1.070/1.841/0.771 ms

While I can proceed with my deployment, it would be really good to resolve the issue properly and have ARP resolution work as expected.

Hello, @aa7!

Do you use VyOS from Google Cloud Marketplace, or install it from another source? We want to investigate the issue with a route, but for this, we need to have the ability to reproduce all the steps, which you have done.

Regarding ARP. Network in Google Cloud works a bit differently from the classical Ethernet network. The main thing, which you must remember: all the traffic must flow via the gateway, even inside the same VPC. Technically, all traffic, which is sent via the interface goes to the gateway. So, you do not need any ARP here - just be sure that you have a correct route to a destination network via a gateway.

Hi Taras,

My VyOS deployment is from GCP Marketplace. Bit-by-bit as specified in the procedure your team posted. Zone us-east4-c. If you need more details, let me know.

ARP issue:
thank you for clarifying the gw detail, I knew about it. External routing (with proper gw) worked as expected from the beginning. It is intra-LAN communication that I had issues with. Without IP-to-MAC static mapping (the way I described in the post) no traffic would flow between my VyOS unit and the rest of my LAN (with statically configured gateway or DHCP injected route during deployment). It is highly likely the issue is on VyOS side, since the rest of my VMs had its MAC resolved correctly. I think you should investigate that too.

Note: after setting the static map I got full connectivity not only between VyOS and rest of LAN but also between my remote units via S2S vpn through VyOS. It all works as expected. The project has only a few units so it can be handled manually… for a bigger deployments it might be a big concern.

Your product is amazing for what is does. I’d really like to use it and recommend/promote it further… as long as it is stable.

Hello, @aa7!

We need details about your network settings in Google and the full VyOS configuration. I have deployed VyOS from the marketplace 10 hours ago, and the default route still there. So, there must be something else, non-obvious for us at this moment. The ideal will be if you will provide a step-by-step guide on how to reproduce the route issue (starting from creating VPC in Google Cloud).

With ARP also we missing something. By default, Google will not assign an address to host with any netmask except /32. So, the routing table must not contain any directly connected broadcast networks. Even more - broadcast is not supported inside Google VPC (https://cloud.google.com/vpc/docs/vpc), so ARP is not able to operate there. In the ARP table, we must see only learned from incoming packets MAC of a gateway. If you have a different situation - we need to know your Google VPC configuration and again, step-by-step guide on how to configure topology like yours.

route:
default route starts disappearing only after I change interface IP from ephemeral to static.
I didn’t create the VPC, using the default one.

ARP:
again: my VPC is the default (with auto mode networks). The project deployment is flat, single (auto) subnet with all VMs in the same (L2) domain. Right after I create VyOS instance I can’t ping from it any other IP. But that could be due to the firewall rules. I just realized that GCP is huge /32 routing domain. Let me check something and get back to you.

We had also an issue with this, so we decided to create an interface-route:

set protocols static interface-route [default gateway address]/32 next-hop-interface [eth0]

eth0 is a placeholder for the interface that is direct connected to the default gateway.

Reason for this issue is that GCP does not use normal layer-2 networks, so arp fails. But when you just forward the packets to the right network, the packets will be captured by the gateway none the less.

Hope this helps.

@Frank, you are completely right. The only what we need to care about is sending a packet to the correct one interface. Everything else will do Google gateway.
I prefer to use this way, but the result should be the same in most cases:

set interfaces ethernet eth0 address 'X.X.X.X/32'
set protocols static interface-route X.X.X.1/32 next-hop-interface eth0
set protocols static route 0.0.0.0/0 next-hop X.X.X.1

@aa7, information about changing address type is the point from which we could start. Did you changed it on-the-fly or after VM stop, and then start VM with static one? It would be good to get the output of the next command after route will be lost:

sudo journalctl | grep dhc