IPSec drops, then never recovers

Hi all,

Long-time EdgeOS/VyOS user, struggling right now with intermittent IPSec drop issues with VyOS 1.3.0RC5 (though I’ve had issues across a few versions, just testing RC5 as its latest and could include fixes to my issues).

Scenario
I’ve got a couple VPNs up, each to a Ubiquiti EdgeRouter on the other end. Periodically (takes 1 day or more at times), the VPN will drop, the IPSec SA will disappear, and the VyOS router will never try to re-initiate the VPN until I explicitly ask it to (via reset vpn ipsec-peer or similar).

Environment

ryanb@ubnt01-sec:~$ show version

Version:          VyOS 1.3.0-rc5
Release Train:    equuleus

Built by:         Sentrium S.L.
Built on:         Tue 29 Jun 2021 08:26 UTC
Build UUID:       36f7c218-6ebb-497f-9ec5-676241e5c13a
Build Commit ID:  892e8689b3234e

Architecture:     x86_64
Boot via:         installed image
System type:      VMware guest

Hardware vendor:  VMware, Inc.
Hardware model:   VMware Virtual Platform
Hardware S/N:     VMware-42 39 17 9d b5 1f a4 1b-94 7f a3 b1 00 c7 51 5c
Hardware UUID:    9d173942-1fb5-1ba4-947f-a3b100c7515c

Copyright:        VyOS maintainers and contributors

VyOS IPSec configuration:

set vpn ipsec esp-group esp-azure compression 'disable'
set vpn ipsec esp-group esp-azure lifetime '3600'
set vpn ipsec esp-group esp-azure mode 'tunnel'
set vpn ipsec esp-group esp-azure pfs 'disable'
set vpn ipsec esp-group esp-azure proposal 1 encryption 'aes256'
set vpn ipsec esp-group esp-azure proposal 1 hash 'sha1'
set vpn ipsec esp-group esp-azure proposal 2 encryption 'aes256'
set vpn ipsec esp-group esp-azure proposal 2 hash 'sha256'
set vpn ipsec esp-group esp-azure proposal 3 encryption 'aes128'
set vpn ipsec esp-group esp-azure proposal 3 hash 'sha1'
set vpn ipsec ike-group ike-azure ikev2-reauth 'no'
set vpn ipsec ike-group ike-azure key-exchange 'ikev2'
set vpn ipsec ike-group ike-azure lifetime '28800'
set vpn ipsec ike-group ike-azure proposal 1 dh-group '2'
set vpn ipsec ike-group ike-azure proposal 1 encryption 'aes256'
set vpn ipsec ike-group ike-azure proposal 1 hash 'sha1'
set vpn ipsec ike-group ike-azure proposal 2 dh-group '2'
set vpn ipsec ike-group ike-azure proposal 2 encryption 'aes256'
set vpn ipsec ike-group ike-azure proposal 2 hash 'sha256'
set vpn ipsec ike-group ike-azure proposal 3 dh-group '2'
set vpn ipsec ike-group ike-azure proposal 3 encryption 'aes128'
set vpn ipsec ike-group ike-azure proposal 3 hash 'sha1'
set vpn ipsec ike-group ike-azure proposal 4 dh-group '2'
set vpn ipsec ike-group ike-azure proposal 4 encryption 'aes128'
set vpn ipsec ike-group ike-azure proposal 4 hash 'sha256'
set vpn ipsec ipsec-interfaces interface 'eth0'
set vpn ipsec logging log-level '1'
set vpn ipsec logging log-modes 'any'
set vpn ipsec logging log-modes 'ike'
set vpn ipsec logging log-modes 'esp'
set vpn ipsec logging log-modes 'net'
set vpn ipsec nat-traversal 'disable'
set vpn ipsec options disable-route-autoinstall
set vpn ipsec site-to-site peer 2.2.2.2 authentication mode 'pre-shared-secret'
set vpn ipsec site-to-site peer 2.2.2.2 authentication pre-shared-secret 'secret'
set vpn ipsec site-to-site peer 2.2.2.2 connection-type 'initiate'
set vpn ipsec site-to-site peer 2.2.2.2 default-esp-group 'esp-azure'
set vpn ipsec site-to-site peer 2.2.2.2 ike-group 'ike-azure'
set vpn ipsec site-to-site peer 2.2.2.2 ikev2-reauth 'inherit'
set vpn ipsec site-to-site peer 2.2.2.2 local-address '1.1.1.1'
set vpn ipsec site-to-site peer 2.2.2.2 vti bind 'vti10'

EdgeOS peer config:

set vpn ipsec allow-access-to-local-interface disable
set vpn ipsec auto-firewall-nat-exclude enable
set vpn ipsec disable-uniqreqids
set vpn ipsec esp-group FOO0 compression disable
set vpn ipsec esp-group FOO0 lifetime 3600
set vpn ipsec esp-group FOO0 mode tunnel
set vpn ipsec esp-group FOO0 pfs disable
set vpn ipsec esp-group FOO0 proposal 1 encryption aes256
set vpn ipsec esp-group FOO0 proposal 1 hash sha256
set vpn ipsec esp-group FOO2 compression disable
set vpn ipsec esp-group FOO2 lifetime 27000
set vpn ipsec esp-group FOO2 mode tunnel
set vpn ipsec esp-group FOO2 pfs disable
set vpn ipsec esp-group FOO2 proposal 1 encryption aes256
set vpn ipsec esp-group FOO2 proposal 1 hash sha256
set vpn ipsec ike-group FOO0 ikev2-reauth no
set vpn ipsec ike-group FOO0 key-exchange ikev2
set vpn ipsec ike-group FOO0 lifetime 28800
set vpn ipsec ike-group FOO0 proposal 1 dh-group 2
set vpn ipsec ike-group FOO0 proposal 1 encryption aes256
set vpn ipsec ike-group FOO0 proposal 1 hash sha256
set vpn ipsec ike-group FOO2 ikev2-reauth no
set vpn ipsec ike-group FOO2 key-exchange ikev2
set vpn ipsec ike-group FOO2 lifetime 28800
set vpn ipsec ike-group FOO2 proposal 1 dh-group 2
set vpn ipsec ike-group FOO2 proposal 1 encryption aes256
set vpn ipsec ike-group FOO2 proposal 1 hash sha256
set vpn ipsec site-to-site peer 1.1.1.1 authentication mode pre-shared-secret
set vpn ipsec site-to-site peer 1.1.1.1 authentication pre-shared-secret secret
set vpn ipsec site-to-site peer 1.1.1.1 connection-type respond
set vpn ipsec site-to-site peer 1.1.1.1 default-esp-group FOO2
set vpn ipsec site-to-site peer 1.1.1.1 description ipsec
set vpn ipsec site-to-site peer 1.1.1.1 ike-group FOO2
set vpn ipsec site-to-site peer 1.1.1.1 ikev2-reauth inherit
set vpn ipsec site-to-site peer 1.1.1.1 local-address 2.2.2.2
set vpn ipsec site-to-site peer 1.1.1.1 vti bind vti3
set vpn ipsec site-to-site peer 1.1.1.1 vti esp-group FOO2

Logs

7/19/2021 00:20,Info,ubnt01-sec.lan,06[IKE] <peer-2.2.2.2-tunnel-vti|13> giving up after 5 retransmits
7/19/2021 00:19,Info,ubnt01-sec.lan,%ADJCHANGE: neighbor 192.168.1.1(Unknown) in vrf default Down BGP Notification send
7/19/2021 00:19,Info,ubnt01-sec.lan,%NOTIFICATION: sent to neighbor 192.168.1.1 4/0 (Hold Timer Expired) 0 bytes 
7/19/2021 00:19,Info,ubnt01-sec.lan,05[NET] <peer-2.2.2.2-tunnel-vti|13> sending packet: from 1.1.1.1[4500] to 2.2.2.2[4500] (464 bytes)
7/19/2021 00:19,Info,ubnt01-sec.lan,05[IKE] <peer-2.2.2.2-tunnel-vti|13> retransmit 5 of request with message ID 22
7/19/2021 00:18,Info,ubnt01-sec.lan,12[NET] <peer-2.2.2.2-tunnel-vti|13> sending packet: from 1.1.1.1[4500] to 2.2.2.2[4500] (464 bytes)
7/19/2021 00:18,Info,ubnt01-sec.lan,12[IKE] <peer-2.2.2.2-tunnel-vti|13> retransmit 4 of request with message ID 22
7/19/2021 00:18,Info,ubnt01-sec.lan,08[NET] <peer-2.2.2.2-tunnel-vti|13> sending packet: from 1.1.1.1[4500] to 2.2.2.2[4500] (464 bytes)
7/19/2021 00:18,Info,ubnt01-sec.lan,08[IKE] <peer-2.2.2.2-tunnel-vti|13> retransmit 3 of request with message ID 22
7/19/2021 00:17,Info,ubnt01-sec.lan,02[NET] <peer-2.2.2.2-tunnel-vti|13> sending packet: from 1.1.1.1[4500] to 2.2.2.2[4500] (464 bytes)
7/19/2021 00:17,Info,ubnt01-sec.lan,02[IKE] <peer-2.2.2.2-tunnel-vti|13> retransmit 2 of request with message ID 22
7/19/2021 00:17,Info,ubnt01-sec.lan,11[NET] <peer-2.2.2.2-tunnel-vti|13> sending packet: from 1.1.1.1[4500] to 2.2.2.2[4500] (464 bytes)
7/19/2021 00:17,Info,ubnt01-sec.lan,11[IKE] <peer-2.2.2.2-tunnel-vti|13> retransmit 1 of request with message ID 22
7/19/2021 00:17,Info,ubnt01-sec.lan,14[NET] <peer-2.2.2.2-tunnel-vti|13> sending packet: from 1.1.1.1[4500] to 2.2.2.2[4500] (464 bytes)
7/19/2021 00:17,Info,ubnt01-sec.lan,14[ENC] <peer-2.2.2.2-tunnel-vti|13> generating CREATE_CHILD_SA request 22 [ SA No KE ]
7/19/2021 00:17,Info,ubnt01-sec.lan,14[IKE] <peer-2.2.2.2-tunnel-vti|13> initiating IKE_SA peer-2.2.2.2-tunnel-vti[15] to 2.2.2.2
7/19/2021 00:05,Info,ubnt01-sec.lan,06[IKE] <peer-2.2.2.2-tunnel-vti|13> CHILD_SA closed
7/19/2021 00:05,Info,ubnt01-sec.lan,06[IKE] <peer-2.2.2.2-tunnel-vti|13> received DELETE for ESP CHILD_SA with SPI c999eefe
7/19/2021 00:05,Info,ubnt01-sec.lan,06[ENC] <peer-2.2.2.2-tunnel-vti|13> parsed INFORMATIONAL response 21 [ D ]
7/19/2021 00:05,Info,ubnt01-sec.lan,06[NET] <peer-2.2.2.2-tunnel-vti|13> received packet: from 2.2.2.2[4500] to 1.1.1.1[4500] (80 bytes)
7/19/2021 00:05,Info,ubnt01-sec.lan,07[NET] <peer-2.2.2.2-tunnel-vti|13> sending packet: from 1.1.1.1[4500] to 2.2.2.2[4500] (80 bytes)
7/19/2021 00:05,Info,ubnt01-sec.lan,07[ENC] <peer-2.2.2.2-tunnel-vti|13> generating INFORMATIONAL request 21 [ D ]
7/19/2021 00:05,Info,ubnt01-sec.lan,07[IKE] <peer-2.2.2.2-tunnel-vti|13> sending DELETE for ESP CHILD_SA with SPI c12191de
7/19/2021 00:05,Info,ubnt01-sec.lan,07[IKE] <peer-2.2.2.2-tunnel-vti|13> closing CHILD_SA peer-2.2.2.2-tunnel-vti{134} with SPIs c12191de_i (5289 bytes) c999eefe_o (5289 bytes) and TS 0.0.0.0/0 === 0.0.0.0/0
7/19/2021 00:05,Info,ubnt01-sec.lan,07[IKE] <peer-2.2.2.2-tunnel-vti|13> outbound CHILD_SA peer-2.2.2.2-tunnel-vti{136} established with SPIs cd9504ec_i c4734dd7_o and TS 0.0.0.0/0 === 0.0.0.0/0
7/19/2021 00:05,Info,ubnt01-sec.lan,07[IKE] <peer-2.2.2.2-tunnel-vti|13> inbound CHILD_SA peer-2.2.2.2-tunnel-vti{136} established with SPIs cd9504ec_i c4734dd7_o and TS 0.0.0.0/0 === 0.0.0.0/0
7/19/2021 00:05,Info,ubnt01-sec.lan,07[CFG] <peer-2.2.2.2-tunnel-vti|13> selected proposal: ESP:AES_CBC_256/HMAC_SHA2_256_128/NO_EXT_SEQ
7/19/2021 00:05,Info,ubnt01-sec.lan,07[ENC] <peer-2.2.2.2-tunnel-vti|13> parsed CREATE_CHILD_SA response 20 [ SA No TSi TSr ]
7/19/2021 00:05,Info,ubnt01-sec.lan,07[NET] <peer-2.2.2.2-tunnel-vti|13> received packet: from 2.2.2.2[4500] to 1.1.1.1[4500] (208 bytes)
7/19/2021 00:05,Info,ubnt01-sec.lan,15[NET] <peer-2.2.2.2-tunnel-vti|13> sending packet: from 1.1.1.1[4500] to 2.2.2.2[4500] (288 bytes)
7/19/2021 00:05,Info,ubnt01-sec.lan,15[ENC] <peer-2.2.2.2-tunnel-vti|13> generating CREATE_CHILD_SA request 20 [ N(REKEY_SA) SA No TSi TSr ]
7/19/2021 00:05,Info,ubnt01-sec.lan,15[IKE] <peer-2.2.2.2-tunnel-vti|13> establishing CHILD_SA peer-2.2.2.2-tunnel-vti{136} reqid 2
7/19/2021 00:05,Info,ubnt01-sec.lan,15[KNL] creating rekey job for CHILD_SA ESP/0xc12191de/1.1.1.1

EdgeOS logs (peer)

Jul 19 00:00:39 08[KNL] creating acquire job for policy 192.168.1.1/32[tcp/34697] === 10.10.0.1/32[tcp/bgp] with reqid {2}
Jul 19 00:04:28 14[KNL] creating acquire job for policy 192.168.1.1/32[tcp/33693] === 10.10.0.1/32[tcp/bgp] with reqid {2}
Jul 19 00:05:42 05[IKE] <peer-1.1.1.1-tunnel-vti|11930> inbound CHILD_SA peer-1.1.1.1-tunnel-vti{1521} established with SPIs c4734dd7_i cd9504ec_o and TS 0.0.0.0/0 === 0.0.0.0/0
Jul 19 00:05:43 14[IKE] <peer-1.1.1.1-tunnel-vti|11930> closing CHILD_SA peer-1.1.1.1-tunnel-vti{1519} with SPIs c999eefe_i (5289 bytes) c12191de_o (5289 bytes) and TS 0.0.0.0/0 === 0.0.0.0/0
Jul 19 00:05:43 14[IKE] <peer-1.1.1.1-tunnel-vti|11930> outbound CHILD_SA peer-1.1.1.1-tunnel-vti{1521} established with SPIs c4734dd7_i cd9504ec_o and TS 0.0.0.0/0 === 0.0.0.0/0
Jul 19 00:08:01 08[KNL] creating acquire job for policy 192.168.1.1/32[tcp/40371] === 10.10.0.1/32[tcp/bgp] with reqid {2}
Jul 19 00:11:25 12[KNL] creating acquire job for policy 192.168.1.1/32[tcp/40541] === 10.10.0.1/32[tcp/bgp] with reqid {2}
Jul 19 00:14:24 09[IKE] <peer-1.1.1.1-tunnel-vti|11930> initiating IKE_SA peer-1.1.1.1-tunnel-vti[12071] to 1.1.1.1
Jul 19 00:14:53 06[KNL] creating acquire job for policy 192.168.1.1/32[tcp/45073] === 10.10.0.1/32[tcp/bgp] with reqid {2}
Jul 19 00:17:50 15[KNL] creating acquire job for policy 192.168.1.1/32[tcp/bgp] === 10.1.0.13/32[tcp/42177] with reqid {4}
Jul 19 00:17:50 10[IKE] <peer-1.1.1.1-tunnel-vti|12073> initiating IKE_SA peer-1.1.1.1-tunnel-vti[12073] to 1.1.1.1
Jul 19 00:18:34 06[KNL] creating acquire job for policy 192.168.1.1/32[tcp/42995] === 10.10.0.1/32[tcp/bgp] with reqid {2}
Jul 19 00:20:35 06[KNL] creating delete job for CHILD_SA ESP/0x00000000/1.1.1.1
Jul 19 00:21:54 05[KNL] creating acquire job for policy 192.168.1.1/32[tcp/36997] === 10.1.0.13/32[tcp/bgp] with reqid {4}
Jul 19 00:21:54 05[IKE] <peer-1.1.1.1-tunnel-vti|12075> initiating IKE_SA peer-1.1.1.1-tunnel-vti[12075] to 1.1.1.1
Jul 19 00:22:23 15[KNL] creating acquire job for policy 192.168.1.1/32[tcp/37755] === 10.10.0.1/32[tcp/bgp] with reqid {2}
Jul 19 00:24:39 12[KNL] creating delete job for CHILD_SA ESP/0x00000000/1.1.1.1
Jul 19 00:25:37 04[KNL] creating acquire job for policy 192.168.1.1/32[tcp/44323] === 10.1.0.13/32[tcp/bgp] with reqid {4}
Jul 19 00:25:37 04[IKE] <peer-1.1.1.1-tunnel-vti|12077> initiating IKE_SA peer-1.1.1.1-tunnel-vti[12077] to 1.1.1.1
Jul 19 00:25:50 12[KNL] creating acquire job for policy 192.168.1.1/32[tcp/38955] === 10.10.0.1/32[tcp/bgp] with reqid {2}
Jul 19 00:28:22 09[KNL] creating delete job for CHILD_SA ESP/0x00000000/1.1.1.1
Jul 19 00:29:28 13[KNL] creating acquire job for policy 192.168.1.1/32[tcp/44937] === 10.10.0.1/32[tcp/bgp] with reqid {2}
Jul 19 00:29:31 09[KNL] creating acquire job for policy 192.168.1.1/32[tcp/44945] === 10.1.0.13/32[tcp/bgp] with reqid {4}
Jul 19 00:29:31 07[IKE] <peer-1.1.1.1-tunnel-vti|12079> initiating IKE_SA peer-1.1.1.1-tunnel-vti[12079] to 1.1.1.1
Jul 19 00:32:16 07[KNL] creating delete job for CHILD_SA ESP/0x00000000/1.1.1.1
Jul 19 00:32:16 07[KNL] creating delete job for CHILD_SA ESP/0x00000000/1.1.1.1
Jul 19 00:32:41 05[KNL] creating acquire job for policy 192.168.1.1/32[tcp/41633] === 10.10.0.1/32[tcp/bgp] with reqid {2}
Jul 19 00:33:04 09[KNL] creating acquire job for policy 192.168.1.1/32[tcp/33455] === 10.1.0.13/32[tcp/bgp] with reqid {4}
Jul 19 00:33:04 07[IKE] <peer-1.1.1.1-tunnel-vti|12082> initiating IKE_SA peer-1.1.1.1-tunnel-vti[12082] to 1.1.1.1
Jul 19 00:35:49 13[KNL] creating delete job for CHILD_SA ESP/0x00000000/1.1.1.1
Jul 19 00:36:11 07[KNL] creating acquire job for policy 192.168.1.1/32[tcp/39795] === 10.10.0.1/32[tcp/bgp] with reqid {2}
Jul 19 00:36:29 10[KNL] creating acquire job for policy 192.168.1.1/32[tcp/35383] === 10.1.0.13/32[tcp/bgp] with reqid {4}
Jul 19 00:36:29 06[IKE] <peer-1.1.1.1-tunnel-vti|12083> initiating IKE_SA peer-1.1.1.1-tunnel-vti[12083] to 1.1.1.1
Jul 19 00:39:14 07[KNL] creating delete job for CHILD_SA ESP/0x00000000/1.1.1.1
Jul 19 00:39:36 06[KNL] creating acquire job for policy 192.168.1.1/32[tcp/42913] === 10.10.0.1/32[tcp/bgp] with reqid {2}
Jul 19 00:40:07 14[KNL] creating acquire job for policy 192.168.1.1/32[tcp/33337] === 10.1.0.13/32[tcp/bgp] with reqid {4}
Jul 19 00:40:07 05[IKE] <peer-1.1.1.1-tunnel-vti|12085> initiating IKE_SA peer-1.1.1.1-tunnel-vti[12085] to 1.1.1.1
Jul 19 00:42:52 06[KNL] creating delete job for CHILD_SA ESP/0x00000000/1.1.1.1

Notes

  • I’ve got monitoring setup to 2.2.2.2, and certainly have never received a down notification from the endpoint, so I don’t think its actually truly down and not listening.
  • EdgeOS logs show it continually trying to create an acquire job for the VPN. I can confirm that even now, many hours after the VPN went down, I can see IKE packets coming in on my WAN interface, and the VyOS device effectively ignores them:
    ryanb@ubnt01-sec:~$ sudo tcpdump -vvv -ni eth0 host 2.2.2.2
    tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
    08:34:50.609267 IP (tos 0x0, ttl 53, id 50840, offset 0, flags [DF], proto UDP (17), length 364)
        2.2.2.2.500 > 1.1.1.1.500: [udp sum ok] isakmp 2.0 msgid 00000000 cookie 459e4349059a92c1->0000000000000000: parent_sa ikev2_init[I]:
        (sa: len=44
            (p: #1 protoid=isakmp transform=4 len=44
                (t: #1 type=encr id=aes (type=keylen value=0100))
                (t: #2 type=integ id=#12 )
                (t: #3 type=prf id=#5 )
                (t: #4 type=dh id=modp1024 )))
        (v2ke: len=128 group=modp1024 681c74c49c11cc9ee42a67b91cd8b3f4fcbdfddfe530bba3276bdc0023556a06fc687659c5bf0922c681c24e15389062b54b597257382eeb26308631b38a56b670d95872ae84d005ae21b266fab6cb7bab685702255a636851d6e60d23ab44f577021ccb803dc272e12c4713e0c7cd7a6beb7ec5cb2129b7bb6f5655da79c3ed)
        (nonce: len=32 nonce=(0f68cde8bd8dbf317fb897f60261b33fb7bd77c80b9080141f7f2eb133f6f4f0) )
        (n: prot_id=#0 type=16388(nat_detection_source_ip))
        (n: prot_id=#0 type=16389(nat_detection_destination_ip))
        (n: prot_id=#0 type=16430(status))
        (n: prot_id=#0 type=16431(status))
        (n: prot_id=#0 type=16406(status))
    08:35:32.599670 IP (tos 0x0, ttl 53, id 52672, offset 0, flags [DF], proto UDP (17), length 364)
        2.2.2.2.500 > 1.1.1.1.500: [udp sum ok] isakmp 2.0 msgid 00000000 cookie 459e4349059a92c1->0000000000000000: parent_sa ikev2_init[I]:
        (sa: len=44
            (p: #1 protoid=isakmp transform=4 len=44
                (t: #1 type=encr id=aes (type=keylen value=0100))
                (t: #2 type=integ id=#12 )
                (t: #3 type=prf id=#5 )
                (t: #4 type=dh id=modp1024 )))
        (v2ke: len=128 group=modp1024 681c74c49c11cc9ee42a67b91cd8b3f4fcbdfddfe530bba3276bdc0023556a06fc687659c5bf0922c681c24e15389062b54b597257382eeb26308631b38a56b670d95872ae84d005ae21b266fab6cb7bab685702255a636851d6e60d23ab44f577021ccb803dc272e12c4713e0c7cd7a6beb7ec5cb2129b7bb6f5655da79c3ed)
        (nonce: len=32 nonce=(0f68cde8bd8dbf317fb897f60261b33fb7bd77c80b9080141f7f2eb133f6f4f0) )
        (n: prot_id=#0 type=16388(nat_detection_source_ip))
        (n: prot_id=#0 type=16389(nat_detection_destination_ip))
        (n: prot_id=#0 type=16430(status))
        (n: prot_id=#0 type=16431(status))
        (n: prot_id=#0 type=16406(status))
    
  • Even using clear vpn ipsec-peer on the EdgeOS side does not ever get anything logged in VyOS. It just ignores the INIT request entirely.

Thoughts on how I can proceed here?

Try to set such settings on VyOS site.

set vpn ipsec ike-group ike-azure dead-peer-detection action 'restart'
set vpn ipsec ike-group ike-azure dead-peer-detection interval '30'
set vpn ipsec ike-group ike-azure dead-peer-detection timeout '120'

Thanks for that. Will try enabling DPD to see the behavior. Will report back in a few days with results.

Any chance there’s an option that doesn’t require DPD? Even a 30sec downtime (if I set my DPD to 10/30) is a lot in some prod cases, so trying to minimize that as much as possible.

Thanks!

How about using wireguard?

I use Wireguard for many point-to-site things, but most platforms don’t (yet, i’m sure) support it natively. EdgeOS has an installation script that’s mildly clunky but generally works, but things like Azure VPN Gateways don’t support it, and likely never will.

Hey there @Viacheslav still struggling with these ipsec drops. Same logging occurs as above.

I’ve implemented DPD via:

set vpn ipsec ike-group azure dead-peer-detection action 'restart'
set vpn ipsec ike-group azure dead-peer-detection interval '10'
set vpn ipsec ike-group azure dead-peer-detection timeout '30'

Yet I still struggle with the same issues as before - I see the IKE engine retransmitting 5 times, then it kills itself. It’s down long enough that BGP hold timers expire, etc.

Anything I can do to debug this? At first I thought that it could be an issue with actual network-layer connectivity issues, but I’ve ruled that out because:

  • I lose zero pings to anywhere else on the internet (I test several external endpoints and never see so much as a single failed ping) - meaning my own ISP, etc seem to be fine
  • I have this same behavior across multiple ISPs (including Microsoft Azure) on the other end of the IPSec tunnel, and the other end sees zero packet loss to the internet either - meaning it’s not the other end’s ISP, either
  • It’s always the same exact behavior - 5 retransmits, then the VPN kills itself with “giving up after 5 retransmits” - never less than 5. I never see any other retransmits in the logs.
  • The tunnel comes RIGHT back up once it kills itself after the retransmits/timeouts.

HI @SlothCroissant did ever find a resolution to this? I have exactly the same issue. Have been using a Vyatta (V6.6) for years, but needed something to support IKEv2, so went with the obvious choice a VyOS appliance, yet with the same config as the Vyatta (which has been rock solid), the VyOS drops the VPN about every hour!!! Thanks…

1 Like

Hi @MartinCutts , could you please specify the following:

  1. VyOS installed platform
  2. VyOS software version
  3. Example IPsec configuration that has mentioned issue
  4. Remote side device/software?
  5. Any firewall on both ends or in the middle that might affect IPsec connection? (UDP ports 500 [IKE], 4500 [NAT-T], and IP 50 [ESP] )
  6. Logs from both ends showing the issue for the specified IPsec tunnel

Hi Elchin,

The VyOS appliance is from the AWS Marketplace:

Version:          VyOS 1.3.0-20220201090943
Release train:    equuleus

Built by:         support@vyos.io
Built on:         Tue 01 Feb 2022 09:10 UTC
Build UUID:       d9226fc7-8192-461e-bf00-deea5380a5ff
Build commit ID:  109f74f152ae42

Architecture:     x86_64
Boot via:         installed image
System type:      KVM guest

Hardware vendor:  Amazon EC2
Hardware model:   t3.small
Hardware S/N:     ec2654a4-f977-e442-5e7e-a5a8d422558a
Hardware UUID:    ec2654a4-f977-e442-5e7e-a5a8d422558a

Copyright:        VyOS maintainers and contributors

The VyOS config is as follows:

 esp-group ESP1 {
     compression disable
     lifetime 3600
     mode tunnel
     pfs disable
     proposal 1 {
         encryption aes256
         hash sha1
     }
     proposal 2 {
         encryption aes256
         hash md5
     }
     proposal 3 {
         encryption aes128
         hash sha1
     }
     proposal 4 {
         encryption aes128
         hash md5
     }
     proposal 5 {
         encryption 3des
         hash sha1
     }
     proposal 6 {
         encryption 3des
         hash md5
     }
 }
 esp-group ESP2 {
     compression disable
     lifetime 3600
     mode tunnel
     pfs dh-group14
     proposal 1 {
         encryption aes256
         hash sha256
     }
 }
 ike-group IKE1 {
     close-action none
     dead-peer-detection {
         action restart
         interval 30
         timeout 120
     }
     ikev2-reauth no
     key-exchange ikev1
     lifetime 28800
     proposal 1 {
         dh-group 5
         encryption aes256
         hash sha1
     }
     proposal 2 {
         dh-group 5
         encryption aes256
         hash md5
     }
     proposal 3 {
         dh-group 2
         encryption 3des
         hash sha1
     }
     proposal 4 {
         dh-group 2
         encryption 3des
         hash md5
     }
 }
 ike-group IKE2 {
     close-action none
     dead-peer-detection {
         action restart
         interval 30
         timeout 120
     }
     ikev2-reauth no
     key-exchange ikev2
     lifetime 28800
     mobike disable
     proposal 1 {
         dh-group 14
         encryption aes256
         hash sha256
     }
 }
 ipsec-interfaces {
     interface eth0
 }
 nat-networks {
     allowed-network 172.16.100.0/24 {
     }
     allowed-network 172.31.9.0/24 {
     }
 }
 nat-traversal disable
 site-to-site {
     peer xxx.xxx.xxx.xxx {
         authentication {
             mode pre-shared-secret
             pre-shared-secret xxxxxxxxxx
         }
         connection-type respond
         description "IKEv2 - xxxxxxxxxxxxxx"
         ike-group IKE2
         ikev2-reauth inherit
         local-address 172.31.253.254
         tunnel 1 {
             allow-nat-networks disable
             allow-public-networks disable
             esp-group ESP2
             local {
                 prefix 172.31.9.0/24
             }
             remote {
                 prefix 172.16.100.0/24
             }
         }
     }
     peer xxx.xxx.xxx.xxx {
         authentication {
             mode pre-shared-secret
             pre-shared-secret xxxxxxxxxx
         }
         connection-type respond
         description "IKEv2 - xxxxxxxxxxxxxxxx"
         ike-group IKE2
         ikev2-reauth inherit
         local-address 172.31.253.254
         tunnel 1 {
             allow-nat-networks disable
             allow-public-networks disable
             esp-group ESP2
             local {
                 prefix 172.31.9.0/24
             }
             remote {
                 prefix 172.16.100.0/24
             }
         }
     }
 }

The remote appliance is a DrayTek 3910 router. The same behavior is also being seen on a DrayTek 2862.

UDP 500, 4500 & ESP (50) are allowed via an AWS security group. The VPN connects just fine, just resets about every hour.

I haven’t got any logs from the DrayTek, but this is from the VyOS:

May 09 11:29:59 charon[2221]: 08[NET] <202> received packet: from xxx.xxx.xxx.xxx[500] to 172.31.253.254[500] (760 bytes)
May 09 11:29:59 charon[2221]: 08[ENC] <202> parsed IKE_SA_INIT request 0 [ SA KE No N(NATD_S_IP) N(NATD_D_IP) ]
May 09 11:29:59 charon[2221]: 08[IKE] <202> xxx.xxx.xxx.xxx is initiating an IKE_SA
May 09 11:29:59 charon[2221]: 08[CFG] <202> selected proposal: IKE:AES_CBC_256/HMAC_SHA2_256_128/PRF_HMAC_SHA2_256/MODP_2048
May 09 11:29:59 charon[2221]: 08[IKE] <202> local host is behind NAT, sending keep alives
May 09 11:29:59 charon[2221]: 08[ENC] <202> generating IKE_SA_INIT response 0 [ SA KE No N(NATD_S_IP) N(NATD_D_IP) N(MULT_AUTH) ]
May 09 11:29:59 charon[2221]: 08[NET] <202> sending packet: from 172.31.253.254[500] to xxx.xxx.xxx.xxx[500] (440 bytes)
May 09 11:29:59 charon[2221]: 14[NET] <202> received packet: from xxx.xxx.xxx.xxx[4500] to 172.31.253.254[4500] (224 bytes)
May 09 11:29:59 charon[2221]: 14[ENC] <202> parsed IKE_AUTH request 1 [ IDi AUTH SA TSi TSr ]
May 09 11:29:59 charon[2221]: 14[CFG] <202> looking for peer configs matching 172.31.253.254[%any]...xxx.xxx.xxx.xxx[xxx.xxx.xxx.xxx]
May 09 11:29:59 charon[2221]: 14[CFG] <peer-xxx.xxx.xxx.xxx-tunnel-1|202> selected peer config 'peer-xxx.xxx.xxx.xxx-tunnel-1'
May 09 11:29:59 charon[2221]: 14[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|202> authentication of 'xxx.xxx.xxx.xxx' with pre-shared key successful
May 09 11:29:59 charon[2221]: 14[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|202> authentication of '172.31.253.254' (myself) with pre-shared key
May 09 11:29:59 charon[2221]: 14[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|202> IKE_SA peer-xxx.xxx.xxx.xxx-tunnel-1[202] established between 172.31.253.254[172.31.253.254]...xxx.xxx.xxx.xxx[xxx.xxx.xxx.xxx]
May 09 11:29:59 charon[2221]: 14[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|202> scheduling rekeying in 27826s
May 09 11:29:59 charon[2221]: 14[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|202> maximum IKE_SA lifetime 28366s
May 09 11:29:59 charon[2221]: 14[CFG] <peer-xxx.xxx.xxx.xxx-tunnel-1|202> selected proposal: ESP:AES_CBC_256/HMAC_SHA2_256_128/NO_EXT_SEQ
May 09 11:29:59 charon[2221]: 14[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|202> CHILD_SA peer-xxx.xxx.xxx.xxx-tunnel-1{379} established with SPIs c0da03f8_i 26c388dd_o and TS 172.31.9.0/24 === 172.16.100.0/24
May 09 11:29:59 charon[2221]: 14[ENC] <peer-xxx.xxx.xxx.xxx-tunnel-1|202> generating IKE_AUTH response 1 [ IDr AUTH SA TSi TSr ]
May 09 11:29:59 charon[2221]: 14[NET] <peer-xxx.xxx.xxx.xxx-tunnel-1|202> sending packet: from 172.31.253.254[4500] to xxx.xxx.xxx.xxx[4500] (224 bytes)
May 09 11:30:01 CRON[16682]: pam_unix(cron:session): session opened for user root by (uid=0)
May 09 11:30:01 CRON[16683]: (root) CMD (/usr/libexec/vyos/vyos-check-wwan.py)
May 09 11:30:02 charon[2221]: 12[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|201> sending keep alive to xxx.xxx.xxx.xxx[4500]
May 09 11:30:02 CRON[16682]: pam_unix(cron:session): session closed for user root
May 09 11:30:05 charon[2221]: 07[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|201> retransmit 4 of request with message ID 1
May 09 11:30:05 charon[2221]: 07[NET] <peer-xxx.xxx.xxx.xxx-tunnel-1|201> sending packet: from 172.31.253.254[4500] to xxx.xxx.xxx.xxx[4500] (256 bytes)
May 09 11:30:26 charon[2221]: 10[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|201> sending keep alive to xxx.xxx.xxx.xxx[4500]
May 09 11:30:46 charon[2221]: 15[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|201> sending keep alive to xxx.xxx.xxx.xxx[4500]
May 09 11:30:47 charon[2221]: 11[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|201> retransmit 5 of request with message ID 1
May 09 11:30:47 charon[2221]: 11[NET] <peer-xxx.xxx.xxx.xxx-tunnel-1|201> sending packet: from 172.31.253.254[4500] to xxx.xxx.xxx.xxx[4500] (256 bytes)
May 09 11:31:08 charon[2221]: 07[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|201> sending keep alive to xxx.xxx.xxx.xxx[4500]
May 09 11:31:28 charon[2221]: 10[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|201> sending keep alive to xxx.xxx.xxx.xxx[4500]
May 09 11:31:48 charon[2221]: 06[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|201> sending keep alive to xxx.xxx.xxx.xxx[4500]
May 09 11:32:03 charon[2221]: 11[KNL] creating delete job for CHILD_SA ESP/0xc25f0538/172.31.253.254
May 09 11:32:03 charon[2221]: 08[JOB] CHILD_SA ESP/0xc25f0538/172.31.253.254 not found for delete
May 09 11:32:03 charon[2221]: 14[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|201> giving up after 5 retransmits
May 09 11:32:03 charon[2221]: 14[IKE] <peer-xxx.xxx.xxx.xxx-tunnel-1|201> establishing IKE_SA failed, peer not responding

Thanks…

Anyone out there? I’ve managed to get the logs off the DrayTek router and the issue seems to be cookie related:

"2022-05-10 14:55:08", "## IKEv2 DBG : Process Packet : Receive IKEv2_AUTH request but can't find corresponding IKE SA for iCookie = 2d161d8e057724e7 rCookie = f0967ad13fd05ebf from xxx.xxx.xxx.xxx"
"2022-05-10 14:55:08", "## IKEv2 DBG : Recv IKEv2_AUTH[35] Request from xxx.xxx.xxx.xxx, Peer is IKEv2 Initiator"
"2022-05-10 14:54:27", "## IKEv2 DBG : Process Packet : Receive IKEv2_AUTH request but can't find corresponding IKE SA for iCookie = 2d161d8e057724e7 rCookie = f0967ad13fd05ebf from xxx.xxx.xxx.xxx"
"2022-05-10 14:54:27", "## IKEv2 DBG : Recv IKEv2_AUTH[35] Request from xxx.xxx.xxx.xxx, Peer is IKEv2 Initiator"
"2022-05-10 14:54:03", "## IKEv2 DBG : Process Packet : Receive IKEv2_AUTH request but can't find corresponding IKE SA for iCookie = 2d161d8e057724e7 rCookie = f0967ad13fd05ebf from xxx.xxx.xxx.xxx"
"2022-05-10 14:54:03", "## IKEv2 DBG : Recv IKEv2_AUTH[35] Request from xxx.xxx.xxx.xxx, Peer is IKEv2 Initiator"
"2022-05-10 14:53:50", "## IKEv2 DBG : Process Packet : Receive IKEv2_AUTH request but can't find corresponding IKE SA for iCookie = 2d161d8e057724e7 rCookie = f0967ad13fd05ebf from xxx.xxx.xxx.xxx"
"2022-05-10 14:53:50", "## IKEv2 DBG : Recv IKEv2_AUTH[35] Request from xxx.xxx.xxx.xxx, Peer is IKEv2 Initiator"
"2022-05-10 14:53:43", "## IKEv2 DBG : Process Packet : Receive IKEv2_AUTH request but can't find corresponding IKE SA for iCookie = 2d161d8e057724e7 rCookie = f0967ad13fd05ebf from xxx.xxx.xxx.xxx"
"2022-05-10 14:53:43", "## IKEv2 DBG : Recv IKEv2_AUTH[35] Request from xxx.xxx.xxx.xxx, Peer is IKEv2 Initiator"

Any ideas anyone?
Thanks
Martin

Hi, @MartinCutts!

Of course, for better analysis, you need to grab logs from the same moment of time from both sides. However, from what we see here, it seems like UDP/4500 traffic from 172.31.253.254 is disallowed.

I would check if all types of required traffic are allowed in AWS and on the remote side for both in/out directions first.

If everything is allowed, try to collect full configs and logs simultaneously from both peers, when the problem occurs. Better to have logs for a as long period as possible, because IPSec relies on states that may be not clear from a short log.

afaik, to get logging on a draytek, you need to set up syslog on it
I’d alter your ike and esp group, to only have single proposal.
Your ipsec lifetime is 1 hour, seems like renewing SA gets you into trouble.
Try setting side behind NAT as initiator

Sorry, not ignoring here, just saw that you’ve got some help from @e.khudiyev and I’m actually no longer using IPSEC VPN in my setup (we moved to WireGuard). Will be watching this closely to see how it progresses, though.

So, to add to the thread, I have had site to site vpn’s configured for years (since 1.1.7) between my house and all my family members, and have worked without issues. But once upgrading to 1.3 (1.3-rolling-202404242039) they broke with what is described above. The remote peers (in my case DHCP end points) stop retrying. From what I observed the connection stays up till the rekey then drops and no longer retries, almost as if the process died.

I tried to pull updates for 1.4 but they are blocked now.

I am in the process of rolling back to 1.2.6