I’ve got two firewalls in a cluster configuration that have been running happily for a couple of years (ver 1.1.6). Yesterday, out of the blue, all of my tunnels dropped at the same time. Looking at the logs I’ve pulled down, I see a lot of events at this time that look like this for all of my tunnels:
Apr 17 02:24:44 fw01 pluto: ERROR: “peer-x.x.x.x-tunnel-1” #494767: sendto on eth2 to x.x.x.x:500 failed in ISAKMP notify. Errno 22: Invalid argument
There were then subsequent similar messages that had “failed in delete notify” and “failed in EVENT_TRANSMIT”.
No changes have been made recently and I’ve never had this issue in the past and nothing special was happening at the time. All other traffic continued as normal, but my tunnels just wouldn’t come up. I jumped on the primary and tried restarting the vpn process but this made no difference. At this point, I rebooted the primary and when the secondary took over as the active, the tunnels came up. However, when the primary came back online after the reboot and took over as the active, the tunnels failed to come up again. This time, a restart of the vpn process got it all working again.
I then had what I believe to be a secondary issue where traffic to some servers wasn’t working through the firewall but it was to other servers yet all servers were reachable from the firewall itself. Not sure what the cause of this was but I have a hunch that it may have been somehow due to conntrack-sync. As soon as I rebooted the secondary, everything started working normally.
Anyway, I’m more interested in what may have caused the initial tunnel drop. A bit of searching around finds others with similar errors in logs but it’s usually to do with pppoe interfaces and/or DHCP addresses on the WAN where tunnels don’t come up when a new address is obtained on the WAN interface. However, I have ethernet connected links with static IP.
When the issue occurred, I was able to remote into the affected firewall via its WAN address and confirmed it could ping remote peers.