1.2.1-S2 stops responding to NAT, SSH slow

Don’t know if this is a bug since there haven’t been other reports.

Background: We have 4 VMs running Vyos, two transparent firewalls (on 10G fiber connections), and two that handle NAT only, one of each VM on two different KVM hosts. Each VM has 4 processors and 12 GB of RAM. We route traffic through NAT using policy-based routing on our Brocade (now Ruckus) ICX7750 switches.

We upgraded all 4 systems from 1.1.7 to 1.2.1-S2 (using add system image) to get the fix for CVE-2019-11477. The firewall machines have been working fine, but the NAT machines seem to stop handling traffic about every 30 minutes. A reboot resolves the issue. We tried installing a new VM from scratch and it ran for about 2 hours and then started showing the same symptoms. There is nothing logged about the issue. We downgraded one machine back to 1.1.7 and it works fine, so there seems to be some issue in the new version, not with the VM or hardware.

We aren’t even sure where to start testing to see what might be happening.

Have you checked the logs/dmsg? Do you have anything in there telling you that contrack overflows or arp thresholds are being hit?