"send disconnect: Broken pipe" with ssh/scp after T3781:'Revert the NAT implementation in 1.3 back to iptables'

dutty · October 3, 2021, 8:46pm

Hello,

It looks like nftables were replaced with iptables due to T3781 in VyOS 1.3 Equuleus (1.3.0-epa1).
I now consistently experience broken pipes (disconnections) while copying large files over ssh from inside-LAN client to outside server through vyos NAT in between.

I have a very typical setup: 192.168.0.0/24 local subnet (LAN) and vyos router (1.3.0-epa1) with masquerade SNAT and one external WAN ip. The problem is that a client residing in the LAN subnet cannot copy large files over ssh to the external (cloud) server: connection randomly disconnects with “disconnect: Broken pipe” in the logs. Though copying itself goes smoothly until the very disconnect moment: it’s clearly visible on the server side with iotop. Disconnect randomly happens in 7 to 50 minutes. I never could complete the copy (it takes about 5 hours normally given the file size and link speed). If I boot to previous image (1.3-rolling-20210813) with exactly same config everything works fine, and ssh never disconnects.

I double checked conntrack settings, timeouts - everything is identical between epa1 image and 1.3-rolling-20210813. The only difference is that version of Aug 13 has nftables instead of iptables in epa1. ServerAliveInterval in client’s ssh config doesn’t help either.

Does anybody know how to fix this?

Thanks a lot.

dutty · October 4, 2021, 9:13pm

I can consistently reproduce the issue on every rolling release image of Equuleus I tried since the mid of September. Every time I installed new image the backup failed the following night. I go back to the image dated Aug 13 (before the switch to iptables for NAT) and the next night backup completes normally.
Since the backup script (that utilizes the copying via ssh) runs on several machines inside LAN, and all of them either all fail in the same manner or all succeed, I’m certain the problem exists in Vyos. And I attribute the issue to NAT implementation switch happened on Aug 26 because no other relevant change is noticed after this date in the changelog 1.3 Equuleus — VyOS 1.3.x (equuleus) documentation.

p252 · October 10, 2021, 2:28pm

Out of curiosity, does the same thing happen on VyOS 1.2 (which uses iptables)? Also, and this may not be relevant to anything, but, are offloads disabled on the Aug 13 image as opposed to the epa1 image? Any errors being displayed in the interface statistics?

dutty · October 10, 2021, 3:22pm

@p252 Thank you for your reply.
Regarding VyOS 1.2, unfortunately I cannot test it on this site because it’s near-production environment. And this site never had VyOS 1.2 installed. We used UBNT Edge Router prior to moving to VyOS 1.3 this summer. We are happy with VyOS 1.3 except this one very annoying issue which we cannot find the root cause of.
Thank you for directing to offloads, it may be relevant. The config of older image (the one without issue) contains

offload {
         gro
         gso
         lro
         rps
         tso
         ufo
     }

enabled on all phy ethernet interfaces. So I believe in epa1 the offload in our case should also be enabled. We will check if disabling the offload helps.

p252 · October 10, 2021, 3:48pm

Hi dutty,

Just to give warning, disabling offloads “could” have performance/throughput impact as Linux kernel unfortunately cannot handle high throughput without the tricks of turning normal packets into super-sized packets before processing (offloads). I was just curious about the the offloads as, IIRC, VyOS 1.3 for a bit was disabling offloads for a while but then they were re-enabled by default - see T3619

dutty · October 11, 2021, 7:48am

Yesterday I made the switch to the epa1 image again (we were running older image dated back to Aug 13). Based on T3619 I removed all offload setting from the configuration in the old image prior to upgrade and installed epa1 afresh. No errors on first boot, all config succeeded. As expected, {offload gro, gso, tso, sg} appeared in the config automatically (as described in T3619). In a word, no anomalies.

Then, during the following night we again experienced broken pipes during remote copying over ssh. Here how it looks in our log:

Oct 11 01:58:45 Start Remote Backup of sda for ...
Oct 11 03:24:47 client_loop: send disconnect: Broken pipe

and another pipe from the same machine:

Oct 11 01:05:57 Start Remote Backup of sda for ...
Oct 11 02:18:40 client_loop: send disconnect: Broken pipe

and another pipe from another machine:

Oct 11 09:02:41 Start Remote Backup of sda for ...
Oct 11 09:04:59 client_loop: send disconnect: Broken pipe

and another pipe from the third machine:

Oct 11 09:38:09 Start Remote Backup of sda for ...
Oct 11 09:49:43 client_loop: send disconnect: Broken pipe

So, all ssh copying failed. If we look closer to the time stamps we find:

connections break based on connection, not interface, problem. First two records are from the same source machine copying to the same target cloud server. Second connection breaks while the first one still goes. They overlap.
in third and forth cases connection breaks quite quickly, in 138 seconds and 11m 34 seconds. 138s is shorter of any known conntrack settings. The first two connections stayed much longer: 1h 34m 02s and 1h 12m 43s. These timings look like random disconnects.

There is nothing in the logs in VyOS about anything problematic around the timestamp of connection losses.

VyOS runs as a virtual kvm guest. Identical config is used in our older image (dated Aug 13) and all backup scripts always work without any disconnects with that older image.

Any help will be highly appreciated.