Set large MTU on a tunnel


#1

I have two VyOS routers connected on a link that supports very large frames. I have set the MTU of the main ethernet card to 4000. It actually supports much larger and each router can ping the other with a size of 4000 no problem.

I have setup a gre-bridge tunnel on two routers, and on each interface set the MTU to 3800.

but it seems to actually be set smaller because I can’t ping more than 1462 bytes.

show int tunnel… shows a MTU of 3800, but in reality it’s not supporting it.

I also tried l2tpv3 and even lower.

I found this online, is it related? https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1743746

any ideas how I can resolve this?

thank you!


#2

Anyone? I can really use some help here.

I have physical NIC’s MTU at 4000. I set tunnel MTU to 3800, but it still is limited to 1462. anyone know how I can fix this? US$100 bounty to the first person who can provide a working fix. thank you


#3

Hi Barry,

I had a look into it but I can’t reproduce it. See below my config. Is there any other device within the path which could have a lower MTU and sending you an icmp back to adjust your MTU size?

tunnel tun0 {
address 127.0.0.100/24
encapsulation gre
local-ip 127.0.0.10
mtu 4000
}

tun0 Link encap:UNSPEC HWaddr 7F-00-00-0A-FF-00-00-00-00-00-00-00-00-00-00-00
inet addr:127.0.0.100 Mask:255.255.255.0
inet6 addr: fe80::5efe:7f00:a/64 Scope:Link
UP RUNNING NOARP MTU:4000 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

tcpdump

2:18:28.363091 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 4042: 127.0.0.20 > 127.0.0.20: ICMP echo request, id 7055, seq 1, length 4008
22:18:28.363097 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 4042: 127.0.0.20 > 127.0.0.20: ICMP echo reply, id 7055, seq 1, length 4008


#4

Thank you for helping me. Yes, there will ultimately be a couple of devices in the path, and so I’m making a gre-bridge. not just a simple gre tunnel.

In the end it’s going to be closer to

R1—R2—R3—R4

with the ends R1 and R4 each being a side of a L2 tunnel. So my encapsulation is gre-bridge, not ‘gre’.

On the physical side, the links support very large 9000 byte frames between all links. I have the MTU of the physical NIC’s set to 4000 and can ping w/out fragmenting no problem.

But with the MTU at 3800 on the tunnels and an IP configured on R1 and R2 for testing purposes, I can’t send any packet larger than 1462 bytes over the tunnel.

R1:

set interfaces bridge br32 address ‘192.168.149.246/30’
set interfaces bridge br32 aging ‘300’
set interfaces bridge br32 hello-time ‘2’
set interfaces bridge br32 max-age ‘20’
set interfaces bridge br32 priority ‘0’
set interfaces bridge br32 stp ‘false’

set interfaces tunnel tun232 encapsulation ‘gre-bridge’
set interfaces tunnel tun232 local-ip ‘195.66.212.1’
set interfaces tunnel tun232 mtu ‘3900’
set interfaces tunnel tun232 multicast ‘enable’
set interfaces tunnel tun232 parameters ip bridge-group bridge ‘br32’
set interfaces tunnel tun232 remote-ip ‘185.239.9.9’

R2:

set interfaces bridge br32 address ‘192.168.149.245/30’
set interfaces bridge br32 aging ‘300’
set interfaces bridge br32 hello-time ‘2’
set interfaces bridge br32 max-age ‘20’
set interfaces bridge br32 priority ‘0’
set interfaces bridge br32 stp ‘false’

set interfaces tunnel tun263 encapsulation ‘gre-bridge’
set interfaces tunnel tun263 local-ip ‘185.239.9.9’
set interfaces tunnel tun263 mtu ‘3900’
set interfaces tunnel tun263 multicast ‘enable’
set interfaces tunnel tun263 parameters ip bridge-group bridge ‘br32’
set interfaces tunnel tun263 remote-ip ‘195.66.212.1’

ping test with even 2000 bytes:

ping 192.168.149.245 size 1434
PING 192.168.149.245 (192.168.149.245) 1434(1462) bytes of data.
1442 bytes from 192.168.149.245: icmp_req=1 ttl=64 time=1.58 ms
1442 bytes from 192.168.149.245: icmp_req=2 ttl=64 time=1.58 ms
^C
— 192.168.149.245 ping statistics —
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 1.586/1.586/1.586/0.000 ms
vyos@uk2-1:~$ ping 192.168.149.245 size 2000
PING 192.168.149.245 (192.168.149.245) 2000(2028) bytes of data.
^C
— 192.168.149.245 ping statistics —
16 packets transmitted, 0 received, 100% packet loss, time 15016ms


#5

Hi Barry,

not sure if I can help, but let’s try to find the issue. I’m quite sure it’s some device which cuts back the MTU, I’ll need till tomorrow at least to build a proper testing ground.
What you could do meanwhile, in case you haven’t done it yet, check the MTU on the br devices and the grp device.
e.g:
sh int detail | match br

br32: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
inet 10.200.200.2/24 brd 10.200.200.255 scope global br32

See, mine would cut it back to 1500, my eth1 is 9000 and working, basically I try to mimic what you have in place.

eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP group default qlen 1000
inet 10.100.100.2/24 brd 10.100.100.255 scope global eth1


#6

Thank you for following-up

I can ping between the physical nics unfragmented packets no problem. I confirmed this by monitoring the traffic.

I also can see when I run a show int bridge br32, etc., on both sides that the MTU is set properly, just when I send the pings over they don’t go larger than ~1460. I agree it’s quite odd

I’ve been out all day. I will post full details tomorrow.

btw the edge routers are running 1.1.8 and the middle routers are running vyos-1.2.0-rolling+201806150337-amd64. So in this test, one of them is 1.1.8. I have not tried on only the rolling build yet.

thanks again!


#7

I tested your scenario, but only used the rolling releases and everything is working just fine. Please verify that the MTU on the bridge port interfaces, the gre bridge should inherit it from there, which is did correctly when I tested it. I’ve got the feeling that you hit a bug in 1.1.8.


#8

I ran a test again, this time with just the rolling builds. I notice when I set the MTU of the physical NIC, I can’t seem to send out any pings > 1500.

so I removed the mtu setting on the phyiscal interface and am just trying simple pings. check this out…does it look odd to you?

from router 1 I send pings over 3000 bytes long. but check the tcp dump on the receiving side! length 1500??? huh?

ping 23.227.1.3 size 3000
PING 23.227.1.3 (23.227.1.3) 3000(3028) bytes of data.
3008 bytes from 23.227.1.3: icmp_seq=1 ttl=52 time=68.9 ms
3008 bytes from 23.227.1.3: icmp_seq=2 ttl=52 time=69.0 ms
3008 bytes from 23.227.1.3: icmp_seq=3 ttl=52 time=69.0 ms
3008 bytes from 23.227.1.3: icmp_seq=4 ttl=52 time=69.0 ms
3008 bytes from 23.227.1.3: icmp_seq=5 ttl=52 time=69.1 ms
3008 bytes from 23.227.1.3: icmp_seq=6 ttl=52 time=68.8 ms
3008 bytes from 23.227.1.3: icmp_seq=7 ttl=52 time=69.3 ms
3008 bytes from 23.227.1.3: icmp_seq=8 ttl=52 time=68.9 ms
3008 bytes from 23.227.1.3: icmp_seq=9 ttl=52 time=69.0 ms
3008 bytes from 23.227.1.3: icmp_seq=10 ttl=52 time=68.8 ms
3008 bytes from 23.227.1.3: icmp_seq=11 ttl=52 time=69.1 ms
3008 bytes from 23.227.1.3: icmp_seq=12 ttl=52 time=69.0 ms
3008 bytes from 23.227.1.3: icmp_seq=13 ttl=52 time=69.2 ms
3008 bytes from 23.227.1.3: icmp_seq=14 ttl=52 time=69.0 ms
3008 bytes from 23.227.1.3: icmp_seq=15 ttl=52 time=69.1 ms

but check the tcpdump:

20:00:40.715075 IP (tos 0x28, ttl 64, id 1932, offset 0, flags [+], proto ICMP (1), length 1500)
23.227.1.3 > 185.239.1.14: ICMP echo reply, id 7024, seq 14, length 1480
20:00:40.715078 IP (tos 0x28, ttl 64, id 1932, offset 1480, flags [+], proto ICMP (1), length 1500)
23.227.1.3 > 185.239.1.14: icmp
20:00:40.715078 IP (tos 0x28, ttl 64, id 1932, offset 2960, flags [none], proto ICMP (1), length 68)
23.227.1.3 > 185.239.1.14: icmp
20:00:41.716192 IP (tos 0x28, ttl 52, id 46367, offset 0, flags [+], proto ICMP (1), length 1500)
185.239.1.14 > 23.227.1.3: ICMP echo request, id 7024, seq 15, length 1480
20:00:41.716200 IP (tos 0x28, ttl 52, id 46367, offset 1480, flags [+], proto ICMP (1), length 1500)
185.239.1.14 > 23.227.1.3: icmp
20:00:41.716201 IP (tos 0x28, ttl 52, id 46367, offset 2960, flags [none], proto ICMP (1), length 68)
185.239.1.14 > 23.227.1.3: icmp
20:00:41.716228 IP (tos 0x28, ttl 64, id 2015, offset 0, flags [+], proto ICMP (1), length 1500)
23.227.1.3 > 185.239.1.14: ICMP echo reply, id 7024, seq 15, length 1480
20:00:41.716229 IP (tos 0x28, ttl 64, id 2015, offset 1480, flags [+], proto ICMP (1), length 1500)
23.227.1.3 > 185.239.1.14: icmp

shouldn’t this not be fragmenting?? but the length in the dump says 1500.

what do you make of this?

thank you much!


#9

I think you are right. either something in the middle or a bug in 1.1.8.

I’ll find it and report back

thank you


#10

What hw have you it running on? Normally if you set the MTU, it going to be handled by the kernel, so you would see ‘x’ bytes leaving the machine. Is it possible that you can offload it to your NICs? IMy tests were on virtual hardware, now I did it between 2 laptops, no issues. The only difference is that I use the rolling releases, the 1.1 branch is based on debian wheezy which is off support, so from a security perspective I wouldn’t use it anymore.


#11

seems there is a hidden device in the middle blocking this. errrrr

I really need to get some small giants through. any ideas? bonding tunnels an option?

I believe I owe you $$. send me your paypal please

thank you


#12

I don’t use paypal and I also didn’t answer for the money :). The only real solution is to find and replace the device/interface which lowers the MTU. If you have multiple uplinks and your connected switch supports LACP, you could do link aggregation (aka etherchannel) to join multiple links into a big one. Your bottleneck will be the any link with a lower bandwidth than the total of your aggregates links. I think those are all options you have on layer2.