Very long time commit

I have an installation with several hundred interfaces and about a thousand static routes. A commit of any change (even a minor one) takes several minutes. Is there a way to optimize the commit?

1 Like

Is it faster or slower than a regular reboot in your case?

Mainly the timestamps in the end of a boot/reboot:

  1. vyos-router[xxx]: Waiting for NICs to settle down: settled in 1secs…
  2. vyos-router[xxx]: Mounting VyOS Config…done.
  3. vyos-router[xxx]: Starting VyOS router: migrate configure.
  4. vyos-config[xxx]: Configuration success

In my case I run this (VyOS 1.4 rolling release) in Virtualbox with 4 virtual NICs and the host CPU is a Intel i5-4250U (2 VCPU exposed) and at boot these clocks in at about:

  1. 24.89 sec
  2. 31.07 sec
  3. 49.12 sec
  4. 49.34 sec

While doing a minor change such as “set firewall all-ping disable” followed by a commit the commit takes roughly 3-4 seconds.

The commit time depends on the size of the configuration. The more interfaces and routes I have, the slower the commit. Even if I add just one line, the commit takes several minutes. The commit time does not depend on the difference.

We have a DEV stand with those parameters:

Total config’s lines:

show configuration commands | wc -l
3623

Total count of interfaces:

show interfaces  | wc -l
535

Total count of static routes:

show ip route  vrf internet static | wc -l
265

Time of show diffs:

time $(compare)

real	0m9.079s
user	0m1.211s
sys	0m2.255s

Time of commit:

time $(commit)

real	0m43.402s
user	0m2.943s
sys	0m6.771s

Incrising commit time depends just on size of config.
We noticed that during a commit “unionfs-fuse -o cow -o allow_other ...” process takes most of the time.

During injection of 1024 static routes (fresh reboot using VyOS 1.4-rolling-202307200317, started configuration mode ran “commit” followed by “save” and then ran “/config/inject.sh”):

Script being used to inject specific number of routes (/config/inject.sh):

#!/bin/vbash

# Include VyOS functions
source /opt/vyatta/etc/functions/script-template

# Script debugging:
#set -x

# Set variables
OCTET_A_START=0
OCTET_A_END=0
OCTET_B_START=0
OCTET_B_END=0
OCTET_C_START=0
OCTET_C_END=3
OCTET_D_START=0
OCTET_D_END=255

# Inject configuration
configure
for OCTET_A in `seq ${OCTET_A_START} ${OCTET_A_END}`
do
	for OCTET_B in `seq ${OCTET_B_START} ${OCTET_B_END}`
	do
		for OCTET_C in `seq ${OCTET_C_START} ${OCTET_C_END}`
		do
			for OCTET_D in `seq ${OCTET_D_START} ${OCTET_D_END}`
			do
				echo "set vrf name INTERNET protocols static route ${OCTET_A}.${OCTET_B}.${OCTET_C}.${OCTET_D}/32 next-hop 192.0.2.1 distance 1"
				set vrf name INTERNET protocols static route ${OCTET_A}.${OCTET_B}.${OCTET_C}.${OCTET_D}/32 next-hop 192.0.2.1 distance 1
			done
		done
	done
done
commit
exit

Observing through ssh about 5 routes added per second.

Also saw this:

root@vyos:/home/vyos# ps auxwww | grep -i union
root        3046 13.7  0.0 299624  2772 ?        Ssl  04:04   0:31 unionfs-fuse -o cow -o allow_other /opt/vyatta/config/tmp/changes_only_3030=RW:/opt/vyatta/config/active=RO /opt/vyatta/config/tmp/new_config_3030

Once reaching “commit” the my_commit process took a few seconds following by unionfs-fuse and then /usr/lib/frr/staticd started to work for some time.

Even if staticd took +90% of CPU (according to htop) the other frr features which I dont currently used also took 2-3% CPU !?

Total time to inject 1024 static routes was a few minutes.

Afterwards entering configuration mode and run “show” takes 5-6 seconds before it spits out the current configuration in json format.

Rebooting device (the config file is about 160kbyte in size) takes:

  1. vyos-router[xxx]: Waiting for NICs to settle down: settled in 1secs…
  2. vyos-router[xxx]: Mounting VyOS Config…done.
  3. vyos-router[xxx]: Starting VyOS router: migrate configure.
  4. vyos-config[xxx]: Configuration success

Timestamps as seen in console:

  1. 29.85 sec
  2. 36.06 sec
  3. 306.76 sec
  4. 306.77 sec

After the above reboot entering configuration mode and changing something trivial like “set firewall all-ping disable” and then commit takes about 1 minute to complete, same with changing it back to “set firewall all-ping enable”.

So yeah I think there might be room for improvement here :slight_smile:

I have filed this as a bug because its not good that it takes more than 5 minutes to boot (normally less than 1 minute) by just adding 1024 static routes so something is broken somewhere (or there is plenty of room of improvement), also commit times in configuration mode increased to about 1 minute aswell:

https://vyos.dev/T5388

Hopefully its an easy fix knock on wood.

Perhaps some maintainer can answer this (anyone are free to fill in :slight_smile: )?

How come unionfs-fuse is being used during configuration mode when the root system already uses overlayfs?

root@vyos:/home/vyos# ps auxwww | grep -i union
root 3046 13.7 0.0 299624 2772 ? Ssl 04:04 0:31 unionfs-fuse -o cow -o allow_other /opt/vyatta/config/tmp/changes_only_3030=RW:/opt/vyatta/config/active=RO /opt/vyatta/config/tmp/new_config_3030

and:

root@vyos:~# mount | grep -i overlay
tmpfs on /usr/lib/live/mount/overlay type tmpfs (rw,relatime)
overlay on / type overlay (rw,noatime,lowerdir=/live/rootfs/1.4-rolling-202307220749.squashfs/,upperdir=/live/persistence/boot/1.4-rolling-202307220749/rw,workdir=/live/persistence/boot/1.4-rolling-202307220749/work)

This is an interesting design choice. Can you help me understand why you’re using so many static routes instead of using a dynamic protocol?

Some may remember the days before ipset was available, and loading large firewall rules on Linux took a very long time like you’re describing because it had to load each line of the firewall individually.

I assume a similar thing is happening in this case.

In my case its just a testcase to reproduce the issues the topic creator reported.

But having many static routes its not that uncommon for larger deployments, specially for a box with hundreds of physical or virtual (vlan) interfaces.

The problem here doesnt necessary seem to be static routes themselves but the number of config entries in the configuration file where at least from my investigation it seems to be two parts.

First part is the time it takes to add/remove/change configuration for a large configuration file (well the commit along with boot time at least). It should be in the range of a few seconds but it quickly escalades to several minutes. Here that unionfs-fuse seems to be part of the long commit times.

The second part is the frr/staticd itself (part of the commit and also took long time to complete).

For the second part a solution might (if not already used) be to use transactional cli (–tcli) as described in:

Also found this which I dont know if its related or still valid:

This would remove any overhead which exists when injecting and route entry at a time to frr/staticd (if thats whats happening at least for the frr/staticd part of the long commit?).

This topic reads as if static routes are added one by one during read of config, instead of all at once.
Then the FIB has to be re-calculated 1000 times from the FIB.

@giga1699 Our use case is interface per client. We use a virtual bridge interface and unnumbered addresses. Sample of a typical client’s interface config:

set interfaces bridge br3003797 address 'xxxx:xxxx:xxxx:xxxx::1/64'
set interfaces bridge br3003797 description 'TENANT SVI interface 270226'
set interfaces bridge br3003797 ip enable-arp-accept
set interfaces bridge br3003797 ip enable-proxy-arp
set interfaces bridge br3003797 ip source-validation 'strict'
set interfaces bridge br3003797 mac '26:28:B0:96:C0:C9'
set interfaces bridge br3003797 member interface vxlan3003797
set interfaces bridge br3003797 traffic-policy in '1G-in'
set interfaces bridge br3003797 traffic-policy out '1G-out'
set interfaces bridge br3003797 vrf 'internet'
set interfaces vxlan vxlan3003797 mtu '1500'
set interfaces vxlan vxlan3003797 parameters nolearning
set interfaces vxlan vxlan3003797 port '4789'
set interfaces vxlan vxlan3003797 source-address '10.32.0.34'
set interfaces vxlan vxlan3003797 vni '3003797'
set policy route-map UPLINK-OUT rule 185 match interface 'br3003797'
set protocols bgp address-family l2vpn-evpn vni 3003797 advertise-svi-ip
set protocols bgp address-family l2vpn-evpn vni 3003797 rd '65022:103003797'
set protocols bgp address-family l2vpn-evpn vni 3003797 route-target export '65002:1'
set protocols bgp address-family l2vpn-evpn vni 3003797 route-target import '65000:3003797'
set service router-advert interface br3003797 name-server 'xxxx:xxxx::xxxx'
set service router-advert interface br3003797 prefix xxxx:xxxx:xxxx:xxxx::/64
set vrf name internet protocols static route xxx.xxx.xxx.x0/32 interface br3003797
set vrf name internet protocols static route xxx.xxx.xxx.x1/32 interface br3003797
set vrf name internet protocols static route xxx.xxx.xxx.x2/32 interface br3003797
set vrf name internet protocols static route xxx.xxx.xxx.x3/32 interface br3003797
set vrf name internet protocols static route xxx.xxx.xxx.x4/32 interface br3003797

Each interface can have from 1 to some hundred unnumbered IP-addresses.

@16again: Adding a line in config mode is no issue, commiting (or booting) and wait for more than 5 minutes is.

Long commit time does not depend on the number of static routes, but on a size of the configuration or number of lines in the configuration. If a router has a large configuration (not necessarily static routes), committing any changes takes several minutes.

Sure but commit and boot times that exceeds 5 minutes just because one added some static routes is riddicilous and should be fixed.

Some claims that VyOS isnt “enterprise grade” (whatever that means to begin with) but its hard to argue against that with “bugs” like extremely long commit and boot times.

I think this can become a major showstopper to put VyOS into production.

Years ago PaloAlto Networks ran into riddicilious commit (and boot) times on their PAN-2000 series platform which they resolved by discontinuing that model but also sending upgrade kits to customers who wanted it. Basically SSD along with RAM upgrade because the problem in their case was due to compiling FPGA ruleset that started to use the operating swapfile which was placed on a HDD (which then cascaded the commit and boot times into +20 minutes).

The problem in VyOS case doesnt seem to be bound to disk IO and use of swapfile but rather some design error (or room for improvement) and which tools and methods are being used when dealing with large configurations.

One suspect is the use of unionfs-fuse (usermode) instead of overlayfs (kernelmode) and the other suspect is how the static routes (as an example) are actually being injected into frr/staticd (guttfeeling is that they are injected one by one by the config daemon instead of batchmode through --tcli or just write directly to the configfile and reload the staticd daemon).

Doing some more digging it turned out that VyOS doesnt support nested routing so the gateway must be reachable (at least IP-address wise) through a physical interface - I have updated the script and attached it at the bottom of this post (added variable GATEWAY).

This way one can verify that whatever staticd is digesting do show up in the kernel (when in bash mode):

ip route show vrf INTERNET

I also verified that when altering /run/frr/config/frr.conf (NOTE: this will be recreated next time you commit config or reboot the device) and having staticd reloaded (couldnt figure out which systemctl command to use so I just did “kill -9 ” on the staticd pid and watchfrr would then within a few seconds respawn the staticd process) staticd would digest all 1024 routes within a second or so.

So part of the long commit/boot times would be to change however static routes are being loaded so they instead will be loaded in a batch fashion.

That is generate the /run/frr/config/frr.conf file and then have frr reload all its processes and estimated 20% or so of the commit/boot times would be cut.

This still leaves us with what it is that takes the other +80% of the commit/boot times (that unionfs-fuse thingy seen through “ps auxwww”).

Updated /config/custom/inject.sh:

#!/bin/vbash

# Include VyOS functions
source /opt/vyatta/etc/functions/script-template

# Script debugging
#set -x

# Set variables
GATEWAY=192.168.1.254
OCTET_A_START=0
OCTET_A_END=0
OCTET_B_START=0
OCTET_B_END=0
OCTET_C_START=0
OCTET_C_END=3
OCTET_D_START=0
OCTET_D_END=255

# Inject configuration
configure
for OCTET_A in `seq ${OCTET_A_START} ${OCTET_A_END}`
do
	for OCTET_B in `seq ${OCTET_B_START} ${OCTET_B_END}`
	do
		for OCTET_C in `seq ${OCTET_C_START} ${OCTET_C_END}`
		do
			for OCTET_D in `seq ${OCTET_D_START} ${OCTET_D_END}`
			do
				echo "set vrf name INTERNET protocols static route ${OCTET_A}.${OCTET_B}.${OCTET_C}.${OCTET_D}/32 next-hop ${GATEWAY} distance 1"
				set vrf name INTERNET protocols static route ${OCTET_A}.${OCTET_B}.${OCTET_C}.${OCTET_D}/32 next-hop ${GATEWAY} distance 1
			done
		done
	done
done
commit
exit

Hi guys! I can verify that commit time depends directly in the configuration file size.
Not only depending on static routes, it also could be firewall rules or whatever.
In versions 1.2.x we also experienced some random configuration missing when rebooting a vyos with config.boot file higher than 1MB.
In version 1.3.x those issues are not so frequent, but commit time (and rebooting time) is still so long for large configurations.
It will be very appreciated if this performance could be enhanced!
Thanks and regards

1 Like