Watch-frr can't restore frr process as expected

Hi,

I recently found a problem of FRR in VyOS. One day, my VyOS router ran out of memory, it lead to watchfrr restarting other frr process like below.

Apr 12 11:34:45 eu-hub zebra[1013]: [EC 4043309090] Unknown netlink nlmsg_type RTM_GETNEIGH(30) vrf 0
Apr 12 11:34:45 eu-hub watchfrr[983]: bgpd: slow echo response finally received after 690.958289 seconds
Apr 12 11:34:45 eu-hub zebra[1013]: [EC 4043309090] Unknown netlink nlmsg_type RTM_GETNEIGH(30) vrf 0
Apr 12 11:34:45 eu-hub watchfrr[983]: ospfd: slow echo response finally received after 691.008029 seconds
Apr 12 11:34:45 eu-hub zebra[1013]: [EC 4043309090] Unknown netlink nlmsg_type RTM_GETNEIGH(30) vrf 0
Apr 12 11:34:45 eu-hub watchfrr[983]: ospf6d: slow echo response finally received after 649.294706 seconds
Apr 12 11:34:45 eu-hub zebra[1013]: [EC 4043309090] Unknown netlink nlmsg_type RTM_GETNEIGH(30) vrf 0
Apr 12 11:34:45 eu-hub watchfrr[983]: zebra: slow echo response finally received after 766.866493 seconds
Apr 12 11:34:45 eu-hub zebra[1013]: [EC 4043309090] Unknown netlink nlmsg_type RTM_GETNEIGH(30) vrf 0
Apr 12 11:34:45 eu-hub zebra[1013]: message repeated 10 times: [ [EC 4043309090] Unknown netlink nlmsg_type RTM_GETNEIGH(30) vrf 0]
Apr 12 11:34:47 eu-hub zebra[1013]: Terminating on signal
Apr 12 11:34:47 eu-hub bgpd[1018]: Terminating on signal
Apr 12 11:34:47 eu-hub ripd[1034]: Terminating on signal
Apr 12 11:34:47 eu-hub ospfd[1042]: Terminating on signal
Apr 12 11:34:47 eu-hub ripngd[1038]: Terminating on signal
Apr 12 11:34:47 eu-hub ospf6d[1046]: Terminating on signal SIGINT

It’s fine that watchfrr try to restore the frr process, but after all frr process restored, all my routes are gone, including static and bgp, so what’s the meaning if the watchfrr just restore the process rather than restoring the traffic?

Hi @MapleWang
We have a bug report on this. ref ⚓ T2175 Rewriting all FRR processes allow for reloading and to XML/Python style

Hi @Viacheslav,

I understand there is legacy problem for 1.2.x to support it, but it’s LTS branch, even the rewrite can’t be delivered, there should be workaround to fix it.Here is my proposal:

To fix the problem, we should archive two aspects:

  1. in each commit, as long as it involves protocols configuration, we should save the running-config in frr, then even frr is restarted, it can restore the configuration and traffic by itself
  2. if frr save the configuration in runtime, then it should avoid to be conflict with configuration load in the boot time

To archive the first aspect, we need to make sure in each commit, we should check the vyos configuration and decide if save the frr configuration, and in 1.2.x VyOS, in my understanding, all commits should be included in following files:

  • /opt/vyatta/sbin/vyatta-boot-config-loader(effective in boot time)
  • /opt/vyatta/sbin/vyatta-cfg-cmd-wrapper(effective in API call)
  • /opt/vyatta/share/vyatta-cfg/functions/interpreter/vyatta-cfg-run(effective in cmdline)

we can do following action in the commit function of above files:

  • before commit
# check if there is change of protocols
  protocols_changed=0
  if [ -d ${VYATTA_CHANGES_ONLY_DIR}/protocols ] || [ -d ${VYATTA_CHANGES_ONLY_DIR}/.unionfs/protocols ]
  then
    protocols_change=1
  fi
  • after commit
  # if there is protocols change in commit, then save the configuration of frr
  if [ "${protocols_change}" == "1" ]
  then
    /usr/bin/vtysh -c "copy running-config startup-config"
  fi

To archive the second aspect, we could clean /etc/frr/frr.conf before frr service is started in the boot time, since frr in vyos is controlled by vyos-router service, we could do it easily in /usr/libexec/vyos/init/vyos-router:

# cleanup /etc/frr/frr.conf
cp /etc/frr/frr.conf.empty /etc/frr/frr.conf