Hi,
We’re running 1.3-rolling-202001190217 with FRRouting v7.3-dev-20191226-00-gd7cce42cc and can reproduce zebra crashing when it appears to be setting nexthop.
opened 12:39PM - 21 Jan 20 UTC
closed 08:02PM - 22 Jan 20 UTC
triage
Hi,
We're running FRR 7.3-dev-20191226-00-gd7cce42cc as part of VyOS 1.3 roll… ing (202001190217) which results in BGP dying when establishing a peering session to a specific router. It is able to sustain the session when it's the only one and all others operate concurrently without problems. We can reproduce the problem if we start any other peering session, after having started the problematic one, or start the problematic one after first establishing peering with any other working peer.
FRR logs after setting 'log file /tmp/frr_debug.log debugging':
```
ZEBRA: Received signal 11 at 1579599554 (si_addr 0x0, PC 0x7fa738f6485b); aborting...
Program counter: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x5785b)[0x7fa738f6485b]
Backtrace for 16 stack frames:
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x60)[0x7fa738f5f100]
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0x10c)[0x7fa738f5f57c]
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x72f74)[0x7fa738f7ff74]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7fa738df3730]
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x5785b)[0x7fa738f6485b]
/usr/lib/frr/zebra(+0x479c3)[0x55d7cd7f39c3]
/usr/lib/frr/zebra(+0x47909)[0x55d7cd7f3909]
/usr/lib/frr/zebra(zebra_nhg_rib_find+0x4d)[0x55d7cd7f364d]
/usr/lib/frr/zebra(nexthop_active_update+0x594)[0x55d7cd7f3fe4]
/usr/lib/frr/zebra(+0x5005f)[0x55d7cd7fc05f]
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(work_queue_run+0xc8)[0x7fa738f96d28]
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x56)[0x7fa738f8d6a6]
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8)[0x7fa738f5d3e8]
/usr/lib/frr/zebra(main+0x32e)[0x55d7cd7ca6fe]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7fa738c4409b]
/usr/lib/frr/zebra(_start+0x2a)[0x55d7cd7cadfa]
in thread work_queue_run scheduled from lib/workqueue.c:140
2020/01/21 11:39:15 BGP: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
2020/01/21 11:39:19 BFD: VRF disable default id 0
2020/01/21 11:39:19 BFD: VRF Deletion: default(0)
2020/01/21 11:39:19 BGP: Terminating on signal
2020/01/21 11:39:19 BGP: %NOTIFICATION: sent to neighbor 192.0.2.34 6/3 (Cease/Peer Unconfigured) 0 bytes
2020/01/21 11:39:19 BGP: %NOTIFICATION: sent to neighbor 192.0.2.35 6/3 (Cease/Peer Unconfigured) 0 bytes
2020/01/21 11:39:19 BGP: %NOTIFICATION: sent to neighbor 192.0.2.46 6/3 (Cease/Peer Unconfigured) 0 bytes
2020/01/21 11:39:19 BGP: %NOTIFICATION: sent to neighbor 192.0.2.34 6/2 (Cease/Administratively Shutdown) 0 bytes
2020/01/21 11:39:19 BGP: %NOTIFICATION: sent to neighbor 192.0.2.35 6/2 (Cease/Administratively Shutdown) 0 bytes
2020/01/21 11:39:19 BGP: %NOTIFICATION: sent to neighbor 192.0.2.46 6/2 (Cease/Administratively Shutdown) 0 bytes
2020/01/21 11:39:20 STATIC: Terminating on signal
2020/01/21 11:39:20 BGP: %ADJCHANGE: neighbor 192.0.2.34(Unknown) in vrf default Down Neighbor deleted
2020/01/21 11:39:20 BGP: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
2020/01/21 11:39:20 BGP: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
2020/01/21 11:39:20 OSPF6: Terminating on signal SIGINT
2020/01/21 11:39:20 BGP: %ADJCHANGE: neighbor 192.0.2.35(Unknown) in vrf default Down Neighbor deleted
2020/01/21 11:39:20 BGP: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
2020/01/21 11:39:20 BGP: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
2020/01/21 11:39:20 BGP: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
2020/01/21 11:39:20 BGP: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
2020/01/21 11:39:20 BGP: %ADJCHANGE: neighbor 192.0.2.46(zatjnb01-rr04) in vrf default Down Neighbor deleted
2020/01/21 11:39:20 BGP: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
2020/01/21 11:39:20 BGP: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
```
[x] Did you check if this is a duplicate issue?
[ ] Did you test it on the latest FRRouting/frr master branch?
- OS Kernel: Linux 4.19.91-amd64-vyos
- FRR Version: 7.3-dev-20191226-00-gd7cce42cc
- MPLS with LDP enabled
Configuration:
```
testing# show running-config
Building configuration...
Current configuration:
!
frr version 7.3-dev-20191226-00-gd7cce42cc
frr defaults traditional
hostname testing
log file /tmp/frr_full_debug.log
log syslog informational
service integrated-vtysh-config
!
ip route 198.18.0.0/24 Null0
ipv6 route fc00::198:18:0:0/118 Null0
!
interface eth0.35
ip ospf authentication message-digest
ip ospf cost 200
ip ospf dead-interval 10
ip ospf hello-interval 1
ip ospf message-digest-key 1 md5 ****************
ip ospf network point-to-point
!
interface eth0.37
ip ospf authentication message-digest
ip ospf cost 200
ip ospf dead-interval 10
ip ospf hello-interval 1
ip ospf message-digest-key 1 md5 ****************
ip ospf network point-to-point
!
interface eth0.39
ip ospf authentication message-digest
ip ospf cost 200
ip ospf dead-interval 10
ip ospf hello-interval 1
ip ospf message-digest-key 1 md5 ****************
ip ospf network point-to-point
!
router bgp 64500
bgp router-id 192.0.2.45
bgp log-neighbor-changes
bgp cluster-id 192.0.2.45
neighbor rr_client peer-group
neighbor rr_client remote-as 64500
neighbor rr_client update-source lo
neighbor 192.0.2.34 peer-group rr_client
neighbor 192.0.2.34 password ****************
neighbor 192.0.2.35 peer-group rr_client
neighbor 192.0.2.35 password ****************
neighbor 192.0.2.39 peer-group rr_client
neighbor 192.0.2.39 password ****************
!
address-family ipv4 unicast
redistribute connected route-map bgp-out-connected
redistribute static route-map bgp-out-static
neighbor rr_client route-reflector-client
neighbor rr_client soft-reconfiguration inbound
neighbor rr_client route-map bgp-in in
neighbor rr_client route-map bgp-out out
exit-address-family
!
address-family ipv6 unicast
redistribute connected route-map bgp-out-connected
redistribute static route-map bgp-out-static
neighbor rr_client activate
neighbor rr_client route-reflector-client
neighbor rr_client soft-reconfiguration inbound
neighbor rr_client route-map bgp-in in
neighbor rr_client route-map bgp-out out
exit-address-family
!
router ospf
ospf router-id 192.0.2.45
network 10.0.0.0/30 area 0
network 10.0.0.8/30 area 0
network 10.0.0.16/30 area 0
network 192.0.2.45/32 area 0
!
mpls ldp
router-id 192.0.2.45
!
address-family ipv4
discovery transport-address 192.0.2.45
label local allocate host-routes
!
interface eth0.35
discovery hello holdtime 10
discovery hello interval 1
!
interface eth0.37
discovery hello holdtime 10
discovery hello interval 1
!
interface eth0.39
discovery hello holdtime 10
discovery hello interval 1
!
exit-address-family
!
!
ip prefix-list filtered-static seq 10 permit 0.0.0.0/0
ip prefix-list filtered-static seq 20 permit 198.18.0.0/24 le 32
!
ipv6 prefix-list filtered-static seq 10 permit ::/0
ipv6 prefix-list filtered-static seq 20 permit fc00::198:18:0:0/118 le 128
!
bgp community-list expanded blackhole permit 64500:900
!
route-map ospf-in permit 10
set src 192.0.2.45
!
route-map bgp-in permit 10
match community blackhole
set ip next-hop 198.18.0.1
set ipv6 next-hop global fc00::198:18:0:1
!
route-map bgp-in permit 20
!
route-map bgp-out permit 10
!
route-map bgp-out-connected deny 10
match ip address prefix-list filtered-connected
!
route-map bgp-out-connected deny 20
match ipv6 address prefix-list filtered-connected
!
route-map bgp-out-connected permit 30
set community 64500:505
set local-preference 305
!
route-map bgp-out-static deny 10
match ip address prefix-list filtered-static
!
route-map bgp-out-static deny 20
match ipv6 address prefix-list filtered-static
!
route-map bgp-out-static permit 30
set community 64500:510
set local-preference 300
!
ip protocol ospf route-map ospf-in
!
line vty
!
end
```
I, perhaps mistakenly, opened an issue on FFRouting’s Github project where I provided a relatively simply FRR configuration and debug logs. Developers there are asking some questions I please need some answers to:
What is the latest commit signature of FRR used in VyOS 1.3 rolling?
How do I generate a core dump in VyOS?
How often is FRR in VyOS 1.3-rolling updated? I observe numerous commits in the last 30 days relating to nexthop logic
A question I have:
Shouldn’t VyOS restart crashed processes? I would image that it would continually cycle in a situation where the problem is reproducible but what about ‘one in a blue moon’ scenarios?
Regards
David Herselman
c-po
January 22, 2020, 7:17am
2
You can get the latest FRR commit signature when running vtysh -c show version
FRR is not updated automatically. An update is deployed by re-running our CI Job at https://ci.vyos.net/job/vyos-build-frr/job/master/
which we do from time to time until there is a new FRR release - then we pin the Git Tag.
As VyOS is a regular Linux system coredump generation needs to be enabled by setting
ulimit -c unlimited
1 Like
bbs2web
January 22, 2020, 12:52pm
3
Many thanks!
To double check then, as VyOS 1.3-rolling-202001190217 reports FRR version as 7.3-dev-20191226-00-gd7cce42cc I would presume it to be the development branch with commits up to the 25th of December 2019 as I presume the builds to happen shortly after midnight?
I presume the subscription VyOS uses FRR release versions and not rolling as Releases · FRRouting/frr · GitHub only references ‘frr-7.3-dev’ with commit signature eef47e1 on the 6th of September 2019.
VyOS 1.3 rolling is not generating core dumps when I set ‘ulimit -c unlimited’ due to it presumably only applying to the current session. I tried adding this to /usr/libexec/vyos/init/vyos-router before ‘/usr/lib/frr/frrinit.sh start’ in the ‘start ()’ function without it generating a core dump file in /var/core
Any suggestions on where I need to set this, to obtain a core dump when FRR dies?
PS: I presume watchfrr should be restarting processes but BGP remains non-functional whilst OSPF recovers. Running ‘vtysh -c “show running”’ shows everything except for BGP. Is this by design or another problem?
Apologies, I think I have a better understanding of the automated build environment and really appreciate the granular detail jenkins provides. It was great to see an updated build of frr this afternoon and that the latest rolling release including it was already available.
The problem is gone but it would still be useful to know how to obtain a core dump in VyOS in future.
No core dumps in /var/core or /tmp after doing the following:
admin@testing:~$ sysctl -a -r core_pattern;
kernel.core_pattern = /var/core/core-%e-%p-%t
#sysctl -w kernel.core_pattern="/tmp/frr-core-%e-%p-%t"
pico /etc/security/limits.conf;
* soft core unlimited
init 6;
admin@testing:~$ ulimit -a | grep core;
core file size (blocks, -c) unlimited