Clustering heartbeat dying randomly - help with troubleshooting (Rel 1.2.0-rc11)

Hi All

I have a 2 node cluster that seems to die every now and then and I don’t know enough about it as to where to start looking.

Now and then I notice my standby router has taken over, but the primary is still responsive. If I look at the cluster status though, I see this:

chris@router-vmware:~$ sh cl st
=== Status report on primary node router-vmware ===

  Primary router-vmware (this node): Unknown

  Secondary router-qnap: Unknown

  Monitor 192.168.0.24: Unknown

  Resources [192.168.0.1/24/eth1 192.168.10.1/24/eth2.10 192.168.11.1/29/eth2.11]:
    Unknown

The only thing I can see in the log is the following:

Feb 03 20:44:41 router-vmware cl_status[46914]: [46914]: ERROR: Cannot signon with heartbeat
Feb 03 20:44:41 router-vmware cl_status[46914]: [46914]: ERROR: REASON: hb_api_signon: Can't initiate connection  to heartbeat
Feb 03 20:44:41 router-vmware cl_status[46916]: [46916]: ERROR: Cannot signon with heartbeat
Feb 03 20:44:41 router-vmware cl_status[46916]: [46916]: ERROR: REASON: hb_api_signon: Can't initiate connection  to heartbeat
Feb 03 20:44:41 router-vmware cl_status[46918]: [46918]: ERROR: Cannot signon with heartbeat
Feb 03 20:44:41 router-vmware cl_status[46918]: [46918]: ERROR: REASON: hb_api_signon: Can't initiate connection  to heartbeat
Feb 03 20:44:41 router-vmware cl_status[46920]: [46920]: ERROR: Cannot signon with heartbeat
Feb 03 20:44:41 router-vmware cl_status[46920]: [46920]: ERROR: REASON: hb_api_signon: Can't initiate connection  to heartbeat

This is my config:

cluster {
    dead-interval 10000
    group home-cluster {
        auto-failback true
        monitor 192.168.0.24
        primary router-vmware
        secondary router-qnap
        service 192.168.0.1/24/eth1
        service 192.168.10.1/24/eth2.10
        service 192.168.11.1/29/eth2.11
    }
    interface eth1
    interface eth2.10
    keepalive-interval 2000
    monitor-dead-interval 10000
    pre-shared-secret ****************
}

The only thing I can think of is that is could be occurring during VMware migrations (I’ll check that out).

Is anyone able to guide me as to where I should look to try and figure out what’s going on?

Thanks in advance!

Chris…

Hello, @cmonck!
Try to catch a moment when this is happening and see what else is changing at the same time. For example, if your router migrates to another machine this can cause a problem with delivering multicast traffic, which is used for clustering. And so on.
If you want to define a reason, you must find the dependency.

Thanks @zsdc

Yeah, I’m attempting to do that. At this stage there’s nothing obvious. I have some manual migrations without any issues, and it has occurred when no migrations have taken place.

I am an idiot at times and realised it’s also just a standard linux application, so I’ve been doing some digging and have enabled the debug log file, which is showing some things but I’m still getting my head around it. From what I can see it does sound more like a config issue than a bug.

For what it’s worth, this is probably the interesting bit in the log:

Feb 07 10:47:12 router-vmware heartbeat: [13232]: CRIT: Cluster node router-qnap returning after partition.
Feb 07 10:47:12 router-vmware heartbeat: [13232]: info: For information on cluster partitions, See URL: http://linux-ha.org/wiki/Split_Brain
Feb 07 10:47:12 router-vmware heartbeat: [13232]: WARN: Deadtime value may be too small.
Feb 07 10:47:12 router-vmware heartbeat: [13232]: info: See FAQ for information on tuning deadtime.
Feb 07 10:47:12 router-vmware heartbeat: [13232]: info: URL: http://linux-ha.org/wiki/FAQ#Heavy_Load
Feb 07 10:47:12 router-vmware heartbeat: [13232]: info: Link router-qnap:eth1 up.
Feb 07 10:47:12 router-vmware heartbeat: [13232]: CRIT: Late heartbeat: Node router-qnap: interval 525700 ms (> deadtime)
Feb 07 10:47:12 router-vmware heartbeat: [13232]: WARN: node router-qnap: is dead
Feb 07 10:47:12 router-vmware heartbeat: [13232]: info: Dead node router-qnap gave up resources.
Feb 07 10:47:12 router-vmware heartbeat: [13232]: info: Status update for node router-qnap: status active
Feb 07 10:47:12 router-vmware heartbeat: [13232]: info: Link router-qnap:eth2.10 up.
Feb 07 10:47:12 router-vmware ipfail: [13244]: info: Link Status update: Link router-qnap/eth1 now has status up
Feb 07 10:47:12 router-vmware ipfail: [13244]: info: Status update: Node router-qnap now has status dead
Feb 07 10:47:12 router-vmware heartbeat: [20718]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc(default)[20718]:   2019/02/07_10:47:12 info: Running /etc/ha.d//rc.d/status status
Feb 07 10:47:12 router-vmware ipfail: [13244]: debug: Found ping node 192.168.0.24!
Feb 07 10:47:13 router-vmware ipfail: [13244]: info: NS: We are still alive!
Feb 07 10:47:13 router-vmware ipfail: [13244]: info: Status update: Node router-qnap now has status active
Feb 07 10:47:13 router-vmware ipfail: [13244]: info: Link Status update: Link router-qnap/eth2.10 now has status up
Feb 07 10:47:14 router-vmware heartbeat: [13232]: info: all clients are now paused
Feb 07 10:47:14 router-vmware heartbeat: [13232]: info: Heartbeat shutdown in progress. (13232)
Feb 07 10:47:14 router-vmware heartbeat: [20741]: info: Giving up all HA resources.
ResourceManager(default)[20754]:        2019/02/07_10:47:14 info: Releasing resource group: router-vmware IPaddr2-vyatta::192.168.0.1/24/eth1 IPaddr2-vyatta::192.168.10.1/24/eth2.10 IPaddr2-vyatta::192.16$
ResourceManager(default)[20754]:        2019/02/07_10:47:14 info: Running /etc/ha.d/resource.d/IPaddr2-vyatta 192.168.11.1/29/eth2.11 stop
IPaddr2-vyatta(IPaddr2-vyatta_192.168.11.1)[20833]:     2019/02/07_10:47:14 INFO: ip -f inet addr delete 192.168.11.1/29 dev eth2.11
/usr/lib/ocf/resource.d//heartbeat/IPaddr2-vyatta(IPaddr2-vyatta_192.168.11.1)[20792]:  2019/02/07_10:47:14 INFO:  Success
INFO:  Success
ResourceManager(default)[20754]:        2019/02/07_10:47:14 info: Running /etc/ha.d/resource.d/IPaddr2-vyatta 192.168.10.1/24/eth2.10 stop
IPaddr2-vyatta(IPaddr2-vyatta_192.168.10.1)[20976]:     2019/02/07_10:47:14 INFO: ip -f inet addr delete 192.168.10.1/24 dev eth2.10
/usr/lib/ocf/resource.d//heartbeat/IPaddr2-vyatta(IPaddr2-vyatta_192.168.10.1)[20937]:  2019/02/07_10:47:14 INFO:  Success
INFO:  Success
ResourceManager(default)[20754]:        2019/02/07_10:47:14 info: Running /etc/ha.d/resource.d/IPaddr2-vyatta 192.168.0.1/24/eth1 stop
IPaddr2-vyatta(IPaddr2-vyatta_192.168.0.1)[21087]:      2019/02/07_10:47:14 INFO: ip -f inet addr delete 192.168.0.1/24 dev eth1
/usr/lib/ocf/resource.d//heartbeat/IPaddr2-vyatta(IPaddr2-vyatta_192.168.0.1)[21058]:   2019/02/07_10:47:14 INFO:  Success
INFO:  Success
Feb 07 10:47:14 router-vmware heartbeat: [20741]: info: All HA resources relinquished.
Feb 07 10:47:15 router-vmware heartbeat: [13232]: info: all clients are now resumed
Feb 07 10:47:15 router-vmware heartbeat: [13232]: WARN: 1 lost packet(s) for [router-qnap] [6302:6304]
Feb 07 10:47:15 router-vmware heartbeat: [13232]: info: No pkts missing from router-qnap!
Feb 07 10:47:15 router-vmware ipfail: [13244]: debug: Other side is now stable.
Feb 07 10:47:15 router-vmware heartbeat: [13232]: info: killing /usr/lib/heartbeat/ipfail process group 13244 with signal 15
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: killing HBREAD process 13237 with signal 15
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: killing HBWRITE process 13238 with signal 15
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: killing HBREAD process 13239 with signal 15
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: killing HBWRITE process 13240 with signal 15
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: killing HBREAD process 13241 with signal 15
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: killing HBFIFO process 13235 with signal 15
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: killing HBWRITE process 13236 with signal 15
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: Core process 13239 exited. 7 remaining
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: Core process 13241 exited. 6 remaining
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: Core process 13236 exited. 5 remaining
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: Core process 13238 exited. 4 remaining
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: Core process 13235 exited. 3 remaining
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: Core process 13240 exited. 2 remaining
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: Core process 13237 exited. 1 remaining
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: router-vmware Heartbeat shutdown complete.
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: Heartbeat restart triggered.
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: Restarting heartbeat.
Feb 07 10:47:17 router-vmware heartbeat: [13232]: info: Performing heartbeat restart exec.
Feb 07 10:47:28 router-vmware heartbeat: [13232]: info: Pacemaker support: false
Feb 07 10:47:28 router-vmware heartbeat: [13232]: WARN: Logging daemon is disabled --enabling logging daemon is recommended
Feb 07 10:47:28 router-vmware heartbeat: [13232]: info: **************************
Feb 07 10:47:28 router-vmware heartbeat: [13232]: info: Configuration validated. Starting heartbeat 3.0.5
Feb 07 10:47:28 router-vmware heartbeat: [21300]: info: heartbeat: version 3.0.5
Feb 07 10:47:28 router-vmware heartbeat: [21300]: info: Heartbeat generation: 1545338682
Feb 07 10:47:28 router-vmware heartbeat: [21300]: info: glib: UDP multicast heartbeat started for group 239.251.252.253 port 694 interface eth1 (ttl=1 loop=0)
Feb 07 10:47:28 router-vmware heartbeat: [21300]: info: glib: UDP multicast heartbeat started for group 239.251.252.253 port 694 interface eth2.10 (ttl=1 loop=0)
Feb 07 10:47:28 router-vmware heartbeat: [21300]: info: glib: ping heartbeat started.
Feb 07 10:47:28 router-vmware heartbeat: [21300]: info: Local status now set to: 'up'
Feb 07 10:47:28 router-vmware heartbeat: [21300]: WARN: Shutdown delayed until Communication is up.
Feb 07 10:47:28 router-vmware heartbeat: [21300]: WARN: Managed HBWRITE process 21313 exited with return code 15.
Feb 07 10:47:28 router-vmware heartbeat: [21300]: info: Core process 21313 exited. 7 remaining
Feb 07 10:47:28 router-vmware heartbeat: [21300]: WARN: Managed HBREAD process 21314 exited with return code 15.
Feb 07 10:47:28 router-vmware heartbeat: [21300]: info: Core process 21314 exited. 6 remaining
Feb 07 10:47:28 router-vmware heartbeat: [21300]: WARN: Managed HBWRITE process 21315 exited with return code 15.
Feb 07 10:47:28 router-vmware heartbeat: [21300]: info: Core process 21315 exited. 5 remaining
Feb 07 10:47:28 router-vmware heartbeat: [21300]: WARN: Managed HBREAD process 21316 exited with return code 15.
Feb 07 10:47:28 router-vmware heartbeat: [21300]: info: Core process 21316 exited. 4 remaining
Feb 07 10:47:28 router-vmware heartbeat: [21300]: WARN: Managed HBWRITE process 21317 exited with return code 15.
Feb 07 10:47:28 router-vmware heartbeat: [21300]: info: Core process 21317 exited. 3 remaining
Feb 07 10:47:28 router-vmware heartbeat: [21300]: WARN: Managed HBREAD process 21318 exited with return code 15.
Feb 07 10:47:28 router-vmware heartbeat: [21300]: info: Core process 21318 exited. 2 remaining
Feb 07 10:47:29 router-vmware heartbeat: [21300]: WARN: Managed HBFIFO process 21312 exited with return code 15.
Feb 07 10:47:29 router-vmware heartbeat: [21300]: info: Core process 21312 exited. 1 remaining
Feb 07 10:47:29 router-vmware heartbeat: [21300]: info: router-vmware Heartbeat shutdown complete.

…so it seems to be detecting a split-brain and seems to be restating as a result. However, it doesn’t seem to restart properly and so tries a couple of times and then just gives up.

I can manually restart the service and it seems to come up fine.

I have increased the deadtime to 20 seconds. We’ll see what happens…

Definitely seen split brain during vmotion. Can you pin the routers?

Also may need to allow mac changes on your vswitch if you havent already.

Thanks @chief_dan

That’s definitely an option, but I would prefer another fix if possible.

So far so good with the deadtime at 20s though. No issues so far after 3 days. We’ll see if that continues… If it seems to fix it I’ll change to 15 seconds and try again. Perhaps the 10 seconds is just enough to upset with vmotion, etc.