[GH-ISSUE #26] Should fence_mpath agent be utilized instead of the fence_scsi agent? #26

Open
opened 2026-05-05 03:31:09 -06:00 by gitea-mirror · 7 comments
Owner

Originally created by @rcproam on GitHub (Apr 3, 2019).
Original GitHub issue: https://github.com/ewwhite/zfs-ha/issues/26

This is not an issue with the current design. Possibly label as enhancement?

In particular, due to the documented issue "RHEL 7 High Availability and Resilient Storage Pacemaker cluster experiences a fence race condition between nodes during network outages while using fence_scsi with multipath storage", would it be more reliable to utilize the fence_mpath agent than the fence_scsi agent?
I've encountered an issue very similar to the issue described here: https://access.redhat.com/solutions/3201072

Red Hat recommends utilizing the fence_mpath agent instead of fence_scsi to resolve this particular issue, however fence_mpath is more complex to configure, and may likely come with its own unique caveats/issues.
https://access.redhat.com/articles/3078811

Still need to test the fence_mpath agent with my particular buildout to confirm whether or not it resolves the fencing / scsi reservation issue I've encountered, but I'm opening this issue in case others might have time to test the fence_mpath agent before I can.

Originally created by @rcproam on GitHub (Apr 3, 2019). Original GitHub issue: https://github.com/ewwhite/zfs-ha/issues/26 This is not an issue with the current design. Possibly label as enhancement? In particular, due to the documented issue "RHEL 7 High Availability and Resilient Storage Pacemaker cluster experiences a fence race condition between nodes during network outages while using fence_scsi with multipath storage", would it be more reliable to utilize the fence_mpath agent than the fence_scsi agent? I've encountered an issue very similar to the issue described here: [https://access.redhat.com/solutions/3201072](url) Red Hat recommends utilizing the fence_mpath agent instead of fence_scsi to resolve this particular issue, however fence_mpath is more complex to configure, and may likely come with its own unique caveats/issues. [https://access.redhat.com/articles/3078811](url) Still need to test the fence_mpath agent with my particular buildout to confirm whether or not it resolves the fencing / scsi reservation issue I've encountered, but I'm opening this issue in case others might have time to test the fence_mpath agent before I can.
Author
Owner

@rcproam commented on GitHub (Apr 4, 2019):

Description of fence_mpath agent and how it functions compared to fence_scsi:

fence_mpath: new fence agent for dm-multipath based on mpathpersist
Previously, scenario with multipath and underlying SCSI devices was solved by using
fence_scsi what works correctly but there are some limitation. The most important
is that unfencing has to be done when all paths are available as it is executed only once.
This new fence agent solve this situation properly as most of this situations are solved
by mpathpersist which is part of dm-multipath.
https://lists.fedorahosted.org/pipermail/cluster-commits/2014-November/004033.html

<!-- gh-comment-id:479947747 --> @rcproam commented on GitHub (Apr 4, 2019): Description of fence_mpath agent and how it functions compared to fence_scsi: fence_mpath: new fence agent for dm-multipath based on mpathpersist Previously, scenario with multipath and underlying SCSI devices was solved by using fence_scsi what works correctly but there are some limitation. The most important is that unfencing has to be done when all paths are available as it is executed only once. This new fence agent solve this situation properly as most of this situations are solved by mpathpersist which is part of dm-multipath. [https://lists.fedorahosted.org/pipermail/cluster-commits/2014-November/004033.html](url)
Author
Owner

@ewwhite commented on GitHub (Apr 4, 2019):

I'd still see if you can debug your specific issue. I don't know of anyone using fence_mpath for this type of setup, and there are plenty of folks using this guide with success.

Please note what I mentioned about diverse heartbeat network paths.

<!-- gh-comment-id:479955310 --> @ewwhite commented on GitHub (Apr 4, 2019): I'd still see if you can debug your specific issue. I don't know of anyone using fence_mpath for this type of setup, and there are plenty of folks using this guide with success. Please note what I mentioned about diverse heartbeat network paths.
Author
Owner

@rcproam commented on GitHub (Apr 8, 2019):

Thanks @ewwhite I will try to debug some more... still trying to understand how the pcs resource start and stop timeouts affect failover as the suggested 90 seconds seems like a very large value (IIRC the TCP session timeout for NFS is only like 60 seconds).
Also, my particular deployment is utilizing a SuperMicro Storage Bridge Bay (SBB), which includes an internal Ethernet interconnect between nodes which I am using for heartbeats.

<!-- gh-comment-id:480712465 --> @rcproam commented on GitHub (Apr 8, 2019): Thanks @ewwhite I will try to debug some more... still trying to understand how the pcs resource start and stop timeouts affect failover as the suggested 90 seconds seems like a very large value (IIRC the TCP session timeout for NFS is only like 60 seconds). Also, my particular deployment is utilizing a SuperMicro Storage Bridge Bay (SBB), which includes an internal Ethernet interconnect between nodes which I am using for heartbeats.
Author
Owner

@rcproam commented on GitHub (Apr 8, 2019):

So I placed node#2 (cluster-nas2) into standby, then shut it down completely. When I subsequently startup node#2 again it causes pacemaker to crash on node#1. Below is the excerpt from the syslog on node#1 showing the sequence:

Apr 8 01:35:41 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Apr 8 01:50:41 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Apr 8 01:50:41 svr-lf-nas1 pengine[2849]: notice: On loss of CCM Quorum: Ignore
Apr 8 01:50:41 svr-lf-nas1 pengine[2849]: notice: Calculated transition 3481, saving inputs in /var/lib/pacemaker/pengine/pe-input-367.bz2
Apr 8 01:50:41 svr-lf-nas1 crmd[2850]: notice: Transition 3481 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-367.bz2): Complete
Apr 8 01:50:41 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Apr 8 02:05:41 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Apr 8 02:05:41 svr-lf-nas1 pengine[2849]: notice: On loss of CCM Quorum: Ignore
Apr 8 02:05:41 svr-lf-nas1 pengine[2849]: notice: Calculated transition 3482, saving inputs in /var/lib/pacemaker/pengine/pe-input-367.bz2
Apr 8 02:05:41 svr-lf-nas1 crmd[2850]: notice: Transition 3482 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-367.bz2): Complete
Apr 8 02:05:41 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Apr 8 02:17:01 svr-lf-nas1 CRON[13384]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 8 02:18:52 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Apr 8 02:18:52 svr-lf-nas1 pengine[2849]: notice: On loss of CCM Quorum: Ignore
Apr 8 02:18:52 svr-lf-nas1 pengine[2849]: notice: Calculated transition 3483, saving inputs in /var/lib/pacemaker/pengine/pe-input-368.bz2
Apr 8 02:18:52 svr-lf-nas1 crmd[2850]: notice: Transition 3483 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-368.bz2): Complete
Apr 8 02:18:52 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Apr 8 02:19:15 svr-lf-nas1 pengine[2849]: notice: On loss of CCM Quorum: Ignore
Apr 8 02:19:15 svr-lf-nas1 pengine[2849]: notice: Scheduling Node cluster-nas2 for shutdown
Apr 8 02:19:15 svr-lf-nas1 pengine[2849]: notice: Calculated transition 3484, saving inputs in /var/lib/pacemaker/pengine/pe-input-369.bz2
Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: Transition 3484 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-369.bz2): Complete
Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: do_shutdown of peer cluster-nas2 is complete
Apr 8 02:19:15 svr-lf-nas1 cib[2845]: notice: Node cluster-nas2 state is now lost
Apr 8 02:19:15 svr-lf-nas1 cib[2845]: notice: Purged 1 peers with id=2 and/or uname=cluster-nas2 from the membership cache
Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: notice [TOTEM ] A new membership (198.51.100.1:884) was formed. Members left: 2
Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: notice [QUORUM] Members[1]: 1
Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: notice [MAIN ] Completed service synchronization, ready to provide service.
Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: [TOTEM ] A new membership (198.51.100.1:884) was formed. Members left: 2
Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: [QUORUM] Members[1]: 1
Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 8 02:19:15 svr-lf-nas1 pacemakerd[2840]: notice: Node cluster-nas2 state is now lost
Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: Node cluster-nas2 state is now lost
Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: do_shutdown of peer cluster-nas2 is complete
Apr 8 02:19:15 svr-lf-nas1 stonith-ng[2846]: notice: Node cluster-nas2 state is now lost
Apr 8 02:19:15 svr-lf-nas1 stonith-ng[2846]: notice: Purged 1 peers with id=2 and/or uname=cluster-nas2 from the membership cache
Apr 8 02:19:15 svr-lf-nas1 attrd[2848]: notice: Node cluster-nas2 state is now lost
Apr 8 02:19:15 svr-lf-nas1 attrd[2848]: notice: Removing all cluster-nas2 attributes for peer loss
Apr 8 02:19:15 svr-lf-nas1 attrd[2848]: notice: Lost attribute writer cluster-nas2
Apr 8 02:19:15 svr-lf-nas1 attrd[2848]: notice: Purged 1 peers with id=2 and/or uname=cluster-nas2 from the membership cache
Apr 8 02:19:25 svr-lf-nas1 kernel: [3133621.758535] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down
Apr 8 02:19:27 svr-lf-nas1 ntpd[2809]: Deleting interface #11 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=3133254 secs
Apr 8 02:19:28 svr-lf-nas1 kernel: [3133624.730941] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Apr 8 02:19:30 svr-lf-nas1 ntpd[2809]: Listen normally on 12 eno3 198.51.100.1:123
Apr 8 02:20:33 svr-lf-nas1 kernel: [3133689.895368] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down
Apr 8 02:20:35 svr-lf-nas1 ntpd[2809]: Deleting interface #12 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=65 secs
Apr 8 02:20:37 svr-lf-nas1 kernel: [3133692.983744] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Apr 8 02:20:38 svr-lf-nas1 ntpd[2809]: Listen normally on 13 eno3 198.51.100.1:123
Apr 8 02:20:42 svr-lf-nas1 kernel: [3133698.535494] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down
Apr 8 02:20:44 svr-lf-nas1 ntpd[2809]: Deleting interface #13 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=6 secs
Apr 8 02:20:45 svr-lf-nas1 kernel: [3133701.371873] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Apr 8 02:20:47 svr-lf-nas1 ntpd[2809]: Listen normally on 14 eno3 198.51.100.1:123
Apr 8 02:21:12 svr-lf-nas1 kernel: [3133728.815903] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down
Apr 8 02:21:14 svr-lf-nas1 ntpd[2809]: Deleting interface #14 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=27 secs
Apr 8 02:21:38 svr-lf-nas1 kernel: [3133754.760563] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Apr 8 02:21:38 svr-lf-nas1 kernel: [3133754.760633] igb 0000:05:00.0 eno3: Link Speed was downgraded by SmartSpeed
Apr 8 02:21:40 svr-lf-nas1 ntpd[2809]: Listen normally on 15 eno3 198.51.100.1:123
Apr 8 02:22:35 svr-lf-nas1 kernel: [3133811.692929] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down
Apr 8 02:22:37 svr-lf-nas1 ntpd[2809]: Deleting interface #15 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=57 secs
Apr 8 02:23:34 svr-lf-nas1 kernel: [3133870.401931] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Apr 8 02:23:34 svr-lf-nas1 kernel: [3133870.401997] igb 0000:05:00.0 eno3: Link Speed was downgraded by SmartSpeed
Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: notice [TOTEM ] A new membership (198.51.100.1:888) was formed. Members joined: 2
Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: [TOTEM ] A new membership (198.51.100.1:888) was formed. Members joined: 2
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: do_shutdown of peer cluster-nas2 is complete
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: error: Node cluster-nas2[2] appears to be online even though we think it is dead
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: Node cluster-nas2 state is now member
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_INTEGRATION
Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: notice [QUORUM] Members[2]: 1 2
Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: notice [MAIN ] Completed service synchronization, ready to provide service.
Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: [QUORUM] Members[2]: 1 2
Apr 8 02:23:35 svr-lf-nas1 pacemakerd[2840]: notice: Node cluster-nas2 state is now member
Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 8 02:23:35 svr-lf-nas1 cib[2845]: notice: Node cluster-nas2 state is now member
Apr 8 02:23:35 svr-lf-nas1 attrd[2848]: notice: Node cluster-nas2 state is now member
Apr 8 02:23:35 svr-lf-nas1 stonith-ng[2846]: notice: Node cluster-nas2 state is now member
Apr 8 02:23:35 svr-lf-nas1 attrd[2848]: notice: Recorded attribute writer: cluster-nas2
Apr 8 02:23:35 svr-lf-nas1 cib[2845]: error: Cannot perform modification with no data
Apr 8 02:23:35 svr-lf-nas1 cib[2845]: warning: Completed cib_modify operation for section status: Invalid argument (rc=-22, origin=cluster-nas2/crmd/35, version=0.256.6)
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: warning: Another DC detected: cluster-nas2 (op=noop)
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: State transition S_ELECTION -> S_INTEGRATION
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: Syncing the Cluster Information Base from cluster-nas2 to rest of cluster
Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: Requested version <generation_tuple crm_feature_set="3.0.11" validate-with="pacemaker-2.6" epoch="256" num_updates="13" admin_epoch="0" cib-last-written="Mon Apr 8 02:18:52 2019" update-origin="cluster-nas2" update-client="crm_attribute" update-user="hacluster" have-quorum="1" dc-uuid="2"/>
Apr 8 02:23:35 svr-lf-nas1 attrd[2848]: notice: Updating all attributes after cib_refresh_notify event
Apr 8 02:23:36 svr-lf-nas1 ntpd[2809]: Listen normally on 16 eno3 198.51.100.1:123
Apr 8 02:23:36 svr-lf-nas1 stonith-ng[2846]: notice: Operation reboot of cluster-nas1 by cluster-nas2 for crmd.2716@cluster-nas2.f04b7ab5: OK
Apr 8 02:23:36 svr-lf-nas1 stonith-ng[2846]: notice: Operation on of cluster-nas2 by cluster-nas2 for crmd.2716@cluster-nas2.fbcdacb2: OK
Apr 8 02:23:37 svr-lf-nas1 crmd[2850]: crit: We were allegedly just fenced by cluster-nas2 for cluster-nas2!
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: warning: The crmd process (2850) can no longer be respawned, shutting the cluster down.
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Shutting down Pacemaker
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping pengine
Apr 8 02:23:37 svr-lf-nas1 kernel: [3133873.286736] sd 0:0:13:0: Parameters changed
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: new_event_notification (2847-2850-7): Bad file descriptor (9)
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed
Apr 8 02:23:37 svr-lf-nas1 pengine[2849]: notice: Caught 'Terminated' signal
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping attrd
Apr 8 02:23:37 svr-lf-nas1 attrd[2848]: notice: Caught 'Terminated' signal
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping lrmd
Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: notice: Caught 'Terminated' signal
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping stonith-ng
Apr 8 02:23:37 svr-lf-nas1 stonith-ng[2846]: notice: Caught 'Terminated' signal
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping cib
Apr 8 02:23:37 svr-lf-nas1 cib[2845]: notice: Caught 'Terminated' signal
Apr 8 02:23:37 svr-lf-nas1 cib[2845]: notice: Disconnected from Corosync
Apr 8 02:23:37 svr-lf-nas1 cib[2845]: notice: Disconnected from Corosync
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Shutdown complete
Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Attempting to inhibit respawning after fatal error

<!-- gh-comment-id:480719264 --> @rcproam commented on GitHub (Apr 8, 2019): So I placed node#2 (cluster-nas2) into standby, then shut it down completely. When I subsequently startup node#2 again it causes pacemaker to crash on node#1. Below is the excerpt from the syslog on node#1 showing the sequence: Apr 8 01:35:41 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Apr 8 01:50:41 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_POLICY_ENGINE Apr 8 01:50:41 svr-lf-nas1 pengine[2849]: notice: On loss of CCM Quorum: Ignore Apr 8 01:50:41 svr-lf-nas1 pengine[2849]: notice: Calculated transition 3481, saving inputs in /var/lib/pacemaker/pengine/pe-input-367.bz2 Apr 8 01:50:41 svr-lf-nas1 crmd[2850]: notice: Transition 3481 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-367.bz2): Complete Apr 8 01:50:41 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Apr 8 02:05:41 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_POLICY_ENGINE Apr 8 02:05:41 svr-lf-nas1 pengine[2849]: notice: On loss of CCM Quorum: Ignore Apr 8 02:05:41 svr-lf-nas1 pengine[2849]: notice: Calculated transition 3482, saving inputs in /var/lib/pacemaker/pengine/pe-input-367.bz2 Apr 8 02:05:41 svr-lf-nas1 crmd[2850]: notice: Transition 3482 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-367.bz2): Complete Apr 8 02:05:41 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Apr 8 02:17:01 svr-lf-nas1 CRON[13384]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Apr 8 02:18:52 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_POLICY_ENGINE Apr 8 02:18:52 svr-lf-nas1 pengine[2849]: notice: On loss of CCM Quorum: Ignore Apr 8 02:18:52 svr-lf-nas1 pengine[2849]: notice: Calculated transition 3483, saving inputs in /var/lib/pacemaker/pengine/pe-input-368.bz2 Apr 8 02:18:52 svr-lf-nas1 crmd[2850]: notice: Transition 3483 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-368.bz2): Complete Apr 8 02:18:52 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_POLICY_ENGINE Apr 8 02:19:15 svr-lf-nas1 pengine[2849]: notice: On loss of CCM Quorum: Ignore Apr 8 02:19:15 svr-lf-nas1 pengine[2849]: notice: Scheduling Node cluster-nas2 for shutdown Apr 8 02:19:15 svr-lf-nas1 pengine[2849]: notice: Calculated transition 3484, saving inputs in /var/lib/pacemaker/pengine/pe-input-369.bz2 Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: Transition 3484 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-369.bz2): Complete Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: do_shutdown of peer cluster-nas2 is complete Apr 8 02:19:15 svr-lf-nas1 cib[2845]: notice: Node cluster-nas2 state is now lost Apr 8 02:19:15 svr-lf-nas1 cib[2845]: notice: Purged 1 peers with id=2 and/or uname=cluster-nas2 from the membership cache Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: notice [TOTEM ] A new membership (198.51.100.1:884) was formed. Members left: 2 Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: notice [QUORUM] Members[1]: 1 Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: notice [MAIN ] Completed service synchronization, ready to provide service. Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: [TOTEM ] A new membership (198.51.100.1:884) was formed. Members left: 2 Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: [QUORUM] Members[1]: 1 Apr 8 02:19:15 svr-lf-nas1 corosync[2768]: [MAIN ] Completed service synchronization, ready to provide service. Apr 8 02:19:15 svr-lf-nas1 pacemakerd[2840]: notice: Node cluster-nas2 state is now lost Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: Node cluster-nas2 state is now lost Apr 8 02:19:15 svr-lf-nas1 crmd[2850]: notice: do_shutdown of peer cluster-nas2 is complete Apr 8 02:19:15 svr-lf-nas1 stonith-ng[2846]: notice: Node cluster-nas2 state is now lost Apr 8 02:19:15 svr-lf-nas1 stonith-ng[2846]: notice: Purged 1 peers with id=2 and/or uname=cluster-nas2 from the membership cache Apr 8 02:19:15 svr-lf-nas1 attrd[2848]: notice: Node cluster-nas2 state is now lost Apr 8 02:19:15 svr-lf-nas1 attrd[2848]: notice: Removing all cluster-nas2 attributes for peer loss Apr 8 02:19:15 svr-lf-nas1 attrd[2848]: notice: Lost attribute writer cluster-nas2 Apr 8 02:19:15 svr-lf-nas1 attrd[2848]: notice: Purged 1 peers with id=2 and/or uname=cluster-nas2 from the membership cache Apr 8 02:19:25 svr-lf-nas1 kernel: [3133621.758535] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down Apr 8 02:19:27 svr-lf-nas1 ntpd[2809]: Deleting interface #11 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=3133254 secs Apr 8 02:19:28 svr-lf-nas1 kernel: [3133624.730941] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX Apr 8 02:19:30 svr-lf-nas1 ntpd[2809]: Listen normally on 12 eno3 198.51.100.1:123 Apr 8 02:20:33 svr-lf-nas1 kernel: [3133689.895368] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down Apr 8 02:20:35 svr-lf-nas1 ntpd[2809]: Deleting interface #12 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=65 secs Apr 8 02:20:37 svr-lf-nas1 kernel: [3133692.983744] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX Apr 8 02:20:38 svr-lf-nas1 ntpd[2809]: Listen normally on 13 eno3 198.51.100.1:123 Apr 8 02:20:42 svr-lf-nas1 kernel: [3133698.535494] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down Apr 8 02:20:44 svr-lf-nas1 ntpd[2809]: Deleting interface #13 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=6 secs Apr 8 02:20:45 svr-lf-nas1 kernel: [3133701.371873] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX Apr 8 02:20:47 svr-lf-nas1 ntpd[2809]: Listen normally on 14 eno3 198.51.100.1:123 Apr 8 02:21:12 svr-lf-nas1 kernel: [3133728.815903] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down Apr 8 02:21:14 svr-lf-nas1 ntpd[2809]: Deleting interface #14 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=27 secs Apr 8 02:21:38 svr-lf-nas1 kernel: [3133754.760563] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX Apr 8 02:21:38 svr-lf-nas1 kernel: [3133754.760633] igb 0000:05:00.0 eno3: Link Speed was downgraded by SmartSpeed Apr 8 02:21:40 svr-lf-nas1 ntpd[2809]: Listen normally on 15 eno3 198.51.100.1:123 Apr 8 02:22:35 svr-lf-nas1 kernel: [3133811.692929] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Down Apr 8 02:22:37 svr-lf-nas1 ntpd[2809]: Deleting interface #15 eno3, 198.51.100.1#123, interface stats: received=0, sent=0, dropped=0, active_time=57 secs Apr 8 02:23:34 svr-lf-nas1 kernel: [3133870.401931] igb 0000:05:00.0 eno3: igb: eno3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX Apr 8 02:23:34 svr-lf-nas1 kernel: [3133870.401997] igb 0000:05:00.0 eno3: Link Speed was downgraded by SmartSpeed Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: notice [TOTEM ] A new membership (198.51.100.1:888) was formed. Members joined: 2 Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: [TOTEM ] A new membership (198.51.100.1:888) was formed. Members joined: 2 Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: do_shutdown of peer cluster-nas2 is complete Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: error: Node cluster-nas2[2] appears to be online even though we think it is dead Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: Node cluster-nas2 state is now member Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: State transition S_IDLE -> S_INTEGRATION Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: notice [QUORUM] Members[2]: 1 2 Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: notice [MAIN ] Completed service synchronization, ready to provide service. Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: [QUORUM] Members[2]: 1 2 Apr 8 02:23:35 svr-lf-nas1 pacemakerd[2840]: notice: Node cluster-nas2 state is now member Apr 8 02:23:35 svr-lf-nas1 corosync[2768]: [MAIN ] Completed service synchronization, ready to provide service. Apr 8 02:23:35 svr-lf-nas1 cib[2845]: notice: Node cluster-nas2 state is now member Apr 8 02:23:35 svr-lf-nas1 attrd[2848]: notice: Node cluster-nas2 state is now member Apr 8 02:23:35 svr-lf-nas1 stonith-ng[2846]: notice: Node cluster-nas2 state is now member Apr 8 02:23:35 svr-lf-nas1 attrd[2848]: notice: Recorded attribute writer: cluster-nas2 Apr 8 02:23:35 svr-lf-nas1 cib[2845]: error: Cannot perform modification with no data Apr 8 02:23:35 svr-lf-nas1 cib[2845]: warning: Completed cib_modify operation for section status: Invalid argument (rc=-22, origin=cluster-nas2/crmd/35, version=0.256.6) Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: warning: Another DC detected: cluster-nas2 (op=noop) Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: State transition S_ELECTION -> S_INTEGRATION Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: Syncing the Cluster Information Base from cluster-nas2 to rest of cluster Apr 8 02:23:35 svr-lf-nas1 crmd[2850]: notice: Requested version <generation_tuple crm_feature_set="3.0.11" validate-with="pacemaker-2.6" epoch="256" num_updates="13" admin_epoch="0" cib-last-written="Mon Apr 8 02:18:52 2019" update-origin="cluster-nas2" update-client="crm_attribute" update-user="hacluster" have-quorum="1" dc-uuid="2"/> Apr 8 02:23:35 svr-lf-nas1 attrd[2848]: notice: Updating all attributes after cib_refresh_notify event Apr 8 02:23:36 svr-lf-nas1 ntpd[2809]: Listen normally on 16 eno3 198.51.100.1:123 Apr 8 02:23:36 svr-lf-nas1 stonith-ng[2846]: notice: Operation reboot of cluster-nas1 by cluster-nas2 for crmd.2716@cluster-nas2.f04b7ab5: OK Apr 8 02:23:36 svr-lf-nas1 stonith-ng[2846]: notice: Operation on of cluster-nas2 by cluster-nas2 for crmd.2716@cluster-nas2.fbcdacb2: OK Apr 8 02:23:37 svr-lf-nas1 crmd[2850]: crit: We were allegedly just fenced by cluster-nas2 for cluster-nas2! Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: warning: The crmd process (2850) can no longer be respawned, shutting the cluster down. Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Shutting down Pacemaker Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping pengine Apr 8 02:23:37 svr-lf-nas1 kernel: [3133873.286736] sd 0:0:13:0: Parameters changed Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: new_event_notification (2847-2850-7): Bad file descriptor (9) Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: warning: Notification of client crmd/09eb8595-7f1f-4169-aa4a-8935aa1fb4b6 failed Apr 8 02:23:37 svr-lf-nas1 pengine[2849]: notice: Caught 'Terminated' signal Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping attrd Apr 8 02:23:37 svr-lf-nas1 attrd[2848]: notice: Caught 'Terminated' signal Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping lrmd Apr 8 02:23:37 svr-lf-nas1 lrmd[2847]: notice: Caught 'Terminated' signal Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping stonith-ng Apr 8 02:23:37 svr-lf-nas1 stonith-ng[2846]: notice: Caught 'Terminated' signal Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Stopping cib Apr 8 02:23:37 svr-lf-nas1 cib[2845]: notice: Caught 'Terminated' signal Apr 8 02:23:37 svr-lf-nas1 cib[2845]: notice: Disconnected from Corosync Apr 8 02:23:37 svr-lf-nas1 cib[2845]: notice: Disconnected from Corosync Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Shutdown complete Apr 8 02:23:37 svr-lf-nas1 pacemakerd[2840]: notice: Attempting to inhibit respawning after fatal error
Author
Owner

@ewwhite commented on GitHub (Apr 8, 2019):

Can you show me the pcs resource creation string you used for the fencing?

Maybe also the cluster creation string... and also your hosts files?

<!-- gh-comment-id:480807968 --> @ewwhite commented on GitHub (Apr 8, 2019): Can you show me the pcs resource creation string you used for the fencing? Maybe also the cluster creation string... and also your hosts files?
Author
Owner

@ewwhite commented on GitHub (Apr 12, 2019):

Any updates? @rcproam

<!-- gh-comment-id:482703458 --> @ewwhite commented on GitHub (Apr 12, 2019): Any updates? @rcproam
Author
Owner

@rcproam commented on GitHub (Apr 12, 2019):

Thanks so much for following-up on this @ewwhite and my apologies for the delay. My spare time has been focused on tax preparations this week.

Anythom, I did try configuring the fence_mpath agent devices but unfortunately unfencing no worky for me :-\

Will try to revert back to fence_scsi agent tonight and provide the info you requested.

BTW, are you receiving Email to your @ewwhite.net address? I had sent an Email last week. If you’re located in Chicago maybe we could meet up one day? Would like to learn more about your consulting business in case I have the opportunity to refer some new business to you.

<!-- gh-comment-id:482717743 --> @rcproam commented on GitHub (Apr 12, 2019): Thanks so much for following-up on this @ewwhite and my apologies for the delay. My spare time has been focused on tax preparations this week. Anythom, I did try configuring the fence_mpath agent devices but unfortunately unfencing no worky for me :-\ Will try to revert back to fence_scsi agent tonight and provide the info you requested. BTW, are you receiving Email to your @ewwhite.net address? I had sent an Email last week. If you’re located in Chicago maybe we could meet up one day? Would like to learn more about your consulting business in case I have the opportunity to refer some new business to you.
Sign in to join this conversation.
No labels
pull-request
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/zfs-ha#26
No description provided.