[GH-ISSUE #25] scsi reservations issue on failover #24

Open
opened 2026-05-05 03:29:55 -06:00 by gitea-mirror · 31 comments
Owner

Originally created by @WanWizard on GitHub (Apr 2, 2019).
Original GitHub issue: https://github.com/ewwhite/zfs-ha/issues/25

Hey, followed your great instructions to the letter, but I'm left with a situation that leaves me stumped.

I have a setup with two supermicro's, each connected to 2 12-disk JBOD's with SAS disks, but without a loop, so no multipath (and multipath is not installed). Both JBOD's are used in mirrored vdevs, so I can lose an entire JBOD without much issues.

OS: CentOS Linux release 7.6.1810 (Core)
ZFS: 0.7.13, from the zfs-kmod repo

This setup works fine, until pacemaker decides there is a need to failover. It doesn't matter if that is because the active node is put into standby, because the hardware is switched off, etc.

When pacemaker fails over, the second node tries to import the pool, which fails because something on the first node has placed SCSI reservations on the disk:

[root@nas01 /]# sg_persist -r /dev/sdh
  NETAPP    X412_HVIPC560A15  NA02
  Peripheral device type: disk
  PR generation=0x1, Reservation follows:
    Key=0x666e0001
    scope: LU_SCOPE,  type: Write Exclusive, registrants only

as soon as the failover happens, the second node starts to log:

[ 5834.890588] sd 0:0:7:0: reservation conflict
[ 5834.890674] sd 0:0:7:0: reservation conflict
[ 5834.890693] sd 0:0:7:0: [sdh] Test Unit Ready failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 5834.891274] sd 0:0:7:0: reservation conflict
[ 5834.891369] sd 0:0:7:0: reservation conflict
[ 5834.891466] sd 0:0:7:0: reservation conflict
[ 5834.891560] sd 0:0:7:0: reservation conflict
[ 5834.921402]  sdh: sdh1 sdh9
[ 5834.922452] sd 0:0:7:0: reservation conflict
[ 5834.957331] sd 0:0:7:0: reservation conflict
[ 5834.958157] sd 0:0:7:0: reservation conflict
[ 5835.052881] sd 0:0:7:0: reservation conflict

which either causes the entire import to fail, or, if the import succeeds, with disks offline due to excessive errors.

I've been pulling my hair out for about two weeks now, but no clue what sets these reservations, or how I can have them released on a cluster start or a cluster failover. There seem to be lots of people building Linux HA clusters with ZFS judging the discussions I found, but no one mentions this issue...

Originally created by @WanWizard on GitHub (Apr 2, 2019). Original GitHub issue: https://github.com/ewwhite/zfs-ha/issues/25 Hey, followed your great instructions to the letter, but I'm left with a situation that leaves me stumped. I have a setup with two supermicro's, each connected to 2 12-disk JBOD's with SAS disks, but without a loop, so no multipath (and multipath is not installed). Both JBOD's are used in mirrored vdevs, so I can lose an entire JBOD without much issues. OS: CentOS Linux release 7.6.1810 (Core) ZFS: 0.7.13, from the zfs-kmod repo This setup works fine, until pacemaker decides there is a need to failover. It doesn't matter if that is because the active node is put into standby, because the hardware is switched off, etc. When pacemaker fails over, the second node tries to import the pool, which fails because something on the first node has placed SCSI reservations on the disk: ```` [root@nas01 /]# sg_persist -r /dev/sdh NETAPP X412_HVIPC560A15 NA02 Peripheral device type: disk PR generation=0x1, Reservation follows: Key=0x666e0001 scope: LU_SCOPE, type: Write Exclusive, registrants only ```` as soon as the failover happens, the second node starts to log: ```` [ 5834.890588] sd 0:0:7:0: reservation conflict [ 5834.890674] sd 0:0:7:0: reservation conflict [ 5834.890693] sd 0:0:7:0: [sdh] Test Unit Ready failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK [ 5834.891274] sd 0:0:7:0: reservation conflict [ 5834.891369] sd 0:0:7:0: reservation conflict [ 5834.891466] sd 0:0:7:0: reservation conflict [ 5834.891560] sd 0:0:7:0: reservation conflict [ 5834.921402] sdh: sdh1 sdh9 [ 5834.922452] sd 0:0:7:0: reservation conflict [ 5834.957331] sd 0:0:7:0: reservation conflict [ 5834.958157] sd 0:0:7:0: reservation conflict [ 5835.052881] sd 0:0:7:0: reservation conflict ```` which either causes the entire import to fail, or, if the import succeeds, with disks offline due to excessive errors. I've been pulling my hair out for about two weeks now, but no clue what sets these reservations, or how I can have them released on a cluster start or a cluster failover. There seem to be lots of people building Linux HA clusters with ZFS judging the discussions I found, but no one mentions this issue...
Author
Owner

@ewwhite commented on GitHub (Apr 2, 2019):

Hello,

This doesn't follow the spirit of what I documented.
The idea of my instructions is to use JBODs with dual-ported SAS disks and multipath cabling.

Why doesn't your design use multipath?

<!-- gh-comment-id:479261036 --> @ewwhite commented on GitHub (Apr 2, 2019): Hello, This doesn't follow the spirit of what I documented. The idea of my instructions is to use JBODs with dual-ported SAS disks and multipath cabling. Why doesn't your design use multipath?
Author
Owner

@WanWizard commented on GitHub (Apr 3, 2019):

Don't have the space for a second HBA, servers have only a few low-profile slots. One has the HBA, the other a dual 10G card for NFS connectivity.

This is a low-budget op with second hand hardware, paid by donations, to provide storage for the build/compile servers of an open source development team I help out with some infra and admin work.

I understand that two HBA's and multipathing would provide additional availability, but unfortunately it is what it is. Until a big sponsor comes along... ;-)

<!-- gh-comment-id:479262913 --> @WanWizard commented on GitHub (Apr 3, 2019): Don't have the space for a second HBA, servers have only a few low-profile slots. One has the HBA, the other a dual 10G card for NFS connectivity. This is a low-budget op with second hand hardware, paid by donations, to provide storage for the build/compile servers of an open source development team I help out with some infra and admin work. I understand that two HBA's and multipathing would provide additional availability, but unfortunately it is what it is. Until a big sponsor comes along... ;-)
Author
Owner

@milleroff commented on GitHub (Apr 3, 2019):

We had the same problems with SCSI reservations as it never worked as expected. Sometimes when a failover happened, the new active server could not import the disks.
We ended up setting the IPMI fencing. If something goes wrong, the second server shuts off the first server over the IPMI protocol and takes the control over the disks.

<!-- gh-comment-id:479263464 --> @milleroff commented on GitHub (Apr 3, 2019): We had the same problems with SCSI reservations as it never worked as expected. Sometimes when a failover happened, the new active server could not import the disks. We ended up setting the IPMI fencing. If something goes wrong, the second server shuts off the first server over the IPMI protocol and takes the control over the disks.
Author
Owner

@WanWizard commented on GitHub (Apr 3, 2019):

Good to read I'm not alone. Not good that you needed to work around it like that. I'd hoped to avoid that.

<!-- gh-comment-id:479263926 --> @WanWizard commented on GitHub (Apr 3, 2019): Good to read I'm not alone. Not good that you needed to work around it like that. I'd hoped to avoid that.
Author
Owner

@ewwhite commented on GitHub (Apr 3, 2019):

I haven't had such issues with any deployment.
You can use a single dual-port HBA in each host. Dual HBA cards are not required.

@milleroff Please make sure meta provides=unfencing, pcmk_monitor_action="metadata" and pcmk_host_list= is populated in your SCSI stonith resource. Also, this absolutely requires dual-port SAS drives everywhere.

<!-- gh-comment-id:479265199 --> @ewwhite commented on GitHub (Apr 3, 2019): I haven't had such issues with any deployment. You can use a single dual-port HBA in each host. Dual HBA cards are not required. @milleroff Please make sure `meta provides=unfencing`, `pcmk_monitor_action="metadata"` and `pcmk_host_list=` is populated in your SCSI stonith resource. Also, this absolutely requires dual-port SAS drives everywhere.
Author
Owner

@WanWizard commented on GitHub (Apr 3, 2019):

That is indeed how I've hooked it up now, one port going to each of the enclosures.

All my disks are Hitachi HUS156060VLS600, which are dual-port SAS drives. I don't have any SCSI fencing active, removing that was my first step in trying to find the problem.

<!-- gh-comment-id:479267028 --> @WanWizard commented on GitHub (Apr 3, 2019): That is indeed how I've hooked it up now, one port going to each of the enclosures. All my disks are Hitachi HUS156060VLS600, which are dual-port SAS drives. I don't have any SCSI fencing active, removing that was my first step in trying to find the problem.
Author
Owner

@ewwhite commented on GitHub (Apr 3, 2019):

SCSI fencing is crucial to what you're doing. That's how the failover and pool import work.

<!-- gh-comment-id:479267370 --> @ewwhite commented on GitHub (Apr 3, 2019): SCSI fencing is crucial to what you're doing. That's how the failover and pool import work.
Author
Owner

@WanWizard commented on GitHub (Apr 3, 2019):

So what is setting the reservations, as it's not the fence_scsi agent?

I get that in production you need it to avoid imports on both nodes (which would be utter horror), but if it doesn't work in a controlled failover (where the pools are cleanly exported and a node is cleanly shutdown to tigger a failover), I don't see how adding another layer of complexity will fix the issue.

<!-- gh-comment-id:479268596 --> @WanWizard commented on GitHub (Apr 3, 2019): So what is setting the reservations, as it's not the fence_scsi agent? I get that in production you need it to avoid imports on both nodes (which would be utter horror), but if it doesn't work in a controlled failover (where the pools are cleanly exported and a node is cleanly shutdown to tigger a failover), I don't see how adding another layer of complexity will fix the issue.
Author
Owner

@ewwhite commented on GitHub (Apr 3, 2019):

Controlled and uncontrolled failovers work in the setup I've described and documented.

I do not know what's unique about your environment, but removing critical components of the design isn't going to help the situation. What is the output of zpool status -v and lsscsi and multipath -ll?

This high-availability design assumes:
Multipath SAS cabling
Dual-ported disks
Multipath service enabled
ZFS pool creation using dm-multipath devices (not individual /dev/sdX SCSI disks)
SCSI reservation fencing

<!-- gh-comment-id:479269796 --> @ewwhite commented on GitHub (Apr 3, 2019): Controlled and uncontrolled failovers work in the setup I've described and documented. I do not know what's unique about your environment, but removing critical components of the design isn't going to help the situation. What is the output of `zpool status -v` and `lsscsi` and `multipath -ll`? This high-availability design assumes: Multipath SAS cabling Dual-ported disks Multipath service enabled ZFS pool creation using dm-multipath devices (not individual `/dev/sdX` SCSI disks) SCSI reservation fencing
Author
Owner

@WanWizard commented on GitHub (Apr 3, 2019):

I get that. I'd rather work with a documented (and if possible supported) environment as well. But as said, it is what it is. ;-)

[root@nas01 /]# zpool status -v
  pool: sas01
 state: ONLINE
  scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        sas01                       ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000cca01fa899a4  ONLINE       0     0     0
            wwn-0x5000cca01fa81cfc  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            wwn-0x5000cca02a6a7e6c  ONLINE       0     0     0
            wwn-0x5000cca01fcf0128  ONLINE       0     0     0
          mirror-2                  ONLINE       0     0     0
            wwn-0x5000cca01f8f1394  ONLINE       0     0     0
            wwn-0x5000cca0411f4c48  ONLINE       0     0     0
          mirror-3                  ONLINE       0     0     0
            wwn-0x5000cca01fcc73ec  ONLINE       0     0     0
            wwn-0x5000cca02a018bf8  ONLINE       0     0     0
          mirror-4                  ONLINE       0     0     0
            wwn-0x5000cca01f47b644  ONLINE       0     0     0
            wwn-0x5000cca018511124  ONLINE       0     0     0
          mirror-5                  ONLINE       0     0     0
            wwn-0x5000cca01fda745c  ONLINE       0     0     0
            wwn-0x5000cca02a6b3e20  ONLINE       0     0     0
          mirror-6                  ONLINE       0     0     0
            wwn-0x5000cca01fa6e548  ONLINE       0     0     0
            wwn-0x5000cca018d80018  ONLINE       0     0     0
          mirror-7                  ONLINE       0     0     0
            wwn-0x5000cca02a67e1e0  ONLINE       0     0     0
            wwn-0x5000cca02a0070e0  ONLINE       0     0     0

errors: No known data errors
[root@nas01 /]# lsscsi
[0:0:0:0]    disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdb
[0:0:1:0]    disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdc
[0:0:2:0]    disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdd
[0:0:3:0]    disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sde
[0:0:4:0]    disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdf
[0:0:5:0]    disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdg
[0:0:6:0]    disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdh
[0:0:7:0]    disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdi
[0:0:8:0]    enclosu LSI      SAS2X28          0e12  -
[0:0:9:0]    disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdj
[0:0:10:0]   disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdk
[0:0:11:0]   disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdl
[0:0:12:0]   disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdm
[0:0:13:0]   disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdn
[0:0:14:0]   disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdo
[0:0:15:0]   disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdp
[0:0:16:0]   disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdq
[0:0:17:0]   enclosu LSI      SAS2X28          0e12  -
[2:0:0:0]    disk    ATA      SAMSUNG SSD CM85 3D2Q  /dev/sda

The SSD disk is a 32GB SATA DOM the server boots from.

I don't have a multipath setup, so no multipath service installed. And therefore no multipath output, and no dm-multipath devices.

I know my setup isn't as documented, but I was still hoping someone knew where these reservations came from, so I could work with/around them. I have no issues writing a resource agent to deal with those if needed, if that is what it takes.

Thanks sofar. I have reinstated the fence_scsi agent now, will try another failover tomorrow, I'm in GMT+2, getting late here... ;-)

<!-- gh-comment-id:479272664 --> @WanWizard commented on GitHub (Apr 3, 2019): I get that. I'd rather work with a documented (and if possible supported) environment as well. But as said, it is what it is. ;-) ```` [root@nas01 /]# zpool status -v pool: sas01 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM sas01 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x5000cca01fa899a4 ONLINE 0 0 0 wwn-0x5000cca01fa81cfc ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 wwn-0x5000cca02a6a7e6c ONLINE 0 0 0 wwn-0x5000cca01fcf0128 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 wwn-0x5000cca01f8f1394 ONLINE 0 0 0 wwn-0x5000cca0411f4c48 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 wwn-0x5000cca01fcc73ec ONLINE 0 0 0 wwn-0x5000cca02a018bf8 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 wwn-0x5000cca01f47b644 ONLINE 0 0 0 wwn-0x5000cca018511124 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 wwn-0x5000cca01fda745c ONLINE 0 0 0 wwn-0x5000cca02a6b3e20 ONLINE 0 0 0 mirror-6 ONLINE 0 0 0 wwn-0x5000cca01fa6e548 ONLINE 0 0 0 wwn-0x5000cca018d80018 ONLINE 0 0 0 mirror-7 ONLINE 0 0 0 wwn-0x5000cca02a67e1e0 ONLINE 0 0 0 wwn-0x5000cca02a0070e0 ONLINE 0 0 0 errors: No known data errors ```` ```` [root@nas01 /]# lsscsi [0:0:0:0] disk NETAPP X412_HVIPC560A15 NA04 /dev/sdb [0:0:1:0] disk NETAPP X412_HVIPC560A15 NA02 /dev/sdc [0:0:2:0] disk NETAPP X412_HVIPC560A15 NA02 /dev/sdd [0:0:3:0] disk NETAPP X412_HVIPC560A15 NA02 /dev/sde [0:0:4:0] disk NETAPP X412_HVIPC560A15 NA02 /dev/sdf [0:0:5:0] disk NETAPP X412_HVIPC560A15 NA04 /dev/sdg [0:0:6:0] disk NETAPP X412_HVIPC560A15 NA02 /dev/sdh [0:0:7:0] disk NETAPP X412_HVIPC560A15 NA04 /dev/sdi [0:0:8:0] enclosu LSI SAS2X28 0e12 - [0:0:9:0] disk NETAPP X412_HVIPC560A15 NA02 /dev/sdj [0:0:10:0] disk NETAPP X412_HVIPC560A15 NA04 /dev/sdk [0:0:11:0] disk NETAPP X412_HVIPC560A15 NA02 /dev/sdl [0:0:12:0] disk NETAPP X412_HVIPC560A15 NA04 /dev/sdm [0:0:13:0] disk NETAPP X412_HVIPC560A15 NA02 /dev/sdn [0:0:14:0] disk NETAPP X412_HVIPC560A15 NA04 /dev/sdo [0:0:15:0] disk NETAPP X412_HVIPC560A15 NA02 /dev/sdp [0:0:16:0] disk NETAPP X412_HVIPC560A15 NA04 /dev/sdq [0:0:17:0] enclosu LSI SAS2X28 0e12 - [2:0:0:0] disk ATA SAMSUNG SSD CM85 3D2Q /dev/sda ```` The SSD disk is a 32GB SATA DOM the server boots from. I don't have a multipath setup, so no multipath service installed. And therefore no multipath output, and no dm-multipath devices. I know my setup isn't as documented, but I was still hoping someone knew where these reservations came from, so I could work with/around them. I have no issues writing a resource agent to deal with those if needed, if that is what it takes. Thanks sofar. I have reinstated the fence_scsi agent now, will try another failover tomorrow, I'm in GMT+2, getting late here... ;-)
Author
Owner

@ewwhite commented on GitHub (Apr 3, 2019):

I'd enable the multipath daemon, re-create the pool with the resulting /dev/mapper devices and add a fencing resource containing those /dev/mapper devices.

pcs stonith create fence-vol1 fence_scsi pcmk_monitor_action="metadata" pcmk_host_list="zfs-node1,zfs-node2" devices="/dev/mapper/35000c500236061b3,/dev/mapper/35000c500236032f7,/dev/mapper/35000c5007772e5ff,/dev/mapper/35000c50023614aef,/dev/mapper/35000a7203008de44,/dev/mapper/35000c500236004a3,/dev/mapper/35000c5002362ffab,/dev/mapper/35000c500236031ab,/dev/mapper/35000c50023605c1b,/dev/mapper/35000c500544508b7,/dev/mapper/35000c5002362f347" meta provides=unfencing

Your NetApp shelves should allow you to do this.

<!-- gh-comment-id:479273655 --> @ewwhite commented on GitHub (Apr 3, 2019): I'd enable the multipath daemon, re-create the pool with the resulting `/dev/mapper` devices and add a fencing resource containing those /dev/mapper devices. > pcs stonith create fence-vol1 fence_scsi pcmk_monitor_action="metadata" pcmk_host_list="zfs-node1,zfs-node2" devices="/dev/mapper/35000c500236061b3,/dev/mapper/35000c500236032f7,/dev/mapper/35000c5007772e5ff,/dev/mapper/35000c50023614aef,/dev/mapper/35000a7203008de44,/dev/mapper/35000c500236004a3,/dev/mapper/35000c5002362ffab,/dev/mapper/35000c500236031ab,/dev/mapper/35000c50023605c1b,/dev/mapper/35000c500544508b7,/dev/mapper/35000c5002362f347" meta provides=unfencing Your NetApp shelves should allow you to do this.
Author
Owner

@WanWizard commented on GitHub (Apr 3, 2019):

Ok, will do that tomorrow evening. Thanks for the help sofar.

<!-- gh-comment-id:479275521 --> @WanWizard commented on GitHub (Apr 3, 2019): Ok, will do that tomorrow evening. Thanks for the help sofar.
Author
Owner

@WanWizard commented on GitHub (Apr 3, 2019):

Decided to restart the project completely. Formatted and cleared everything, reinstalled CentOS and ZFS, following the wiki, virtually to the letter.

Difference is I decided not to use multipath, I've had a chat with some redhat DC guys today, and they all advised me not to use it when there is no multipath in use, to avoid another layer of complexity. So I followed their advice, and used "/dev/disk/by-id" instead of "/dev/mapper". Seems to work fine sofar.

Just tested a few failovers by switching nodes to standby and back and faking network issues, and that seems to work fine now, including the SCSI fencing. Happy days.

Only one issue left: when ZFS fails over, the shares aren't activated after the failover, and I need to do a zfs share -a to get them active again. Put one node in standby, and restarted the active node, after which the same issue occured. /etc/default/zfs' has ZFS_SHARE='yes', and the zfs-share.service is enabled.

No idea where to look next. I didn't have this problem before, so I seem to roll from one issue to the next...

<!-- gh-comment-id:479579529 --> @WanWizard commented on GitHub (Apr 3, 2019): Decided to restart the project completely. Formatted and cleared everything, reinstalled CentOS and ZFS, following the wiki, virtually to the letter. Difference is I decided not to use multipath, I've had a chat with some redhat DC guys today, and they all advised me not to use it when there is no multipath in use, to avoid another layer of complexity. So I followed their advice, and used "`/dev/disk/by-id`" instead of "`/dev/mapper`". Seems to work fine sofar. Just tested a few failovers by switching nodes to standby and back and faking network issues, and that seems to work fine now, including the SCSI fencing. Happy days. Only one issue left: when ZFS fails over, the shares aren't activated after the failover, and I need to do a `zfs share -a` to get them active again. Put one node in standby, and restarted the active node, after which the same issue occured. `/etc/default/zfs'` has `ZFS_SHARE='yes'`, and the `zfs-share.service` is enabled. No idea where to look next. I didn't have this problem before, so I seem to roll from one issue to the next...
Author
Owner

@ewwhite commented on GitHub (Apr 3, 2019):

I don't understand what you are trying to do by avoiding multipath, as it is a key element of this design.

I understand you're seeking assistance, but you have not clearly articulated the reasoning behind not using multipath devices. If there's an architectural issue preventing multipath cabling, please explain.

<!-- gh-comment-id:479582760 --> @ewwhite commented on GitHub (Apr 3, 2019): I don't understand what you are trying to do by avoiding multipath, as it is a key element of this design. I understand you're seeking assistance, but you have not clearly articulated the reasoning behind not using multipath devices. If there's an architectural issue preventing multipath cabling, please explain.
Author
Owner

@WanWizard commented on GitHub (Apr 3, 2019):

I don't have multipaths to my storage, so it totally pointless to install and use multipath, every device only has one path.

I get it is a key element of your design, but I don't have the hardware to match, I have already explained that to you (I only have one HBA per server, and only place for one).

<!-- gh-comment-id:479586108 --> @WanWizard commented on GitHub (Apr 3, 2019): I don't have multipaths to my storage, so it totally pointless to install and use multipath, every device only has one path. I get it is a key element of your design, but I don't have the hardware to match, I have already explained that to you (I only have one HBA per server, and only place for one).
Author
Owner

@ewwhite commented on GitHub (Apr 3, 2019):

I'm sorry, but the guidelines are very clear. Single HBAs aren't a problem if they have two external ports. Your equipment choices and crafting a workaround is not a valid support issue.

<!-- gh-comment-id:479588229 --> @ewwhite commented on GitHub (Apr 3, 2019): I'm sorry, but the guidelines are very clear. Single HBAs aren't a problem if they have two external ports. Your equipment choices and crafting a workaround is not a valid support issue.
Author
Owner

@WanWizard commented on GitHub (Apr 3, 2019):

What an attitude. Disappointing.

I have servers with a single HBA. They have two ports each. One port is connected to shelf A, one port is connected to shelf B. The shelfs themselfs also only have two ports, so I CANT create a multipath, even if I wanted to. As I wrote yesterday, it is what it is, and then you didn't have a problem with it.

Besides that, the fact that nfs shares don't become available after a zpool import has absolutely zero to do with whether multipath is in use or not. I doesn't work either if I boot up one node while the other is switched off...

<!-- gh-comment-id:479590090 --> @WanWizard commented on GitHub (Apr 3, 2019): What an attitude. Disappointing. I have servers with a single HBA. They have two ports each. One port is connected to shelf A, one port is connected to shelf B. The shelfs themselfs also only have two ports, so I CANT create a multipath, even if I wanted to. As I wrote yesterday, it is what it is, and then you didn't have a problem with it. Besides that, the fact that nfs shares don't become available after a zpool import has absolutely zero to do with whether multipath is in use or not. I doesn't work either if I boot up one node while the other is switched off...
Author
Owner

@ewwhite commented on GitHub (Apr 3, 2019):

This is outside the scope of support because your solution is not built properly.

Regarding ZFS shares, filesystem exports are shared automatically on zpool import. sharenfs is a ZFS filesystem property, so if the filesystem is present and mounted, the sharing should work.

I suspect that your pools aren’t actually exporting/importing since the servers and disks have no knowledge of each other; because there's no use of multipath devices/device names in your zpool.

<!-- gh-comment-id:479599645 --> @ewwhite commented on GitHub (Apr 3, 2019): This is outside the scope of support because your solution is not built properly. Regarding ZFS shares, filesystem exports are shared automatically on zpool import. `sharenfs` is a ZFS filesystem property, so if the filesystem is present and mounted, the sharing should work. I suspect that your pools aren’t actually exporting/importing since the servers and disks have no knowledge of each other; because there's no use of multipath devices/device names in your zpool.
Author
Owner

@rcproam commented on GitHub (Apr 3, 2019):

@ewwhite Thanks so much for your excellent and hard work on this project! If you're interested was hoping to share some work I've done to integrate your design with ZapZend, which (as I'm sure you're aware) stores all of the snapshot & replication configuration within properties of the ZFS filesystem itself. In my testing (with minor modification) ZnapZend meshes well with your design :-)

We had the same problems with SCSI reservations as it never worked as expected. Sometimes when a failover happened, the new active server could not import the disks.

@milleroff
I've encountered a very similar issue with nodes not releasing scsi reservations during graceful failover. However, my design is based on Debian Stretch (stable) which unfortunately at present only includes older versions of pacemaker and fence-agents compared to CentOS 7. As such, I was thinking that the failover/fencing issue I'm encountering is related to this issue Red Hat documented "RHEL 7 High Availability and Resilient Storage Pacemaker cluster experiences a fence race condition between nodes during network outages while using fence_scsi with multipath storage"
https://access.redhat.com/solutions/3201072
...but as I don't have a RHEL account I can't see what versions of pacemaker and fence-agents are affected. If you have time might you be able to share the versions of O/S, pacemaker, and fence-agents used in your implementation?

@WanWizard If you have each SAS HBA port connected to a separate shelf on both server nodes than the physical SAS connectivity IS already multipathed. However, for this ZFS-HA design to function, and as @ewwhite has described in the wiki, it is essential to install and configure device-mapper-multipath, and use the /dev/mapper/ device IDs for the vdevs when you create the ZFS pool. Hope this helps.

<!-- gh-comment-id:479619578 --> @rcproam commented on GitHub (Apr 3, 2019): @ewwhite Thanks so much for your excellent and hard work on this project! If you're interested was hoping to share some work I've done to integrate your design with ZapZend, which (as I'm sure you're aware) stores all of the snapshot & replication configuration within properties of the ZFS filesystem itself. In my testing (with minor modification) ZnapZend meshes well with your design :-) > We had the same problems with SCSI reservations as it never worked as expected. Sometimes when a failover happened, the new active server could not import the disks. @milleroff I've encountered a very similar issue with nodes not releasing scsi reservations during graceful failover. However, my design is based on Debian Stretch (stable) which unfortunately at present only includes older versions of pacemaker and fence-agents compared to CentOS 7. As such, I was thinking that the failover/fencing issue I'm encountering is related to this issue Red Hat documented "RHEL 7 High Availability and Resilient Storage Pacemaker cluster experiences a fence race condition between nodes during network outages while using fence_scsi with multipath storage" [https://access.redhat.com/solutions/3201072](url) ...but as I don't have a RHEL account I can't see what versions of pacemaker and fence-agents are affected. If you have time might you be able to share the versions of O/S, pacemaker, and fence-agents used in your implementation? @WanWizard If you have each SAS HBA port connected to a separate shelf on both server nodes than the physical SAS connectivity IS already multipathed. However, for this ZFS-HA design to function, and as @ewwhite has described in the wiki, it is essential to install and configure device-mapper-multipath, and use the /dev/mapper/ device IDs for the vdevs when you create the ZFS pool. Hope this helps.
Author
Owner

@ewwhite commented on GitHub (Apr 3, 2019):

Here are the notes from the Red Hat support article linked above:

Resolution
Utilizing fence_mpath instead of fence_scsi should prevent his race condition from occurring
Ensure storage is always appropriately zoned so that all paths are functional

Root Cause
This issue occurs in the rare circumstance where your nodes have a faulty path to multipath storage but are unaware that the path is not currently offline. This can be as a result of improper zoning of devices from the storage side, misconfiguration of FCOE storage, or just a path failure occurring at the right time. If the node is unaware that one path is not accepting I/O to the device, and it hasn't been otherwise determined that the device is unavailable, it may temporarily prevent a node from adjusting scsi reservations on the device long enough for another node in the cluster to fence that device.

<!-- gh-comment-id:479622084 --> @ewwhite commented on GitHub (Apr 3, 2019): Here are the notes from the Red Hat support article linked above: Resolution Utilizing fence_mpath instead of fence_scsi should prevent his race condition from occurring Ensure storage is always appropriately zoned so that all paths are functional Root Cause This issue occurs in the rare circumstance where your nodes have a faulty path to multipath storage but are unaware that the path is not currently offline. This can be as a result of improper zoning of devices from the storage side, misconfiguration of FCOE storage, or just a path failure occurring at the right time. If the node is unaware that one path is not accepting I/O to the device, and it hasn't been otherwise determined that the device is unavailable, it may temporarily prevent a node from adjusting scsi reservations on the device long enough for another node in the cluster to fence that device.
Author
Owner

@rcproam commented on GitHub (Apr 3, 2019):

Thanks for the prompt response and helpful info @ewwhite :-)

It seems the fence_mpath agent is a little more complex to setup, and requires "that /etc/multipath.conf be configured with a unique reservation_key hexadecimal value on each node, either in the defaults or in a multipath block for each cluster-shared device."
https://access.redhat.com/articles/3078811

Have you tested using the fence_mpath agent with your design BTW?

<!-- gh-comment-id:479625304 --> @rcproam commented on GitHub (Apr 3, 2019): Thanks for the prompt response and helpful info @ewwhite :-) It seems the fence_mpath agent is a little more complex to setup, and requires "that /etc/multipath.conf be configured with a unique reservation_key hexadecimal value on each node, either in the defaults or in a multipath block for each cluster-shared device." [https://access.redhat.com/articles/3078811](url) Have you tested using the fence_mpath agent with your design BTW?
Author
Owner

@WanWizard commented on GitHub (Apr 3, 2019):

For those finding this issue because of a similar issue:

Multipath is not a requirement in my setup, creating (and failing over) a zpool with vdev's using wwn's works fine, wwn's are fixed and unique. I had this confirmed by my company's redhat system engineer.

And I've checked the ZoL code, zpool import only does a zfs share -a on import but only if it detects that NFS is already running. Which wasn't the case for me due to resource order constraints (I have the nfsserver status directory on a dataset in the zpool to be able to failover nfs status, so zfs must start before nfs).

Addressed this issue by modifying the nfsserver heartbeat script, and add a zfs share -a just before ocf_log info "NFS server started".

<!-- gh-comment-id:479627474 --> @WanWizard commented on GitHub (Apr 3, 2019): For those finding this issue because of a similar issue: Multipath is not a requirement in my setup, creating (and failing over) a zpool with vdev's using wwn's works fine, wwn's are fixed and unique. I had this confirmed by my company's redhat system engineer. And I've checked the ZoL code, `zpool import` only does a `zfs share -a` on import **but only if it detects that NFS is already running**. Which wasn't the case for me due to resource order constraints (I have the nfsserver status directory on a dataset in the zpool to be able to failover nfs status, so zfs must start before nfs). Addressed this issue by modifying the nfsserver heartbeat script, and add a `zfs share -a` just before `ocf_log info "NFS server started"`.
Author
Owner

@ewwhite commented on GitHub (Apr 3, 2019):

@rcproam No, I have not had a need to use the fence_mpath. I don't and have not encountered SCSI reservation problems in my builds. Definitely try to make sure meta provides=unfencing, pcmk_monitor_action="metadata" and pcmk_host_list= is populated in your SCSI stonith resource.

The other thing that I do these days is ensure there's a discrete heartbeat network path between nodes. I've been using a simple USB transfer cable between hosts to provide this additional link as the alternate Corosync ring.

I found this to be necessary in environments where I have MLAG/MC-LAC switches and multi-chassis LACP from the server to switches. A switch failure with collapsed VLANs for data, heartbeat, etc. would kill all of the network links, including the Corosync rings.

That's the only other modification I've needed. I don't suspect that SCSI reservation issues are commonplace.

@WanWizard - I advise leaving the NFS service running full time on both nodes. ZFS takes care of the rest. There's no need to start/stop that service for this purpose. Note that there's no NFS server resource. Just ZFS zpool, STONITH and IP address.

 Resource Group: group-vol1
     vol1	(ocf::heartbeat:ZFS):	Started zfs1-node1
     vol1-ip	(ocf::heartbeat:IPaddr2):	Started zfs1-node1
<!-- gh-comment-id:479631498 --> @ewwhite commented on GitHub (Apr 3, 2019): @rcproam No, I have not had a need to use the fence_mpath. I don't and have not encountered SCSI reservation problems in my builds. Definitely try to make sure `meta provides=unfencing`, `pcmk_monitor_action="metadata"` and `pcmk_host_list=` is populated in your SCSI stonith resource. The other thing that I do these days is ensure there's a discrete heartbeat network path between nodes. I've been using a simple USB transfer cable between hosts to provide this additional link as the alternate Corosync ring. I found this to be necessary in environments where I have MLAG/MC-LAC switches and multi-chassis LACP from the server to switches. A switch failure with collapsed VLANs for data, heartbeat, etc. would kill all of the network links, including the Corosync rings. That's the only other modification I've needed. I don't suspect that SCSI reservation issues are commonplace. @WanWizard - I advise leaving the NFS service running full time on both nodes. ZFS takes care of the rest. There's no need to start/stop that service for this purpose. Note that there's no NFS server resource. Just ZFS zpool, STONITH and IP address. ``` fence-vol1 (stonith:fence_scsi): Started zfs1-node1 Resource Group: group-vol1 vol1 (ocf::heartbeat:ZFS): Started zfs1-node1 vol1-ip (ocf::heartbeat:IPaddr2): Started zfs1-node1
Author
Owner

@WanWizard commented on GitHub (Apr 3, 2019):

As I wrote, I have nfs_shared_infodir configured to point to a zfs dataset, so I can failover nfs state information.

This is on suggestion of that same Redhat engineer, and documented here: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/s1-resourcegroupcreatenfs-haaa

Doing so requires the zpool and its datasets to be available before NFS starts, and that can only be archieved using the nfsserver resource in combination with order constraints.

It does create a chicken-and-egg problem, I understand that now. Redhat's examples are based on DRBD, which doesn't have this problem. I worked around it, I don't have a problem with that. Just wanted to report that back, for future reference.

<!-- gh-comment-id:479636662 --> @WanWizard commented on GitHub (Apr 3, 2019): As I wrote, I have `nfs_shared_infodir` configured to point to a zfs dataset, so I can failover nfs state information. This is on suggestion of that same Redhat engineer, and documented here: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/s1-resourcegroupcreatenfs-haaa Doing so requires the zpool and its datasets to be available before NFS starts, and that can only be archieved using the nfsserver resource in combination with order constraints. It does create a chicken-and-egg problem, I understand that now. Redhat's examples are based on DRBD, which doesn't have this problem. I worked around it, I don't have a problem with that. Just wanted to report that back, for future reference.
Author
Owner

@rcproam commented on GitHub (Apr 3, 2019):

@ewwhite Thanks again for the prompt reply and helpful tips! :-)

As the fence_scsi vs fence_mpath agent topic is out of scope for this issue, I've opened a new issue to track if it resolves the fencing problem with my particular implementation:

#26

<!-- gh-comment-id:479639411 --> @rcproam commented on GitHub (Apr 3, 2019): @ewwhite Thanks again for the prompt reply and helpful tips! :-) As the fence_scsi vs fence_mpath agent topic is out of scope for this issue, I've opened a new issue to track if it resolves the fencing problem with my particular implementation: #26
Author
Owner

@rbicelli commented on GitHub (Apr 29, 2019):

Hello, I think I'm in the same situation as @WanWizard. I have a Dell MD1220 which basically has 2 EMM controllers with 2 SAS ports each. Unfortunately the 2nd port is unusable for multipath because is reserved for the daisy chain between enclosures only. So I'm forced to connect only one sas cable per EMM (EMM1 on host1 HBA, EMM2 on host2 HBA). Disks are SAS dual port.

I have multipath enabled but obviously multipath -ll show a single path for each disk on each host.

However it seems that failover is working with no issues.

My question is: could I stay safe with this setup or have I to migrate to a full dual-port SAS solution?

<!-- gh-comment-id:487647723 --> @rbicelli commented on GitHub (Apr 29, 2019): Hello, I think I'm in the same situation as @WanWizard. I have a Dell MD1220 which basically has 2 EMM controllers with 2 SAS ports each. Unfortunately the 2nd port is unusable for multipath because is reserved for the daisy chain between enclosures only. So I'm forced to connect only one sas cable per EMM (EMM1 on host1 HBA, EMM2 on host2 HBA). Disks are SAS dual port. I have multipath enabled but obviously multipath -ll show a single path for each disk on each host. However it seems that failover is working with no issues. My question is: could I stay safe with this setup or have I to migrate to a full dual-port SAS solution?
Author
Owner

@WanWizard commented on GitHub (Apr 30, 2019):

The only addtional risk you run is that a cable or connection issue between the active node and one of the enclosures will trigger a failover, whereas with multipath the active node would remain active and would use the second path.

It depends on your situation, but in my case everything runs in a locked rack that nobody ever opens, the chance of connection issues are very slim, and a failover because of it is not a problem (that is why I have two nodes, right?). In my case the risks by far outweigh the replacement costs.

<!-- gh-comment-id:487884357 --> @WanWizard commented on GitHub (Apr 30, 2019): The only addtional risk you run is that a cable or connection issue between the active node and one of the enclosures will trigger a failover, whereas with multipath the active node would remain active and would use the second path. It depends on your situation, but in my case everything runs in a locked rack that nobody ever opens, the chance of connection issues are very slim, and a failover because of it is not a problem (that is why I have two nodes, right?). In my case the risks by far outweigh the replacement costs.
Author
Owner

@ewwhite commented on GitHub (Apr 30, 2019):

I just read through the technical guidebook for the Dell MD1220. The manual says that clustering is not supported on the enclosure.

What happens if you create a SAS Multipath ring and use the Out ports on the EMM? The manual says the ports may be disabled depending on the enclosure mode (split/unified). If this doesn’t work, I guess that means this Dell is not an ideal enclosure for ZFS clustering purposes.

Edmund White

On Apr 29, 2019, at 11:25 AM, Riccardo <notifications@github.commailto:notifications@github.com> wrote:

Hello, I think I'm in the same situation as @WanWizardhttps://github.com/WanWizard. I have a Dell MD1220 which basically has 2 EMM controllers with 2 SAS ports. Unfortunately the 2nd port is unusable for multipath because is reserved for the daisy chain between enclosures only.

So I'm looking for a way to preserve my MD1220 and have a reliable system even in case of hard failure.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/ewwhite/zfs-ha/issues/25#issuecomment-487647723, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABJSFNQYDXSTO2GSLRIQZSTPS4OOXANCNFSM4HDFLAPQ.

<!-- gh-comment-id:487893967 --> @ewwhite commented on GitHub (Apr 30, 2019): I just read through the technical guidebook for the Dell MD1220. The manual says that clustering is not supported on the enclosure. What happens if you create a SAS Multipath ring and use the Out ports on the EMM? The manual says the ports may be disabled depending on the enclosure mode (split/unified). If this doesn’t work, I guess that means this Dell is not an ideal enclosure for ZFS clustering purposes. Edmund White On Apr 29, 2019, at 11:25 AM, Riccardo <notifications@github.com<mailto:notifications@github.com>> wrote: Hello, I think I'm in the same situation as @WanWizard<https://github.com/WanWizard>. I have a Dell MD1220 which basically has 2 EMM controllers with 2 SAS ports. Unfortunately the 2nd port is unusable for multipath because is reserved for the daisy chain between enclosures only. So I'm looking for a way to preserve my MD1220 and have a reliable system even in case of hard failure. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://github.com/ewwhite/zfs-ha/issues/25#issuecomment-487647723>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABJSFNQYDXSTO2GSLRIQZSTPS4OOXANCNFSM4HDFLAPQ>.
Author
Owner

@rbicelli commented on GitHub (Apr 30, 2019):

By connecting a SAS HBA to Out port on EMM nothing happens. multipath -ll says there's only one possible path.

However I tried to do some failover test and everything seemed fine:

  • Soft power off of a node: ok
  • Hard power off of a node: ok
  • Kernel panic of a node: ok
  • Pull out of sas cable of a node: ok
    ok = the remaining active node takes over the resources, everything with a win vm with iometer doing workload tests.
    At this point I don't understand if dual sas cable per hba is really a requirement for the solution to work.

I think also that dell MD1200 should be removed from the wiki, since it is equipped with same EMM controllers of MD1220.

Riccardo

<!-- gh-comment-id:487958515 --> @rbicelli commented on GitHub (Apr 30, 2019): By connecting a SAS HBA to Out port on EMM nothing happens. multipath -ll says there's only one possible path. However I tried to do some failover test and everything seemed fine: - Soft power off of a node: ok - Hard power off of a node: ok - Kernel panic of a node: ok - Pull out of sas cable of a node: ok ok = the remaining active node takes over the resources, everything with a win vm with iometer doing workload tests. At this point I don't understand if dual sas cable per hba is really a requirement for the solution to work. I think also that dell MD1200 should be removed from the wiki, since it is equipped with same EMM controllers of MD1220. Riccardo
Author
Owner

@ewwhite commented on GitHub (Apr 30, 2019):

@rbicelli The point of the dual cabling is to provide HBA, port, cable and controller resilience.
I suppose you could have a situation where you lose a cable, and that's tantamount to losing the entire node. So technically, things would work. It just means a cluster failover is triggered in more circumstances.

The limitation of the MD1220 controller is disappointing to see.

<!-- gh-comment-id:487960043 --> @ewwhite commented on GitHub (Apr 30, 2019): @rbicelli The point of the dual cabling is to provide HBA, port, cable and controller resilience. I suppose you could have a situation where you lose a cable, and that's tantamount to losing the entire node. So technically, things would work. It just means a cluster failover is triggered in more circumstances. The limitation of the MD1220 controller is disappointing to see.
Author
Owner

@Nooby1 commented on GitHub (Nov 1, 2021):

I am having this issue as well. My setup is
two HP bl460c G8 blades with hp P721m raid controllers in HBA mode, 8.32 firmware
two 6g sas blade switches (latest firmware)
sas switch zoned in bay mode
D2700 plugged in with top and bottom module in left and right sas switch
hp D2700 DAS (150 firmware) with 20 600gb 10K sas disks

red hat 8
zfs 2.05
encrypted zpool, but imported with -l and -d /dev/multipath/ and the key on a different disk.

muiltpath-ll shows two paths for each disk
populated the STONITH meta provides=unfencing, pcmk_monitor_action="metadata" and pcmk_host_list=

multipathd was initially trying to use TUR, this was giving reservation errors on passive node every time it tried to check if the path was avalable. I changed it to use directio and this stopped the errors on passive node on path check.

However on failover I still get reservastion issues causing a faliure to failover

sd 1:0:62:0: reservation conflict
hpsa 0000:21:00.0 cp xxxxxxxx has status 0x18 sense: 0xff, ASC: 0xff, ASCQ: 0xff, Returning result: 0x18
zio pool=zpool1 dev=/dev/mapper/disk error=52 type=2 offset=78633883483416 size=8192 flags=b08c1

sg_persist shows the disk has reservations also, like original poster.

<!-- gh-comment-id:956131460 --> @Nooby1 commented on GitHub (Nov 1, 2021): I am having this issue as well. My setup is two HP bl460c G8 blades with hp P721m raid controllers in HBA mode, 8.32 firmware two 6g sas blade switches (latest firmware) sas switch zoned in bay mode D2700 plugged in with top and bottom module in left and right sas switch hp D2700 DAS (150 firmware) with 20 600gb 10K sas disks red hat 8 zfs 2.05 encrypted zpool, but imported with -l and -d /dev/multipath/ and the key on a different disk. muiltpath-ll shows two paths for each disk populated the STONITH meta provides=unfencing, pcmk_monitor_action="metadata" and pcmk_host_list= multipathd was initially trying to use TUR, this was giving reservation errors on passive node every time it tried to check if the path was avalable. I changed it to use directio and this stopped the errors on passive node on path check. However on failover I still get reservastion issues causing a faliure to failover sd 1:0:62:0: reservation conflict hpsa 0000:21:00.0 cp xxxxxxxx has status 0x18 sense: 0xff, ASC: 0xff, ASCQ: 0xff, Returning result: 0x18 zio pool=zpool1 dev=/dev/mapper/disk error=52 type=2 offset=78633883483416 size=8192 flags=b08c1 sg_persist shows the disk has reservations also, like original poster.
Sign in to join this conversation.
No labels
pull-request
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/zfs-ha#24
No description provided.