[GH-ISSUE #33] Issues with unfencing #33

Open
opened 2026-05-05 03:31:53 -06:00 by gitea-mirror · 5 comments
Owner

Originally created by @ACiDGRiM on GitHub (Jun 18, 2020).
Original GitHub issue: https://github.com/ewwhite/zfs-ha/issues/33

I'm having trouble getting the fencing component working and frequently neither system can mount the array.

I frequently have either no reservation tickets or both reservation tickets assigned to the storage array. I also have followed the wiki nearly exactly other than using a dual port HBA instead of two 1 port HBAs.

I'm using a Dell MD 1200 in my case.

Originally created by @ACiDGRiM on GitHub (Jun 18, 2020). Original GitHub issue: https://github.com/ewwhite/zfs-ha/issues/33 I'm having trouble getting the fencing component working and frequently neither system can mount the array. I frequently have either no reservation tickets or both reservation tickets assigned to the storage array. I also have followed the wiki nearly exactly other than using a dual port HBA instead of two 1 port HBAs. I'm using a Dell MD 1200 in my case.
Author
Owner

@ewwhite commented on GitHub (Jun 18, 2020):

Can you post log snippets?

<!-- gh-comment-id:645934038 --> @ewwhite commented on GitHub (Jun 18, 2020): Can you post log snippets?
Author
Owner

@ACiDGRiM commented on GitHub (Jun 20, 2020):

I've isolated it to an issue with SCSI reservations and write access to the array. If I mount the zpool with readonly=on I can mount it and there are no disk failures, but otherwise I get the below kernel error:

[ 1436.487415] WARNING: MMP writes to pool 'zfs_storage-array01' have not succeeded in over 167981 ms; suspending pool. Hrtime 1436487432373
[ 1436.487418] WARNING: Pool 'zfs_storage-array01' has encountered an uncorrectable I/O failure and has been suspended.

[ 1475.232670] INFO: task l2arc_feed:3466 blocked for more than 120 seconds.
[ 1475.232720]       Tainted: P          IOE    --------- -  - 4.18.0-193.6.3.el8_2.centos.plus.x86_64 #1
[ 1475.232721] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1475.232723] l2arc_feed      D    0  3466      2 0x80004000
[ 1475.232726] Call Trace:
[ 1475.232740]  ? __schedule+0x24f/0x650
[ 1475.232745]  schedule+0x2f/0xa0
[ 1475.232748]  schedule_preempt_disabled+0xa/0x10
[ 1475.232750]  __mutex_lock.isra.5+0x2d0/0x4a0
[ 1475.232768]  ? __cv_timedwait_common+0xec/0x160 [spl]
[ 1475.232885]  l2arc_feed_thread+0xdb/0x420 [zfs]
[ 1475.232957]  ? l2arc_evict+0x2a0/0x2a0 [zfs]
[ 1475.232965]  ? __thread_exit+0x20/0x20 [spl]
[ 1475.232975]  thread_generic_wrapper+0x6f/0x80 [spl]
[ 1475.232980]  kthread+0x112/0x130
[ 1475.232983]  ? kthread_flush_work_fn+0x10/0x10
[ 1475.232985]  ret_from_fork+0x35/0x40

manually testing fence_scsi and only one path connected to the backplane.

fence_scsi -d /dev/mapper/35000c500b6f6b607,/dev/mapper/35000c500b6f71b8b,/dev/mapper/35000c500b6f71c27,/dev/mapper/35000c500b6f71c37,/dev/mapper/35000c500b6f7df7b,/dev/mapper/35000c500b6f80333,/dev/mapper/35000c500b6f8070f,/dev/mapper/35000c500b6f8072b,/dev/mapper/35000cca0131663d4,/dev/mapper/35000cca01317b5d8,/dev/mapper/35000cca01317bb94,/dev/mapper/35000cca01317beb8 -o on -k 3bcc0000 -v

/usr/bin/sg_persist -n -i -k -d /dev/mapper/35000c500b6f6b607
  PR generation=0x5, 1 registered reservation key follows:
    0x3bcc0000

I've powercycled the backplane and the reservation key is cleared. without writing to the disks do you know of a way to test if this issue is caused by write reservations on the backplane, or a zfs filesystem issue.

<!-- gh-comment-id:646958203 --> @ACiDGRiM commented on GitHub (Jun 20, 2020): I've isolated it to an issue with SCSI reservations and write access to the array. If I mount the zpool with readonly=on I can mount it and there are no disk failures, but otherwise I get the below kernel error: ``` [ 1436.487415] WARNING: MMP writes to pool 'zfs_storage-array01' have not succeeded in over 167981 ms; suspending pool. Hrtime 1436487432373 [ 1436.487418] WARNING: Pool 'zfs_storage-array01' has encountered an uncorrectable I/O failure and has been suspended. [ 1475.232670] INFO: task l2arc_feed:3466 blocked for more than 120 seconds. [ 1475.232720] Tainted: P IOE --------- - - 4.18.0-193.6.3.el8_2.centos.plus.x86_64 #1 [ 1475.232721] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1475.232723] l2arc_feed D 0 3466 2 0x80004000 [ 1475.232726] Call Trace: [ 1475.232740] ? __schedule+0x24f/0x650 [ 1475.232745] schedule+0x2f/0xa0 [ 1475.232748] schedule_preempt_disabled+0xa/0x10 [ 1475.232750] __mutex_lock.isra.5+0x2d0/0x4a0 [ 1475.232768] ? __cv_timedwait_common+0xec/0x160 [spl] [ 1475.232885] l2arc_feed_thread+0xdb/0x420 [zfs] [ 1475.232957] ? l2arc_evict+0x2a0/0x2a0 [zfs] [ 1475.232965] ? __thread_exit+0x20/0x20 [spl] [ 1475.232975] thread_generic_wrapper+0x6f/0x80 [spl] [ 1475.232980] kthread+0x112/0x130 [ 1475.232983] ? kthread_flush_work_fn+0x10/0x10 [ 1475.232985] ret_from_fork+0x35/0x40 ``` manually testing fence_scsi and only one path connected to the backplane. ``` fence_scsi -d /dev/mapper/35000c500b6f6b607,/dev/mapper/35000c500b6f71b8b,/dev/mapper/35000c500b6f71c27,/dev/mapper/35000c500b6f71c37,/dev/mapper/35000c500b6f7df7b,/dev/mapper/35000c500b6f80333,/dev/mapper/35000c500b6f8070f,/dev/mapper/35000c500b6f8072b,/dev/mapper/35000cca0131663d4,/dev/mapper/35000cca01317b5d8,/dev/mapper/35000cca01317bb94,/dev/mapper/35000cca01317beb8 -o on -k 3bcc0000 -v /usr/bin/sg_persist -n -i -k -d /dev/mapper/35000c500b6f6b607 PR generation=0x5, 1 registered reservation key follows: 0x3bcc0000 ``` I've powercycled the backplane and the reservation key is cleared. without writing to the disks do you know of a way to test if this issue is caused by write reservations on the backplane, or a zfs filesystem issue.
Author
Owner

@ewwhite commented on GitHub (Jun 20, 2020):

What hardware are you using here? What's connected to what?

<!-- gh-comment-id:646959037 --> @ewwhite commented on GitHub (Jun 20, 2020): What hardware are you using here? What's connected to what?
Author
Owner

@ewwhite commented on GitHub (Jun 20, 2020):

I see MMP enabled. Disable that zpool option. It's likely causing the issue here.

<!-- gh-comment-id:646959380 --> @ewwhite commented on GitHub (Jun 20, 2020): I see MMP enabled. Disable that zpool option. It's likely causing the issue here.
Author
Owner

@ACiDGRiM commented on GitHub (Jun 20, 2020):

I've had multihost enabled on the zpool since it was created and haven't had this issue. However I can't disable MMP because the zfs module locks up, even when updating parameters in /sys
only if I import with readonly=on can I access the data

Right now there is just one host connected to the enclosure with a single path and I have this issue on both hosts, even if one is shutoff.

the full system below:

2x Dell R610
    9211-8e HBA in IT mode
    Intel x520-2x 10G uplink
1x Dell MD1200
    8x Seagate ST10000NM0096 (storage)
    4x Hitachi HUSRL402 CLAR200 (ZIL and L2ARC)
Centos 8.2 plus kernel (for internal RAID compatibility)
    4.18.0-193.6.3.el8_2.centos.plus.x86_64
    ZFS 0.8.4

I verified all devices have the latest firmware/BIOS

Host 1 is connected to ports 1 on each enclosure controller
Host 2 is connected to ports 2 on each enclosure controller
I've tried connecting in an X (host 1 connects to port 1 on controller 1 and port 2 on controller 2) but both hosts hang on boot when trying to initialize the disks.

<!-- gh-comment-id:647046106 --> @ACiDGRiM commented on GitHub (Jun 20, 2020): I've had multihost enabled on the zpool since it was created and haven't had this issue. However I can't disable MMP because the zfs module locks up, even when updating parameters in /sys only if I import with readonly=on can I access the data Right now there is just one host connected to the enclosure with a single path and I have this issue on both hosts, even if one is shutoff. the full system below: ``` 2x Dell R610 9211-8e HBA in IT mode Intel x520-2x 10G uplink 1x Dell MD1200 8x Seagate ST10000NM0096 (storage) 4x Hitachi HUSRL402 CLAR200 (ZIL and L2ARC) Centos 8.2 plus kernel (for internal RAID compatibility) 4.18.0-193.6.3.el8_2.centos.plus.x86_64 ZFS 0.8.4 ``` I verified all devices have the latest firmware/BIOS Host 1 is connected to ports 1 on each enclosure controller Host 2 is connected to ports 2 on each enclosure controller I've tried connecting in an X (host 1 connects to port 1 on controller 1 and port 2 on controller 2) but both hosts hang on boot when trying to initialize the disks.
Sign in to join this conversation.
No labels
pull-request
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/zfs-ha#33
No description provided.