Alexander Eiblinger Posted January 10, 2021 Posted January 10, 2021 Hi, I have major issues with my helios64 / ZFS setup - maybe anyone can give me a good hint. I'm running the following system: Linux helios64 5.9.14-rockchip64 #20.11.4 SMP PREEMPT Tue Dec 15 08:52:20 CET 2020 aarch64 GNU/Linux I have a three WD 4TB Plus disk - on each disk is encrypted with luks. The three encrypted disks are bundled into a raidz1 zpool "archive" using zfs. Basically this setup works pretty good, but with rather high disc usage, e.g. during a scrub, the whole pool degrades due to read / crc errors. As example: zpool status -x pool: archive state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub in progress since Sun Jan 10 09:54:30 2021 613G scanned at 534M/s, 363G issued at 316M/s, 1.46T total 7.56M repaired, 24.35% done, 0 days 01:00:57 to go config: NAME STATE READ WRITE CKSUM archive DEGRADED 0 0 0 raidz1-0 DEGRADED 87 0 0 EncSDC DEGRADED 78 0 0 too many errors (repairing) EncSDD FAULTED 64 0 0 too many errors (repairing) EncSDE DEGRADED 56 0 20 too many errors (repairing) errors: No known data errors dmesg shows a lot of these errors: [Jan10 09:59] ata3.00: NCQ disabled due to excessive errors [ +0.000009] ata3.00: exception Emask 0x2 SAct 0x18004102 SErr 0x400 action 0x6 [ +0.000002] ata3.00: irq_stat 0x08000000 [ +0.000004] ata3: SError: { Proto } [ +0.000006] ata3.00: failed command: READ FPDMA QUEUED [ +0.000007] ata3.00: cmd 60/00:08:e8:92:01/08:00:68:00:00/40 tag 1 ncq dma 1048576 in res 40/00:d8:68:8b:01/00:00:68:00:00/40 Emask 0x2 (HSM violation) [ +0.000002] ata3.00: status: { DRDY } [ +0.000004] ata3.00: failed command: READ FPDMA QUEUED [ +0.000006] ata3.00: cmd 60/80:40:e8:9a:01/03:00:68:00:00/40 tag 8 ncq dma 458752 in res 40/00:d8:68:8b:01/00:00:68:00:00/40 Emask 0x2 (HSM violation) [ +0.000002] ata3.00: status: { DRDY } [ +0.000003] ata3.00: failed command: READ FPDMA QUEUED [ +0.000005] ata3.00: cmd 60/80:70:68:9e:01/00:00:68:00:00/40 tag 14 ncq dma 65536 in res 40/00:d8:68:8b:01/00:00:68:00:00/40 Emask 0x2 (HSM violation) [ +0.000003] ata3.00: status: { DRDY } [ +0.000002] ata3.00: failed command: READ FPDMA QUEUED [ +0.000006] ata3.00: cmd 60/80:d8:68:8b:01/01:00:68:00:00/40 tag 27 ncq dma 196608 in res 40/00:d8:68:8b:01/00:00:68:00:00/40 Emask 0x2 (HSM violation) [ +0.000002] ata3.00: status: { DRDY } [ +0.000002] ata3.00: failed command: READ FPDMA QUEUED [ +0.000005] ata3.00: cmd 60/00:e0:e8:8c:01/06:00:68:00:00/40 tag 28 ncq dma 786432 in res 40/00:d8:68:8b:01/00:00:68:00:00/40 Emask 0x2 (HSM violation) [ +0.000002] ata3.00: status: { DRDY } [ +0.000007] ata3: hard resetting link [ +0.475899] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ +0.001128] ata3.00: configured for UDMA/133 [ +0.000306] scsi_io_completion_action: 8 callbacks suppressed [ +0.000009] sd 2:0:0:0: [sdd] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [ +0.000006] sd 2:0:0:0: [sdd] tag#1 Sense Key : 0x5 [current] [ +0.000003] sd 2:0:0:0: [sdd] tag#1 ASC=0x21 ASCQ=0x4 [ +0.000006] sd 2:0:0:0: [sdd] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 68 01 92 e8 00 00 08 00 00 00 [ +0.000003] print_req_error: 8 callbacks suppressed [ +0.000003] blk_update_request: I/O error, dev sdd, sector 1744933608 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0 [ +0.000019] zio pool=archive vdev=/dev/mapper/EncSDC error=5 type=1 offset=893389230080 size=1048576 flags=40080cb0 [ +0.000069] sd 2:0:0:0: [sdd] tag#8 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [ +0.000005] sd 2:0:0:0: [sdd] tag#8 Sense Key : 0x5 [current] [ +0.000003] sd 2:0:0:0: [sdd] tag#8 ASC=0x21 ASCQ=0x4 [ +0.000004] sd 2:0:0:0: [sdd] tag#8 CDB: opcode=0x88 88 00 00 00 00 00 68 01 9a e8 00 00 03 80 00 00 [ +0.000004] blk_update_request: I/O error, dev sdd, sector 1744935656 op 0x0:(READ) flags 0x700 phys_seg 7 prio class 0 [ +0.000014] zio pool=archive vdev=/dev/mapper/EncSDC error=5 type=1 offset=893390278656 size=458752 flags=40080cb0 [ +0.000049] sd 2:0:0:0: [sdd] tag#14 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [ +0.000004] sd 2:0:0:0: [sdd] tag#14 Sense Key : 0x5 [current] [ +0.000004] sd 2:0:0:0: [sdd] tag#14 ASC=0x21 ASCQ=0x4 [ +0.000003] sd 2:0:0:0: [sdd] tag#14 CDB: opcode=0x88 88 00 00 00 00 00 68 01 9e 68 00 00 00 80 00 00 [ +0.000004] blk_update_request: I/O error, dev sdd, sector 1744936552 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0 [ +0.000013] zio pool=archive vdev=/dev/mapper/EncSDC error=5 type=1 offset=893390737408 size=65536 flags=1808b0 The S.M.A.R.T values of the discs are OK - only "UDMA_CRC_ERROR_COUNT" are increased (values ~25, increasing). What's also worth mentioning: If I start a scrub I get these errors after about 10 seconds (rather quickly) - but it seems that if I let the scrub continue, there are no more errors occuring. Does this indicate bad discs? The discs itself do not indicate any problems (selftest does not show issues). Is this related to the zfs implementations for armbian? Anything I can do to get this reliable? Thank!
ShadowDance Posted January 10, 2021 Posted January 10, 2021 This looks like the same issue as I’ve been having, it’s possibly due to the SATA cables in the harness. I’ve tested with other SATA cables and can’t reproduce the issue with those. So I believe it’s not likely to be bad disks but that is a possibility of course. You can start by trying to limit SATA speed to 3 Gbps and see if that helps, if it does it’s very likely the same issue. I also recommend doing SMART scans and scrub _after_ limiting the SATA speed. See my other topic for more information: Good luck! 1
Alexander Eiblinger Posted January 11, 2021 Author Posted January 11, 2021 Thanks for the hint with the cables, unfortunately they are rather hard to replace Interesting enough, I've today removed the luks layer for all three disks, so my pool is now using the disks directly. As (unexpected) result, scrub now completes without degrading the pool - at least it worked three times today without problems. So it seems for the moment that the lusk encryption was the source of my issue.
ShadowDance Posted January 12, 2021 Posted January 12, 2021 Hmm, I think one reason you’re not seeing it now could also be that disk access patterns changed by removing the encryption. For me the issue is easily reproducible even without encryption. Simply do a read via dd on a disk. For me speed of reproduction varies depending in which disk slot is read from. Here’s an example command: Quote dd if=/dev/sdd of=/dev/null bs=1M status=progress You may want to try this on each disk just to confirm the issue isn’t still lurking around. Good luck!
Recommended Posts