ShadowDance Posted November 13, 2020 Posted November 13, 2020 (edited) This is a continuation of ata1.00: failed command: READ FPDMA QUEUED noted by @djurny. I've experienced the same issue, and have some additional data points to provide. My observations so far: I'm using WDC WD60EFRX (68MYMN1 and 68L0BN1) drives The drives were working without issue previously behind a ASMedia ASM1062 SATA Controller, I've also used some of them behind ASM1542 (external eSATA enclosure) I can reproduce the issue on a clean install of Armbian 20.08.21 Buster and Focal I can reproduce via simple `dd` to `/dev/null` from the drive so filesystem does not seem to be the underlying cause Every drive is affected (i.e. each SATA slot) At what point dd produces an error varies from SATA slot to SATA slot (not drive dependent), SATA slot 4 can reproducibly produce the error almost immediately after starting a read The problem goes away when setting `extraargs=libata.force=3.0` in `/boot/armbianEnv.txt` [1] [1] However, even with SATA limited to 3 Gbps, the problem did reappear when hot-plugging a drive. This reset happened on drive slot 3 when I hot-plugged a drive onto slot 5. This seems weird to me considering they are supposed to be on different power rails. This may suggest there is in general a problem with either the PSU or power delivery to the drives. Here's an excerpt from the reset: [152957.354311] ata3.00: exception Emask 0x10 SAct 0x80000000 SErr 0x9b0000 action 0xe frozen [152957.354318] ata3.00: irq_stat 0x00400000, PHY RDY changed [152957.354322] ata3: SError: { PHYRdyChg PHYInt 10B8B Dispar LinkSeq } [152957.354328] ata3.00: failed command: READ FPDMA QUEUED [152957.354335] ata3.00: cmd 60/58:f8:00:f8:e7/01:00:71:02:00/40 tag 31 ncq dma 176128 in res 40/00:f8:00:f8:e7/00:00:71:02:00/40 Emask 0x10 (ATA bus error) [152957.354338] ata3.00: status: { DRDY } [152957.354345] ata3: hard resetting link And the full dmesg from when the error happened is below: Spoiler [152957.354311] ata3.00: exception Emask 0x10 SAct 0x80000000 SErr 0x9b0000 action 0xe frozen [152957.354318] ata3.00: irq_stat 0x00400000, PHY RDY changed [152957.354322] ata3: SError: { PHYRdyChg PHYInt 10B8B Dispar LinkSeq } [152957.354328] ata3.00: failed command: READ FPDMA QUEUED [152957.354335] ata3.00: cmd 60/58:f8:00:f8:e7/01:00:71:02:00/40 tag 31 ncq dma 176128 in res 40/00:f8:00:f8:e7/00:00:71:02:00/40 Emask 0x10 (ATA bus error) [152957.354338] ata3.00: status: { DRDY } [152957.354345] ata3: hard resetting link [152962.114138] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320) [152962.115442] ata4.00: configured for UDMA/133 [152962.334140] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320) [152962.335386] ata3.00: configured for UDMA/133 [152962.335438] sd 2:0:0:0: [sdc] tag#31 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=5s [152962.335444] sd 2:0:0:0: [sdc] tag#31 Sense Key : 0x5 [current] [152962.335448] sd 2:0:0:0: [sdc] tag#31 ASC=0x21 ASCQ=0x4 [152962.335454] sd 2:0:0:0: [sdc] tag#31 CDB: opcode=0x88 88 00 00 00 00 02 71 e7 f8 00 00 00 01 58 00 00 [152962.335458] blk_update_request: I/O error, dev sdc, sector 10500962304 op 0x0:(READ) flags 0x700 phys_seg 5 prio class 0 [152962.335477] zio pool=rpool vdev=/dev/mapper/luks-ata-WDC_WD60EFRX-68MYMN1_WD-************-part4 error=5 type=1 offset=5374864261120 size=176128 flags=40080cb0 [152962.335558] ata3: EH complete [152967.562044] ata5: softreset failed (device not ready) [152970.353986] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 320) [152970.354724] ata5.00: ATA-9: WDC WD60EFRX-68L0BN1, 82.00A82, max UDMA/133 [152970.354732] ata5.00: 11721045168 sectors, multi 0: LBA48 NCQ (depth 32), AA [152970.355605] ata5.00: configured for UDMA/133 [152970.356728] scsi 4:0:0:0: Direct-Access ATA WDC WD60EFRX-68L 0A82 PQ: 0 ANSI: 5 [152970.362585] scsi 4:0:0:0: Attached scsi generic sg4 type 0 [152970.367108] sd 4:0:0:0: [sde] 11721045168 512-byte logical blocks: (6.00 TB/5.46 TiB) [152970.367123] sd 4:0:0:0: [sde] 4096-byte physical blocks [152970.367150] sd 4:0:0:0: [sde] Write Protect is off [152970.367154] sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00 [152970.367185] sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [152970.453251] sde: sde1 sde9 [152970.454796] sd 4:0:0:0: [sde] Attached SCSI disk Edited November 13, 2020 by ShadowDance Removed drive serial number from dmesg 1 Quote
gprovost Posted November 14, 2020 Posted November 14, 2020 12 hours ago, ShadowDance said: [1] However, even with SATA limited to 3 Gbps, the problem did reappear when hot-plugging a drive. This reset happened on drive slot 3 when I hot-plugged a drive onto slot 5. This seems weird to me considering they are supposed to be on different power rails. This may suggest there is in general a problem with either the PSU or power delivery to the drives. Here's an excerpt from the reset: That's a very useful information. Slot 3 and Slot 5 are on the same power rail. Such issue shouldn't happen with the capacitors on the HDD harness. Wondering if this could be some kind of root cause. We need to investigate. 0 Quote
ShadowDance Posted November 25, 2020 Author Posted November 25, 2020 I had a harddrive reset again after a reboot and ~1h15min of uptime. This time without plugging in or out any drives and the drives limited to 3 Gbps. I had a ZFS scrub running since boot and I can't spot anything peculiar in my graphs, e.g. CPU utilization was constantly ~70%, RAM ~2.5GiB used, etc. Here's the dmesg of it happening, but pretty much the same as earlier times: [ 4477.151219] ata1.00: exception Emask 0x2 SAct 0x3000000 SErr 0x400 action 0x6 [ 4477.151226] ata1.00: irq_stat 0x08000000 [ 4477.151229] ata1: SError: { Proto } [ 4477.151235] ata1.00: failed command: READ FPDMA QUEUED [ 4477.151243] ata1.00: cmd 60/58:c0:50:81:14/00:00:1d:00:00/40 tag 24 ncq dma 45056 in res 40/00:c0:50:81:14/00:00:1d:00:00/40 Emask 0x2 (HSM violation) [ 4477.151245] ata1.00: status: { DRDY } [ 4477.151248] ata1.00: failed command: READ FPDMA QUEUED [ 4477.151254] ata1.00: cmd 60/08:c8:a8:81:14/00:00:1d:00:00/40 tag 25 ncq dma 4096 in res 40/00:c0:50:81:14/00:00:1d:00:00/40 Emask 0x2 (HSM violation) [ 4477.151256] ata1.00: status: { DRDY } [ 4477.151263] ata1: hard resetting link [ 4477.635201] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320) [ 4477.636449] ata1.00: configured for UDMA/133 [ 4477.636488] sd 0:0:0:0: [sda] tag#24 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [ 4477.636493] sd 0:0:0:0: [sda] tag#24 Sense Key : 0x5 [current] [ 4477.636497] sd 0:0:0:0: [sda] tag#24 ASC=0x21 ASCQ=0x4 [ 4477.636503] sd 0:0:0:0: [sda] tag#24 CDB: opcode=0x88 88 00 00 00 00 00 1d 14 81 50 00 00 00 58 00 00 [ 4477.636508] blk_update_request: I/O error, dev sda, sector 487883088 op 0x0:(READ) flags 0x700 phys_seg 11 prio class 0 [ 4477.636527] zio pool=rpool vdev=/dev/mapper/luks-ata-WDC_WD60EFRX-68L0BN1_WD-XXXXXXXXXXXX-part4 error=5 type=1 offset=248167702528 size=45056 flags=1808b0 [ 4477.636579] sd 0:0:0:0: [sda] tag#25 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [ 4477.636584] sd 0:0:0:0: [sda] tag#25 Sense Key : 0x5 [current] [ 4477.636587] sd 0:0:0:0: [sda] tag#25 ASC=0x21 ASCQ=0x4 [ 4477.636591] sd 0:0:0:0: [sda] tag#25 CDB: opcode=0x88 88 00 00 00 00 00 1d 14 81 a8 00 00 00 08 00 00 [ 4477.636595] blk_update_request: I/O error, dev sda, sector 487883176 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0 [ 4477.636605] zio pool=rpool vdev=/dev/mapper/luks-ata-WDC_WD60EFRX-68L0BN1_WD-XXXXXXXXXXXX-part4 error=5 type=1 offset=248167747584 size=4096 flags=1808b0 [ 4477.636638] ata1: EH complete 0 Quote
TheLinuxBug Posted November 25, 2020 Posted November 25, 2020 I hate to say this cause it won't be what you want to hear -- however, generally these errors are caused by only a few things: 1. Bad SATA cable 2. Bad hard drive 3. Bad SATA controller 4. Under powering the drives 5. Some really bad bug in sata drivers that causes reset (least likely as others are not reporting the same) A good test would be to see if you can use a hard drive you haven't tested with yet and using a new SATA cable to the drive to make sure. Also, you could try alternatively supplying power to the drives via an ATX power supply or bench supply and see if the same thing continues. It could be over drawing power under heavy load and causing the drives to reset when they don't get the right amount of power. You could also test this by running less drives at once and see if this helps to lower your power budget. I have seen this issue my self in a few deployments where I was using a power supply that was too weak to put out the needed 12v rail requirements under load and would cause the drives to drop out and the raid to fail them, so this is actually a real possibility. It could be that your drives are pulling more power than was expected in the spec. Hope this helps some. my 2 cents. Cheers! 0 Quote
ShadowDance Posted November 25, 2020 Author Posted November 25, 2020 Thanks for the reply @TheLinuxBug. It's not apparent from my top post, but providing a bit more context; this happens even with no system load and it happens with both four and five drives plugged in. And also reading from only one drive at a time triggers the issue (while the rest are idle). Considering the four drive scenario, I find it hard to believe that a single drive could be pulling too much power from the power rail (one rail powering 3 drives and the other 1 drive). Testing with a separate PSU is a good suggestion, in hindsight I should've done that before taking the NAS into "production" use. I also have some spare SATA cables I could try with but right now I won't be able to experiment. I have to see if I can find other drives on which I can reproduce the issue because as it stands, the SATA resets are not kind on my drives, they cause read errors on the drives and even with double parity I don't want to risk it. (A ZFS scrub fixes the read errors by writing to the then unreadable sector.) 0 Quote
TheLinuxBug Posted November 25, 2020 Posted November 25, 2020 Interesting. Though as others are seeing freezes during heavy load, I am still curious based on all these descriptions if indeed there is some issue with the power supply there. However, someone will need to actually test that so it can be marked off as a possibility -- even in your situation it would likely be worth a test just to see. If there weren't so many saying there are lock-ups or freezes related to drives under load I would be leaning more towards the SATA controller -- though one would hope that it isn't that since it isn't something you can easily replace or test. Maybe attaching 1 drive via USB 3.0 to the board and perform the same functions -- though if 3.5 inch drives you would obviously need to have an adapter that provide will provide 12v power? my 2 cents. Cheers! 0 Quote
aprayoga Posted November 26, 2020 Posted November 26, 2020 (edited) @ShadowDance have you tried to unplug and replug the SATA cable from the board? I have similar issue (reset link and downgraded to 3.0 Gbps) in the past and it was fixed after i unplug and replug the sata cable. At that time i used off the shelf sata cable, atx power supply and 5x 2.5" sandisk sata ssd I tried to reproduce issue with hardware i have, 3x WD20EFRX and 2x ST1000DM010-2EP102. The WD drives connected to SATA port 3, 4, 5 and configured as mdadm raid 5. I scrub the array which took ~4hrs to finish. no ata issue occurred. then i prepared the drive for hotplug, mdadm -S md127 echo 1 > /sys/block/sdc/device/delete remove the drive, wait for a while then replug the drive and re-assemble the array. The drive still connected at 6.0Gbps The log Spoiler Nov 26 14:47:38 helios64 kernel: sd 2:0:0:0: [sdc] Synchronizing SCSI cache Nov 26 14:47:38 helios64 kernel: sd 2:0:0:0: [sdc] Stopping disk Nov 26 14:47:38 helios64 systemd-udevd[575]: Process '/sbin/mdadm -If sdc1 --path platform-f8000000.pcie-pci-0000:01:00.0-ata-3' failed with exit code 1. Nov 26 14:47:38 helios64 kernel: ata3.00: disabled Nov 26 14:47:42 helios64 login[2096]: pam_unix(login:session): session opened for user root by LOGIN(uid=0) Nov 26 14:47:42 helios64 systemd-logind[1527]: New session 9 of user root. Nov 26 14:47:42 helios64 systemd[1]: Started Session 9 of user root. Nov 26 14:47:42 helios64 login[2520]: ROOT LOGIN on '/dev/ttyS2' Nov 26 14:47:57 helios64 kernel: ata3: SATA link down (SStatus 0 SControl 300) Nov 26 14:48:01 helios64 CRON[2530]: pam_unix(cron:session): session opened for user root by (uid=0) Nov 26 14:48:01 helios64 CRON[2531]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1) Nov 26 14:48:01 helios64 CRON[2530]: pam_unix(cron:session): session closed for user root Nov 26 14:48:33 helios64 kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Nov 26 14:48:33 helios64 kernel: ata3.00: ATA-9: WDC WD20EFRX-68EUZN0, 82.00A82, max UDMA/133 Nov 26 14:48:33 helios64 kernel: ata3.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 32), AA Nov 26 14:48:33 helios64 kernel: ata3.00: configured for UDMA/133 Nov 26 14:48:33 helios64 kernel: scsi 2:0:0:0: Direct-Access ATA WDC WD20EFRX-68E 0A82 PQ: 0 ANSI: 5 Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: Attached scsi generic sg2 type 0 Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB) Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: [sdc] 4096-byte physical blocks Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: [sdc] Write Protect is off Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00 Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Nov 26 14:48:33 helios64 kernel: sdc: sdc1 Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: [sdc] Attached SCSI disk Nov 26 14:49:01 helios64 CRON[2559]: pam_unix(cron:session): session opened for user root by (uid=0) Nov 26 14:49:01 helios64 CRON[2560]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1) Nov 26 14:49:01 helios64 CRON[2559]: pam_unix(cron:session): session closed for user root Nov 26 14:50:01 helios64 CRON[2598]: pam_unix(cron:session): session opened for user root by (uid=0) Nov 26 14:50:01 helios64 CRON[2599]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1) Nov 26 14:50:01 helios64 CRON[2598]: pam_unix(cron:session): session closed for user root Nov 26 14:51:01 helios64 CRON[2617]: pam_unix(cron:session): session opened for user root by (uid=0) Nov 26 14:51:01 helios64 CRON[2618]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1) Nov 26 14:51:01 helios64 CRON[2617]: pam_unix(cron:session): session closed for user root Nov 26 14:52:01 helios64 CRON[2634]: pam_unix(cron:session): session opened for user root by (uid=0) Nov 26 14:52:01 helios64 CRON[2635]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1) Nov 26 14:52:02 helios64 CRON[2634]: pam_unix(cron:session): session closed for user root Nov 26 14:53:01 helios64 CRON[2656]: pam_unix(cron:session): session opened for user root by (uid=0) Nov 26 14:53:01 helios64 CRON[2657]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1) Nov 26 14:53:01 helios64 CRON[2656]: pam_unix(cron:session): session closed for user root Nov 26 14:54:01 helios64 CRON[2677]: pam_unix(cron:session): session opened for user root by (uid=0) Nov 26 14:54:01 helios64 CRON[2678]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1) Nov 26 14:54:01 helios64 CRON[2677]: pam_unix(cron:session): session closed for user root Nov 26 14:54:12 helios64 kernel: md: array md127 already has disks! Nov 26 14:54:12 helios64 kernel: md/raid:md127: device sde1 operational as raid disk 2 Nov 26 14:54:12 helios64 kernel: md/raid:md127: device sdd1 operational as raid disk 1 Nov 26 14:54:12 helios64 kernel: md/raid:md127: device sdc1 operational as raid disk 0 Nov 26 14:54:12 helios64 kernel: md/raid:md127: raid level 5 active with 3 out of 3 devices, algorithm 2 Nov 26 14:54:12 helios64 kernel: md127: detected capacity change from 0 to 4000527155200 Nov 26 14:54:12 helios64 systemd[1]: Started MD array monitor. I use Armbian 20.11 Buster (LK 5.9.10) for the test Edited November 26, 2020 by aprayoga clarify armbian version 0 Quote
ShadowDance Posted November 26, 2020 Author Posted November 26, 2020 (edited) I was still seeing the issue on Armbian 20.11 Buster (LK 5.9) and it actually seemed worse (even with the 3Gbps limit). So, considering my suspicion of power issue (the push from @TheLinuxBug helped too ), I took the plunge and got a bit creative with my testing. Here's the TL;DR: Stock power/harness has the issue even with only one drive connected Stock power/harness has the issue with UPS disconnected Stock harness has the issue when only powering via UPS battery (no PSU), this triggered the issue without even reading from the disk (idle disk, it happened immediately after boot in initrd) Separate PSU for disk and different SATA cable (no harness) == no issue The 3 Gbps limit was removed for these tests and I picked the ata4 port because I knew it could reliably reproduce the issue soon after boot. I've attached dmesg's from cases 2-4 and I've annotated some of them. I'm leaning towards power-issue here considering case 3 and 4. For reference these drives are rated 0.60A @ 5V and 0.45A @ 12V. I would've liked to test the harness power / harness SATA cable separately but that's unfortunately quite hard, and I didn't have any SATA extension cables (female/male) that would have allowed me to test the harness SATA cable separately from power. In case someone finds it amusing, here's my testing rig (case 4): Notable here is that I performed the 4th test 3 times. The first time I had the drive powered on before powering on the Helios64, but the link was set to 3 Gbps for some reason. I powered both down and powered first on Helios64, waited 2 seconds, then I powered on the drive, this way the link was set up as 6 Gbps and no issues when dd'ing. For good measure I repeated this a second time and still no issue. @gprovost Is there anything else I could test? Or would the next step be to try a different harness, PSU, etc? @aprayoga I have indeed juggled them around, but my issue is with all SATA ports, unlikely that all of them are loose. And thanks for attempting to reproduce, I'll still need to check if it happens with other drives too, but that's a task for another day. 2-dmesg-internal-harness-no-ups.txt 3-dmesg-no-psu-only-ups.txt 4-dmesg-external-psu-sata-cable.txt Edited November 26, 2020 by ShadowDance Add more details about test case 4 2 Quote
gprovost Posted November 27, 2020 Posted November 27, 2020 @ShadowDance What is the serial num of your board. We want to check what was the factory test report on this one. Just want to be sure the SATA controller was properly flashed. cat /proc/device-tree/serial-number Honestly it could also be SATA data cable issue (faulty SATA cable is not uncommon). Since your test doesn't narrow down to to either SATA power or SATA data issue. We can send you a new harness along with one power-only harness that you can combine with normal SATA cable. Please contact us to support@kokol.io to arrange that. Then you can do more testing and finally figure out what's the issue. 0 Quote
Fred Fettinger Posted January 27, 2021 Posted January 27, 2021 (edited) On 11/26/2020 at 9:25 PM, gprovost said: cat /proc/device-tree/serial-number Are there any known serial numbers which were not flashed properly? I'm starting to troubleshoot similar behavior and mine outputs: 000100001425 Edited January 27, 2021 by Fred Fettinger Clarifying question 0 Quote
clostro Posted January 27, 2021 Posted January 27, 2021 Just started seeing these 'ata1.00: failed command: READ FPDMA QUEUED' lines in my log as well. I assume ata1 is sda. Had a whole bunch of them a while after boot and nothing afterwards. But the device was not accessed a whole lot during this time. It just booted up after a crash and the LVM cache was cleaning the dirty bits on the cache SSD connected to sda. sdb-e are populated with 4x4TB ST4000VN008 Ironwolfs, and sda is hooked up to an old (and I mean old) Sandisk Extreme 240GB SSD SDSSDX240GG25. I attached the smartctl report for the SSD below, and it passed a short smart test just now. I'll start a long test in a minute. Using Linux helios64 5.9.14-rockchip64 #20.11.4 SMP PREEMPT Tue Dec 15 08:52:20 CET 2020 aarch64 GNU/Linux. Serial number of the board is 000100000530 if that could help. Spoiler [ 4101.251887] ata1.00: exception Emask 0x10 SAct 0xffffffff SErr 0xb80100 action 0x6 [ 4101.251895] ata1.00: irq_stat 0x08000000 [ 4101.251903] ata1: SError: { UnrecovData 10B8B Dispar BadCRC LinkSeq } [ 4101.251911] ata1.00: failed command: READ FPDMA QUEUED [ 4101.251924] ata1.00: cmd 60/00:00:00:02:b6/02:00:08:00:00/40 tag 0 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.251929] ata1.00: status: { DRDY } [ 4101.251934] ata1.00: failed command: READ FPDMA QUEUED [ 4101.251946] ata1.00: cmd 60/00:08:00:a0:b3/02:00:08:00:00/40 tag 1 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.251950] ata1.00: status: { DRDY } [ 4101.251956] ata1.00: failed command: READ FPDMA QUEUED [ 4101.251968] ata1.00: cmd 60/00:10:00:7c:b3/02:00:08:00:00/40 tag 2 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.251972] ata1.00: status: { DRDY } [ 4101.251977] ata1.00: failed command: READ FPDMA QUEUED [ 4101.251989] ata1.00: cmd 60/00:18:00:58:b3/02:00:08:00:00/40 tag 3 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.251993] ata1.00: status: { DRDY } [ 4101.251998] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252010] ata1.00: cmd 60/00:20:00:34:b3/02:00:08:00:00/40 tag 4 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252014] ata1.00: status: { DRDY } [ 4101.252020] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252032] ata1.00: cmd 60/00:28:00:3e:b1/02:00:08:00:00/40 tag 5 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252036] ata1.00: status: { DRDY } [ 4101.252041] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252053] ata1.00: cmd 60/00:30:00:1a:b1/02:00:08:00:00/40 tag 6 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252057] ata1.00: status: { DRDY } [ 4101.252062] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252074] ata1.00: cmd 60/00:38:00:f6:b0/02:00:08:00:00/40 tag 7 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252078] ata1.00: status: { DRDY } [ 4101.252083] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252095] ata1.00: cmd 60/00:40:00:d2:b0/02:00:08:00:00/40 tag 8 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252099] ata1.00: status: { DRDY } [ 4101.252105] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252116] ata1.00: cmd 60/00:48:00:ae:b0/02:00:08:00:00/40 tag 9 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252120] ata1.00: status: { DRDY } [ 4101.252126] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252138] ata1.00: cmd 60/00:50:00:b8:ae/02:00:08:00:00/40 tag 10 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252142] ata1.00: status: { DRDY } [ 4101.252147] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252159] ata1.00: cmd 60/00:58:00:94:ae/02:00:08:00:00/40 tag 11 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252163] ata1.00: status: { DRDY } [ 4101.252169] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252181] ata1.00: cmd 60/00:60:00:b8:bd/02:00:08:00:00/40 tag 12 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252185] ata1.00: status: { DRDY } [ 4101.252190] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252202] ata1.00: cmd 60/00:68:00:94:bd/02:00:08:00:00/40 tag 13 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252206] ata1.00: status: { DRDY } [ 4101.252211] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252223] ata1.00: cmd 60/00:70:00:70:bd/02:00:08:00:00/40 tag 14 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252228] ata1.00: status: { DRDY } [ 4101.252232] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252244] ata1.00: cmd 60/00:78:00:4c:bd/02:00:08:00:00/40 tag 15 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252249] ata1.00: status: { DRDY } [ 4101.252254] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252266] ata1.00: cmd 60/00:80:00:56:bb/02:00:08:00:00/40 tag 16 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252270] ata1.00: status: { DRDY } [ 4101.252276] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252288] ata1.00: cmd 60/00:88:00:32:bb/02:00:08:00:00/40 tag 17 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252292] ata1.00: status: { DRDY } [ 4101.252297] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252309] ata1.00: cmd 60/00:90:00:0e:bb/02:00:08:00:00/40 tag 18 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252314] ata1.00: status: { DRDY } [ 4101.252319] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252330] ata1.00: cmd 60/00:98:00:ea:ba/02:00:08:00:00/40 tag 19 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252335] ata1.00: status: { DRDY } [ 4101.252340] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252352] ata1.00: cmd 60/00:a0:00:c6:ba/02:00:08:00:00/40 tag 20 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252356] ata1.00: status: { DRDY } [ 4101.252361] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252373] ata1.00: cmd 60/00:a8:00:d0:b8/02:00:08:00:00/40 tag 21 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252377] ata1.00: status: { DRDY } [ 4101.252382] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252394] ata1.00: cmd 60/00:b0:00:ac:b8/02:00:08:00:00/40 tag 22 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252399] ata1.00: status: { DRDY } [ 4101.252404] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252416] ata1.00: cmd 60/00:b8:00:88:b8/02:00:08:00:00/40 tag 23 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252420] ata1.00: status: { DRDY } [ 4101.252426] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252437] ata1.00: cmd 60/00:c0:00:64:b8/02:00:08:00:00/40 tag 24 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252442] ata1.00: status: { DRDY } [ 4101.252447] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252459] ata1.00: cmd 60/00:c8:00:40:b8/02:00:08:00:00/40 tag 25 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252463] ata1.00: status: { DRDY } [ 4101.252468] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252480] ata1.00: cmd 60/00:d0:00:4a:b6/02:00:08:00:00/40 tag 26 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252485] ata1.00: status: { DRDY } [ 4101.252490] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252502] ata1.00: cmd 60/00:d8:00:26:b6/02:00:08:00:00/40 tag 27 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252506] ata1.00: status: { DRDY } [ 4101.252511] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252523] ata1.00: cmd 60/00:e0:00:de:b5/02:00:08:00:00/40 tag 28 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252527] ata1.00: status: { DRDY } [ 4101.252532] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252545] ata1.00: cmd 60/00:e8:00:ba:b5/02:00:08:00:00/40 tag 29 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252549] ata1.00: status: { DRDY } [ 4101.252554] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252566] ata1.00: cmd 60/00:f0:00:c4:b3/02:00:08:00:00/40 tag 30 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252570] ata1.00: status: { DRDY } [ 4101.252575] ata1.00: failed command: READ FPDMA QUEUED [ 4101.252587] ata1.00: cmd 60/00:f8:00:70:ae/02:00:08:00:00/40 tag 31 ncq dma 262144 in res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error) [ 4101.252592] ata1.00: status: { DRDY } [ 4101.252603] ata1: hard resetting link [ 4101.727761] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 4101.749701] ata1.00: configured for UDMA/133 [ 4101.752101] ata1: EH complete Spoiler smartctl -x /dev/sda smartctl 6.6 2017-11-05 r4594 [aarch64-linux-5.9.14-rockchip64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SandForce Driven SSDs Device Model: SanDisk SDSSDX240GG25 Serial Number: 123273400127 LU WWN Device Id: 5 001b44 7bf3fcb3f Firmware Version: R201 User Capacity: 240,057,409,536 bytes [240 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS, ACS-2 T13/2015-D revision 3 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Jan 27 13:24:04 2021 +03 SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM level is: 254 (maximum performance) Rd look-ahead is: Disabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, NOT FROZEN [SEC1] Wt Cache Reorder: Unavailable === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x02) Offline data collection activity was completed without error. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 48) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x0021) SCT Status supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 119 119 050 - 0/225624030 5 Retired_Block_Count PO--CK 100 100 003 - 7 9 Power_On_Hours_and_Msec -O--CK 000 000 000 - 30893h+27m+05.090s 12 Power_Cycle_Count -O--CK 099 099 000 - 1951 171 Program_Fail_Count -O--CK 000 000 000 - 0 172 Erase_Fail_Count -O--CK 000 000 000 - 0 174 Unexpect_Power_Loss_Ct ----CK 000 000 000 - 999 177 Wear_Range_Delta ------ 000 000 000 - 3 181 Program_Fail_Count -O--CK 000 000 000 - 0 182 Erase_Fail_Count -O--CK 000 000 000 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 194 Temperature_Celsius -O---K 028 055 000 - 28 (Min/Max 10/55) 195 ECC_Uncorr_Error_Count --SRC- 120 120 000 - 0/225624030 196 Reallocated_Event_Count PO--CK 100 100 003 - 7 201 Unc_Soft_Read_Err_Rate --SRC- 120 120 000 - 0/225624030 204 Soft_ECC_Correct_Rate --SRC- 120 120 000 - 0/225624030 230 Life_Curve_Status PO--C- 100 100 000 - 100 231 SSD_Life_Left PO--C- 100 100 010 - 0 233 SandForce_Internal ------ 000 000 000 - 10541 234 SandForce_Internal -O--CK 000 000 000 - 13196 241 Lifetime_Writes_GiB -O--CK 000 000 000 - 13196 242 Lifetime_Reads_GiB -O--CK 000 000 000 - 13491 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning General Purpose Log Directory Version 1 SMART Log Directory Version 1 [multi-sector log support] Address Access R/W Size Description 0x00 GPL,SL R/O 1 Log Directory 0x07 GPL R/O 1 Extended self-test log 0x09 SL R/W 1 Selective self-test log 0x10 GPL R/O 1 NCQ Command Error log 0x11 GPL,SL R/O 1 SATA Phy Event Counters log 0x80-0x9f GPL,SL R/W 16 Host vendor specific log 0xb7 GPL,SL VS 16 Device vendor specific log 0xe0 GPL,SL R/W 1 SCT Command/Status 0xe1 GPL,SL R/W 1 SCT Data Transfer SMART Extended Comprehensive Error Log (GP Log 0x03) not supported SMART Error Log not supported SMART Extended Self-test Log Version: 0 (1 sectors) No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. SCT Status Version: 3 SCT Version (vendor specific): 0 (0x0000) SCT Support Level: 1 Device State: Active (0) Current Temperature: 28 Celsius Power Cycle Min/Max Temperature: 27/31 Celsius Lifetime Min/Max Temperature: 10/55 Celsius Under/Over Temperature Limit Count: 0/0 SCT Temperature History Version: 10 (Unknown, should be 2) Temperature Sampling Period: 1 minute Temperature Logging Interval: 10 minutes Min/Max recommended Temperature: 0/120 Celsius Min/Max Temperature Limit: 0/ 0 Celsius Temperature History Size (Index): 56576 (7) Invalid Temperature History Size or Index SCT Error Recovery Control command not supported Device Statistics (GP/SMART Log 0x04) not supported Pending Defects log (GP Log 0x0c) not supported SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x0001 2 0 Command failed due to ICRC error 0x0003 2 0 R_ERR response for device-to-host data FIS 0x0004 2 0 R_ERR response for host-to-device data FIS 0x0006 2 0 R_ERR response for device-to-host non-data FIS 0x0007 2 0 R_ERR response for host-to-device non-data FIS 0x0008 2 0 Device-to-host non-data FIS retries 0x0009 2 4 Transition from drive PhyRdy to drive PhyNRdy 0x000a 2 4 Device-to-host register FISes sent due to a COMRESET 0x000f 2 0 R_ERR response for host-to-device data FIS, CRC 0x0010 2 0 R_ERR response for host-to-device data FIS, non-CRC 0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC 0x0013 2 0 R_ERR response for host-to-device non-data FIS, non-CRC 0x0002 2 0 R_ERR response for data FIS 0x0005 2 0 R_ERR response for non-data FIS 0x000b 2 0 CRC errors within host-to-device FIS 0x000d 2 0 Non-CRC errors within host-to-device FIS 0 Quote
Gareth Halfacree Posted January 27, 2021 Posted January 27, 2021 (edited) I'm seeing the same on my Helios64, on ata2.00 - a bunch on Monday January 4th, then even more on Thursday January 14th. None since. ata2.00 is a Western Digital Red Plus 6TB, purchased new for the Helios64. It's in a mirror with a 6TB Seagate IronWolf; the only other drive in the system is a 240GB M.2 SSD on the mainboard. The drive has completed SMART tests without errors, and the only thing of interest in the SMART attributes is 199 UDMA CRC Error Count at 5. The hard drives are in bays 2 & 4; 1, 3, and 5 are empty. Board serial number is 000100001123, running Ubuntu 20.04.1 LTS 5.9.14-rockchip64. Edited January 27, 2021 by Gareth Halfacree 0 Quote
ShadowDance Posted January 27, 2021 Author Posted January 27, 2021 Hey, sorry I haven't updated this thread until now. The Kobol team sent me, as promised, a new harness and a power-only harness so that I could do some testing: Cutting off capacitors from the my original harness did not make a difference The new (normal) harness had the exact same issue as the original one With the power-only harness and my own SATA cables, I was unable to reproduce the issue (even at 6 Gbps) Final test was to go to town on my original harness and cut the connector in two, this allowed me to use my own SATA cable with the original harness and there was, again, no issue (at 6 Gbps) Judging from my initial results, it would seem that there is an issue with the SATA cables in the stock harness. But I should try to do this for a longer period of time -- problem was I didn't have SATA cables for all disks, once I do I'll try to do a week long stress test. I reported my result to the Kobol team but haven't heard back yet. Even with the 3.0 Gbps limit, I still occasionally run into this issue with the original harness, has happened 2 times since I did the experiment. If someone else is willing to repeat this experiment with a good set of SATA cables, please do contact Kobol to see if they'd be willing to ship out another set of test harnesses, or perhaps they have other plans. Here's some pics of my test setup, including the mutilated connector: 3 Quote
clostro Posted January 27, 2021 Posted January 27, 2021 (edited) Thanks for the update, I really hope its not the cables in my case. I mean I was not getting these lines in the log before, just got them for the first time. The only difference from the last few boots is the CPU frequency and speed governor per the quote below. I don't think they are related, this was originally suggested for troubleshooting a 'sync and drop_caches' issue, which works fine on my device. Later it was also suggested for the 'mystery red light' issue, which was a problem on my device. But this could be something else. Hopefully not the cables, I would rather have the SSD connected to that slot fail than to change the harness. On 1/4/2021 at 6:48 AM, gprovost said: Hi, could try to the following tweak and redo the same methods that triggers the crash : Run armbian-config, go to -> System -> CPU And set: Minimum CPU speed = 1200000 Maximum CPU speed = 1200000 CPU governor = performance This will help us understand if instability is still due to DVFS. Edited January 27, 2021 by clostro clarification 0 Quote
Gareth Halfacree Posted February 4, 2021 Posted February 4, 2021 My Helios64 ran a btrfs scrub on the 1st of the month, and that triggered a whole bunch more READ FPDMA QUEUED errors - 41 over the space of two hours, clustered in two big groups at around 57 minutes into the scrub and 1h46m into the scrub. Despite the errors, the scrub completed successfully without reporting any read errors. 0 Quote
gprovost Posted February 9, 2021 Posted February 9, 2021 Hey guys, could you disable NCQ and see how it impacts your system when trying to reproduce the issue. Edit /boot/armbianEnv.txt and add the following line : extraargs=libata.force=noncq 0 Quote
ShadowDance Posted February 9, 2021 Author Posted February 9, 2021 @gprovost I can only speak for my part, but I’ve pretty much gone trough all libata options and combinations thereof (NCQ included), no improvement on my situation. 0 Quote
alban Posted February 9, 2021 Posted February 9, 2021 Hello, I'm getting same issue : [ 3613.284995] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0xb80100 action 0x6 [ 3613.285019] ata4.00: irq_stat 0x08000000 [ 3613.285035] ata4: SError: { UnrecovData 10B8B Dispar BadCRC LinkSeq } [ 3613.285053] ata4.00: failed command: READ DMA EXT [ 3613.285081] ata4.00: cmd 25/00:40:f0:a9:36/00:05:04:00:00/e0 tag 26 dma 688128 in res 50/00:00:f0:a9:36/00:00:04:00:00/e0 Emask 0x10 (ATA bus error) [ 3613.285092] ata4.00: status: { DRDY } [ 3613.285113] ata4: hard resetting link [ 3614.116878] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 3614.117787] ata4.00: configured for UDMA/133 [ 3614.117946] ata4: EH complete I was runing Armbian 20.11 with 5.9 kernel since first install on sdcard and this problem occured during last days I tried to install a new system from last image Armbian 21.02 on emmc to be sdcard independant The error occur within hours after boot It ends to reduce ata link speed and finally to disable the first 2 drives (corressponding to sda and sdb in my case) : LED were off when I noticed problem I'm using 5 WDC WD30EFRX as Raid6 array I tried NCQ disable tonight and I'm getting error not long after boot : I forced array recheck and as I'm wrriting this article, I'm getting an other error : [ 4003.337231] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0xb00100 action 0x6 [ 4003.337248] ata4.00: irq_stat 0x08000000 [ 4003.337256] ata4: SError: { UnrecovData Dispar BadCRC LinkSeq } [ 4003.337265] ata4.00: failed command: READ DMA EXT [ 4003.337278] ata4.00: cmd 25/00:40:80:41:09/00:05:09:00:00/e0 tag 7 dma 688128 in res 50/00:00:80:41:09/00:00:09:00:00/e0 Emask 0x10 (ATA bus error) [ 4003.337283] ata4.00: status: { DRDY } [ 4003.337298] ata4: hard resetting link [ 4004.169150] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 4004.170064] ata4.00: configured for UDMA/133 [ 4004.170222] ata4: EH complete If this is sata cable related I don't get why this problem come a few month after non stop working fine. Any other idea ? In case here my serial : cat /proc/device-tree/serial-number 000100001235 Regards 0 Quote
gprovost Posted February 10, 2021 Posted February 10, 2021 @ShadowDance @alban I guess it was a shot in the dark to check if NCQ has any sort of impact. 1 Quote
griefman Posted February 10, 2021 Posted February 10, 2021 I have been following the discussion for a while and would like to report where i am at. Also I would like to request that this information is being recorded in a systematic way by Kobol, of course only if this isn't already happening. I have the following setup: pool: mypool state: ONLINE scan: resilvered 12K in 00:00:00 with 0 errors on Tue Feb 9 17:22:03 2021 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0 Some weeks ago I noticed that the health of the zpool is DEGRADED. I checked and one device had READ errors and was marked as FAULTED. This has also resulted in storing UDMA CRC errors in the SMART stats for this drive. So, I cleared the errors and ran a scrub to see what is going on. sudo zpool scrub mypool I monitored the scrub with sudo watch zpool status And I saw quite quickly that all drives have started to get READ errors. SMART also reported that all drives now have UDMA CRC errors. It was clear that something bad is going on, so I contacted Kobol and the we started to debug the issue together. First I changed the SATA speed to 3Gbps by adding the following line to /boot/armbianEnv.txt extraargs=libata.force=3.0 The results were similar, but I noticed that the errors started to show up a bit later in the scrubbing process. The UDMA CRC Errors count has increased. Then I replaced the power and the SATA Cables with new ones, which unfortunately did not bring any improvement. Then I disabled NCQ by adding the following to /boot/armbianEnv.txt extraargs=libata.force=noncq and reverted back to SATA 6 Gbps by removing the 3 Gbps line, introduced earlier. This had a positive results and I was able to run the scrub without any errors. Then I went back and installed the old(original) cable harness again and retested - all good. While disable NCQ is having a positive impact on the errors, it is also having a negative impact on the speed and to some amount also on the disk drives' health. I have also tried to reduce NCQs depth to 31, which is the recommended value, however this did not have any impact. I hope that using this information Kobol will try to reproduce this issue themselves, to see if its only certain boards that are affected or if its every board. 2 Quote
ShadowDance Posted February 10, 2021 Author Posted February 10, 2021 (edited) On 2/10/2021 at 12:10 PM, griefman said: Then I replaced the power and the SATA Cables with new ones, which unfortunately did not bring any improvement. Was this a replacement harness from Kobol or 3rd party SATA cables? I saw no improvement with a replcement harness but 3rd party SATA cables seems to have done the trick. Either way, thanks for sharing your insights @griefman, it’s nice to see others digging into this/these issue(s) as well. Another interesting observation I made along the way is that a reboot helps “fix” the read errors without writing to the “bad sectors” (not actually bad sectors!). I.e. if 6 Gbps triggered an error, limiting to 3 Gbps and rebooting would “fix it”, so the errors seem to stem, in my case, from the system resetting the drive, which in turn seems to happen because communication errors happen with stock SATA cables. Edited February 11, 2021 by ShadowDance Clarify quoted bad sectors 0 Quote
alban Posted February 11, 2021 Posted February 11, 2021 (edited) UPDATE : Slot 1 & 2 won't start anymore with many tests : Battery connected or not doesn't change anything I tried with external power and brand new sata 6Gbit cable : it's working fine I tried with sata power splitter Y from Solt1 or slot2 line : disks doesn't start with external Sata cable : It seems there are trouble with power lines from stock harness I tried with sata power splitter connected from Slot 3 powering slot 1 & 2 : drives 1 / 2 / 4 / 5are working : seems OK I will try with 2 sata splitter Y (delivery planned for tonight) Conclusion for now : Seems power isn't properly achieved through stock harness : I was thinking trouble with sata data cables but problem for my unit seems mainly from power sata connectors I will keep you update Edited February 11, 2021 by alban 0 Quote
griefman Posted February 11, 2021 Posted February 11, 2021 (edited) On 2/10/2021 at 11:31 AM, ShadowDance said: Was this a replacement harness from Kobol or 3rd party SATA cables? I saw no improvement with a replcement harness but 3rd party SATA cables seems to have done the trick. Either way, thanks for sharing your insights @griefman, it’s nice to see others digging into this/these issue(s) as well. Another interesting observation I made along the way is that a reboot helps “fix” the read errors without writing to the “bad sectors”. I.e. if 6 Gbps triggered an error, limiting to 3 Gbps and rebooting would “fix it”, so the errors seem to stem, in my case, from the system resetting the drive, which in turn seems to happen because communication errors happen with stock SATA cables. I used a power cable that was provided by Kobol and brand new SATA cables from Amazon. Also, I did not have any bad sectors written on the disks, only UDMA CRC errors. Also, my drives were brand new when I connected them Edited February 11, 2021 by griefman 0 Quote
ShadowDance Posted February 11, 2021 Author Posted February 11, 2021 5 minutes ago, griefman said: I used a power cable that was provided by Kobol and brand new SATA cables from Amazon. Also, I did not have any bad sectors written on the disks, only UDMA CRC errors. Also my drives were brand new when I connected them Ok, good to know. And just to be clear: No bad sectors here either. However, the resets done by the system put the drives in a weird state (speculating) which behaved like bad sectors (but weren't, hence the quotes), a reboot would cleared those up. UDMA CRC errors happen on my drives too (never happened before putting them in the Helios64). 0 Quote
Gareth Halfacree Posted March 1, 2021 Posted March 1, 2021 It's the first of the month, so another automated btrfs scrub took place - and again triggered a bunch of DRDY drive resets on ata2.00. Are we any closer to figuring out what the problem is, here? 1 Quote
0utc45t Posted March 1, 2021 Posted March 1, 2021 I'm too suffering from this issue. Sata SSD in slot1 (with system in it) and ZFS pool (mirror) in the rest. Started with 2 older disks... scrub initiated problems, added one new disk to get redundancy back... started to get errors on that too, added second new Ironwolf and now my pool is resilvering with 1 old WD disk and 1 new Seagate in degraded state (too many errors). Waiting for the resilvering to finish on the second new Seagate Ironwolf to have my data mirrored... Any real solution yet? 0 Quote
gprovost Posted March 2, 2021 Posted March 2, 2021 Right now we still have a bit of difficulty to distinguish between ATA errors related to cable issue (something that we are addressing with a new PCB back-plane) and ATA errors potentially due to the SATA controller. For the latest one we are asking help from JMicron, but the issue is that we don't have an easy way to reproduce the error. The issue that might be related to SATA controller seems to only show up with BTRFS and ZFS while doing some scrubbing. Do you guys have a suggestion for a simple reproducible test starting from scratch (no data on HDD) ? Thanks for your help. 0 Quote
Gareth Halfacree Posted March 2, 2021 Posted March 2, 2021 4 hours ago, gprovost said: The issue that might be related to SATA controller seems to only show up with BTRFS and ZFS while doing some scrubbing. Do you guys have a suggestion for a simple reproducible test starting from scratch (no data on HDD) ? Thanks for your help. I'd suggest getting one or more drives, setting them up in a btrfs mirror, dding a bunch of random onto them, and scrubbing. It's possible dding alone will trigger the error, but if it doesn't then scrubbing certainly should. $ sudo mkfs.btrfs -L testdisks -m raid1 -d raid1 /dev/sdX /dev/sdY /dev/sdZ $ sudo mkdir /media/testdisks $ sudo chown USER /media/testdisks $ sudo mount /dev/sdX /media/testdisks -o user,defaults,noatime,nodiratime,compress=zstd,space_cache $ cd /media/testdisks $ dd if=/dev/urandom of=/media/testdisks/001.testfile bs=10M count=1024 $ for i in {002..100}; do cp /media/testdisks/001.testfile /media/testdisks/$i.testfile; done $ sudo btrfs scrub start -Bd /media/testdisks The options on the mount are set to mimic my own. Obviously change the number and name of the devices and the username to match your own. It's also set to create 100 files of 10GB each for a total of 1TB of data; you could try with 10 files for 100GB total first to speed things up, but I've around a terabyte of data on my system so figured it was worth matching the real-world scenario as closely as possible. 1 Quote
grek Posted March 23, 2021 Posted March 23, 2021 Today I manually ran zfs scrub at my zfs mirror pool, I got similar errors. The funny is, that normally ,scrub was initiated by the cron without any issue. Any suggestion what should I do? I done: zpool clear data sda-crypt and zpool clear data sdb-crypt ,at my pools and now I have online state, but I'm little afraid about my data ...... output before today manual zfs scrub - Scrub without errors at Sun Mar 14...... root@helios64:/# zpool status -v data pool: data state: ONLINE scan: scrub repaired 0B in 06:17:38 with 0 errors on Sun Mar 14 06:41:40 2021 config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sda-crypt ONLINE 0 0 0 sdb-crypt ONLINE 0 0 0 cache sdc2 ONLINE 0 0 0 root@helios64:/var/log# cat /etc/cron.d/zfsutils-linux PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin # TRIM the first Sunday of every month. 24 0 1-7 * * root if [ $(date +\%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/trim ]; then /usr/lib/zfs-linux/trim; fi # Scrub the second Sunday of every month. 24 0 8-14 * * root if [ $(date +\%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ]; then /usr/lib/zfs-linux/scrub; fi root@helios64:/var/log# Quote root@helios64:/var/log# zpool status -v data pool: data state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub repaired 4.87M in 06:32:31 with 0 errors on Tue Mar 23 15:21:50 2021 config: NAME STATE READ WRITE CKSUM data DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 sda-crypt DEGRADED 6 0 32 too many errors sdb-crypt DEGRADED 9 0 29 too many errors cache sdc2 ONLINE 0 0 0 Quote [ 8255.882611] ata2.00: irq_stat 0x08000000 [ 8255.882960] ata2: SError: { Proto } [ 8255.883273] ata2.00: failed command: READ FPDMA QUEUED [ 8255.883732] ata2.00: cmd 60/00:10:18:4d:9d/08:00:15:00:00/40 tag 2 ncq dma 1048576 in [ 8255.883732] res 40/00:10:18:4d:9d/00:00:15:00:00/40 Emask 0x2 (HSM violation) [ 8255.885101] ata2.00: status: { DRDY } [ 8255.885426] ata2.00: failed command: READ FPDMA QUEUED [ 8255.885885] ata2.00: cmd 60/00:18:90:5b:9d/08:00:15:00:00/40 tag 3 ncq dma 1048576 in [ 8255.885885] res 40/00:10:18:4d:9d/00:00:15:00:00/40 Emask 0x2 (HSM violation) [ 8255.887294] ata2.00: status: { DRDY } [ 8255.887623] ata2.00: failed command: READ FPDMA QUEUED [ 8255.888083] ata2.00: cmd 60/10:98:18:55:9d/06:00:15:00:00/40 tag 19 ncq dma 794624 in [ 8255.888083] res 40/00:10:18:4d:9d/00:00:15:00:00/40 Emask 0x2 (HSM violation) [ 8255.889450] ata2.00: status: { DRDY } [ 8255.889783] ata2: hard resetting link [ 8256.365961] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 8256.367771] ata2.00: configured for UDMA/133 [ 8256.368540] sd 1:0:0:0: [sdb] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [ 8256.369343] sd 1:0:0:0: [sdb] tag#2 Sense Key : 0x5 [current] [ 8256.369858] sd 1:0:0:0: [sdb] tag#2 ASC=0x21 ASCQ=0x4 [ 8256.370354] sd 1:0:0:0: [sdb] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 15 9d 4d 18 00 00 08 00 00 00 [ 8256.371158] blk_update_request: I/O error, dev sdb, sector 362630424 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0 [ 8256.372125] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=185649999872 size=1048576 flags=40080cb0 [ 8256.373117] sd 1:0:0:0: [sdb] tag#3 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [ 8256.373964] sd 1:0:0:0: [sdb] tag#3 Sense Key : 0x5 [current] [ 8256.374480] sd 1:0:0:0: [sdb] tag#3 ASC=0x21 ASCQ=0x4 [ 8256.374940] sd 1:0:0:0: [sdb] tag#3 CDB: opcode=0x88 88 00 00 00 00 00 15 9d 5b 90 00 00 08 00 00 00 [ 8256.375742] blk_update_request: I/O error, dev sdb, sector 362634128 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0 [ 8256.376704] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=185651896320 size=1048576 flags=40080cb0 [ 8256.377724] sd 1:0:0:0: [sdb] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [ 8256.378568] sd 1:0:0:0: [sdb] tag#19 Sense Key : 0x5 [current] [ 8256.379096] sd 1:0:0:0: [sdb] tag#19 ASC=0x21 ASCQ=0x4 [ 8256.379561] sd 1:0:0:0: [sdb] tag#19 CDB: opcode=0x88 88 00 00 00 00 00 15 9d 55 18 00 00 06 10 00 00 [ 8256.380371] blk_update_request: I/O error, dev sdb, sector 362632472 op 0x0:(READ) flags 0x700 phys_seg 13 prio class 0 [ 8256.381336] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=185651048448 size=794624 flags=40080cb0 [ 8256.382389] ata2: EH complete [15116.869369] ata2.00: exception Emask 0x2 SAct 0x30000200 SErr 0x400 action 0x6 [15116.870017] ata2.00: irq_stat 0x08000000 [15116.870366] ata2: SError: { Proto } [15116.870680] ata2.00: failed command: READ FPDMA QUEUED [15116.871140] ata2.00: cmd 60/00:48:98:41:1b/07:00:55:00:00/40 tag 9 ncq dma 917504 in [15116.871140] res 40/00:e0:98:40:1b/00:00:55:00:00/40 Emask 0x2 (HSM violation) [15116.872501] ata2.00: status: { DRDY } [15116.872826] ata2.00: failed command: READ FPDMA QUEUED [15116.873284] ata2.00: cmd 60/00:e0:98:40:1b/01:00:55:00:00/40 tag 28 ncq dma 131072 in [15116.873284] res 40/00:e0:98:40:1b/00:00:55:00:00/40 Emask 0x2 (HSM violation) [15116.874696] ata2.00: status: { DRDY } [15116.875023] ata2.00: failed command: READ FPDMA QUEUED [15116.875482] ata2.00: cmd 60/00:e8:98:48:1b/01:00:55:00:00/40 tag 29 ncq dma 131072 in [15116.875482] res 40/00:e0:98:40:1b/00:00:55:00:00/40 Emask 0x2 (HSM violation) [15116.876850] ata2.00: status: { DRDY } [15116.877182] ata2: hard resetting link [15117.353348] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [15117.355108] ata2.00: configured for UDMA/133 [15117.355678] sd 1:0:0:0: [sdb] tag#9 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [15117.356481] sd 1:0:0:0: [sdb] tag#9 Sense Key : 0x5 [current] [15117.356996] sd 1:0:0:0: [sdb] tag#9 ASC=0x21 ASCQ=0x4 [15117.357492] sd 1:0:0:0: [sdb] tag#9 CDB: opcode=0x88 88 00 00 00 00 00 55 1b 41 98 00 00 07 00 00 00 [15117.358296] blk_update_request: I/O error, dev sdb, sector 1427849624 op 0x0:(READ) flags 0x700 phys_seg 18 prio class 0 [15117.359273] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=731042230272 size=917504 flags=40080cb0 [15117.360297] sd 1:0:0:0: [sdb] tag#28 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [15117.361152] sd 1:0:0:0: [sdb] tag#28 Sense Key : 0x5 [current] [15117.361716] sd 1:0:0:0: [sdb] tag#28 ASC=0x21 ASCQ=0x4 [15117.362181] sd 1:0:0:0: [sdb] tag#28 CDB: opcode=0x88 88 00 00 00 00 00 55 1b 40 98 00 00 01 00 00 00 [15117.362990] blk_update_request: I/O error, dev sdb, sector 1427849368 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0 [15117.363944] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=731042099200 size=131072 flags=1808b0 [15117.364883] sd 1:0:0:0: [sdb] tag#29 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [15117.365719] sd 1:0:0:0: [sdb] tag#29 Sense Key : 0x5 [current] [15117.366243] sd 1:0:0:0: [sdb] tag#29 ASC=0x21 ASCQ=0x4 [15117.366706] sd 1:0:0:0: [sdb] tag#29 CDB: opcode=0x88 88 00 00 00 00 00 55 1b 48 98 00 00 01 00 00 00 [15117.367514] blk_update_request: I/O error, dev sdb, sector 1427851416 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0 [15117.368466] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=731043147776 size=131072 flags=1808b0 [15117.369450] ata2: EH complete [18959.250478] FS-Cache: Loaded [18959.292215] FS-Cache: Netfs 'nfs' registered for caching [18959.613399] NFS: Registering the id_resolver key type [18959.613885] Key type id_resolver registered [18959.614253] Key type id_legacy registered [19150.007895] ata1.00: exception Emask 0x2 SAct 0xe0000 SErr 0x400 action 0x6 [19150.008525] ata1.00: irq_stat 0x08000000 [19150.008874] ata1: SError: { Proto } [19150.009188] ata1.00: failed command: READ FPDMA QUEUED [19150.009648] ata1.00: cmd 60/78:88:30:40:05/00:00:78:00:00/40 tag 17 ncq dma 61440 in [19150.009648] res 40/00:90:60:41:05/00:00:78:00:00/40 Emask 0x2 (HSM violation) [19150.011011] ata1.00: status: { DRDY } [19150.011341] ata1.00: failed command: READ FPDMA QUEUED [19150.011901] ata1.00: cmd 60/98:90:60:41:05/00:00:78:00:00/40 tag 18 ncq dma 77824 in [19150.011901] res 40/00:90:60:41:05/00:00:78:00:00/40 Emask 0x2 (HSM violation) [19150.013357] ata1.00: status: { DRDY } [19150.013689] ata1.00: failed command: READ FPDMA QUEUED [19150.014148] ata1.00: cmd 60/60:98:88:42:05/00:00:78:00:00/40 tag 19 ncq dma 49152 in [19150.014148] res 40/00:90:60:41:05/00:00:78:00:00/40 Emask 0x2 (HSM violation) [19150.015507] ata1.00: status: { DRDY } [19150.015892] ata1: hard resetting link [19150.491836] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [19150.493606] ata1.00: configured for UDMA/133 [19150.494071] sd 0:0:0:0: [sda] tag#17 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [19150.494880] sd 0:0:0:0: [sda] tag#17 Sense Key : 0x5 [current] [19150.495401] sd 0:0:0:0: [sda] tag#17 ASC=0x21 ASCQ=0x4 [19150.495888] sd 0:0:0:0: [sda] tag#17 CDB: opcode=0x88 88 00 00 00 00 00 78 05 40 30 00 00 00 78 00 00 [19150.496698] blk_update_request: I/O error, dev sda, sector 2013610032 op 0x0:(READ) flags 0x700 phys_seg 14 prio class 0 [19150.497673] zio pool=data vdev=/dev/mapper/sda-crypt error=5 type=1 offset=1030951559168 size=61440 flags=40080cb0 [19150.498675] sd 0:0:0:0: [sda] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [19150.499485] sd 0:0:0:0: [sda] tag#18 Sense Key : 0x5 [current] [19150.500044] sd 0:0:0:0: [sda] tag#18 ASC=0x21 ASCQ=0x4 [19150.500511] sd 0:0:0:0: [sda] tag#18 CDB: opcode=0x88 88 00 00 00 00 00 78 05 41 60 00 00 00 98 00 00 [19150.501321] blk_update_request: I/O error, dev sda, sector 2013610336 op 0x0:(READ) flags 0x700 phys_seg 17 prio class 0 [19150.502290] zio pool=data vdev=/dev/mapper/sda-crypt error=5 type=1 offset=1030951714816 size=77824 flags=40080cb0 [19150.503304] sd 0:0:0:0: [sda] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [19150.504141] sd 0:0:0:0: [sda] tag#19 Sense Key : 0x5 [current] [19150.504664] sd 0:0:0:0: [sda] tag#19 ASC=0x21 ASCQ=0x4 [19150.505128] sd 0:0:0:0: [sda] tag#19 CDB: opcode=0x88 88 00 00 00 00 00 78 05 42 88 00 00 00 60 00 00 [19150.505936] blk_update_request: I/O error, dev sda, sector 2013610632 op 0x0:(READ) flags 0x700 phys_seg 11 prio class 0 [19150.506896] zio pool=data vdev=/dev/mapper/sda-crypt error=5 type=1 offset=1030951866368 size=49152 flags=40080cb0 [19150.507875] ata1: EH complete [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[B[B[B[B[B[B[B[B[B[B[B[B[21308.943111] ata1.00: exception Emask 0x2 SAct 0x60010 SErr 0x400 action 0x6 [21308.943736] ata1.00: irq_stat 0x08000000 [21308.944086] ata1: SError: { Proto } [21308.944399] ata1.00: failed command: READ FPDMA QUEUED [21308.944859] ata1.00: cmd 60/00:20:70:11:6d/01:00:21:00:00/40 tag 4 ncq dma 131072 in [21308.944859] res 40/00:88:70:0e:6d/00:00:21:00:00/40 Emask 0x2 (HSM violation) [21308.946219] ata1.00: status: { DRDY } [21308.946544] ata1.00: failed command: READ FPDMA QUEUED [21308.947003] ata1.00: cmd 60/00:88:70:0e:6d/02:00:21:00:00/40 tag 17 ncq dma 262144 in [21308.947003] res 40/00:88:70:0e:6d/00:00:21:00:00/40 Emask 0x2 (HSM violation) [21308.948419] ata1.00: status: { DRDY } [21308.948747] ata1.00: failed command: READ FPDMA QUEUED [21308.949207] ata1.00: cmd 60/00:90:70:12:6d/01:00:21:00:00/40 tag 18 ncq dma 131072 in [21308.949207] res 40/00:88:70:0e:6d/00:00:21:00:00/40 Emask 0x2 (HSM violation) [21308.950575] ata1.00: status: { DRDY } [21308.950909] ata1: hard resetting link [21309.427090] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [21309.428854] ata1.00: configured for UDMA/133 [21309.429342] sd 0:0:0:0: [sda] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [21309.430145] sd 0:0:0:0: [sda] tag#4 Sense Key : 0x5 [current] [21309.430659] sd 0:0:0:0: [sda] tag#4 ASC=0x21 ASCQ=0x4 [21309.431157] sd 0:0:0:0: [sda] tag#4 CDB: opcode=0x88 88 00 00 00 00 00 21 6d 11 70 00 00 01 00 00 00 [21309.431964] blk_update_request: I/O error, dev sda, sector 560796016 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0 [21309.432934] zio pool=data vdev=/dev/mapper/sda-crypt error=5 type=1 offset=287110782976 size=131072 flags=1808b0 [21309.433947] sd 0:0:0:0: [sda] tag#17 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [21309.434762] sd 0:0:0:0: [sda] tag#17 Sense Key : 0x5 [current] [21309.435329] sd 0:0:0:0: [sda] tag#17 ASC=0x21 ASCQ=0x4 [21309.435799] sd 0:0:0:0: [sda] tag#17 CDB: opcode=0x88 88 00 00 00 00 00 21 6d 0e 70 00 00 02 00 00 00 [21309.436609] blk_update_request: I/O error, dev sda, sector 560795248 op 0x0:(READ) flags 0x700 phys_seg 4 prio class 0 [21309.437561] zio pool=data vdev=/dev/mapper/sda-crypt error=5 type=1 offset=287110389760 size=262144 flags=40080cb0 [21309.438558] sd 0:0:0:0: [sda] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [21309.439397] sd 0:0:0:0: [sda] tag#18 Sense Key : 0x5 [current] [21309.439923] sd 0:0:0:0: [sda] tag#18 ASC=0x21 ASCQ=0x4 [21309.440388] sd 0:0:0:0: [sda] tag#18 CDB: opcode=0x88 88 00 00 00 00 00 21 6d 12 70 00 00 01 00 00 00 [21309.441198] blk_update_request: I/O error, dev sda, sector 560796272 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0 [21309.442148] zio pool=data vdev=/dev/mapper/sda-crypt error=5 type=1 offset=287110914048 size=131072 flags=1808b0 [21309.443166] ata1: EH complete [23343.877842] ata2.00: exception Emask 0x2 SAct 0x280002 SErr 0x400 action 0x6 [23343.878472] ata2.00: irq_stat 0x08000000 [23343.878821] ata2: SError: { Proto } [23343.879134] ata2.00: failed command: READ FPDMA QUEUED [23343.879594] ata2.00: cmd 60/00:08:10:c4:2e/01:00:6e:00:00/40 tag 1 ncq dma 131072 in [23343.879594] res 40/00:08:10:c4:2e/00:00:6e:00:00/40 Emask 0x2 (HSM violation) [23343.880954] ata2.00: status: { DRDY } [23343.881279] ata2.00: failed command: READ FPDMA QUEUED [23343.881738] ata2.00: cmd 60/00:98:10:c6:2e/01:00:6e:00:00/40 tag 19 ncq dma 131072 in [23343.881738] res 40/00:08:10:c4:2e/00:00:6e:00:00/40 Emask 0x2 (HSM violation) [23343.883162] ata2.00: status: { DRDY } [23343.883492] ata2.00: failed command: READ FPDMA QUEUED [23343.883953] ata2.00: cmd 60/00:a8:10:c5:2e/01:00:6e:00:00/40 tag 21 ncq dma 131072 in [23343.883953] res 40/00:08:10:c4:2e/00:00:6e:00:00/40 Emask 0x2 (HSM violation) [23343.885321] ata2.00: status: { DRDY } [23343.885654] ata2: hard resetting link [23344.361826] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [23344.363595] ata2.00: configured for UDMA/133 [23344.364067] sd 1:0:0:0: [sdb] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [23344.364869] sd 1:0:0:0: [sdb] tag#1 Sense Key : 0x5 [current] [23344.365384] sd 1:0:0:0: [sdb] tag#1 ASC=0x21 ASCQ=0x4 [23344.365882] sd 1:0:0:0: [sdb] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 6e 2e c4 10 00 00 01 00 00 00 [23344.366687] blk_update_request: I/O error, dev sdb, sector 1848558608 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0 [23344.367660] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=946445230080 size=131072 flags=1808b0 [23344.368660] sd 1:0:0:0: [sdb] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [23344.369472] sd 1:0:0:0: [sdb] tag#19 Sense Key : 0x5 [current] [23344.370040] sd 1:0:0:0: [sdb] tag#19 ASC=0x21 ASCQ=0x4 [23344.370510] sd 1:0:0:0: [sdb] tag#19 CDB: opcode=0x88 88 00 00 00 00 00 6e 2e c6 10 00 00 01 00 00 00 [23344.371320] blk_update_request: I/O error, dev sdb, sector 1848559120 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0 [23344.372281] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=946445492224 size=131072 flags=1808b0 [23344.373266] sd 1:0:0:0: [sdb] tag#21 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [23344.374113] sd 1:0:0:0: [sdb] tag#21 Sense Key : 0x5 [current] [23344.374639] sd 1:0:0:0: [sdb] tag#21 ASC=0x21 ASCQ=0x4 [23344.375108] sd 1:0:0:0: [sdb] tag#21 CDB: opcode=0x88 88 00 00 00 00 00 6e 2e c5 10 00 00 01 00 00 00 [23344.375920] blk_update_request: I/O error, dev sdb, sector 1848558864 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0 [23344.376877] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=946445361152 size=131072 flags=1808b0 [23344.377899] ata2: EH complete 0 Quote
gprovost Posted March 24, 2021 Posted March 24, 2021 @grek Can you change the IO scheduler to none as recommended here to see if any change. 0 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.