SATA issue, drive resets: ataX.00: failed command: READ FPDMA QUEUED

ShadowDance · November 13, 2020

This is a continuation of ata1.00: failed command: READ FPDMA QUEUED noted by @djurny. I've experienced the same issue, and have some additional data points to provide.

My observations so far:

I'm using WDC WD60EFRX (68MYMN1 and 68L0BN1) drives
The drives were working without issue previously behind a ASMedia ASM1062 SATA Controller, I've also used some of them behind ASM1542 (external eSATA enclosure)
I can reproduce the issue on a clean install of Armbian 20.08.21 Buster and Focal
I can reproduce via simple `dd` to `/dev/null` from the drive so filesystem does not seem to be the underlying cause
Every drive is affected (i.e. each SATA slot)
At what point dd produces an error varies from SATA slot to SATA slot (not drive dependent), SATA slot 4 can reproducibly produce the error almost immediately after starting a read
The problem goes away when setting `extraargs=libata.force=3.0` in `/boot/armbianEnv.txt` [1]

[1] However, even with SATA limited to 3 Gbps, the problem did reappear when hot-plugging a drive.

This reset happened on drive slot 3 when I hot-plugged a drive onto slot 5. This seems weird to me considering they are supposed to be on different power rails. This may suggest there is in general a problem with either the PSU or power delivery to the drives. Here's an excerpt from the reset:

[152957.354311] ata3.00: exception Emask 0x10 SAct 0x80000000 SErr 0x9b0000 action 0xe frozen
[152957.354318] ata3.00: irq_stat 0x00400000, PHY RDY changed
[152957.354322] ata3: SError: { PHYRdyChg PHYInt 10B8B Dispar LinkSeq }
[152957.354328] ata3.00: failed command: READ FPDMA QUEUED
[152957.354335] ata3.00: cmd 60/58:f8:00:f8:e7/01:00:71:02:00/40 tag 31 ncq dma 176128 in
                         res 40/00:f8:00:f8:e7/00:00:71:02:00/40 Emask 0x10 (ATA bus error)
[152957.354338] ata3.00: status: { DRDY }
[152957.354345] ata3: hard resetting link

And the full dmesg from when the error happened is below:

Spoiler

[152957.354311] ata3.00: exception Emask 0x10 SAct 0x80000000 SErr 0x9b0000 action 0xe frozen
[152957.354318] ata3.00: irq_stat 0x00400000, PHY RDY changed
[152957.354322] ata3: SError: { PHYRdyChg PHYInt 10B8B Dispar LinkSeq }
[152957.354328] ata3.00: failed command: READ FPDMA QUEUED
[152957.354335] ata3.00: cmd 60/58:f8:00:f8:e7/01:00:71:02:00/40 tag 31 ncq dma 176128 in
res 40/00:f8:00:f8:e7/00:00:71:02:00/40 Emask 0x10 (ATA bus error)
[152957.354338] ata3.00: status: { DRDY }
[152957.354345] ata3: hard resetting link
[152962.114138] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[152962.115442] ata4.00: configured for UDMA/133
[152962.334140] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[152962.335386] ata3.00: configured for UDMA/133
[152962.335438] sd 2:0:0:0: [sdc] tag#31 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=5s
[152962.335444] sd 2:0:0:0: [sdc] tag#31 Sense Key : 0x5 [current]
[152962.335448] sd 2:0:0:0: [sdc] tag#31 ASC=0x21 ASCQ=0x4
[152962.335454] sd 2:0:0:0: [sdc] tag#31 CDB: opcode=0x88 88 00 00 00 00 02 71 e7 f8 00 00 00 01 58 00 00
[152962.335458] blk_update_request: I/O error, dev sdc, sector 10500962304 op 0x0:(READ) flags 0x700 phys_seg 5 prio class 0
[152962.335477] zio pool=rpool vdev=/dev/mapper/luks-ata-WDC_WD60EFRX-68MYMN1_WD-************-part4 error=5 type=1 offset=5374864261120 size=176128 flags=40080cb0
[152962.335558] ata3: EH complete
[152967.562044] ata5: softreset failed (device not ready)
[152970.353986] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[152970.354724] ata5.00: ATA-9: WDC WD60EFRX-68L0BN1, 82.00A82, max UDMA/133
[152970.354732] ata5.00: 11721045168 sectors, multi 0: LBA48 NCQ (depth 32), AA
[152970.355605] ata5.00: configured for UDMA/133
[152970.356728] scsi 4:0:0:0: Direct-Access ATA WDC WD60EFRX-68L 0A82 PQ: 0 ANSI: 5
[152970.362585] scsi 4:0:0:0: Attached scsi generic sg4 type 0
[152970.367108] sd 4:0:0:0: [sde] 11721045168 512-byte logical blocks: (6.00 TB/5.46 TiB)
[152970.367123] sd 4:0:0:0: [sde] 4096-byte physical blocks
[152970.367150] sd 4:0:0:0: [sde] Write Protect is off
[152970.367154] sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
[152970.367185] sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[152970.453251] sde: sde1 sde9
[152970.454796] sd 4:0:0:0: [sde] Attached SCSI disk

Edited November 13, 2020 by ShadowDance
Removed drive serial number from dmesg

gprovost · November 14, 2020

12 hours ago, ShadowDance said:

[1] However, even with SATA limited to 3 Gbps, the problem did reappear when hot-plugging a drive.

This reset happened on drive slot 3 when I hot-plugged a drive onto slot 5. This seems weird to me considering they are supposed to be on different power rails. This may suggest there is in general a problem with either the PSU or power delivery to the drives. Here's an excerpt from the reset:

That's a very useful information. Slot 3 and Slot 5 are on the same power rail. Such issue shouldn't happen with the capacitors on the HDD harness. Wondering if this could be some kind of root cause. We need to investigate.

ShadowDance · November 25, 2020

I had a harddrive reset again after a reboot and ~1h15min of uptime. This time without plugging in or out any drives and the drives limited to 3 Gbps. I had a ZFS scrub running since boot and I can't spot anything peculiar in my graphs, e.g. CPU utilization was constantly ~70%, RAM ~2.5GiB used, etc.

Here's the dmesg of it happening, but pretty much the same as earlier times:

[ 4477.151219] ata1.00: exception Emask 0x2 SAct 0x3000000 SErr 0x400 action 0x6
[ 4477.151226] ata1.00: irq_stat 0x08000000
[ 4477.151229] ata1: SError: { Proto }
[ 4477.151235] ata1.00: failed command: READ FPDMA QUEUED
[ 4477.151243] ata1.00: cmd 60/58:c0:50:81:14/00:00:1d:00:00/40 tag 24 ncq dma 45056 in
                        res 40/00:c0:50:81:14/00:00:1d:00:00/40 Emask 0x2 (HSM violation)
[ 4477.151245] ata1.00: status: { DRDY }
[ 4477.151248] ata1.00: failed command: READ FPDMA QUEUED
[ 4477.151254] ata1.00: cmd 60/08:c8:a8:81:14/00:00:1d:00:00/40 tag 25 ncq dma 4096 in
                        res 40/00:c0:50:81:14/00:00:1d:00:00/40 Emask 0x2 (HSM violation)
[ 4477.151256] ata1.00: status: { DRDY }
[ 4477.151263] ata1: hard resetting link
[ 4477.635201] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[ 4477.636449] ata1.00: configured for UDMA/133
[ 4477.636488] sd 0:0:0:0: [sda] tag#24 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[ 4477.636493] sd 0:0:0:0: [sda] tag#24 Sense Key : 0x5 [current]
[ 4477.636497] sd 0:0:0:0: [sda] tag#24 ASC=0x21 ASCQ=0x4
[ 4477.636503] sd 0:0:0:0: [sda] tag#24 CDB: opcode=0x88 88 00 00 00 00 00 1d 14 81 50 00 00 00 58 00 00
[ 4477.636508] blk_update_request: I/O error, dev sda, sector 487883088 op 0x0:(READ) flags 0x700 phys_seg 11 prio class 0
[ 4477.636527] zio pool=rpool vdev=/dev/mapper/luks-ata-WDC_WD60EFRX-68L0BN1_WD-XXXXXXXXXXXX-part4 error=5 type=1 offset=248167702528 size=45056 flags=1808b0
[ 4477.636579] sd 0:0:0:0: [sda] tag#25 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[ 4477.636584] sd 0:0:0:0: [sda] tag#25 Sense Key : 0x5 [current]
[ 4477.636587] sd 0:0:0:0: [sda] tag#25 ASC=0x21 ASCQ=0x4
[ 4477.636591] sd 0:0:0:0: [sda] tag#25 CDB: opcode=0x88 88 00 00 00 00 00 1d 14 81 a8 00 00 00 08 00 00
[ 4477.636595] blk_update_request: I/O error, dev sda, sector 487883176 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0
[ 4477.636605] zio pool=rpool vdev=/dev/mapper/luks-ata-WDC_WD60EFRX-68L0BN1_WD-XXXXXXXXXXXX-part4 error=5 type=1 offset=248167747584 size=4096 flags=1808b0
[ 4477.636638] ata1: EH complete

TheLinuxBug · November 25, 2020

I hate to say this cause it won't be what you want to hear -- however, generally these errors are caused by only a few things:

1. Bad SATA cable

2. Bad hard drive

3. Bad SATA controller

4. Under powering the drives

5. Some really bad bug in sata drivers that causes reset (least likely as others are not reporting the same)

A good test would be to see if you can use a hard drive you haven't tested with yet and using a new SATA cable to the drive to make sure.

Also, you could try alternatively supplying power to the drives via an ATX power supply or bench supply and see if the same thing continues.

It could be over drawing power under heavy load and causing the drives to reset when they don't get the right amount of power. You could also test this by running less drives at once and see if this helps to lower your power budget. I have seen this issue my self in a few deployments where I was using a power supply that was too weak to put out the needed 12v rail requirements under load and would cause the drives to drop out and the raid to fail them, so this is actually a real possibility. It could be that your drives are pulling more power than was expected in the spec.

Hope this helps some.

my 2 cents.

Cheers!

ShadowDance · November 25, 2020

Thanks for the reply @TheLinuxBug.

It's not apparent from my top post, but providing a bit more context; this happens even with no system load and it happens with both four and five drives plugged in. And also reading from only one drive at a time triggers the issue (while the rest are idle). Considering the four drive scenario, I find it hard to believe that a single drive could be pulling too much power from the power rail (one rail powering 3 drives and the other 1 drive).

Testing with a separate PSU is a good suggestion, in hindsight I should've done that before taking the NAS into "production" use. I also have some spare SATA cables I could try with but right now I won't be able to experiment. I have to see if I can find other drives on which I can reproduce the issue because as it stands, the SATA resets are not kind on my drives, they cause read errors on the drives and even with double parity I don't want to risk it. (A ZFS scrub fixes the read errors by writing to the then unreadable sector.)

TheLinuxBug · November 25, 2020

Interesting. Though as others are seeing freezes during heavy load, I am still curious based on all these descriptions if indeed there is some issue with the power supply there.

However, someone will need to actually test that so it can be marked off as a possibility -- even in your situation it would likely be worth a test just to see.

If there weren't so many saying there are lock-ups or freezes related to drives under load I would be leaning more towards the SATA controller -- though one would hope that it isn't that since it isn't something you can easily replace or test.

Maybe attaching 1 drive via USB 3.0 to the board and perform the same functions -- though if 3.5 inch drives you would obviously need to have an adapter that provide will provide 12v power?

my 2 cents.

Cheers!

aprayoga · November 26, 2020

@ShadowDance have you tried to unplug and replug the SATA cable from the board?

I have similar issue (reset link and downgraded to 3.0 Gbps) in the past and it was fixed after i unplug and replug the sata cable.

At that time i used off the shelf sata cable, atx power supply and 5x 2.5" sandisk sata ssd

I tried to reproduce issue with hardware i have, 3x WD20EFRX and 2x ST1000DM010-2EP102.

The WD drives connected to SATA port 3, 4, 5 and configured as mdadm raid 5.

I scrub the array which took ~4hrs to finish. no ata issue occurred.

then i prepared the drive for hotplug,

mdadm -S md127
echo 1 > /sys/block/sdc/device/delete

remove the drive, wait for a while then replug the drive and re-assemble the array.

The drive still connected at 6.0Gbps

The log

Spoiler




Nov 26 14:47:38 helios64 kernel: sd 2:0:0:0: [sdc] Synchronizing SCSI cache
Nov 26 14:47:38 helios64 kernel: sd 2:0:0:0: [sdc] Stopping disk
Nov 26 14:47:38 helios64 systemd-udevd[575]: Process '/sbin/mdadm -If sdc1 --path platform-f8000000.pcie-pci-0000:01:00.0-ata-3' failed with exit code 1.
Nov 26 14:47:38 helios64 kernel: ata3.00: disabled
Nov 26 14:47:42 helios64 login[2096]: pam_unix(login:session): session opened for user root by LOGIN(uid=0)
Nov 26 14:47:42 helios64 systemd-logind[1527]: New session 9 of user root.
Nov 26 14:47:42 helios64 systemd[1]: Started Session 9 of user root.
Nov 26 14:47:42 helios64 login[2520]: ROOT LOGIN  on '/dev/ttyS2'
Nov 26 14:47:57 helios64 kernel: ata3: SATA link down (SStatus 0 SControl 300)
Nov 26 14:48:01 helios64 CRON[2530]: pam_unix(cron:session): session opened for user root by (uid=0)
Nov 26 14:48:01 helios64 CRON[2531]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1)
Nov 26 14:48:01 helios64 CRON[2530]: pam_unix(cron:session): session closed for user root
Nov 26 14:48:33 helios64 kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 26 14:48:33 helios64 kernel: ata3.00: ATA-9: WDC WD20EFRX-68EUZN0, 82.00A82, max UDMA/133
Nov 26 14:48:33 helios64 kernel: ata3.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 32), AA
Nov 26 14:48:33 helios64 kernel: ata3.00: configured for UDMA/133
Nov 26 14:48:33 helios64 kernel: scsi 2:0:0:0: Direct-Access     ATA      WDC WD20EFRX-68E 0A82 PQ: 0 ANSI: 5
Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: Attached scsi generic sg2 type 0
Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: [sdc] 4096-byte physical blocks
Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: [sdc] Write Protect is off
Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 26 14:48:33 helios64 kernel:  sdc: sdc1
Nov 26 14:48:33 helios64 kernel: sd 2:0:0:0: [sdc] Attached SCSI disk
Nov 26 14:49:01 helios64 CRON[2559]: pam_unix(cron:session): session opened for user root by (uid=0)
Nov 26 14:49:01 helios64 CRON[2560]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1)
Nov 26 14:49:01 helios64 CRON[2559]: pam_unix(cron:session): session closed for user root
Nov 26 14:50:01 helios64 CRON[2598]: pam_unix(cron:session): session opened for user root by (uid=0)
Nov 26 14:50:01 helios64 CRON[2599]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1)
Nov 26 14:50:01 helios64 CRON[2598]: pam_unix(cron:session): session closed for user root
Nov 26 14:51:01 helios64 CRON[2617]: pam_unix(cron:session): session opened for user root by (uid=0)
Nov 26 14:51:01 helios64 CRON[2618]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1)
Nov 26 14:51:01 helios64 CRON[2617]: pam_unix(cron:session): session closed for user root
Nov 26 14:52:01 helios64 CRON[2634]: pam_unix(cron:session): session opened for user root by (uid=0)
Nov 26 14:52:01 helios64 CRON[2635]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1)
Nov 26 14:52:02 helios64 CRON[2634]: pam_unix(cron:session): session closed for user root
Nov 26 14:53:01 helios64 CRON[2656]: pam_unix(cron:session): session opened for user root by (uid=0)
Nov 26 14:53:01 helios64 CRON[2657]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1)
Nov 26 14:53:01 helios64 CRON[2656]: pam_unix(cron:session): session closed for user root
Nov 26 14:54:01 helios64 CRON[2677]: pam_unix(cron:session): session opened for user root by (uid=0)
Nov 26 14:54:01 helios64 CRON[2678]: (root) CMD (/usr/sbin/omv-ionice >/dev/null 2>&1)
Nov 26 14:54:01 helios64 CRON[2677]: pam_unix(cron:session): session closed for user root
Nov 26 14:54:12 helios64 kernel: md: array md127 already has disks!
Nov 26 14:54:12 helios64 kernel: md/raid:md127: device sde1 operational as raid disk 2
Nov 26 14:54:12 helios64 kernel: md/raid:md127: device sdd1 operational as raid disk 1
Nov 26 14:54:12 helios64 kernel: md/raid:md127: device sdc1 operational as raid disk 0
Nov 26 14:54:12 helios64 kernel: md/raid:md127: raid level 5 active with 3 out of 3 devices, algorithm 2
Nov 26 14:54:12 helios64 kernel: md127: detected capacity change from 0 to 4000527155200
Nov 26 14:54:12 helios64 systemd[1]: Started MD array monitor.

I use Armbian 20.11 Buster (LK 5.9.10) for the test

Edited November 26, 2020 by aprayoga
clarify armbian version

ShadowDance · November 26, 2020

I was still seeing the issue on Armbian 20.11 Buster (LK 5.9) and it actually seemed worse (even with the 3Gbps limit). So, considering my suspicion of power issue (the push from @TheLinuxBug helped too ), I took the plunge and got a bit creative with my testing.

Here's the TL;DR:

Stock power/harness has the issue even with only one drive connected
Stock power/harness has the issue with UPS disconnected
Stock harness has the issue when only powering via UPS battery (no PSU), this triggered the issue without even reading from the disk (idle disk, it happened immediately after boot in initrd)
Separate PSU for disk and different SATA cable (no harness) == no issue

The 3 Gbps limit was removed for these tests and I picked the ata4 port because I knew it could reliably reproduce the issue soon after boot. I've attached dmesg's from cases 2-4 and I've annotated some of them.

I'm leaning towards power-issue here considering case 3 and 4. For reference these drives are rated 0.60A @ 5V and 0.45A @ 12V. I would've liked to test the harness power / harness SATA cable separately but that's unfortunately quite hard, and I didn't have any SATA extension cables (female/male) that would have allowed me to test the harness SATA cable separately from power.

In case someone finds it amusing, here's my testing rig (case 4):

IMG_2806.jpeg.42d74ad7e7939e06770b390a07e4a96c.jpeg

Notable here is that I performed the 4th test 3 times. The first time I had the drive powered on before powering on the Helios64, but the link was set to 3 Gbps for some reason. I powered both down and powered first on Helios64, waited 2 seconds, then I powered on the drive, this way the link was set up as 6 Gbps and no issues when dd'ing. For good measure I repeated this a second time and still no issue.

@gprovost Is there anything else I could test? Or would the next step be to try a different harness, PSU, etc?

@aprayoga I have indeed juggled them around, but my issue is with all SATA ports, unlikely that all of them are loose. And thanks for attempting to reproduce, I'll still need to check if it happens with other drives too, but that's a task for another day.

2-dmesg-internal-harness-no-ups.txt 3-dmesg-no-psu-only-ups.txt 4-dmesg-external-psu-sata-cable.txt

Edited November 26, 2020 by ShadowDance
Add more details about test case 4

gprovost · November 27, 2020

@ShadowDance What is the serial num of your board. We want to check what was the factory test report on this one. Just want to be sure the SATA controller was properly flashed.

cat /proc/device-tree/serial-number

Honestly it could also be SATA data cable issue (faulty SATA cable is not uncommon). Since your test doesn't narrow down to to either SATA power or SATA data issue.

We can send you a new harness along with one power-only harness that you can combine with normal SATA cable. Please contact us to support@kokol.io to arrange that.

Then you can do more testing and finally figure out what's the issue.

Fred Fettinger · January 27, 2021

On 11/26/2020 at 9:25 PM, gprovost said:
cat /proc/device-tree/serial-number

Are there any known serial numbers which were not flashed properly?

I'm starting to troubleshoot similar behavior and mine outputs:

000100001425

Edited January 27, 2021 by Fred Fettinger
Clarifying question

clostro · January 27, 2021

Just started seeing these 'ata1.00: failed command: READ FPDMA QUEUED' lines in my log as well. I assume ata1 is sda.

Had a whole bunch of them a while after boot and nothing afterwards. But the device was not accessed a whole lot during this time. It just booted up after a crash and the LVM cache was cleaning the dirty bits on the cache SSD connected to sda.

sdb-e are populated with 4x4TB ST4000VN008 Ironwolfs, and sda is hooked up to an old (and I mean old) Sandisk Extreme 240GB SSD SDSSDX240GG25.

I attached the smartctl report for the SSD below, and it passed a short smart test just now. I'll start a long test in a minute.

Using Linux helios64 5.9.14-rockchip64 #20.11.4 SMP PREEMPT Tue Dec 15 08:52:20 CET 2020 aarch64 GNU/Linux.

Serial number of the board is 000100000530 if that could help.

Spoiler


[ 4101.251887] ata1.00: exception Emask 0x10 SAct 0xffffffff SErr 0xb80100 action 0x6
[ 4101.251895] ata1.00: irq_stat 0x08000000
[ 4101.251903] ata1: SError: { UnrecovData 10B8B Dispar BadCRC LinkSeq }
[ 4101.251911] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.251924] ata1.00: cmd 60/00:00:00:02:b6/02:00:08:00:00/40 tag 0 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.251929] ata1.00: status: { DRDY }
[ 4101.251934] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.251946] ata1.00: cmd 60/00:08:00:a0:b3/02:00:08:00:00/40 tag 1 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.251950] ata1.00: status: { DRDY }
[ 4101.251956] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.251968] ata1.00: cmd 60/00:10:00:7c:b3/02:00:08:00:00/40 tag 2 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.251972] ata1.00: status: { DRDY }
[ 4101.251977] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.251989] ata1.00: cmd 60/00:18:00:58:b3/02:00:08:00:00/40 tag 3 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.251993] ata1.00: status: { DRDY }
[ 4101.251998] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252010] ata1.00: cmd 60/00:20:00:34:b3/02:00:08:00:00/40 tag 4 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252014] ata1.00: status: { DRDY }
[ 4101.252020] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252032] ata1.00: cmd 60/00:28:00:3e:b1/02:00:08:00:00/40 tag 5 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252036] ata1.00: status: { DRDY }
[ 4101.252041] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252053] ata1.00: cmd 60/00:30:00:1a:b1/02:00:08:00:00/40 tag 6 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252057] ata1.00: status: { DRDY }
[ 4101.252062] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252074] ata1.00: cmd 60/00:38:00:f6:b0/02:00:08:00:00/40 tag 7 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252078] ata1.00: status: { DRDY }
[ 4101.252083] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252095] ata1.00: cmd 60/00:40:00:d2:b0/02:00:08:00:00/40 tag 8 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252099] ata1.00: status: { DRDY }
[ 4101.252105] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252116] ata1.00: cmd 60/00:48:00:ae:b0/02:00:08:00:00/40 tag 9 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252120] ata1.00: status: { DRDY }
[ 4101.252126] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252138] ata1.00: cmd 60/00:50:00:b8:ae/02:00:08:00:00/40 tag 10 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252142] ata1.00: status: { DRDY }
[ 4101.252147] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252159] ata1.00: cmd 60/00:58:00:94:ae/02:00:08:00:00/40 tag 11 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252163] ata1.00: status: { DRDY }
[ 4101.252169] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252181] ata1.00: cmd 60/00:60:00:b8:bd/02:00:08:00:00/40 tag 12 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252185] ata1.00: status: { DRDY }
[ 4101.252190] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252202] ata1.00: cmd 60/00:68:00:94:bd/02:00:08:00:00/40 tag 13 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252206] ata1.00: status: { DRDY }
[ 4101.252211] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252223] ata1.00: cmd 60/00:70:00:70:bd/02:00:08:00:00/40 tag 14 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252228] ata1.00: status: { DRDY }
[ 4101.252232] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252244] ata1.00: cmd 60/00:78:00:4c:bd/02:00:08:00:00/40 tag 15 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252249] ata1.00: status: { DRDY }
[ 4101.252254] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252266] ata1.00: cmd 60/00:80:00:56:bb/02:00:08:00:00/40 tag 16 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252270] ata1.00: status: { DRDY }
[ 4101.252276] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252288] ata1.00: cmd 60/00:88:00:32:bb/02:00:08:00:00/40 tag 17 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252292] ata1.00: status: { DRDY }
[ 4101.252297] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252309] ata1.00: cmd 60/00:90:00:0e:bb/02:00:08:00:00/40 tag 18 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252314] ata1.00: status: { DRDY }
[ 4101.252319] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252330] ata1.00: cmd 60/00:98:00:ea:ba/02:00:08:00:00/40 tag 19 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252335] ata1.00: status: { DRDY }
[ 4101.252340] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252352] ata1.00: cmd 60/00:a0:00:c6:ba/02:00:08:00:00/40 tag 20 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252356] ata1.00: status: { DRDY }
[ 4101.252361] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252373] ata1.00: cmd 60/00:a8:00:d0:b8/02:00:08:00:00/40 tag 21 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252377] ata1.00: status: { DRDY }
[ 4101.252382] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252394] ata1.00: cmd 60/00:b0:00:ac:b8/02:00:08:00:00/40 tag 22 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252399] ata1.00: status: { DRDY }
[ 4101.252404] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252416] ata1.00: cmd 60/00:b8:00:88:b8/02:00:08:00:00/40 tag 23 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252420] ata1.00: status: { DRDY }
[ 4101.252426] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252437] ata1.00: cmd 60/00:c0:00:64:b8/02:00:08:00:00/40 tag 24 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252442] ata1.00: status: { DRDY }
[ 4101.252447] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252459] ata1.00: cmd 60/00:c8:00:40:b8/02:00:08:00:00/40 tag 25 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252463] ata1.00: status: { DRDY }
[ 4101.252468] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252480] ata1.00: cmd 60/00:d0:00:4a:b6/02:00:08:00:00/40 tag 26 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252485] ata1.00: status: { DRDY }
[ 4101.252490] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252502] ata1.00: cmd 60/00:d8:00:26:b6/02:00:08:00:00/40 tag 27 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252506] ata1.00: status: { DRDY }
[ 4101.252511] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252523] ata1.00: cmd 60/00:e0:00:de:b5/02:00:08:00:00/40 tag 28 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252527] ata1.00: status: { DRDY }
[ 4101.252532] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252545] ata1.00: cmd 60/00:e8:00:ba:b5/02:00:08:00:00/40 tag 29 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252549] ata1.00: status: { DRDY }
[ 4101.252554] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252566] ata1.00: cmd 60/00:f0:00:c4:b3/02:00:08:00:00/40 tag 30 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252570] ata1.00: status: { DRDY }
[ 4101.252575] ata1.00: failed command: READ FPDMA QUEUED
[ 4101.252587] ata1.00: cmd 60/00:f8:00:70:ae/02:00:08:00:00/40 tag 31 ncq dma 262144 in
                        res 40/00:c8:00:40:b8/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
[ 4101.252592] ata1.00: status: { DRDY }
[ 4101.252603] ata1: hard resetting link
[ 4101.727761] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 4101.749701] ata1.00: configured for UDMA/133
[ 4101.752101] ata1: EH complete

Spoiler


smartctl  -x /dev/sda
smartctl 6.6 2017-11-05 r4594 [aarch64-linux-5.9.14-rockchip64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     SanDisk SDSSDX240GG25
Serial Number:    123273400127
LU WWN Device Id: 5 001b44 7bf3fcb3f
Firmware Version: R201
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jan 27 13:24:04 2021 +03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     254 (maximum performance)
Rd look-ahead is: Disabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  48) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0021) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   119   119   050    -    0/225624030
  5 Retired_Block_Count     PO--CK   100   100   003    -    7
  9 Power_On_Hours_and_Msec -O--CK   000   000   000    -    30893h+27m+05.090s
 12 Power_Cycle_Count       -O--CK   099   099   000    -    1951
171 Program_Fail_Count      -O--CK   000   000   000    -    0
172 Erase_Fail_Count        -O--CK   000   000   000    -    0
174 Unexpect_Power_Loss_Ct  ----CK   000   000   000    -    999
177 Wear_Range_Delta        ------   000   000   000    -    3
181 Program_Fail_Count      -O--CK   000   000   000    -    0
182 Erase_Fail_Count        -O--CK   000   000   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   028   055   000    -    28 (Min/Max 10/55)
195 ECC_Uncorr_Error_Count  --SRC-   120   120   000    -    0/225624030
196 Reallocated_Event_Count PO--CK   100   100   003    -    7
201 Unc_Soft_Read_Err_Rate  --SRC-   120   120   000    -    0/225624030
204 Soft_ECC_Correct_Rate   --SRC-   120   120   000    -    0/225624030
230 Life_Curve_Status       PO--C-   100   100   000    -    100
231 SSD_Life_Left           PO--C-   100   100   010    -    0
233 SandForce_Internal      ------   000   000   000    -    10541
234 SandForce_Internal      -O--CK   000   000   000    -    13196
241 Lifetime_Writes_GiB     -O--CK   000   000   000    -    13196
242 Lifetime_Reads_GiB      -O--CK   000   000   000    -    13491
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL,SL  R/O      1  SATA Phy Event Counters log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xb7       GPL,SL  VS      16  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log (GP Log 0x03) not supported

SMART Error Log not supported

SMART Extended Self-test Log Version: 0 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       0 (0x0000)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    28 Celsius
Power Cycle Min/Max Temperature:     27/31 Celsius
Lifetime    Min/Max Temperature:     10/55 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     10 (Unknown, should be 2)
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        10 minutes
Min/Max recommended Temperature:      0/120 Celsius
Min/Max Temperature Limit:            0/ 0 Celsius
Temperature History Size (Index):    56576 (7)
Invalid Temperature History Size or Index

SCT Error Recovery Control command not supported

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            4  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            4  Device-to-host register FISes sent due to a COMRESET
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC
0x0002  2            0  R_ERR response for data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

Gareth Halfacree · January 27, 2021

I'm seeing the same on my Helios64, on ata2.00 - a bunch on Monday January 4th, then even more on Thursday January 14th. None since.

ata2.00 is a Western Digital Red Plus 6TB, purchased new for the Helios64. It's in a mirror with a 6TB Seagate IronWolf; the only other drive in the system is a 240GB M.2 SSD on the mainboard. The drive has completed SMART tests without errors, and the only thing of interest in the SMART attributes is 199 UDMA CRC Error Count at 5. The hard drives are in bays 2 & 4; 1, 3, and 5 are empty.

Board serial number is 000100001123, running Ubuntu 20.04.1 LTS 5.9.14-rockchip64.

Edited January 27, 2021 by Gareth Halfacree

ShadowDance · January 27, 2021

Hey, sorry I haven't updated this thread until now.

The Kobol team sent me, as promised, a new harness and a power-only harness so that I could do some testing:

Cutting off capacitors from the my original harness did not make a difference
The new (normal) harness had the exact same issue as the original one
With the power-only harness and my own SATA cables, I was unable to reproduce the issue (even at 6 Gbps)
Final test was to go to town on my original harness and cut the connector in two, this allowed me to use my own SATA cable with the original harness and there was, again, no issue (at 6 Gbps)

Judging from my initial results, it would seem that there is an issue with the SATA cables in the stock harness. But I should try to do this for a longer period of time -- problem was I didn't have SATA cables for all disks, once I do I'll try to do a week long stress test. I reported my result to the Kobol team but haven't heard back yet.

Even with the 3.0 Gbps limit, I still occasionally run into this issue with the original harness, has happened 2 times since I did the experiment.

If someone else is willing to repeat this experiment with a good set of SATA cables, please do contact Kobol to see if they'd be willing to ship out another set of test harnesses, or perhaps they have other plans.

Here's some pics of my test setup, including the mutilated connector:

clostro · January 27, 2021

Thanks for the update, I really hope its not the cables in my case. I mean I was not getting these lines in the log before, just got them for the first time.

The only difference from the last few boots is the CPU frequency and speed governor per the quote below.

I don't think they are related, this was originally suggested for troubleshooting a 'sync and drop_caches' issue, which works fine on my device.

Later it was also suggested for the 'mystery red light' issue, which was a problem on my device.

But this could be something else.

Hopefully not the cables, I would rather have the SSD connected to that slot fail than to change the harness.

On 1/4/2021 at 6:48 AM, gprovost said:

Hi, could try to the following tweak and redo the same methods that triggers the crash :

Run armbian-config, go to -> System -> CPU

And set:

Minimum CPU speed = 1200000

Maximum CPU speed = 1200000

CPU governor = performance

This will help us understand if instability is still due to DVFS.

Edited January 27, 2021 by clostro
clarification

Gareth Halfacree · February 4, 2021

My Helios64 ran a btrfs scrub on the 1st of the month, and that triggered a whole bunch more READ FPDMA QUEUED errors - 41 over the space of two hours, clustered in two big groups at around 57 minutes into the scrub and 1h46m into the scrub. Despite the errors, the scrub completed successfully without reporting any read errors.

gprovost · February 9, 2021

Hey guys, could you disable NCQ and see how it impacts your system when trying to reproduce the issue.

Edit /boot/armbianEnv.txt and add the following line :

extraargs=libata.force=noncq

ShadowDance · February 9, 2021

@gprovost I can only speak for my part, but I’ve pretty much gone trough all libata options and combinations thereof (NCQ included), no improvement on my situation.

alban · February 9, 2021

Hello,

I'm getting same issue :

[ 3613.284995] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0xb80100 action 0x6
[ 3613.285019] ata4.00: irq_stat 0x08000000
[ 3613.285035] ata4: SError: { UnrecovData 10B8B Dispar BadCRC LinkSeq }
[ 3613.285053] ata4.00: failed command: READ DMA EXT
[ 3613.285081] ata4.00: cmd 25/00:40:f0:a9:36/00:05:04:00:00/e0 tag 26 dma 688128 in
res 50/00:00:f0:a9:36/00:00:04:00:00/e0 Emask 0x10 (ATA bus error)
[ 3613.285092] ata4.00: status: { DRDY }
[ 3613.285113] ata4: hard resetting link
[ 3614.116878] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 3614.117787] ata4.00: configured for UDMA/133
[ 3614.117946] ata4: EH complete

I was runing Armbian 20.11 with 5.9 kernel since first install on sdcard and this problem occured during last days

I tried to install a new system from last image Armbian 21.02 on emmc to be sdcard independant

The error occur within hours after boot

It ends to reduce ata link speed and finally to disable the first 2 drives (corressponding to sda and sdb in my case) : LED were off when I noticed problem

I'm using 5 WDC WD30EFRX as Raid6 array

I tried NCQ disable tonight and I'm getting error not long after boot :

I forced array recheck and as I'm wrriting this article, I'm getting an other error :

[ 4003.337231] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0xb00100 action 0x6
[ 4003.337248] ata4.00: irq_stat 0x08000000
[ 4003.337256] ata4: SError: { UnrecovData Dispar BadCRC LinkSeq }
[ 4003.337265] ata4.00: failed command: READ DMA EXT
[ 4003.337278] ata4.00: cmd 25/00:40:80:41:09/00:05:09:00:00/e0 tag 7 dma 688128 in
res 50/00:00:80:41:09/00:00:09:00:00/e0 Emask 0x10 (ATA bus error)
[ 4003.337283] ata4.00: status: { DRDY }
[ 4003.337298] ata4: hard resetting link
[ 4004.169150] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 4004.170064] ata4.00: configured for UDMA/133
[ 4004.170222] ata4: EH complete

If this is sata cable related I don't get why this problem come a few month after non stop working fine.

Any other idea ?

In case here my serial : cat /proc/device-tree/serial-number
000100001235

Regards

gprovost · February 10, 2021

@ShadowDance @alban I guess it was a shot in the dark to check if NCQ has any sort of impact.

griefman · February 10, 2021

I have been following the discussion for a while and would like to report where i am at. Also I would like to request that this information is being recorded in a systematic way by Kobol, of course only if this isn't already happening.

I have the following setup:

  pool: mypool
 state: ONLINE
  scan: resilvered 12K in 00:00:00 with 0 errors on Tue Feb  9 17:22:03 2021
config:

        NAME                                          STATE     READ WRITE CKSUM
        mypool                                        ONLINE       0     0     0
          raidz1-0                                    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX  ONLINE       0     0     0

Some weeks ago I noticed that the health of the zpool is DEGRADED. I checked and one device had READ errors and was marked as FAULTED. This has also resulted in storing UDMA CRC errors in the SMART stats for this drive. So, I cleared the errors and ran a scrub to see what is going on.

sudo zpool scrub mypool

I monitored the scrub with

sudo watch zpool status

And I saw quite quickly that all drives have started to get READ errors. SMART also reported that all drives now have UDMA CRC errors.

It was clear that something bad is going on, so I contacted Kobol and the we started to debug the issue together.

First I changed the SATA speed to 3Gbps by adding the following line to /boot/armbianEnv.txt

extraargs=libata.force=3.0

The results were similar, but I noticed that the errors started to show up a bit later in the scrubbing process. The UDMA CRC Errors count has increased.

Then I replaced the power and the SATA Cables with new ones, which unfortunately did not bring any improvement.

Then I disabled NCQ by adding the following to /boot/armbianEnv.txt

extraargs=libata.force=noncq

and reverted back to SATA 6 Gbps by removing the 3 Gbps line, introduced earlier.

This had a positive results and I was able to run the scrub without any errors.

Then I went back and installed the old(original) cable harness again and retested - all good.

While disable NCQ is having a positive impact on the errors, it is also having a negative impact on the speed and to some amount also on the disk drives' health.

I have also tried to reduce NCQs depth to 31, which is the recommended value, however this did not have any impact.

I hope that using this information Kobol will try to reproduce this issue themselves, to see if its only certain boards that are affected or if its every board.

ShadowDance · February 10, 2021

On 2/10/2021 at 12:10 PM, griefman said:

Then I replaced the power and the SATA Cables with new ones, which unfortunately did not bring any improvement.

Was this a replacement harness from Kobol or 3rd party SATA cables? I saw no improvement with a replcement harness but 3rd party SATA cables seems to have done the trick.

Either way, thanks for sharing your insights @griefman, it’s nice to see others digging into this/these issue(s) as well.

Another interesting observation I made along the way is that a reboot helps “fix” the read errors without writing to the “bad sectors” (not actually bad sectors!). I.e. if 6 Gbps triggered an error, limiting to 3 Gbps and rebooting would “fix it”, so the errors seem to stem, in my case, from the system resetting the drive, which in turn seems to happen because communication errors happen with stock SATA cables.

Edited February 11, 2021 by ShadowDance
Clarify quoted bad sectors

alban · February 11, 2021

UPDATE :

Slot 1 & 2 won't start anymore with many tests : Battery connected or not doesn't change anything

I tried with external power and brand new sata 6Gbit cable : it's working fine

I tried with sata power splitter Y from Solt1 or slot2 line : disks doesn't start with external Sata cable : It seems there are trouble with power lines from stock harness

I tried with sata power splitter connected from Slot 3 powering slot 1 & 2 : drives 1 / 2 / 4 / 5are working : seems OK

I will try with 2 sata splitter Y (delivery planned for tonight)

Conclusion for now : Seems power isn't properly achieved through stock harness : I was thinking trouble with sata data cables but problem for my unit seems mainly from power sata connectors

I will keep you update

Edited February 11, 2021 by alban

griefman · February 11, 2021

On 2/10/2021 at 11:31 AM, ShadowDance said:

Was this a replacement harness from Kobol or 3rd party SATA cables? I saw no improvement with a replcement harness but 3rd party SATA cables seems to have done the trick.

Either way, thanks for sharing your insights @griefman, it’s nice to see others digging into this/these issue(s) as well.

Another interesting observation I made along the way is that a reboot helps “fix” the read errors without writing to the “bad sectors”. I.e. if 6 Gbps triggered an error, limiting to 3 Gbps and rebooting would “fix it”, so the errors seem to stem, in my case, from the system resetting the drive, which in turn seems to happen because communication errors happen with stock SATA cables.

I used a power cable that was provided by Kobol and brand new SATA cables from Amazon. Also, I did not have any bad sectors written on the disks, only UDMA CRC errors. Also, my drives were brand new when I connected them

Edited February 11, 2021 by griefman

ShadowDance · February 11, 2021

5 minutes ago, griefman said:

I used a power cable that was provided by Kobol and brand new SATA cables from Amazon. Also, I did not have any bad sectors written on the disks, only UDMA CRC errors. Also my drives were brand new when I connected them

Ok, good to know. And just to be clear: No bad sectors here either. However, the resets done by the system put the drives in a weird state (speculating) which behaved like bad sectors (but weren't, hence the quotes), a reboot would cleared those up. UDMA CRC errors happen on my drives too (never happened before putting them in the Helios64).

Gareth Halfacree · March 1, 2021

It's the first of the month, so another automated btrfs scrub took place - and again triggered a bunch of DRDY drive resets on ata2.00. Are we any closer to figuring out what the problem is, here?

0utc45t · March 1, 2021

I'm too suffering from this issue. Sata SSD in slot1 (with system in it) and ZFS pool (mirror) in the rest. Started with 2 older disks... scrub initiated problems, added one new disk to get redundancy back... started to get errors on that too, added second new Ironwolf and now my pool is resilvering with 1 old WD disk and 1 new Seagate in degraded state (too many errors). Waiting for the resilvering to finish on the second new Seagate Ironwolf to have my data mirrored...

Any real solution yet?

gprovost · March 2, 2021

Right now we still have a bit of difficulty to distinguish between ATA errors related to cable issue (something that we are addressing with a new PCB back-plane) and ATA errors potentially due to the SATA controller. For the latest one we are asking help from JMicron, but the issue is that we don't have an easy way to reproduce the error.

The issue that might be related to SATA controller seems to only show up with BTRFS and ZFS while doing some scrubbing. Do you guys have a suggestion for a simple reproducible test starting from scratch (no data on HDD) ? Thanks for your help.

Gareth Halfacree · March 2, 2021

4 hours ago, gprovost said:

The issue that might be related to SATA controller seems to only show up with BTRFS and ZFS while doing some scrubbing. Do you guys have a suggestion for a simple reproducible test starting from scratch (no data on HDD) ? Thanks for your help.

I'd suggest getting one or more drives, setting them up in a btrfs mirror, dding a bunch of random onto them, and scrubbing. It's possible dding alone will trigger the error, but if it doesn't then scrubbing certainly should.

$ sudo mkfs.btrfs -L testdisks -m raid1 -d raid1 /dev/sdX /dev/sdY /dev/sdZ
$ sudo mkdir /media/testdisks
$ sudo chown USER /media/testdisks
$ sudo mount /dev/sdX /media/testdisks -o user,defaults,noatime,nodiratime,compress=zstd,space_cache
$ cd /media/testdisks
$ dd if=/dev/urandom of=/media/testdisks/001.testfile bs=10M count=1024
$ for i in {002..100}; do cp /media/testdisks/001.testfile /media/testdisks/$i.testfile; done
$ sudo btrfs scrub start -Bd /media/testdisks

The options on the mount are set to mimic my own. Obviously change the number and name of the devices and the username to match your own. It's also set to create 100 files of 10GB each for a total of 1TB of data; you could try with 10 files for 100GB total first to speed things up, but I've around a terabyte of data on my system so figured it was worth matching the real-world scenario as closely as possible.

grek · March 23, 2021

Today I manually ran zfs scrub at my zfs mirror pool,

I got similar errors.

The funny is, that normally ,scrub was initiated by the cron without any issue.

Any suggestion what should I do?

I done: zpool clear data sda-crypt and zpool clear data sdb-crypt ,at my pools and now I have online state, but I'm little afraid about my data ......

output before today manual zfs scrub - Scrub without errors at Sun Mar 14......

root@helios64:/# zpool status -v data
  pool: data
 state: ONLINE
  scan: scrub repaired 0B in 06:17:38 with 0 errors on Sun Mar 14 06:41:40 2021
config:

        NAME           STATE     READ WRITE CKSUM
        data           ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            sda-crypt  ONLINE       0     0     0
            sdb-crypt  ONLINE       0     0     0
        cache
          sdc2         ONLINE       0     0     0

root@helios64:/var/log# cat /etc/cron.d/zfsutils-linux
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# TRIM the first Sunday of every month.
24 0 1-7 * * root if [ $(date +\%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/trim ]; then /usr/lib/zfs-linux/trim; fi

# Scrub the second Sunday of every month.
24 0 8-14 * * root if [ $(date +\%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ]; then /usr/lib/zfs-linux/scrub; fi
root@helios64:/var/log#

Quote

root@helios64:/var/log# zpool status -v data
pool: data
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 4.87M in 06:32:31 with 0 errors on Tue Mar 23 15:21:50 2021
config:

NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
sda-crypt DEGRADED 6 0 32 too many errors
sdb-crypt DEGRADED 9 0 29 too many errors
cache
sdc2 ONLINE 0 0 0

Quote

[ 8255.882611] ata2.00: irq_stat 0x08000000
[ 8255.882960] ata2: SError: { Proto }
[ 8255.883273] ata2.00: failed command: READ FPDMA QUEUED
[ 8255.883732] ata2.00: cmd 60/00:10:18:4d:9d/08:00:15:00:00/40 tag 2 ncq dma 1048576 in
[ 8255.883732] res 40/00:10:18:4d:9d/00:00:15:00:00/40 Emask 0x2 (HSM violation)
[ 8255.885101] ata2.00: status: { DRDY }
[ 8255.885426] ata2.00: failed command: READ FPDMA QUEUED
[ 8255.885885] ata2.00: cmd 60/00:18:90:5b:9d/08:00:15:00:00/40 tag 3 ncq dma 1048576 in
[ 8255.885885] res 40/00:10:18:4d:9d/00:00:15:00:00/40 Emask 0x2 (HSM violation)
[ 8255.887294] ata2.00: status: { DRDY }
[ 8255.887623] ata2.00: failed command: READ FPDMA QUEUED
[ 8255.888083] ata2.00: cmd 60/10:98:18:55:9d/06:00:15:00:00/40 tag 19 ncq dma 794624 in
[ 8255.888083] res 40/00:10:18:4d:9d/00:00:15:00:00/40 Emask 0x2 (HSM violation)
[ 8255.889450] ata2.00: status: { DRDY }
[ 8255.889783] ata2: hard resetting link
[ 8256.365961] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 8256.367771] ata2.00: configured for UDMA/133
[ 8256.368540] sd 1:0:0:0: [sdb] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[ 8256.369343] sd 1:0:0:0: [sdb] tag#2 Sense Key : 0x5 [current]
[ 8256.369858] sd 1:0:0:0: [sdb] tag#2 ASC=0x21 ASCQ=0x4
[ 8256.370354] sd 1:0:0:0: [sdb] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 15 9d 4d 18 00 00 08 00 00 00
[ 8256.371158] blk_update_request: I/O error, dev sdb, sector 362630424 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0
[ 8256.372125] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=185649999872 size=1048576 flags=40080cb0
[ 8256.373117] sd 1:0:0:0: [sdb] tag#3 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[ 8256.373964] sd 1:0:0:0: [sdb] tag#3 Sense Key : 0x5 [current]
[ 8256.374480] sd 1:0:0:0: [sdb] tag#3 ASC=0x21 ASCQ=0x4
[ 8256.374940] sd 1:0:0:0: [sdb] tag#3 CDB: opcode=0x88 88 00 00 00 00 00 15 9d 5b 90 00 00 08 00 00 00
[ 8256.375742] blk_update_request: I/O error, dev sdb, sector 362634128 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0
[ 8256.376704] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=185651896320 size=1048576 flags=40080cb0
[ 8256.377724] sd 1:0:0:0: [sdb] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[ 8256.378568] sd 1:0:0:0: [sdb] tag#19 Sense Key : 0x5 [current]
[ 8256.379096] sd 1:0:0:0: [sdb] tag#19 ASC=0x21 ASCQ=0x4
[ 8256.379561] sd 1:0:0:0: [sdb] tag#19 CDB: opcode=0x88 88 00 00 00 00 00 15 9d 55 18 00 00 06 10 00 00
[ 8256.380371] blk_update_request: I/O error, dev sdb, sector 362632472 op 0x0:(READ) flags 0x700 phys_seg 13 prio class 0
[ 8256.381336] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=185651048448 size=794624 flags=40080cb0
[ 8256.382389] ata2: EH complete
[15116.869369] ata2.00: exception Emask 0x2 SAct 0x30000200 SErr 0x400 action 0x6
[15116.870017] ata2.00: irq_stat 0x08000000
[15116.870366] ata2: SError: { Proto }
[15116.870680] ata2.00: failed command: READ FPDMA QUEUED
[15116.871140] ata2.00: cmd 60/00:48:98:41:1b/07:00:55:00:00/40 tag 9 ncq dma 917504 in
[15116.871140] res 40/00:e0:98:40:1b/00:00:55:00:00/40 Emask 0x2 (HSM violation)
[15116.872501] ata2.00: status: { DRDY }
[15116.872826] ata2.00: failed command: READ FPDMA QUEUED
[15116.873284] ata2.00: cmd 60/00:e0:98:40:1b/01:00:55:00:00/40 tag 28 ncq dma 131072 in
[15116.873284] res 40/00:e0:98:40:1b/00:00:55:00:00/40 Emask 0x2 (HSM violation)
[15116.874696] ata2.00: status: { DRDY }
[15116.875023] ata2.00: failed command: READ FPDMA QUEUED
[15116.875482] ata2.00: cmd 60/00:e8:98:48:1b/01:00:55:00:00/40 tag 29 ncq dma 131072 in
[15116.875482] res 40/00:e0:98:40:1b/00:00:55:00:00/40 Emask 0x2 (HSM violation)
[15116.876850] ata2.00: status: { DRDY }
[15116.877182] ata2: hard resetting link
[15117.353348] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[15117.355108] ata2.00: configured for UDMA/133
[15117.355678] sd 1:0:0:0: [sdb] tag#9 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[15117.356481] sd 1:0:0:0: [sdb] tag#9 Sense Key : 0x5 [current]
[15117.356996] sd 1:0:0:0: [sdb] tag#9 ASC=0x21 ASCQ=0x4
[15117.357492] sd 1:0:0:0: [sdb] tag#9 CDB: opcode=0x88 88 00 00 00 00 00 55 1b 41 98 00 00 07 00 00 00
[15117.358296] blk_update_request: I/O error, dev sdb, sector 1427849624 op 0x0:(READ) flags 0x700 phys_seg 18 prio class 0
[15117.359273] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=731042230272 size=917504 flags=40080cb0
[15117.360297] sd 1:0:0:0: [sdb] tag#28 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[15117.361152] sd 1:0:0:0: [sdb] tag#28 Sense Key : 0x5 [current]
[15117.361716] sd 1:0:0:0: [sdb] tag#28 ASC=0x21 ASCQ=0x4
[15117.362181] sd 1:0:0:0: [sdb] tag#28 CDB: opcode=0x88 88 00 00 00 00 00 55 1b 40 98 00 00 01 00 00 00
[15117.362990] blk_update_request: I/O error, dev sdb, sector 1427849368 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
[15117.363944] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=731042099200 size=131072 flags=1808b0
[15117.364883] sd 1:0:0:0: [sdb] tag#29 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[15117.365719] sd 1:0:0:0: [sdb] tag#29 Sense Key : 0x5 [current]
[15117.366243] sd 1:0:0:0: [sdb] tag#29 ASC=0x21 ASCQ=0x4
[15117.366706] sd 1:0:0:0: [sdb] tag#29 CDB: opcode=0x88 88 00 00 00 00 00 55 1b 48 98 00 00 01 00 00 00
[15117.367514] blk_update_request: I/O error, dev sdb, sector 1427851416 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
[15117.368466] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=731043147776 size=131072 flags=1808b0
[15117.369450] ata2: EH complete
[18959.250478] FS-Cache: Loaded
[18959.292215] FS-Cache: Netfs 'nfs' registered for caching
[18959.613399] NFS: Registering the id_resolver key type
[18959.613885] Key type id_resolver registered
[18959.614253] Key type id_legacy registered
[19150.007895] ata1.00: exception Emask 0x2 SAct 0xe0000 SErr 0x400 action 0x6
[19150.008525] ata1.00: irq_stat 0x08000000
[19150.008874] ata1: SError: { Proto }
[19150.009188] ata1.00: failed command: READ FPDMA QUEUED
[19150.009648] ata1.00: cmd 60/78:88:30:40:05/00:00:78:00:00/40 tag 17 ncq dma 61440 in
[19150.009648] res 40/00:90:60:41:05/00:00:78:00:00/40 Emask 0x2 (HSM violation)
[19150.011011] ata1.00: status: { DRDY }
[19150.011341] ata1.00: failed command: READ FPDMA QUEUED
[19150.011901] ata1.00: cmd 60/98:90:60:41:05/00:00:78:00:00/40 tag 18 ncq dma 77824 in
[19150.011901] res 40/00:90:60:41:05/00:00:78:00:00/40 Emask 0x2 (HSM violation)
[19150.013357] ata1.00: status: { DRDY }
[19150.013689] ata1.00: failed command: READ FPDMA QUEUED
[19150.014148] ata1.00: cmd 60/60:98:88:42:05/00:00:78:00:00/40 tag 19 ncq dma 49152 in
[19150.014148] res 40/00:90:60:41:05/00:00:78:00:00/40 Emask 0x2 (HSM violation)
[19150.015507] ata1.00: status: { DRDY }
[19150.015892] ata1: hard resetting link
[19150.491836] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[19150.493606] ata1.00: configured for UDMA/133
[19150.494071] sd 0:0:0:0: [sda] tag#17 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[19150.494880] sd 0:0:0:0: [sda] tag#17 Sense Key : 0x5 [current]
[19150.495401] sd 0:0:0:0: [sda] tag#17 ASC=0x21 ASCQ=0x4
[19150.495888] sd 0:0:0:0: [sda] tag#17 CDB: opcode=0x88 88 00 00 00 00 00 78 05 40 30 00 00 00 78 00 00
[19150.496698] blk_update_request: I/O error, dev sda, sector 2013610032 op 0x0:(READ) flags 0x700 phys_seg 14 prio class 0
[19150.497673] zio pool=data vdev=/dev/mapper/sda-crypt error=5 type=1 offset=1030951559168 size=61440 flags=40080cb0
[19150.498675] sd 0:0:0:0: [sda] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[19150.499485] sd 0:0:0:0: [sda] tag#18 Sense Key : 0x5 [current]
[19150.500044] sd 0:0:0:0: [sda] tag#18 ASC=0x21 ASCQ=0x4
[19150.500511] sd 0:0:0:0: [sda] tag#18 CDB: opcode=0x88 88 00 00 00 00 00 78 05 41 60 00 00 00 98 00 00
[19150.501321] blk_update_request: I/O error, dev sda, sector 2013610336 op 0x0:(READ) flags 0x700 phys_seg 17 prio class 0
[19150.502290] zio pool=data vdev=/dev/mapper/sda-crypt error=5 type=1 offset=1030951714816 size=77824 flags=40080cb0
[19150.503304] sd 0:0:0:0: [sda] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[19150.504141] sd 0:0:0:0: [sda] tag#19 Sense Key : 0x5 [current]
[19150.504664] sd 0:0:0:0: [sda] tag#19 ASC=0x21 ASCQ=0x4
[19150.505128] sd 0:0:0:0: [sda] tag#19 CDB: opcode=0x88 88 00 00 00 00 00 78 05 42 88 00 00 00 60 00 00
[19150.505936] blk_update_request: I/O error, dev sda, sector 2013610632 op 0x0:(READ) flags 0x700 phys_seg 11 prio class 0
[19150.506896] zio pool=data vdev=/dev/mapper/sda-crypt error=5 type=1 offset=1030951866368 size=49152 flags=40080cb0
[19150.507875] ata1: EH complete
[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[B[B[B[B[B[B[B[B[B[B[B[B[21308.943111] ata1.00: exception Emask 0x2 SAct 0x60010 SErr 0x400 action 0x6
[21308.943736] ata1.00: irq_stat 0x08000000
[21308.944086] ata1: SError: { Proto }
[21308.944399] ata1.00: failed command: READ FPDMA QUEUED
[21308.944859] ata1.00: cmd 60/00:20:70:11:6d/01:00:21:00:00/40 tag 4 ncq dma 131072 in
[21308.944859] res 40/00:88:70:0e:6d/00:00:21:00:00/40 Emask 0x2 (HSM violation)
[21308.946219] ata1.00: status: { DRDY }
[21308.946544] ata1.00: failed command: READ FPDMA QUEUED
[21308.947003] ata1.00: cmd 60/00:88:70:0e:6d/02:00:21:00:00/40 tag 17 ncq dma 262144 in
[21308.947003] res 40/00:88:70:0e:6d/00:00:21:00:00/40 Emask 0x2 (HSM violation)
[21308.948419] ata1.00: status: { DRDY }
[21308.948747] ata1.00: failed command: READ FPDMA QUEUED
[21308.949207] ata1.00: cmd 60/00:90:70:12:6d/01:00:21:00:00/40 tag 18 ncq dma 131072 in
[21308.949207] res 40/00:88:70:0e:6d/00:00:21:00:00/40 Emask 0x2 (HSM violation)
[21308.950575] ata1.00: status: { DRDY }
[21308.950909] ata1: hard resetting link
[21309.427090] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[21309.428854] ata1.00: configured for UDMA/133
[21309.429342] sd 0:0:0:0: [sda] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[21309.430145] sd 0:0:0:0: [sda] tag#4 Sense Key : 0x5 [current]
[21309.430659] sd 0:0:0:0: [sda] tag#4 ASC=0x21 ASCQ=0x4
[21309.431157] sd 0:0:0:0: [sda] tag#4 CDB: opcode=0x88 88 00 00 00 00 00 21 6d 11 70 00 00 01 00 00 00
[21309.431964] blk_update_request: I/O error, dev sda, sector 560796016 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
[21309.432934] zio pool=data vdev=/dev/mapper/sda-crypt error=5 type=1 offset=287110782976 size=131072 flags=1808b0
[21309.433947] sd 0:0:0:0: [sda] tag#17 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[21309.434762] sd 0:0:0:0: [sda] tag#17 Sense Key : 0x5 [current]
[21309.435329] sd 0:0:0:0: [sda] tag#17 ASC=0x21 ASCQ=0x4
[21309.435799] sd 0:0:0:0: [sda] tag#17 CDB: opcode=0x88 88 00 00 00 00 00 21 6d 0e 70 00 00 02 00 00 00
[21309.436609] blk_update_request: I/O error, dev sda, sector 560795248 op 0x0:(READ) flags 0x700 phys_seg 4 prio class 0
[21309.437561] zio pool=data vdev=/dev/mapper/sda-crypt error=5 type=1 offset=287110389760 size=262144 flags=40080cb0
[21309.438558] sd 0:0:0:0: [sda] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[21309.439397] sd 0:0:0:0: [sda] tag#18 Sense Key : 0x5 [current]
[21309.439923] sd 0:0:0:0: [sda] tag#18 ASC=0x21 ASCQ=0x4
[21309.440388] sd 0:0:0:0: [sda] tag#18 CDB: opcode=0x88 88 00 00 00 00 00 21 6d 12 70 00 00 01 00 00 00
[21309.441198] blk_update_request: I/O error, dev sda, sector 560796272 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
[21309.442148] zio pool=data vdev=/dev/mapper/sda-crypt error=5 type=1 offset=287110914048 size=131072 flags=1808b0
[21309.443166] ata1: EH complete
[23343.877842] ata2.00: exception Emask 0x2 SAct 0x280002 SErr 0x400 action 0x6
[23343.878472] ata2.00: irq_stat 0x08000000
[23343.878821] ata2: SError: { Proto }
[23343.879134] ata2.00: failed command: READ FPDMA QUEUED
[23343.879594] ata2.00: cmd 60/00:08:10:c4:2e/01:00:6e:00:00/40 tag 1 ncq dma 131072 in
[23343.879594] res 40/00:08:10:c4:2e/00:00:6e:00:00/40 Emask 0x2 (HSM violation)
[23343.880954] ata2.00: status: { DRDY }
[23343.881279] ata2.00: failed command: READ FPDMA QUEUED
[23343.881738] ata2.00: cmd 60/00:98:10:c6:2e/01:00:6e:00:00/40 tag 19 ncq dma 131072 in
[23343.881738] res 40/00:08:10:c4:2e/00:00:6e:00:00/40 Emask 0x2 (HSM violation)
[23343.883162] ata2.00: status: { DRDY }
[23343.883492] ata2.00: failed command: READ FPDMA QUEUED
[23343.883953] ata2.00: cmd 60/00:a8:10:c5:2e/01:00:6e:00:00/40 tag 21 ncq dma 131072 in
[23343.883953] res 40/00:08:10:c4:2e/00:00:6e:00:00/40 Emask 0x2 (HSM violation)
[23343.885321] ata2.00: status: { DRDY }
[23343.885654] ata2: hard resetting link
[23344.361826] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[23344.363595] ata2.00: configured for UDMA/133
[23344.364067] sd 1:0:0:0: [sdb] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[23344.364869] sd 1:0:0:0: [sdb] tag#1 Sense Key : 0x5 [current]
[23344.365384] sd 1:0:0:0: [sdb] tag#1 ASC=0x21 ASCQ=0x4
[23344.365882] sd 1:0:0:0: [sdb] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 6e 2e c4 10 00 00 01 00 00 00
[23344.366687] blk_update_request: I/O error, dev sdb, sector 1848558608 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
[23344.367660] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=946445230080 size=131072 flags=1808b0
[23344.368660] sd 1:0:0:0: [sdb] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[23344.369472] sd 1:0:0:0: [sdb] tag#19 Sense Key : 0x5 [current]
[23344.370040] sd 1:0:0:0: [sdb] tag#19 ASC=0x21 ASCQ=0x4
[23344.370510] sd 1:0:0:0: [sdb] tag#19 CDB: opcode=0x88 88 00 00 00 00 00 6e 2e c6 10 00 00 01 00 00 00
[23344.371320] blk_update_request: I/O error, dev sdb, sector 1848559120 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
[23344.372281] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=946445492224 size=131072 flags=1808b0
[23344.373266] sd 1:0:0:0: [sdb] tag#21 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[23344.374113] sd 1:0:0:0: [sdb] tag#21 Sense Key : 0x5 [current]
[23344.374639] sd 1:0:0:0: [sdb] tag#21 ASC=0x21 ASCQ=0x4
[23344.375108] sd 1:0:0:0: [sdb] tag#21 CDB: opcode=0x88 88 00 00 00 00 00 6e 2e c5 10 00 00 01 00 00 00
[23344.375920] blk_update_request: I/O error, dev sdb, sector 1848558864 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
[23344.376877] zio pool=data vdev=/dev/mapper/sdb-crypt error=5 type=1 offset=946445361152 size=131072 flags=1808b0
[23344.377899] ata2: EH complete

gprovost · March 24, 2021

@grek Can you change the IO scheduler to none as recommended here to see if any change.

Sign In

SATA issue, drive resets: ataX.00: failed command: READ FPDMA QUEUED

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Similar Content

Important Information