SATA issue, drive resets: ataX.00: failed command: READ FPDMA QUEUED


ShadowDance

Recommended Posts

I have some new things to report.

 

I finally got around to replacing all SATA cables with 3rd party cables and using the power-only harness that the team at Kobol sent me. So I went ahead and re-enabled 6 Gbps but to my disappointment I ran into failed command: READ FPDMA QUEUED errors again. Because of this I tried once again to disable NCQ (extraargs=libata.force=noncq) and lo-and-behold, I didn't run into disk issues for the 1+ week I was testing it (kernel 5.9.14). This made me hopeful that maybe, just maybe, my previous test with NCQ disabled was bad so I re-installed the new standard harness I received from Kobol but unfortunately, I started immediately having issues again.

 

Note that this is with io scheduler set to none.

 

ata4.00: failed command: READ DMA EXT

Spoiler

[  173.700955] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x300100 action 0x6
[  173.701578] ata4.00: irq_stat 0x08000000
[  173.701928] ata4: SError: { UnrecovData Dispar BadCRC }
[  173.702391] ata4.00: failed command: READ DMA EXT
[  173.702814] ata4.00: cmd 25/00:38:d8:66:ba/00:00:f9:01:00/e0 tag 22 dma 28672 in
                        res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
[  173.704154] ata4.00: status: { DRDY }
[  173.704491] ata4: hard resetting link
[  174.184891] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  174.186685] ata4.00: configured for UDMA/133
[  174.187111] sd 3:0:0:0: [sdd] tag#22 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[  174.187920] sd 3:0:0:0: [sdd] tag#22 Sense Key : 0x5 [current]
[  174.188440] sd 3:0:0:0: [sdd] tag#22 ASC=0x21 ASCQ=0x4
[  174.188922] sd 3:0:0:0: [sdd] tag#22 CDB: opcode=0x88 88 00 00 00 00 01 f9 ba 66 d8 00 00 00 38 00 00
[  174.189732] blk_update_request: I/O error, dev sdd, sector 8484710104 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0
[  174.190695] zio pool=rpool vdev=/dev/disk/by-id/dm-uuid-CRYPT-LUKS2-356ca7bf038a4fda94bf84e9abfabb4c-luks-ata-WDC_WD60EFRX-68L0BN1_WD-WX51D583KTSE-part4 error=5 type=1 offset=4343080005632 size=28672 flags=180880
[  174.192407] ata4: EH complete
[  177.185563] kmem.limit_in_bytes is deprecated and will be removed. Please report your usecase to linux-mm@kvack.org if you depend on this functionality.
[  177.187069] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
[  178.311447] eth0: renamed from veth863bc6b
[  178.349209] eth0: renamed from vethb8208e6
[  178.370851] IPv6: ADDRCONF(NETDEV_CHANGE): veth6d9db95: link becomes ready
[  178.371722] br-429aece1ddc2: port 1(veth6d9db95) entered blocking state
[  178.372324] br-429aece1ddc2: port 1(veth6d9db95) entered forwarding state
[  178.373185] IPv6: ADDRCONF(NETDEV_CHANGE): veth8810431: link becomes ready
[  178.373987] br-429aece1ddc2: port 2(veth8810431) entered blocking state
[  178.374576] br-429aece1ddc2: port 2(veth8810431) entered forwarding state
[  178.375585] IPv6: ADDRCONF(NETDEV_CHANGE): br-429aece1ddc2: link becomes ready
[  301.509646] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[  331.652485] perf: interrupt took too long (3168 > 3138), lowering kernel.perf_event_max_sample_rate to 63000
[  350.695350] perf: interrupt took too long (3986 > 3960), lowering kernel.perf_event_max_sample_rate to 50000
[  352.017716] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x300100 action 0x6
[  352.018339] ata4.00: irq_stat 0x08000000
[  352.018688] ata4: SError: { UnrecovData Dispar BadCRC }
[  352.019149] ata4.00: failed command: READ DMA EXT
[  352.019571] ata4.00: cmd 25/00:b0:e8:1a:6c/00:07:75:02:00/e0 tag 18 dma 1007616 in
                        res 50/00:00:88:16:6c/00:00:75:02:00/e0 Emask 0x10 (ATA bus error)
[  352.020923] ata4.00: status: { DRDY }
[  352.021255] ata4: hard resetting link
[  352.497484] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  352.499169] ata4.00: configured for UDMA/133
[  352.499721] sd 3:0:0:0: [sdd] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[  352.500539] sd 3:0:0:0: [sdd] tag#18 Sense Key : 0x5 [current]
[  352.501057] sd 3:0:0:0: [sdd] tag#18 ASC=0x21 ASCQ=0x4
[  352.501548] sd 3:0:0:0: [sdd] tag#18 CDB: opcode=0x88 88 00 00 00 00 02 75 6c 1a e8 00 00 07 b0 00 00
[  352.502357] blk_update_request: I/O error, dev sdd, sector 10559953640 op 0x0:(READ) flags 0x700 phys_seg 98 prio class 0
[  352.503327] zio pool=rpool vdev=/dev/disk/by-id/dm-uuid-CRYPT-LUKS2-356ca7bf038a4fda94bf84e9abfabb4c-luks-ata-WDC_WD60EFRX-68L0BN1_WD-WX51D583KTSE-part4 error=5 type=1 offset=5405604696064 size=1007616 flags=40080cb0
[  352.505047] ata4: EH complete
[  365.833358] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x300100 action 0x6
[  365.834109] ata4.00: irq_stat 0x08000000
[  365.834454] ata4: SError: { UnrecovData Dispar BadCRC }
[  365.834913] ata4.00: failed command: READ DMA EXT
[  365.835329] ata4.00: cmd 25/00:b0:b8:07:23/00:01:76:02:00/e0 tag 2 dma 221184 in
                        res 50/00:00:b8:07:23/00:00:76:02:00/e0 Emask 0x10 (ATA bus error)
[  365.836795] ata4.00: status: { DRDY }
[  365.837123] ata4: hard resetting link
[  366.317315] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  366.319003] ata4.00: configured for UDMA/133
[  366.319433] sd 3:0:0:0: [sdd] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[  366.320230] sd 3:0:0:0: [sdd] tag#2 Sense Key : 0x5 [current]
[  366.320741] sd 3:0:0:0: [sdd] tag#2 ASC=0x21 ASCQ=0x4
[  366.321193] sd 3:0:0:0: [sdd] tag#2 CDB: opcode=0x88 88 00 00 00 00 02 76 23 07 b8 00 00 01 b0 00 00
[  366.322009] blk_update_request: I/O error, dev sdd, sector 10571941816 op 0x0:(READ) flags 0x700 phys_seg 23 prio class 0
[  366.322978] zio pool=rpool vdev=/dev/disk/by-id/dm-uuid-CRYPT-LUKS2-356ca7bf038a4fda94bf84e9abfabb4c-luks-ata-WDC_WD60EFRX-68L0BN1_WD-WX51D583KTSE-part4 error=5 type=1 offset=5411742642176 size=221184 flags=40080cb0
[  366.324676] ata4: EH complete
[  369.905366] ata4: limiting SATA link speed to 3.0 Gbps
[  369.905887] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x300100 action 0x6
[  369.906499] ata4.00: irq_stat 0x08000000
[  369.906849] ata4: SError: { UnrecovData Dispar BadCRC }
[  369.907312] ata4.00: failed command: READ DMA EXT
[  369.907737] ata4.00: cmd 25/00:c0:e0:ae:54/00:07:89:02:00/e0 tag 20 dma 1015808 in
                        res 50/00:00:20:a7:54/00:00:89:02:00/e0 Emask 0x10 (ATA bus error)
[  369.909093] ata4.00: status: { DRDY }
[  369.909507] ata4: hard resetting link
[  370.385268] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  370.386955] ata4.00: configured for UDMA/133
[  370.387493] sd 3:0:0:0: [sdd] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[  370.388298] sd 3:0:0:0: [sdd] tag#20 Sense Key : 0x5 [current]
[  370.388816] sd 3:0:0:0: [sdd] tag#20 ASC=0x21 ASCQ=0x4
[  370.389294] sd 3:0:0:0: [sdd] tag#20 CDB: opcode=0x88 88 00 00 00 00 02 89 54 ae e0 00 00 07 c0 00 00
[  370.390098] blk_update_request: I/O error, dev sdd, sector 10893962976 op 0x0:(READ) flags 0x700 phys_seg 88 prio class 0
[  370.391066] zio pool=rpool vdev=/dev/disk/by-id/dm-uuid-CRYPT-LUKS2-356ca7bf038a4fda94bf84e9abfabb4c-luks-ata-WDC_WD60EFRX-68L0BN1_WD-WX51D583KTSE-part4 error=5 type=1 offset=5576617476096 size=1015808 flags=40080cb0
[  370.392770] ata4: EH complete
[  379.682552] perf: interrupt took too long (4987 > 4982), lowering kernel.perf_event_max_sample_rate to 40000
[  411.315139] perf: interrupt took too long (6282 > 6233), lowering kernel.perf_event_max_sample_rate to 31750
[  450.954583] perf: interrupt took too long (7858 > 7852), lowering kernel.perf_event_max_sample_rate to 25250
[  505.281348] perf: interrupt took too long (9867 > 9822), lowering kernel.perf_event_max_sample_rate to 20250
[  681.003734] perf: interrupt took too long (12378 > 12333), lowering kernel.perf_event_max_sample_rate to 16000

 

 

@gprovost are you any closer to figuring out what could be wrong here or have a potential fix in the pipeline?

 

@grek I would recommend adding extraargs=libata.force=noncq to /boot/armbianEnv.txt and see if it helps. Might not completely fix the issue but could make more stable.

Link to post
Share on other sites
Armbian is a community driven open source project. Do you like to contribute your code?

I have the same issue with three 6TB WD Red Plus. There is no problem until zfs scrub runs. Maybe it is not a sata bus problem, but a disk problem. All reports here seems to use WD Red disks and you can find similar reports for this disks and zfs with other hardware.

I will test my three disks in an other hardware this weekend to check if it is a helios problem or a disk problem.

 

Model Family:     Western Digital Red
Device Model:     WDC WD60EFRX-68L0BN1
Firmware Version: 82.00A82

 

[363042.880705] ata2.00: exception Emask 0x2 SAct 0x100060 SErr 0x400 action 0x6
[363042.880715] ata2.00: irq_stat 0x08000000
[363042.880722] ata2: SError: { Proto }
[363042.880730] ata2.00: failed command: READ FPDMA QUEUED
[363042.880741] ata2.00: cmd 60/80:28:20:14:60/00:00:0b:00:00/40 tag 5 ncq dma 65536 in
                         res 40/00:28:20:14:60/00:00:0b:00:00/40 Emask 0x2 (HSM violation)
[363042.880746] ata2.00: status: { DRDY }
[363042.880751] ata2.00: failed command: READ FPDMA QUEUED
[363042.880761] ata2.00: cmd 60/00:30:a0:14:60/01:00:0b:00:00/40 tag 6 ncq dma 131072 in
                         res 40/00:28:20:14:60/00:00:0b:00:00/40 Emask 0x2 (HSM violation)
[363042.880765] ata2.00: status: { DRDY }
[363042.880770] ata2.00: failed command: READ FPDMA QUEUED
[363042.880779] ata2.00: cmd 60/80:a0:a0:15:60/00:00:0b:00:00/40 tag 20 ncq dma 65536 in
                         res 40/00:28:20:14:60/00:00:0b:00:00/40 Emask 0x2 (HSM violation)
[363042.880783] ata2.00: status: { DRDY }
[363042.880793] ata2: hard resetting link
[363043.356670] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[363043.357984] ata2.00: configured for UDMA/133
[363043.358054] sd 1:0:0:0: [sdb] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[363043.358062] sd 1:0:0:0: [sdb] tag#5 Sense Key : 0x5 [current] 
[363043.358068] sd 1:0:0:0: [sdb] tag#5 ASC=0x21 ASCQ=0x4 
[363043.358076] sd 1:0:0:0: [sdb] tag#5 CDB: opcode=0x88 88 00 00 00 00 00 0b 60 14 20 00 00 00 80 00 00
[363043.358084] blk_update_request: I/O error, dev sdb, sector 190845984 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0
[363043.358098] zio pool=datastore vdev=/dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX72D507JSP5-part1 error=5 type=1 offset=97712095232 size=65536 flags=1808b0
[363043.358145] sd 1:0:0:0: [sdb] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[363043.358152] sd 1:0:0:0: [sdb] tag#6 Sense Key : 0x5 [current] 
[363043.358157] sd 1:0:0:0: [sdb] tag#6 ASC=0x21 ASCQ=0x4 
[363043.358163] sd 1:0:0:0: [sdb] tag#6 CDB: opcode=0x88 88 00 00 00 00 00 0b 60 14 a0 00 00 01 00 00 00
[363043.358170] blk_update_request: I/O error, dev sdb, sector 190846112 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
[363043.358178] zio pool=datastore vdev=/dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX72D507JSP5-part1 error=5 type=1 offset=97712160768 size=131072 flags=40080cb0
[363043.358214] sd 1:0:0:0: [sdb] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[363043.358220] sd 1:0:0:0: [sdb] tag#20 Sense Key : 0x5 [current] 
[363043.358226] sd 1:0:0:0: [sdb] tag#20 ASC=0x21 ASCQ=0x4 
[363043.358232] sd 1:0:0:0: [sdb] tag#20 CDB: opcode=0x88 88 00 00 00 00 00 0b 60 15 a0 00 00 00 80 00 00
[363043.358239] blk_update_request: I/O error, dev sdb, sector 190846368 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0
[363043.358248] zio pool=datastore vdev=/dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX72D507JSP5-part1 error=5 type=1 offset=97712291840 size=65536 flags=1808b0
[363043.358275] ata2: EH complete

 

Link to post
Share on other sites
3 hours ago, Wofferl said:

All reports here seems to use WD Red disks and you can find similar reports for this disks and zfs with other hardware.

Taking a look around, this definitely seems to be the case:

https://www.truenas.com/community/threads/warning-wd-red-wd60efrx-68l0bn1.58676/

 

Additionally, people see a lot more errors with the *EFAX drives instead of the *EFRX, since the *EFAX drives are CMR and don't hold up.

Link to post
Share on other sites

Thanks @ShadowDance for the tip, my scub test done without errors

 

Quote

root@helios64:~# zpool status -v data
  pool: data
 state: ONLINE
  scan: scrub repaired 0B in 07:47:07 with 0 errors on Thu Mar 25 03:55:37 2021
config:

    NAME           STATE     READ WRITE CKSUM
    data           ONLINE       0     0     0
      mirror-0     ONLINE       0     0     0
        sda-crypt  ONLINE       0     0     0
        sdb-crypt  ONLINE       0     0     0
    cache
      sdc2         ONLINE       0     0     0
 

 

Quote

root@helios64:~# cat /boot/armbianEnv.txt 
verbosity=7
console=serial
extraargs=earlyprintk ignore_loglevel libata.force=noncq
 

 

Quote

root@helios64:~# cat /etc/default/cpufrequtils 
ENABLE="true"
GOVERNOR=performance
MAX_SPEED=1416000
MIN_SPEED=1416000
 

 

 

Quote

root@helios64:~# cat /sys/block/sda/queue/scheduler 
mq-deadline kyber [bfq] none
root@helios64:~# cat /sys/block/sdb/queue/scheduler 
mq-deadline kyber [bfq] none
root@helios64:~# 
 

 

 

Quote

root@helios64:~# hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
    Model Number:       WDC WD40EFRX-68N32N0                    
    Firmware Revision:  82.00A82
 

root@helios64:~# hdparm -I /dev/sdb

/dev/sdb:

ATA device, with non-removable media
    Model Number:       WDC WD40EFRX-68N32N0                    
    Firmware Revision:  82.00A82
 

 

Link to post
Share on other sites

@Wofferl those are the exact same model as three of my disks (but mine aren't "Plus"). I've used these disks in another machine with ZFS and zero issues (ASM1062 SATA controller). So if we assume the problem is between SATA controller and disk, and while I agree with you that it's probably in part a disk issue, I'm convinced it's something that would be fixable on the SATA controller firmware. Perhaps these disks do something funny that the SATA controller doesn't expect? And based on all my testing so far, the SATA cable also plays a role, meaning perhaps there's a noise-factor in play (as well).

 

Side-note; Western Digital really screwed us over with this whole SMR fiasco, didn't they. I'd be pretty much ready to throw these disks in the trash if it wasn't for the fact that they worked perfectly on another SATA controller.

 

@grek glad it helped! By the way, I would still recommend changing the io scheduler to none because bfq is CPU intensive, and ZFS does it's own scheduling. Probably wont fix issues but might reduce some CPU overhead.

Link to post
Share on other sites

We will provide soon instruction on how to update SATA controller (JMB585) firmware with the latest version. Let see if it has positive impact or not.

 

14 hours ago, ShadowDance said:

perhaps there's a noise-factor in play (as well).

For the new revision we will reinforce noise filtering on the SATA controller in case this is part of the root cause in some cases.

Link to post
Share on other sites
4 hours ago, gprovost said:

We will provide soon instruction on how to update SATA controller (JMB585) firmware with the latest version. Let see if it has positive impact or not.

 

That's great news, looking forward to it!

 

4 hours ago, gprovost said:

For the new revision we will reinforce noise filtering on the SATA controller in case this is part of the root cause in some cases.

 

Also nice! If you do figure out a trick we can implement ourselves (i.e. via soldering or, dare I say, foil wrapping) to reduce the noise there, let us know .

 

Speaking of noise, I have two new thoughts:

  • My Helios64 PSU is not connected to a grounded wall-outlet. Unfortunately here in my apartment there are only two such outlets in very inconvenient places, but I'll see if it's feasible to test it out on that one. Perhaps this introduces noise into the system? Perhaps other devices are leaking their noise via grounded extension cord (I recently learned that using a grounded extension cord on a non-grounded outlet is not ideal, hadn't even thought to consider it before..)
  • I read somewhere that the metal frame of the HDD cases should be connected to device ground, but since we're using plastic mount brackets I doubt this is the case? To be honest I don't know if this is a real issue or not, just something I read on a forum.
Link to post
Share on other sites

Guys, those drives are CMR (ones I could pick out from the logs posted), not SMR. EFRX is CMR, EFAX is SMR. 

Also, SMR is the one to avoid at all costs.

 

From https://www.servethehome.com/buyers-guides/top-hardware-components-freenas-nas-servers/top-picks-freenas-hard-drives/, ServeTheHome runs a lot of ZFS pools in their infra.

Quote

As a quick note. We do not recommend the new WD Red SMR drives. See WD Red SMR vs CMR Tested Avoid Red SMR.

 

Please refer to manufacturer spec sheets and tables before buying a HDD. Here are a few lists-

https://nascompares.com/answer/list-of-wd-cmr-and-smr-hard-drives-hdd/

https://blog.westerndigital.com/wd-red-nas-drives/

https://www.seagate.com/internal-hard-drives/cmr-smr-list/

 

As far as I know, all Hitachi drives are CMR.

Link to post
Share on other sites
6 minutes ago, clostro said:

Guys, those drives are CMR (ones I could pick out from the logs posted), not SMR. EFRX is CMR, EFAX is SMR. 

Also, SMR is the one to avoid at all costs.

 

From https://www.servethehome.com/buyers-guides/top-hardware-components-freenas-nas-servers/top-picks-freenas-hard-drives/, ServeTheHome runs a lot of ZFS pools in their infra.

 

Please refer to manufacturer spec sheets and tables before buying a HDD. Here are a few lists-

https://nascompares.com/answer/list-of-wd-cmr-and-smr-hard-drives-hdd/

https://blog.westerndigital.com/wd-red-nas-drives/

https://www.seagate.com/internal-hard-drives/cmr-smr-list/

 

As far as I know, all Hitachi drives are CMR.

 

Thanks for sharing, and it seems you are right. I've read some confusing information back and forth and to be honest was never really sure whether mine were SMR or CMR. Doesn't help that WD introduced the Plus-series and then claims all plain Red (1-6TB) are SMR. Good to know they're CMR and makes sense since I haven't noticed performance issues with them.

Link to post
Share on other sites
On 3/26/2021 at 5:05 PM, ShadowDance said:

My Helios64 PSU is not connected to a grounded wall-outlet. Unfortunately here in my apartment there are only two such outlets in very inconvenient places, but I'll see if it's feasible to test it out on that one. Perhaps this introduces noise into the system? Perhaps other devices are leaking their noise via grounded extension cord (I recently learned that using a grounded extension cord on a non-grounded outlet is not ideal, hadn't even thought to consider it before...

 

That is a valid point you are bringing and actually we did request to some customers to pay attention to this aspect. FYI we decided to make the DC ground connected to AC earth when ordered the AC/DC adapter, which should be the safe and standard approach, but we know many manufacturer don't bother doing it. However if you plug your device on a power extension along other devices you might get impact by the noise of other not well isolated devices. You could try to isolated Helios64 by using for example a travel adapter that would not used the earth pin.

 

On 3/26/2021 at 5:05 PM, ShadowDance said:

I read somewhere that the metal frame of the HDD cases should be connected to device ground, but since we're using plastic mount brackets I doubt this is the case? To be honest I don't know if this is a real issue or not, just something I read on a forum.

Hmmm my expectation would be that the HDD enclosure is connected to the ground of the HDD controller PCB, which in turn is connected via power cable to the ground of the motherboard, which in turn is connected to the ground / eather of the AC / DC PSU.

 

Link to post
Share on other sites

Finally got around to testing my Helios64 on the grounded wall outlet (alone), unfortunately there was no change. The disks behaved the same as before and I also had those kernel panics during boot that I seem to get ~3 out of 5 bootups.

 

On 3/29/2021 at 5:58 AM, gprovost said:

Hmmm my expectation would be that the HDD enclosure is connected to the ground of the HDD controller PCB, which in turn is connected via power cable to the ground of the motherboard, which in turn is connected to the ground / eather of the AC / DC PSU.

 

Yes sounds very reasonable, and this is my expectation too. I didn't put too much weight behind what I read, but over there one user had issues with his disks until he grounded the cases of the HDDs. Sounded strange but at this point I'm open to pretty much any strange solutions, haha. Could've just been poorly designed/manufactured HDDs too.

Link to post
Share on other sites

To add another data point, I have 5 of these drives in a raid6 configuration with an ext4 filesystem:

Model Family:     Western Digital Caviar Black
Device Model:     WDC WD2002FAEX-007BA0
Serial Number:    WD-WMAY02794530
LU WWN Device Id: 5 0014ee 002d98d8f
Firmware Version: 05.01D05
User Capacity:    2,000,398,934,016 bytes [2.00 TB]

I moved the drives into the helios64 from another server where they were all running without issues.

 

As far as I know, I have not yet experienced a failure that would cause data loss or even temporary loss of access to a disk that could be addressed with a reboot. I noticed errors similar to the below during the initial raid sync. As others have noted on this thread, this indicates data corruption during the transfer, but in my case the retries seem to succeed. However, after enough failures the device is placed in 3.0 Gbps speed, and afterwards I don't see any more errors on that device.

Apr  2 23:00:17 localhost kernel: [ 1070.137017] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x6
Apr  2 23:00:17 localhost kernel: [ 1070.137032] ata3.00: irq_stat 0x08000000
Apr  2 23:00:17 localhost kernel: [ 1070.137040] ata3: SError: { UnrecovData 10B8B Dispar BadCRC }
Apr  2 23:00:17 localhost kernel: [ 1070.137049] ata3.00: failed command: READ DMA EXT
Apr  2 23:00:17 localhost kernel: [ 1070.137061] ata3.00: cmd 25/00:00:00:80:2f/00:04:00:00:00/e0 tag 20 dma 524288 in
Apr  2 23:00:17 localhost kernel: [ 1070.137061]          res 50/00:00:ff:87:2f/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Apr  2 23:00:17 localhost kernel: [ 1070.137066] ata3.00: status: { DRDY }
Apr  2 23:00:17 localhost kernel: [ 1070.137079] ata3: hard resetting link
Apr  2 23:00:18 localhost kernel: [ 1070.612946] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr  2 23:00:18 localhost kernel: [ 1070.622339] ata3.00: configured for UDMA/133
Apr  2 23:00:18 localhost kernel: [ 1070.622453] ata3: EH complete
Apr  2 23:00:55 localhost kernel: [ 1108.163913] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x6
Apr  2 23:00:55 localhost kernel: [ 1108.163927] ata3.00: irq_stat 0x08000000
Apr  2 23:00:55 localhost kernel: [ 1108.163935] ata3: SError: { UnrecovData 10B8B Dispar BadCRC }
Apr  2 23:00:55 localhost kernel: [ 1108.163944] ata3.00: failed command: READ DMA EXT
Apr  2 23:00:55 localhost kernel: [ 1108.163957] ata3.00: cmd 25/00:00:00:cc:2e/00:04:00:00:00/e0 tag 31 dma 524288 in
Apr  2 23:00:55 localhost kernel: [ 1108.163957]          res 50/00:00:4f:46:2f/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Apr  2 23:00:55 localhost kernel: [ 1108.163962] ata3.00: status: { DRDY }
Apr  2 23:00:55 localhost kernel: [ 1108.163975] ata3: hard resetting link
Apr  2 23:00:56 localhost kernel: [ 1108.639872] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr  2 23:00:56 localhost kernel: [ 1108.649317] ata3.00: configured for UDMA/133
Apr  2 23:00:56 localhost kernel: [ 1108.649434] ata3: EH complete
Apr  2 23:01:27 localhost kernel: [ 1139.690980] ata3: limiting SATA link speed to 3.0 Gbps
Apr  2 23:01:27 localhost kernel: [ 1139.690997] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x980000 action 0x6 frozen
Apr  2 23:01:27 localhost kernel: [ 1139.691004] ata3: SError: { 10B8B Dispar LinkSeq }
Apr  2 23:01:27 localhost kernel: [ 1139.691014] ata3.00: failed command: WRITE DMA EXT
Apr  2 23:01:27 localhost kernel: [ 1139.691026] ata3.00: cmd 35/00:40:80:b2:2f/00:05:00:00:00/e0 tag 16 dma 688128 out
Apr  2 23:01:27 localhost kernel: [ 1139.691026]          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr  2 23:01:27 localhost kernel: [ 1139.691032] ata3.00: status: { DRDY }
Apr  2 23:01:27 localhost kernel: [ 1139.691045] ata3: hard resetting link
Apr  2 23:01:27 localhost kernel: [ 1140.166952] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Apr  2 23:01:27 localhost kernel: [ 1140.177164] ata3.00: configured for UDMA/133
Apr  2 23:01:27 localhost kernel: [ 1140.177269] ata3: EH complete

 

I'm able to reproduce similar errors on one of the drives within a few minutes by running stress (with the current directory on the raid6 filesystem)

stress --hdd 4 &

 

Within a few hours, I will see many similar errors on all 5 drives, and all will be reset to 3.0Gbps.

 

I think it's reasonable to conclude that these errors occur in transit on the cable, because the drives' SMART stats do not seem to show a corresponding increase when the kernel log is full of BadCRC errors. I was checking UDMA_CRC_Error_Count specifically since it seemed most relevant to the errors above, but I'm not very familiar with SMART so there may be something else that I should check.

 

I tried some modifications to /boot/armbianEnv.txt to see how they would affect the problem:

1. extraargs=libata.force=noncq didn't have an impact, I saw similar errors on multiple disks within 30 minutes

2. extraargs=libata.force=noncq,3.0 saw no errors after a 2 hours of testing 

3. extraargs=libata.force=3.0 saw no errors after 2 hours of testing

 

I'd prefer to err on the side of reliability, so I'm going to stick with the extraargs=libata.force=3.0 for now.

 

I wonder how many people are seeing behavior similar to mine that simply haven't noticed these errors in their kernel logs?

 

I had similar errors with a banana pi board a few years ago when I used an internal SATA cable to connect to an external drive enclosure. In that case, switching to an eSATA cable which had the appropriate shielding resolved the issue.

 

I could try replacing the SATA cables to see if it reduces the error rate. Did anyone have recommendations on a specific set of replacement cables that has worked well?

Link to post
Share on other sites

@Fred Fettinger Your errors are most likely due too a not stable HDD harness. You can contact us to support@kobol.io to see if replacement of the HDD wire harness is needed.

 

@ShadowDance Thanks for the feedback and at least we can remove grounding issue from the list of possible root causes.

Looks like that in the use case of very heavy HDD I/O load arising mainly during scrubbing operation too much noise is generated resulting in those HSM violation.

Thanks to all your tests, the issue seem unfortunately to point to noise filtering issue. As previously mentioned, we will fix that in next revision.

We will see if the new SATA controller firmware has any impact but we doubt that. I think the only solution for current revision is to limit ATA speed to 3Gbps when using btrfs or zfs.

Link to post
Share on other sites

Hi all, you could download the SATA firmware update at https://cdn-kobol-io.s3-ap-southeast-1.amazonaws.com/Helios64/SATA_FW/Helios64_SATA_FW_update_00021_200624.img.xz

 

Instruction:

1. Download the sd card image

2. Flash the image into a microSD card

3. Insert microSD card into Helios64 and power on. Helios64 should automatically boot from microSD. If Helios64 still boot from eMMC, disable the eMMC 

4. Wait for a while, the system will do a reboot and then power off if firmware flashing succeed.

   If failed, both System Status and System Fault LEDs will blink

5. Remove the microSD card and boot Helios64 normally. See if there is any improvement.

 

Our officially supported stock firmware can be downloaded from https://cdn-kobol-io.s3-ap-southeast-1.amazonaws.com/Helios64/SATA_FW/Helios64_SATA_FW_factory_00020_190702.img.xz. If there is no improvement on newer firmware, please revert back to this stock firmware.

 

SHA256SUM:

e5dfbe84f4709a3e2138ffb620f0ee62ecbcc79a8f83692c1c1d7a4361f0d30f *Helios64_SATA_FW_factory_00020_190702.img.xz
0d78fec569dd699fd667acf59ba7b07c420a2865e1bcb8b85b26b61d404998c5 *Helios64_SATA_FW_update_00021_200624.img.xz

 

Link to post
Share on other sites