SATA issue, drive resets: ataX.00: failed command: READ FPDMA QUEUED


ShadowDance

Recommended Posts

I have some new things to report.

 

I finally got around to replacing all SATA cables with 3rd party cables and using the power-only harness that the team at Kobol sent me. So I went ahead and re-enabled 6 Gbps but to my disappointment I ran into failed command: READ FPDMA QUEUED errors again. Because of this I tried once again to disable NCQ (extraargs=libata.force=noncq) and lo-and-behold, I didn't run into disk issues for the 1+ week I was testing it (kernel 5.9.14). This made me hopeful that maybe, just maybe, my previous test with NCQ disabled was bad so I re-installed the new standard harness I received from Kobol but unfortunately, I started immediately having issues again.

 

Note that this is with io scheduler set to none.

 

ata4.00: failed command: READ DMA EXT

Spoiler

[  173.700955] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x300100 action 0x6
[  173.701578] ata4.00: irq_stat 0x08000000
[  173.701928] ata4: SError: { UnrecovData Dispar BadCRC }
[  173.702391] ata4.00: failed command: READ DMA EXT
[  173.702814] ata4.00: cmd 25/00:38:d8:66:ba/00:00:f9:01:00/e0 tag 22 dma 28672 in
                        res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
[  173.704154] ata4.00: status: { DRDY }
[  173.704491] ata4: hard resetting link
[  174.184891] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  174.186685] ata4.00: configured for UDMA/133
[  174.187111] sd 3:0:0:0: [sdd] tag#22 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[  174.187920] sd 3:0:0:0: [sdd] tag#22 Sense Key : 0x5 [current]
[  174.188440] sd 3:0:0:0: [sdd] tag#22 ASC=0x21 ASCQ=0x4
[  174.188922] sd 3:0:0:0: [sdd] tag#22 CDB: opcode=0x88 88 00 00 00 00 01 f9 ba 66 d8 00 00 00 38 00 00
[  174.189732] blk_update_request: I/O error, dev sdd, sector 8484710104 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0
[  174.190695] zio pool=rpool vdev=/dev/disk/by-id/dm-uuid-CRYPT-LUKS2-356ca7bf038a4fda94bf84e9abfabb4c-luks-ata-WDC_WD60EFRX-68L0BN1_WD-WX51D583KTSE-part4 error=5 type=1 offset=4343080005632 size=28672 flags=180880
[  174.192407] ata4: EH complete
[  177.185563] kmem.limit_in_bytes is deprecated and will be removed. Please report your usecase to linux-mm@kvack.org if you depend on this functionality.
[  177.187069] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
[  178.311447] eth0: renamed from veth863bc6b
[  178.349209] eth0: renamed from vethb8208e6
[  178.370851] IPv6: ADDRCONF(NETDEV_CHANGE): veth6d9db95: link becomes ready
[  178.371722] br-429aece1ddc2: port 1(veth6d9db95) entered blocking state
[  178.372324] br-429aece1ddc2: port 1(veth6d9db95) entered forwarding state
[  178.373185] IPv6: ADDRCONF(NETDEV_CHANGE): veth8810431: link becomes ready
[  178.373987] br-429aece1ddc2: port 2(veth8810431) entered blocking state
[  178.374576] br-429aece1ddc2: port 2(veth8810431) entered forwarding state
[  178.375585] IPv6: ADDRCONF(NETDEV_CHANGE): br-429aece1ddc2: link becomes ready
[  301.509646] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[  331.652485] perf: interrupt took too long (3168 > 3138), lowering kernel.perf_event_max_sample_rate to 63000
[  350.695350] perf: interrupt took too long (3986 > 3960), lowering kernel.perf_event_max_sample_rate to 50000
[  352.017716] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x300100 action 0x6
[  352.018339] ata4.00: irq_stat 0x08000000
[  352.018688] ata4: SError: { UnrecovData Dispar BadCRC }
[  352.019149] ata4.00: failed command: READ DMA EXT
[  352.019571] ata4.00: cmd 25/00:b0:e8:1a:6c/00:07:75:02:00/e0 tag 18 dma 1007616 in
                        res 50/00:00:88:16:6c/00:00:75:02:00/e0 Emask 0x10 (ATA bus error)
[  352.020923] ata4.00: status: { DRDY }
[  352.021255] ata4: hard resetting link
[  352.497484] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  352.499169] ata4.00: configured for UDMA/133
[  352.499721] sd 3:0:0:0: [sdd] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[  352.500539] sd 3:0:0:0: [sdd] tag#18 Sense Key : 0x5 [current]
[  352.501057] sd 3:0:0:0: [sdd] tag#18 ASC=0x21 ASCQ=0x4
[  352.501548] sd 3:0:0:0: [sdd] tag#18 CDB: opcode=0x88 88 00 00 00 00 02 75 6c 1a e8 00 00 07 b0 00 00
[  352.502357] blk_update_request: I/O error, dev sdd, sector 10559953640 op 0x0:(READ) flags 0x700 phys_seg 98 prio class 0
[  352.503327] zio pool=rpool vdev=/dev/disk/by-id/dm-uuid-CRYPT-LUKS2-356ca7bf038a4fda94bf84e9abfabb4c-luks-ata-WDC_WD60EFRX-68L0BN1_WD-WX51D583KTSE-part4 error=5 type=1 offset=5405604696064 size=1007616 flags=40080cb0
[  352.505047] ata4: EH complete
[  365.833358] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x300100 action 0x6
[  365.834109] ata4.00: irq_stat 0x08000000
[  365.834454] ata4: SError: { UnrecovData Dispar BadCRC }
[  365.834913] ata4.00: failed command: READ DMA EXT
[  365.835329] ata4.00: cmd 25/00:b0:b8:07:23/00:01:76:02:00/e0 tag 2 dma 221184 in
                        res 50/00:00:b8:07:23/00:00:76:02:00/e0 Emask 0x10 (ATA bus error)
[  365.836795] ata4.00: status: { DRDY }
[  365.837123] ata4: hard resetting link
[  366.317315] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  366.319003] ata4.00: configured for UDMA/133
[  366.319433] sd 3:0:0:0: [sdd] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[  366.320230] sd 3:0:0:0: [sdd] tag#2 Sense Key : 0x5 [current]
[  366.320741] sd 3:0:0:0: [sdd] tag#2 ASC=0x21 ASCQ=0x4
[  366.321193] sd 3:0:0:0: [sdd] tag#2 CDB: opcode=0x88 88 00 00 00 00 02 76 23 07 b8 00 00 01 b0 00 00
[  366.322009] blk_update_request: I/O error, dev sdd, sector 10571941816 op 0x0:(READ) flags 0x700 phys_seg 23 prio class 0
[  366.322978] zio pool=rpool vdev=/dev/disk/by-id/dm-uuid-CRYPT-LUKS2-356ca7bf038a4fda94bf84e9abfabb4c-luks-ata-WDC_WD60EFRX-68L0BN1_WD-WX51D583KTSE-part4 error=5 type=1 offset=5411742642176 size=221184 flags=40080cb0
[  366.324676] ata4: EH complete
[  369.905366] ata4: limiting SATA link speed to 3.0 Gbps
[  369.905887] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x300100 action 0x6
[  369.906499] ata4.00: irq_stat 0x08000000
[  369.906849] ata4: SError: { UnrecovData Dispar BadCRC }
[  369.907312] ata4.00: failed command: READ DMA EXT
[  369.907737] ata4.00: cmd 25/00:c0:e0:ae:54/00:07:89:02:00/e0 tag 20 dma 1015808 in
                        res 50/00:00:20:a7:54/00:00:89:02:00/e0 Emask 0x10 (ATA bus error)
[  369.909093] ata4.00: status: { DRDY }
[  369.909507] ata4: hard resetting link
[  370.385268] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  370.386955] ata4.00: configured for UDMA/133
[  370.387493] sd 3:0:0:0: [sdd] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[  370.388298] sd 3:0:0:0: [sdd] tag#20 Sense Key : 0x5 [current]
[  370.388816] sd 3:0:0:0: [sdd] tag#20 ASC=0x21 ASCQ=0x4
[  370.389294] sd 3:0:0:0: [sdd] tag#20 CDB: opcode=0x88 88 00 00 00 00 02 89 54 ae e0 00 00 07 c0 00 00
[  370.390098] blk_update_request: I/O error, dev sdd, sector 10893962976 op 0x0:(READ) flags 0x700 phys_seg 88 prio class 0
[  370.391066] zio pool=rpool vdev=/dev/disk/by-id/dm-uuid-CRYPT-LUKS2-356ca7bf038a4fda94bf84e9abfabb4c-luks-ata-WDC_WD60EFRX-68L0BN1_WD-WX51D583KTSE-part4 error=5 type=1 offset=5576617476096 size=1015808 flags=40080cb0
[  370.392770] ata4: EH complete
[  379.682552] perf: interrupt took too long (4987 > 4982), lowering kernel.perf_event_max_sample_rate to 40000
[  411.315139] perf: interrupt took too long (6282 > 6233), lowering kernel.perf_event_max_sample_rate to 31750
[  450.954583] perf: interrupt took too long (7858 > 7852), lowering kernel.perf_event_max_sample_rate to 25250
[  505.281348] perf: interrupt took too long (9867 > 9822), lowering kernel.perf_event_max_sample_rate to 20250
[  681.003734] perf: interrupt took too long (12378 > 12333), lowering kernel.perf_event_max_sample_rate to 16000

 

 

@gprovost are you any closer to figuring out what could be wrong here or have a potential fix in the pipeline?

 

@grek I would recommend adding extraargs=libata.force=noncq to /boot/armbianEnv.txt and see if it helps. Might not completely fix the issue but could make more stable.

Link to post
Share on other sites
Armbian is a community driven open source project. Do you like to contribute your code?

I have the same issue with three 6TB WD Red Plus. There is no problem until zfs scrub runs. Maybe it is not a sata bus problem, but a disk problem. All reports here seems to use WD Red disks and you can find similar reports for this disks and zfs with other hardware.

I will test my three disks in an other hardware this weekend to check if it is a helios problem or a disk problem.

 

Model Family:     Western Digital Red
Device Model:     WDC WD60EFRX-68L0BN1
Firmware Version: 82.00A82

 

[363042.880705] ata2.00: exception Emask 0x2 SAct 0x100060 SErr 0x400 action 0x6
[363042.880715] ata2.00: irq_stat 0x08000000
[363042.880722] ata2: SError: { Proto }
[363042.880730] ata2.00: failed command: READ FPDMA QUEUED
[363042.880741] ata2.00: cmd 60/80:28:20:14:60/00:00:0b:00:00/40 tag 5 ncq dma 65536 in
                         res 40/00:28:20:14:60/00:00:0b:00:00/40 Emask 0x2 (HSM violation)
[363042.880746] ata2.00: status: { DRDY }
[363042.880751] ata2.00: failed command: READ FPDMA QUEUED
[363042.880761] ata2.00: cmd 60/00:30:a0:14:60/01:00:0b:00:00/40 tag 6 ncq dma 131072 in
                         res 40/00:28:20:14:60/00:00:0b:00:00/40 Emask 0x2 (HSM violation)
[363042.880765] ata2.00: status: { DRDY }
[363042.880770] ata2.00: failed command: READ FPDMA QUEUED
[363042.880779] ata2.00: cmd 60/80:a0:a0:15:60/00:00:0b:00:00/40 tag 20 ncq dma 65536 in
                         res 40/00:28:20:14:60/00:00:0b:00:00/40 Emask 0x2 (HSM violation)
[363042.880783] ata2.00: status: { DRDY }
[363042.880793] ata2: hard resetting link
[363043.356670] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[363043.357984] ata2.00: configured for UDMA/133
[363043.358054] sd 1:0:0:0: [sdb] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[363043.358062] sd 1:0:0:0: [sdb] tag#5 Sense Key : 0x5 [current] 
[363043.358068] sd 1:0:0:0: [sdb] tag#5 ASC=0x21 ASCQ=0x4 
[363043.358076] sd 1:0:0:0: [sdb] tag#5 CDB: opcode=0x88 88 00 00 00 00 00 0b 60 14 20 00 00 00 80 00 00
[363043.358084] blk_update_request: I/O error, dev sdb, sector 190845984 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0
[363043.358098] zio pool=datastore vdev=/dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX72D507JSP5-part1 error=5 type=1 offset=97712095232 size=65536 flags=1808b0
[363043.358145] sd 1:0:0:0: [sdb] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[363043.358152] sd 1:0:0:0: [sdb] tag#6 Sense Key : 0x5 [current] 
[363043.358157] sd 1:0:0:0: [sdb] tag#6 ASC=0x21 ASCQ=0x4 
[363043.358163] sd 1:0:0:0: [sdb] tag#6 CDB: opcode=0x88 88 00 00 00 00 00 0b 60 14 a0 00 00 01 00 00 00
[363043.358170] blk_update_request: I/O error, dev sdb, sector 190846112 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
[363043.358178] zio pool=datastore vdev=/dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX72D507JSP5-part1 error=5 type=1 offset=97712160768 size=131072 flags=40080cb0
[363043.358214] sd 1:0:0:0: [sdb] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[363043.358220] sd 1:0:0:0: [sdb] tag#20 Sense Key : 0x5 [current] 
[363043.358226] sd 1:0:0:0: [sdb] tag#20 ASC=0x21 ASCQ=0x4 
[363043.358232] sd 1:0:0:0: [sdb] tag#20 CDB: opcode=0x88 88 00 00 00 00 00 0b 60 15 a0 00 00 00 80 00 00
[363043.358239] blk_update_request: I/O error, dev sdb, sector 190846368 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0
[363043.358248] zio pool=datastore vdev=/dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX72D507JSP5-part1 error=5 type=1 offset=97712291840 size=65536 flags=1808b0
[363043.358275] ata2: EH complete

 

Link to post
Share on other sites
3 hours ago, Wofferl said:

All reports here seems to use WD Red disks and you can find similar reports for this disks and zfs with other hardware.

Taking a look around, this definitely seems to be the case:

https://www.truenas.com/community/threads/warning-wd-red-wd60efrx-68l0bn1.58676/

 

Additionally, people see a lot more errors with the *EFAX drives instead of the *EFRX, since the *EFAX drives are CMR and don't hold up.

Link to post
Share on other sites

Thanks @ShadowDance for the tip, my scub test done without errors

 

Quote

root@helios64:~# zpool status -v data
  pool: data
 state: ONLINE
  scan: scrub repaired 0B in 07:47:07 with 0 errors on Thu Mar 25 03:55:37 2021
config:

    NAME           STATE     READ WRITE CKSUM
    data           ONLINE       0     0     0
      mirror-0     ONLINE       0     0     0
        sda-crypt  ONLINE       0     0     0
        sdb-crypt  ONLINE       0     0     0
    cache
      sdc2         ONLINE       0     0     0
 

 

Quote

root@helios64:~# cat /boot/armbianEnv.txt 
verbosity=7
console=serial
extraargs=earlyprintk ignore_loglevel libata.force=noncq
 

 

Quote

root@helios64:~# cat /etc/default/cpufrequtils 
ENABLE="true"
GOVERNOR=performance
MAX_SPEED=1416000
MIN_SPEED=1416000
 

 

 

Quote

root@helios64:~# cat /sys/block/sda/queue/scheduler 
mq-deadline kyber [bfq] none
root@helios64:~# cat /sys/block/sdb/queue/scheduler 
mq-deadline kyber [bfq] none
root@helios64:~# 
 

 

 

Quote

root@helios64:~# hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
    Model Number:       WDC WD40EFRX-68N32N0                    
    Firmware Revision:  82.00A82
 

root@helios64:~# hdparm -I /dev/sdb

/dev/sdb:

ATA device, with non-removable media
    Model Number:       WDC WD40EFRX-68N32N0                    
    Firmware Revision:  82.00A82
 

 

Link to post
Share on other sites

@Wofferl those are the exact same model as three of my disks (but mine aren't "Plus"). I've used these disks in another machine with ZFS and zero issues (ASM1062 SATA controller). So if we assume the problem is between SATA controller and disk, and while I agree with you that it's probably in part a disk issue, I'm convinced it's something that would be fixable on the SATA controller firmware. Perhaps these disks do something funny that the SATA controller doesn't expect? And based on all my testing so far, the SATA cable also plays a role, meaning perhaps there's a noise-factor in play (as well).

 

Side-note; Western Digital really screwed us over with this whole SMR fiasco, didn't they. I'd be pretty much ready to throw these disks in the trash if it wasn't for the fact that they worked perfectly on another SATA controller.

 

@grek glad it helped! By the way, I would still recommend changing the io scheduler to none because bfq is CPU intensive, and ZFS does it's own scheduling. Probably wont fix issues but might reduce some CPU overhead.

Link to post
Share on other sites

We will provide soon instruction on how to update SATA controller (JMB585) firmware with the latest version. Let see if it has positive impact or not.

 

14 hours ago, ShadowDance said:

perhaps there's a noise-factor in play (as well).

For the new revision we will reinforce noise filtering on the SATA controller in case this is part of the root cause in some cases.

Link to post
Share on other sites
4 hours ago, gprovost said:

We will provide soon instruction on how to update SATA controller (JMB585) firmware with the latest version. Let see if it has positive impact or not.

 

That's great news, looking forward to it!

 

4 hours ago, gprovost said:

For the new revision we will reinforce noise filtering on the SATA controller in case this is part of the root cause in some cases.

 

Also nice! If you do figure out a trick we can implement ourselves (i.e. via soldering or, dare I say, foil wrapping) to reduce the noise there, let us know .

 

Speaking of noise, I have two new thoughts:

  • My Helios64 PSU is not connected to a grounded wall-outlet. Unfortunately here in my apartment there are only two such outlets in very inconvenient places, but I'll see if it's feasible to test it out on that one. Perhaps this introduces noise into the system? Perhaps other devices are leaking their noise via grounded extension cord (I recently learned that using a grounded extension cord on a non-grounded outlet is not ideal, hadn't even thought to consider it before..)
  • I read somewhere that the metal frame of the HDD cases should be connected to device ground, but since we're using plastic mount brackets I doubt this is the case? To be honest I don't know if this is a real issue or not, just something I read on a forum.
Link to post
Share on other sites

Guys, those drives are CMR (ones I could pick out from the logs posted), not SMR. EFRX is CMR, EFAX is SMR. 

Also, SMR is the one to avoid at all costs.

 

From https://www.servethehome.com/buyers-guides/top-hardware-components-freenas-nas-servers/top-picks-freenas-hard-drives/, ServeTheHome runs a lot of ZFS pools in their infra.

Quote

As a quick note. We do not recommend the new WD Red SMR drives. See WD Red SMR vs CMR Tested Avoid Red SMR.

 

Please refer to manufacturer spec sheets and tables before buying a HDD. Here are a few lists-

https://nascompares.com/answer/list-of-wd-cmr-and-smr-hard-drives-hdd/

https://blog.westerndigital.com/wd-red-nas-drives/

https://www.seagate.com/internal-hard-drives/cmr-smr-list/

 

As far as I know, all Hitachi drives are CMR.

Link to post
Share on other sites
6 minutes ago, clostro said:

Guys, those drives are CMR (ones I could pick out from the logs posted), not SMR. EFRX is CMR, EFAX is SMR. 

Also, SMR is the one to avoid at all costs.

 

From https://www.servethehome.com/buyers-guides/top-hardware-components-freenas-nas-servers/top-picks-freenas-hard-drives/, ServeTheHome runs a lot of ZFS pools in their infra.

 

Please refer to manufacturer spec sheets and tables before buying a HDD. Here are a few lists-

https://nascompares.com/answer/list-of-wd-cmr-and-smr-hard-drives-hdd/

https://blog.westerndigital.com/wd-red-nas-drives/

https://www.seagate.com/internal-hard-drives/cmr-smr-list/

 

As far as I know, all Hitachi drives are CMR.

 

Thanks for sharing, and it seems you are right. I've read some confusing information back and forth and to be honest was never really sure whether mine were SMR or CMR. Doesn't help that WD introduced the Plus-series and then claims all plain Red (1-6TB) are SMR. Good to know they're CMR and makes sense since I haven't noticed performance issues with them.

Link to post
Share on other sites
On 3/26/2021 at 5:05 PM, ShadowDance said:

My Helios64 PSU is not connected to a grounded wall-outlet. Unfortunately here in my apartment there are only two such outlets in very inconvenient places, but I'll see if it's feasible to test it out on that one. Perhaps this introduces noise into the system? Perhaps other devices are leaking their noise via grounded extension cord (I recently learned that using a grounded extension cord on a non-grounded outlet is not ideal, hadn't even thought to consider it before...

 

That is a valid point you are bringing and actually we did request to some customers to pay attention to this aspect. FYI we decided to make the DC ground connected to AC earth when ordered the AC/DC adapter, which should be the safe and standard approach, but we know many manufacturer don't bother doing it. However if you plug your device on a power extension along other devices you might get impact by the noise of other not well isolated devices. You could try to isolated Helios64 by using for example a travel adapter that would not used the earth pin.

 

On 3/26/2021 at 5:05 PM, ShadowDance said:

I read somewhere that the metal frame of the HDD cases should be connected to device ground, but since we're using plastic mount brackets I doubt this is the case? To be honest I don't know if this is a real issue or not, just something I read on a forum.

Hmmm my expectation would be that the HDD enclosure is connected to the ground of the HDD controller PCB, which in turn is connected via power cable to the ground of the motherboard, which in turn is connected to the ground / eather of the AC / DC PSU.

 

Link to post
Share on other sites

Finally got around to testing my Helios64 on the grounded wall outlet (alone), unfortunately there was no change. The disks behaved the same as before and I also had those kernel panics during boot that I seem to get ~3 out of 5 bootups.

 

On 3/29/2021 at 5:58 AM, gprovost said:

Hmmm my expectation would be that the HDD enclosure is connected to the ground of the HDD controller PCB, which in turn is connected via power cable to the ground of the motherboard, which in turn is connected to the ground / eather of the AC / DC PSU.

 

Yes sounds very reasonable, and this is my expectation too. I didn't put too much weight behind what I read, but over there one user had issues with his disks until he grounded the cases of the HDDs. Sounded strange but at this point I'm open to pretty much any strange solutions, haha. Could've just been poorly designed/manufactured HDDs too.

Link to post
Share on other sites

To add another data point, I have 5 of these drives in a raid6 configuration with an ext4 filesystem:

Model Family:     Western Digital Caviar Black
Device Model:     WDC WD2002FAEX-007BA0
Serial Number:    WD-WMAY02794530
LU WWN Device Id: 5 0014ee 002d98d8f
Firmware Version: 05.01D05
User Capacity:    2,000,398,934,016 bytes [2.00 TB]

I moved the drives into the helios64 from another server where they were all running without issues.

 

As far as I know, I have not yet experienced a failure that would cause data loss or even temporary loss of access to a disk that could be addressed with a reboot. I noticed errors similar to the below during the initial raid sync. As others have noted on this thread, this indicates data corruption during the transfer, but in my case the retries seem to succeed. However, after enough failures the device is placed in 3.0 Gbps speed, and afterwards I don't see any more errors on that device.

Apr  2 23:00:17 localhost kernel: [ 1070.137017] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x6
Apr  2 23:00:17 localhost kernel: [ 1070.137032] ata3.00: irq_stat 0x08000000
Apr  2 23:00:17 localhost kernel: [ 1070.137040] ata3: SError: { UnrecovData 10B8B Dispar BadCRC }
Apr  2 23:00:17 localhost kernel: [ 1070.137049] ata3.00: failed command: READ DMA EXT
Apr  2 23:00:17 localhost kernel: [ 1070.137061] ata3.00: cmd 25/00:00:00:80:2f/00:04:00:00:00/e0 tag 20 dma 524288 in
Apr  2 23:00:17 localhost kernel: [ 1070.137061]          res 50/00:00:ff:87:2f/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Apr  2 23:00:17 localhost kernel: [ 1070.137066] ata3.00: status: { DRDY }
Apr  2 23:00:17 localhost kernel: [ 1070.137079] ata3: hard resetting link
Apr  2 23:00:18 localhost kernel: [ 1070.612946] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr  2 23:00:18 localhost kernel: [ 1070.622339] ata3.00: configured for UDMA/133
Apr  2 23:00:18 localhost kernel: [ 1070.622453] ata3: EH complete
Apr  2 23:00:55 localhost kernel: [ 1108.163913] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x6
Apr  2 23:00:55 localhost kernel: [ 1108.163927] ata3.00: irq_stat 0x08000000
Apr  2 23:00:55 localhost kernel: [ 1108.163935] ata3: SError: { UnrecovData 10B8B Dispar BadCRC }
Apr  2 23:00:55 localhost kernel: [ 1108.163944] ata3.00: failed command: READ DMA EXT
Apr  2 23:00:55 localhost kernel: [ 1108.163957] ata3.00: cmd 25/00:00:00:cc:2e/00:04:00:00:00/e0 tag 31 dma 524288 in
Apr  2 23:00:55 localhost kernel: [ 1108.163957]          res 50/00:00:4f:46:2f/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Apr  2 23:00:55 localhost kernel: [ 1108.163962] ata3.00: status: { DRDY }
Apr  2 23:00:55 localhost kernel: [ 1108.163975] ata3: hard resetting link
Apr  2 23:00:56 localhost kernel: [ 1108.639872] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr  2 23:00:56 localhost kernel: [ 1108.649317] ata3.00: configured for UDMA/133
Apr  2 23:00:56 localhost kernel: [ 1108.649434] ata3: EH complete
Apr  2 23:01:27 localhost kernel: [ 1139.690980] ata3: limiting SATA link speed to 3.0 Gbps
Apr  2 23:01:27 localhost kernel: [ 1139.690997] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x980000 action 0x6 frozen
Apr  2 23:01:27 localhost kernel: [ 1139.691004] ata3: SError: { 10B8B Dispar LinkSeq }
Apr  2 23:01:27 localhost kernel: [ 1139.691014] ata3.00: failed command: WRITE DMA EXT
Apr  2 23:01:27 localhost kernel: [ 1139.691026] ata3.00: cmd 35/00:40:80:b2:2f/00:05:00:00:00/e0 tag 16 dma 688128 out
Apr  2 23:01:27 localhost kernel: [ 1139.691026]          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr  2 23:01:27 localhost kernel: [ 1139.691032] ata3.00: status: { DRDY }
Apr  2 23:01:27 localhost kernel: [ 1139.691045] ata3: hard resetting link
Apr  2 23:01:27 localhost kernel: [ 1140.166952] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Apr  2 23:01:27 localhost kernel: [ 1140.177164] ata3.00: configured for UDMA/133
Apr  2 23:01:27 localhost kernel: [ 1140.177269] ata3: EH complete

 

I'm able to reproduce similar errors on one of the drives within a few minutes by running stress (with the current directory on the raid6 filesystem)

stress --hdd 4 &

 

Within a few hours, I will see many similar errors on all 5 drives, and all will be reset to 3.0Gbps.

 

I think it's reasonable to conclude that these errors occur in transit on the cable, because the drives' SMART stats do not seem to show a corresponding increase when the kernel log is full of BadCRC errors. I was checking UDMA_CRC_Error_Count specifically since it seemed most relevant to the errors above, but I'm not very familiar with SMART so there may be something else that I should check.

 

I tried some modifications to /boot/armbianEnv.txt to see how they would affect the problem:

1. extraargs=libata.force=noncq didn't have an impact, I saw similar errors on multiple disks within 30 minutes

2. extraargs=libata.force=noncq,3.0 saw no errors after a 2 hours of testing 

3. extraargs=libata.force=3.0 saw no errors after 2 hours of testing

 

I'd prefer to err on the side of reliability, so I'm going to stick with the extraargs=libata.force=3.0 for now.

 

I wonder how many people are seeing behavior similar to mine that simply haven't noticed these errors in their kernel logs?

 

I had similar errors with a banana pi board a few years ago when I used an internal SATA cable to connect to an external drive enclosure. In that case, switching to an eSATA cable which had the appropriate shielding resolved the issue.

 

I could try replacing the SATA cables to see if it reduces the error rate. Did anyone have recommendations on a specific set of replacement cables that has worked well?

Link to post
Share on other sites

@Fred Fettinger Your errors are most likely due too a not stable HDD harness. You can contact us to support@kobol.io to see if replacement of the HDD wire harness is needed.

 

@ShadowDance Thanks for the feedback and at least we can remove grounding issue from the list of possible root causes.

Looks like that in the use case of very heavy HDD I/O load arising mainly during scrubbing operation too much noise is generated resulting in those HSM violation.

Thanks to all your tests, the issue seem unfortunately to point to noise filtering issue. As previously mentioned, we will fix that in next revision.

We will see if the new SATA controller firmware has any impact but we doubt that. I think the only solution for current revision is to limit ATA speed to 3Gbps when using btrfs or zfs.

Link to post
Share on other sites

Hi all, you could download the SATA firmware update at https://cdn-kobol-io.s3-ap-southeast-1.amazonaws.com/Helios64/SATA_FW/Helios64_SATA_FW_update_00021_200624.img.xz

 

Instruction:

1. Download the sd card image

2. Flash the image into a microSD card

3. Insert microSD card into Helios64 and power on. Helios64 should automatically boot from microSD. If Helios64 still boot from eMMC, disable the eMMC 

4. Wait for a while, the system will do a reboot and then power off if firmware flashing succeed.

   If failed, both System Status and System Fault LEDs will blink

5. Remove the microSD card and boot Helios64 normally. See if there is any improvement.

 

Our officially supported stock firmware can be downloaded from https://cdn-kobol-io.s3-ap-southeast-1.amazonaws.com/Helios64/SATA_FW/Helios64_SATA_FW_factory_00020_190702.img.xz. If there is no improvement on newer firmware, please revert back to this stock firmware.

 

SHA256SUM:

e5dfbe84f4709a3e2138ffb620f0ee62ecbcc79a8f83692c1c1d7a4361f0d30f *Helios64_SATA_FW_factory_00020_190702.img.xz
0d78fec569dd699fd667acf59ba7b07c420a2865e1bcb8b85b26b61d404998c5 *Helios64_SATA_FW_update_00021_200624.img.xz

 

Link to post
Share on other sites

@aprayoga I have updated to the new firmware, but the problem is still there. Shortly after starting a zfs scrub I get the HSM violations.

 

[  544.719820] ata3.00: exception Emask 0x2 SAct 0x30000030 SErr 0x400 action 0x6
[  544.719830] ata3.00: irq_stat 0x08000000
[  544.719836] ata3: SError: { Proto }
[  544.719845] ata3.00: failed command: READ FPDMA QUEUED
[  544.719856] ata3.00: cmd 60/80:20:d0:ec:63/00:00:0d:00:00/40 tag 4 ncq dma 65536 in
                        res 40/00:e0:d0:eb:63/00:00:0d:00:00/40 Emask 0x2 (HSM violation)
[  544.719860] ata3.00: status: { DRDY }

 

Is there a way to check the installed firmware version to verify that the update was really successful?

Link to post
Share on other sites

Thanks for the updated firmware. Unfortunately, I'm seeing the same as Wofferl:

 

Spoiler


[  156.746403] ata4.00: exception Emask 0x10 SAct 0x4 SErr 0x300100 action 0x6
[  156.747037] ata4.00: irq_stat 0x08000000
[  156.747396] ata4: SError: { UnrecovData Dispar BadCRC }
[  156.747869] ata4.00: failed command: READ FPDMA QUEUED
[  156.748346] ata4.00: cmd 60/00:10:e0:d5:54/01:00:00:00:00/40 tag 2 ncq dma 131072 in
[  156.748346]          res 40/00:10:e0:d5:54/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[  156.749723] ata4.00: status: { DRDY }
[  156.750075] ata4: hard resetting link
[  157.226341] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  157.228668] ata4.00: configured for UDMA/133
[  157.229175] sd 3:0:0:0: [sdd] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[  157.229993] sd 3:0:0:0: [sdd] tag#2 Sense Key : 0x5 [current]
[  157.230552] sd 3:0:0:0: [sdd] tag#2 ASC=0x21 ASCQ=0x4
[  157.231026] sd 3:0:0:0: [sdd] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 00 54 d5 e0 00 00 01 00 00 00
[  157.231843] blk_update_request: I/O error, dev sdd, sector 5559776 op 0x0:(READ) flags 0x80700 phys_seg 12 prio class 0
[  157.232994] ata4: EH complete
[  193.546385] ata4.00: exception Emask 0x10 SAct 0xc00 SErr 0x300100 action 0x6 frozen
[  193.547085] ata4.00: irq_stat 0x08000000
[  193.547446] ata4: SError: { UnrecovData Dispar BadCRC }
[  193.547919] ata4.00: failed command: READ FPDMA QUEUED
[  193.548398] ata4.00: cmd 60/00:50:e0:c5:6d/01:00:00:00:00/40 tag 10 ncq dma 131072 in
[  193.548398]          res 40/00:50:e0:c5:6d/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[  193.549784] ata4.00: status: { DRDY }
[  193.550121] ata4.00: failed command: READ FPDMA QUEUED
[  193.550703] ata4.00: cmd 60/00:58:e0:c6:6d/01:00:00:00:00/40 tag 11 ncq dma 131072 in
[  193.550703]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
[  193.552096] ata4.00: status: { DRDY }
[  193.552446] ata4: hard resetting link
[  194.030341] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  194.032750] ata4.00: configured for UDMA/133
[  194.033243] ata4.00: device reported invalid CHS sector 0
[  194.033787] sd 3:0:0:0: [sdd] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=31s
[  194.034671] sd 3:0:0:0: [sdd] tag#10 Sense Key : 0x5 [current]
[  194.035209] sd 3:0:0:0: [sdd] tag#10 ASC=0x21 ASCQ=0x4
[  194.035688] sd 3:0:0:0: [sdd] tag#10 CDB: opcode=0x88 88 00 00 00 00 00 00 6d c5 e0 00 00 01 00 00 00
[  194.036516] blk_update_request: I/O error, dev sdd, sector 7194080 op 0x0:(READ) flags 0x80700 phys_seg 8 prio class 0
[  194.037704] sd 3:0:0:0: [sdd] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=31s
[  194.038573] sd 3:0:0:0: [sdd] tag#11 Sense Key : 0x5 [current]
[  194.039111] sd 3:0:0:0: [sdd] tag#11 ASC=0x21 ASCQ=0x4
[  194.039590] sd 3:0:0:0: [sdd] tag#11 CDB: opcode=0x88 88 00 00 00 00 00 00 6d c6 e0 00 00 01 00 00 00
[  194.040414] blk_update_request: I/O error, dev sdd, sector 7194336 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
[  194.041545] ata4: EH complete
[  194.286419] ata4.00: exception Emask 0x2 SAct 0x7c000000 SErr 0x400 action 0x6
[  194.287073] ata4.00: irq_stat 0x08000000
[  194.287432] ata4: SError: { Proto }
[  194.287754] ata4.00: failed command: READ FPDMA QUEUED
[  194.288230] ata4.00: cmd 60/01:d0:3b:c6:6d/00:00:00:00:00/40 tag 26 ncq dma 512 in
[  194.288230]          res 40/00:e8:3e:c6:6d/00:00:00:00:00/40 Emask 0x2 (HSM violation)
[  194.289585] ata4.00: status: { DRDY }
[  194.289920] ata4.00: failed command: READ FPDMA QUEUED
[  194.290429] ata4.00: cmd 60/01:d8:3c:c6:6d/00:00:00:00:00/40 tag 27 ncq dma 512 in
[  194.290429]          res 40/00:e8:3e:c6:6d/00:00:00:00:00/40 Emask 0x2 (HSM violation)
[  194.291787] ata4.00: status: { DRDY }
[  194.292125] ata4.00: failed command: READ FPDMA QUEUED
[  194.292600] ata4.00: cmd 60/01:e0:3d:c6:6d/00:00:00:00:00/40 tag 28 ncq dma 512 in
[  194.292600]          res 40/00:e8:3e:c6:6d/00:00:00:00:00/40 Emask 0x2 (HSM violation)
[  194.293954] ata4.00: status: { DRDY }
[  194.294397] ata4.00: failed command: READ FPDMA QUEUED
[  194.294881] ata4.00: cmd 60/01:e8:3e:c6:6d/00:00:00:00:00/40 tag 29 ncq dma 512 in
[  194.294881]          res 40/00:e8:3e:c6:6d/00:00:00:00:00/40 Emask 0x2 (HSM violation)
[  194.296233] ata4.00: status: { DRDY }
[  194.296568] ata4.00: failed command: READ FPDMA QUEUED
[  194.297043] ata4.00: cmd 60/01:f0:3f:c6:6d/00:00:00:00:00/40 tag 30 ncq dma 512 in
[  194.297043]          res 40/00:e8:3e:c6:6d/00:00:00:00:00/40 Emask 0x2 (HSM violation)
[  194.298468] ata4.00: status: { DRDY }
[  194.298823] ata4: hard resetting link
[  194.774377] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  194.776810] ata4.00: configured for UDMA/133
[  194.777458] ata4: EH complete
[  195.010433] ata4: limiting SATA link speed to 3.0 Gbps
[  195.010919] ata4.00: exception Emask 0x2 SAct 0x3f000 SErr 0x400 action 0x6
[  195.011536] ata4.00: irq_stat 0x08000000
[  195.011894] ata4: SError: { Proto }
[  195.012216] ata4.00: failed command: READ FPDMA QUEUED
[  195.012693] ata4.00: cmd 60/01:60:aa:c6:6d/00:00:00:00:00/40 tag 12 ncq dma 512 in
[  195.012693]          res 40/00:78:ad:c6:6d/00:00:00:00:00/40 Emask 0x2 (HSM violation)
[  195.014048] ata4.00: status: { DRDY }
[  195.014422] ata4.00: failed command: READ FPDMA QUEUED
[  195.014900] ata4.00: cmd 60/01:68:ab:c6:6d/00:00:00:00:00/40 tag 13 ncq dma 512 in
[  195.014900]          res 40/00:78:ad:c6:6d/00:00:00:00:00/40 Emask 0x2 (HSM violation)
[  195.016253] ata4.00: status: { DRDY }
[  195.016587] ata4.00: failed command: READ FPDMA QUEUED
[  195.017063] ata4.00: cmd 60/01:70:ac:c6:6d/00:00:00:00:00/40 tag 14 ncq dma 512 in
[  195.017063]          res 40/00:78:ad:c6:6d/00:00:00:00:00/40 Emask 0x2 (HSM violation)
[  195.018440] ata4.00: status: { DRDY }
[  195.018778] ata4.00: failed command: READ FPDMA QUEUED
[  195.019254] ata4.00: cmd 60/01:78:ad:c6:6d/00:00:00:00:00/40 tag 15 ncq dma 512 in
[  195.019254]          res 40/00:78:ad:c6:6d/00:00:00:00:00/40 Emask 0x2 (HSM violation)
[  195.020607] ata4.00: status: { DRDY }
[  195.020942] ata4.00: failed command: READ FPDMA QUEUED
[  195.021418] ata4.00: cmd 60/01:80:ae:c6:6d/00:00:00:00:00/40 tag 16 ncq dma 512 in
[  195.021418]          res 40/00:78:ad:c6:6d/00:00:00:00:00/40 Emask 0x2 (HSM violation)
[  195.022900] ata4.00: status: { DRDY }
[  195.023240] ata4.00: failed command: READ FPDMA QUEUED
[  195.023715] ata4.00: cmd 60/01:88:af:c6:6d/00:00:00:00:00/40 tag 17 ncq dma 512 in
[  195.023715]          res 40/00:78:ad:c6:6d/00:00:00:00:00/40 Emask 0x2 (HSM violation)
[  195.025069] ata4.00: status: { DRDY }
[  195.025420] ata4: hard resetting link
[  195.506345] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  195.508591] ata4.00: configured for UDMA/133
[  195.509252] ata4: EH complete

 

 

At first the programming didn't seem to work (I used balenaEtcher without unpacking the file, it seemed to understand that it was compressed though so went ahead with it), there was no reboot either. Here's the output:

 

Spoiler


U-Boot 2020.07-armbian (Aug 19 2020 - 11:56:02 +0700)

SoC: Rockchip rk3399
Reset cause: POR
DRAM:  3.9 GiB
PMIC:  RK808 
SF: Detected w25q128 with page size 256 Bytes, erase size 4 KiB, total 16 MiB
MMC:   mmc@fe320000: 1, sdhci@fe330000: 0
Loading Environment from MMC... unable to select a mode
*** Warning - No block device, using default environment

In:    serial
Out:   serial
Err:   serial
Model: Helios64
Revision: 1.2 - 4GB non ECC
Net:   eth0: ethernet@fe300000
Hit any key to stop autoboot:  0 
unable to select a mode
unable to select a mode
starting USB...
Bus usb@fe380000: USB EHCI 1.00
Bus dwc3: usb maximum-speed not found
Register 2000140 NbrPorts 2
Starting the controller
USB XHCI 1.10
scanning bus usb@fe380000 for devices... 1 USB Device(s) found
scanning bus dwc3 for devices... 4 USB Device(s) found
       scanning usb for storage devices... 0 Storage Device(s) found

Device 0: unknown device
scanning bus for devices...

Device 0: unknown device
Speed: 1000, full duplex
BOOTP broadcast 1
BOOTP broadcast 2
BOOTP broadcast 3
BOOTP broadcast 4
DHCP client bound to address 192.168.1.106 (1932 ms)
*** ERROR: `serverip' not set
Cannot autoload with TFTPGET
missing environment variable: pxeuuid
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/01-64-62-66-d0-0b-12
Speed: 1000, full duplex
*** ERROR: `serverip' not set
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/C0A8016A
Speed: 1000, full duplex
*** ERROR: `serverip' not set
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/C0A8016
Speed: 1000, full duplex
*** ERROR: `serverip' not set
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/C0A801
Speed: 1000, full duplex
*** ERROR: `serverip' not set
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/C0A80
Speed: 1000, full duplex
*** ERROR: `serverip' not set
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/C0A8
Speed: 1000, full duplex
*** ERROR: `serverip' not set
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/C0A
Speed: 1000, full duplex
*** ERROR: `serverip' not set
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/C0
Speed: 1000, full duplex
*** ERROR: `serverip' not set
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/C
Speed: 1000, full duplex
*** ERROR: `serverip' not set
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/default-arm-rk3399-helios64
Speed: 1000, full duplex
*** ERROR: `serverip' not set
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/default-arm-rk3399
Speed: 1000, full duplex
*** ERROR: `serverip' not set
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/default-arm
Speed: 1000, full duplex
*** ERROR: `serverip' not set
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/default
Speed: 1000, full duplex
*** ERROR: `serverip' not set
Config file not found
Speed: 1000, full duplex
BOOTP broadcast 1
DHCP client bound to address 192.168.1.106 (53 ms)
*** ERROR: `serverip' not set
Cannot autoload with TFTPGET
Speed: 1000, full duplex
BOOTP broadcast 1
DHCP client bound to address 192.168.1.106 (23 ms)
*** ERROR: `serverip' not set
Cannot autoload with TFTPGET
=> 

 

 

Then I tried flashing it again, unpacked it first this time around and the flashing worked as described:

 

Spoiler


DDR Version 1.24 20191016
In
soft reset
SRX
channel 0
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 1
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 0 training pass!
channel 1 training pass!
change freq to 416MHz 0,1
Channel 0: LPDDR4,416MHz
Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB
Channel 1: LPDDR4,416MHz
Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB
256B stride
channel 0
CS = 0
MR0=0x18
MR4=0x82
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 1
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 0 training pass!
channel 1 training pass!
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
change freq to 856MHz 1,0
ch 0 ddrconfig = 0x101, ddrsize = 0x40
ch 1 ddrconfig = 0x101, ddrsize = 0x40
pmugrf_os_reg[2] = 0x32C1F2C1, stride = 0xD
ddr_set_rate to 328MHZ
ddr_set_rate to 666MHZ
ddr_set_rate to 928MHZ
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
ddr_set_rate to 416MHZ, ctl_index 0
ddr_set_rate to 856MHZ, ctl_index 1
support 416 856 328 666 928 MHz, current 856MHz
OUT
Boot1: 2019-03-14, version: 1.19
CPUId = 0x0
ChipType = 0x10, 322
SdmmcInit=2 0
BootCapSize=100000
UserCapSize=14910MB
FwPartOffset=2000 , 100000
mmc0:cmd5,20
SdmmcInit=0 0
BootCapSize=0
UserCapSize=15247MB
FwPartOffset=2000 , 0
StorageInit ok = 196695
SecureMode = 0
SecureInit read PBA: 0x4
SecureInit read PBA: 0x404
SecureInit read PBA: 0x804
SecureInit read PBA: 0xc04
SecureInit read PBA: 0x1004
SecureInit read PBA: 0x1404
SecureInit read PBA: 0x1804
SecureInit read PBA: 0x1c04
SecureInit ret = 0, SecureMode = 0
atags_set_bootdev: ret:(0)
GPT 0x3380ec0 signature is wrong
recovery gpt...
GPT 0x3380ec0 signature is wrong
recovery gpt fail!
LoadTrust Addr:0x4000
No find bl30.bin
No find bl32.bin
Load uboot, ReadLba = 2000
Load OK, addr=0x200000, size=0xdedec
RunBL31 0x40000
NOTICE:  BL31: v1.3(debug):42583b6
NOTICE:  BL31: Built : 07:55:13, Oct 15 2019
NOTICE:  BL31: Rockchip release version: v1.1
INFO:    GICv3 with legacy support detected. ARM GICV3 driver initialized in EL3
INFO:    Using opteed sec cpu_context!
INFO:    boot cpu mask: 0
INFO:    plat_rockchip_pmu_init(1190): pd status 3e
INFO:    BL31: Initializing runtime services
WARNING: No OPTEE provided by BL2 boot loader, Booting device without OPTEE initialization. SMC`s destined for OPTEE will return SMC_UNK
ERROR:   Error initializing runtime service opteed_fast
INFO:    BL31: Preparing for EL3 exit to normal world
INFO:    Entry point address = 0x200000
INFO:    SPSR = 0x3c9


U-Boot 2020.07-armbian (Aug 19 2020 - 11:56:02 +0700)

SoC: Rockchip rk3399
Reset cause: RST
DRAM:  3.9 GiB
PMIC:  RK808 
SF: Detected w25q128 with page size 256 Bytes, erase size 4 KiB, total 16 MiB
MMC:   mmc@fe320000: 1, sdhci@fe330000: 0
Loading Environment from MMC... *** Warning - !read failed, using default environment

In:    serial
Out:   serial
Err:   serial
Model: Helios64
Revision: 1.2 - 4GB non ECC
Net:   eth0: ethernet@fe300000
Hit any key to stop autoboot:  0 
switch to partitions #0, OK
mmc1 is current device
Scanning mmc 1:1...
Found U-Boot script /boot/boot.scr
876 bytes read in 6 ms (142.6 KiB/s)
## Executing script at 00500000
gpio: pin E24 (gpio 168) value is 1
SF: Detected w25x20cl with page size 256 Bytes, erase size 4 KiB, total 256 KiB
65568 bytes read in 10 ms (6.3 MiB/s)
device 0 offset 0x0, size 0x10020
SF: 65568 bytes @ 0x0 Read: OK
byte at 0x0000000002080060 (0x78) != byte at 0x0000000006000060 (0x24)
Total of 96 byte(s) were the same
Writing SATA Flash
SF: 262144 bytes @ 0x0 Erased: OK
device 0 offset 0x0, size 0x10020
   
Updating, 7% 38836 B/s   
Updating, 13% 39016 B/s   
Updating, 19% 39321 B/s   
Updating, 25% 39199 B/s   
Updating, 32% 39199 B/s   
Updating, 38% 39321 B/s   
Updating, 44% 39409 B/s   
Updating, 50% 39522 B/s   
Updating, 57% 39527 B/s   
Updating, 63% 39531 B/s   
Updating, 69% 39534 B/s   
Updating, 75% 39568 B/s   
Updating, 82% 39511 B/s   
Updating, 88% 39542 B/s   
Updating, 94% 39568 B/s   
Updating, 100% 39522 B/s
65568 bytes written, 0 bytes skipped in 1.802s, speed 37259 B/s
gpio: pin E24 (gpio 168) value is 0
gpio: pin C28 (gpio 108) value is 0
resetting ...
DDR Version 1.24 20191016
In
soft reset
SRX
channel 0
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 1
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 0 training pass!
channel 1 training pass!
change freq to 416MHz 0,1
Channel 0: LPDDR4,416MHz
Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB
Channel 1: LPDDR4,416MHz
Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB
256B stride
channel 0
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 1
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 0 training pass!
channel 1 training pass!
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
change freq to 856MHz 1,0
ch 0 ddrconfig = 0x101, ddrsize = 0x40
ch 1 ddrconfig = 0x101, ddrsize = 0x40
pmugrf_os_reg[2] = 0x32C1F2C1, stride = 0xD
ddr_set_rate to 328MHZ
ddr_set_rate to 666MHZ
ddr_set_rate to 928MHZ
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
ddr_set_rate to 416MHZ, ctl_index 0
ddr_set_rate to 856MHZ, ctl_index 1
support 416 856 328 666 928 MHz, current 856MHz
OUT
Boot1: 2019-03-14, version: 1.19
CPUId = 0x0
ChipType = 0x10, 323
SdmmcInit=2 0
BootCapSize=100000
UserCapSize=14910MB
FwPartOffset=2000 , 100000
mmc0:cmd5,20
SdmmcInit=0 0
BootCapSize=0
UserCapSize=15247MB
FwPartOffset=2000 , 0
StorageInit ok = 66832
SecureMode = 0
SecureInit read PBA: 0x4
SecureInit read PBA: 0x404
SecureInit read PBA: 0x804
SecureInit read PBA: 0xc04
SecureInit read PBA: 0x1004
SecureInit read PBA: 0x1404
SecureInit read PBA: 0x1804
SecureInit read PBA: 0x1c04
SecureInit ret = 0, SecureMode = 0
atags_set_bootdev: ret:(0)
GPT 0x3380ec0 signature is wrong
recovery gpt...
GPT 0x3380ec0 signature is wrong
recovery gpt fail!
LoadTrust Addr:0x4000
No find bl30.bin
No find bl32.bin
Load uboot, ReadLba = 2000
Load OK, addr=0x200000, size=0xdedec
RunBL31 0x40000
NOTICE:  BL31: v1.3(debug):42583b6
NOTICE:  BL31: Built : 07:55:13, Oct 15 2019
NOTICE:  BL31: Rockchip release version: v1.1
INFO:    GICv3 with legacy support detected. ARM GICV3 driver initialized in EL3
INFO:    Using opteed sec cpu_context!
INFO:    boot cpu mask: 0
INFO:    plat_rockchip_pmu_init(1190): pd status 3e
INFO:    BL31: Initializing runtime services
WARNING: No OPTEE provided by BL2 boot loader, Booting device without OPTEE initialization. SMC`s destined for OPTEE will return SMC_UNK
ERROR:   Error initializing runtime service opteed_fast
INFO:    BL31: Preparing for EL3 exit to normal world
INFO:    Entry point address = 0x200000
INFO:    SPSR = 0x3c9


U-Boot 2020.07-armbian (Aug 19 2020 - 11:56:02 +0700)

SoC: Rockchip rk3399
Reset cause: RST
DRAM:  3.9 GiB
PMIC:  RK808 
SF: Detected w25q128 with page size 256 Bytes, erase size 4 KiB, total 16 MiB
MMC:   mmc@fe320000: 1, sdhci@fe330000: 0
Loading Environment from MMC... unable to select a mode
*** Warning - No block device, using default environment

In:    serial
Out:   serial
Err:   serial
Model: Helios64
Revision: 1.2 - 4GB non ECC
Net:   eth0: ethernet@fe300000
Hit any key to stop autoboot:  0 
switch to partitions #0, OK
mmc1 is current device
Scanning mmc 1:1...
Found U-Boot script /boot/boot.scr
876 bytes read in 6 ms (142.6 KiB/s)
## Executing script at 00500000
gpio: pin E24 (gpio 168) value is 1
SF: Detected w25x20cl with page size 256 Bytes, erase size 4 KiB, total 256 KiB
65568 bytes read in 10 ms (6.3 MiB/s)
device 0 offset 0x0, size 0x10020
SF: 65568 bytes @ 0x0 Read: OK
Total of 65568 byte(s) were the same
Continue boot
gpio: pin E24 (gpio 168) value is 0
Power off system
Setting bus to 0

 

 

Link to post
Share on other sites

@ShadowDance

 

Someone from our helpdesk mentioned today, they encounter a warning message from Synology DSM after a drive failure remplacement. The NAS was loaded with EFRX series. Here is what you can find in Synology compatibility list now. The 4TB has even a lower number recommended.

Quote

It is highly recommended to not configure RAID 5 or RAID 6 with 7 or more drives due to the possibility of a performance drop.

 

I wish I had a better

 

Oli

 

Link to post
Share on other sites

I am having the same issue: though I wanna mention that I use Seagate drives. So the issue is not restricted to WD. I can reliably get it using a zfs scrub. Currently 4 disks are in use and it always happens on /dev/sdd, so disk 4. Disk 5 is the empty slot.
I have read most of the comments but did anyone really succeed with a new power harness or the new firmware?

[16218.668509] ata4.00: exception Emask 0x10 SAct 0x1000084 SErr 0xb80100 action 0x6 frozen
[16218.668523] ata4.00: irq_stat 0x08000000
[16218.668530] ata4: SError: { UnrecovData 10B8B Dispar BadCRC LinkSeq }
[16218.668539] ata4.00: failed command: READ FPDMA QUEUED
[16218.668551] ata4.00: cmd 60/b8:10:e0:43:37/07:00:11:03:00/40 tag 2 ncq dma 1011712 in
                        res 40/00:38:98:4b:37/00:00:11:03:00/40 Emask 0x10 (ATA bus error)
[16218.668556] ata4.00: status: { DRDY }
[16218.668562] ata4.00: failed command: READ FPDMA QUEUED
[16218.668574] ata4.00: cmd 60/b8:38:98:4b:37/07:00:11:03:00/40 tag 7 ncq dma 1011712 in
                        res 40/00:38:98:4b:37/00:00:11:03:00/40 Emask 0x10 (ATA bus error)
[16218.668578] ata4.00: status: { DRDY }
[16218.668584] ata4.00: failed command: READ FPDMA QUEUED
[16218.668595] ata4.00: cmd 60/b8:c0:58:53:37/07:00:11:03:00/40 tag 24 ncq dma 1011712 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
[16218.668600] ata4.00: status: { DRDY }
[16218.668612] ata4: hard resetting link
[16219.144463] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[16219.163619] ata4.00: configured for UDMA/133
[16219.164082] ata4.00: device reported invalid CHS sector 0
[16219.164110] sd 3:0:0:0: [sdd] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=31s
[16219.164116] sd 3:0:0:0: [sdd] tag#2 Sense Key : 0x5 [current] 
[16219.164121] sd 3:0:0:0: [sdd] tag#2 ASC=0x21 ASCQ=0x4 
[16219.164127] sd 3:0:0:0: [sdd] tag#2 CDB: opcode=0x88 88 00 00 00 00 03 11 37 43 e0 00 00 07 b8 00 00
[16219.164133] blk_update_request: I/O error, dev sdd, sector 13173736416 op 0x0:(READ) flags 0x700 phys_seg 125 prio class 0
[16219.164148] zio pool=r2pool vdev=/dev/disk/by-id/ata-ST10000VN0004-1ZD101_$serial-part1 error=5 type=1 offset=6744951996416 size=1011712 flags=40080cb0
[16219.164209] sd 3:0:0:0: [sdd] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=31s
[16219.164213] sd 3:0:0:0: [sdd] tag#7 Sense Key : 0x5 [current] 
[16219.164218] sd 3:0:0:0: [sdd] tag#7 ASC=0x21 ASCQ=0x4 
[16219.164222] sd 3:0:0:0: [sdd] tag#7 CDB: opcode=0x88 88 00 00 00 00 03 11 37 4b 98 00 00 07 b8 00 00
[16219.164227] blk_update_request: I/O error, dev sdd, sector 13173738392 op 0x0:(READ) flags 0x700 phys_seg 142 prio class 0
[16219.164235] zio pool=r2pool vdev=/dev/disk/by-id/ata-ST10000VN0004-1ZD101_$serial-part1 error=5 type=1 offset=6744953008128 size=1011712 flags=40080cb0
[16219.164286] sd 3:0:0:0: [sdd] tag#24 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=31s
[16219.164291] sd 3:0:0:0: [sdd] tag#24 Sense Key : 0x5 [current] 
[16219.164295] sd 3:0:0:0: [sdd] tag#24 ASC=0x21 ASCQ=0x4 
[16219.164299] sd 3:0:0:0: [sdd] tag#24 CDB: opcode=0x88 88 00 00 00 00 03 11 37 53 58 00 00 07 b8 00 00
[16219.164304] blk_update_request: I/O error, dev sdd, sector 13173740376 op 0x0:(READ) flags 0x700 phys_seg 127 prio class 0
[16219.164311] zio pool=r2pool vdev=/dev/disk/by-id/ata-ST10000VN0004-1ZD101_$serial-part1 error=5 type=1 offset=6744954023936 size=1011712 flags=40080cb0
[16219.164339] ata4: EH complete

 

Link to post
Share on other sites