Helios64 - SATA issues with Armbian 25.5.1/Kernel 6.12.32

Alexander Eiblinger · June 8

Servus!
I'm running an "old" Helios64, still with debian buster and kernel 5.9.14-rockchip64 ... pretty much the same installation, as I had setup the Helios64 years ago.
So far it is running great, with no real issues. I have ZFS running on it, with 3 disks.

Today I decided to upgrade to something new - and installed Armbian 25.5.1 / Nobel - with kernel "6.12.32-current-rockchip64". Helios64 boots up and everything seems to work fine. I was able to import my ZFS pool and it was accessible - however, when I do heavy i/o, e.g. md5sum on a big file, I get a lot of these errors:

Jun 08 20:06:36 helios64 kernel: ata2.00: failed command: READ FPDMA QUEUED
Jun 08 20:06:36 helios64 kernel: ata2.00: cmd 60/80:90:40:3a:2f/00:00:06:00:00/40 tag 18 ncq dma 65536 in res 41/84:00:00:00:00/00:00:00:00:00/00 Emask 0x12 (ATA bus error)
Jun 08 20:06:36 helios64 kernel: ata2.00: status: { DRDY ERR }
Jun 08 20:06:36 helios64 kernel: ata2.00: error: { ICRC ABRT }
Jun 08 20:06:36 helios64 kernel: ata2: hard resetting link
Jun 08 20:06:36 helios64 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Jun 08 20:06:36 helios64 kernel: ata2.00: configured for UDMA/133
Jun 08 20:06:36 helios64 kernel: ata2: EH complete
Jun 08 20:07:04 helios64 kernel: ata2: limiting SATA link speed to 1.5 Gbps
Jun 08 20:07:04 helios64 kernel: ata2.00: exception Emask 0x2 SAct 0x202003 SErr 0x400 action 0x6
Jun 08 20:07:04 helios64 kernel: ata2.00: irq_stat 0x08000000
Jun 08 20:07:04 helios64 kernel: ata2: SError: { Proto }
Jun 08 20:07:04 helios64 kernel: ata2.00: failed command: READ FPDMA QUEUED
Jun 08 20:07:04 helios64 kernel: ata2.00: cmd 60/00:00:98:55:7f/08:00:07:00:00/40 tag 0 ncq dma 1048576 in res 41/84:00:00:00:00/00:00:00:00:00/00 Emask 0x12 (ATA bus error)
Jun 08 20:07:04 helios64 kernel: ata2.00: status: { DRDY ERR }
Jun 08 20:07:04 helios64 kernel: ata2.00: error: { ICRC ABRT }

This is for ata2, but this happens also for ata3 and ata4 - so all three disk of my ZFS.
Interesting so, ZFS itself does not detect any read / write errors - but if you e.g. copy files, you see a noticeable drop in throughput, disk access hangs for a moment.

I have installed 25.5.1 on SD card, so I was able to switch back to my original system ... and voilá, no such errors anymore.
So it seems to be related to 25.5.1.
Does anyone know this problem? Is there a solution or anything I'm missing?

Thanks
Alex

djurny · June 8

Hi,

Did you also make sure that you copy any ata options from your old installation? e.g. For my own Helios64 I have had similar issues and those were resolved by limiting the SATA link speed to 3Gbps:

extraargs=libata.force=1:3.0G,2:3.0G,3:3.0G,4:3.0G

The internets say that those ATA errors are caused by the the disk not responding to a controller request in time. Mostly it is advised to replace the [S]ATA cables and to make sure contacts are clean and no crosstalk/EMI can occur. A reduction in the ATA link speed also helps, which in my case sure did.

Do keep in mind, that if you have regular spinning drives, the most throughput you will get out of your drives will be around 100~140 MB/s, which translates to roughly 1.5 Gb/s. So decreasing the linkspeed will not cause any slowdown, your drive's throughput will remain the performance bottleneck.

For my Helios64, I have had some issues with the left-most (or top) disk, as the connectors did not protrude well enough, or had a little too much flex which caused the left-most disk from sometimes not being detected at all. Reseating and pressing on the connector from the rear helped that issue.

As the internets mention these errors are the ATA driver complaining about not hearing from the drive(s) in time (or at all), you might want to experiment with a lower CPU frequency, or perhaps not using any of the dynamic freq scaling governors, but sticking to the powersave or performance one.

Groetjes,

Alexander Eiblinger · June 9

Hi ... thanks for your hint. Actually I forced on the old system sata speed to 3.0 GB ... which I did not on the new installation. Setting also the link speed to 3.0 GB did however not solve the issue, I had to set it down to 1.5 GB. Now the issue seems to be gone. Performancewise it is not noticeable for harddisks, but for the SSD I have it is - need to look further into it.

I'm just wondering what (still) the difference is. Forcing link speed down to 1.5 GB was not needed before ... I'm tending to exclude "hardware" as source of issues, as the hardware is in both cases identicial, no single cable touched.

CU
Alex

djurny · June 9

Hi,

Not an expert here, but some educated guesses would be related to either timing, blocksizes towards the devices. Anything that might change or increase the amount of ATA commands towards the disks would be my bet to look at. You could experiment with enabling/disabling NCQ, reducing read-ahead, increasing chunksizes in case you use [software] RAID, increasing caching by tweaking vm options. Perhaps even the IO schedulers, there might have been some new IO schedulers introduced on the new kernel that show different behavior compared to the older kernel, etc.

Groetjes,