Fred Fettinger
-
Posts
2 -
Joined
-
Last visited
Content Type
Forums
Store
Crowdfunding
Applications
Events
Raffles
Community Map
Posts posted by Fred Fettinger
-
-
On 11/26/2020 at 9:25 PM, gprovost said:
cat /proc/device-tree/serial-number
Are there any known serial numbers which were not flashed properly?
I'm starting to troubleshoot similar behavior and mine outputs:
000100001425
SATA issue, drive resets: ataX.00: failed command: READ FPDMA QUEUED
in Rockchip
Posted
To add another data point, I have 5 of these drives in a raid6 configuration with an ext4 filesystem:
I moved the drives into the helios64 from another server where they were all running without issues.
As far as I know, I have not yet experienced a failure that would cause data loss or even temporary loss of access to a disk that could be addressed with a reboot. I noticed errors similar to the below during the initial raid sync. As others have noted on this thread, this indicates data corruption during the transfer, but in my case the retries seem to succeed. However, after enough failures the device is placed in 3.0 Gbps speed, and afterwards I don't see any more errors on that device.
I'm able to reproduce similar errors on one of the drives within a few minutes by running stress (with the current directory on the raid6 filesystem)
Within a few hours, I will see many similar errors on all 5 drives, and all will be reset to 3.0Gbps.
I think it's reasonable to conclude that these errors occur in transit on the cable, because the drives' SMART stats do not seem to show a corresponding increase when the kernel log is full of BadCRC errors. I was checking UDMA_CRC_Error_Count specifically since it seemed most relevant to the errors above, but I'm not very familiar with SMART so there may be something else that I should check.
I tried some modifications to /boot/armbianEnv.txt to see how they would affect the problem:
1. extraargs=libata.force=noncq didn't have an impact, I saw similar errors on multiple disks within 30 minutes
2. extraargs=libata.force=noncq,3.0 saw no errors after a 2 hours of testing
3. extraargs=libata.force=3.0 saw no errors after 2 hours of testing
I'd prefer to err on the side of reliability, so I'm going to stick with the extraargs=libata.force=3.0 for now.
I wonder how many people are seeing behavior similar to mine that simply haven't noticed these errors in their kernel logs?
I had similar errors with a banana pi board a few years ago when I used an internal SATA cable to connect to an external drive enclosure. In that case, switching to an eSATA cable which had the appropriate shielding resolved the issue.
I could try replacing the SATA cables to see if it reduces the error rate. Did anyone have recommendations on a specific set of replacement cables that has worked well?