Fred Fettinger

April 3, 2021

To add another data point, I have 5 of these drives in a raid6 configuration with an ext4 filesystem:

Model Family:     Western Digital Caviar Black
Device Model:     WDC WD2002FAEX-007BA0
Serial Number:    WD-WMAY02794530
LU WWN Device Id: 5 0014ee 002d98d8f
Firmware Version: 05.01D05
User Capacity:    2,000,398,934,016 bytes [2.00 TB]

I moved the drives into the helios64 from another server where they were all running without issues.

As far as I know, I have not yet experienced a failure that would cause data loss or even temporary loss of access to a disk that could be addressed with a reboot. I noticed errors similar to the below during the initial raid sync. As others have noted on this thread, this indicates data corruption during the transfer, but in my case the retries seem to succeed. However, after enough failures the device is placed in 3.0 Gbps speed, and afterwards I don't see any more errors on that device.

Apr  2 23:00:17 localhost kernel: [ 1070.137017] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x6
Apr  2 23:00:17 localhost kernel: [ 1070.137032] ata3.00: irq_stat 0x08000000
Apr  2 23:00:17 localhost kernel: [ 1070.137040] ata3: SError: { UnrecovData 10B8B Dispar BadCRC }
Apr  2 23:00:17 localhost kernel: [ 1070.137049] ata3.00: failed command: READ DMA EXT
Apr  2 23:00:17 localhost kernel: [ 1070.137061] ata3.00: cmd 25/00:00:00:80:2f/00:04:00:00:00/e0 tag 20 dma 524288 in
Apr  2 23:00:17 localhost kernel: [ 1070.137061]          res 50/00:00:ff:87:2f/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Apr  2 23:00:17 localhost kernel: [ 1070.137066] ata3.00: status: { DRDY }
Apr  2 23:00:17 localhost kernel: [ 1070.137079] ata3: hard resetting link
Apr  2 23:00:18 localhost kernel: [ 1070.612946] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr  2 23:00:18 localhost kernel: [ 1070.622339] ata3.00: configured for UDMA/133
Apr  2 23:00:18 localhost kernel: [ 1070.622453] ata3: EH complete
Apr  2 23:00:55 localhost kernel: [ 1108.163913] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x6
Apr  2 23:00:55 localhost kernel: [ 1108.163927] ata3.00: irq_stat 0x08000000
Apr  2 23:00:55 localhost kernel: [ 1108.163935] ata3: SError: { UnrecovData 10B8B Dispar BadCRC }
Apr  2 23:00:55 localhost kernel: [ 1108.163944] ata3.00: failed command: READ DMA EXT
Apr  2 23:00:55 localhost kernel: [ 1108.163957] ata3.00: cmd 25/00:00:00:cc:2e/00:04:00:00:00/e0 tag 31 dma 524288 in
Apr  2 23:00:55 localhost kernel: [ 1108.163957]          res 50/00:00:4f:46:2f/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Apr  2 23:00:55 localhost kernel: [ 1108.163962] ata3.00: status: { DRDY }
Apr  2 23:00:55 localhost kernel: [ 1108.163975] ata3: hard resetting link
Apr  2 23:00:56 localhost kernel: [ 1108.639872] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr  2 23:00:56 localhost kernel: [ 1108.649317] ata3.00: configured for UDMA/133
Apr  2 23:00:56 localhost kernel: [ 1108.649434] ata3: EH complete
Apr  2 23:01:27 localhost kernel: [ 1139.690980] ata3: limiting SATA link speed to 3.0 Gbps
Apr  2 23:01:27 localhost kernel: [ 1139.690997] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x980000 action 0x6 frozen
Apr  2 23:01:27 localhost kernel: [ 1139.691004] ata3: SError: { 10B8B Dispar LinkSeq }
Apr  2 23:01:27 localhost kernel: [ 1139.691014] ata3.00: failed command: WRITE DMA EXT
Apr  2 23:01:27 localhost kernel: [ 1139.691026] ata3.00: cmd 35/00:40:80:b2:2f/00:05:00:00:00/e0 tag 16 dma 688128 out
Apr  2 23:01:27 localhost kernel: [ 1139.691026]          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr  2 23:01:27 localhost kernel: [ 1139.691032] ata3.00: status: { DRDY }
Apr  2 23:01:27 localhost kernel: [ 1139.691045] ata3: hard resetting link
Apr  2 23:01:27 localhost kernel: [ 1140.166952] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Apr  2 23:01:27 localhost kernel: [ 1140.177164] ata3.00: configured for UDMA/133
Apr  2 23:01:27 localhost kernel: [ 1140.177269] ata3: EH complete

I'm able to reproduce similar errors on one of the drives within a few minutes by running stress (with the current directory on the raid6 filesystem)

stress --hdd 4 &

Within a few hours, I will see many similar errors on all 5 drives, and all will be reset to 3.0Gbps.

I think it's reasonable to conclude that these errors occur in transit on the cable, because the drives' SMART stats do not seem to show a corresponding increase when the kernel log is full of BadCRC errors. I was checking UDMA_CRC_Error_Count specifically since it seemed most relevant to the errors above, but I'm not very familiar with SMART so there may be something else that I should check.

I tried some modifications to /boot/armbianEnv.txt to see how they would affect the problem:

1. extraargs=libata.force=noncq didn't have an impact, I saw similar errors on multiple disks within 30 minutes

2. extraargs=libata.force=noncq,3.0 saw no errors after a 2 hours of testing

3. extraargs=libata.force=3.0 saw no errors after 2 hours of testing

I'd prefer to err on the side of reliability, so I'm going to stick with the extraargs=libata.force=3.0 for now.

I wonder how many people are seeing behavior similar to mine that simply haven't noticed these errors in their kernel logs?

I had similar errors with a banana pi board a few years ago when I used an internal SATA cable to connect to an external drive enclosure. In that case, switching to an eSATA cable which had the appropriate shielding resolved the issue.

I could try replacing the SATA cables to see if it reduces the error rate. Did anyone have recommendations on a specific set of replacement cables that has worked well?

January 27, 2021

On 11/26/2020 at 9:25 PM, gprovost said:
cat /proc/device-tree/serial-number

Are there any known serial numbers which were not flashed properly?

I'm starting to troubleshoot similar behavior and mine outputs:

000100001425

Sign In

Fred Fettinger

Posts

Joined

Last visited

Content Type

Forums

Store

Crowdfunding

Applications

Events

Raffles

Community Map

Posts posted by Fred Fettinger

SATA issue, drive resets: ataX.00: failed command: READ FPDMA QUEUED

SATA issue, drive resets: ataX.00: failed command: READ FPDMA QUEUED

Forums

My Activity Streams

Download

Store

Important Information