Jump to content

Fred Fettinger

Members
  • Posts

    2
  • Joined

  • Last visited

Everything posted by Fred Fettinger

  1. To add another data point, I have 5 of these drives in a raid6 configuration with an ext4 filesystem: Model Family: Western Digital Caviar Black Device Model: WDC WD2002FAEX-007BA0 Serial Number: WD-WMAY02794530 LU WWN Device Id: 5 0014ee 002d98d8f Firmware Version: 05.01D05 User Capacity: 2,000,398,934,016 bytes [2.00 TB] I moved the drives into the helios64 from another server where they were all running without issues. As far as I know, I have not yet experienced a failure that would cause data loss or even temporary loss of access to a disk that could be addressed with a reboot. I noticed errors similar to the below during the initial raid sync. As others have noted on this thread, this indicates data corruption during the transfer, but in my case the retries seem to succeed. However, after enough failures the device is placed in 3.0 Gbps speed, and afterwards I don't see any more errors on that device. Apr 2 23:00:17 localhost kernel: [ 1070.137017] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x6 Apr 2 23:00:17 localhost kernel: [ 1070.137032] ata3.00: irq_stat 0x08000000 Apr 2 23:00:17 localhost kernel: [ 1070.137040] ata3: SError: { UnrecovData 10B8B Dispar BadCRC } Apr 2 23:00:17 localhost kernel: [ 1070.137049] ata3.00: failed command: READ DMA EXT Apr 2 23:00:17 localhost kernel: [ 1070.137061] ata3.00: cmd 25/00:00:00:80:2f/00:04:00:00:00/e0 tag 20 dma 524288 in Apr 2 23:00:17 localhost kernel: [ 1070.137061] res 50/00:00:ff:87:2f/00:00:00:00:00/e0 Emask 0x10 (ATA bus error) Apr 2 23:00:17 localhost kernel: [ 1070.137066] ata3.00: status: { DRDY } Apr 2 23:00:17 localhost kernel: [ 1070.137079] ata3: hard resetting link Apr 2 23:00:18 localhost kernel: [ 1070.612946] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Apr 2 23:00:18 localhost kernel: [ 1070.622339] ata3.00: configured for UDMA/133 Apr 2 23:00:18 localhost kernel: [ 1070.622453] ata3: EH complete Apr 2 23:00:55 localhost kernel: [ 1108.163913] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x6 Apr 2 23:00:55 localhost kernel: [ 1108.163927] ata3.00: irq_stat 0x08000000 Apr 2 23:00:55 localhost kernel: [ 1108.163935] ata3: SError: { UnrecovData 10B8B Dispar BadCRC } Apr 2 23:00:55 localhost kernel: [ 1108.163944] ata3.00: failed command: READ DMA EXT Apr 2 23:00:55 localhost kernel: [ 1108.163957] ata3.00: cmd 25/00:00:00:cc:2e/00:04:00:00:00/e0 tag 31 dma 524288 in Apr 2 23:00:55 localhost kernel: [ 1108.163957] res 50/00:00:4f:46:2f/00:00:00:00:00/e0 Emask 0x10 (ATA bus error) Apr 2 23:00:55 localhost kernel: [ 1108.163962] ata3.00: status: { DRDY } Apr 2 23:00:55 localhost kernel: [ 1108.163975] ata3: hard resetting link Apr 2 23:00:56 localhost kernel: [ 1108.639872] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Apr 2 23:00:56 localhost kernel: [ 1108.649317] ata3.00: configured for UDMA/133 Apr 2 23:00:56 localhost kernel: [ 1108.649434] ata3: EH complete Apr 2 23:01:27 localhost kernel: [ 1139.690980] ata3: limiting SATA link speed to 3.0 Gbps Apr 2 23:01:27 localhost kernel: [ 1139.690997] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x980000 action 0x6 frozen Apr 2 23:01:27 localhost kernel: [ 1139.691004] ata3: SError: { 10B8B Dispar LinkSeq } Apr 2 23:01:27 localhost kernel: [ 1139.691014] ata3.00: failed command: WRITE DMA EXT Apr 2 23:01:27 localhost kernel: [ 1139.691026] ata3.00: cmd 35/00:40:80:b2:2f/00:05:00:00:00/e0 tag 16 dma 688128 out Apr 2 23:01:27 localhost kernel: [ 1139.691026] res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Apr 2 23:01:27 localhost kernel: [ 1139.691032] ata3.00: status: { DRDY } Apr 2 23:01:27 localhost kernel: [ 1139.691045] ata3: hard resetting link Apr 2 23:01:27 localhost kernel: [ 1140.166952] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Apr 2 23:01:27 localhost kernel: [ 1140.177164] ata3.00: configured for UDMA/133 Apr 2 23:01:27 localhost kernel: [ 1140.177269] ata3: EH complete I'm able to reproduce similar errors on one of the drives within a few minutes by running stress (with the current directory on the raid6 filesystem) stress --hdd 4 & Within a few hours, I will see many similar errors on all 5 drives, and all will be reset to 3.0Gbps. I think it's reasonable to conclude that these errors occur in transit on the cable, because the drives' SMART stats do not seem to show a corresponding increase when the kernel log is full of BadCRC errors. I was checking UDMA_CRC_Error_Count specifically since it seemed most relevant to the errors above, but I'm not very familiar with SMART so there may be something else that I should check. I tried some modifications to /boot/armbianEnv.txt to see how they would affect the problem: 1. extraargs=libata.force=noncq didn't have an impact, I saw similar errors on multiple disks within 30 minutes 2. extraargs=libata.force=noncq,3.0 saw no errors after a 2 hours of testing 3. extraargs=libata.force=3.0 saw no errors after 2 hours of testing I'd prefer to err on the side of reliability, so I'm going to stick with the extraargs=libata.force=3.0 for now. I wonder how many people are seeing behavior similar to mine that simply haven't noticed these errors in their kernel logs? I had similar errors with a banana pi board a few years ago when I used an internal SATA cable to connect to an external drive enclosure. In that case, switching to an eSATA cable which had the appropriate shielding resolved the issue. I could try replacing the SATA cables to see if it reduces the error rate. Did anyone have recommendations on a specific set of replacement cables that has worked well?
  2. Are there any known serial numbers which were not flashed properly? I'm starting to troubleshoot similar behavior and mine outputs: 000100001425
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines