I have been following the discussion for a while and would like to report where i am at. Also I would like to request that this information is being recorded in a systematic way by Kobol, of course only if this isn't already happening.
I have the following setup:
pool: mypool
state: ONLINE
scan: resilvered 12K in 00:00:00 with 0 errors on Tue Feb 9 17:22:03 2021
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0
ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0
ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0
ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0
ata-WDC_WD40EFRX-68N32N0_WD-WCC7XXXXXXXX ONLINE 0 0 0
Some weeks ago I noticed that the health of the zpool is DEGRADED. I checked and one device had READ errors and was marked as FAULTED. This has also resulted in storing UDMA CRC errors in the SMART stats for this drive. So, I cleared the errors and ran a scrub to see what is going on.
sudo zpool scrub mypool
I monitored the scrub with
sudo watch zpool status
And I saw quite quickly that all drives have started to get READ errors. SMART also reported that all drives now have UDMA CRC errors.
It was clear that something bad is going on, so I contacted Kobol and the we started to debug the issue together.
First I changed the SATA speed to 3Gbps by adding the following line to /boot/armbianEnv.txt
extraargs=libata.force=3.0
The results were similar, but I noticed that the errors started to show up a bit later in the scrubbing process. The UDMA CRC Errors count has increased.
Then I replaced the power and the SATA Cables with new ones, which unfortunately did not bring any improvement.
Then I disabled NCQ by adding the following to /boot/armbianEnv.txt
extraargs=libata.force=noncq
and reverted back to SATA 6 Gbps by removing the 3 Gbps line, introduced earlier.
This had a positive results and I was able to run the scrub without any errors.
Then I went back and installed the old(original) cable harness again and retested - all good.
While disable NCQ is having a positive impact on the errors, it is also having a negative impact on the speed and to some amount also on the disk drives' health.
I have also tried to reduce NCQs depth to 31, which is the recommended value, however this did not have any impact.
I hope that using this information Kobol will try to reproduce this issue themselves, to see if its only certain boards that are affected or if its every board.