smartctl tests are always cancelled by host.

freed00m · February 8, 2021

Hello,

I've bought 3 new identical drives and put 2 of them in the Helios64 and 1 onto my desktop.

I ran smartctl longtest+shortest on both drives simultanously but all of them were aborted/interupted by host.

`# /usr/sbin/smartctl -a /dev/sda`

On my desktop the longtest from smartctl succeeds without error and all 3 drives received the same care, I just bough them and installed them, so unlikely the drives are physically damaged.

The complete diagnostic log: http://ix.io/2OBr

So anyone got an idea why are my SMART extended tests being canceled?

Note: I've tried even the trick with running background task to prevent drives from some vendor sleep `while true; do dd if=/dev/sda of=/dev/null count=1; sleep 60; done`

Gareth Halfacree · February 8, 2021

Check your logs for host resets:

dmesg -T | grep DRDY

If you're seeing those, then you've likely got this issue.

gprovost · February 9, 2021

@freed00m

Could you disable NCQ and see how if you manage to complete the SMART extended test.

Edit /boot/armbianEnv.txt and add the following line :

extraargs=libata.force=noncq

freed00m · February 9, 2021

Hi, sorry to reply after such long time, the forum rules allows newbies to make 2nd post after 24h.

@Gareth Halfacree thx for tip, there was no DRDY event in dmesg after running the test.

I've also tried to move /dev/sdb from SATA 2nd position to SATA 3rd possition. No effect.

@gprovost yes I'ved added the line, and this is how the ambianEnv.txt looks like at /boot now

The test are still being interrupted.

```

verbosity=1

bootlogo=false overlay_prefix=rockchip

rootdev=UUID=e4e3bcd6-3f03-4362-bbe0-f1654138c5d8

rootfstype=ext4

extraargs=libata.force=noncq

usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u

```

usbstoragequirks? how did that get there does that mean the drive atatched them selves as UAS and could make the test fail?

I've never formatted and used the drives, I wanted to have successful longtest before using them daily.

I will do some more testing and troubleshooting. I really hope the SATA Harness cable is not damaged. Will post more.

----

Fun, reading threads like these https://community.synology.com/enu/forum/1/post/123516 makes me think the problem is very common for some drives.

Edited February 9, 2021 by freed00m

gprovost · February 10, 2021

9 hours ago, freed00m said:

usbstoragequirks? how did that get there does that mean the drive atatched them selves as UAS and could make the test fail?

This is a quirk for UAS device applied to all Armbian release, it is not specific to Helios64.

https://github.com/armbian/build/blob/2b1306443d973033c6f2cef7b221f5c25f0af98d/packages/bsp/common/usr/lib/armbian/armbian-hardware-optimization#L379

freed00m · February 11, 2021

Mystery solved!

After I discovered that smart tests succeeds in Windows WD WinDTF tool and fail the same manner on different machine with Archlinux I knew the drives are the incompatibility with smartctl.

The issue with

while true; do dd if=/dev/sda of=/dev/null count=1; sleep 60; done

was that it did not prevent the disk from sleep due to WD Gold drives having 256MB of buffer ssd cache.

To prevent this the dd has to have iflag=direct so it wont go to sleep and really don't understand why.

But even better solution is to query smartctl -a periodicaly to prevent it from sleep.

Running this will let my tests to complete.

# watch -n 60 /usr/sbin/smartctl -a /dev/sda

Anyhow, is this solvable on the smartctl part? Should I open a issue on smartmontools or is this common behavior?

clostro · February 11, 2021

Have you tried to see the sleep/spin down timers on the disks with hdparm?

hdparm -I /dev/sd[a-e] | grep level

And here is a chart for interpreting the output APM values- http://www.howtoeverything.net/linux/hardware/list-timeout-values-hdparm-s

freed00m · February 13, 2021

kobol:~:% sudo hdparm -I /dev/sd[a-e] | grep level
[sudo] password for frdm:
	Advanced power management level: 254
	Advanced power management level: 254

Hi, the levels ar 254, which is by that APM value list @clostro posted a reserved value. Maybe if I set a non spindown value it might solve the testing issue, I might try next time I want to run longtest.-

It was just Armbian 21.02.1 Buster with Linux 5.10.12-rockchip64 image installed recently, so everything is somewhat default.

Edited February 13, 2021 by freed00m

clostro · February 14, 2021

Putting aside the discussion about disk health and spin up and downs, a non spin down value might solve your issue here.

You can take a look at both -S and -B options. I couldn't figure out the difference between their 'set' values entirely. They are both supposedly setting the APM value, but aside from -S putting the drives to sleep immediately and then setting the sleep timer, they have different definitions for the level values.

From https://man7.org/linux/man-pages/man8/hdparm.8.html

For instance -B

Quote

Possible settings range from values 1 through 127 (which permit spin-down), and values 128 through 254 (which do not permit spin-down). The highest degree of power management is attained with a setting of 1, and the highest I/O performance with a setting of 254. A value of 255 tells hdparm to disable Advanced Power Management altogether on the drive

and -S

Quote

Put the drive into idle (low-power) mode, and also set the standby (spindown) timeout for the drive.

Quote

A value of zero means "timeouts are disabled": the device will not automatically enter standby mode. Values from 1 to 240 specify multiples of 5 seconds, yielding timeouts from 5 seconds to 20 minutes. Values from 241 to 251 specify from 1 to 11 units of 30 minutes, yielding timeouts from 30 minutes to 5.5 hours. A value of 252 signifies a timeout of 21 minutes. A value of 253 sets a vendor-defined timeout period between 8 and 12 hours, and the value 254 is reserved. 255 is interpreted as 21 minutes plus 15 seconds. Note that some older drives may have very different interpretations of these values.

As you can see, the value of 255 and other special levels are different between -S and -B. But the definitions also sounds like they are doing the same thing as well.

I would like to learn if anyone can clarify the difference.