smartctl tests are always cancelled by host.


Recommended Posts

Hello,

 

I've bought 3 new identical drives and put 2 of them in the Helios64 and 1 onto my desktop.

I ran smartctl longtest+shortest on both drives simultanously but all of them were aborted/interupted by host.

 

`# /usr/sbin/smartctl -a /dev/sda`

Output: http://ix.io/2OLt

 

On my desktop the longtest from smartctl succeeds without error and all 3 drives received the same care, I just bough them and installed them, so unlikely the drives are physically damaged. 

 

The complete diagnostic log: http://ix.io/2OBr

 

So anyone got an idea why are my SMART extended tests being canceled?

 

Note: I've tried even the trick with running background task to prevent drives from some vendor sleep `while true; do dd if=/dev/sda of=/dev/null count=1; sleep 60; done`

Link to post
Share on other sites
Armbian is a community driven open source project. Do you like to contribute your code?

Hi, sorry to reply after such long time, the forum rules allows newbies to make 2nd post after 24h.

 

@Gareth Halfacree thx for tip, there was no DRDY event in dmesg after running the test.

I've also tried to move /dev/sdb from SATA 2nd position to SATA 3rd possition. No effect.

 

@gprovost yes I'ved added the line, and this is how the ambianEnv.txt looks like at /boot now

 

The test are still being interrupted.

```

verbosity=1

bootlogo=false overlay_prefix=rockchip

rootdev=UUID=e4e3bcd6-3f03-4362-bbe0-f1654138c5d8

rootfstype=ext4

extraargs=libata.force=noncq

usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u

```

 

usbstoragequirks? how did that get there does that mean the drive atatched them selves as UAS and could make the test fail?

 

I've never formatted and used the drives, I wanted to have successful longtest before using them daily.

 

I will do some more testing and troubleshooting. I really hope the SATA Harness cable is not damaged. Will post more.

 

----

 

Fun, reading threads like these https://community.synology.com/enu/forum/1/post/123516 makes me think the problem is very common for some drives.

 

Edited by freed00m
Link to post
Share on other sites
9 hours ago, freed00m said:

usbstoragequirks? how did that get there does that mean the drive atatched them selves as UAS and could make the test fail?

 

This is a quirk for UAS device applied to all Armbian release, it is not specific to Helios64.

https://github.com/armbian/build/blob/2b1306443d973033c6f2cef7b221f5c25f0af98d/packages/bsp/common/usr/lib/armbian/armbian-hardware-optimization#L379

Link to post
Share on other sites

Mystery solved!

 

After I discovered  that smart tests succeeds  in Windows WD WinDTF tool and fail the same manner on different machine with Archlinux I knew  the  drives are the incompatibility with smartctl.

 

The issue with

while true; do dd if=/dev/sda of=/dev/null count=1; sleep 60; done

was  that  it did not prevent the disk from sleep due  to WD Gold drives having 256MB of buffer ssd cache.

 

To prevent this  the dd has  to have  iflag=direct so it wont go to sleep and really don't understand why.

 

But even better solution is to query smartctl -a  periodicaly to prevent it from  sleep.

Running this will let my tests to complete.

 

# watch -n 60 /usr/sbin/smartctl -a /dev/sda

 

Anyhow, is this solvable on the smartctl part? Should I open a issue on  smartmontools or is this common behavior?

Link to post
Share on other sites
kobol:~:% sudo hdparm -I /dev/sd[a-e] | grep level
[sudo] password for frdm:
	Advanced power management level: 254
	Advanced power management level: 254

 

Hi, the levels ar 254, which is by that APM value list @clostro posted a reserved value.  Maybe if I set a non spindown value it might solve the testing issue, I might try next time I want to run longtest.-

It was just Armbian 21.02.1 Buster with Linux 5.10.12-rockchip64 image installed recently, so everything is somewhat default.

 

 

 

 

Edited by freed00m
Link to post
Share on other sites

Putting aside the discussion about disk health and spin up and downs, a non spin down value might solve your issue here.

 

You can take a look at both -S and -B options. I couldn't figure out the difference between their 'set' values entirely.  They are both supposedly setting the APM value, but aside from -S putting the drives to sleep immediately and then setting the sleep timer, they have different definitions for the level values.

 

From https://man7.org/linux/man-pages/man8/hdparm.8.html

For instance -B 

Quote

Possible settings range from values 1 through 127 (which permit spin-down), and values 128 through 254 (which do not permit spin-down). The highest degree of power management is attained with a setting of 1, and the highest I/O performance with a setting of 254. A value of 255 tells hdparm to disable Advanced Power Management altogether on the drive

 

and -S 

Quote

Put the drive into idle (low-power) mode, and also set the standby (spindown) timeout for the drive.

Quote

A value of zero means "timeouts are disabled": the device will not automatically enter standby mode. Values from 1 to 240 specify multiples of 5 seconds, yielding timeouts from 5 seconds to 20 minutes. Values from 241 to 251 specify from 1 to 11 units of 30 minutes, yielding timeouts from 30 minutes to 5.5 hours. A value of 252 signifies a timeout of 21 minutes. A value of 253 sets a vendor-defined timeout period between 8 and 12 hours, and the value 254 is reserved. 255 is interpreted as 21 minutes plus 15 seconds. Note that some older drives may have very different interpretations of these values.

 

 

As you can see, the value of 255 and other special levels are different between -S and -B. But the definitions also sounds like they are doing the same thing as well.

I would like to learn if anyone can clarify the difference.

Link to post
Share on other sites