freed00m Posted February 8, 2021 Posted February 8, 2021 Hello, I've bought 3 new identical drives and put 2 of them in the Helios64 and 1 onto my desktop. I ran smartctl longtest+shortest on both drives simultanously but all of them were aborted/interupted by host. `# /usr/sbin/smartctl -a /dev/sda` Output: http://ix.io/2OLt On my desktop the longtest from smartctl succeeds without error and all 3 drives received the same care, I just bough them and installed them, so unlikely the drives are physically damaged. The complete diagnostic log: http://ix.io/2OBr So anyone got an idea why are my SMART extended tests being canceled? Note: I've tried even the trick with running background task to prevent drives from some vendor sleep `while true; do dd if=/dev/sda of=/dev/null count=1; sleep 60; done` 1 Quote
Gareth Halfacree Posted February 8, 2021 Posted February 8, 2021 Check your logs for host resets: dmesg -T | grep DRDY If you're seeing those, then you've likely got this issue. 0 Quote
gprovost Posted February 9, 2021 Posted February 9, 2021 @freed00m Could you disable NCQ and see how if you manage to complete the SMART extended test. Edit /boot/armbianEnv.txt and add the following line : extraargs=libata.force=noncq 1 Quote
freed00m Posted February 9, 2021 Author Posted February 9, 2021 (edited) Hi, sorry to reply after such long time, the forum rules allows newbies to make 2nd post after 24h. @Gareth Halfacree thx for tip, there was no DRDY event in dmesg after running the test. I've also tried to move /dev/sdb from SATA 2nd position to SATA 3rd possition. No effect. @gprovost yes I'ved added the line, and this is how the ambianEnv.txt looks like at /boot now The test are still being interrupted. ``` verbosity=1 bootlogo=false overlay_prefix=rockchip rootdev=UUID=e4e3bcd6-3f03-4362-bbe0-f1654138c5d8 rootfstype=ext4 extraargs=libata.force=noncq usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u ``` usbstoragequirks? how did that get there does that mean the drive atatched them selves as UAS and could make the test fail? I've never formatted and used the drives, I wanted to have successful longtest before using them daily. I will do some more testing and troubleshooting. I really hope the SATA Harness cable is not damaged. Will post more. ---- Fun, reading threads like these https://community.synology.com/enu/forum/1/post/123516 makes me think the problem is very common for some drives. Edited February 9, 2021 by freed00m 0 Quote
gprovost Posted February 10, 2021 Posted February 10, 2021 9 hours ago, freed00m said: usbstoragequirks? how did that get there does that mean the drive atatched them selves as UAS and could make the test fail? This is a quirk for UAS device applied to all Armbian release, it is not specific to Helios64. https://github.com/armbian/build/blob/2b1306443d973033c6f2cef7b221f5c25f0af98d/packages/bsp/common/usr/lib/armbian/armbian-hardware-optimization#L379 1 Quote
freed00m Posted February 11, 2021 Author Posted February 11, 2021 Mystery solved! After I discovered that smart tests succeeds in Windows WD WinDTF tool and fail the same manner on different machine with Archlinux I knew the drives are the incompatibility with smartctl. The issue with while true; do dd if=/dev/sda of=/dev/null count=1; sleep 60; done was that it did not prevent the disk from sleep due to WD Gold drives having 256MB of buffer ssd cache. To prevent this the dd has to have iflag=direct so it wont go to sleep and really don't understand why. But even better solution is to query smartctl -a periodicaly to prevent it from sleep. Running this will let my tests to complete. # watch -n 60 /usr/sbin/smartctl -a /dev/sda Anyhow, is this solvable on the smartctl part? Should I open a issue on smartmontools or is this common behavior? 0 Quote
clostro Posted February 11, 2021 Posted February 11, 2021 Have you tried to see the sleep/spin down timers on the disks with hdparm? hdparm -I /dev/sd[a-e] | grep level And here is a chart for interpreting the output APM values- http://www.howtoeverything.net/linux/hardware/list-timeout-values-hdparm-s 0 Quote
freed00m Posted February 13, 2021 Author Posted February 13, 2021 (edited) kobol:~:% sudo hdparm -I /dev/sd[a-e] | grep level [sudo] password for frdm: Advanced power management level: 254 Advanced power management level: 254 Hi, the levels ar 254, which is by that APM value list @clostro posted a reserved value. Maybe if I set a non spindown value it might solve the testing issue, I might try next time I want to run longtest.- It was just Armbian 21.02.1 Buster with Linux 5.10.12-rockchip64 image installed recently, so everything is somewhat default. Edited February 13, 2021 by freed00m 0 Quote
clostro Posted February 14, 2021 Posted February 14, 2021 Putting aside the discussion about disk health and spin up and downs, a non spin down value might solve your issue here. You can take a look at both -S and -B options. I couldn't figure out the difference between their 'set' values entirely. They are both supposedly setting the APM value, but aside from -S putting the drives to sleep immediately and then setting the sleep timer, they have different definitions for the level values. From https://man7.org/linux/man-pages/man8/hdparm.8.html For instance -B Quote Possible settings range from values 1 through 127 (which permit spin-down), and values 128 through 254 (which do not permit spin-down). The highest degree of power management is attained with a setting of 1, and the highest I/O performance with a setting of 254. A value of 255 tells hdparm to disable Advanced Power Management altogether on the drive and -S Quote Put the drive into idle (low-power) mode, and also set the standby (spindown) timeout for the drive. Quote A value of zero means "timeouts are disabled": the device will not automatically enter standby mode. Values from 1 to 240 specify multiples of 5 seconds, yielding timeouts from 5 seconds to 20 minutes. Values from 241 to 251 specify from 1 to 11 units of 30 minutes, yielding timeouts from 30 minutes to 5.5 hours. A value of 252 signifies a timeout of 21 minutes. A value of 253 sets a vendor-defined timeout period between 8 and 12 hours, and the value 254 is reserved. 255 is interpreted as 21 minutes plus 15 seconds. Note that some older drives may have very different interpretations of these values. As you can see, the value of 255 and other special levels are different between -S and -B. But the definitions also sounds like they are doing the same thing as well. I would like to learn if anyone can clarify the difference. 0 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.