Armbian 21.05.2 Focal with Linux 5.10.35-rockchip64: fancontrol die in error, fans not spinning


Recommended Posts

Posting here following what was recommended on twitter.

After updating my helios64 earlier this week and rebooting to get the new kernel, I realized it was suspiciously silent.

A quick check to sensor temps readings and physical check made me realize the fan were not spinning.

 

After a quick read on the wiki, I checked fancontrol which was indeed failing:

root@helios64:~ # systemctl status fancontrol.service
● fancontrol.service - fan speed regulator
     Loaded: loaded (/lib/systemd/system/fancontrol.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/fancontrol.service.d
             └─pid.conf
     Active: failed (Result: exit-code) since Fri 2021-05-28 00:08:13 CEST; 1min 42s ago
       Docs: man:fancontrol(8)
             man:pwmconfig(8)
    Process: 2495 ExecStartPre=/usr/sbin/fancontrol --check (code=exited, status=0/SUCCESS)
    Process: 2876 ExecStart=/usr/sbin/fancontrol (code=exited, status=1/FAILURE)
   Main PID: 2876 (code=exited, status=1/FAILURE)

May 28 00:08:13 helios64 fancontrol[2876]:   MINPWM=0
May 28 00:08:13 helios64 fancontrol[2876]:   MAXPWM=255
May 28 00:08:13 helios64 fancontrol[2876]:   AVERAGE=1
May 28 00:08:13 helios64 fancontrol[2876]: Error: file /dev/thermal-cpu/temp1_input doesn't exist
May 28 00:08:13 helios64 fancontrol[2876]: Error: file /dev/thermal-cpu/temp1_input doesn't exist
May 28 00:08:13 helios64 fancontrol[2876]: At least one referenced file is missing. Either some required kernel
May 28 00:08:13 helios64 fancontrol[2876]: modules haven't been loaded, or your configuration file is outdated.
May 28 00:08:13 helios64 fancontrol[2876]: In the latter case, you should run pwmconfig again.
May 28 00:08:13 helios64 systemd[1]: fancontrol.service: Main process exited, code=exited, status=1/FAILURE
May 28 00:08:13 helios64 systemd[1]: fancontrol.service: Failed with result 'exit-code'.

 

Basically fancontrol expect a device in /dev to read the sensors value from, and that device seems to be missing. After a bit of poking around and learning about udev, I managed to manually solve the issue by recreating the device symlink manually:

/usr/bin/mkdir /dev/thermal-cpu/
ln -s /sys/devices/virtual/thermal/thermal_zone0/temp /dev/thermal-cpu/temp1_input
systemctl restart fancontrol.service
systemctl status fancontrol.service

Now digging more this issue happen because udev is not creating the symlink like it should for some reason. After reading the rule in /etc/udev/rules.d/90-helios64-hwmon-legacy.rules and a bit of udev documentation, I managed to find how to test it:

root@helios64:~ # udevadm test /sys/devices/virtual/thermal/thermal_zone0
[...]
Reading rules file: /etc/udev/rules.d/90-helios64-hwmon-legacy.rules
Reading rules file: /etc/udev/rules.d/90-helios64-ups.rules
[...]
DEVPATH=/devices/virtual/thermal/thermal_zone0
ACTION=add
SUBSYSTEM=thermal
IS_HELIOS64_HWMON=1
HWMON_PATH=/sys/devices/virtual/thermal/thermal_zone0
USEC_INITIALIZED=7544717
run: '/bin/ln -sf /sys/devices/virtual/thermal/thermal_zone0 ' <-- something is wrong here, there is no target
Unload module index
Unloaded link configuration context.

After spending a bit more time reading the udev rule, I realized that the second argument was empty because we don't match the ATTR{type}=="soc-thermal" condition. We can look up the types like this:

root@helios64:~ # find /sys/ -name type | grep thermal
/sys/devices/virtual/thermal/cooling_device1/type
/sys/devices/virtual/thermal/thermal_zone0/type
/sys/devices/virtual/thermal/cooling_device4/type
/sys/devices/virtual/thermal/cooling_device2/type
/sys/devices/virtual/thermal/thermal_zone1/type
/sys/devices/virtual/thermal/cooling_device0/type
/sys/devices/virtual/thermal/cooling_device3/type
/sys/firmware/devicetree/base/thermal-zones/gpu/trips/gpu_alert0/type
/sys/firmware/devicetree/base/thermal-zones/gpu/trips/gpu_crit/type
/sys/firmware/devicetree/base/thermal-zones/cpu/trips/cpu_crit/type
/sys/firmware/devicetree/base/thermal-zones/cpu/trips/cpu_alert0/type
/sys/firmware/devicetree/base/thermal-zones/cpu/trips/cpu_alert1/type
root@helios64:~ # cat /sys/devices/virtual/thermal/thermal_zone0/type
cpu <-- we where expecting soc-thermal

and after rewriting the line with the new type, udev is happy again

# Edit in /etc/udev/rules.d/90-helios64-hwmon-legacy.rules and add the following line after the original one
ATTR{type}=="cpu", ENV{HWMON_PATH}="/sys%p/temp", ENV{HELIOS64_SYMLINK}="/dev/thermal-cpu/temp1_input", RUN+="/usr/bin/mkdir /dev/thermal-cpu/"

root@helios64:~ # udevadm control --reload
root@helios64:~ # udevadm test /sys/devices/virtual/thermal/thermal_zone0
[...]
DEVPATH=/devices/virtual/thermal/thermal_zone0
ACTION=add
SUBSYSTEM=thermal
IS_HELIOS64_HWMON=1
HWMON_PATH=/sys/devices/virtual/thermal/thermal_zone0/temp
HELIOS64_SYMLINK=/dev/thermal-cpu/temp1_input
USEC_INITIALIZED=7544717
run: '/usr/bin/mkdir /dev/thermal-cpu/'
run: '/bin/ln -sf /sys/devices/virtual/thermal/thermal_zone0/temp /dev/thermal-cpu/temp1_input'
Unload module index
Unloaded link configuration context.

Apparently for some reason the device-tree changed upstream and the thermal type changed from soc-thermal to cpu?

Link to post
Share on other sites
Armbian is a community driven open source project. Do you like to contribute your code?

For anybody passing by, the issue is due to the fact that for some reason the armbian-bsp-cli-helios64 package for 21.05.2 (EDIT: clarify, 21.05.1 is fine as seen below) was build with the old udev rule (for 4.4 kernels):

$ ls armbian-bsp-cli-helios64_21.05.1_arm64.deb\data.tar\.\etc\udev\rules.d\
10-wifi-disable-powermanagement.rules
50-mali.rules
50-rk3399-vpu.rules
50-usb-realtek-net.rules
70-keep-usb-lan-as-eth1.rules
90-helios64-hwmon.rules
90-helios64-ups.rules

$ ls armbian-bsp-cli-helios64_21.05.2_arm64.deb\data.tar\.\etc\udev\rules.d\
10-wifi-disable-powermanagement.rules
50-mali.rules
50-rk3399-vpu.rules
50-usb-realtek-net.rules
70-keep-usb-lan-as-eth1.rules
90-helios64-hwmon-legacy.rules
90-helios64-ups.rules

The content of the 90-helios64-hwmon.rules is indeed correct and match the 5.10.x kernel device tree: https://github.com/armbian/build/blob/master/packages/bsp/helios64/90-helios64-hwmon.rules

 

I tried reversing the build system to find why the old file was used instead of the other, but the best I could find is

# in config/sources/families/include/rockchip64_common.inc
395         ### Fancontrol tweaks
396         # copy hwmon rules to fix device mapping
397         if [[ $BRANCH == legacy ]]; then
398             install -m 644 $SRC/packages/bsp/helios64/90-helios64-hwmon-legacy.rules $destination/etc/udev/rules.d/
399         else
400             install -m 644 $SRC/packages/bsp/helios64/90-helios64-hwmon.rules $destination/etc/udev/rules.d/
401         fi

 

Link to post
Share on other sites

To confirm I checked with https://armbian.systemonachip.net/apt/pool/focal-utils/a/armbian-bsp-cli-helios64/ 

 

Which really shows that there is a wrong file in 21.05.2 (/etc/udev/rules.d/90-helios64-hwmon...), interestingly the nightly build from beta.armbian.com does have it right... @Igor pinging you here, as I am not familiar with the new packaging. Was this only an issue in one version and will be fixed automatically with next release/minor version? Or do we have to fix some packaging somewhere...?

Link to post
Share on other sites

One solution to this would be to merge both the old and the new rule into the same file (like I ended up doing above), but I would highly suggest that we package a new version of the bsp with the correct rule in a 21.05.3 version to avoid issues with non-spinning fans. Let me know if I can assist by any means.

Link to post
Share on other sites
34 minutes ago, snakekick said:

Hi,

a new version armbian-bsp-cli-helios64 (21.05.4) released today but still have the same error. ;((

 

Do you support the project at least this way? https://forum.armbian.com/subscriptions/ So you don't make additional expenses when asking for support you are far away from.

 

Software development and support / bug fixing takes time. It is also very expensive since people needs to have a lot of knowledge which is highly paid and very desirable on the market. Here you expect this service for free. Well, then you have to wait with a partially broken system without complaining ... also you can fix it on your own. Or hire some to fix this for all of us. Why this would go on our private expense???

 

There are "1000 bugs and 1000 people" before this one and this update fixed some other bugs. We made few people happy, but not possible to make all happy. 

 

Bug was recorded to our system and its waiting for a free time slot. For our donation to you. A week, a month or years. Up to you. 

Link to post
Share on other sites

While you are right of course, for professional kind of support there's quite a lot of other alternatives which suit better. Donations to the project are essential, so that things can be done. 

 

But where I disagree a little, is that this issue causes the fans to stop. I find that to be a serious issue, it can cause hardware damage. I'm sure there is thermal throttling and auto-shutdown if the temperature reaches some thresholds, but it's never good to go into that area. So the "1000 bugs" before this one, I don't agree with it. 

Now to avoid any kind of unnecessary pressure, my suggestion to all users is: revert back to 21.05.01, until as solution is deployed in the latest release.

 

 

Link to post
Share on other sites
24 minutes ago, Zac said:

I find that to be a serious issue


How about this way - "Sadly we ran out of money to fix things for this year. In reality already second day of the year.". But hey, this is open source. Anyone can fix things.

 

26 minutes ago, Zac said:

I'm sure there is thermal throttling and auto-shutdown if the temperature reaches some thresholds

 

It is.

 

26 minutes ago, Zac said:

Now to avoid any kind of unnecessary pressure, my suggestion to all users is: revert back to 21.05.01, until as solution is deployed in the latest release.


That would be some workaround but sadly we have no ability to effectively communicate such message.

Link to post
Share on other sites

Another temporary solution is already provided in the first post btw.

 

Anyone struggling with this issue and only wants the fan to work again can just:

 

On 5/28/2021 at 1:19 AM, halfa said:

and after rewriting the line with the new type, udev is happy again


# Edit in /etc/udev/rules.d/90-helios64-hwmon-legacy.rules and add the following line after the original one
ATTR{type}=="cpu", ENV{HWMON_PATH}="/sys%p/temp", ENV{HELIOS64_SYMLINK}="/dev/thermal-cpu/temp1_input", RUN+="/usr/bin/mkdir /dev/thermal-cpu/"

root@helios64:~ # udevadm control --reload
root@helios64:~ # udevadm test /sys/devices/virtual/thermal/thermal_zone0
[...]
DEVPATH=/devices/virtual/thermal/thermal_zone0
ACTION=add
SUBSYSTEM=thermal
IS_HELIOS64_HWMON=1
HWMON_PATH=/sys/devices/virtual/thermal/thermal_zone0/temp
HELIOS64_SYMLINK=/dev/thermal-cpu/temp1_input
USEC_INITIALIZED=7544717
run: '/usr/bin/mkdir /dev/thermal-cpu/'
run: '/bin/ln -sf /sys/devices/virtual/thermal/thermal_zone0/temp /dev/thermal-cpu/temp1_input'
Unload module index
Unloaded link configuration context.

 

Link to post
Share on other sites