Jump to content

Armbian 21.05.2 Focal with Linux 5.10.35-rockchip64: fancontrol die in error, fans not spinning


Recommended Posts

Posted

Posting here following what was recommended on twitter.

After updating my helios64 earlier this week and rebooting to get the new kernel, I realized it was suspiciously silent.

A quick check to sensor temps readings and physical check made me realize the fan were not spinning.

 

After a quick read on the wiki, I checked fancontrol which was indeed failing:

root@helios64:~ # systemctl status fancontrol.service
● fancontrol.service - fan speed regulator
     Loaded: loaded (/lib/systemd/system/fancontrol.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/fancontrol.service.d
             └─pid.conf
     Active: failed (Result: exit-code) since Fri 2021-05-28 00:08:13 CEST; 1min 42s ago
       Docs: man:fancontrol(8)
             man:pwmconfig(8)
    Process: 2495 ExecStartPre=/usr/sbin/fancontrol --check (code=exited, status=0/SUCCESS)
    Process: 2876 ExecStart=/usr/sbin/fancontrol (code=exited, status=1/FAILURE)
   Main PID: 2876 (code=exited, status=1/FAILURE)

May 28 00:08:13 helios64 fancontrol[2876]:   MINPWM=0
May 28 00:08:13 helios64 fancontrol[2876]:   MAXPWM=255
May 28 00:08:13 helios64 fancontrol[2876]:   AVERAGE=1
May 28 00:08:13 helios64 fancontrol[2876]: Error: file /dev/thermal-cpu/temp1_input doesn't exist
May 28 00:08:13 helios64 fancontrol[2876]: Error: file /dev/thermal-cpu/temp1_input doesn't exist
May 28 00:08:13 helios64 fancontrol[2876]: At least one referenced file is missing. Either some required kernel
May 28 00:08:13 helios64 fancontrol[2876]: modules haven't been loaded, or your configuration file is outdated.
May 28 00:08:13 helios64 fancontrol[2876]: In the latter case, you should run pwmconfig again.
May 28 00:08:13 helios64 systemd[1]: fancontrol.service: Main process exited, code=exited, status=1/FAILURE
May 28 00:08:13 helios64 systemd[1]: fancontrol.service: Failed with result 'exit-code'.

 

Basically fancontrol expect a device in /dev to read the sensors value from, and that device seems to be missing. After a bit of poking around and learning about udev, I managed to manually solve the issue by recreating the device symlink manually:

/usr/bin/mkdir /dev/thermal-cpu/
ln -s /sys/devices/virtual/thermal/thermal_zone0/temp /dev/thermal-cpu/temp1_input
systemctl restart fancontrol.service
systemctl status fancontrol.service

Now digging more this issue happen because udev is not creating the symlink like it should for some reason. After reading the rule in /etc/udev/rules.d/90-helios64-hwmon-legacy.rules and a bit of udev documentation, I managed to find how to test it:

root@helios64:~ # udevadm test /sys/devices/virtual/thermal/thermal_zone0
[...]
Reading rules file: /etc/udev/rules.d/90-helios64-hwmon-legacy.rules
Reading rules file: /etc/udev/rules.d/90-helios64-ups.rules
[...]
DEVPATH=/devices/virtual/thermal/thermal_zone0
ACTION=add
SUBSYSTEM=thermal
IS_HELIOS64_HWMON=1
HWMON_PATH=/sys/devices/virtual/thermal/thermal_zone0
USEC_INITIALIZED=7544717
run: '/bin/ln -sf /sys/devices/virtual/thermal/thermal_zone0 ' <-- something is wrong here, there is no target
Unload module index
Unloaded link configuration context.

After spending a bit more time reading the udev rule, I realized that the second argument was empty because we don't match the ATTR{type}=="soc-thermal" condition. We can look up the types like this:

root@helios64:~ # find /sys/ -name type | grep thermal
/sys/devices/virtual/thermal/cooling_device1/type
/sys/devices/virtual/thermal/thermal_zone0/type
/sys/devices/virtual/thermal/cooling_device4/type
/sys/devices/virtual/thermal/cooling_device2/type
/sys/devices/virtual/thermal/thermal_zone1/type
/sys/devices/virtual/thermal/cooling_device0/type
/sys/devices/virtual/thermal/cooling_device3/type
/sys/firmware/devicetree/base/thermal-zones/gpu/trips/gpu_alert0/type
/sys/firmware/devicetree/base/thermal-zones/gpu/trips/gpu_crit/type
/sys/firmware/devicetree/base/thermal-zones/cpu/trips/cpu_crit/type
/sys/firmware/devicetree/base/thermal-zones/cpu/trips/cpu_alert0/type
/sys/firmware/devicetree/base/thermal-zones/cpu/trips/cpu_alert1/type
root@helios64:~ # cat /sys/devices/virtual/thermal/thermal_zone0/type
cpu <-- we were expecting soc-thermal!

and after rewriting the line with the new type, udev is happy again

# Edit in /etc/udev/rules.d/90-helios64-hwmon-legacy.rules and add the following line after the original one
ATTR{type}=="cpu", ENV{HWMON_PATH}="/sys%p/temp", ENV{HELIOS64_SYMLINK}="/dev/thermal-cpu/temp1_input", RUN+="/usr/bin/mkdir /dev/thermal-cpu/"

root@helios64:~ # udevadm control --reload
root@helios64:~ # udevadm test /sys/devices/virtual/thermal/thermal_zone0
[...]
DEVPATH=/devices/virtual/thermal/thermal_zone0
ACTION=add
SUBSYSTEM=thermal
IS_HELIOS64_HWMON=1
HWMON_PATH=/sys/devices/virtual/thermal/thermal_zone0/temp
HELIOS64_SYMLINK=/dev/thermal-cpu/temp1_input
USEC_INITIALIZED=7544717
run: '/usr/bin/mkdir /dev/thermal-cpu/'
run: '/bin/ln -sf /sys/devices/virtual/thermal/thermal_zone0/temp /dev/thermal-cpu/temp1_input'
Unload module index
Unloaded link configuration context.

Apparently for some reason the device-tree changed upstream and the thermal type changed from soc-thermal to cpu?

Posted

For anybody passing by, the issue is due to the fact that for some reason the armbian-bsp-cli-helios64 package for 21.05.2 (EDIT: clarify, 21.05.1 is fine as seen below) was build with the old udev rule (for 4.4 kernels):

$ ls armbian-bsp-cli-helios64_21.05.1_arm64.deb\data.tar\.\etc\udev\rules.d\
10-wifi-disable-powermanagement.rules
50-mali.rules
50-rk3399-vpu.rules
50-usb-realtek-net.rules
70-keep-usb-lan-as-eth1.rules
90-helios64-hwmon.rules
90-helios64-ups.rules

$ ls armbian-bsp-cli-helios64_21.05.2_arm64.deb\data.tar\.\etc\udev\rules.d\
10-wifi-disable-powermanagement.rules
50-mali.rules
50-rk3399-vpu.rules
50-usb-realtek-net.rules
70-keep-usb-lan-as-eth1.rules
90-helios64-hwmon-legacy.rules
90-helios64-ups.rules

The content of the 90-helios64-hwmon.rules is indeed correct and match the 5.10.x kernel device tree: https://github.com/armbian/build/blob/master/packages/bsp/helios64/90-helios64-hwmon.rules

 

I tried reversing the build system to find why the old file was used instead of the other, but the best I could find is

# in config/sources/families/include/rockchip64_common.inc
395         ### Fancontrol tweaks
396         # copy hwmon rules to fix device mapping
397         if [[ $BRANCH == legacy ]]; then
398             install -m 644 $SRC/packages/bsp/helios64/90-helios64-hwmon-legacy.rules $destination/etc/udev/rules.d/
399         else
400             install -m 644 $SRC/packages/bsp/helios64/90-helios64-hwmon.rules $destination/etc/udev/rules.d/
401         fi

 

Posted

To confirm I checked with https://armbian.systemonachip.net/apt/pool/focal-utils/a/armbian-bsp-cli-helios64/ 

 

Which really shows that there is a wrong file in 21.05.2 (/etc/udev/rules.d/90-helios64-hwmon...), interestingly the nightly build from beta.armbian.com does have it right... @Igor pinging you here, as I am not familiar with the new packaging. Was this only an issue in one version and will be fixed automatically with next release/minor version? Or do we have to fix some packaging somewhere...?

Posted
4 hours ago, Heisath said:

pinging you here, as I am not familiar with the new packaging.


Random behaviour is because there is no more per branch BSP. Decisions has to be refactored for runtime. I keep forgetting we still have legacy stuff here ... so this will complicate a bit :(

 

https://armbian.atlassian.net/browse/AR-779

Posted

One solution to this would be to merge both the old and the new rule into the same file (like I ended up doing above), but I would highly suggest that we package a new version of the bsp with the correct rule in a 21.05.3 version to avoid issues with non-spinning fans. Let me know if I can assist by any means.

Posted
34 minutes ago, snakekick said:

Hi,

a new version armbian-bsp-cli-helios64 (21.05.4) released today but still have the same error. ;((

 

Do you support the project at least this way? https://forum.armbian.com/subscriptions/ So you don't make additional expenses when asking for support you are far away from.

 

Software development and support / bug fixing takes time. It is also very expensive since people needs to have a lot of knowledge which is highly paid and very desirable on the market. Here you expect this service for free. Well, then you have to wait with a partially broken system without complaining ... also you can fix it on your own. Or hire some to fix this for all of us. Why this would go on our private expense???

 

There are "1000 bugs and 1000 people" before this one and this update fixed some other bugs. We made few people happy, but not possible to make all happy. 

 

Bug was recorded to our system and its waiting for a free time slot. For our donation to you. A week, a month or years. Up to you. 

Posted

While you are right of course, for professional kind of support there's quite a lot of other alternatives which suit better. Donations to the project are essential, so that things can be done. 

 

But where I disagree a little, is that this issue causes the fans to stop. I find that to be a serious issue, it can cause hardware damage. I'm sure there is thermal throttling and auto-shutdown if the temperature reaches some thresholds, but it's never good to go into that area. So the "1000 bugs" before this one, I don't agree with it. 

Now to avoid any kind of unnecessary pressure, my suggestion to all users is: revert back to 21.05.01, until as solution is deployed in the latest release.

 

 

Posted
24 minutes ago, Zac said:

I find that to be a serious issue


How about this way - "Sadly we ran out of money to fix things for this year. In reality already second day of the year.". But hey, this is open source. Anyone can fix things.

 

26 minutes ago, Zac said:

I'm sure there is thermal throttling and auto-shutdown if the temperature reaches some thresholds

 

It is.

 

26 minutes ago, Zac said:

Now to avoid any kind of unnecessary pressure, my suggestion to all users is: revert back to 21.05.01, until as solution is deployed in the latest release.


That would be some workaround but sadly we have no ability to effectively communicate such message.

Posted

Another temporary solution is already provided in the first post btw.

 

Anyone struggling with this issue and only wants the fan to work again can just:

 

On 5/28/2021 at 1:19 AM, halfa said:

and after rewriting the line with the new type, udev is happy again


# Edit in /etc/udev/rules.d/90-helios64-hwmon-legacy.rules and add the following line after the original one
ATTR{type}=="cpu", ENV{HWMON_PATH}="/sys%p/temp", ENV{HELIOS64_SYMLINK}="/dev/thermal-cpu/temp1_input", RUN+="/usr/bin/mkdir /dev/thermal-cpu/"

root@helios64:~ # udevadm control --reload
root@helios64:~ # udevadm test /sys/devices/virtual/thermal/thermal_zone0
[...]
DEVPATH=/devices/virtual/thermal/thermal_zone0
ACTION=add
SUBSYSTEM=thermal
IS_HELIOS64_HWMON=1
HWMON_PATH=/sys/devices/virtual/thermal/thermal_zone0/temp
HELIOS64_SYMLINK=/dev/thermal-cpu/temp1_input
USEC_INITIALIZED=7544717
run: '/usr/bin/mkdir /dev/thermal-cpu/'
run: '/bin/ln -sf /sys/devices/virtual/thermal/thermal_zone0/temp /dev/thermal-cpu/temp1_input'
Unload module index
Unloaded link configuration context.

 

Posted
15 hours ago, Zac said:

Donations to the project are essential, so that things can be done. 


All three customers that are complaining here - start with an Angel monthly subscription and it will be quickly enough to cover expenses to fix this bug. Donations are reserved for free willing acts which btw don't cover electricity costs. I wouldn't call that as essential.

Posted

I'm having the same issue with a slightly different presentation.

Fans don't run, which caused a couple of thermal shutdowns before i realised

PRETTY_NAME="Armbian 21.05.4 Buster"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

 

output from systemctl status fancontrol.service is similar to yours @halfa but not identical

 

the  odd thing is that i don't have any /etc/udev/rules.d/90-helios64-hwmon-legacy.rules at all

the path /etc/udev/rules.d/ has some rules, but no 90-helios64-hwmon-legacy.rules

 

i've currently got a dead NAS (kernel panic every other boot) without fans, but i can't reinstall, because the latest image is still broken.

Are any historic images available, i couldn't see anywhere to get these

Posted

The 21.05.6 release fixed the regression in the package build process (couldn't find a related commit, maybe during the build targets reworks?), removing the "legacy" udev rule and adding the correct one.

root@helios64:~ # apt info armbian-bsp-cli-helios64
Package: armbian-bsp-cli-helios64
Version: 21.05.6

Given that this is fixed, I don't know if there is a need to patch the upstream Heisath. I'm willing to test you're version of the merged udev rule but I don't have a legacy env. to properly test the other one.

Posted

Question is, is it really fixed? Or is it just by chance now always including the current udev rule? Also maybe on legacy now using the current file? I'd think so...

 

I will leave my branch open for now. Maybe someone has the time & hardware to fully test.

Posted

I've reinstalled with Armbian 21.02.3 Buster with Linux 5.10.21-rockchip64.

I didn't see any mention of a fix in release notes, and sadly, WFH, and with a newborn, I don't have time to reinstall a version where it's uncertain this is in it.

Posted
7 hours ago, gershwin said:

with a newborn, I don't have time to reinstall a version

 

Congratulations :) You won't believe - I have two small kids, a wife, a cat (ok, this one is on low maintenance), a full time job and a full time project to maintain, I don't have time to test this even we are providing this software for you and I had this hardware. Alternative is to hire you help, but since you are not interested, this is not going to happen.

 

Posted

Hello,

My workaround or patch, maybe it bad, sad or unclear but it's KISS and seem to work !

Bye

 

root@helios64:~# cat /etc/fancontrol

# Helios64 PWM Fan Control Configuration
# Temp source : /dev/thermal-cpu
INTERVAL=10
#FCTEMPS=/dev/fan-p6/pwm1=/dev/thermal-cpu/temp1_input /dev/fan-p7/pwm1=/dev/thermal-cpu/temp1_input
FCTEMPS=/dev/fan-p6/pwm1=/sys/class/thermal/thermal_zone0/temp /dev/fan-p7/pwm1=/sys/class/thermal/thermal_zone0/temp
MINTEMP=/dev/fan-p6/pwm1=40 /dev/fan-p7/pwm1=40
MAXTEMP=/dev/fan-p6/pwm1=110 /dev/fan-p7/pwm1=110
MINSTART=/dev/fan-p6/pwm1=60 /dev/fan-p7/pwm1=60
MINSTOP=/dev/fan-p6/pwm1=40 /dev/fan-p7/pwm1=40
MINPWM=20

Posted

Thanks alot @BipBip1981 

 

Changing the values at FCTEMPS from `/dev/thermal-cpu/temp1_input`  to `/sys/class/thermal/thermal_zone0/temp`  fixed it.

 

My dumb ass then updated the system via OVM and after a reboot the fan started spinning non-stop again, this time:
 

root@helios64:~# systemctl status fancontrol.service

 fancontrol.service - fan speed regulator
   Loaded: loaded (/lib/systemd/system/fancontrol.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2024-10-06 10:49:39 UTC; 6s ago
     Docs: man:fancontrol(8)
           man:pwmconfig(8)
  Process: 1653 ExecStartPre=/usr/sbin/fancontrol --check (code=exited, status=0/SUCCESS)
  Process: 1773 ExecStart=/usr/sbin/fancontrol (code=exited, status=1/FAILURE)
 Main PID: 1773 (code=exited, status=1/FAILURE)

Oct 06 10:49:39 helios64 fancontrol[1773]:   MINSTOP=40
Oct 06 10:49:39 helios64 fancontrol[1773]:   MINPWM=0
Oct 06 10:49:39 helios64 fancontrol[1773]:   MAXPWM=255
Oct 06 10:49:39 helios64 fancontrol[1773]: Error: file /dev/fan-p6/pwm1 doesn't exist
Oct 06 10:49:39 helios64 fancontrol[1773]: Error: file /dev/fan-p7/pwm1 doesn't exist

 

I somehow pieced the new value together via:

Fan is silent now and fancontrol.service is active (running) with:

 

root@helios64:~# cat fancontrol/fancontrol 
INTERVAL=10
FCTEMPS=/sys/devices/platform/p6-fan/hwmon/hwmon5/pwm1=/sys/class/thermal/thermal_zone0/temp /sys/devices/platform/p7-fan/hwmon/hwmon4/pwm1=/sys/class/thermal/thermal_zone1/temp
MINTEMP=/sys/devices/platform/p6-fan/hwmon/hwmon5/pwm1=40 /sys/devices/platform/p7-fan/hwmon/hwmon4/pwm1=40
MAXTEMP=/sys/devices/platform/p6-fan/hwmon/hwmon5/pwm1=110 /sys/devices/platform/p7-fan/hwmon/hwmon4/pwm1=110
MINSTART=/sys/devices/platform/p6-fan/hwmon/hwmon5/pwm1=60 /sys/devices/platform/p7-fan/hwmon/hwmon4/pwm1=60
MINSTOP=/sys/devices/platform/p6-fan/hwmon/hwmon5/pwm1=40 /sys/devices/platform/p7-fan/hwmon/hwmon4/pwm1=40
MINPWM=20

 

This is working on:

root@helios64:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 10 (buster)
Release:        10
Codename:       buster

root@helios64:~# uname -r
6.6.47-current-rockchip64

 

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines