Helios64 edge kernel: rcu: rcu_preempt kthread starved for 5234524 jiffies (and unresponsive)

crosser · September 1, 2024

I found my Helis64 unresponsive after about a week or so. Even heartbeat LED is not blinking (permanently on). I see this repeating on the serial console

[495778.879711] rcu: rcu_preempt kthread timer wakeup didn't happen for 5324533 jiffies! g9706925 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200
[495778.880747] rcu: 	Possible timer handling issue on cpu=3 timer-softirq=2513452
[495778.881383] rcu: rcu_preempt kthread starved for 5324534 jiffies! g9706925 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=3
[495778.882336] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[495778.883137] rcu: RCU grace-period kthread stack dump:
[495778.883584] task:rcu_preempt     state:R stack:0     pid:16    tgid:16    ppid:2      flags:0x00000008
[495778.884404] Call trace:
[495778.884627]  __switch_to+0xe0/0x124
[495778.884946]  __schedule+0x308/0xa8c
[495778.885263]  schedule+0x34/0xf8
[495778.885549]  schedule_timeout+0x98/0x1bc
[495778.885902]  rcu_gp_fqs_loop+0x150/0x670
[495778.886256]  rcu_gp_kthread+0x234/0x274
[495778.886603]  kthread+0x114/0x118
[495778.886896]  ret_from_fork+0x10/0x20
[495778.887219] rcu: Stack dump where RCU GP kthread last ran:
[495778.887705] Sending NMI from CPU 4 to CPUs 3:

Aside from usual NFS server (unused at the time) and syncthing, it was running duplicity backup at the time. I cannot rule out that it ran out of memory. Though it did run full backup successfully a couple of days ago.

helios64-rcu-stall.txt

prahal · September 5, 2024

I guess this is one core not responding anymore, likely CPU 5 (one of the big cores). Which kernel do you run?

Is this the first time you encounter this bug?

You might want to run ebin-dev dtb (there are voltage hacks for the big CPUs in it).

crosser · September 15, 2024

Edge from the deb package (6.8.11-edge-rockchip64).

It was the first time; this Saturday I got a similar situation, _but_ I was able to ssh (after minutes of waiting) and save syslog. The first anomaly in the log was (this time):

Sep 14 22:11:52 kobol kernel: BUG: Bad page state in process kcompactd0  pfn:1e320
Sep 14 22:11:52 kobol kernel: page:000000001709b832 refcount:0 mapcount:0 mapping:000000004953ae39 index:0x4c1a1c30 pfn:0x1e320
Sep 14 22:11:52 kobol kernel: aops:0xffff800081149ed8 ino:1
Sep 14 22:11:52 kobol kernel: flags: 0xffff1800000020c(referenced|uptodate|workingset|node=0|zone=0|lastcpupid=0xffff)
Sep 14 22:11:52 kobol kernel: page_type: 0xffffffff()
Sep 14 22:11:52 kobol kernel: raw: 0ffff1800000020c dead000000000100 dead000000000122 ffff0000009e8338
Sep 14 22:11:52 kobol kernel: raw: 000000004c1a1c30 0000000000000000 00000000ffffffff 0000000000000000
Sep 14 22:11:52 kobol kernel: page dumped because: non-NULL mapping

and later there are repeated "rcu: INFO: rcu_preempt detected stalls on CPUs/tasks ..." (see attached file).

I will try that other dtb, thanks!

rcu-stall.txt

Trillien · November 21, 2024

Hi,

For info I had the same issue running OMV 6.0 with PhotoPrism. It seemed PhotoPrism consumes a lot of CPU to compute the pictures. And at a point (between a quarter and two hours), Helios64 failed with rcu_preempt detected stalls on CPUs error.

This error was probably linked to the jump between frequencies : I solved it by limiting the max frequency on biggest cores at 1200 MHz. Note that was before Prahal's DTB to increase cpu voltage.

prahal · December 11, 2024

@crosser was it stable with the other dtb?

crosser · January 28

@prahal sorry for dropping off.

I was too lazy to build the needed version of the kernel, and before long, a new edge release came out, and the device was rock solid since then. Knock on wood.

(As a side note, I _do_ have a real problem with the hardware: after some research I figured that one of the two disk power circuits has gone bad, so now I have two disks (and SSD) working, and three slots empty.)

crosser@kobol:~$ uname -a
Linux kobol 6.12.1-edge-rockchip64 #1 SMP PREEMPT Fri Nov 22 14:30:26 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
crosser@kobol:~$ uptime
20:32:31 up 56 days, 12 min, 1 user, load average: 0.04, 0.05, 0.00

(last reboot to upgrade the kernel, no crashes for quite a while)

prahal · January 29

@crosser thanks for the feedback. By new edge release stable you mean vanilla armbian one? That is without copying ebin-dev dtb with the new edge kernel?

crosser · February 7

@prahal I am running this now:

crosser@kobol:~$ cat /etc/os-release 
PRETTY_NAME="Armbian 24.11.3 noble"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.armbian.com"
SUPPORT_URL="https://forum.armbian.com"
BUG_REPORT_URL="https://www.armbian.com/bugs"
PRIVACY_POLICY_URL="https://www.armbian.com"
UBUNTU_CODENAME=noble
LOGO="armbian-logo"
ARMBIAN_PRETTY_NAME="Armbian 24.11.3 noble"
crosser@kobol:~$ uname -a
Linux kobol 6.12.1-edge-rockchip64 #1 SMP PREEMPT Fri Nov 22 14:30:26 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
crosser@kobol:~$ dpkg --list|grep dtb
ii  linux-dtb-edge-rockchip64        24.11.1                               arm64        Armbian Linux edge DTBs in /boot/dtb-6.12.1-edge-rockchip64
crosser@kobol:~$

without any kernel-level alterations
(A few minor userspace config modifications are needed to make LED work, and nfs services start in the right order, but that is all)

prahal · February 15

If LED issue was helios64:green:status in sysfs instead of helios64::status it is fixed in git.

Sign In

Helios64 edge kernel: rcu: rcu_preempt kthread starved for 5234524 jiffies (and unresponsive)

Recommended Posts

crosser

prahal

crosser

Trillien

prahal

crosser

prahal

crosser

prahal

Join the conversation

Similar Content

Forums

My Activity Streams

Download

Store

Important Information