Jump to content

Helios64 edge kernel: rcu: rcu_preempt kthread starved for 5234524 jiffies (and unresponsive)


Recommended Posts

Posted

I found my Helis64 unresponsive after about a week or so. Even heartbeat LED is not blinking (permanently on). I see this repeating on the serial console

[495778.879711] rcu: rcu_preempt kthread timer wakeup didn't happen for 5324533 jiffies! g9706925 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200
[495778.880747] rcu: 	Possible timer handling issue on cpu=3 timer-softirq=2513452
[495778.881383] rcu: rcu_preempt kthread starved for 5324534 jiffies! g9706925 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=3
[495778.882336] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[495778.883137] rcu: RCU grace-period kthread stack dump:
[495778.883584] task:rcu_preempt     state:R stack:0     pid:16    tgid:16    ppid:2      flags:0x00000008
[495778.884404] Call trace:
[495778.884627]  __switch_to+0xe0/0x124
[495778.884946]  __schedule+0x308/0xa8c
[495778.885263]  schedule+0x34/0xf8
[495778.885549]  schedule_timeout+0x98/0x1bc
[495778.885902]  rcu_gp_fqs_loop+0x150/0x670
[495778.886256]  rcu_gp_kthread+0x234/0x274
[495778.886603]  kthread+0x114/0x118
[495778.886896]  ret_from_fork+0x10/0x20
[495778.887219] rcu: Stack dump where RCU GP kthread last ran:
[495778.887705] Sending NMI from CPU 4 to CPUs 3:

Aside from usual NFS server (unused at the time) and syncthing, it was running duplicity backup at the time. I cannot rule out that it ran out of memory. Though it did run full backup successfully a couple of days ago.

helios64-rcu-stall.txt

<
Posted

I guess this is one core not responding anymore, likely CPU 5 (one of the big cores). Which kernel do you run?

Is this the first time you encounter this bug?

 

You might want to run ebin-dev dtb (there are voltage hacks for the big CPUs in it).

Posted

Edge from the deb package (6.8.11-edge-rockchip64).

It was the first time; this Saturday I got a similar situation, _but_ I was able to ssh (after minutes of waiting) and save syslog. The first anomaly in the log was (this time):

Sep 14 22:11:52 kobol kernel: BUG: Bad page state in process kcompactd0  pfn:1e320
Sep 14 22:11:52 kobol kernel: page:000000001709b832 refcount:0 mapcount:0 mapping:000000004953ae39 index:0x4c1a1c30 pfn:0x1e320
Sep 14 22:11:52 kobol kernel: aops:0xffff800081149ed8 ino:1
Sep 14 22:11:52 kobol kernel: flags: 0xffff1800000020c(referenced|uptodate|workingset|node=0|zone=0|lastcpupid=0xffff)
Sep 14 22:11:52 kobol kernel: page_type: 0xffffffff()
Sep 14 22:11:52 kobol kernel: raw: 0ffff1800000020c dead000000000100 dead000000000122 ffff0000009e8338
Sep 14 22:11:52 kobol kernel: raw: 000000004c1a1c30 0000000000000000 00000000ffffffff 0000000000000000
Sep 14 22:11:52 kobol kernel: page dumped because: non-NULL mapping

and later there are repeated "rcu: INFO: rcu_preempt detected stalls on CPUs/tasks ..." (see attached file).

I will try that other dtb, thanks!

rcu-stall.txt

Posted

Hi,

For info I had the same issue running OMV 6.0 with PhotoPrism. It seemed PhotoPrism consumes a lot of CPU to compute the pictures. And at a point (between a quarter and two hours), Helios64 failed with rcu_preempt detected stalls on CPUs error.

This error was probably linked to the jump between frequencies I solved it by limiting the max frequency on biggest cores at 1200 MHz. Note that was before Prahal's DTB to increase cpu voltage.

Posted

@prahal sorry for dropping off.

I was too lazy to build the needed version of the kernel, and before long, a new edge release came out, and the device was rock solid since then. Knock on wood.

(As a side note, I _do_ have a real problem with the hardware: after some research I figured that one of the two disk power circuits has gone bad, so now I have two disks (and SSD) working, and three slots empty.)

 

crosser@kobol:~$ uname -a
Linux kobol 6.12.1-edge-rockchip64 #1 SMP PREEMPT Fri Nov 22 14:30:26 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
crosser@kobol:~$ uptime
 20:32:31 up 56 days, 12 min,  1 user,  load average: 0.04, 0.05, 0.00
 

(last reboot to upgrade the kernel, no crashes for quite a while)

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines