Jump to content

Helios64 edge kernel: rcu: rcu_preempt kthread starved for 5234524 jiffies (and unresponsive)


Recommended Posts

Posted

I found my Helis64 unresponsive after about a week or so. Even heartbeat LED is not blinking (permanently on). I see this repeating on the serial console

[495778.879711] rcu: rcu_preempt kthread timer wakeup didn't happen for 5324533 jiffies! g9706925 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200
[495778.880747] rcu: 	Possible timer handling issue on cpu=3 timer-softirq=2513452
[495778.881383] rcu: rcu_preempt kthread starved for 5324534 jiffies! g9706925 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=3
[495778.882336] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[495778.883137] rcu: RCU grace-period kthread stack dump:
[495778.883584] task:rcu_preempt     state:R stack:0     pid:16    tgid:16    ppid:2      flags:0x00000008
[495778.884404] Call trace:
[495778.884627]  __switch_to+0xe0/0x124
[495778.884946]  __schedule+0x308/0xa8c
[495778.885263]  schedule+0x34/0xf8
[495778.885549]  schedule_timeout+0x98/0x1bc
[495778.885902]  rcu_gp_fqs_loop+0x150/0x670
[495778.886256]  rcu_gp_kthread+0x234/0x274
[495778.886603]  kthread+0x114/0x118
[495778.886896]  ret_from_fork+0x10/0x20
[495778.887219] rcu: Stack dump where RCU GP kthread last ran:
[495778.887705] Sending NMI from CPU 4 to CPUs 3:

Aside from usual NFS server (unused at the time) and syncthing, it was running duplicity backup at the time. I cannot rule out that it ran out of memory. Though it did run full backup successfully a couple of days ago.

helios64-rcu-stall.txt

Posted

I guess this is one core not responding anymore, likely CPU 5 (one of the big cores). Which kernel do you run?

Is this the first time you encounter this bug?

 

You might want to run ebin-dev dtb (there are voltage hacks for the big CPUs in it).

Posted

Edge from the deb package (6.8.11-edge-rockchip64).

It was the first time; this Saturday I got a similar situation, _but_ I was able to ssh (after minutes of waiting) and save syslog. The first anomaly in the log was (this time):

Sep 14 22:11:52 kobol kernel: BUG: Bad page state in process kcompactd0  pfn:1e320
Sep 14 22:11:52 kobol kernel: page:000000001709b832 refcount:0 mapcount:0 mapping:000000004953ae39 index:0x4c1a1c30 pfn:0x1e320
Sep 14 22:11:52 kobol kernel: aops:0xffff800081149ed8 ino:1
Sep 14 22:11:52 kobol kernel: flags: 0xffff1800000020c(referenced|uptodate|workingset|node=0|zone=0|lastcpupid=0xffff)
Sep 14 22:11:52 kobol kernel: page_type: 0xffffffff()
Sep 14 22:11:52 kobol kernel: raw: 0ffff1800000020c dead000000000100 dead000000000122 ffff0000009e8338
Sep 14 22:11:52 kobol kernel: raw: 000000004c1a1c30 0000000000000000 00000000ffffffff 0000000000000000
Sep 14 22:11:52 kobol kernel: page dumped because: non-NULL mapping

and later there are repeated "rcu: INFO: rcu_preempt detected stalls on CPUs/tasks ..." (see attached file).

I will try that other dtb, thanks!

rcu-stall.txt

Posted

Hi,

For info I had the same issue running OMV 6.0 with PhotoPrism. It seemed PhotoPrism consumes a lot of CPU to compute the pictures. And at a point (between a quarter and two hours), Helios64 failed with rcu_preempt detected stalls on CPUs error.

This error was probably linked to the jump between frequencies I solved it by limiting the max frequency on biggest cores at 1200 MHz. Note that was before Prahal's DTB to increase cpu voltage.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines