Helios64 edge kernel: rcu: rcu_preempt kthread starved for 5234524 jiffies (and unresponsive)

crosser · September 1

I found my Helis64 unresponsive after about a week or so. Even heartbeat LED is not blinking (permanently on). I see this repeating on the serial console

[495778.879711] rcu: rcu_preempt kthread timer wakeup didn't happen for 5324533 jiffies! g9706925 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200
[495778.880747] rcu: 	Possible timer handling issue on cpu=3 timer-softirq=2513452
[495778.881383] rcu: rcu_preempt kthread starved for 5324534 jiffies! g9706925 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=3
[495778.882336] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[495778.883137] rcu: RCU grace-period kthread stack dump:
[495778.883584] task:rcu_preempt     state:R stack:0     pid:16    tgid:16    ppid:2      flags:0x00000008
[495778.884404] Call trace:
[495778.884627]  __switch_to+0xe0/0x124
[495778.884946]  __schedule+0x308/0xa8c
[495778.885263]  schedule+0x34/0xf8
[495778.885549]  schedule_timeout+0x98/0x1bc
[495778.885902]  rcu_gp_fqs_loop+0x150/0x670
[495778.886256]  rcu_gp_kthread+0x234/0x274
[495778.886603]  kthread+0x114/0x118
[495778.886896]  ret_from_fork+0x10/0x20
[495778.887219] rcu: Stack dump where RCU GP kthread last ran:
[495778.887705] Sending NMI from CPU 4 to CPUs 3:

Aside from usual NFS server (unused at the time) and syncthing, it was running duplicity backup at the time. I cannot rule out that it ran out of memory. Though it did run full backup successfully a couple of days ago.

helios64-rcu-stall.txt

prahal · September 5

I guess this is one core not responding anymore, likely CPU 5 (one of the big cores). Which kernel do you run?

Is this the first time you encounter this bug?

You might want to run ebin-dev dtb (there are voltage hacks for the big CPUs in it).

crosser · Sunday at 08:12 PM

Edge from the deb package (6.8.11-edge-rockchip64).

It was the first time; this Saturday I got a similar situation, _but_ I was able to ssh (after minutes of waiting) and save syslog. The first anomaly in the log was (this time):

Sep 14 22:11:52 kobol kernel: BUG: Bad page state in process kcompactd0  pfn:1e320
Sep 14 22:11:52 kobol kernel: page:000000001709b832 refcount:0 mapcount:0 mapping:000000004953ae39 index:0x4c1a1c30 pfn:0x1e320
Sep 14 22:11:52 kobol kernel: aops:0xffff800081149ed8 ino:1
Sep 14 22:11:52 kobol kernel: flags: 0xffff1800000020c(referenced|uptodate|workingset|node=0|zone=0|lastcpupid=0xffff)
Sep 14 22:11:52 kobol kernel: page_type: 0xffffffff()
Sep 14 22:11:52 kobol kernel: raw: 0ffff1800000020c dead000000000100 dead000000000122 ffff0000009e8338
Sep 14 22:11:52 kobol kernel: raw: 000000004c1a1c30 0000000000000000 00000000ffffffff 0000000000000000
Sep 14 22:11:52 kobol kernel: page dumped because: non-NULL mapping

and later there are repeated "rcu: INFO: rcu_preempt detected stalls on CPUs/tasks ..." (see attached file).

I will try that other dtb, thanks!

rcu-stall.txt

Sign In

Helios64 edge kernel: rcu: rcu_preempt kthread starved for 5234524 jiffies (and unresponsive)

Recommended Posts

crosser

Link to comment

Share on other sites

Armbian & Khadas are rewarding contributors

prahal

Link to comment

Share on other sites

crosser

Link to comment

Share on other sites

Join the conversation

Forums

My Activity Streams

Download

Store

Important Information