4 4
Mikhail Kulinich

Linux kernel crash and system moves to unresponsive state

Recommended Posts

(edited)

At 19:54:55 kernel crash happens. That for some reason forces system to change time to Aug 24 1978.
systemd reacts to that situation strangely, which puts system under unresponsive state, like infinite loop, potentially causing 100% cpu occupation (this is guess now)
The printouts "Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed" follows many times per second until node is rebooted manually.

 

The kernel version: 4.19.50-sunxi #5.89 SMP Fri Jun 14 01:50:58 EDT 2019 armv7l armv7l armv7l GNU/Linux

Aug 10 19:45:01 ptc_lab_collab3_node17 CRON[32019]: pam_unix(cron:session): session closed for user root
Aug 10 19:45:01 ptc_lab_collab3_node17 CRON[32018]: pam_unix(cron:session): session closed for user root
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: rcu:         1-...!: (2 GPs behind) idle=bfa/0/0x1 softirq=4388853/4388853 fqs=0
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: rcu:         2-...!: (8 GPs behind) idle=114/0/0x0 softirq=4692641/4692641 fqs=0
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: rcu:         (detected by 3, t=39728 jiffies, g=15607817, q=14)
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: Sending NMI from CPU 3 to CPUs 1:
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: Sending NMI from CPU 3 to CPUs 2:
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: NMI backtrace for cpu 2
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: CPU: 2 PID: 0 Comm: swapper/2 Tainted: G        W  O      4.19.50-sunxi #5.89
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: Hardware name: Allwinner sun8i Family
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: PC is at arch_cpu_idle+0x28/0x2c
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: LR is at arch_cpu_idle+0x27/0x2c
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: pc : [<c01078f4>]    lr : [<c01078f3>]    psr: 40010033
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: sp : df52dfb0  ip : c0e03d00  fp : 00000000
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: r10: c0db9870  r9 : c0e04d48  r8 : 00000000
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: r7 : 00000004  r6 : c0e04db8  r5 : c0e04d70  r4 : ffffe000
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: r3 : c0116441  r2 : dfb03438  r1 : 0739e3cc  r0 : 00000000
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: Flags: nZcv  IRQs on  FIQs on  Mode SVC_32  ISA Thumb  Segment none
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: Control: 50c5387d  Table: 5e67806a  DAC: 00000051
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: CPU: 2 PID: 0 Comm: swapper/2 Tainted: G        W  O      4.19.50-sunxi #5.89
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: Hardware name: Allwinner sun8i Family
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: [<c010d745>] (unwind_backtrace) from [<c010a2f1>] (show_stack+0x11/0x14)
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: [<c010a2f1>] (show_stack) from [<c08efb01>] (dump_stack+0x69/0x78)
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: [<c08efb01>] (dump_stack) from [<c08f3d4d>] (nmi_cpu_backtrace+0x59/0x90)
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: [<c08f3d4d>] (nmi_cpu_backtrace) from [<c010c5d1>] (handle_IPI+0x85/0x2c0)
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: [<c010c5d1>] (handle_IPI) from [<c05c9287>] (gic_handle_irq+0x67/0x68)
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: [<c05c9287>] (gic_handle_irq) from [<c0101a65>] (__irq_svc+0x65/0x94)
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: Exception stack(0xdf52df60 to 0xdf52dfa8)
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: df60: 00000000 0739e3cc dfb03438 c0116441 ffffe000 c0e04d70 c0e04db8 00000004
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: df80: 00000000 c0e04d48 c0db9870 00000000 c0e03d00 df52dfb0 c01078f3 c01078f4
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: dfa0: 40010033 ffffffff
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: [<c0101a65>] (__irq_svc) from [<c01078f4>] (arch_cpu_idle+0x28/0x2c)
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: [<c01078f4>] (arch_cpu_idle) from [<c013e973>] (do_idle+0x14b/0x1d8)
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: [<c013e973>] (do_idle) from [<c013ebf5>] (cpu_startup_entry+0x19/0x1c)
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: [<c013ebf5>] (cpu_startup_entry) from [<401024b1>] (0x401024b1)
Aug 10 19:55:01 ptc_lab_collab3_node17 CRON[32672]: pam_unix(cron:account): account root has password changed in future
Aug 10 19:55:01 ptc_lab_collab3_node17 CRON[32672]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 10 19:55:01 ptc_lab_collab3_node17 CRON[32673]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 10 19:55:01 ptc_lab_collab3_node17 CRON[32672]: pam_unix(cron:session): session closed for user root
Aug 10 19:58:32 ptc_lab_collab3_node17 systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!
Aug 10 20:00:01 ptc_lab_collab3_node17 CRON[32754]: pam_unix(cron:account): account root has password changed in future
Aug 10 20:00:01 ptc_lab_collab3_node17 CRON[32754]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 10 20:00:01 ptc_lab_collab3_node17 CRON[32755]: (root) CMD (/usr/lib/armbian/armbian-truncate-logs)
Aug 10 20:00:02 ptc_lab_collab3_node17 systemd[1]: systemd-udevd.service: State 'stop-sigabrt' timed out. Terminating.
Aug 10 20:00:06 ptc_lab_collab3_node17 CRON[32754]: pam_unix(cron:session): session closed for user root
Aug 10 20:01:33 ptc_lab_collab3_node17 systemd[1]: systemd-udevd.service: State 'stop-sigterm' timed out. Killing.
Aug 10 20:03:03 ptc_lab_collab3_node17 systemd[1]: systemd-udevd.service: Processes still around after SIGKILL. Ignoring.
Aug 10 20:04:33 ptc_lab_collab3_node17 systemd[1]: systemd-udevd.service: State 'stop-final-sigterm' timed out. Killing.
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[1]: systemd-udevd.service: Processes still around after final SIGKILL. Entering failed mode.
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[1]: systemd-udevd.service: Unit entered failed state.
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[1]: systemd-udevd.service: Failed with result 'timeout'.
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[1]: icecast2.service: Service reached runtime time limit. Stopping.
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[1]: farm-assets.service: Service reached runtime time limit. Stopping.
Aug 24 07:45:10 ptc_lab_collab3_node17 CRON[32766]: pam_unix(cron:account): account root has password changed in future
Aug 24 07:45:10 ptc_lab_collab3_node17 CRON[32766]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 24 07:45:10 ptc_lab_collab3_node17 CRON[301]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 24 07:45:10 ptc_lab_collab3_node17 CRON[32766]: pam_unix(cron:session): session closed for user root
Aug 24 07:45:10 ptc_lab_collab3_node17 icecast2[300]:  * Stopping streaming media server icecast2
Aug 24 07:45:10 ptc_lab_collab3_node17 CRON[32765]: pam_unix(cron:account): account root has password changed in future
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[1]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 CRON[32765]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 24 07:45:10 ptc_lab_collab3_node17 icsneo-socketcan[797]: ValueCAN 3 126241 disconnected
Aug 24 07:45:10 ptc_lab_collab3_node17 icsneo-socketcan[797]: Removing device mscan_126241 with handle 1
Aug 24 07:45:10 ptc_lab_collab3_node17 kernel: intrepid: Removing device 1 can1 0x17c453e0
Aug 24 07:45:10 ptc_lab_collab3_node17 CRON[309]: (root) CMD (/usr/lib/armbian/armbian-truncate-logs)
Aug 24 07:45:10 ptc_lab_collab3_node17 systemctl[308]: Failed to stop assets.mount: Connection timed out
Aug 24 07:45:10 ptc_lab_collab3_node17 systemctl[308]: See system logs and 'systemctl status assets.mount' for details.
Aug 24 07:45:10 ptc_lab_collab3_node17 systemctl[308]: Failed to get load state of assets.mount: Connection timed out
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 icsneo-socketcan[797]: Removed device mscan_126241 with handle 1, result 0
Aug 24 07:45:10 ptc_lab_collab3_node17 kernel: intrepid: Removed device 1
Aug 24 07:45:10 ptc_lab_collab3_node17 kernel: intrepid: Removing device 0 can0 0x84d5895f
Aug 24 07:45:10 ptc_lab_collab3_node17 icsneo-socketcan[797]: Removing device hscan_126241 with handle 0
Aug 24 07:45:10 ptc_lab_collab3_node17 CRON[32765]: pam_unix(cron:session): session closed for user root
Aug 24 07:45:10 ptc_lab_collab3_node17 icsneo-socketcan[797]: Removed device hscan_126241 with handle 0, result 0
Aug 24 07:45:10 ptc_lab_collab3_node17 kernel: intrepid: Removed device 0
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed
Aug 24 07:45:10 ptc_lab_collab3_node17 systemd[20635]: Time has been changed

 

Would be nice to receive any suggestions to root cause this issue.

Edited by Mikhail Kulinich

Share this post


Link to post
Share on other sites
7 hours ago, Mikhail Kulinich said:

Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: rcu:         1-...!: (2 GPs behind) idle=bfa/0/0x1 softirq=4388853/4388853 fqs=0
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: rcu:         2-...!: (8 GPs behind) idle=114/0/0x0 softirq=4692641/4692641 fqs=0
Aug 10 19:54:55 ptc_lab_collab3_node17 kernel: rcu:         (detected by 3, t=39728 jiffies, g=15607817, q=14)

 

Looks like a CPU stall on cores 1 and 2. Nothing obvious in the call stack. The systemd errors are a symptom of the problem, not the root cause of the problem. What type of workload is this system running?

 

See the following link for some possible causes of CPU stalls: https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt

Share this post


Link to post
Share on other sites
4 hours ago, drice said:

What type of workload is this system running?

 

The system runs automated tests (using Robot Framework, python tool), interacts via CAN bus (using Intrepid ValueCAN3) with system under test, does some unheavy networking. This was not happening on kernel 4.13.15, but started to happen on 4.19.50.

 

I've separate issue on Github for Intrepid as their driver gives a call stack during boot up.  However, not sure if this is related.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
4 4