Kernel panics, stalls and bugs on 5.10.21


ShadowDance
 Share

2 2

Recommended Posts

I'm experiencing a lot of issues on 5.10.x and thought a dedicated thread would make sense. I've posted some reports in other threads too. Note that 5.9.x is much more stable for me, but not without problems.

 

I'm also experiencing what I can only refer to as boot stalls, i.e. nothing happens. Occasionally when these stalls happen there is a rcu: INFO: rcu_preempt detected stalls on CPUs/tasks trace printed. This happens on both 5.9 and 5.10. Looking at where it stalls it seems to be around the time the hardware optimizations are applied and/or graphics modules are loaded. I have currently blocked the graphics modules from loading and have moved hardware optimizations to initrd so that they will be performed earlier. Time will tell if this has an effect or not.

 

  • Distro: Armbian Buster 21.02.3
  • Kernel: 5.10.21
  • /proc/cmdline: root=ZFS=rpool/ROOT/debian rootwait rootfstype=zfs earlycon console=ttyS2,1500000  consoleblank=0 loglevel=7 ubootpart=794023b5-01 usb-storage.quirks=0x2537:0x1066:u,0x2537:0x1068:u libata.force=noncq earlyprintk ignore_loglevel  cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory swapaccount=1
  • Filesystem: ZFS 2.0.3 (root on ZFS, boot from eMMC)
  • LUKS: Yes, full disk encryption (HDD -> Partition -> LUKS -> ZFS)
  • IO scheduler: None

 

Panic during boot: Unable to handle kernel write to read-only memory at virtual address ffff8000100d3a54

Spoiler


[  OK  ] Started Flush Journal to Persistent Storage.
[  120.234882] Unable to handle kernel write to read-only memory at virtual address ffff8000100d3a54
[  120.235662] Mem abort info:
[  120.235912]   ESR = 0x9600004e
[  120.236186]   EC = 0x25: DABT (current EL), IL = 32 bits
[  120.236654]   SET = 0, FnV = 0
[  120.236926]   EA = 0, S1PTW = 0
[  120.237205] Data abort info:
[  120.237463]   ISV = 0, ISS = 0x0000004e
[  120.237802]   CM = 0, WnR = 1
[  120.238069] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000003669000
[  120.238657] [ffff8000100d3a54] pgd=00000000f7fff003, p4d=00000000f7fff003, pud=00000000f7ffe003, pmd=0040000002200781
[  120.239600] Internal error: Oops: 9600004e [#1] PREEMPT SMP
[  120.240093] Modules linked in: r8152 snd_soc_rockchip_i2s snd_soc_core snd_pcm_dmaengine snd_pcm fusb302 sg tcpm snd_timer typec snd gpio_charger soundcore leds_pwm gpio_beeper cpufreq_dt nfsd auth_rpcgss nfs_acl lockd grace lm75 sunrpc ip_tables x_tables autofs4 zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) algif_skcipher af_alg dm_crypt dm_mod realtek dwmac_rk stmmac_platform stmmac pcs_xpcs adc_keys pwm_fan
[  120.243782] CPU: 5 PID: 0 Comm: swapper/5 Tainted: P           OE     5.10.21-rockchip64 #21.02.3
[  120.244559] Hardware name: Helios64 (DT)
[  120.244910] pstate: 80000085 (Nzcv daIf -PAN -UAO -TCO BTYPE=--)
[  120.245448] pc : sched_clock+0x1c/0x90
[  120.245784] lr : sched_clock_cpu+0x14/0x30
[  120.246145] sp : ffff800011c13de0
[  120.246440] x29: ffff800011c13eb8 x28: ffff000040311d00
[  120.246914] x27: ffff000040004300 x26: ffff800011cf3de0
[  120.247387] x25: 0000000000000001 x24: 0000000000000000
[  120.247860] x23: 0000000000000000 x22: ffff8000100d42f4
[  120.248333] x21: ffff8000118b9948 x20: ffff0000f77c7d00
[  120.248805] x19: ffff0000f77c7d00 x18: 0000000000000000
[  120.249278] x17: 0000000000000000 x16: 0000000000000000
[  120.249750] x15: 0000000000000000 x14: 00000000000002db
[  120.250222] x13: 0000000000000010 x12: 0000000000000001
[  120.250696] x11: 0000000000000040 x10: ffff80001194ef28
[  120.251168] x9 : ffff80001194ef20 x8 : ffff000040400270
[  120.251640] x7 : 0000000000000000 x6 : 0000000000000000
[  120.252113] x5 : ffff000040400248 x4 : ffff8000e6248000
[  120.252585] x3 : 0000000000000000 x2 : 0000000000000001
[  120.253057] x1 : 0000000000000004 x0 : 0000000000000005
[  120.253530] Call trace:
[  120.253752]  sched_clock+0x1c/0x90
[  120.254060] Code: a9025bf5 d000c056 912302d6 a90363f7 (910022d7)
[  120.254606] ---[ end trace 45509e5a34da5642 ]---
[  120.255014] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[  120.255620] SMP: stopping secondary CPUs
[  120.255974] Kernel Offset: disabled
[  120.256285] CPU features: 0x0240022,6100200c
[  120.256662] Memory Limit: none
[  120.256946] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

 

 

Kernel bug (system left inoperable): Internal error: Oops - BUG: 0 [#1] PREEMPT SMP

Spoiler


[  103.576970] usb 4-1.4: reset SuperSpeed Gen 1 USB device number 3 using xhci-hcd
[  103.661197] ------------[ cut here ]------------
[  103.661613] kernel BUG at arch/arm64/kernel/traps.c:470!
[  103.662083] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[  103.662565] Modules linked in: snd_soc_rockchip_i2s r8152(+) snd_soc_core snd_pcm_dmaengine snd_pcm sg snd_timer snd fusb302 tcpm soundcore gpio_charger typec leds_pwm gpio_beeper cpufreq_dt nfsd auth_rpcgss nfs_acl lockd grace sunrpc
lm75 ip_tables x_tables autofs4 zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) algif_skcipher af_alg dm_crypt dm_mod realtek dwmac_rk stmmac_platform stmmac mdio_xpcs adc_keys pwm_fan
[  103.666178] CPU: 5 PID: 423 Comm: kworker/5:2 Tainted: P           OE     5.9.14-rockchip64 #20.11.4
[  103.666975] Hardware name: Helios64 (DT)
[  103.667337] Workqueue: events dbs_work_handler
[  103.667733] pstate: 00000085 (nzcv daIf -PAN -UAO BTYPE=--)
[  103.668229] pc : do_undefinstr+0x2ec/0x318
[  103.668590] lr : do_undefinstr+0x1e4/0x318
[  103.668950] sp : ffff80001250b310
[  103.669244] x29: ffff80001250b310 x28: ffff0000e3ab4880
[  103.669712] x27: 00000000000000fa x26: 0000000001000001
[  103.670180] x25: 0000000000000000 x24: ffff0000f59edcf0
[  103.670648] x23: 0000000080000085 x22: ffff8000100221a8
[  103.671115] x21: ffff80001250b4c0 x20: ffff80001250b380
[  103.671582] x19: ffff800011839948 x18: 00001a54e93b15d3
[  103.672049] x17: 00000020a0fc205e x16: 000000a89c508b62
[  103.672517] x15: ffff80001250b47f x14: ffff0000d90f7ff8
[  103.672984] x13: 00000000000003d7 x12: ffff8000091fb000
[  103.673451] x11: 0000000000000001 x10: 0000000000000a30
[  103.673919] x9 : ffff80001250b440 x8 : ffff0000e3ab5310
[  103.674386] x7 : 0000000000000000 x6 : ffff80001250b368
[  103.674853] x5 : 00000000d5300000 x4 : ffff800011846258
[  103.675320] x3 : 0000000000000002 x2 : 0000000000000001
[  103.675787] x1 : ffff0000e3ab4880 x0 : 0000000080000085
[  103.676255] Call trace:
[  103.676475]  do_undefinstr+0x2ec/0x318
[  103.676809]  el1_sync_handler+0xbc/0x120
[  103.677155]  el1_sync+0x7c/0x100
[  103.677444]  save_return_addr+0x30/0x38
[  103.677784]  return_address+0x5c/0x90
[  103.678112]  preempt_count_add+0xb8/0x158
[  103.678468]  _raw_spin_lock_irqsave+0x28/0x2f8
[  103.678863]  prepare_to_wait_event+0x24/0xe8
[  103.679245]  rk3x_i2c_xfer_common.isra.6+0x3f0/0x488
[  103.679683]  rk3x_i2c_xfer+0x18/0x50
[  103.680000]  __i2c_transfer+0x14c/0x670
[  103.680339]  i2c_transfer+0x60/0x128
[  103.680656]  i2c_transfer_buffer_flags+0x5c/0x88
[  103.681067]  regmap_i2c_write+0x20/0x58
[  103.681406]  _regmap_raw_write_impl+0x6fc/0x868
[  103.681805]  _regmap_bus_raw_write+0x68/0x88
[  103.682182]  _regmap_write+0x6c/0x160
[  103.682507]  _regmap_update_bits+0xf8/0x110
[  103.682876]  regmap_update_bits_base+0x64/0x98
[  103.683272]  regulator_set_voltage_sel_regmap+0x48/0x98
[  103.683733]  _regulator_call_set_voltage_sel+0x80/0xd0
[  103.684185]  _regulator_do_set_voltage+0x460/0x5e0
[  103.684608]  regulator_set_voltage_rdev+0xac/0x228
[  103.685031]  regulator_do_balance_voltage+0x29c/0x400
[  103.685476]  regulator_balance_voltage+0x50/0x90
[  103.685884]  regulator_set_voltage_unlocked+0x94/0x128
[  103.686338]  regulator_set_voltage+0x54/0x98
[  103.686718]  _set_opp_voltage+0x44/0x148
[  103.687067]  dev_pm_opp_set_rate+0x274/0x638
[  103.687452]  set_target+0x40/0x88 [cpufreq_dt]
[  103.687846]  __cpufreq_driver_target+0x2b0/0x698
[  103.688255]  od_dbs_update+0xbc/0x1a0
[  103.688580]  dbs_work_handler+0x40/0x78
[  103.688922]  process_one_work+0x1ec/0x4b0
[  103.689276]  worker_thread+0x48/0x478
[  103.689602]  kthread+0x140/0x150
[  103.689890]  ret_from_fork+0x10/0x34
[  103.690213] Code: f9401bf7 17ffff7d a9025bf5 f9001bf7 (d4210000)
[  103.690751] ---[ end trace 95a6ab813609b0f6 ]---
[  103.691159] note: kworker/5:2[423] exited with preempt_count 2
[  103.692007] ------------[ cut here ]------------
[  103.692435] WARNING: CPU: 5 PID: 0 at kernel/rcu/tree.c:630 rcu_eqs_enter.isra.76+0xa8/0xb0
[  103.693164] Modules linked in: snd_soc_rockchip_i2s r8152(+) snd_soc_core snd_pcm_dmaengine snd_pcm sg snd_timer snd fusb302 tcpm soundcore gpio_charger typec leds_pwm gpio_beeper cpufreq_dt nfsd auth_rpcgss nfs_acl lockd grace sunrpc
lm75 ip_tables x_tables autofs4 zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) algif_skcipher af_alg dm_crypt dm_mod realtek dwmac_rk stmmac_platform stmmac mdio_xpcs adc_keys pwm_fan
[  103.696768] CPU: 5 PID: 0 Comm: swapper/5 Tainted: P      D    OE     5.9.14-rockchip64 #20.11.4
[  103.697534] Hardware name: Helios64 (DT)
[  103.697883] pstate: 200003c5 (nzCv DAIF -PAN -UAO BTYPE=--)
[  103.698375] pc : rcu_eqs_enter.isra.76+0xa8/0xb0
[  103.698783] lr : rcu_eqs_enter.isra.76+0x1c/0xb0
[  103.699189] sp : ffff800011bcbf00
[  103.699482] x29: ffff800011bcbf00 x28: 0000000000000000
[  103.699950] x27: 0000000000000000 x26: ffff0000f6eae580
[  103.700417] x25: 0000000000000000 x24: ffff80001128fd00
[  103.700884] x23: ffff800011523838 x22: ffff800011839948
[  103.701351] x21: ffff80001183a2a8 x20: ffff800011839988
[  103.701818] x19: ffff8000115258c0 x18: 0000000000000004
[  103.702285] x17: 0000000000000001 x16: 0000000000000019
[  103.702752] x15: ffff80001185a098 x14: ffff0000e3dc3a68
[  103.703219] x13: 0000000000000131 x12: 0000000000000000
[  103.703686] x11: 0000000000000001 x10: 0000000000000a30
[  103.704153] x9 : ffff800011bcbe80 x8 : ffff0000f6eaf010
[  103.704620] x7 : 00000000ffffffff x6 : 00000000af348f7b
[  103.705088] x5 : 00ffffffffffffff x4 : ffff8000e62a3000
[  103.705555] x3 : 0000000000000001 x2 : 4000000000000000
[  103.706022] x1 : 4000000000000002 x0 : ffff0000f77c88c0
[  103.706489] Call trace:
[  103.706710]  rcu_eqs_enter.isra.76+0xa8/0xb0
[  103.707093]  rcu_idle_enter+0x10/0x20
[  103.707419]  default_idle_call+0x44/0x1d0
[  103.707774]  do_idle+0x1fc/0x270
[  103.708061]  cpu_startup_entry+0x24/0x60
[  103.708411]  secondary_start_kernel+0x16c/0x180
[  103.708809] ---[ end trace 95a6ab813609b0f7 ]---
[  103.749437] r8152 4-1.4:1.0 (unnamed net_device) (uninitialized): netif_napi_add() called with weight 256
[  103.769440] r8152 4-1.4:1.0 eth1: v2.14.0 (2020/09/24)
[  103.769923] r8152 4-1.4:1.0 eth1: This product is covered by one or more of the following patents:
[  103.769923]          US6,570,884, US6,115,776, and US6,327,625.
[  103.769923]
[  103.771654] usbcore: registered new interface driver r8152

 

 

And the latest issue: WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended.

Note: The warning above was only printed the latter time, but the same symptom occurred two times within two days. I.e. everything is stuck in iowait and it's basically impossible to do anything with the system. This only happens on 5.10, and happened two times within two days. On 5.9 (ZFS 2.0.3 as well) there is no such issue.

Spoiler


[11651.128098] WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended.
---[ cut ]--- (Sorry, part missing here because it had scrolled past scrollback history, but essentially same kinds of errors as below)
[11663.588993] systemd[1]: Failed to start Journal Service.
[11663.589901] systemd[1]: Dependency failed for Flush Journal to Persistent Storage.
[11663.590644] systemd[1]: systemd-journal-flush.service: Job systemd-journal-flush.service/start failed with result 'dependency'.
[11663.594603] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[11663.595694] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 2.
[11663.597479] systemd[1]: Stopped Journal Service.
[11663.598342] systemd[1]: systemd-journald.service: Found left-over process 6620 (systemd-journal) in control group while starting unit. Ignoring.
[11663.599484] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[11753.833991] systemd[1]: systemd-journald.service: Start operation timed out. Terminating.
[11844.081183] systemd[1]: systemd-journald.service: State 'stop-sigterm' timed out. Killing.
[11844.082126] systemd[1]: systemd-journald.service: Killing process 11419 (systemd-journal) with signal SIGKILL.
[11844.083288] systemd[1]: systemd-journald.service: Killing process 6620 (systemd-journal) with signal SIGKILL.
[11844.084224] systemd[1]: systemd-journald.service: Killing process 11167 (systemd-journal) with signal SIGKILL.
[11896.683709] systemd[1]: NetworkManager-dispatcher.service: Succeeded.
[11934.328404] systemd[1]: systemd-journald.service: Processes still around after SIGKILL. Ignoring.
[12024.575976] systemd[1]: systemd-journald.service: State 'stop-final-sigterm' timed out. Killing.
[12024.576976] systemd[1]: systemd-journald.service: Killing process 11419 (systemd-journal) with signal SIGKILL.
[12024.578042] systemd[1]: systemd-journald.service: Killing process 6620 (systemd-journal) with signal SIGKILL.
[12024.578966] systemd[1]: systemd-journald.service: Killing process 11167 (systemd-journal) with signal SIGKILL.
[12114.822800] systemd[1]: systemd-journald.service: Processes still around after final SIGKILL. Entering failed mode.
[12114.823731] systemd[1]: systemd-journald.service: Failed with result 'timeout'.
[12114.824712] systemd[1]: Failed to start Journal Service.
[12114.826630] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[12114.827632] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 3.
[12114.828834] systemd[1]: Stopped Journal Service.
[12114.829439] systemd[1]: systemd-journald.service: Found left-over process 6620 (systemd-journal) in control group while starting unit. Ignoring.
[12114.830593] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12114.831625] systemd[1]: systemd-journald.service: Found left-over process 11167 (systemd-journal) in control group while starting unit. Ignoring.
[12114.832761] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12205.070123] systemd[1]: systemd-journald.service: Start operation timed out. Terminating.
[12295.317332] systemd[1]: systemd-journald.service: State 'stop-sigterm' timed out. Killing.
[12295.318234] systemd[1]: systemd-journald.service: Killing process 11872 (systemd-journal) with signal SIGKILL.
[12295.319307] systemd[1]: systemd-journald.service: Killing process 6620 (systemd-journal) with signal SIGKILL.
[12295.320232] systemd[1]: systemd-journald.service: Killing process 11167 (systemd-journal) with signal SIGKILL.
[12295.321220] systemd[1]: systemd-journald.service: Killing process 11419 (systemd-journal) with signal SIGKILL.
[12385.564582] systemd[1]: systemd-journald.service: Processes still around after SIGKILL. Ignoring.
[12475.811853] systemd[1]: systemd-journald.service: State 'stop-final-sigterm' timed out. Killing.
[12475.812831] systemd[1]: systemd-journald.service: Killing process 11872 (systemd-journal) with signal SIGKILL.
[12475.813923] systemd[1]: systemd-journald.service: Killing process 6620 (systemd-journal) with signal SIGKILL.
[12475.814848] systemd[1]: systemd-journald.service: Killing process 11167 (systemd-journal) with signal SIGKILL.
[12475.815957] systemd[1]: systemd-journald.service: Killing process 11419 (systemd-journal) with signal SIGKILL.
[12566.059169] systemd[1]: systemd-journald.service: Processes still around after final SIGKILL. Entering failed mode.
[12566.060119] systemd[1]: systemd-journald.service: Failed with result 'timeout'.
[12566.061326] systemd[1]: Failed to start Journal Service.
[12566.064617] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[12566.065669] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 4.
[12566.067254] systemd[1]: Stopped Journal Service.
[12566.068003] systemd[1]: systemd-journald.service: Found left-over process 6620 (systemd-journal) in control group while starting unit. Ignoring.
[12566.069316] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12566.070377] systemd[1]: systemd-journald.service: Found left-over process 11167 (systemd-journal) in control group while starting unit. Ignoring.
[12566.071518] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12571.809061] systemd[1]: systemd-timesyncd.service: Watchdog timeout (limit 3min)!
[12571.809922] systemd[1]: systemd-timesyncd.service: Killing process 7563 (systemd-timesyn) with signal SIGABRT.
[12656.306362] systemd[1]: systemd-journald.service: Start operation timed out. Terminating.
[12662.056236] systemd[1]: systemd-timesyncd.service: State 'stop-watchdog' timed out. Terminating.
[12746.553671] systemd[1]: systemd-journald.service: State 'stop-sigterm' timed out. Killing.

 

 

Anyway, I'm back on 5.9.14 for the time being. Just finished a full scrub of the ZFS filesystem as well, no issues or errors.

Link to post
Share on other sites

Donate and support the project!

 Share

2 2