Helios64: Kernel oops when starting docker container

mikeakers · January 29, 2024

Hello! I'm able to reproducibly get a kernel oops whenever I start a docker container on my Helios64

The SD card I was running my Helios64 from died and so I reinstalled Armbian using the steps to install bookworm from this post. I did skip the steps related to hs400 and L2 cache as I'm currently running the system from an SD card.

After the install seemed to be working well I installed zfs, imported my existing pool, and then installed docker. My previous setup was using docker's ZFS storage driver so I got that configured as well.

Now, whenever I start up a docker container (ie by running "sudo docker run hello-world") I get the following oops from the kernel:

[  246.422387] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000038
[  246.422400] Mem abort info:
[  246.422403]   ESR = 0x0000000096000005
[  246.422406]   EC = 0x25: DABT (current EL), IL = 32 bits
[  246.422411]   SET = 0, FnV = 0
[  246.422414]   EA = 0, S1PTW = 0
[  246.422417]   FSC = 0x05: level 1 translation fault
[  246.422421] Data abort info:
[  246.422423]   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[  246.422426]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  246.422430]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  246.422434] user pgtable: 4k pages, 48-bit VAs, pgdp=000000000942d000
[  246.422439] [0000000000000038] pgd=0800000016225003, p4d=0800000016225003, pud=0000000000000000
[  246.422452] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
[  246.423015] Modules linked in: veth xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge rfkill sunrpc lz4hc lz4 zram binfmt_misc zfs(PO) crct10dif_ce leds_pwm panfrost spl(O) gpio_charger gpu_sched drm_shmem_helper pwm_fan snd_soc_hdmi_codec snd_soc_rockchip_i2s rockchip_vdec(C) hantro_vpu rk_crypto snd_soc_core rockchip_rga snd_compress v4l2_vp9 v4l2_h264 videobuf2_dma_sg videobuf2_dma_contig ac97_bus v4l2_mem2mem snd_pcm_dmaengine snd_pcm videobuf2_memops videobuf2_v4l2 snd_timer videodev nvmem_rockchip_efuse videobuf2_common mc snd soundcore gpio_beeper ledtrig_netdev lm75 ip_tables x_tables autofs4 efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear cdc_ncm cdc_ether usbnet r8152 realtek rockchipdrm dw_mipi_dsi fusb302 dw_hdmi tcpm analogix_dp dwmac_rk typec drm_display_helper stmmac_platform cec
[  246.423214]  drm_dma_helper stmmac drm_kms_helper drm adc_keys pcs_xpcs
[  246.431612] CPU: 5 PID: 2686 Comm: dockerd Tainted: P         C O       6.6.8-edge-rockchip64 #1
[  246.432383] Hardware name: Helios64 (DT)
[  246.432731] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  246.433344] pc : get_device_state+0x28/0x88 [ledtrig_netdev]
[  246.433854] lr : netdev_trig_notify+0x13c/0x1fc [ledtrig_netdev]
[  246.434387] sp : ffff80008b2ab3e0
[  246.434680] x29: ffff80008b2ab3e0 x28: ffff0000154d3378 x27: ffff80007a36dc30
[  246.435313] x26: ffff800081c89a40 x25: ffff800081907008 x24: ffff80008b2ab5b8
[  246.435944] x23: 000000000000000b x22: ffff000005efb000 x21: ffff0000154d3300
[  246.436575] x20: ffff0000154d3378 x19: ffff0000154d3300 x18: ffff80008d293c58
[  246.437206] x17: 000000040044ffff x16: 00500074b5503510 x15: 0000000000000000
[  246.437837] x14: ffffffffffffffff x13: 0000000000000020 x12: 0101010101010101
[  246.438468] x11: 7f7f7f7f7f7f7f7f x10: 000000000f5d83a0 x9 : 0000000000000020
[  246.439099] x8 : 0101010101010101 x7 : 0000000080000000 x6 : 0000000080303663
[  246.439729] x5 : 6336300000000000 x4 : 0000000000000000 x3 : 0000000000000000
[  246.440360] x2 : ffff000021bdbc00 x1 : ffff000021bdbc00 x0 : 0000000000000000
[  246.440991] Call trace:
[  246.441210]  get_device_state+0x28/0x88 [ledtrig_netdev]
[  246.441683]  netdev_trig_notify+0x13c/0x1fc [ledtrig_netdev]
[  246.442185]  notifier_call_chain+0x74/0x144
[  246.442561]  raw_notifier_call_chain+0x18/0x24
[  246.442956]  call_netdevice_notifiers_info+0x58/0xa4
[  246.443399]  dev_change_name+0x190/0x318
[  246.443751]  do_setlink+0xb7c/0xde4
[  246.444064]  rtnl_setlink+0xf8/0x194
[  246.444383]  rtnetlink_rcv_msg+0x12c/0x398
[  246.444748]  netlink_rcv_skb+0x5c/0x128
[  246.445089]  rtnetlink_rcv+0x18/0x24
[  246.445408]  netlink_unicast+0x2e8/0x350
[  246.445755]  netlink_sendmsg+0x1d4/0x444
[  246.446102]  __sock_sendmsg+0x5c/0xac
[  246.446432]  __sys_sendto+0x124/0x150
[  246.446760]  __arm64_sys_sendto+0x28/0x38
[  246.447117]  invoke_syscall+0x48/0x114
[  246.447453]  el0_svc_common.constprop.0+0x40/0xe8
[  246.447871]  do_el0_svc+0x20/0x2c
[  246.448168]  el0_svc+0x40/0xf4
[  246.448442]  el0t_64_sync_handler+0x13c/0x158
[  246.448829]  el0t_64_sync+0x190/0x194
[  246.449158] Code: f9423c20 f90047e0 d2800000 f9404e60 (f9401c01)
[  246.449695] ---[ end trace 0000000000000000 ]---

TIA for the help and for keeping this unsupported hardware alive!

Edited January 29, 2024 by mikeakers
Typo fix

snakekick · January 29, 2024

hello mikeakers,

you can test you memory with

for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done

it is possible that bootloader was also updated during the update to bookworm and that can be a problem

if this run 5-6 times without errors you try to limit the cpu speed with armbian-config

for me, my system only run "stable" with 600-1200mhz

Edited January 29, 2024 by snakekick

mikeakers · January 29, 2024

Memory seems good. I was able to run that command 6 times without an error

BipBip1981 · January 30, 2024

Hi,

try:

1200mhz min freq

1200mhz max freq

governor performance

in armbian-config program.

The best stable configuration in my helios64

mikeakers · January 30, 2024

I'll try that BipBip1981, but this seems more like a logic error than system instability. Especially since I was able to do the exact same thing with my previous OS install.

The other thing I'll probably try tonight is reverting the kernel to 5.10.63-rockchip64. That was stable in my old install.

BipBip1981 · January 30, 2024

okok keep in touch, i am interesting by your feedback.

mikeakers · January 31, 2024

Some more findings:

linux-6.6.8 still oopses at 1200 mhz...
Downgrading to linux 5.15.93 does seem to fix the issue though!
1. "sudo docker run hello-world" now works correctly without oopsing the kernel.
2. Still stable with max clock set to 1400 mhz
3. So far so good, I'll report back if anything goes sideways once I get all my containers running again

Edited January 31, 2024 by mikeakers

BipBip1981 · January 31, 2024

hi, okok,

from my side kernel 6.6.12 1200MHz Performance,

i do podman run hello-world, no problem and my helios run from 2 days with my freeze test pattern and not freeze for moment.

good night or day

steki · February 4, 2024

regarding my last message...

just remove it from loading from /etc/modules-load.d/modules.conf

leave only lm75 inside and that is it

steki · February 5, 2024

looks that my original message did not get trough as it was before creating account,

so to make it clear what i meant is:

remove from /etc/modules-load.d/modules.conf line that states

"ledtrig_netdev" as that is kernel module that on our systems triggers kernel oops

best would be to just follow next steps to not need rebooting,

or if you plan to reboot just remove that module from loading and that is it no need for this script to run at all.

rmmod ledtrig_netdev

find /lib/modules/$(uname -r) -name ledtrig-netdev.ko -exec mv {} {}.backup ';'

fri.K · March 20, 2024

Hi,
I had exactly the same issue with kernel 6.6.16-current-rockchip64 on Helios64, but commenting out ledtrig-netdev in /etc/modules-load.d/modules.conf fixed it. Thanks @steki