Jump to content

Helios64 boots then crashes - kernel panic


Rmleonard

Recommended Posts

[  184.130515] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT SMP
[  184.131281] Modules linked in: rfkill lz4hc lz4 snd_soc_hdmi_codec snd_soc_rockchip_i2s rockchip_vdec(C) hantro_vpu v4l2_vp9 leds_pwm gpio_charger videobuf2_dma_contig pwm_fan rockchip_rga panfrost videobuf2_dma_sg v4l2_h264 gpu_sched v4l2_mem2mem snd_soc_core drm_shmem_helper snd_compress rockchip_rng rng_core snd_pcm_dmaengine videobuf2_memops snd_pcm videobuf2_v4l2 videobuf2_common snd_timer videodev snd soundcore mc zram binfmt_misc gpio_beeper cpufreq_dt ledtrig_netdev lm75 dm_mod ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod r8152 cdc_ncm cdc_ether usbnet realtek fusb302 tcpm typec dwmac_rk stmmac_platform stmmac pcs_xpcs adc_keys
[  184.137145] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G         C         6.1.36-rockchip64 #3
[  184.137890] Hardware name: Helios64 (DT)
[  184.138243] pstate: 000003c5 (nzcv DAIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  184.138866] pc : debug_smp_processor_id+0x28/0x2c
[  184.139305] lr : ct_nmi_enter+0x68/0x1a4
[  184.139663] sp : ffff800009cabb90
[  184.139963] x29: ffff800009cabb90 x28: ffff000000761e00 x27: 0000000000000000
[  184.140608] x26: ffff80000901a180 x25: ffff8000093d9e80 x24: 0000000000000001
[  184.141251] x23: ffff8000099cdc70 x22: ffff0000f777e8f0 x21: ffff8000094797a8
[  184.141893] x20: 0000000096000007 x19: ffff8000096f68f0 x18: ffff800010813c58
[  184.142534] x17: ffff8000ee088000 x16: ffff800009ca8000 x15: 0000000000000001
[  184.143176] x14: 0000000000000000 x13: 00000000000002da x12: 000000000041a201
[  184.143818] x11: 0000000000000040 x10: ffff000000404470 x9 : ffff000000404468
[  184.144459] x8 : ffff0000008004b8 x7 : 0000000000000000 x6 : 000000001b4110bc
[  184.145100] x5 : ffff800009cabc60 x4 : 0000000000010002 x3 : ffff8000096e7008
[  184.145741] x2 : ffff000000761e00 x1 : ffff800009415a68 x0 : 0000000000000004
[  184.146382] Call trace:
[  184.146606]  debug_smp_processor_id+0x28/0x2c
[  184.147004]  ct_irq_enter+0x10/0x1c
[  184.147326]  enter_from_kernel_mode+0x28/0x74
[  184.147720]  el1_abort+0x24/0x64
[  184.148016]  el1h_64_sync_handler+0xd8/0xe4
[  184.148397]  el1h_64_sync+0x64/0x68
[  184.148715]  update_curr+0x84/0x1fc
[  184.149040]  enqueue_entity+0x16c/0x32c
[  184.149387]  enqueue_task_fair+0x84/0x3e0
[  184.149749]  ttwu_do_activate+0x78/0x164
[  184.150106]  sched_ttwu_pending+0xec/0x1e0
[  184.150480]  __flush_smp_call_function_queue+0xec/0x254
[  184.150949]  generic_smp_call_function_single_interrupt+0x14/0x20
[  184.151494]  ipi_handler+0x90/0x350
[  184.151816]  handle_percpu_devid_irq+0xa4/0x230
[  184.152227]  generic_handle_domain_irq+0x2c/0x44
[  184.152648]  gic_handle_irq+0x50/0x130
[  184.152991]  call_on_irq_stack+0x24/0x4c
[  184.153349]  do_interrupt_handler+0xd4/0xe0
[  184.153730]  el1_interrupt+0x34/0x6c
[  184.154058]  el1h_64_irq_handler+0x18/0x2c
[  184.154430]  el1h_64_irq+0x64/0x68
[  184.154738]  arch_cpu_idle+0x18/0x2c
[  184.155065]  default_idle_call+0x38/0x17c
[  184.155428]  do_idle+0x23c/0x2b0
[  184.155727]  cpu_startup_entry+0x24/0x30
[  184.156085]  secondary_start_kernel+0x124/0x150
[  184.156496]  __secondary_switched+0xb0/0xb4
[  184.156882] Code: 9107c000 97ffffb0 a8c17bfd d50323bf (d65f03c0) 
[  184.157427] ---[ end trace 0000000000000000 ]---
[  184.157841] Kernel panic - not syncing: Oops - Undefined instruction: Fatal exception in interrupt
[  184.158632] SMP: stopping secondary CPUs
[  184.158993] Kernel Offset: disabled
[  184.159307] CPU features: 0x20000,20834084,0000421b
[  184.159745] Memory Limit: none
[  184.160030] ---[ end Kernel panic - not syncing: Oops - Undefined instruction: Fatal exception in interrupt ]---

This is getting more and more frequent --
the CPU changes but mostly it is 4 or 5

I have no idea as to where to start/what to do next - 

So I came here...

 

a friend at work says the boards should be cheap enough - buy a new one ---
 

I don't seem to be able to find a source for this particular board...
I think it is a custom "tinkerboard" 
overall this is/was the Helios64 NAS
which Worked rocksolid -- until it didn't
I can't remember if I updated the OS and then it started failing -- or what --

at the moment - it is running whatever was the latest 2 weeks ago...

above is the dump it leaves when it crashes - 
I have the board out of the cast and on my test desk --
when it does boot -- ( if / when ) --- I shell into it -- if it stays  up - I run BTOP 

after whatever length of time -- I try to shell in for a second session - and try to run armbian-config don't often get this far...
help?

 

if asked I can   provide  the  boot sequence text...

 

Rich Leonard

Edited by Rmleonard
Link to comment
Share on other sites

I had similar problem (not quite the same from what I remember), but i was receiving kernel panic on armbian 23.5, kernel 6.1.36 and the issue was caused by the start of the armbian-ramlog.service. Once the service was disabled, it booted every time. Disabling the service was discussed here as well:

 

Link to comment
Share on other sites

this was "sort of" a Solution....

I've tinkered both armbian-ramlog and armbian-zram-config such that they are essentially disabled (but still in the system, in case something I don't know about makes a reference call to them)

the edits in the /etc/defaults/

in armbian-ramlog

I set ENABLED to false

 

and in armbian-zram-config

ENABLED=false

SWAP=false

everything else is commented out

 

Does this stop the random lockups - no ----- they are MUCH more infrequent and more often than not I can find log files with information --- 
I've also just gotten the save/boot to/from eMMC to work -- so I'm finally working from "memory" and not from the SD card...

I have the board out of the Helios box and I've been running CPU stress tests --- s-tui + stress and it runs and cooks until it doesn't -- when it quits there doesn't seem to be a "panic" or "system dump" -- the error LED doesn't even blink or show --- it just "stops"

 

I'm not convinced that this is the correct spot for this discussion and I'll keep looking for more info... and a better forum to post....

 

Thank You rumking for the direction you pointed me towards!!!

 

Rich
 

 

 

Link to comment
Share on other sites

Our use case may be different (I personally dont mind running old buster or bullseye, but I want kernel 6.1 'cause I use btrfs-raid1).

It has been discussed, that CPU settings in armbian-config improves stability.

I am running:

- min CPU speed 408MHz

- max CPU speed 1.2GHz

- governor set on performance 

The stability improved RAPIDLY - the system runs several days, I tested btrfs balance/replace - without issues.

My system is:

- image Armbian_21.08.1_Helios64_bullseye_current_5.10.60.img.xz

- newer images occasionally freeze during the boot sequence, I dont mind older system...

- apt install linux-image-edge-current - to get kernel 6.1, you could install headers as well for DKMS

- I had SERIOUS network issues all the time - I gave up and bought usb to ethernet adapter, after that, no issues with network

 

Hope this helps someone.

 

Link to comment
Share on other sites

@RmleonardI have similar crashes but have not found a simple reproducer.

When you say you are able to reproduce with the board out, do you mean without anything plugged into it except the power adapter and ethernet?

I was testing if it was not related to PCIe/SATA of the board but if you are able to reproduce without any disks plugged into the board that would help me phase out this option.

 

Also, you don't remember if you updated the OS. But do you remember if you ever updated before? And if so when was the last update you remember doing?

Link to comment
Share on other sites

i've taken the whole bloody thing to pieces. i have the motherboard alone on my workbench, i feed it power, ethernet, usb-c for console access.. it boots off of emmc. so, it is EXTREMELY unlikely that there is a ground/short/earthed connection. i use picocomm to sign in to the console to watch the pre-boot. everything else is via ssh

Link to comment
Share on other sites

i created 10 crash files (over time) to see if it always died/crashed at same memory location, answer - no...
BUT, of this latest batch of 10 dump files - all were "swapper/4 cpuid 4"
which makes me wish i could disable the 2 core sub processor and run the system off the main 4 cores... if i still have crashes, flip the issue, make system run off the 2ndary cpu alone and see how it runs

as a thought
 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines