dieKatze88 Posted March 21, 2021 Posted March 21, 2021 I backed this early, and I have had nothing but stability problems since I built the thing. Sometimes my machine runs for as few as 6 minutes before crashing, and helpfully, it keeps clearing the systemd journal every time it starts up so I can't even see what happened just before it crashed. It crashes running OMV and Syncthing with high load, it crashes doing absolutely nothing but watching the systemd journal. It crashes doing nothing at all. When it crashes, it corrupts my files, and often the OMV Database requiring me to CONSTANTLY reset the GUI password for OMV, and then find that half of OMV isn't working. It does this on both uSD cards and on the inbuilt MMC. I'm nearing my end with this thing, how can I do a full hardware test on it in a way that will say "Yes this is working as expected" or "No this is defective." 0 Quote
Werner Posted March 22, 2021 Posted March 22, 2021 Providing logs with armbianmonitor -u helps with troubleshooting and significantly raises chances that issue gets addressed. 0 Quote
gprovost Posted March 22, 2021 Posted March 22, 2021 @dieKatze88 Which kernel version are you running ? There has been significant stability improvement with latest one (version 5.10.21) ? Any chance you can have the serial console opened in hope to catch something when it crashes ? 0 Quote
dieKatze88 Posted March 25, 2021 Author Posted March 25, 2021 I have disabled zram, as it was suggested by someone on Reddit. I am now running the latest kernel, but absolutely no kernel in my history of this thing has been stable. I got the following serial console the last time it crashed (But could not edit my post due to limits): [10105.431800] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: rcu_sched_clock_irq+0x7a4/0xce0 [10105.432752] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G C 5.10.21-rockchip64 #21.02.3 [10105.433526] Hardware name: Helios64 (DT) [10105.433872] Call trace: [10105.434093] dump_backtrace+0x0/0x200 [10105.434418] show_stack+0x18/0x68 [10105.434714] dump_stack+0xcc/0x124 [10105.435016] panic+0x174/0x374 [10105.435288] __stack_chk_fail+0x3c/0x40 [10105.435626] rcu_sched_clock_irq+0x7a4/0xce0 [10105.436004] update_process_times+0x60/0xa0 [10105.436373] tick_sched_handle.isra.19+0x40/0x58 [10105.436778] tick_sched_timer+0x58/0xb0 [10105.437118] __hrtimer_run_queues+0x104/0x388 [10105.437502] hrtimer_interrupt+0xf4/0x250 [10105.437861] arch_timer_handler_phys+0x30/0x40 [10105.438258] handle_percpu_devid_irq+0xa0/0x298 [10105.438659] generic_handle_irq+0x30/0x48 [10105.439012] __handle_domain_irq+0x94/0x108 [10105.439384] gic_handle_irq+0xc0/0x140 [10105.439715] el1_irq+0xc0/0x180 [10105.439995] arch_cpu_idle+0x18/0x28 [10105.440310] default_idle_call+0x44/0x1bc [10105.440665] do_idle+0x204/0x278 [10105.440950] cpu_startup_entry+0x28/0x60 [10105.441298] secondary_start_kernel+0x170/0x180 [10105.441700] SMP: stopping secondary CPUs [10105.442057] Kernel Offset: disabled [10105.442365] CPU features: 0x0240022,6100200c [10105.442740] Memory Limit: none [10105.443021] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: rcu_sched_clock_irq+0x7a4/0xce0 ]--- root@helios64:~# uname -a Linux helios64 5.10.21-rockchip64 #21.02.3 SMP PREEMPT Mon Mar 8 01:05:08 UTC 2021 aarch64 GNU/Linux root@helios64:~# my armbian monitor: http://ix.io/2U0J 2 Quote
gprovost Posted March 25, 2021 Posted March 25, 2021 Sorry didn't see that you needed a like in order to remove the 1 msg / day limitation. Still need to wait 12 hours for it to be lifted. Actually I'm wondering if the U-boot you are using on eMMC is the correct one. Could you post here the early stage boot log of your unit ? You will need serial console connected before you switch on the board. 0 Quote
dieKatze88 Posted March 25, 2021 Author Posted March 25, 2021 It crashed again last night. I Have two blocks for you. Both of the Kernel Panics posted here have been from a SD boot with the emmc wiped. [21004.776415] Internal error: Oops: 96000004 [#1] PREEMPT SMP [21004.776915] Modules linked in: rfkill governor_performance snd_soc_hdmi_codec r8152 hantro_vpu(C) rockchip_vdec(C) rockchip_rga v4l2_h264 videobuf2_dma_contig v4l2_mem2mem snd_soc_rockchip_i2s videobuf2_dma_sg videobuf2_vmalloc panfrost videobuf2_memops rockchipdrm dw_mipi_dsi dw_hdmi leds_pwm analogix_dp pwm_fan snd_soc_core gpu_sched videobuf2_v4l2 gpio_charger videobuf2_common snd_pcm_dmaengine snd_pcm drm_kms_helper fusb302 snd_timer cec tcpm snd videodev rc_core soundcore typec mc drm drm_panel_orientation_quirks gpio_beeper cpufreq_dt ledtrig_netdev lm75 ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek dwmac_rk stmmac_platform stmmac pcs_xpcs adc_keys [21004.782817] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G C 5.10.21 -rockchip64 #21.02.3 [21004.783592] Hardware name: Helios64 (DT) [21004.783938] pstate: 40000085 (nZcv daIf -PAN -UAO -TCO BTYPE=--) [21004.784476] pc : rcu_sched_clock_irq+0x208/0xce0 [21004.784883] lr : rcu_sched_clock_irq+0x1f8/0xce0 [21004.785288] sp : ffff800011c13cd0 [21004.785580] x29: ffff800011c13cd0 x28: ffff800011952440 [21004.786049] x27: ffff8000118ba000 x26: ffff0000f77c8980 [21004.786518] x25: ffff800011580980 x24: ffff8000e6248000 [21004.786986] x23: 0000000000000000 x22: ffff8000118b9948 [21004.787454] x21: ffff800011b27ad8 x20: ffff0000f77c89f0 [21004.787921] x19: 0000000000000001 x18: 0000000000000000 [21004.788390] x17: 0000000000000000 x16: 0000000000000000 [21004.788858] x15: 0000000000000001 x14: 00000000000002d8 [21004.789326] x13: 00000001004efb7a x12: 000000000010e229 [21004.789794] x11: ffff8000118b7000 x10: ffff80001194ef28 [21004.790262] x9 : ffff80001194ef20 x8 : ffff800011b72320 [21004.790730] x7 : ffff800011952000 x6 : 000000757b1fbc62 [21004.791197] x5 : d29eb8946b701b4f x4 : ffff8000e6248000 [21004.791665] x3 : 0000000000010001 x2 : ffff8000e6248000 [21004.792133] x1 : ffff0000f77c89f0 x0 : fffe800011952440 [21004.792602] Call trace: [21004.792822] rcu_sched_clock_irq+0x208/0xce0 [21004.793200] update_process_times+0x60/0xa0 [21004.793569] tick_sched_handle.isra.19+0x40/0x58 [21004.793974] tick_sched_timer+0x58/0xb0 [21004.794313] __hrtimer_run_queues+0x104/0x388 [21004.794697] hrtimer_interrupt+0xf4/0x250 [21004.795054] arch_timer_handler_phys+0x30/0x40 [21004.795447] handle_percpu_devid_irq+0xa0/0x298 [21004.795845] generic_handle_irq+0x30/0x48 [21004.796199] __handle_domain_irq+0x94/0x108 [21004.796570] gic_handle_irq+0xc0/0x140 [21004.796902] el1_irq+0xc0/0x180 [21004.797182] arch_cpu_idle+0x18/0x28 [21004.797498] default_idle_call+0x44/0x1bc [21004.797853] do_idle+0x204/0x278 [21004.798138] cpu_startup_entry+0x24/0x60 [21004.798486] secondary_start_kernel+0x170/0x180 [21004.798887] Code: 72001c1f 54fffda1 34fffcd3 f94033e0 (f9400401) [21004.799427] ---[ end trace 730e9802b6c79383 ]--- [21004.799833] Kernel panic - not syncing: Oops: Fatal exception in interrupt [21004.800436] SMP: stopping secondary CPUs [21004.800793] Kernel Offset: disabled [21004.801103] CPU features: 0x0240022,6100200c [21004.801477] Memory Limit: none [21004.801756] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]--- DDR Version 1.24 20191016 In channel 0 CS = 0 MR0=0x18 MR4=0x2 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 1 CS = 0 MR0=0x18 MR4=0x2 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 0 training pass! channel 1 training pass! change freq to 416MHz 0,1 Channel 0: LPDDR4,416MHz Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB Channel 1: LPDDR4,416MHz Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB 256B stride channel 0 CS = 0 MR0=0x18 MR4=0x2 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 1 CS = 0 MR0=0x18 MR4=0x2 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 0 training pass! channel 1 training pass! channel 0, cs 0, advanced training done channel 1, cs 0, advanced training done change freq to 856MHz 1,0 ch 0 ddrconfig = 0x101, ddrsize = 0x40 ch 1 ddrconfig = 0x101, ddrsize = 0x40 pmugrf_os_reg[2] = 0x32C1F2C1, stride = 0xD ddr_set_rate to 328MHZ ddr_set_rate to 666MHZ ddr_set_rate to 928MHZ channel 0, cs 0, advanced training done channel 1, cs 0, advanced training done ddr_set_rate to 416MHZ, ctl_index 0 ddr_set_rate to 856MHZ, ctl_index 1 support 416 856 328 666 928 MHz, current 856MHz OUT Boot1: 2019-03-14, version: 1.19 CPUId = 0x0 ChipType = 0x10, 254 SdmmcInit=2 0 BootCapSize=100000 UserCapSize=14910MB FwPartOffset=2000 , 100000 mmc0:cmd5,20 SdmmcInit=0 0 BootCapSize=0 UserCapSize=30436MB FwPartOffset=2000 , 0 StorageInit ok = 83460 SecureMode = 0 SecureInit read PBA: 0x4 SecureInit read PBA: 0x404 SecureInit read PBA: 0x804 SecureInit read PBA: 0xc04 SecureInit read PBA: 0x1004 SecureInit read PBA: 0x1404 SecureInit read PBA: 0x1804 SecureInit read PBA: 0x1c04 SecureInit ret = 0, SecureMode = 0 atags_set_bootdev: ret:(0) GPT 0x3380ec0 signature is wrong recovery gpt... GPT 0x3380ec0 signature is wrong recovery gpt fail! LoadTrust Addr:0x4000 No find bl30.bin No find bl32.bin Load uboot, ReadLba = 2000 Load OK, addr=0x200000, size=0xe5b60 RunBL31 0x40000 NOTICE: BL31: v1.3(debug):42583b6 NOTICE: BL31: Built : 07:55:13, Oct 15 2019 NOTICE: BL31: Rockchip release version: v1.1 INFO: GICv3 with legacy support detected. ARM GICV3 driver initialized in EL3 INFO: Using opteed sec cpu_context! INFO: boot cpu mask: 0 INFO: plat_rockchip_pmu_init(1190): pd status 3e INFO: BL31: Initializing runtime services WARNING: No OPTEE provided by BL2 boot loader, Booting device without OPTEE initialization. SMC`s destined for OPTEE will return SMC_UNK ERROR: Error initializing runtime service opteed_fast INFO: BL31: Preparing for EL3 exit to normal world INFO: Entry point address = 0x200000 INFO: SPSR = 0x3c9 U-Boot 2020.10-armbian (Mar 08 2021 - 14:54:58 +0000) SoC: Rockchip rk3399 Reset cause: POR DRAM: 3.9 GiB PMIC: RK808 SF: Detected w25q128 with page size 256 Bytes, erase size 4 KiB, total 16 MiB MMC: mmc@fe320000: 1, sdhci@fe330000: 0 Loading Environment from MMC... *** Warning - bad CRC, using default environment In: serial Out: serial Err: serial Model: Helios64 Revision: 1.2 - 4GB non ECC Net: eth0: ethernet@fe300000 scanning bus for devices... starting USB... Bus usb@fe380000: USB EHCI 1.00 Bus dwc3: usb maximum-speed not found Register 2000140 NbrPorts 2 Starting the controller USB XHCI 1.10 scanning bus usb@fe380000 for devices... 1 USB Device(s) found scanning bus dwc3 for devices... cannot reset port 4!? 4 USB Device(s) found scanning usb for storage devices... 0 Storage Device(s) found Hit any key to stop autoboot: 0 switch to partitions #0, OK mmc1 is current device Scanning mmc 1:1... Found U-Boot script /boot/boot.scr 3185 bytes read in 9 ms (344.7 KiB/s) ## Executing script at 00500000 Boot script loaded from mmc 1 166 bytes read in 12 ms (12.7 KiB/s) 13851809 bytes read in 606 ms (21.8 MiB/s) 28582400 bytes read in 1214 ms (22.5 MiB/s) 81913 bytes read in 16 ms (4.9 MiB/s) 2698 bytes read in 13 ms (202.1 KiB/s) Applying kernel provided DT fixup script (rockchip-fixup.scr) ## Executing script at 09000000 Moving Image from 0x2080000 to 0x2200000, end=3de0000 ## Loading init Ramdisk from Legacy Image at 06000000 ... Image Name: uInitrd Image Type: AArch64 Linux RAMDisk Image (gzip compressed) Data Size: 13851745 Bytes = 13.2 MiB Load Address: 00000000 Entry Point: 00000000 Verifying Checksum ... OK ## Flattened Device Tree blob at 01f00000 Booting using the fdt blob at 0x1f00000 Loading Ramdisk to f51b9000, end f5eeec61 ... OK Loading Device Tree to 00000000f513c000, end 00000000f51b8fff ... OK Starting kernel ... 0 Quote
gprovost Posted March 26, 2021 Posted March 26, 2021 Ok that's the correct U-Boot. Hmmm seems to be still stability issue related to the scheduler / DFS. Have you tried to use Performance Schedule ? armbian-config > System > CPU Minimum CPU speed = 1200000 Maximum CPU speed = 1200000 CPU governor = performance 0 Quote
dieKatze88 Posted March 28, 2021 Author Posted March 28, 2021 I have set those settings tonight and we'll see if it crashes by morning. 0 Quote
dieKatze88 Posted March 28, 2021 Author Posted March 28, 2021 OK After 13 hours we're still up (Even with a light load of sending massive pings on the 2.5g interface to my desktop) I'm going to give it one more day before I call it good and try reinstalling to the internal flash again. 0 Quote
dieKatze88 Posted March 29, 2021 Author Posted March 29, 2021 I've gone ahead and reinstalled to the internal flash, setup with a more minimal system (Not using OMV) and am monitoring it for failures. Any reason why some units are only stable at 1.2ghz? 0 Quote
gprovost Posted March 29, 2021 Posted March 29, 2021 @dieKatze88 You can setup the highest frequency and the outcome will be most likely the same. The issue is not the frequency speed, is the Dynamic Frequency Scaling (DFS) which constantly change the cpu freq and it seems to create some instability. I recommended 1.2 GHz just to insure the system run cool therefore minimizing fan noise. 0 Quote
dieKatze88 Posted March 30, 2021 Author Posted March 30, 2021 It didn't stay as stable as we thought. Unfortunately the serial console failed at some point. I'll reconnect with it and try to keep it up again to see if I can catch it crashing again. At least it lasted about 30 hours this time. 0 Quote
clostro Posted March 30, 2021 Posted March 30, 2021 (edited) May I suggest outputting dmesg live to a network location? I'm not sure if the serial console output is the same as 'dmesg' but if it is, you can live 'nohup &' it to any file. That way you wouldn't have to keep connected to console or ssh all the time. Just don't output it to any local file system as writing to a local file system at a crash might corrupt it and cause more problems. nohup dmesg --follow > /network/location/folder/helios64-log.txt & 2>&1 exit needed to have single >, and exit the session with 'exit' apparently.. Edited March 30, 2021 by clostro edited the command 1 Quote
dieKatze88 Posted March 30, 2021 Author Posted March 30, 2021 I did manage to catch the output for the 3rd crash yesterday. [22793.372295] Internal error: Oops: 96000004 [#1] PREEMPT SMP [22793.372795] Modules linked in: governor_performance rfkill zram snd_soc_hdmi_ codec r8152 leds_pwm gpio_charger pwm_fan snd_soc_rockchip_i2s snd_soc_core snd_ pcm_dmaengine hantro_vpu(C) snd_pcm rockchip_vdec(C) rockchip_rga snd_timer vide obuf2_dma_sg v4l2_h264 videobuf2_dma_contig videobuf2_vmalloc panfrost v4l2_mem2 mem gpu_sched videobuf2_memops snd videobuf2_v4l2 videobuf2_common fusb302 sound core tcpm rockchipdrm videodev typec mc dw_mipi_dsi dw_hdmi analogix_dp drm_kms_ helper cec sg rc_core drm drm_panel_orientation_quirks gpio_beeper cpufreq_dt le dtrig_netdev lm75 ip_tables x_tables autofs4 raid10 raid1 raid0 multipath linear dm_mirror dm_region_hash dm_log raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx realtek dm_mod md_mod dwmac_rk stmmac_platform stmmac pcs_xp cs adc_keys [22793.379068] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G C 5.10.21 -rockchip64 #21.02.3 [22793.379844] Hardware name: Helios64 (DT) [22793.380191] pstate: 40000085 (nZcv daIf -PAN -UAO -TCO BTYPE=--) [22793.380728] pc : rcu_sched_clock_irq+0x208/0xce0 [22793.381134] lr : rcu_sched_clock_irq+0x1f8/0xce0 [22793.381539] sp : ffff800011c13cd0 [22793.381832] x29: ffff800011c13cd0 x28: ffff800011952440 [22793.382301] x27: ffff8000118ba000 x26: ffff0000f77c8980 [22793.382769] x25: ffff800011580980 x24: ffff8000e6248000 [22793.383237] x23: 0000000000000000 x22: ffff8000118b9948 [22793.383705] x21: ffff800011b27ad8 x20: ffff0000f77c89f0 [22793.384173] x19: 0000000000000001 x18: 0000000000000000 [22793.384641] x17: 0000000000000000 x16: 0000000000000000 [22793.385109] x15: 0000002d01e1f6ac x14: 000000000000006a [22793.385577] x13: 000000010055ce03 x12: 00000000000ab681 [22793.386045] x11: ffff8000118b7000 x10: ffff80001194ef28 [22793.386513] x9 : ffff80001194ef20 x8 : ffff800011b72320 [22793.386981] x7 : ffff800011952000 x6 : 0000007f7ced25ad [22793.387449] x5 : 7ab3901a5062db37 x4 : ffff8000e6248000 [22793.387917] x3 : 0000000000010001 x2 : ffff8000e6248000 [22793.388385] x1 : ffff0000f77c89f0 x0 : fffe800011952440 [22793.388854] Call trace: [22793.389076] rcu_sched_clock_irq+0x208/0xce0 [22793.389454] update_process_times+0x60/0xa0 [22793.389825] tick_sched_handle.isra.19+0x40/0x58 [22793.390231] tick_sched_timer+0x58/0xb0 [22793.390572] __hrtimer_run_queues+0x104/0x388 [22793.390956] hrtimer_interrupt+0xf4/0x250 [22793.391311] arch_timer_handler_phys+0x30/0x40 [22793.391704] handle_percpu_devid_irq+0xa0/0x298 [22793.392103] generic_handle_irq+0x30/0x48 [22793.392456] __handle_domain_irq+0x94/0x108 [22793.392827] gic_handle_irq+0xc0/0x140 [22793.393159] el1_irq+0xc0/0x180 [22793.393440] arch_cpu_idle+0x18/0x28 [22793.393757] default_idle_call+0x44/0x1bc [22793.394111] do_idle+0x204/0x278 [22793.394397] cpu_startup_entry+0x24/0x60 [22793.394745] secondary_start_kernel+0x170/0x180 [22793.395147] Code: 72001c1f 54fffda1 34fffcd3 f94033e0 (f9400401) [22793.395690] ---[ end trace a14f0598db2feff1 ]--- [22793.396097] Kernel panic - not syncing: Oops: Fatal exception in interrupt [22793.396700] SMP: stopping secondary CPUs [22793.397053] Kernel Offset: disabled [22793.397361] CPU features: 0x0240022,6100200c [22793.397736] Memory Limit: none [22793.398014] ---[ end Kernel panic - not syncing: Oops: Fatal exception in int errupt ]--- 0 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.