How to do a full hardware test?


dieKatze88
 Share

2 2

Recommended Posts

I backed this early, and I have had nothing but stability problems since I built the thing. Sometimes my machine runs for as few as 6 minutes before crashing, and helpfully, it keeps clearing the systemd journal every time it starts up so I can't even see what happened just before it crashed. It crashes running OMV and Syncthing with high load, it crashes doing absolutely nothing but watching the systemd journal. It crashes doing nothing at all.

When it crashes, it corrupts my files, and often the OMV Database requiring me to CONSTANTLY reset the GUI password for OMV, and then find that half of OMV isn't working. It does this on both uSD cards and on the inbuilt MMC.

I'm nearing my end with this thing, how can I do a full hardware test on it in a way that will say "Yes this is working as expected" or "No this is defective."

Link to post
Share on other sites

Donate and support the project!

I have disabled zram, as it was suggested by someone on Reddit.

 

I am now running the latest kernel, but absolutely no kernel in my history of this thing has been stable.

I got the following serial console the last time it crashed (But could not edit my post due to limits):
 

[10105.431800] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: rcu_sched_clock_irq+0x7a4/0xce0
[10105.432752] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G         C        5.10.21-rockchip64 #21.02.3
[10105.433526] Hardware name: Helios64 (DT)
[10105.433872] Call trace:
[10105.434093]  dump_backtrace+0x0/0x200
[10105.434418]  show_stack+0x18/0x68
[10105.434714]  dump_stack+0xcc/0x124
[10105.435016]  panic+0x174/0x374
[10105.435288]  __stack_chk_fail+0x3c/0x40
[10105.435626]  rcu_sched_clock_irq+0x7a4/0xce0
[10105.436004]  update_process_times+0x60/0xa0
[10105.436373]  tick_sched_handle.isra.19+0x40/0x58
[10105.436778]  tick_sched_timer+0x58/0xb0
[10105.437118]  __hrtimer_run_queues+0x104/0x388
[10105.437502]  hrtimer_interrupt+0xf4/0x250
[10105.437861]  arch_timer_handler_phys+0x30/0x40
[10105.438258]  handle_percpu_devid_irq+0xa0/0x298
[10105.438659]  generic_handle_irq+0x30/0x48
[10105.439012]  __handle_domain_irq+0x94/0x108
[10105.439384]  gic_handle_irq+0xc0/0x140
[10105.439715]  el1_irq+0xc0/0x180
[10105.439995]  arch_cpu_idle+0x18/0x28
[10105.440310]  default_idle_call+0x44/0x1bc
[10105.440665]  do_idle+0x204/0x278
[10105.440950]  cpu_startup_entry+0x28/0x60
[10105.441298]  secondary_start_kernel+0x170/0x180
[10105.441700] SMP: stopping secondary CPUs
[10105.442057] Kernel Offset: disabled
[10105.442365] CPU features: 0x0240022,6100200c
[10105.442740] Memory Limit: none
[10105.443021] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: rcu_sched_clock_irq+0x7a4/0xce0 ]---


root@helios64:~# uname -a
Linux helios64 5.10.21-rockchip64 #21.02.3 SMP PREEMPT Mon Mar 8 01:05:08 UTC 2021 aarch64 GNU/Linux
root@helios64:~#

 

my armbian monitor:
http://ix.io/2U0J

Link to post
Share on other sites

Sorry didn't see that you needed a like in order to remove the 1 msg / day limitation. Still need to wait 12 hours for it to be lifted.

 

Actually I'm wondering if the U-boot you are using on eMMC is the correct one.

 

Could you post here the early stage boot log of your unit ? You will need serial console connected before you switch on the board.

Link to post
Share on other sites

It crashed again last night. I Have two blocks for you. Both of the Kernel Panics posted here have been from a SD boot with the emmc wiped.

[21004.776415] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[21004.776915] Modules linked in: rfkill governor_performance snd_soc_hdmi_codec r8152 hantro_vpu(C) rockchip_vdec(C) rockchip_rga v4l2_h264 videobuf2_dma_contig v4l2_mem2mem snd_soc_rockchip_i2s videobuf2_dma_sg videobuf2_vmalloc panfrost videobuf2_memops rockchipdrm dw_mipi_dsi dw_hdmi leds_pwm analogix_dp pwm_fan snd_soc_core gpu_sched videobuf2_v4l2 gpio_charger videobuf2_common snd_pcm_dmaengine snd_pcm drm_kms_helper fusb302 snd_timer cec tcpm snd videodev rc_core soundcore typec mc drm drm_panel_orientation_quirks gpio_beeper cpufreq_dt ledtrig_netdev lm75 ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek dwmac_rk stmmac_platform stmmac pcs_xpcs adc_keys
[21004.782817] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G         C        5.10.21                                                                                   -rockchip64 #21.02.3
[21004.783592] Hardware name: Helios64 (DT)
[21004.783938] pstate: 40000085 (nZcv daIf -PAN -UAO -TCO BTYPE=--)
[21004.784476] pc : rcu_sched_clock_irq+0x208/0xce0
[21004.784883] lr : rcu_sched_clock_irq+0x1f8/0xce0
[21004.785288] sp : ffff800011c13cd0
[21004.785580] x29: ffff800011c13cd0 x28: ffff800011952440
[21004.786049] x27: ffff8000118ba000 x26: ffff0000f77c8980
[21004.786518] x25: ffff800011580980 x24: ffff8000e6248000
[21004.786986] x23: 0000000000000000 x22: ffff8000118b9948
[21004.787454] x21: ffff800011b27ad8 x20: ffff0000f77c89f0
[21004.787921] x19: 0000000000000001 x18: 0000000000000000
[21004.788390] x17: 0000000000000000 x16: 0000000000000000
[21004.788858] x15: 0000000000000001 x14: 00000000000002d8
[21004.789326] x13: 00000001004efb7a x12: 000000000010e229
[21004.789794] x11: ffff8000118b7000 x10: ffff80001194ef28
[21004.790262] x9 : ffff80001194ef20 x8 : ffff800011b72320
[21004.790730] x7 : ffff800011952000 x6 : 000000757b1fbc62
[21004.791197] x5 : d29eb8946b701b4f x4 : ffff8000e6248000
[21004.791665] x3 : 0000000000010001 x2 : ffff8000e6248000
[21004.792133] x1 : ffff0000f77c89f0 x0 : fffe800011952440
[21004.792602] Call trace:
[21004.792822]  rcu_sched_clock_irq+0x208/0xce0
[21004.793200]  update_process_times+0x60/0xa0
[21004.793569]  tick_sched_handle.isra.19+0x40/0x58
[21004.793974]  tick_sched_timer+0x58/0xb0
[21004.794313]  __hrtimer_run_queues+0x104/0x388
[21004.794697]  hrtimer_interrupt+0xf4/0x250
[21004.795054]  arch_timer_handler_phys+0x30/0x40
[21004.795447]  handle_percpu_devid_irq+0xa0/0x298
[21004.795845]  generic_handle_irq+0x30/0x48
[21004.796199]  __handle_domain_irq+0x94/0x108
[21004.796570]  gic_handle_irq+0xc0/0x140
[21004.796902]  el1_irq+0xc0/0x180
[21004.797182]  arch_cpu_idle+0x18/0x28
[21004.797498]  default_idle_call+0x44/0x1bc
[21004.797853]  do_idle+0x204/0x278
[21004.798138]  cpu_startup_entry+0x24/0x60
[21004.798486]  secondary_start_kernel+0x170/0x180
[21004.798887] Code: 72001c1f 54fffda1 34fffcd3 f94033e0 (f9400401)
[21004.799427] ---[ end trace 730e9802b6c79383 ]---
[21004.799833] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[21004.800436] SMP: stopping secondary CPUs
[21004.800793] Kernel Offset: disabled
[21004.801103] CPU features: 0x0240022,6100200c
[21004.801477] Memory Limit: none
[21004.801756] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---


 

DDR Version 1.24 20191016
In
channel 0
CS = 0
MR0=0x18
MR4=0x2
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 1
CS = 0
MR0=0x18
MR4=0x2
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 0 training pass!
channel 1 training pass!
change freq to 416MHz 0,1
Channel 0: LPDDR4,416MHz
Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB
Channel 1: LPDDR4,416MHz
Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB
256B stride
channel 0
CS = 0
MR0=0x18
MR4=0x2
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 1
CS = 0
MR0=0x18
MR4=0x2
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 0 training pass!
channel 1 training pass!
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
change freq to 856MHz 1,0
ch 0 ddrconfig = 0x101, ddrsize = 0x40
ch 1 ddrconfig = 0x101, ddrsize = 0x40
pmugrf_os_reg[2] = 0x32C1F2C1, stride = 0xD
ddr_set_rate to 328MHZ
ddr_set_rate to 666MHZ
ddr_set_rate to 928MHZ
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
ddr_set_rate to 416MHZ, ctl_index 0
ddr_set_rate to 856MHZ, ctl_index 1
support 416 856 328 666 928 MHz, current 856MHz
OUT
Boot1: 2019-03-14, version: 1.19
CPUId = 0x0
ChipType = 0x10, 254
SdmmcInit=2 0
BootCapSize=100000
UserCapSize=14910MB
FwPartOffset=2000 , 100000
mmc0:cmd5,20
SdmmcInit=0 0
BootCapSize=0
UserCapSize=30436MB
FwPartOffset=2000 , 0
StorageInit ok = 83460
SecureMode = 0
SecureInit read PBA: 0x4
SecureInit read PBA: 0x404
SecureInit read PBA: 0x804
SecureInit read PBA: 0xc04
SecureInit read PBA: 0x1004
SecureInit read PBA: 0x1404
SecureInit read PBA: 0x1804
SecureInit read PBA: 0x1c04
SecureInit ret = 0, SecureMode = 0
atags_set_bootdev: ret:(0)
GPT 0x3380ec0 signature is wrong
recovery gpt...
GPT 0x3380ec0 signature is wrong
recovery gpt fail!
LoadTrust Addr:0x4000
No find bl30.bin
No find bl32.bin
Load uboot, ReadLba = 2000
Load OK, addr=0x200000, size=0xe5b60
RunBL31 0x40000
NOTICE:  BL31: v1.3(debug):42583b6
NOTICE:  BL31: Built : 07:55:13, Oct 15 2019
NOTICE:  BL31: Rockchip release version: v1.1
INFO:    GICv3 with legacy support detected. ARM GICV3 driver initialized in EL3
INFO:    Using opteed sec cpu_context!
INFO:    boot cpu mask: 0
INFO:    plat_rockchip_pmu_init(1190): pd status 3e
INFO:    BL31: Initializing runtime services
WARNING: No OPTEE provided by BL2 boot loader, Booting device without OPTEE initialization. SMC`s destined for OPTEE will return SMC_UNK
ERROR:   Error initializing runtime service opteed_fast
INFO:    BL31: Preparing for EL3 exit to normal world
INFO:    Entry point address = 0x200000
INFO:    SPSR = 0x3c9


U-Boot 2020.10-armbian (Mar 08 2021 - 14:54:58 +0000)

SoC: Rockchip rk3399
Reset cause: POR
DRAM:  3.9 GiB
PMIC:  RK808
SF: Detected w25q128 with page size 256 Bytes, erase size 4 KiB, total 16 MiB
MMC:   mmc@fe320000: 1, sdhci@fe330000: 0
Loading Environment from MMC... *** Warning - bad CRC, using default environment

In:    serial
Out:   serial
Err:   serial
Model: Helios64
Revision: 1.2 - 4GB non ECC
Net:   eth0: ethernet@fe300000
scanning bus for devices...
starting USB...
Bus usb@fe380000: USB EHCI 1.00
Bus dwc3: usb maximum-speed not found
Register 2000140 NbrPorts 2
Starting the controller
USB XHCI 1.10
scanning bus usb@fe380000 for devices... 1 USB Device(s) found
scanning bus dwc3 for devices... cannot reset port 4!?
4 USB Device(s) found
       scanning usb for storage devices... 0 Storage Device(s) found
Hit any key to stop autoboot:  0
switch to partitions #0, OK
mmc1 is current device
Scanning mmc 1:1...
Found U-Boot script /boot/boot.scr
3185 bytes read in 9 ms (344.7 KiB/s)
## Executing script at 00500000
Boot script loaded from mmc 1
166 bytes read in 12 ms (12.7 KiB/s)
13851809 bytes read in 606 ms (21.8 MiB/s)
28582400 bytes read in 1214 ms (22.5 MiB/s)
81913 bytes read in 16 ms (4.9 MiB/s)
2698 bytes read in 13 ms (202.1 KiB/s)
Applying kernel provided DT fixup script (rockchip-fixup.scr)
## Executing script at 09000000
Moving Image from 0x2080000 to 0x2200000, end=3de0000
## Loading init Ramdisk from Legacy Image at 06000000 ...
   Image Name:   uInitrd
   Image Type:   AArch64 Linux RAMDisk Image (gzip compressed)
   Data Size:    13851745 Bytes = 13.2 MiB
   Load Address: 00000000
   Entry Point:  00000000
   Verifying Checksum ... OK
## Flattened Device Tree blob at 01f00000
   Booting using the fdt blob at 0x1f00000
   Loading Ramdisk to f51b9000, end f5eeec61 ... OK
   Loading Device Tree to 00000000f513c000, end 00000000f51b8fff ... OK

Starting kernel ...

 

Link to post
Share on other sites

Ok that's the correct U-Boot.

 

Hmmm seems to be still stability issue related to the scheduler / DFS.

 

Have you tried to use Performance Schedule ?

 

armbian-config > System > CPU

 

Minimum CPU speed = 1200000

Maximum CPU speed = 1200000

CPU governor = performance

 

Link to post
Share on other sites

OK After 13 hours we're still up (Even with a light load of sending massive pings on the 2.5g interface to my desktop)

I'm going to give it one more day before I call it good and try reinstalling to the internal flash again.

Link to post
Share on other sites

@dieKatze88 You can setup the highest frequency and the outcome will be most likely the same. The issue is not the frequency speed, is the Dynamic Frequency Scaling (DFS) which constantly change the cpu freq and it seems to create some instability.

I recommended 1.2 GHz just to insure the system run cool therefore minimizing fan noise.

Link to post
Share on other sites

It didn't stay as stable as we thought. Unfortunately the serial console failed at some point. I'll reconnect with it and try to keep it up again to see if I can catch it crashing again. At least it lasted about 30 hours this time.

Link to post
Share on other sites

May I suggest outputting dmesg live to a network location?

I'm not sure if the serial console output is the same as 'dmesg' but if it is, you can live 'nohup &' it to any file. That way you wouldn't have to keep connected to console or ssh all the time. Just don't output it to any local file system as writing to a local file system at a crash might corrupt it and cause more problems.

 

nohup dmesg --follow > /network/location/folder/helios64-log.txt & 2>&1

exit

 

needed to have single >, and exit the session with 'exit' apparently..

Edited by clostro
edited the command
Link to post
Share on other sites

I did manage to catch the output for the 3rd crash yesterday.

 

[22793.372295] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[22793.372795] Modules linked in: governor_performance rfkill zram snd_soc_hdmi_                                  codec r8152 leds_pwm gpio_charger pwm_fan snd_soc_rockchip_i2s snd_soc_core snd_                                  pcm_dmaengine hantro_vpu(C) snd_pcm rockchip_vdec(C) rockchip_rga snd_timer vide                                  obuf2_dma_sg v4l2_h264 videobuf2_dma_contig videobuf2_vmalloc panfrost v4l2_mem2                                  mem gpu_sched videobuf2_memops snd videobuf2_v4l2 videobuf2_common fusb302 sound                                  core tcpm rockchipdrm videodev typec mc dw_mipi_dsi dw_hdmi analogix_dp drm_kms_                                  helper cec sg rc_core drm drm_panel_orientation_quirks gpio_beeper cpufreq_dt le                                  dtrig_netdev lm75 ip_tables x_tables autofs4 raid10 raid1 raid0 multipath linear                                   dm_mirror dm_region_hash dm_log raid456 async_raid6_recov async_memcpy async_pq                                   async_xor async_tx realtek dm_mod md_mod dwmac_rk stmmac_platform stmmac pcs_xp                                  cs adc_keys
[22793.379068] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G         C        5.10.21                                  -rockchip64 #21.02.3
[22793.379844] Hardware name: Helios64 (DT)
[22793.380191] pstate: 40000085 (nZcv daIf -PAN -UAO -TCO BTYPE=--)
[22793.380728] pc : rcu_sched_clock_irq+0x208/0xce0
[22793.381134] lr : rcu_sched_clock_irq+0x1f8/0xce0
[22793.381539] sp : ffff800011c13cd0
[22793.381832] x29: ffff800011c13cd0 x28: ffff800011952440
[22793.382301] x27: ffff8000118ba000 x26: ffff0000f77c8980
[22793.382769] x25: ffff800011580980 x24: ffff8000e6248000
[22793.383237] x23: 0000000000000000 x22: ffff8000118b9948
[22793.383705] x21: ffff800011b27ad8 x20: ffff0000f77c89f0
[22793.384173] x19: 0000000000000001 x18: 0000000000000000
[22793.384641] x17: 0000000000000000 x16: 0000000000000000
[22793.385109] x15: 0000002d01e1f6ac x14: 000000000000006a
[22793.385577] x13: 000000010055ce03 x12: 00000000000ab681
[22793.386045] x11: ffff8000118b7000 x10: ffff80001194ef28
[22793.386513] x9 : ffff80001194ef20 x8 : ffff800011b72320
[22793.386981] x7 : ffff800011952000 x6 : 0000007f7ced25ad
[22793.387449] x5 : 7ab3901a5062db37 x4 : ffff8000e6248000
[22793.387917] x3 : 0000000000010001 x2 : ffff8000e6248000
[22793.388385] x1 : ffff0000f77c89f0 x0 : fffe800011952440
[22793.388854] Call trace:
[22793.389076]  rcu_sched_clock_irq+0x208/0xce0
[22793.389454]  update_process_times+0x60/0xa0
[22793.389825]  tick_sched_handle.isra.19+0x40/0x58
[22793.390231]  tick_sched_timer+0x58/0xb0
[22793.390572]  __hrtimer_run_queues+0x104/0x388
[22793.390956]  hrtimer_interrupt+0xf4/0x250
[22793.391311]  arch_timer_handler_phys+0x30/0x40
[22793.391704]  handle_percpu_devid_irq+0xa0/0x298
[22793.392103]  generic_handle_irq+0x30/0x48
[22793.392456]  __handle_domain_irq+0x94/0x108
[22793.392827]  gic_handle_irq+0xc0/0x140
[22793.393159]  el1_irq+0xc0/0x180
[22793.393440]  arch_cpu_idle+0x18/0x28
[22793.393757]  default_idle_call+0x44/0x1bc
[22793.394111]  do_idle+0x204/0x278
[22793.394397]  cpu_startup_entry+0x24/0x60
[22793.394745]  secondary_start_kernel+0x170/0x180
[22793.395147] Code: 72001c1f 54fffda1 34fffcd3 f94033e0 (f9400401)
[22793.395690] ---[ end trace a14f0598db2feff1 ]---
[22793.396097] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[22793.396700] SMP: stopping secondary CPUs
[22793.397053] Kernel Offset: disabled
[22793.397361] CPU features: 0x0240022,6100200c
[22793.397736] Memory Limit: none
[22793.398014] ---[ end Kernel panic - not syncing: Oops: Fatal exception in int                                  errupt ]---

 

Link to post
Share on other sites

 Share

2 2