Jump to content
  • 0

Helios64 - freeze whatever the kernel is.


SR-G
 Share

Question

As stated in my comments here (but it seems no one is reading this blog post comments anymore ...) https://blog.kobol.io/2020/10/27/helios64-software-issue/

 

- With 5.8.17 i have kernel panics.

With 5.8.14 i have freeze.

 

So what to do in order to have a stable situation ?!

 

 

Quote

I installed armbian 20.08.21 with kernel 5.8.17 (and neither 5.8.14 nor 5.8.16 ), and i've just encountered the whole system being frozen (during a MDADM RAID5 array creation / after ~20h of rebuild ...), hence the need to power-down the NAS (first led was blinking red).

Linux helios64 5.8.17-rockchip64 #20.08.21 SMP PREEMPT Sat Oct 31 08:22:59 CET 2020 aarch64 GNU/Linux

I don't have any logs (as by default journalctl is persisted in ram) so i have no clue about what has happened (nothing in kern.log), but could this be linked to the issue described in this post ?
Or was this identified kernel crash only happening on 5.8.16 and is now corrected with 5.8.17 ?
So should i stay on 5.8.17, or is it adviced to revert to 5.8.14 ?

 

Quote

And today i just got (but without immediate freeze and without blinkind red light / but i still had to reboot after a few minutes due to weird behaviors, and even if ssh sessions were still online) (still with 5.8.17)

[Sat Nov 28 01:08:46 2020] Unable to handle kernel paging request at virtual address bfff800010c4132c
[Sat Nov 28 01:08:46 2020] Mem abort info:
[Sat Nov 28 01:08:46 2020] ESR = 0x86000004
[Sat Nov 28 01:08:46 2020] EC = 0x21: IABT (current EL), IL = 32 bits
[Sat Nov 28 01:08:46 2020] SET = 0, FnV = 0
[Sat Nov 28 01:08:46 2020] EA = 0, S1PTW = 0
[Sat Nov 28 01:08:46 2020] [bfff800010c4132c] address between user and kernel address ranges
[Sat Nov 28 01:08:46 2020] Internal error: Oops: 86000004 [#1] PREEMPT SMP
[Sat Nov 28 01:08:46 2020] Modules linked in: rfkill governor_performance snd_soc_hdmi_codec r8152 snd_soc_rockchip_i2s hantro_vpu(C) rockchip_rga rockchip_vdec(C) snd_soc_core rockchipdrm v4l2_h264 snd_pcm_dmaengine videobuf2_vmalloc videobuf2_dma_sg videobuf2_dma_contig snd_pcm dw_mipi_dsi v4l2_mem2mem dw_hdmi videobuf2_memops snd_timer analogix_dp videobuf2_v4l2 snd zstd panfrost videobuf2_common fusb302 pwm_fan leds_pwm drm_kms_helper soundcore gpio_charger tcpm videodev gpu_sched cec typec mc rc_core sg drm drm_panel_orientation_quirks cpufreq_dt gpio_beeper zram lm75 ip_tables x_tables autofs4 raid10 raid1 raid0 multipath linear raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx md_mod realtek dwmac_rk stmmac_platform stmmac mdio_xpcs adc_keys
[Sat Nov 28 01:08:46 2020] CPU: 4 PID: 7457 Comm: sshd Tainted: G C 5.9.10-rockchip64 #20.11
[Sat Nov 28 01:08:46 2020] Hardware name: Helios64 (DT)
[Sat Nov 28 01:08:46 2020] pstate: 00000005 (nzcv daif -PAN -UAO BTYPE=--)
[Sat Nov 28 01:08:46 2020] pc : 0xbfff800010c4132c
[Sat Nov 28 01:08:46 2020] lr : simple_copy_to_iter+0x34/0x68
[Sat Nov 28 01:08:46 2020] sp : ffff8000132eba30
[Sat Nov 28 01:08:46 2020] x29: ffff8000132eba30 x28: 0000000000000000
[Sat Nov 28 01:08:46 2020] x27: ffff000043d97d00 x26: 00000000000005a8
[Sat Nov 28 01:08:46 2020] x25: 0000000000001228 x24: 00000000000005a8
[Sat Nov 28 01:08:46 2020] x23: ffff8000132ebcc8 x22: 0000000000000001
[Sat Nov 28 01:08:46 2020] x21: ffff8000132ebcc8 x20: ffff0000a0aa0882
[Sat Nov 28 01:08:46 2020] x19: 00000000000005a8 x18: 0000000000000000
[Sat Nov 28 01:08:46 2020] x17: 0000000000000000 x16: 0000000000000000
[Sat Nov 28 01:08:46 2020] x15: 0000000000000000 x14: 34d9ba8f70dd5728
[Sat Nov 28 01:08:46 2020] x13: abbcee5b1190622f x12: a4a3849f1551ce47
[Sat Nov 28 01:08:46 2020] x11: 9ce64aa033a6ddcb x10: 0000000000000882
[Sat Nov 28 01:08:46 2020] x9 : 908fce41edbdd9d6 x8 : 00000000f7e00000
[Sat Nov 28 01:08:46 2020] x7 : 0000000000000018 x6 : ffff800011a84510
[Sat Nov 28 01:08:46 2020] x5 : ffff800011a84510 x4 : 0000000000000000
[Sat Nov 28 01:08:46 2020] x3 : ffff800010010000 x2 : ffff80000de00000
[Sat Nov 28 01:08:46 2020] x1 : ffff000002210000 x0 : ffff000003040000
[Sat Nov 28 01:08:46 2020] Call trace:
[Sat Nov 28 01:08:46 2020] 0xbfff800010c4132c
[Sat Nov 28 01:08:46 2020] __skb_datagram_iter+0x144/0x240
[Sat Nov 28 01:08:46 2020] skb_copy_datagram_iter+0x50/0x110
[Sat Nov 28 01:08:46 2020] tcp_recvmsg+0x590/0x950
[Sat Nov 28 01:08:46 2020] inet_recvmsg+0x50/0x120
[Sat Nov 28 01:08:46 2020] sock_recvmsg+0x4c/0x60
[Sat Nov 28 01:08:46 2020] sock_read_iter+0x88/0xe0
[Sat Nov 28 01:08:46 2020] new_sync_read+0x16c/0x180
[Sat Nov 28 01:08:46 2020] vfs_read+0x148/0x1d8
[Sat Nov 28 01:08:46 2020] ksys_read+0xe0/0xf8
[Sat Nov 28 01:08:46 2020] __arm64_sys_read+0x1c/0x28
[Sat Nov 28 01:08:46 2020] el0_svc_common.constprop.0+0x70/0x188
[Sat Nov 28 01:08:46 2020] do_el0_svc+0x24/0x90
[Sat Nov 28 01:08:46 2020] el0_sync_handler+0x90/0x198
[Sat Nov 28 01:08:46 2020] el0_sync+0x158/0x180
[Sat Nov 28 01:08:46 2020] Code: bad PC value
[Sat Nov 28 01:08:46 2020] ---[ end trace 82cecfd63ab60ea2 ]---

 

Quote

So i switched back to 5.8.14 ... and i also have a lot of issues and kernel panic !!!

 

> Linux helios64 5.8.14-rockchip64 #20.08.10 SMP PREEMPT Tue Oct 13 16:58:01 CEST 2020 aarch64 GNU/Linux
 

From today (board frozen + red light blinking + had to reboot :


Message from syslogd@localhost at Nov 29 14:48:57 ...
kernel:[16987.580304] Internal error: Oops: 96000044 [#1] PREEMPT SMP

Message from syslogd@localhost at Nov 29 14:48:57 ...
kernel:[16987.601169] Code: aa0103e0 f9400422 f85f8403 f9000462 (f9000043)

Message from syslogd@localhost at Nov 29 14:48:58 ...
kernel:[16987.780586] Internal error: Oops: 96000004 [#2] PREEMPT SMP

Message from syslogd@localhost at Nov 29 14:48:58 ...
kernel:[16987.801529] Code: f2fbd5a7 d0008408 f9400f01 aa0103e0 (f9400422)

Message from syslogd@localhost at Nov 29 14:48:58 ...
kernel:[16987.807055] Internal error: Oops: 96000004 [#3] PREEMPT SMP

Message from syslogd@localhost at Nov 29 14:48:58 ...
kernel:[16987.833634] Code: f2fbd5a7 d0008408 f9400f01 aa0103e0 (f9400422)

 

So what to do to have a stable situation ?
Is it only possible ?
With which kernel ?

 

It seems to be "under load" (one freeze during RAID array being built, and several ones (with .14 or .17 kernels) while files were being copied through the 1Gb/s network interface).

- System newly installed and running on fresh SSD card.

- 5*3.5" WD HDD plugged in / RAID5 mdadm array (so no M.2, ...)

- nothing else done on the NAS (nothing has been installed on the OS, no processes in memory outside raid + rsync or scp file copying through SSH)

Link to comment
Share on other sites

Recommended Posts

  • 0

On my side

- OS (debian helios64 image) installed on SD card, SD card is a samsung one (128G)

- 5x Western Digital HDD (all the same) WDBBGB0140HBK-EESN 14TB, plugged in a regular way sda > sde (and so obviously no M.2 plugged in)

- I have the internal battery plugged in

 

At OS level : 

- docker with netdata container (and nothing else for now)

- mdadm activated for the RAID-5 array
- SMBFS with a few shares

- NFS with shares mounted on other servers (crashes where already happening before switching to NFS instead of SSHFS for regular accesses)

 

At this time at "workaround" side

- latest kernel 5.10.21

- with patched boot.scr 

- governor = powersave, min speed = max speed = 1.6Ghz (and not 1.8Ghz)

it seems to be the "least problematic" configuration (one crash every two days and not every day ...)

 

About load

- rclone each day during a few hours to mirror everything in the cloud (limited to 750GB per day so it takes quite some days) - nearly no freezes during this

- borgbackup fetching from another server and through NFS some file to backup each night - i suspect some freezes there

- some NFS shares being accessed for various tasks all the times (sometimes with a lot of IO) - i suspect some freezes there

- all this is quite reasonable about load and is not generating a lot of IO in the end

 

Helios64 board ordered on 2020, jan 12 (order 1312), sent on 2020, sept 21 and received sometime around beginning of october (NAS not installed before december)

 

By the way i also have a Helios4 board since a long time, and i never got any freeze with it.

Link to comment
Share on other sites

Open source development is fun. Join Armbian Linux development team today!

  • 0
5 hours ago, snakekick said:

when i now read the Helios64 Production Update 4

https://blog.kobol.io/2020/08/23/helios64-update4/

"Our test successful rate is not as high as we expected. It turned out that our FAT software was a bit too strict resulting in some board failing the FAT while they are actually OK"

 

its get a little bit bad taste...

 

thank you for your feedback.
don't bet an winning horses, but on a solution to a specific problem.

 

every one of my problems were software...!!

 

I don't minimize yours!

Link to comment
Share on other sites

  • 0
14 hours ago, snakekick said:

"Our test successful rate is not as high as we expected. It turned out that our FAT software was a bit too strict resulting in some board failing the FAT while they are actually OK"

 

The issue with our Factory Acceptance Test (FAT) we mentioned was related to PCIe link training. It took us a bit of time to realize that the PCIe device (SATA controller) was getting reset by software twice at boot instead of once which was causing sometimes the SATA controller to fail the link training sequence... therefore failing the FAT test. It's completely unrelated to DFS problem and what fixed within days.

Link to comment
Share on other sites

  • 0
Am 5.4.2021 um 06:41 schrieb gprovost:

Yes sorry about the lack of centralized news on that stability improvement effort.

We don't want to confuse people by posting things when the effort is still ongoing, and that we still don't have a clear understanding why the instability issue only impact some of the boards.

I can understand that, also ist doesnt make a good impression, yet i assume there must be way more people out there with an unstable system wondering what is happening and being disappointed.

I would have appreciated any information, even though its just that users reported about stability problems and you are onto it, trying to imporove this and that section of the code.

 

Also i wonder what you are working on to stabilize the system, if you dont know the cause? I already tried the voltage and scheduler workaround without success, and those are the only two specific issues that were pointed out so far.

 

 

Am 5.4.2021 um 06:41 schrieb gprovost:

Are you running from eMMC or SDcard ?

Can you share some crash logs.

I m running on both, since i have to reflash the system too many times, its currently on the SD, but i also copied it to nand without any difference in stability

 

I already shared some crash logs on other topics, wouldnt be a dedicated topic with just crash logs easier for you guys?

Link to comment
Share on other sites

  • 0

I'm a bit confused to not have the same values on my side for all policies ...

 

QBOMdtN.png

 

Whereas : 

rxRHNJm.png

 

(> i set min = max = 1.6Ghz through armbian-config)

 

What are these two different policies in /cpufreq/ ? (policy0 + policy4 in /cpufreq/ on my side)

Is it like "policy0" is used by "performance" governor mode and policy4 by "powersave" ? (in which case it would make sense for me to have different values)

 

Link to comment
Share on other sites

  • 0
Quote

What are these two different policies in /cpufreq/ ? (policy0 + policy4 in /cpufreq/ on my side)

Not sure. I changed both policies to 1.6GHz and I get the same min/max as you:

image.png.9eac49871a25ceb8c5841dda84d1bbf7.png


Changed back to 1GHz and both policy are back the same again.
image.png.5e3200d0d7d6acf04fa14a0fb5380134.png

 

Okay, and I just noticed that not all core have the same frequencies available:
image.png.f2f5a0caf2583523b923a0fc891cf698.png

So maybe the policies are made to fit the available frequencies?

And yep, setting the min max to 1.4GHz makes both policies equal.

image.png.4eefa8ec45be85f45ca3fbc67483935d.png
 
Not sure why some cores doesn't accept 1.6 and 1.8GHz tho. Isn't this supposed to be the same clock for every core?

 

Link to comment
Share on other sites

  • 0

@tionebrr

Quote

Not sure why some cores doesn't accept 1.6 and 1.8GHz tho. Isn't this supposed to be the same clock for every core?

Source https://www.rockchip.fr/RK3399 datasheet V1.8.pdf

 

1.2.1 Microprocessor
 Dual-core ARM Cortex-A72 MPCore processor and Quad-core ARM Cortex-A53MPCore processor, both are high-performance, low-power and cached application processor

 Two CPU clusters.Big cluster with dual-coreCortex-A72 is optimized for high-performance and little cluster with quad-core Cortex-A53 is optimized for low power.
<... >
 PD_A72_B0: 1st Cortex-A72 + Neon + FPU + L1 I/D cache of big cluster
 PD_A72_B1: 2nd Cortex-A72+ Neon + FPU + L1 I/D cache of big cluster
<... >
 PD_A53_L0: 1st Cortex-A53 + Neon + FPU + L1 I/D Cache of little cluster
 PD_A53_L1: 2nd Cortex-A53 + Neon + FPU + L1 I/D Cache of little cluster

 PD_A53_L2: 3rd Cortex-A53 + Neon + FPU + L1 I/D Cache of little cluster
 PD_A53_L3: 4th Cortex-A53 + Neon + FPU + L1 I/D Cache of little cluster

<...>

 

3.2 Recommended Operating Conditions

The below table describes the recommended operating condition for every clock domain.

Table 3-2 Recommended operating conditions

Parameters Symbol Min Typ Max Units

Supply voltage for Cortex A72 CPU BIGCPU_VDD 0.80 0.90 1.25 V

Supply voltage for Cortex A53 CPU LITCPU_VDD 0.80 0.90 1.20 V

Max frequency of Cortex A72 CPU 1.8 GHz

Max frequency of Cortex A53 CPU 1.4 GHz

 

Link to comment
Share on other sites

  • 0
Am 6.4.2021 um 16:05 schrieb tionebrr:

I had instability too, I just dialed back on the performances and it hasn't crashed in a while.
image.thumb.png.15c0b218b889f0406724d2811842cdbc.png

 

However, throttling is disabled. I fixed the cpufreq at about 1GHz. Hadn't had the time to do more testing.
image.png.14c0d42e12cb7e30f957ac5b5e2deb68.png

 

Well, fixing the governor to 1,2 ghz did also do the trick for me, whereas at 1,6 ghz it still crashed regularly.

 

2044791689_UnbenanntesBild.png.e76bca0531558c2b28c5a5098b044cc6.png

Link to comment
Share on other sites

  • 0

Today is the day.

After 7 days, probably a corrupted OS again.

 

Spoiler

DDR Version 1.24 20191016
In
channel 0
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 1
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 0 training pass!
channel 1 training pass!
change freq to 416MHz 0,1
Channel 0: LPDDR4,416MHz
Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB
Channel 1: LPDDR4,416MHz
Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB
256B stride
channel 0
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 1
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 0 training pass!
channel 1 training pass!
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
change freq to 856MHz 1,0
ch 0 ddrconfig = 0x101, ddrsize = 0x40
ch 1 ddrconfig = 0x101, ddrsize = 0x40
pmugrf_os_reg[2] = 0x32C1F2C1, stride = 0xD
ddr_set_rate to 328MHZ
ddr_set_rate to 666MHZ
ddr_set_rate to 928MHZ
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
ddr_set_rate to 416MHZ, ctl_index 0
ddr_set_rate to 856MHZ, ctl_index 1
support 416 856 328 666 928 MHz, current 856MHz
OUT
Boot1: 2019-03-14, version: 1.19
CPUId = 0x0
ChipType = 0x10, 252
SdmmcInit=2 0
BootCapSize=100000
UserCapSize=14910MB
FwPartOffset=2000 , 100000
mmc0:cmd5,20
SdmmcInit=0 0
BootCapSize=0
UserCapSize=30528MB
FwPartOffset=2000 , 0
StorageInit ok = 65661
SecureMode = 0
SecureInit read PBA: 0x4
SecureInit read PBA: 0x404
SecureInit read PBA: 0x804
SecureInit read PBA: 0xc04
SecureInit read PBA: 0x1004
SecureInit read PBA: 0x1404
SecureInit read PBA: 0x1804
SecureInit read PBA: 0x1c04
SecureInit ret = 0, SecureMode = 0
atags_set_bootdev: ret:(0)
GPT 0x3380ec0 signature is wrong
recovery gpt...
GPT 0x3380ec0 signature is wrong
recovery gpt fail!
LoadTrust Addr:0x4000
No find bl30.bin
No find bl32.bin
Load uboot, ReadLba = 2000
Load OK, addr=0x200000, size=0xe5b60
RunBL31 0x40000
NOTICE:  BL31: v1.3(debug):42583b6
NOTICE:  BL31: Built : 07:55:13, Oct 15 2019
NOTICE:  BL31: Rockchip release version: v1.1
INFO:    GICv3 with legacy support detected. ARM GICV3 driver initialized in EL3
INFO:    Using opteed sec cpu_context!
INFO:    boot cpu mask: 0
INFO:    plat_rockchip_pmu_init(1190): pd status 3e
INFO:    BL31: Initializing runtime services
WARNING: No OPTEE provided by BL2 boot loader, Booting device without OPTEE initialization. SMC`s destined for OPTEE will return SMC_UNK
ERROR:   Error initializing runtime service opteed_fast
INFO:    BL31: Preparing for EL3 exit to normal world
INFO:    Entry point address = 0x200000
INFO:    SPSR = 0x3c9


U-Boot 2020.10-armbian (Mar 08 2021 - 14:54:58 +0000)

SoC: Rockchip rk3399
Reset cause: POR
DRAM:  3.9 GiB
PMIC:  RK808
SF: Detected w25q128 with page size 256 Bytes, erase size 4 KiB, total 16 MiB
MMC:   mmc@fe320000: 1, sdhci@fe330000: 0
Loading Environment from MMC... *** Warning - bad CRC, using default environment

In:    serial
Out:   serial
Err:   serial
Model: Helios64
Revision: 1.2 - 4GB non ECC
Net:   eth0: ethernet@fe300000
scanning bus for devices...
starting USB...
Bus usb@fe380000: USB EHCI 1.00
Bus dwc3: usb maximum-speed not found
Register 2000140 NbrPorts 2
Starting the controller
USB XHCI 1.10
scanning bus usb@fe380000 for devices... 1 USB Device(s) found
scanning bus dwc3 for devices... cannot reset port 4!?
5 USB Device(s) found
       scanning usb for storage devices... 0 Storage Device(s) found
Hit any key to stop autoboot:  0
switch to partitions #0, OK
mmc1 is current device
Scanning mmc 1:1...
Found U-Boot script /boot/boot.scr
3185 bytes read in 5 ms (622.1 KiB/s)
## Executing script at 00500000
Boot script loaded from mmc 1
25 bytes read in 4 ms (5.9 KiB/s)
16208157 bytes read in 690 ms (22.4 MiB/s)
28582400 bytes read in 1213 ms (22.5 MiB/s)
81913 bytes read in 12 ms (6.5 MiB/s)
Failed to load '/boot/dtb/rockchip/overlay/-fixup.scr'
Moving Image from 0x2080000 to 0x2200000, end=3de0000
## Loading init Ramdisk from Legacy Image at 06000000 ...
   Image Name:   uInitrd
   Image Type:   AArch64 Linux RAMDisk Image (gzip compressed)
   Data Size:    16208093 Bytes = 15.5 MiB
   Load Address: 00000000
   Entry Point:  00000000
   Verifying Checksum ... OK
## Flattened Device Tree blob at 01f00000
   Booting using the fdt blob at 0x1f00000
   Loading Ramdisk to f4f7a000, end f5eef0dd ... OK
   Loading Device Tree to 00000000f4efd000, end 00000000f4f79fff ... OK

Starting kernel ...

[    3.703315] SError Interrupt on CPU5, code 0xbf000002 -- SError
[    3.703320] CPU: 5 PID: 231 Comm: kworker/5:2 Not tainted 5.10.21-rockchip64 #21.02.3
[    3.703323] Hardware name: Helios64 (DT)
[    3.703326] Workqueue: events deferred_probe_work_func
[    3.703332] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
[    3.703335] pc : rockchip_pcie_rd_conf+0xb0/0x268
[    3.703338] lr : rockchip_pcie_rd_conf+0x1b4/0x268
[    3.703341] sp : ffff80001280b830
[    3.703344] x29: ffff80001280b830 x28: 0000000000000000
[    3.703351] x27: 0000000000000000 x26: 0000000000000000
[    3.703358] x25: 0000000000000000 x24: ffff80001280b974
[    3.703365] x23: ffff0000f542a800 x22: ffff0000f5429b80
[    3.703372] x21: ffff80001280b8b4 x20: 0000000000000004
[    3.703379] x19: 0000000000000000 x18: 0000000000000000
[    3.703385] x17: 0000000000000020 x16: 000000007d755c46
[    3.703392] x15: ffffffffffffffff x14: ffff8000118b9948
[    3.703398] x13: ffff0000448f1a1c x12: ffff0000448f1290
[    3.703405] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
[    3.703412] x9 : 0000000001001d87 x8 : 000000000000ea60
[    3.703419] x7 : ffff80001280b800 x6 : 0000000000000001
[    3.703425] x5 : 0000000000100000 x4 : 0000000000000000
[    3.703432] x3 : 0000000000c00008 x2 : 000000000080000a
[    3.703438] x1 : ffff80001dc00008 x0 : ffff80001a000000
[    3.703446] Kernel panic - not syncing: Asynchronous SError Interrupt
[    3.703450] CPU: 5 PID: 231 Comm: kworker/5:2 Not tainted 5.10.21-rockchip64 #21.02.3
[    3.703453] Hardware name: Helios64 (DT)
[    3.703456] Workqueue: events deferred_probe_work_func
[    3.703460] Call trace:
[    3.703463]  dump_backtrace+0x0/0x200
[    3.703465]  show_stack+0x18/0x68
[    3.703468]  dump_stack+0xcc/0x124
[    3.703471]  panic+0x174/0x374
[    3.703473]  nmi_panic+0x64/0x98
[    3.703476]  arm64_serror_panic+0x74/0x88
[    3.703479]  do_serror+0x38/0x98
[    3.703481]  el1_error+0x84/0x104
[    3.703484]  rockchip_pcie_rd_conf+0xb0/0x268
[    3.703487]  pci_bus_read_config_dword+0x84/0xd8
[    3.703490]  pci_bus_generic_read_dev_vendor_id+0x34/0x1b0
[    3.703493]  pci_bus_read_dev_vendor_id+0x4c/0x70
[    3.703496]  pci_scan_single_device+0x84/0xe0
[    3.703499]  pci_scan_slot+0x38/0x120
[    3.703502]  pci_scan_child_bus_extend+0x58/0x330
[    3.703505]  pci_scan_bridge_extend+0x340/0x5a0
[    3.703508]  pci_scan_child_bus_extend+0x1fc/0x330
[    3.703511]  pci_scan_root_bus_bridge+0xd4/0xf0
[    3.703513]  pci_host_probe+0x18/0xb0
[    3.703516]  rockchip_pcie_probe+0x268/0x478
[    3.703519]  platform_drv_probe+0x54/0xa8
[    3.703521]  really_probe+0xe8/0x4d0
[    3.703524]  driver_probe_device+0xf4/0x160
[    3.703527]  __device_attach_driver+0x8c/0x118
[    3.703530]  bus_for_each_drv+0x7c/0xd0
[    3.703533]  __device_attach+0xe8/0x168
[    3.703535]  device_initial_probe+0x14/0x20
[    3.703538]  bus_probe_device+0x9c/0xa8
[    3.703541]  deferred_probe_work_func+0x88/0xd8
[    3.703544]  process_one_work+0x1ec/0x4d0
[    3.703547]  worker_thread+0x208/0x478
[    3.703549]  kthread+0x140/0x150
[    3.703552]  ret_from_fork+0x10/0x34
[    3.703582] SMP: stopping secondary CPUs
[    3.703585] Kernel Offset: disabled
[    3.703589] CPU features: 0x0240022,6100200c
[    3.703591] Memory Limit: none

 

Link to comment
Share on other sites

  • 0

So on my side, after my latest reinstallation (due to corrupted OS)

- with default installation / configuration out of the box, i had one freeze every 24h

- by switching to "powersave", or to "performance" mode, and with min CPU frequency = max CPU frequency = either 1.8Ghz either 1.6Ghz : still the same (one freeze per day)

- by switching to "performance" mode, with min CPU frequency = max CPU frequency = 1.4Ghz, it seems now more stable (uptime = 5 days for now)

 

So my guts feeling is really that these issues : 

- are mainly related to the cpufreq mechanizm

- and probably related to what has been nicely spotted before (by Vin), the fact that 2 core have different max frequency range (as expected per the specs, but maybe with a corner case in the cpufreq governance)

 

Quote

8:38 root@helios64 ~# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
1416000
1416000

 

8:38 root@helios64 ~# uptime
 08:38:34 up 5 days, 17:25,  1 user,  load average: 0.00, 0.00, 0.00

Link to comment
Share on other sites

  • 0
On 3/16/2021 at 3:27 AM, ShadowDance said:

@jbergler I recently noticed the armbian-hardware-optimization script for Helios64 changes the IO scheduler to `bfq` for spinning disks, however, for ZFS we should be using `none` because it has it's own scheduler. Normally ZFS would change the scheduler itself, but that would only happen if you're using raw disks (not partitions) and if you import the zpool _after_ the hardware optimization script has run.

 

You can try changing it (e.g. `echo none >/sys/block/sda/queue/scheduler`) for each ZFS disk and see if anything changes. I still haven't figured out if this is a cause for any problems, but it's worth a shot.

 

Currently giving this a try. Note that OMV creates a rule to override the scheduler to `bfq` for all rotating disks. Since all 5 of the HDDs are participating in my ZFS pool, I simply changed `bfq` to `none` in `/etc/udev/rules.d/99-openmediavault-scheduler.rules`, then ran the following to apply:

 

$ sudo udevadm control --reload-rules
$ sudo udevadm trigger --type=devices --action=change

 

Edited by roadkill
Link to comment
Share on other sites

  • 0

Hi,

 

I can confirm having the same problems with CPU freezes.

 

Initially I fixed the problem by setting min CPU frequency = max CPU frequency = 1.4Ghz. This configuration ran very solid.

A few days ago I reinstalled my system (due to a different problem) and left the CPU settings on default. I'm now running Kernel 5.10.63-rockchip64. Yesterday I noticed that the system was unresponsive again. Actually the red fault LED was blinking, which I cannot remember that it did before. Unlike documented it was set to trigger on (kernel) panic.

 

Now I set the max CPU frequency to 1.4Ghz, leaving the min CPU frequency at 400Mhz (performance mode). Lets see how this goes...

 

Meanwhile, I'm wondering if anyone heard of an actual fix. Could it actually be a problem with my hardware or is it purely software? And if it is - why are there so little complaints? I mean this system is designed to run 24/7, isn't it? I don't want to complain about Helios64, I'm just trying to understand if there is anything I'm missing out to get around this problem.

Many Thanks,
Julius

Link to comment
Share on other sites

  • 0

Since I got my Helios64 : I had an hadrware problem (voltage on some disk drop too low : it ended : replace MB) 

But I had many panic/ freeze since first days.

 

I was never able to pass 10 days without problems until I read this topic :

Last month I decided to set freq policy as read here 

root@helios64:~# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
1416000
1416000
root@helios64:~# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
1416000
1416000
root@helios64:~# cat /sys/devices/system/cpu/cpufreq/policy*/cpuinfo_cur_freq
1416000
1416000


And I have a much more stable system 

root@helios64:~# uptime
 18:35:42 up 30 days,  1:34,  1 user,  load average: 0.01, 0.03, 0.00
root@helios64:~# uname -a
Linux helios64 5.10.60-rockchip64 #21.08.1 SMP PREEMPT Wed Aug 25 18:56:55 UTC 2021 aarch64 GNU/Linux
root@helios64:~# cat /etc/issue
Armbian 21.08.1 Buster \l

 

 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Answer this question...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share

×
×
  • Create New...