Helios64 - freeze whatever the kernel is.


SR-G
 Share

11 11

Recommended Posts

19 hours ago, Vin said:

I know you are a small team and you had to takle many obstacles to release the Helios64, believe me im a fan of your work, product and armbian in general, i m aware of the effort every involved person puts into this project.

 

But i would appreciate a bit more news about the current status of developement.

 

As far as i can see there is no status or developement overview on the current issues, not on your blog nor on twitter.

 

The only information we get are in various topics across the forum on armbian.

 

Yes sorry about the lack of centralized news on that stability improvement effort.

We don't want to confuse people by posting things when the effort is still ongoing, and that we still don't have a clear understanding why the instability issue only impact some of the boards.

 

19 hours ago, Vin said:

Also I wonder if those issues are genreal rk3399 gonvernor etc. problems, or if it applies specifically for the Helios64?

 

It is not an issue that impact all rk3399 boards. However same problems is impacting the NanoPi M4V2, and it's thanks to @piter75 work on the NanoPi that Helios64 stability has improved.

 

19 hours ago, Vin said:

Mine is still crashing like a clockwork every 24 hours and leaves me generally with a corrupted OS and data.

Are you running from eMMC or SDcard ?

Can you share some crash logs.

Link to post
Share on other sites

Armbian is a community driven open source project. Do you like to contribute your code?

On my side

- OS (debian helios64 image) installed on SD card, SD card is a samsung one (128G)

- 5x Western Digital HDD (all the same) WDBBGB0140HBK-EESN 14TB, plugged in a regular way sda > sde (and so obviously no M.2 plugged in)

- I have the internal battery plugged in

 

At OS level : 

- docker with netdata container (and nothing else for now)

- mdadm activated for the RAID-5 array
- SMBFS with a few shares

- NFS with shares mounted on other servers (crashes where already happening before switching to NFS instead of SSHFS for regular accesses)

 

At this time at "workaround" side

- latest kernel 5.10.21

- with patched boot.scr 

- governor = powersave, min speed = max speed = 1.6Ghz (and not 1.8Ghz)

it seems to be the "least problematic" configuration (one crash every two days and not every day ...)

 

About load

- rclone each day during a few hours to mirror everything in the cloud (limited to 750GB per day so it takes quite some days) - nearly no freezes during this

- borgbackup fetching from another server and through NFS some file to backup each night - i suspect some freezes there

- some NFS shares being accessed for various tasks all the times (sometimes with a lot of IO) - i suspect some freezes there

- all this is quite reasonable about load and is not generating a lot of IO in the end

 

Helios64 board ordered on 2020, jan 12 (order 1312), sent on 2020, sept 21 and received sometime around beginning of october (NAS not installed before december)

 

By the way i also have a Helios4 board since a long time, and i never got any freeze with it.

Link to post
Share on other sites

5 hours ago, snakekick said:

when i now read the Helios64 Production Update 4

https://blog.kobol.io/2020/08/23/helios64-update4/

"Our test successful rate is not as high as we expected. It turned out that our FAT software was a bit too strict resulting in some board failing the FAT while they are actually OK"

 

its get a little bit bad taste...

 

thank you for your feedback.
don't bet an winning horses, but on a solution to a specific problem.

 

every one of my problems were software...!!

 

I don't minimize yours!

Link to post
Share on other sites

14 hours ago, snakekick said:

"Our test successful rate is not as high as we expected. It turned out that our FAT software was a bit too strict resulting in some board failing the FAT while they are actually OK"

 

The issue with our Factory Acceptance Test (FAT) we mentioned was related to PCIe link training. It took us a bit of time to realize that the PCIe device (SATA controller) was getting reset by software twice at boot instead of once which was causing sometimes the SATA controller to fail the link training sequence... therefore failing the FAT test. It's completely unrelated to DFS problem and what fixed within days.

Link to post
Share on other sites

Am 5.4.2021 um 06:41 schrieb gprovost:

Yes sorry about the lack of centralized news on that stability improvement effort.

We don't want to confuse people by posting things when the effort is still ongoing, and that we still don't have a clear understanding why the instability issue only impact some of the boards.

I can understand that, also ist doesnt make a good impression, yet i assume there must be way more people out there with an unstable system wondering what is happening and being disappointed.

I would have appreciated any information, even though its just that users reported about stability problems and you are onto it, trying to imporove this and that section of the code.

 

Also i wonder what you are working on to stabilize the system, if you dont know the cause? I already tried the voltage and scheduler workaround without success, and those are the only two specific issues that were pointed out so far.

 

 

Am 5.4.2021 um 06:41 schrieb gprovost:

Are you running from eMMC or SDcard ?

Can you share some crash logs.

I m running on both, since i have to reflash the system too many times, its currently on the SD, but i also copied it to nand without any difference in stability

 

I already shared some crash logs on other topics, wouldnt be a dedicated topic with just crash logs easier for you guys?

Link to post
Share on other sites

I had instability too, I just dialed back on the performances and it hasn't crashed in a while.
image.thumb.png.15c0b218b889f0406724d2811842cdbc.png

 

However, throttling is disabled. I fixed the cpufreq at about 1GHz. Hadn't had the time to do more testing.
image.png.14c0d42e12cb7e30f957ac5b5e2deb68.png

Link to post
Share on other sites

I'm a bit confused to not have the same values on my side for all policies ...

 

QBOMdtN.png

 

Whereas : 

rxRHNJm.png

 

(> i set min = max = 1.6Ghz through armbian-config)

 

What are these two different policies in /cpufreq/ ? (policy0 + policy4 in /cpufreq/ on my side)

Is it like "policy0" is used by "performance" governor mode and policy4 by "powersave" ? (in which case it would make sense for me to have different values)

 

Link to post
Share on other sites

Quote

What are these two different policies in /cpufreq/ ? (policy0 + policy4 in /cpufreq/ on my side)

Not sure. I changed both policies to 1.6GHz and I get the same min/max as you:

image.png.9eac49871a25ceb8c5841dda84d1bbf7.png


Changed back to 1GHz and both policy are back the same again.
image.png.5e3200d0d7d6acf04fa14a0fb5380134.png

 

Okay, and I just noticed that not all core have the same frequencies available:
image.png.f2f5a0caf2583523b923a0fc891cf698.png

So maybe the policies are made to fit the available frequencies?

And yep, setting the min max to 1.4GHz makes both policies equal.

image.png.4eefa8ec45be85f45ca3fbc67483935d.png
 
Not sure why some cores doesn't accept 1.6 and 1.8GHz tho. Isn't this supposed to be the same clock for every core?

 

Link to post
Share on other sites

@tionebrr

Quote

Not sure why some cores doesn't accept 1.6 and 1.8GHz tho. Isn't this supposed to be the same clock for every core?

Source https://www.rockchip.fr/RK3399 datasheet V1.8.pdf

 

1.2.1 Microprocessor
 Dual-core ARM Cortex-A72 MPCore processor and Quad-core ARM Cortex-A53MPCore processor, both are high-performance, low-power and cached application processor

 Two CPU clusters.Big cluster with dual-coreCortex-A72 is optimized for high-performance and little cluster with quad-core Cortex-A53 is optimized for low power.
<... >
 PD_A72_B0: 1st Cortex-A72 + Neon + FPU + L1 I/D cache of big cluster
 PD_A72_B1: 2nd Cortex-A72+ Neon + FPU + L1 I/D cache of big cluster
<... >
 PD_A53_L0: 1st Cortex-A53 + Neon + FPU + L1 I/D Cache of little cluster
 PD_A53_L1: 2nd Cortex-A53 + Neon + FPU + L1 I/D Cache of little cluster

 PD_A53_L2: 3rd Cortex-A53 + Neon + FPU + L1 I/D Cache of little cluster
 PD_A53_L3: 4th Cortex-A53 + Neon + FPU + L1 I/D Cache of little cluster

<...>

 

3.2 Recommended Operating Conditions

The below table describes the recommended operating condition for every clock domain.

Table 3-2 Recommended operating conditions

Parameters Symbol Min Typ Max Units

Supply voltage for Cortex A72 CPU BIGCPU_VDD 0.80 0.90 1.25 V

Supply voltage for Cortex A53 CPU LITCPU_VDD 0.80 0.90 1.20 V

Max frequency of Cortex A72 CPU 1.8 GHz

Max frequency of Cortex A53 CPU 1.4 GHz

 

Link to post
Share on other sites

Am 6.4.2021 um 16:05 schrieb tionebrr:

I had instability too, I just dialed back on the performances and it hasn't crashed in a while.
image.thumb.png.15c0b218b889f0406724d2811842cdbc.png

 

However, throttling is disabled. I fixed the cpufreq at about 1GHz. Hadn't had the time to do more testing.
image.png.14c0d42e12cb7e30f957ac5b5e2deb68.png

 

Well, fixing the governor to 1,2 ghz did also do the trick for me, whereas at 1,6 ghz it still crashed regularly.

 

2044791689_UnbenanntesBild.png.e76bca0531558c2b28c5a5098b044cc6.png

Link to post
Share on other sites

Today is the day.

After 7 days, probably a corrupted OS again.

 

Spoiler

DDR Version 1.24 20191016
In
channel 0
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 1
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 0 training pass!
channel 1 training pass!
change freq to 416MHz 0,1
Channel 0: LPDDR4,416MHz
Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB
Channel 1: LPDDR4,416MHz
Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB
256B stride
channel 0
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 1
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 0 training pass!
channel 1 training pass!
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
change freq to 856MHz 1,0
ch 0 ddrconfig = 0x101, ddrsize = 0x40
ch 1 ddrconfig = 0x101, ddrsize = 0x40
pmugrf_os_reg[2] = 0x32C1F2C1, stride = 0xD
ddr_set_rate to 328MHZ
ddr_set_rate to 666MHZ
ddr_set_rate to 928MHZ
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
ddr_set_rate to 416MHZ, ctl_index 0
ddr_set_rate to 856MHZ, ctl_index 1
support 416 856 328 666 928 MHz, current 856MHz
OUT
Boot1: 2019-03-14, version: 1.19
CPUId = 0x0
ChipType = 0x10, 252
SdmmcInit=2 0
BootCapSize=100000
UserCapSize=14910MB
FwPartOffset=2000 , 100000
mmc0:cmd5,20
SdmmcInit=0 0
BootCapSize=0
UserCapSize=30528MB
FwPartOffset=2000 , 0
StorageInit ok = 65661
SecureMode = 0
SecureInit read PBA: 0x4
SecureInit read PBA: 0x404
SecureInit read PBA: 0x804
SecureInit read PBA: 0xc04
SecureInit read PBA: 0x1004
SecureInit read PBA: 0x1404
SecureInit read PBA: 0x1804
SecureInit read PBA: 0x1c04
SecureInit ret = 0, SecureMode = 0
atags_set_bootdev: ret:(0)
GPT 0x3380ec0 signature is wrong
recovery gpt...
GPT 0x3380ec0 signature is wrong
recovery gpt fail!
LoadTrust Addr:0x4000
No find bl30.bin
No find bl32.bin
Load uboot, ReadLba = 2000
Load OK, addr=0x200000, size=0xe5b60
RunBL31 0x40000
NOTICE:  BL31: v1.3(debug):42583b6
NOTICE:  BL31: Built : 07:55:13, Oct 15 2019
NOTICE:  BL31: Rockchip release version: v1.1
INFO:    GICv3 with legacy support detected. ARM GICV3 driver initialized in EL3
INFO:    Using opteed sec cpu_context!
INFO:    boot cpu mask: 0
INFO:    plat_rockchip_pmu_init(1190): pd status 3e
INFO:    BL31: Initializing runtime services
WARNING: No OPTEE provided by BL2 boot loader, Booting device without OPTEE initialization. SMC`s destined for OPTEE will return SMC_UNK
ERROR:   Error initializing runtime service opteed_fast
INFO:    BL31: Preparing for EL3 exit to normal world
INFO:    Entry point address = 0x200000
INFO:    SPSR = 0x3c9


U-Boot 2020.10-armbian (Mar 08 2021 - 14:54:58 +0000)

SoC: Rockchip rk3399
Reset cause: POR
DRAM:  3.9 GiB
PMIC:  RK808
SF: Detected w25q128 with page size 256 Bytes, erase size 4 KiB, total 16 MiB
MMC:   mmc@fe320000: 1, sdhci@fe330000: 0
Loading Environment from MMC... *** Warning - bad CRC, using default environment

In:    serial
Out:   serial
Err:   serial
Model: Helios64
Revision: 1.2 - 4GB non ECC
Net:   eth0: ethernet@fe300000
scanning bus for devices...
starting USB...
Bus usb@fe380000: USB EHCI 1.00
Bus dwc3: usb maximum-speed not found
Register 2000140 NbrPorts 2
Starting the controller
USB XHCI 1.10
scanning bus usb@fe380000 for devices... 1 USB Device(s) found
scanning bus dwc3 for devices... cannot reset port 4!?
5 USB Device(s) found
       scanning usb for storage devices... 0 Storage Device(s) found
Hit any key to stop autoboot:  0
switch to partitions #0, OK
mmc1 is current device
Scanning mmc 1:1...
Found U-Boot script /boot/boot.scr
3185 bytes read in 5 ms (622.1 KiB/s)
## Executing script at 00500000
Boot script loaded from mmc 1
25 bytes read in 4 ms (5.9 KiB/s)
16208157 bytes read in 690 ms (22.4 MiB/s)
28582400 bytes read in 1213 ms (22.5 MiB/s)
81913 bytes read in 12 ms (6.5 MiB/s)
Failed to load '/boot/dtb/rockchip/overlay/-fixup.scr'
Moving Image from 0x2080000 to 0x2200000, end=3de0000
## Loading init Ramdisk from Legacy Image at 06000000 ...
   Image Name:   uInitrd
   Image Type:   AArch64 Linux RAMDisk Image (gzip compressed)
   Data Size:    16208093 Bytes = 15.5 MiB
   Load Address: 00000000
   Entry Point:  00000000
   Verifying Checksum ... OK
## Flattened Device Tree blob at 01f00000
   Booting using the fdt blob at 0x1f00000
   Loading Ramdisk to f4f7a000, end f5eef0dd ... OK
   Loading Device Tree to 00000000f4efd000, end 00000000f4f79fff ... OK

Starting kernel ...

[    3.703315] SError Interrupt on CPU5, code 0xbf000002 -- SError
[    3.703320] CPU: 5 PID: 231 Comm: kworker/5:2 Not tainted 5.10.21-rockchip64 #21.02.3
[    3.703323] Hardware name: Helios64 (DT)
[    3.703326] Workqueue: events deferred_probe_work_func
[    3.703332] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
[    3.703335] pc : rockchip_pcie_rd_conf+0xb0/0x268
[    3.703338] lr : rockchip_pcie_rd_conf+0x1b4/0x268
[    3.703341] sp : ffff80001280b830
[    3.703344] x29: ffff80001280b830 x28: 0000000000000000
[    3.703351] x27: 0000000000000000 x26: 0000000000000000
[    3.703358] x25: 0000000000000000 x24: ffff80001280b974
[    3.703365] x23: ffff0000f542a800 x22: ffff0000f5429b80
[    3.703372] x21: ffff80001280b8b4 x20: 0000000000000004
[    3.703379] x19: 0000000000000000 x18: 0000000000000000
[    3.703385] x17: 0000000000000020 x16: 000000007d755c46
[    3.703392] x15: ffffffffffffffff x14: ffff8000118b9948
[    3.703398] x13: ffff0000448f1a1c x12: ffff0000448f1290
[    3.703405] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
[    3.703412] x9 : 0000000001001d87 x8 : 000000000000ea60
[    3.703419] x7 : ffff80001280b800 x6 : 0000000000000001
[    3.703425] x5 : 0000000000100000 x4 : 0000000000000000
[    3.703432] x3 : 0000000000c00008 x2 : 000000000080000a
[    3.703438] x1 : ffff80001dc00008 x0 : ffff80001a000000
[    3.703446] Kernel panic - not syncing: Asynchronous SError Interrupt
[    3.703450] CPU: 5 PID: 231 Comm: kworker/5:2 Not tainted 5.10.21-rockchip64 #21.02.3
[    3.703453] Hardware name: Helios64 (DT)
[    3.703456] Workqueue: events deferred_probe_work_func
[    3.703460] Call trace:
[    3.703463]  dump_backtrace+0x0/0x200
[    3.703465]  show_stack+0x18/0x68
[    3.703468]  dump_stack+0xcc/0x124
[    3.703471]  panic+0x174/0x374
[    3.703473]  nmi_panic+0x64/0x98
[    3.703476]  arm64_serror_panic+0x74/0x88
[    3.703479]  do_serror+0x38/0x98
[    3.703481]  el1_error+0x84/0x104
[    3.703484]  rockchip_pcie_rd_conf+0xb0/0x268
[    3.703487]  pci_bus_read_config_dword+0x84/0xd8
[    3.703490]  pci_bus_generic_read_dev_vendor_id+0x34/0x1b0
[    3.703493]  pci_bus_read_dev_vendor_id+0x4c/0x70
[    3.703496]  pci_scan_single_device+0x84/0xe0
[    3.703499]  pci_scan_slot+0x38/0x120
[    3.703502]  pci_scan_child_bus_extend+0x58/0x330
[    3.703505]  pci_scan_bridge_extend+0x340/0x5a0
[    3.703508]  pci_scan_child_bus_extend+0x1fc/0x330
[    3.703511]  pci_scan_root_bus_bridge+0xd4/0xf0
[    3.703513]  pci_host_probe+0x18/0xb0
[    3.703516]  rockchip_pcie_probe+0x268/0x478
[    3.703519]  platform_drv_probe+0x54/0xa8
[    3.703521]  really_probe+0xe8/0x4d0
[    3.703524]  driver_probe_device+0xf4/0x160
[    3.703527]  __device_attach_driver+0x8c/0x118
[    3.703530]  bus_for_each_drv+0x7c/0xd0
[    3.703533]  __device_attach+0xe8/0x168
[    3.703535]  device_initial_probe+0x14/0x20
[    3.703538]  bus_probe_device+0x9c/0xa8
[    3.703541]  deferred_probe_work_func+0x88/0xd8
[    3.703544]  process_one_work+0x1ec/0x4d0
[    3.703547]  worker_thread+0x208/0x478
[    3.703549]  kthread+0x140/0x150
[    3.703552]  ret_from_fork+0x10/0x34
[    3.703582] SMP: stopping secondary CPUs
[    3.703585] Kernel Offset: disabled
[    3.703589] CPU features: 0x0240022,6100200c
[    3.703591] Memory Limit: none

 

Link to post
Share on other sites

So on my side, after my latest reinstallation (due to corrupted OS)

- with default installation / configuration out of the box, i had one freeze every 24h

- by switching to "powersave", or to "performance" mode, and with min CPU frequency = max CPU frequency = either 1.8Ghz either 1.6Ghz : still the same (one freeze per day)

- by switching to "performance" mode, with min CPU frequency = max CPU frequency = 1.4Ghz, it seems now more stable (uptime = 5 days for now)

 

So my guts feeling is really that these issues : 

- are mainly related to the cpufreq mechanizm

- and probably related to what has been nicely spotted before (by Vin), the fact that 2 core have different max frequency range (as expected per the specs, but maybe with a corner case in the cpufreq governance)

 

Quote

8:38 root@helios64 ~# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
1416000
1416000

 

8:38 root@helios64 ~# uptime
 08:38:34 up 5 days, 17:25,  1 user,  load average: 0.00, 0.00, 0.00

Link to post
Share on other sites

On 3/16/2021 at 3:27 AM, ShadowDance said:

@jbergler I recently noticed the armbian-hardware-optimization script for Helios64 changes the IO scheduler to `bfq` for spinning disks, however, for ZFS we should be using `none` because it has it's own scheduler. Normally ZFS would change the scheduler itself, but that would only happen if you're using raw disks (not partitions) and if you import the zpool _after_ the hardware optimization script has run.

 

You can try changing it (e.g. `echo none >/sys/block/sda/queue/scheduler`) for each ZFS disk and see if anything changes. I still haven't figured out if this is a cause for any problems, but it's worth a shot.

 

Currently giving this a try. Note that OMV creates a rule to override the scheduler to `bfq` for all rotating disks. Since all 5 of the HDDs are participating in my ZFS pool, I simply changed `bfq` to `none` in `/etc/udev/rules.d/99-openmediavault-scheduler.rules`, then ran the following to apply:

 

$ sudo udevadm control --reload-rules
$ sudo udevadm trigger --type=devices --action=change

 

Edited by roadkill
Link to post
Share on other sites

Hi,

 

I can confirm having the same problems with CPU freezes.

 

Initially I fixed the problem by setting min CPU frequency = max CPU frequency = 1.4Ghz. This configuration ran very solid.

A few days ago I reinstalled my system (due to a different problem) and left the CPU settings on default. I'm now running Kernel 5.10.63-rockchip64. Yesterday I noticed that the system was unresponsive again. Actually the red fault LED was blinking, which I cannot remember that it did before. Unlike documented it was set to trigger on (kernel) panic.

 

Now I set the max CPU frequency to 1.4Ghz, leaving the min CPU frequency at 400Mhz (performance mode). Lets see how this goes...

 

Meanwhile, I'm wondering if anyone heard of an actual fix. Could it actually be a problem with my hardware or is it purely software? And if it is - why are there so little complaints? I mean this system is designed to run 24/7, isn't it? I don't want to complain about Helios64, I'm just trying to understand if there is anything I'm missing out to get around this problem.

Many Thanks,
Julius

Link to post
Share on other sites

Same here, the instability of my Helios64 combined with Armbian not having a test-suite for it (and thus breaking it at any point) lead me to splurge on hardware that cost 4x as much. A NAS should be out of sight and out of mind, not a constant source of worry.

Link to post
Share on other sites

Since I got my Helios64 : I had an hadrware problem (voltage on some disk drop too low : it ended : replace MB) 

But I had many panic/ freeze since first days.

 

I was never able to pass 10 days without problems until I read this topic :

Last month I decided to set freq policy as read here 

root@helios64:~# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
1416000
1416000
root@helios64:~# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
1416000
1416000
root@helios64:~# cat /sys/devices/system/cpu/cpufreq/policy*/cpuinfo_cur_freq
1416000
1416000


And I have a much more stable system 

root@helios64:~# uptime
 18:35:42 up 30 days,  1:34,  1 user,  load average: 0.01, 0.03, 0.00
root@helios64:~# uname -a
Linux helios64 5.10.60-rockchip64 #21.08.1 SMP PREEMPT Wed Aug 25 18:56:55 UTC 2021 aarch64 GNU/Linux
root@helios64:~# cat /etc/issue
Armbian 21.08.1 Buster \l

 

 

Link to post
Share on other sites

 Share

11 11