Jump to content

Helios64 - Armbian 23.08 Bookworm issues (solved)


ebin-dev

Recommended Posts

Hi,

Not change millivolt value because i afraid to do a mistake

I just setting differentes values of freq

For moment i am at 400-1200Mhz and not lose connection...

Can you explain me command to change milivolt setting if i have futur courage or crazy spirit to try this idea?

And i have question, i install my 6.6.27 Kernel .deb build my self with official framework, then i reboot and do my pattern 3 times without crash...

Today i reboot and lose connection approximately 10min after boot when i set upper than 1400mhz... milivolt value work before reboot why not after, setting not change?

 

Now i do:

 git clone --depth=1 --branch=main https://github.com/armbian/build ; cd build/

./compile.sh kernel BOARD=helios64 BRANCH=current

dpkg -i *.deb to install Kernel 6.6.28 4 packages

reboot

and... uptime 18min... not loose connection ! i become crazy !

 

 

Edited by BipBip1981
Link to comment
Share on other sites

@ebin-devnearly, I do not change the max value of the voltage (only the min and the central value):

change opp-microvolt = <0xc96a8 0xc96a8 0x1312d0>; to opp-microvolt = <0xdbba0 0xdbba0 0x1312d0>

though it could be you could be able to increase the max value, only I don't know if it is safe and how to know if so.

Note that in the edited dts (be it via armbian-config or else) you can replace the hex numbers by decimals. Ie you can write:

opp-microvolt = <900000 900000 0x1312d0>

It is way easier than computing the hex of the initial voltage with 75mV added.

 

 

@BipBip1981 best is to have a reproducible way to trigger the crash. Then you can tell when the issue is gone.

My test case is https://gist.github.com/prahal/316111da0a9b8cc0d0791d26659dc682

If you can run it without a crash with any kernel it is new to me. (I Believe I even got the linux 4.4 helios64  first kernel to crash with this test case).

 

With this patch to increase the min and "central" voltage (I believe requested voltage) by 75mV I cannot get my above test case to crash helios64 (mind there are other helios64 crashers so it best to run the test case in systemd emergency mode, but I managed to run it 100 times in "full" session mode):

EDIT: this patch is incomplete: since then I have added opp-00 an opp-01 with the same values as opp-02 (ir 900000 and the appropriate frequencies)

diff --git a/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts b/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts
index 77844650e2fe..34d94e4d6ada 100644
--- a/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts
+++ b/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts
@@ -1160,10 +1160,36 @@ &cluster0_opp {
        /delete-node/ opp06;
 };
 
 &cluster1_opp {
        /delete-node/ opp08;
+
+       /delete-node/ opp02;
+       /delete-node/ opp03;
+       /delete-node/ opp04;
+       /delete-node/ opp05;
+       /delete-node/ opp06;
+       opp02 {
+               opp-hz = /bits/ 64 <816000000>;
+               opp-microvolt = <900000 900000 1250000>;
+       };
+       opp03 {
+               opp-hz = /bits/ 64 <1008000000>;
+               opp-microvolt = <950000 950000 1250000>;
+       };
+       opp04 {
+               opp-hz = /bits/ 64 <1200000000>;
+               opp-microvolt = <1025000 1025000 1250000>;
+       };
+       opp05 {
+               opp-hz = /bits/ 64 <1416000000>;
+               opp-microvolt = <1100000 1100000 1250000>;
+       };
+       opp06 {
+               opp-hz = /bits/ 64 <1608000000>;
+               opp-microvolt = <1175000 1175000 1250000>;
+       };
 };
 &cpu_thermal {
        trips {
                cpu_warm: cpu_warm {

 

Mind this patch will not apply  (the cpu_thermal is from another patch of mine. But it gives you an idea of what you should write.

 

 

 

 

Also, you should account that crashes might be related to the load or the speed between transitions in the load. So a kernel version might help but will merely hide or render a crash less frequent. But it is not even a workaround, merely it makes the crash more or less frequent. It might be there is still a bug in the kernel that only affects helios64, but it is unlikely.

I think I always had the helios64 (even on the first boot after I mounted the box) because I have a mdadm raid10 with ext4 setup. The raid10 stress the board (and especially the big cpus).

If you could try my stress with your stable kernel that would help decipher if this kernel is really stable with regards to big cpu.

Mind that even with this cpu-b 75mV  workaround I still get crashes from my board, but not with my test case, and way less often. I don't have a test case or know what triggered these remaining crashes yet.

 

 

 

Also, the fact that upping by 75mV workaround crashes when cpufreq switching the big cpus might not fix the root cause. I am not able to analyze the schematic on my own. We would need someone to do so to get a clearer clue as to why this helps and why it could be required.

 

Finally the rk339 is told to be very robust. So it could be it sometimes works with invalid voltages but not all the time.

Edited by prahal
note that the patch is incomplete - ie missing opp-00 and opp-01
Link to comment
Share on other sites

Hi,

This morning, i do cold boot and crash after few minutes (less than 10min...)

Boot is okok but lose ssh connection and to same with usb wire console, i have login and password ask but then i am block... (i can't find if problem hardware... systemd... software...)

I use reset bottom and after 15min uptime, i try the same and not lose connection, (i don't understand why not lose connection, i do nothing and  Helios64 seem Okok...)

 

i do: (3 times)

 

root@helios64:/tmp# ./cpufreq-switching
allocated 64MB
test: toggle freq before write
99/100  
test: toggle freq before read
9/10, 99/100  
root@helios64:/tmp# ./cpufreq-switching
allocated 64MB
test: toggle freq before write
99/100  
test: toggle freq before read
9/10, 99/100  
root@helios64:/tmp# ./cpufreq-switching
allocated 64MB
test: toggle freq before write
99/100  
test: toggle freq before read
9/10, 99/100  
root@helios64:/tmp#

 

Not crash/freeze and during this test, i have samba Time Machine backup work and lot of I/O Network and 1GO of data pass from my mac to helios.

I don't tune voltage, juste use 6.6.28 Kernel and my standard configuaration....

 

I run again you program and, i have again Time Machine Backup (samba share in background):

 

| | | | ___| (_) ___  ___ / /_ | || |  
| |_| |/ _ \ | |/ _ \/ __| '_ \| || |_
|  _  |  __/ | | (_) \__ \ (_) |__   _|
|_| |_|\___|_|_|\___/|___/\___/   |_|  
                                       
Welcome to Armbian-unofficial 24.5.0-trunk Bookworm with Linux 6.6.28-current-rockchip64

No end-user support: built from trunk

System load:   86%               Up time:       27 min    Local users:   2                
Memory usage:  11% of 3.77G      IP:           10.0.0.155
CPU temp:      41°C               Usage of /:    47% of 14G        
RX today:      1.5 GiB      

[ General system configuration (beta): armbian-config ]

Web console: https://helios64:9090/

You have no mail.
helios64@helios64:~$ su -
Mot de passe :
root@helios64:~# cd /tmp/
root@helios64:/tmp# uptime ; ./cpufreq-switching ; uptime
 06:34:39 up 28 min,  3 users,  load average: 5.87, 4.75, 3.61
allocated 64MB
test: toggle freq before write
99/100  
test: toggle freq before read
9/10, 99/100  
 06:36:12 up 29 min,  3 users,  load average: 4.99, 4.71, 3.70
root@helios64:/tmp#

 

No Problem,

To conclude for moment, to my side; 6.6.28 stable but not at cold boot... stable after.

Something (hardware or software) when cold boot crash or do bug in linux software... and after reset buttom is Okok...

It's crazy i know ! (Possible problem in cold ramlog boot is /var/log full... i view is full just now... i will investigate next week...)

 

If you want next week, i build a Vanilla armbian from source with official framework and run your cpufreq-switching on it, i think i will have same this day with my standard configuration but maybe not with crash at cold boot

 

If you read my history message about Helios64 since about 3 years... it never stable with standard parameter.

I do many tests and to change Kernel and this day the Best Kernel i never use is 6.6.27 and upper because not crash at standard frequency Schedutil Governor

The very bad Kernel was 6.X branch, with thing kernel, Helios crash often just when i unlock my raid10 with LUKS cryptosetup

And le 5.15.(something)69 or just before was the best stable Kernel with 400-1400Mhz schelutil (i speak about this in very old post)

 

I try again you program, Time Machine Backup is finish...

 

root@helios64:/tmp# uptime ; ./cpufreq-switching ; uptime
 06:51:25 up 44 min,  3 users,  load average: 3.98, 4.58, 4.24
allocated 64MB
test: toggle freq before write
99/100  
test: toggle freq before read
9/10, 99/100  
 06:53:01 up 46 min,  3 users,  load average: 3.03, 4.09, 4.10
root@helios64:/tmp#

 

Not crash, now i go to work office and then pass a weekend with my familly, keep in touch next week

 

During this weekend, i run on my helios64 a script  do in loop:

echo check > /sys/block/md0/md/sync_action

and:

btrfs check --readonly  --check-data-csum  --progress /dev/disk/by-uuid/1d4e2c84-1c43-4d73-8acb-XXXXXXXXXXXXXX

 

If Monday morning when i back my Helios64 not crash/freeze, for me 6.6.28 is good Kernel.

 

Have a good day

 

 

Edited by BipBip1981
Link to comment
Share on other sites

13 hours ago, prahal said:

I do not change the max value of the voltage (only the min and the central value):

change opp-microvolt = <0xc96a8 0xc96a8 0x1312d0>; to opp-microvolt = <0xdbba0 0xdbba0 0x1312d0>

 

OK - I am currently testing your opp-table-1 values (added the missing ones):

 

     opp-table-1 {
                compatible = "operating-points-v2";
                opp-shared;
                phandle = <0x0d>;

                opp00 {
                        opp-hz = <0x00 0x18519600>;
                        opp-microvolt = <0xdbba0 0xdbba0 0x1312d0>;
                        clock-latency-ns = <0x9c40>;
                };

                opp01 {
                        opp-hz = <0x00 0x23c34600>;
                        opp-microvolt = <0xdbba0 0xdbba0 0x1312d0>;
                };

                opp02 {
                        opp-hz = <0x00 0x30a32c00>;
                        opp-microvolt = <0xdbba0 0xdbba0 0x1312d0>;
                };

                opp03 {
                        opp-hz = <0x00 0x3c14dc00>;
                        opp-microvolt = <0xe7ef0 0xe7ef0 0x1312d0>;
                };

                opp04 {
                        opp-hz = <0x00 0x47868c00>;
                        opp-microvolt = <0xfa3e8 0xfa3e8 0x1312d0>;
                };

                opp05 {
                        opp-hz = <0x00 0x54667200>;
                        opp-microvolt = <0x10c8e0 0x10c8e0 0x1312d0>;
                };

                opp06 {
                        opp-hz = <0x00 0x5fd82200>;
                        opp-microvolt = <0x11edd8 0x11edd8 0x1312d0>;
                };

                opp07 {
                        opp-hz = <0x00 0x6b49d200>;
                        opp-microvolt = <0x1312d0 0x1312d0 0x1312d0>;
                };
        };

 

Link to comment
Share on other sites

Posted (edited)
2 hours ago, Trillien said:

I notice an error at the beginning of linux boot.

"mdadm: initramfs boot message: /scripts/local-bottom/mdadm: rm: not found"

 

can't confirm (my system is now based on Armbian 23.5.4):

 

# cat /var/log/syslog | grep mdadm
# 

 

Edited by ebin-dev
Link to comment
Share on other sites

17 hours ago, BipBip1981 said:

root@helios64:/tmp# ./cpufreq-switching
allocated 64MB
test: toggle freq before write
99/100  
test: toggle freq before read
9/10, 99/100  

could you try my older test case code:

 

Turns out I did not compile my test case anew before pasting it to github gist and could be the new one I pasted there is not testing what I expected (in that it could be I changed it to try testing CPU frequency changes from max to min instead of each step).

Mind I use a binary of the test case I made long ago for my tests which is the one in the link above. I did not feel like sharing a binary test case was a good idea. I prefer you to be able to audit the code (or have someone audit it for you). , I did not have much time to devote to sharing my findings so I checked the source was fine but not if the test was the same as the one I used on my side to stress test the big cpu. Sorry.

 

It looks normal for you the test case I shared to you working fine as as far as I know 1.8GHz 1.2V and 408MHz at 825mV are pretty stable. They could crash I am not sure of that, but it would take more than 50 runs of the test for it to happen (at least it took 80 of them for the 600MHz to fail at 825mV).

 

Mind you should do at least 5 runs of the above test case to be somewhat confident you cannot get the cpu b to crash. The fact that it does not crash is not the point of the test. Its usefulness is that it nearly always crashes the big cpu on the first run.

 

EDIT: the previous gists I gave you as a test case was my v1. The current test case is https://gist.github.com/prahal/8fab73325eb0d7091ad7c4627bf8e25a which is in the other thread I linked in this comment.

Edited by prahal
tell why the first test case code was not the correct test case for the current issue
Link to comment
Share on other sites

5 hours ago, ebin-dev said:

can't confirm (my system is now based on Armbian 23.5.4):

 

# cat /var/log/syslog | grep mdadm
# 

 

 

@ebin-devI believe initramfs messages are not written to syslog.

 

 

@Trillien you see that message on the serial console?

/usr/share/initramfs-tools/scripts/local-bottom/mdadm is part of the mdadm package which pcakaged by Debian. "dpkg -S /usr/share/initramfs-tools/scripts/local-bottom/mdadm", "apt policy mdadm"

Though it could be the fact that the generated initramfs lack/bin/rm is armbian specific. You might want to open a bug against armbian or at least open a topic in the forum. But nothing helios64 specific as far as I know. Could even be a Debian bug.

I don't even know if we ought to fix this missing /bin/rm for mdadm at the board level, even as a workaround.

 

Link to comment
Share on other sites

Hi @prahal,
Yes, for few weeks I took the habit to use the serial console to better understand whether the boot crashes or is stuck at some steps (I've got a long long network config step with systemd-networkd).

The message appears after u-boot gives the hand to linux, and is part of the very first lines where linux checks the hard drive filesystem.

Link to comment
Share on other sites

Hi,

For moment not crash with my pattern test.

Tthis evening i run cpufreq-switching-2-b and post result

Keep in touch

 

root@helios64:~#  btrfs check --readonly  --check-data-csum  --progress /dev/disk/by-uuid/1d4e2c84-1c43-4d73-8acb-14d5a7aa1c4d
Opening filesystem to check...
Checking filesystem on /dev/disk/by-uuid/1d4e2c84-1c43-4d73-8acb-14d5a7aa1c4d
UUID: 1d4e2c84-1c43-4d73-8acb-14d5a7aa1c4d
[1/7] checking root items                      (0:04:20 elapsed, 6258640 items checked)
[2/7] checking extents                         (0:22:26 elapsed, 613032 items checked)
[3/7] checking free space cache                (0:08:50 elapsed, 5519 items checked)
[4/7] checking fs roots                        (8:40:20 elapsed, 169002 items checked)
[5/7] checking csums against data              (35:54:03 elapsed, 2619191 items checked)

 

helios64@helios64:~$ cat /proc/mdstat
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
md0 : active raid10 sdd1[0] sdc1[5] sde1[4] sda1[2]
      15627581440 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [=>...................]  check =  5.2% (821480320/15627581440) finish=1681.3min speed=146770K/sec
      bitmap: 0/117 pages [0KB], 65536KB chunk

unused devices: <none>

 

helios64@helios64:~$ uptime
 11:05:34 up 1 day, 22:02,  3 users,  load average: 2,98, 2,82, 2,78

Edited by BipBip1981
Link to comment
Share on other sites

Hi, i use cpufreq-switching-2-b and... Full Crash and reboot automatic after less than 15s...

 

After reboot not crash second try just after reboot...

 

for moment, my test pattern not crash, cpufreq-switching-2-b continu to do crash with cpufreq-switching-2-b.

 

I test 2 version here soon:

https://gist.github.com/prahal

Edited by BipBip1981
Link to comment
Share on other sites

14 minutes ago, magostinelli said:

Any update regarding the new kernel? I'm on 6.6.8-edge, any time I powered up the helios64 I must reset it, nbecause it doesnb't start on the first attempt.

 

I am using 6.6.8 with the modified dtb (cache awareness, emmc with hs400, 75 mV bump for the big cores). It works perfectly fine. For your convenience I attach the dtb.

Since Helios64 is used throughout the day by several people - time for tests is very limited. Everybody is invited to test kernels and provide feedback here.

rk3399-kobol-helios64.dtb-6.6.8-L2-hs400-opp

Link to comment
Share on other sites

37 minutes ago, prahal said:

@ebin-dev note that for a few days, I have upped the cpub opp3 and above to all 1.2V. I still had the box crash around once a day with 75mV.

 

But isn't that quite an improvement for your box ? Your latest suggestion (voltage increase of 75mV for the big cores) had a very positive effect !

It seems that also my helios benefits from that (but too early to be sure about it).

Edited by ebin-dev
Link to comment
Share on other sites

21 hours ago, ebin-dev said:

But isn't that quite an improvement for your box ? Your latest suggestion (voltage increase of 75mV for the big cores) had a very positive effect !

It seems that also my helios benefits from that (but too early to be sure about it).

 

TLDR; yes upping 75mV helps drastically, but is not enough at least for all frequencies.

 

Indeed, before upping by 75mV I could not boot most of the time (only "emergency" mode boot was reliable, ie no raid10 and services off).

But it seems 75mV is not enough to compensate for the issue at stake all the time.

The thing is I don't know what the root issue upping 75mV workaround is. Could be 100mV is enough, but this is a value based on testing, not a theory that requires 75mV (could be the proper value is upper or could be upping the voltage only helps to cope with voltage drops, making them less frequently drop below a certain value where cpub crashes).

The datasheet for the cpub regulator requires a bigger capacitor on voltage input than the helios64 one. But the weird thing is most rk3399 boards also use the same weak below-spec capacitor value at this place.

At my level (without understanding the hardware interactions or barely) the next step would be to test if my test case also crashes these other boxes with the same vin too low capacitor ... if they crash we could guess that the design is bad and without a bigger capacitor the regulator cannot deliver the voltage for cpub reliably.

Could be we could workaround this in software, but I am not qualified to tell that, at least at this point (I read about how these components work, but I am not an expert.

 

Mind also I tested the board way less for the time to come as now that it is quite reliable I started using it again (been down for months, then I extracted the motherboard to test with the less complex setup possible, in emergency mode).

 

NB: upping the voltage makes the CPU hotter, you might want to check the temperature values (with "sensors"). Mine were fine, way below the throttling temp of 80°C for the rk3399. Even with all opp3 and above at 1.2V. The issue seems mostly of keeping the power consumption low. But I wonder if it has a noticeable effect on helios64 power consumption.

Link to comment
Share on other sites

On 4/27/2024 at 9:06 PM, prahal said:

TLDR; yes upping 75mV helps drastically, but is not enough at least for all frequencies.

 

There are only a few frequencies involved - at least on my board (see the transition tables in my other message in the parallel thread).

Upping the voltage by 75mV - as you suggested - helped my board to get rid of remaining few occasional issues during the boot process !

I am measuring the combined power consumption of Helios64 and two 2.5G switches (together 12.66W idle) - it does not seem to be affected at all by that small change.

Temperature reported by the Armbian welcome screen is 44 °C (ambient temp 19°C in the basement).

 

Is the remaining crash - once a day on your Helios64 - positively affected by upping the voltage to 1.2V for all states ?

If not it may be caused by something else.

 

Edit: I am currently testing 6.6.29 using a modified dtb (L2, hs400, opp up by 75mV)

Edited by ebin-dev
Link to comment
Share on other sites

On 4/26/2024 at 9:42 PM, magostinelli said:

Any update regarding the new kernel? I'm on 6.6.8-edge, any time I powered up the helios64 I must reset it, nbecause it doesnb't start on the first attempt.

 

Linux 6.6.29 is stable on my system so far when used with the modified dtb (in particular applying the 75mV bump for the big cores, dtb attached below). 

Could you give it a try ?

rk3399-kobol-helios64.dtb-6.6.29-L2-hs400-opp

Link to comment
Share on other sites

I am using the default values (after various tests).

 

# cat /etc/default/cpufrequtils 
ENABLE=true
MIN_SPEED=408000
MAX_SPEED=1800000
GOVERNOR=ondemand

 

Edit: 

I would love to use schedutil but low priority tasks take three times as long with it to complete compared to ondemand (!!) (I tested that with a complete file scan of my nextcloud installation).

There was a lot discussion going on in armbian/build on github. Finally the Armbian default governor was set to schedutil again instead to ondemand. But schedutil parameters need to be fine-tuned such that EAS works correctly and from my own observations I can confirm that this is clearly not (yet) the case at least for the rk3399.

 

Anyway - you can set min_speed to 600000 or 816000 for all cores (ondemand): it does not affect power consumption at all, but it renders your Helios64 more responsive.

And disable armbian-hardware-optimize - it degrades performance.

Edited by ebin-dev
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines