Jump to content

Helios64 u-boot does not build anymore after we bumped to 2022.07


Recommended Posts

Posted
On 10/26/2023 at 3:17 AM, BinaryWaves said:

Getting this in there serial output:

Unknown command 'kaslrseed' - try 'help'

 

There was a commit regarding this file and rockchip here:

https://github.com/armbian/build/pull/4352

 

 

Is this a regression bug? And is this comment fatal or just some warning message?

This is totally harmless.

This is explained in the link you gave "If the kaslrseed command hasn't been compiled in to u-boot, it gracefully skips generating the kASLR".

Posted
On 10/30/2023 at 2:26 PM, ebin-dev said:

My system also had the "free() invalid pointer" issue and I repaired it by flashing a new bootloader (linux-u-boot-edge-helios64_22.02.1_arm64 , as discussed here).

All helios64 system that flashed armbian u-boot since it was switch to full mainline uboot has this issue. The only way not to have it is to not have flashed the fully mainlined u-boot (without rockchip DDR blob) or to never stress the ram.

I have not sent the PR with the workaround (which is to build u-boot mainline with the rockhip DDR blob).

 

Mind this is only a workaround. One should fix the upstream u-boot code that sets the LPDDR4 settings.

One step would be to find out if other rk3399 boards with LPDDR4 (Nanopi M4 v2, Rock Pi 4, Orange Pi 4, etc). If one owns such a board it would be great to check if one can reproduce the issue with my test case and similar u-boot.

 

Posted (edited)
On 10/27/2023 at 9:07 PM, ebin-dev said:

The only remaining issue is: while the heartbeat LED starts to operate, the red LEDs on the front panel briefly light up (sata1 to sata5, bus rescan) and the fans spin up for a few seconds , then turn to normal operation.

 

Could this be u-boot related ? Would you have an idea ? (see the parallel thread)

 

Edit: I was wrong. as first the import of the upstream linux dts was done for 6.3 not 6.1 in armbian.

Quote

Would you like to look at the remaining glitch that I observe with linux 6.1.60 during boot (using a spare sd with bookworm on it) ? The sata bus is rescanned during boot and the red sata 1-5 LEDs flash one after the other at the time when the heartbeat LED starts to blink. This was not the case with linux 6.1.36.

 

Then checking the new armbian helios64 patchset it does remove this code to setup the sata power lines https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts?id=8169b9894dbd2d4e440cfbc5fe9f733e5876a564

 

I woudl have to investigate why the sata lines are flashing at kernel startup.

 

Quote

To exclude that this is u-boot related: which version of u-boot do you use (on sd/emmc) ? (A stock image ?)

I do have these flashing leds.

I have U-Boot 2022.07-armbian (Jul 21 2023 - 02:01:45 +0000).

Edited by prahal
I was wrong
Posted (edited)

@prahal Thank you for the hints! I would like to test the modified u-boot. Would you send a link (pm) so that I can try ?

 

My setup is now stable as discussed in the parallel thread. I had to go back to a linux kernel (5.10.43) using the realtek r8152 driver v2.14.0 (2020/09/24) instead of the mainline driver and to downgrade the boot loader to linux-u-boot-edge-helios64_22.02.1_arm64 on emmc  (no flashing LEDs).

 

As a next step I plan to compile and test a kernel based on LTS 6.6.x including the code to setup the sata power lines and a working version of the Realtek driver r8152 . The mainline version of that driver is still under heavy development ...

 

 

Edited by ebin-dev
Posted
On 11/18/2023 at 12:06 PM, prahal said:

Then checking the new armbian helios64 patchset it does remove this code to setup the sata power lines https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts?id=8169b9894dbd2d4e440cfbc5fe9f733e5876a564

 

I woudl have to investigate why the sata lines are flashing at kernel startup.

 

The helios64 patch to setup the sata power lines is already mainlined (line 92 ff). So it remains unclear why the sata LEDs are flashing at kernel startup.

Posted

Apologies it's been a while. I have been busy and haven't had a lot of time to mess with my helios. I had gotten it to a stable point with updates and then kernel headers, etc, and then froze.

 

I just did an armbian update and it installed updates but now it won't boot :\ boot.log

Posted (edited)
On 12/3/2023 at 11:43 PM, BinaryWaves said:

I just did an armbian update and it installed updates but now it won't boot

 

As nobody maintains helios64, installing Armbian updates is like playing russian roulette (even worse than that).

 

In the parallel thread I tested various combinations of OS, bootloader and kernel and ended up with this configuration (adding linux 5.15.52 to the list).

Edited by ebin-dev
Posted (edited)
On 12/3/2023 at 11:43 PM, BinaryWaves said:

I had gotten it to a stable point with updates and then kernel headers, etc, and then froze.

 

I just did an armbian update and it installed updates but now it won't boot 😕

switch to partitions #0, OK
mmc1 is current device
Scanning mmc 1:1...
Found U-Boot script /boot/boot.scr
3185 bytes read in 6 ms (517.6 KiB/s)
## Executing script at 00500000
Boot script loaded from mmc 1
166 bytes read in 4 ms (40 KiB/s)
14541965 bytes read in 620 ms (22.4 MiB/s)
Failed to load '/boot/Image'
86896 bytes read in 14 ms (5.9 MiB/s)
2698 bytes read in 10 ms (262.7 KiB/s)
Applying kernel provided DT fixup script (rockchip-fixup.scr)
## Executing script at 09000000
Bad Linux ARM64 Image magic!
SCRIPT FAILED: continuing...

 

to me this looks like a file required by u-boot got corrupted during your "freeze".

If mmc1 is an SD card you could mount it from a computer and check files like /boot/armbianEnv.txt and the kernel image and initrd are present in /boot.

If kernel or initrd an issue (due to I suppose your freeze furing upgrade), best would be to chroot to the SD card and reinstall the linux-image package.

 

An example of a valid  /boot/armbianEnv.txt
 

verbosity=7
bootlogo=false
overlay_prefix=rockchip
rootdev=UUID=a79a14c0-3cf4-4fb9-a6c6-838571351371
rootfstype=ext4
usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u,0x0bc2:0x231a:u,0x1058:0x2621:u

note the rootdev=UUID could vary,as the usbstoragequirks

Edited by prahal
Posted
On 11/22/2023 at 1:01 PM, ebin-dev said:
On 11/18/2023 at 12:06 PM, prahal said:

Then checking the new armbian helios64 patchset it does remove this code to setup the sata power lines https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts?id=8169b9894dbd2d4e440cfbc5fe9f733e5876a564

 

I woudl have to investigate why the sata lines are flashing at kernel startup.

 

The helios64 patch to setup the sata power lines is already mainlined (line 92 ff). So it remains unclear why the sata LEDs are flashing at kernel startup.

 

@ebin-dev yes and I thought that when the armbian dts for helios64 was migrated to be based on the mainline one this code was kept so I thought it was the cause of the new behavior.

But inspecting the armbian patch more thoroughly removes the sata power lines mainline dts code.

So the issue must be otherwise.

 

I admit that I give a higher priority to the crashes I get because these flashing leds seem harmless.

Posted

@ebin-dev note that u-boot is not updated in the device when the u-boot package is updated.

I really should have pushed the workaround since I cooked it up, but apt upgrade cannot break the u-boot on the board.

I will try to send the pull request for this u-boot workaround in January.

Note also that it is only a workaround. The support for LPDDR4 on u-boot mainline might be buggy. But I don't know how I can fix it on my side. The rockchip blob that works is a binary without sources.

Maybe one could ask rockchip devs to cook a fix for u-boot mainline?

Posted
On 10/27/2023 at 9:07 PM, ebin-dev said:

The only remaining issue is: while the heartbeat LED starts to operate, the red LEDs on the front panel briefly light up (sata1 to sata5, bus rescan) and the fans spin up for a few seconds , then turn to normal operation.

I can no longer reproduce this rescan behavior (LEDs stay solid blue before triggering on disk accesses). It could have been fixed days ago.

I am currently on 6.6.7-edge-rockchip64 fetch and build on Armbian three days ago.

U-Boot 2022.07-armbian (Jul 21 2023 - 02:01:45 +0000)

Posted

@BinaryWaves see my post above for how to restore the files u-boot needs to start the kernel

 

I have not yet been able to find out the cause of the random (sometimes often, sometimes months apart, crashes of the board). However, such crashes tend to corrupt files more easily on SD and EMMC than on HDD probably due to the block size.

So you can end up unable to boot after such crashes due to file corruption. I hope one day to get rid of at least one such crash cause (if more than one cause is at stake) but not there yet.

So one has to resort to chroot to the SD or EMMC and reinstall. Or reinstall a new image if one does not want to bother learning the required steps for chroot.

 

 

You could also take OS FS backup images regularly until the issue is sorted out and reimage after a crash that broke boot (though they are pretty rare, so one might end up stopping imaging until it happens).

Posted (edited)

@prahal Current helios64-u-boot-edge (2023-Dec-28 08:32) is supposed to include the rockchip DDR blob, but unfortunately stable operation of helios64 is still not possible with it: the r8152 is reset very frequently if this bootloader is used (contrary to linux-u-boot-edge-helios64_22.02.1_arm64.deb, were the r8152 is reset only occasionally under load).

Edited by ebin-dev
Posted
On 12/28/2023 at 4:32 PM, ebin-dev said:

Current helios64-u-boot-edge (2023-Dec-28 08:32) is supposed to include the rockchip DDR blob, but unfortunately stable operation of helios64 is still not possible with it: the r8152 is reset very frequently if this bootloader is used (contrary to linux-u-boot-edge-helios64_22.02.1_arm64.deb, were the r8152 is reset only occasionally under load).

 Thank you for the feedback (I do not use the r8152).

Do you mean you have issues with only the r8152 2,5Gb interface ? Or that this is the most obvious issue with the latest u-boot?

 

Sorry I don't know but is linux-u-boot-edge-helios64_22.02.1_arm64.deb the fully mainline u-boot before I restored the rockchip DDR blob or a completely different u-boot (is it based on u-boot 2022.07 ?)

 

I mean does the rockchip DDR blob restored cause w regression or do you mean even with that workaround the current u-boot is still less stable than a way older u-boot?

Posted (edited)
38 minutes ago, prahal said:

Do you mean you have issues with only the r8152 2,5Gb interface ?

 

The only real issue I had was the r8152 driver for the 2.5G interface. Under load the mainline r8152 driver was reset by the NETDEV Watchdog - more or less often:

If I use linux-u-boot-edge-helios64_22.02.1_arm64.deb to boot from emmc I do not observe any problems anymore (see the parallel thread). I assume that the rockchip DDR blob is used.

However using the latest u-boot the mainline r8152 driver was reset multiple times during a single download. May be that version still does not contain the rockchip DDR blob...

Edited by ebin-dev
Posted (edited)
4 hours ago, ebin-dev said:

However using the latest u-boot the mainline r8152 driver was reset multiple times during a single download. May be that version still does not contain the rockchip DDR blob...

You can easily tell if you run uboot with the DDR blob. Then the uboot serial output starts with:

`DDR Version 1.25 20210517

In

soft reset

SRX

channel 0

CS = 0`

 

(The full mainline u-boot starts with a TPL message if I remind correctly).

 

What might also matters is the ATF shipped with the u-boot deb.

You can tweak which ATF release is used in the armbian build framework.

 

I would also be interested in knowing the uboot and ATF version which is in   linux-u-boot-edge-helios64_22.02.1_arm64.deb (I guess the DDR blob version is 1.25 as above as it does not seem to have been upgraded since a long time, but tell me if it is not 1.25).

 

For uboot version you have a line about SPL on the serial output:

`U-Boot SPL 2022.07-armbian (Dec 20 2023 - 09:16:29 +0000)`

 

The ATF version is told after BL31:

`NOTICE:  BL31: lts-v2.8.8(release):armbian

NOTICE:  BL31: Built : 09:16:22, Dec 20 2023`

 

Note the ATF runs at runtime, the kernels calls it.

 

 

Also do you always run with serial attached? Just to check it is not related to my stability issues.

 

I will not be able to test the r8152 stability as I have not even made the soldering fix for it (I have an early helios64 board).

 

To be complete could you test the full mainline u-boot, ie latest before I reintroduced the DDR bin blob? To check if r8152 behaved the same before and after I added the rockchip DDR blob back?

Edited by prahal
ask more details
Posted (edited)
6 hours ago, prahal said:

You can easily tell if you run uboot with the DDR blob.

 

Not so easy:

Helios64 is installed in a 10" rack in the basement 2m above the floor (so that it can't be easily accessed) and it is frequently used by all my family members 🙂.

Current bootloader (linux-u-boot-edge-helios64_22.02.1_arm64.deb):

 

DDR Version 1.25 20210517
In
soft reset
SRX
channel 0
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 1
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 0 training pass!
channel 1 training pass!
change freq to 416MHz 0,1
Channel 0: LPDDR4,416MHz
Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB
Channel 1: LPDDR4,416MHz
Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB
256B stride
channel 0
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 1
CS = 0
MR0=0x18
MR4=0x1
MR5=0x1
MR8=0x10
MR12=0x72
MR14=0x72
MR18=0x0
MR19=0x0
MR24=0x8
MR25=0x0
channel 0 training pass!
channel 1 training pass!
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
change freq to 856MHz 1,0
ch 0 ddrconfig = 0x101, ddrsize = 0x40
ch 1 ddrconfig = 0x101, ddrsize = 0x40
pmugrf_os_reg[2] = 0x32C1F2C1, stride = 0xD
ddr_set_rate to 328MHZ
ddr_set_rate to 666MHZ
ddr_set_rate to 928MHZ
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
ddr_set_rate to 416MHZ, ctl_index 0
ddr_set_rate to 856MHZ, ctl_index 1
support 416 856 328 666 928 MHz, current 856MHz
OUT
Boot1 Release Time: May 29 2020 17:36:36, version: 1.26
CPUId = 0x0
ChipType = 0x10, 449
SdmmcInit=2 0
BootCapSize=100000
UserCapSize=14910MB
FwPartOffset=2000 , 100000
mmc0:cmd8,20
mmc0:cmd5,20
mmc0:cmd55,20
mmc0:cmd1,20
mmc0:cmd8,20
mmc0:cmd5,20
mmc0:cmd55,20
mmc0:cmd1,20
mmc0:cmd8,20
mmc0:cmd5,20
mmc0:cmd55,20
mmc0:cmd1,20
SdmmcInit=0 1
StorageInit ok = 69151
SecureMode = 0
SecureInit read PBA: 0x4
SecureInit read PBA: 0x404
SecureInit read PBA: 0x804
SecureInit read PBA: 0xc04
SecureInit read PBA: 0x1004
SecureInit read PBA: 0x1404
SecureInit read PBA: 0x1804
SecureInit read PBA: 0x1c04
SecureInit ret = 0, SecureMode = 0
atags_set_bootdev: ret:(0)
GPT 0x3335db8 signature is wrong
recovery gpt...
GPT 0x3335db8 signature is wrong
recovery gpt fail!
Trust Addr:0x4000, 0x58334c42
No find bl30.bin
No find bl32.bin
Load uboot, ReadLba = 2000
Load OK, addr=0x200000, size=0xea92c
RunBL31 0x40000 @ 97786 us
NOTICE:  BL31: v1.3(release):845ee93
NOTICE:  BL31: Built : 15:51:11, Jul 22 2020
NOTICE:  BL31: Rockchip release version: v1.1
INFO:    GICv3 with legacy support detected. ARM GICV3 driver initialized in EL3
INFO:    Using opteed sec cpu_context!
INFO:    boot cpu mask: 0
INFO:    plat_rockchip_pmu_init(1196): pd status 3e
INFO:    BL31: Initializing runtime services
WARNING: No OPTEE provided by BL2 boot loader, Booting device without OPTEE initialization. SMC`s destined for OPTEE will return SMC_UNK
ERROR:   Error initializing runtime service opteed_fast
INFO:    BL31: Preparing for EL3 exit to normal world
INFO:    Entry point address = 0x200000
INFO:    SPSR = 0x3c9


U-Boot 2021.07-armbian (Feb 27 2022 - 08:44:53 +0000)

SoC: Rockchip rk3399
Reset cause: RST
DRAM:  3.9 GiB
PMIC:  RK808 
SF: Detected w25q128 with page size 256 Bytes, erase size 4 KiB, total 16 MiB
MMC:   mmc@fe320000: 1, sdhci@fe330000: 0
Loading Environment from MMC... *** Warning - bad CRC, using default environment

 

If you could point me towards a version of a current u-boot that was built taking into account your recent pull requests, I will give it another try.

 

 

Edited by ebin-dev
Posted

@ebin-dev I will test the latest u-boot from https://fi.mirror.armbian.de/beta/pool/main/l/linux-u-boot-helios64-edge/ and tell you if it has the rockchip DDR as soon as I can.

 

Your current u-boot linux-u-boot-edge-helios64_22.02.1_arm64.deb has the same rockchip DDR blob than I put back in latest merge request. But your ATF (wich is called by the Linux kernel at runtime) is way older (version 1.3 from July 2020 while current ATF LTS is version 2.8) and seems to be have rockchip tweaks. Your u-boot is v2021.07.

 

My  Helios64 suffers from random crashes at runtime. I will try with the ATF you have. Thanks for having provided your version. Do you have any Linux oops say once in a month or is helios64 perfectly stable with your setup ( I mean out of the r8152 triggering the netdev watchdog, that is a plain crash that requires a reboot to restore functionality?

Posted

@prahal  Linux 6.6.8 and linux-u-boot-edge-helios64_22.02.1_arm64 is used since December 23rd without any Linux oops (despite the NETDEV Watchdog having to reset occasionally the mainline r8152 driver during iperf3 stress tests - but not during operation).

 

@alchemist observed however, that NFS causes issues with 6.6.8 but not with 6.1.70 but that would not appear to be Helios64 specific.

 

My use case: 24/7 as a DNS server, file server, nextcloud server, music server, plex server, and for home automation - kept everything simple (i.e. ext4 file system, no NFS).

Posted (edited)

@ebin-dev I confirm that latest u-boot https://fi.mirror.armbian.de/beta/pool/main/l/linux-u-boot-helios64-edge/ has the rockchip DDR.

 

! You might want to wait as it seems uboot compiling is broken in armbian !

You could test with rockchip ATF blob too (which I guess is what is inside `linux-u-boot-edge-helios64_22.02.1_arm64.deb`).

To do so edit `config/boards/helios64.csc` in armbian build clone and replace `BOOT_SCENARIO="tpl-blob-atf-mainline"` by `BOOT_SCENARIO="spl-blobs"`

(if you details check the comments in `config/sources/families/include/rockchip64_common.inc`).

Then build u-boot deb with:

./compile.sh uboot BOARD=helios64 BRANCH=edge RELEASE=bookworm

 

After installing the deb you can install the u-boot to the emmc (even if your OS is on SD u-boot is read from emmc first by helios64, except if you set the jumper) wit:

source /usr/lib/u-boot/platform_install.sh
write_uboot_platform $DIR /dev/mmcblk0

 

(where /dev/mmcblk0 is the emmc)

That would help confirm your r8192 issue is related to mainline ATF vs rockchip ATF.

Edited by prahal
warn that uboot compilation seems broken in January 2024
Posted (edited)

@ebin-dev can you confirm your box crashed before completing this program: cpufreq-switching-2.c

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <malloc.h>
#include <unistd.h>
#include <sys/mman.h>

#define MAIN_LOOPS (100)
#define TRIALS_PER_TOGGLE (10)

#define MAX_MEGS (64)


#define CPUL 0
#define CPUB 1

const char *cpul_freqs[] = {
	"408000",
	"600000",
	"816000",
 	"1008000",
     	"1200000",
     	"1416000"
};

const char *cpub_freqs[] = {
	"408000",
	"600000",
	"816000",
 	"1008000",
     	"1200000",
     	"1416000",
     	"1608000",
     	"1800000"
};

uint32_t *megs[MAX_MEGS];

int checked_open(char *name) {
	int fd = open(name, O_RDWR);
	char err[128];
	if (fd < 0) {
		snprintf(err, 128, "cannot open %s", name);
		perror(err);
		exit(1);
	}
	return fd;
}



#define SCALING_PATHL "/sys/devices/system/cpu/cpu0/cpufreq/"
#define SCALING_PATHB "/sys/devices/system/cpu/cpu4/cpufreq/"



void browse_freq(int *cpul_index, int *cpub_index, int *cpul_step, int *cpub_step) {
	static int inited = 0;
	int freql_target_len;
	int freqb_target_len;
	int freqfd;
	int cpul_freqs_count = 0;
	int cpub_freqs_count = 0;

	cpul_freqs_count = sizeof(cpul_freqs)/sizeof(cpul_freqs[0]);
	cpub_freqs_count = sizeof(cpub_freqs)/sizeof(cpub_freqs[0]);

	if (!inited) {
#if CPUL
		freqfd = checked_open(SCALING_PATHL "scaling_governor");
		write(freqfd, "userspace", 9);
		close(freqfd);
#endif
#if CPUB
		freqfd = checked_open(SCALING_PATHB "scaling_governor");
		write(freqfd, "userspace", 9);
		close(freqfd);
#endif
		inited = 1;
	}

	if (*cpul_index >= cpul_freqs_count - 1)
		*cpul_step = -1;
	if (*cpul_index <= 0)
		*cpul_step = 1;

	if (*cpub_index >= cpub_freqs_count - 1)
		*cpub_step = -1;
	if (*cpub_index <= 0)
		*cpub_step = 1;

	*cpul_index += *cpul_step;
	*cpub_index += *cpub_step;
#if CPUL
	printf("cpul_freq %s\n", cpul_freqs[*cpul_index]);
	freql_target_len = strlen(cpul_freqs[*cpul_index]);
	freqfd = checked_open(SCALING_PATHL "scaling_setspeed");
	write(freqfd, cpul_freqs[*cpul_index], freql_target_len);
	close(freqfd);
#endif
#if CPUB
	printf("cpub_freq %s\n", cpub_freqs[*cpub_index]);
	freqb_target_len = strlen(cpub_freqs[*cpub_index]);
	freqfd = checked_open(SCALING_PATHB "scaling_setspeed");
	write(freqfd, cpub_freqs[*cpub_index], freqb_target_len);
	close(freqfd);
#endif
}

void write_test_data(int nmegs, int toggle) {
	int cpul_index = 0;
	int cpub_index = 0;
	int cpul_step = 1;
	int cpub_step = 1;
	while (nmegs--) {
		browse_freq(&cpul_index, &cpub_index, &cpul_step, &cpub_step);
	}
}
void check_test_data(int nmegs, int toggle) {
	int cpul_index = 0;
	int cpub_index = 0;
	int cpul_step = 1;
	int cpub_step = 1;
	while (nmegs--) {
		browse_freq(&cpul_index, &cpub_index, &cpul_step, &cpub_step);
	}
}



int main(int argc, char **argv) {
	int nmegs = MAX_MEGS;
	printf("allocated %dMB\n", nmegs);

	int nloop, ntoggle, ntrial;

	printf("test: toggle freq before write\n");
	for (nloop = 0; nloop < MAIN_LOOPS; nloop++) {
		printf("\r%d/%d  ", nloop, MAIN_LOOPS);
		fflush(stdout);

		write_test_data(nmegs, 1);
		usleep(50);
		check_test_data(nmegs, 0);
	}
	printf("\n");

	printf("test: toggle freq before read\n");
	for (nloop = 0; nloop < MAIN_LOOPS; nloop++) {
		write_test_data(nmegs, 0);
		usleep(50);
		for (ntrial=0; ntrial < TRIALS_PER_TOGGLE; ntrial++) {
			printf("\r%d/%d, %d/%d  ", ntrial, TRIALS_PER_TOGGLE, nloop, MAIN_LOOPS);
			fflush(stdout);

			check_test_data(nmegs, 1);
		}
	}
	printf("\n");

	return 0;
}

 

 

gcc -o cpufreq-switching-2-b cpufreq-switching-2.c

 

then running it:

sudo ./cpufreq-switching-2-b

 

I was able to reproduce the crash even with linux-u-boot-edge-helios64_22.02.1_arm64.deb. That is rockchip ddr binary and atf and u-boot 2021.07, as well as the current one.

Your box being pretty stable and mine not lasting long that would help me decipher if my board has a hardware issue or if the load I apply to the board is at fault (the electrical environment my helios64 lives in could be at play too, but that is another topic)

 

Edited by prahal
Posted (edited)

@prahal It would appear that your system has some kind of hardware issue if it is not stable with linux-u-boot-edge-helios64_22.02.1_arm64.deb and kernel 5.15.93. In my use-case it is stable even with kernel 6.6.8.

 

Regarding testing a potentially corrupt Armbian built u-boot: I am a bit reluctant to such endeavors. Helios64 is used 24/7 (by 5 people) and is not easily accessible (stored away in a rack somewhere in the basement).

 

May be someone else could do the u-boot testing (the board on a desk would be useful) ? Otherwise I could give it a try in about 4 weeks time after I returned from some planned absence.

 

Regarding the crash-test switching cpu frequencies: my system died after switching cpu frequencies about 580 times (in less than a second), with linux-u-boot-edge-helios64_22.02.1_arm64.deb on kernel 6.6.8 (see the attached log).

 

Looking at the output of cpufreq-info it can be seen which cpu-frequency states are used most often. My system normally almost exclusively jumps between 600MHz <-> 1.8GHz (big cores) and between 408MHz <-> 600MHz or between 400MHz <-> 1.42GHz (little cores). The only thing I did in that context was running sbc-bench -r which supposedly changed some performance related settings permanently. I think that omitting the intermediate states reduces switching between states and thus enhances responsiveness and stability while reducing the burden on the scheduler.

 

I don't know if this helps, but I attached the cpu frequency transition tables for cpu5 and cpu0 (after about 3h uptime)

 

# cat /sys/devices/system/cpu/cpu5/cpufreq/stats/trans_table 
   From  :    To
         :    408000    600000    816000   1008000   1200000   1416000   1608000   1800000 
   408000:         0         0         0         0         0         0         0         0 
   600000:         0         0       140        13         7         7         1      1126 
   816000:         0       130         0        13         3         1         2        48 
  1008000:         0        15        18         0         4         2         2         1 
  1200000:         0         5         6         7         0         9         3         7 
  1416000:         0         3         3         4        10         0        15         9 
  1608000:         0         2         1         1         8        18         0        18 
  1800000:         0      1139        29         4         5         7        25         0 

# cat /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table 
   From  :    To
         :    408000    600000    816000   1008000   1200000   1416000 
   408000:         0      1133        14         9         3      1002 
   600000:      1081         0         5         3         2       134 
   816000:        12         6         0        46         3        10 
  1008000:         7         2        44         0        11        21 
  1200000:         1         4         6        17         0        28 
  1416000:      1061        79         8        10        37         0 

 

 

cpufreq-switching-2-b.log

Edited by ebin-dev
log file added, transition tables added
Posted

I managed to get one of my helios64 crash with the above code indeed, with  linux-u-boot-edge-helios64_22.02.1_arm64.deb on kernel 5.15.93 indeed.

 

Armbian 23.8.1 bullseye ttyS2 
            [  115.729058] Internal error: Oops: 86000005 [#1] PREEMPT SMP
[  115.729568] Modules linked in: bluetooth unix_diag veth nft_masq nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge dm_mod ipt_REJECT nf_reject_ipv4 xt_multiport nft_compat nft_counter nf_tables nfnetlink binfmt_misc rfkill lz4hc lz4 zram raid456 async_memcpy async_raid6_recov async_pq async_xor async_tx md_mod r8152 cdc_acm snd_soc_hdmi_codec snd_soc_rockchip_i2s snd_soc_rockchip_pcm leds_pwm pwm_fan snd_soc_core gpio_charger panfrost snd_pcm_dmaengine snd_pcm gpu_sched snd_timer snd soundcore realtek rockchip_vdec(C) hantro_vpu(C) rockchip_iep rockchip_rga v4l2_h264 videobuf2_dma_contig videobuf2_vmalloc videobuf2_dma_sg v4l2_mem2mem videobuf2_memops fusb302 sg videobuf2_v4l2 videobuf2_common dwmac_rk tcpm stmmac_platform typec videodev mc stmmac pcs_xpcs adc_keys gpio_beeper cpufreq_dt ledtrig_netdev lm75 sunrpc ip_tables x_tables autofs4
[  115.736491] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G         C        5.15.93-rockchip64 #23.02.2
[  115.737279] Hardware name: Helios64 (DT)
[  115.737631] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  115.738252] pc : 0xffff8004080d5e8c
[  115.738573] lr : 0xffff8004080d5e8c
[  115.738887] sp : ffff800009df3e60
[  115.739185] x29: ffff800009df3e60 x28: ffff00000078bb00 x27: 0000000000000000
[  115.739826] x26: ffff800009eebc80 x25: 0000000000000001 x24: ffff000000404300
[  115.740467] x23: 00000000000000c0 x22: ffffffffffffffd0 x21: ffff8000095504a8
[  115.741105] x20: ffff0000f77ab980 x19: ffffffffffffffd0 x18: 0000000000000000
[  115.741744] x17: ffff8000ee06c000 x16: ffff800009df4000 x15: 00001f1e8e1e9e92
[  115.742384] x14: 00000000000003f6 x13: 0000000000000056 x12: 0000000000000000
[  115.743023] x11: 0000000000000001 x10: 0000000000000000 x9 : 0000000000000056
[  115.743662] x8 : ffff0000f77aba00 x7 : ffff0000f77aba30 x6 : 0000000000000001
[  115.744301] x5 : ffff8000ee06c000 x4 : 0000000000010002 x3 : 000000000001b663
[  115.744940] x2 : ffffffffffffa88d x1 : 00000000ffff4b2f x0 : 000000000000d5ba
[  115.745580] Call trace:
[  115.745802]  0xffff8004080d5e8c
[  115.746088]  flush_smp_call_function_queue+0x114/0x250
[  115.746557]  generic_smp_call_function_single_interrupt+0x14/0x20
[  115.747103]  ipi_handler+0x7c/0x340
[  115.747423]  handle_percpu_devid_irq+0xa0/0x240
[  115.747830]  handle_domain_irq+0x90/0xd8
[  115.748187]  gic_handle_irq+0xb8/0x134
[  115.748528]  call_on_irq_stack+0x28/0x50
[  115.748883]  do_interrupt_handler+0x58/0x68
[  115.749261]  el1_interrupt+0x30/0x78
[  115.749585]  el1h_64_irq_handler+0x18/0x28
[  115.749954]  el1h_64_irq+0x74/0x78
[  115.750261]  arch_cpu_idle+0x18/0x28
[  115.750584]  default_idle_call+0x40/0x184
[  115.750949]  do_idle+0x1fc/0x270
[  115.751245]  cpu_startup_entry+0x28/0x50
[  115.751602]  secondary_start_kernel+0x164/0x178
[  115.752011]  __secondary_switched+0x90/0x94
[  115.752396] Code: bad PC value
[  115.752677] ---[ end trace 0ceb9c6e6a618ff5 ]---
[  115.753092] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[  115.753699] SMP: stopping secondary CPUs
[  116.920717] SMP: failed to stop secondary CPUs 0,5
[  116.921146] Kernel Offset: disabled
[  116.921458] CPU features: 0x800820f1,20000846
[  116.921847] Memory Limit: none
[  116.922129] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

 

Posted

@ebin-dev@OdyX if time permits you could try changing "CPUL" to "1" and "CPUB" to "0" in my above code ("#define CPUL 1" for example). Running the program on cpu_l (slower 4 CPUs) should not crash.

you can then compile as:

gcc -o cpufreq-switching-2-l cpufreq-switching-2.c

and run it.

 

 

@ebin-dev if yours crashes in one second it seems my hardware is as stable as your board... sad. If it was a matter of soldering a component or even a new RK3399 CPU I would have tried. I believe that the fact any have the issue often and other sporadically has to do with the load (and maybe the mains power and ground could make it even more frequent but it is just a guess).

I believe something is wrong with the cpu_b regulator or the voltage it is fed.

I tested the 12V input voltage on the board and it was fine.

 

Note that CPU big (CPU 4 and 5) loads are related to PCI/SATA and r8152 (in armbian build repository):
 

Quote

 

commit c242d07397ecec40bd0876054b862ad51a45b4d3

    * armbian-hardware-optimization: SATA & 2.5GbE IRQ pinning on Helios64
    
    - 2.5GbE USB LAN which is attached to XHCI, assigned to CPU4
    - SATA controller assigned to CPU4 and CPU5

 

(I believe the r8152 assignment to CPU4 is an assignment of the whole USB3, not r8152 only).

 

from the last Kobol team posts the cause instability of the instability is unknown

 

Any told it was DFS, ie the instability would not come from the frequencies per se that are set during the transitions but the speed between these transitions (from the odroid forum post https://forum.odroid.com/viewtopic.php?t=30303 the bigger the frequency switch at once the more unstable). However this remains to be confirmed that this is what makes the big CPU on our board unstable, the Odroid n1 post and @piter75 patchset for NanoPi M4V2 were about the little cores, not the big ones.

That is they added "max-buck-steps-per-change = 4;" to help with instability but this setting applies to the rk808-D regulator which to me only affects the little CPU cluster (I have not yet tried if the little cluster is unstable without this setting though), ie not the a72 CPUs.

As confirmed by the patch submitter @piter75  these max-buck-steps-per-change were to fix little cores:

 

I believe the big CPU cluster is stable on other rk3399 boards (even those with the same syr827 regulator), though it is just a guess. If one could try my cpufreq-switching-2-b test on another rk3399 board that would help.

 

The Kobol team also took a patch from the Odroid team repository  (https://forum.odroid.com/viewtopic.php?t=30303) which switches the vdd_cpu_b regulator-ramp-delay from 1000 to 40000 to improve stability ... though I believe they misunderstood (the odroid patch aim was to speed up transition because it was tested as still stable). Increasing this regulator-ramp-delay does not up the delay between frequencies transition but fasten it (thus doing the opposite to what they meant to fix the instability that is slowing down frequency switching

ie https://patchwork.ozlabs.org/project/uboot/patch/20190216094548.911-7-krzk@kernel.org/ the regulator-ramp-delay is in uV/uS which means it is the number of uV that it switches per uS. Increasing it switches faster.

Maybe we could try the opposite that is lower this value and retry the test program.

Posted
3 hours ago, prahal said:

Increasing this regulator-ramp-delay does not up the delay between frequencies transition but fasten it (thus doing the opposite to what they meant to fix the instability that is slowing down frequency switching

 

This is very interesting. For regulator vdd_cpu_b, 'regulator-ramp-delay' is still set to decimal 40000 in the current dtb (6.6.8). You could try reduce that number in your dtb to increase the delay until your frequency switching program finishes its task. If the resulting value is large enough for your cpu to still respond quickly enough to tasks scheduled then you could have eliminated a source of instability.

 

Since kernel 6.6.8 uses a more efficient scheduler you could use that one for your experiments.

 

I actually do not think that the Kobol Team was mistaken: in their commit it is stated that the 'existing value make clock transisition time large and could causing random kernel crash'. Therefore the regulator-ramp-delay was increased from decimal 1000 to 40000 thereby decreasing the clock transition time. This was a step in the right direction - may be that one was too large ...

Posted

@ebin-dev about regulator-ramp-delay you should take the rationale in the commit that introduced this setting in the kernel as a reference, not the comment from the Kobol team commit (which states that increasing this value has slowed down the frequency switching, as in my understanding they misunderstood the Odroid post https://forum.odroid.com/viewtopic.php?t=30303 which was about speeding the transitions not slowing them down because the poster wanted faster transition and he tested that even with a faster transition - ie greater regulator-ramp-delay - the CPU was still stable).

As the Linux mainline commit states regulator ramp delay is the uV per uS, that is the greater it is the more V is switched per unit of time.

I already reverted it to its previous 1000 value but as it was already unstable before being increased to 40000 I am not surprised it is still unstable (though my program ran longer than yours, but it might be random). I will try to decrease it next attempt.

 

 

Still, to me, something else should be at play otherwise I do not understand why the same CPU would require a very slow transition switching on helios64 and a very fast one on Odroid N1 😕

At best if it works lowering regulator-ramp-delay this would be a workaround in my opinion. I begin to doubt the correctness of the dts nodes set by the Kobol team (thinking they could have set the wrong regulator type for vdd_cpu_b or the like, or maybe set the wrong pinctrl definition for this regulator ... all things that cannot be confirmed as they did not provide the schematics. I found a picture of the board without the heatsink (from the Kobol team on Twitter https://twitter.com/kobol_io/status/1281088456391667713) but I believe the picture is not detailed enough to see the marking on the syr827 regulator for cpu_b. And it will not tell the wiring and pulldown. Maybe we could ask @aprayoga as he told he would still be around, in September 2021 https://forum.armbian.com/topic/18844-kobol-team-is-pulling-the-plug/?do=findComment&comment=128364).

 

And I do not exclude DDR timings even though from the previous DDR issue (which led me to revert to rockchip DDR setting blob in u-boot) it seems to me such an issue also affects userspace and with the current instability I do not get user space programs crashing, only kernel errors (but this is based on a single experience of a DDR setting issue).

 

I also want to try other things like an ATX power supply plugged to the board instead the power adapter (even though my multimeter shown above 12V on the board with the power adapter, power is a common cause of kernel issue on SBC).

 

Posted (edited)

@prahal There are many values to choose from between 1000 and 40000 (regulator-ramp-delay). Why don't you try 2000, 4000, 10000, 20000 ? (It might solve your problem)

Edited by ebin-dev
Posted

@ebin-dev I am currently cleaning a backup archive on the helios64. I will test values below 1000 asap but I do not expect much (I already had the regulator-ramp-delay set at 1000 for months and it is not stable. Though it could be this regulator-ramp-delay is not the issue ... I already tried adding "regulator-settling-time-us = 5000", no better). I will also try with my test program only asking for a frequency switch every 5 seconds instead of 50 microseconds. I will also try to skip any frequencies to test if only specific frequencies are at play.

At least with a reliable crasher (the above test program), it is easier to tell if a setting helps or not (not "it did not crash for a week so it is better" when the trigger for the crasher might not have happened for this week only).

 

The test program helps but I am out of clue what other setting to try.

If it turned out that this test program also crashes other rk3399 boards (or even knowing it does not) that would help.

 

I would also like to test with the xhci and ahci interrupts removed from the big cores. This is the main difference with other boards.

Posted (edited)

Hello @prahal and @ebin-dev any news on your stability tests? I also have occasional freezes and reboot problems, so I'm very curious to see if anything will change.

Thank you very much for your commitment!

Edited by snakekick
Posted (edited)

Hi @prahal

 

I've just done a test with your cpufreq-switching-2 program.

 

I'm running Helios64 on Armbian 23.08.0-trunk Bookworm with Linux 6.6.8-edge-rockchip64

 

I've started with LITTLE (CPUL = 1)

The program ran the 100 loops without issue.

 

Then I ran with big (CPUB = 1)

So far it failed at the 6th loop

 

Before a third run, I tried to change the interrupt allocation on xhci and ahci as you suggested

Please note the interrupts may vary after reboot (e.g. ahci was 76-80, after reboot it is 75-79)

# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
 18:          0          0          0          0          0          0     GICv3  25 Level     vgic
 20:          0          0          0          0          0          0     GICv3  27 Level     kvm guest vtimer
 23:       7947       8876       6014       7156      18916      24271     GICv3  30 Level     arch_timer
 25:       6601       5232       4476       4609      11249       4343     GICv3 113 Level     rk_timer
 31:          0          0          0          0          0          0     GICv3  37 Level     ff6d0000.dma-controller
 32:          0          0          0          0          0          0     GICv3  38 Level     ff6d0000.dma-controller
 33:          0          0          0          0          0          0     GICv3  39 Level     ff6e0000.dma-controller
 34:          0          0          0          0          0          0     GICv3  40 Level     ff6e0000.dma-controller
 36:        915          0          0          0          0          0     GICv3 132 Level     ttyS2
 37:          0          0          0          0          0          0     GICv3 147 Level     ff650800.iommu
 38:          0          0          0          0          0          0     GICv3 149 Level     ff660480.iommu
 39:          0          0          0          0          0          0     GICv3 151 Level     ff8f3f00.iommu, ff8f0000.vop
 40:          0          0          0          0          0          0     GICv3 150 Level     ff903f00.iommu, ff900000.vop
 41:          0          0          0          0          0          0     GICv3  75 Level     ff914000.iommu
 42:          0          0          0          0          0          0     GICv3  76 Level     ff924000.iommu
 43:          0          0          0          0          0          0     GICv3  85 Level     ff1d0000.spi
 44:          0          0          0          0          0          0     GICv3  84 Level     ff1e0000.spi
 45:          0          0          0          0          0          0     GICv3 164 Level     ff200000.spi
 46:       1399          0          0          0       1775          0     GICv3 142 Level     xhci-hcd:usb1
 47:         30          0          0          0          0          0     GICv3  67 Level     ff120000.i2c
 48:          0          0          0          0          0          0     GICv3  68 Level     ff160000.i2c
 49:       5031          0          0          0          0          0     GICv3  89 Level     ff3c0000.i2c
 50:        540          0          0          0          0          0     GICv3  88 Level     ff3d0000.i2c
 51:          0          0          0          0          0          0     GICv3  90 Level     ff3e0000.i2c
 52:          0          0          0          0          0          0     GICv3 129 Level     rockchip_thermal
 53:          0          0          0          0          0          0     GICv3 152 Edge      ff848000.watchdog
 54:          0          0          0          0          0          0  GICv3-23   0 Level     arm-pmu
 55:          0          0          0          0          0          0  GICv3-23   1 Level     arm-pmu
 56:          0          0          0          0          0          0  rockchip_gpio_irq   9 Edge      2-0020
 57:          0          0          0          0          0          0  rockchip_gpio_irq  10 Level     rk808
 63:          0          0          0          0          0          0     rk808   5 Edge      RTC alarm
 67:          2          0          0          0          0          0     GICv3  94 Level     ff100000.saradc
 68:          0          0          0          0          0          0     GICv3  97 Level     dw-mci
 69:          0          0          0          0          0          0  rockchip_gpio_irq   7 Edge      fe320000.mmc cd
 70:          0          0          0          0          0          0     GICv3  81 Level     pcie-sys
 72:          0          0          0          0          0          0     GICv3  83 Level     pcie-client
 74:          0          0          0          0          0          0   ITS-MSI   0 Edge      PCIe PME, aerdrv
 75:          0        489          0          0        524          0   ITS-MSI 524288 Edge      ahci0
 76:          0          0        237          0          0        904   ITS-MSI 524289 Edge      ahci1
 77:          0          0          0        489      31578          0   ITS-MSI 524290 Edge      ahci2
 78:          0          0          0          0        249          0   ITS-MSI 524291 Edge      ahci3
 79:          0          0          0          0          0        248   ITS-MSI 524292 Edge      ahci4
 83:      14093          0          0          0          0          0     GICv3  43 Level     mmc1
 84:          0          0          0          0          0          0  rockchip_gpio_irq   5 Edge      Power
 85:          0          0          0          0          0          0  rockchip_gpio_irq   3 Edge      User Button 1
 86:          0          0          0        931          0          0     GICv3  44 Level     end0
 87:          5          0          0          0          0          0  rockchip_gpio_irq   2 Level     fsc_interrupt_int_n
 88:          0          0          0          0          0          0     GICv3  59 Level     rockchip_usb2phy
 89:          0          0          0          0          0          0     GICv3 135 Level     rockchip_usb2phy_bvalid
 90:          0          0          0          0          0          0     GICv3 136 Level     rockchip_usb2phy_id
 91:          0          0          0          0          0          0     GICv3  60 Level     ohci_hcd:usb4
 92:          0          0          0          0          0          0     GICv3  58 Level     ehci_hcd:usb3
 93:          0          0          0          0          0          0     GICv3 137 Level     dwc3-otg, xhci-hcd:usb5
 94:          0          0          0          0          0          0     GICv3  32 Level     rk-crypto
 95:          0          0          0          0          0          0     GICv3 146 Level     ff650000.video-codec
 96:          0          0          0          0          0          0     GICv3  87 Level     ff680000.rga
 97:          0          0          0          0          0          0     GICv3 145 Level     ff650000.video-codec
 98:          0          0          0          0          0          0     GICv3 148 Level     ff660000.video-codec
 99:          0          0          0          0          0          0  rockchip_gpio_irq   2 Edge      gpio-charger
100:          0          0          0          0          0          0  rockchip_gpio_irq  27 Edge      gpio-charger
101:          2          0          0          0          0          0     GICv3  51 Level     panfrost-gpu
102:          0          0          0          0          0          0     GICv3  53 Level     panfrost-mmu
103:          0          0          0          0          0          0     GICv3  52 Level     panfrost-job
IPI0:      1384       1517       1472       1311       4816       7551       Rescheduling interrupts
IPI1:     12225      10971       9100       9240      10161      26978       Function call interrupts
IPI2:         0          0          0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:      2213       2003       2357       2402       2137       1671       Timer broadcast interrupts
IPI5:       598        601        747        496       1106        784       IRQ work interrupts
IPI6:         0          0          0          0          0          0       CPU wake-up interrupts
Err:          0

 

I reallocated the interrupts over the little core.

# echo 0 > /proc/irq/46/smp_affinity_list

# echo 1 > /proc/irq/75/smp_affinity_list
# echo 2 > /proc/irq/76/smp_affinity_list
# echo 3 > /proc/irq/77/smp_affinity_list
# echo 0 > /proc/irq/78/smp_affinity_list
# echo 1 > /proc/irq/79/smp_affinity_list

 

Then I ran the program on the big again (CPUB = 1)

And I reach the 25th loop before it failed.

Edited by Trillien

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines