prahal Posted November 18, 2023 Posted November 18, 2023 On 10/26/2023 at 3:17 AM, BinaryWaves said: Getting this in there serial output: Unknown command 'kaslrseed' - try 'help' There was a commit regarding this file and rockchip here: https://github.com/armbian/build/pull/4352 Is this a regression bug? And is this comment fatal or just some warning message? This is totally harmless. This is explained in the link you gave "If the kaslrseed command hasn't been compiled in to u-boot, it gracefully skips generating the kASLR". 0 Quote
prahal Posted November 18, 2023 Posted November 18, 2023 On 10/30/2023 at 2:26 PM, ebin-dev said: My system also had the "free() invalid pointer" issue and I repaired it by flashing a new bootloader (linux-u-boot-edge-helios64_22.02.1_arm64 , as discussed here). All helios64 system that flashed armbian u-boot since it was switch to full mainline uboot has this issue. The only way not to have it is to not have flashed the fully mainlined u-boot (without rockchip DDR blob) or to never stress the ram. I have not sent the PR with the workaround (which is to build u-boot mainline with the rockhip DDR blob). Mind this is only a workaround. One should fix the upstream u-boot code that sets the LPDDR4 settings. One step would be to find out if other rk3399 boards with LPDDR4 (Nanopi M4 v2, Rock Pi 4, Orange Pi 4, etc). If one owns such a board it would be great to check if one can reproduce the issue with my test case and similar u-boot. 0 Quote
prahal Posted November 18, 2023 Posted November 18, 2023 (edited) On 10/27/2023 at 9:07 PM, ebin-dev said: The only remaining issue is: while the heartbeat LED starts to operate, the red LEDs on the front panel briefly light up (sata1 to sata5, bus rescan) and the fans spin up for a few seconds , then turn to normal operation. Could this be u-boot related ? Would you have an idea ? (see the parallel thread) Edit: I was wrong. as first the import of the upstream linux dts was done for 6.3 not 6.1 in armbian. Quote Would you like to look at the remaining glitch that I observe with linux 6.1.60 during boot (using a spare sd with bookworm on it) ? The sata bus is rescanned during boot and the red sata 1-5 LEDs flash one after the other at the time when the heartbeat LED starts to blink. This was not the case with linux 6.1.36. Then checking the new armbian helios64 patchset it does remove this code to setup the sata power lines https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts?id=8169b9894dbd2d4e440cfbc5fe9f733e5876a564 I woudl have to investigate why the sata lines are flashing at kernel startup. Quote To exclude that this is u-boot related: which version of u-boot do you use (on sd/emmc) ? (A stock image ?) I do have these flashing leds. I have U-Boot 2022.07-armbian (Jul 21 2023 - 02:01:45 +0000). Edited November 18, 2023 by prahal I was wrong 0 Quote
ebin-dev Posted November 18, 2023 Posted November 18, 2023 (edited) @prahal Thank you for the hints! I would like to test the modified u-boot. Would you send a link (pm) so that I can try ? My setup is now stable as discussed in the parallel thread. I had to go back to a linux kernel (5.10.43) using the realtek r8152 driver v2.14.0 (2020/09/24) instead of the mainline driver and to downgrade the boot loader to linux-u-boot-edge-helios64_22.02.1_arm64 on emmc (no flashing LEDs). As a next step I plan to compile and test a kernel based on LTS 6.6.x including the code to setup the sata power lines and a working version of the Realtek driver r8152 . The mainline version of that driver is still under heavy development ... Edited November 18, 2023 by ebin-dev 0 Quote
ebin-dev Posted November 22, 2023 Posted November 22, 2023 On 11/18/2023 at 12:06 PM, prahal said: Then checking the new armbian helios64 patchset it does remove this code to setup the sata power lines https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts?id=8169b9894dbd2d4e440cfbc5fe9f733e5876a564 I woudl have to investigate why the sata lines are flashing at kernel startup. The helios64 patch to setup the sata power lines is already mainlined (line 92 ff). So it remains unclear why the sata LEDs are flashing at kernel startup. 0 Quote
BinaryWaves Posted December 3, 2023 Posted December 3, 2023 Apologies it's been a while. I have been busy and haven't had a lot of time to mess with my helios. I had gotten it to a stable point with updates and then kernel headers, etc, and then froze. I just did an armbian update and it installed updates but now it won't boot :\ boot.log 0 Quote
ebin-dev Posted December 9, 2023 Posted December 9, 2023 (edited) On 12/3/2023 at 11:43 PM, BinaryWaves said: I just did an armbian update and it installed updates but now it won't boot As nobody maintains helios64, installing Armbian updates is like playing russian roulette (even worse than that). In the parallel thread I tested various combinations of OS, bootloader and kernel and ended up with this configuration (adding linux 5.15.52 to the list). Edited December 9, 2023 by ebin-dev 0 Quote
prahal Posted December 12, 2023 Posted December 12, 2023 (edited) On 12/3/2023 at 11:43 PM, BinaryWaves said: I had gotten it to a stable point with updates and then kernel headers, etc, and then froze. I just did an armbian update and it installed updates but now it won't boot 😕 switch to partitions #0, OK mmc1 is current device Scanning mmc 1:1... Found U-Boot script /boot/boot.scr 3185 bytes read in 6 ms (517.6 KiB/s) ## Executing script at 00500000 Boot script loaded from mmc 1 166 bytes read in 4 ms (40 KiB/s) 14541965 bytes read in 620 ms (22.4 MiB/s) Failed to load '/boot/Image' 86896 bytes read in 14 ms (5.9 MiB/s) 2698 bytes read in 10 ms (262.7 KiB/s) Applying kernel provided DT fixup script (rockchip-fixup.scr) ## Executing script at 09000000 Bad Linux ARM64 Image magic! SCRIPT FAILED: continuing... to me this looks like a file required by u-boot got corrupted during your "freeze". If mmc1 is an SD card you could mount it from a computer and check files like /boot/armbianEnv.txt and the kernel image and initrd are present in /boot. If kernel or initrd an issue (due to I suppose your freeze furing upgrade), best would be to chroot to the SD card and reinstall the linux-image package. An example of a valid /boot/armbianEnv.txt verbosity=7 bootlogo=false overlay_prefix=rockchip rootdev=UUID=a79a14c0-3cf4-4fb9-a6c6-838571351371 rootfstype=ext4 usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u,0x0bc2:0x231a:u,0x1058:0x2621:u note the rootdev=UUID could vary,as the usbstoragequirks Edited December 12, 2023 by prahal 0 Quote
prahal Posted December 12, 2023 Posted December 12, 2023 On 11/22/2023 at 1:01 PM, ebin-dev said: On 11/18/2023 at 12:06 PM, prahal said: Then checking the new armbian helios64 patchset it does remove this code to setup the sata power lines https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts?id=8169b9894dbd2d4e440cfbc5fe9f733e5876a564 I woudl have to investigate why the sata lines are flashing at kernel startup. The helios64 patch to setup the sata power lines is already mainlined (line 92 ff). So it remains unclear why the sata LEDs are flashing at kernel startup. @ebin-dev yes and I thought that when the armbian dts for helios64 was migrated to be based on the mainline one this code was kept so I thought it was the cause of the new behavior. But inspecting the armbian patch more thoroughly removes the sata power lines mainline dts code. So the issue must be otherwise. I admit that I give a higher priority to the crashes I get because these flashing leds seem harmless. 0 Quote
prahal Posted December 14, 2023 Posted December 14, 2023 @ebin-dev note that u-boot is not updated in the device when the u-boot package is updated. I really should have pushed the workaround since I cooked it up, but apt upgrade cannot break the u-boot on the board. I will try to send the pull request for this u-boot workaround in January. Note also that it is only a workaround. The support for LPDDR4 on u-boot mainline might be buggy. But I don't know how I can fix it on my side. The rockchip blob that works is a binary without sources. Maybe one could ask rockchip devs to cook a fix for u-boot mainline? 0 Quote
prahal Posted December 16, 2023 Posted December 16, 2023 On 10/27/2023 at 9:07 PM, ebin-dev said: The only remaining issue is: while the heartbeat LED starts to operate, the red LEDs on the front panel briefly light up (sata1 to sata5, bus rescan) and the fans spin up for a few seconds , then turn to normal operation. I can no longer reproduce this rescan behavior (LEDs stay solid blue before triggering on disk accesses). It could have been fixed days ago. I am currently on 6.6.7-edge-rockchip64 fetch and build on Armbian three days ago. U-Boot 2022.07-armbian (Jul 21 2023 - 02:01:45 +0000) 0 Quote
prahal Posted December 16, 2023 Posted December 16, 2023 @BinaryWaves see my post above for how to restore the files u-boot needs to start the kernel I have not yet been able to find out the cause of the random (sometimes often, sometimes months apart, crashes of the board). However, such crashes tend to corrupt files more easily on SD and EMMC than on HDD probably due to the block size. So you can end up unable to boot after such crashes due to file corruption. I hope one day to get rid of at least one such crash cause (if more than one cause is at stake) but not there yet. So one has to resort to chroot to the SD or EMMC and reinstall. Or reinstall a new image if one does not want to bother learning the required steps for chroot. You could also take OS FS backup images regularly until the issue is sorted out and reimage after a crash that broke boot (though they are pretty rare, so one might end up stopping imaging until it happens). 0 Quote
ebin-dev Posted December 28, 2023 Posted December 28, 2023 (edited) @prahal Current helios64-u-boot-edge (2023-Dec-28 08:32) is supposed to include the rockchip DDR blob, but unfortunately stable operation of helios64 is still not possible with it: the r8152 is reset very frequently if this bootloader is used (contrary to linux-u-boot-edge-helios64_22.02.1_arm64.deb, were the r8152 is reset only occasionally under load). Edited December 28, 2023 by ebin-dev 1 Quote
prahal Posted December 30, 2023 Posted December 30, 2023 On 12/28/2023 at 4:32 PM, ebin-dev said: Current helios64-u-boot-edge (2023-Dec-28 08:32) is supposed to include the rockchip DDR blob, but unfortunately stable operation of helios64 is still not possible with it: the r8152 is reset very frequently if this bootloader is used (contrary to linux-u-boot-edge-helios64_22.02.1_arm64.deb, were the r8152 is reset only occasionally under load). Thank you for the feedback (I do not use the r8152). Do you mean you have issues with only the r8152 2,5Gb interface ? Or that this is the most obvious issue with the latest u-boot? Sorry I don't know but is linux-u-boot-edge-helios64_22.02.1_arm64.deb the fully mainline u-boot before I restored the rockchip DDR blob or a completely different u-boot (is it based on u-boot 2022.07 ?) I mean does the rockchip DDR blob restored cause w regression or do you mean even with that workaround the current u-boot is still less stable than a way older u-boot? 0 Quote
ebin-dev Posted December 30, 2023 Posted December 30, 2023 (edited) 38 minutes ago, prahal said: Do you mean you have issues with only the r8152 2,5Gb interface ? The only real issue I had was the r8152 driver for the 2.5G interface. Under load the mainline r8152 driver was reset by the NETDEV Watchdog - more or less often: If I use linux-u-boot-edge-helios64_22.02.1_arm64.deb to boot from emmc I do not observe any problems anymore (see the parallel thread). I assume that the rockchip DDR blob is used. However using the latest u-boot the mainline r8152 driver was reset multiple times during a single download. May be that version still does not contain the rockchip DDR blob... Edited December 30, 2023 by ebin-dev 0 Quote
prahal Posted December 30, 2023 Posted December 30, 2023 (edited) 4 hours ago, ebin-dev said: However using the latest u-boot the mainline r8152 driver was reset multiple times during a single download. May be that version still does not contain the rockchip DDR blob... You can easily tell if you run uboot with the DDR blob. Then the uboot serial output starts with: `DDR Version 1.25 20210517 In soft reset SRX channel 0 CS = 0` (The full mainline u-boot starts with a TPL message if I remind correctly). What might also matters is the ATF shipped with the u-boot deb. You can tweak which ATF release is used in the armbian build framework. I would also be interested in knowing the uboot and ATF version which is in linux-u-boot-edge-helios64_22.02.1_arm64.deb (I guess the DDR blob version is 1.25 as above as it does not seem to have been upgraded since a long time, but tell me if it is not 1.25). For uboot version you have a line about SPL on the serial output: `U-Boot SPL 2022.07-armbian (Dec 20 2023 - 09:16:29 +0000)` The ATF version is told after BL31: `NOTICE: BL31: lts-v2.8.8(release):armbian NOTICE: BL31: Built : 09:16:22, Dec 20 2023` Note the ATF runs at runtime, the kernels calls it. Also do you always run with serial attached? Just to check it is not related to my stability issues. I will not be able to test the r8152 stability as I have not even made the soldering fix for it (I have an early helios64 board). To be complete could you test the full mainline u-boot, ie latest before I reintroduced the DDR bin blob? To check if r8152 behaved the same before and after I added the rockchip DDR blob back? Edited December 30, 2023 by prahal ask more details 0 Quote
ebin-dev Posted December 30, 2023 Posted December 30, 2023 (edited) 6 hours ago, prahal said: You can easily tell if you run uboot with the DDR blob. Not so easy: Helios64 is installed in a 10" rack in the basement 2m above the floor (so that it can't be easily accessed) and it is frequently used by all my family members 🙂. Current bootloader (linux-u-boot-edge-helios64_22.02.1_arm64.deb): DDR Version 1.25 20210517 In soft reset SRX channel 0 CS = 0 MR0=0x18 MR4=0x1 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 1 CS = 0 MR0=0x18 MR4=0x1 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 0 training pass! channel 1 training pass! change freq to 416MHz 0,1 Channel 0: LPDDR4,416MHz Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB Channel 1: LPDDR4,416MHz Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB 256B stride channel 0 CS = 0 MR0=0x18 MR4=0x1 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 1 CS = 0 MR0=0x18 MR4=0x1 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 0 training pass! channel 1 training pass! channel 0, cs 0, advanced training done channel 1, cs 0, advanced training done change freq to 856MHz 1,0 ch 0 ddrconfig = 0x101, ddrsize = 0x40 ch 1 ddrconfig = 0x101, ddrsize = 0x40 pmugrf_os_reg[2] = 0x32C1F2C1, stride = 0xD ddr_set_rate to 328MHZ ddr_set_rate to 666MHZ ddr_set_rate to 928MHZ channel 0, cs 0, advanced training done channel 1, cs 0, advanced training done ddr_set_rate to 416MHZ, ctl_index 0 ddr_set_rate to 856MHZ, ctl_index 1 support 416 856 328 666 928 MHz, current 856MHz OUT Boot1 Release Time: May 29 2020 17:36:36, version: 1.26 CPUId = 0x0 ChipType = 0x10, 449 SdmmcInit=2 0 BootCapSize=100000 UserCapSize=14910MB FwPartOffset=2000 , 100000 mmc0:cmd8,20 mmc0:cmd5,20 mmc0:cmd55,20 mmc0:cmd1,20 mmc0:cmd8,20 mmc0:cmd5,20 mmc0:cmd55,20 mmc0:cmd1,20 mmc0:cmd8,20 mmc0:cmd5,20 mmc0:cmd55,20 mmc0:cmd1,20 SdmmcInit=0 1 StorageInit ok = 69151 SecureMode = 0 SecureInit read PBA: 0x4 SecureInit read PBA: 0x404 SecureInit read PBA: 0x804 SecureInit read PBA: 0xc04 SecureInit read PBA: 0x1004 SecureInit read PBA: 0x1404 SecureInit read PBA: 0x1804 SecureInit read PBA: 0x1c04 SecureInit ret = 0, SecureMode = 0 atags_set_bootdev: ret:(0) GPT 0x3335db8 signature is wrong recovery gpt... GPT 0x3335db8 signature is wrong recovery gpt fail! Trust Addr:0x4000, 0x58334c42 No find bl30.bin No find bl32.bin Load uboot, ReadLba = 2000 Load OK, addr=0x200000, size=0xea92c RunBL31 0x40000 @ 97786 us NOTICE: BL31: v1.3(release):845ee93 NOTICE: BL31: Built : 15:51:11, Jul 22 2020 NOTICE: BL31: Rockchip release version: v1.1 INFO: GICv3 with legacy support detected. ARM GICV3 driver initialized in EL3 INFO: Using opteed sec cpu_context! INFO: boot cpu mask: 0 INFO: plat_rockchip_pmu_init(1196): pd status 3e INFO: BL31: Initializing runtime services WARNING: No OPTEE provided by BL2 boot loader, Booting device without OPTEE initialization. SMC`s destined for OPTEE will return SMC_UNK ERROR: Error initializing runtime service opteed_fast INFO: BL31: Preparing for EL3 exit to normal world INFO: Entry point address = 0x200000 INFO: SPSR = 0x3c9 U-Boot 2021.07-armbian (Feb 27 2022 - 08:44:53 +0000) SoC: Rockchip rk3399 Reset cause: RST DRAM: 3.9 GiB PMIC: RK808 SF: Detected w25q128 with page size 256 Bytes, erase size 4 KiB, total 16 MiB MMC: mmc@fe320000: 1, sdhci@fe330000: 0 Loading Environment from MMC... *** Warning - bad CRC, using default environment If you could point me towards a version of a current u-boot that was built taking into account your recent pull requests, I will give it another try. Edited December 30, 2023 by ebin-dev 0 Quote
prahal Posted January 3, 2024 Posted January 3, 2024 @ebin-dev I will test the latest u-boot from https://fi.mirror.armbian.de/beta/pool/main/l/linux-u-boot-helios64-edge/ and tell you if it has the rockchip DDR as soon as I can. Your current u-boot linux-u-boot-edge-helios64_22.02.1_arm64.deb has the same rockchip DDR blob than I put back in latest merge request. But your ATF (wich is called by the Linux kernel at runtime) is way older (version 1.3 from July 2020 while current ATF LTS is version 2.8) and seems to be have rockchip tweaks. Your u-boot is v2021.07. My Helios64 suffers from random crashes at runtime. I will try with the ATF you have. Thanks for having provided your version. Do you have any Linux oops say once in a month or is helios64 perfectly stable with your setup ( I mean out of the r8152 triggering the netdev watchdog, that is a plain crash that requires a reboot to restore functionality? 0 Quote
ebin-dev Posted January 3, 2024 Posted January 3, 2024 @prahal Linux 6.6.8 and linux-u-boot-edge-helios64_22.02.1_arm64 is used since December 23rd without any Linux oops (despite the NETDEV Watchdog having to reset occasionally the mainline r8152 driver during iperf3 stress tests - but not during operation). @alchemist observed however, that NFS causes issues with 6.6.8 but not with 6.1.70 but that would not appear to be Helios64 specific. My use case: 24/7 as a DNS server, file server, nextcloud server, music server, plex server, and for home automation - kept everything simple (i.e. ext4 file system, no NFS). 0 Quote
prahal Posted January 8, 2024 Posted January 8, 2024 (edited) @ebin-dev I confirm that latest u-boot https://fi.mirror.armbian.de/beta/pool/main/l/linux-u-boot-helios64-edge/ has the rockchip DDR. ! You might want to wait as it seems uboot compiling is broken in armbian ! You could test with rockchip ATF blob too (which I guess is what is inside `linux-u-boot-edge-helios64_22.02.1_arm64.deb`). To do so edit `config/boards/helios64.csc` in armbian build clone and replace `BOOT_SCENARIO="tpl-blob-atf-mainline"` by `BOOT_SCENARIO="spl-blobs"` (if you details check the comments in `config/sources/families/include/rockchip64_common.inc`). Then build u-boot deb with: ./compile.sh uboot BOARD=helios64 BRANCH=edge RELEASE=bookworm After installing the deb you can install the u-boot to the emmc (even if your OS is on SD u-boot is read from emmc first by helios64, except if you set the jumper) wit: source /usr/lib/u-boot/platform_install.sh write_uboot_platform $DIR /dev/mmcblk0 (where /dev/mmcblk0 is the emmc) That would help confirm your r8192 issue is related to mainline ATF vs rockchip ATF. Edited January 9, 2024 by prahal warn that uboot compilation seems broken in January 2024 0 Quote
prahal Posted January 8, 2024 Posted January 8, 2024 (edited) @ebin-dev can you confirm your box crashed before completing this program: cpufreq-switching-2.c #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <malloc.h> #include <unistd.h> #include <sys/mman.h> #define MAIN_LOOPS (100) #define TRIALS_PER_TOGGLE (10) #define MAX_MEGS (64) #define CPUL 0 #define CPUB 1 const char *cpul_freqs[] = { "408000", "600000", "816000", "1008000", "1200000", "1416000" }; const char *cpub_freqs[] = { "408000", "600000", "816000", "1008000", "1200000", "1416000", "1608000", "1800000" }; uint32_t *megs[MAX_MEGS]; int checked_open(char *name) { int fd = open(name, O_RDWR); char err[128]; if (fd < 0) { snprintf(err, 128, "cannot open %s", name); perror(err); exit(1); } return fd; } #define SCALING_PATHL "/sys/devices/system/cpu/cpu0/cpufreq/" #define SCALING_PATHB "/sys/devices/system/cpu/cpu4/cpufreq/" void browse_freq(int *cpul_index, int *cpub_index, int *cpul_step, int *cpub_step) { static int inited = 0; int freql_target_len; int freqb_target_len; int freqfd; int cpul_freqs_count = 0; int cpub_freqs_count = 0; cpul_freqs_count = sizeof(cpul_freqs)/sizeof(cpul_freqs[0]); cpub_freqs_count = sizeof(cpub_freqs)/sizeof(cpub_freqs[0]); if (!inited) { #if CPUL freqfd = checked_open(SCALING_PATHL "scaling_governor"); write(freqfd, "userspace", 9); close(freqfd); #endif #if CPUB freqfd = checked_open(SCALING_PATHB "scaling_governor"); write(freqfd, "userspace", 9); close(freqfd); #endif inited = 1; } if (*cpul_index >= cpul_freqs_count - 1) *cpul_step = -1; if (*cpul_index <= 0) *cpul_step = 1; if (*cpub_index >= cpub_freqs_count - 1) *cpub_step = -1; if (*cpub_index <= 0) *cpub_step = 1; *cpul_index += *cpul_step; *cpub_index += *cpub_step; #if CPUL printf("cpul_freq %s\n", cpul_freqs[*cpul_index]); freql_target_len = strlen(cpul_freqs[*cpul_index]); freqfd = checked_open(SCALING_PATHL "scaling_setspeed"); write(freqfd, cpul_freqs[*cpul_index], freql_target_len); close(freqfd); #endif #if CPUB printf("cpub_freq %s\n", cpub_freqs[*cpub_index]); freqb_target_len = strlen(cpub_freqs[*cpub_index]); freqfd = checked_open(SCALING_PATHB "scaling_setspeed"); write(freqfd, cpub_freqs[*cpub_index], freqb_target_len); close(freqfd); #endif } void write_test_data(int nmegs, int toggle) { int cpul_index = 0; int cpub_index = 0; int cpul_step = 1; int cpub_step = 1; while (nmegs--) { browse_freq(&cpul_index, &cpub_index, &cpul_step, &cpub_step); } } void check_test_data(int nmegs, int toggle) { int cpul_index = 0; int cpub_index = 0; int cpul_step = 1; int cpub_step = 1; while (nmegs--) { browse_freq(&cpul_index, &cpub_index, &cpul_step, &cpub_step); } } int main(int argc, char **argv) { int nmegs = MAX_MEGS; printf("allocated %dMB\n", nmegs); int nloop, ntoggle, ntrial; printf("test: toggle freq before write\n"); for (nloop = 0; nloop < MAIN_LOOPS; nloop++) { printf("\r%d/%d ", nloop, MAIN_LOOPS); fflush(stdout); write_test_data(nmegs, 1); usleep(50); check_test_data(nmegs, 0); } printf("\n"); printf("test: toggle freq before read\n"); for (nloop = 0; nloop < MAIN_LOOPS; nloop++) { write_test_data(nmegs, 0); usleep(50); for (ntrial=0; ntrial < TRIALS_PER_TOGGLE; ntrial++) { printf("\r%d/%d, %d/%d ", ntrial, TRIALS_PER_TOGGLE, nloop, MAIN_LOOPS); fflush(stdout); check_test_data(nmegs, 1); } } printf("\n"); return 0; } gcc -o cpufreq-switching-2-b cpufreq-switching-2.c then running it: sudo ./cpufreq-switching-2-b I was able to reproduce the crash even with linux-u-boot-edge-helios64_22.02.1_arm64.deb. That is rockchip ddr binary and atf and u-boot 2021.07, as well as the current one. Your box being pretty stable and mine not lasting long that would help me decipher if my board has a hardware issue or if the load I apply to the board is at fault (the electrical environment my helios64 lives in could be at play too, but that is another topic) Edited January 8, 2024 by prahal 0 Quote
ebin-dev Posted January 9, 2024 Posted January 9, 2024 (edited) @prahal It would appear that your system has some kind of hardware issue if it is not stable with linux-u-boot-edge-helios64_22.02.1_arm64.deb and kernel 5.15.93. In my use-case it is stable even with kernel 6.6.8. Regarding testing a potentially corrupt Armbian built u-boot: I am a bit reluctant to such endeavors. Helios64 is used 24/7 (by 5 people) and is not easily accessible (stored away in a rack somewhere in the basement). May be someone else could do the u-boot testing (the board on a desk would be useful) ? Otherwise I could give it a try in about 4 weeks time after I returned from some planned absence. Regarding the crash-test switching cpu frequencies: my system died after switching cpu frequencies about 580 times (in less than a second), with linux-u-boot-edge-helios64_22.02.1_arm64.deb on kernel 6.6.8 (see the attached log). Looking at the output of cpufreq-info it can be seen which cpu-frequency states are used most often. My system normally almost exclusively jumps between 600MHz <-> 1.8GHz (big cores) and between 408MHz <-> 600MHz or between 400MHz <-> 1.42GHz (little cores). The only thing I did in that context was running sbc-bench -r which supposedly changed some performance related settings permanently. I think that omitting the intermediate states reduces switching between states and thus enhances responsiveness and stability while reducing the burden on the scheduler. I don't know if this helps, but I attached the cpu frequency transition tables for cpu5 and cpu0 (after about 3h uptime) # cat /sys/devices/system/cpu/cpu5/cpufreq/stats/trans_table From : To : 408000 600000 816000 1008000 1200000 1416000 1608000 1800000 408000: 0 0 0 0 0 0 0 0 600000: 0 0 140 13 7 7 1 1126 816000: 0 130 0 13 3 1 2 48 1008000: 0 15 18 0 4 2 2 1 1200000: 0 5 6 7 0 9 3 7 1416000: 0 3 3 4 10 0 15 9 1608000: 0 2 1 1 8 18 0 18 1800000: 0 1139 29 4 5 7 25 0 # cat /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table From : To : 408000 600000 816000 1008000 1200000 1416000 408000: 0 1133 14 9 3 1002 600000: 1081 0 5 3 2 134 816000: 12 6 0 46 3 10 1008000: 7 2 44 0 11 21 1200000: 1 4 6 17 0 28 1416000: 1061 79 8 10 37 0 cpufreq-switching-2-b.log Edited January 9, 2024 by ebin-dev log file added, transition tables added 0 Quote
OdyX Posted January 9, 2024 Posted January 9, 2024 I managed to get one of my helios64 crash with the above code indeed, with linux-u-boot-edge-helios64_22.02.1_arm64.deb on kernel 5.15.93 indeed. Armbian 23.8.1 bullseye ttyS2 [ 115.729058] Internal error: Oops: 86000005 [#1] PREEMPT SMP [ 115.729568] Modules linked in: bluetooth unix_diag veth nft_masq nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge dm_mod ipt_REJECT nf_reject_ipv4 xt_multiport nft_compat nft_counter nf_tables nfnetlink binfmt_misc rfkill lz4hc lz4 zram raid456 async_memcpy async_raid6_recov async_pq async_xor async_tx md_mod r8152 cdc_acm snd_soc_hdmi_codec snd_soc_rockchip_i2s snd_soc_rockchip_pcm leds_pwm pwm_fan snd_soc_core gpio_charger panfrost snd_pcm_dmaengine snd_pcm gpu_sched snd_timer snd soundcore realtek rockchip_vdec(C) hantro_vpu(C) rockchip_iep rockchip_rga v4l2_h264 videobuf2_dma_contig videobuf2_vmalloc videobuf2_dma_sg v4l2_mem2mem videobuf2_memops fusb302 sg videobuf2_v4l2 videobuf2_common dwmac_rk tcpm stmmac_platform typec videodev mc stmmac pcs_xpcs adc_keys gpio_beeper cpufreq_dt ledtrig_netdev lm75 sunrpc ip_tables x_tables autofs4 [ 115.736491] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G C 5.15.93-rockchip64 #23.02.2 [ 115.737279] Hardware name: Helios64 (DT) [ 115.737631] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 115.738252] pc : 0xffff8004080d5e8c [ 115.738573] lr : 0xffff8004080d5e8c [ 115.738887] sp : ffff800009df3e60 [ 115.739185] x29: ffff800009df3e60 x28: ffff00000078bb00 x27: 0000000000000000 [ 115.739826] x26: ffff800009eebc80 x25: 0000000000000001 x24: ffff000000404300 [ 115.740467] x23: 00000000000000c0 x22: ffffffffffffffd0 x21: ffff8000095504a8 [ 115.741105] x20: ffff0000f77ab980 x19: ffffffffffffffd0 x18: 0000000000000000 [ 115.741744] x17: ffff8000ee06c000 x16: ffff800009df4000 x15: 00001f1e8e1e9e92 [ 115.742384] x14: 00000000000003f6 x13: 0000000000000056 x12: 0000000000000000 [ 115.743023] x11: 0000000000000001 x10: 0000000000000000 x9 : 0000000000000056 [ 115.743662] x8 : ffff0000f77aba00 x7 : ffff0000f77aba30 x6 : 0000000000000001 [ 115.744301] x5 : ffff8000ee06c000 x4 : 0000000000010002 x3 : 000000000001b663 [ 115.744940] x2 : ffffffffffffa88d x1 : 00000000ffff4b2f x0 : 000000000000d5ba [ 115.745580] Call trace: [ 115.745802] 0xffff8004080d5e8c [ 115.746088] flush_smp_call_function_queue+0x114/0x250 [ 115.746557] generic_smp_call_function_single_interrupt+0x14/0x20 [ 115.747103] ipi_handler+0x7c/0x340 [ 115.747423] handle_percpu_devid_irq+0xa0/0x240 [ 115.747830] handle_domain_irq+0x90/0xd8 [ 115.748187] gic_handle_irq+0xb8/0x134 [ 115.748528] call_on_irq_stack+0x28/0x50 [ 115.748883] do_interrupt_handler+0x58/0x68 [ 115.749261] el1_interrupt+0x30/0x78 [ 115.749585] el1h_64_irq_handler+0x18/0x28 [ 115.749954] el1h_64_irq+0x74/0x78 [ 115.750261] arch_cpu_idle+0x18/0x28 [ 115.750584] default_idle_call+0x40/0x184 [ 115.750949] do_idle+0x1fc/0x270 [ 115.751245] cpu_startup_entry+0x28/0x50 [ 115.751602] secondary_start_kernel+0x164/0x178 [ 115.752011] __secondary_switched+0x90/0x94 [ 115.752396] Code: bad PC value [ 115.752677] ---[ end trace 0ceb9c6e6a618ff5 ]--- [ 115.753092] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 115.753699] SMP: stopping secondary CPUs [ 116.920717] SMP: failed to stop secondary CPUs 0,5 [ 116.921146] Kernel Offset: disabled [ 116.921458] CPU features: 0x800820f1,20000846 [ 116.921847] Memory Limit: none [ 116.922129] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]--- 0 Quote
prahal Posted January 11, 2024 Posted January 11, 2024 @ebin-dev@OdyX if time permits you could try changing "CPUL" to "1" and "CPUB" to "0" in my above code ("#define CPUL 1" for example). Running the program on cpu_l (slower 4 CPUs) should not crash. you can then compile as: gcc -o cpufreq-switching-2-l cpufreq-switching-2.c and run it. @ebin-dev if yours crashes in one second it seems my hardware is as stable as your board... sad. If it was a matter of soldering a component or even a new RK3399 CPU I would have tried. I believe that the fact any have the issue often and other sporadically has to do with the load (and maybe the mains power and ground could make it even more frequent but it is just a guess). I believe something is wrong with the cpu_b regulator or the voltage it is fed. I tested the 12V input voltage on the board and it was fine. Note that CPU big (CPU 4 and 5) loads are related to PCI/SATA and r8152 (in armbian build repository): Quote commit c242d07397ecec40bd0876054b862ad51a45b4d3 * armbian-hardware-optimization: SATA & 2.5GbE IRQ pinning on Helios64 - 2.5GbE USB LAN which is attached to XHCI, assigned to CPU4 - SATA controller assigned to CPU4 and CPU5 (I believe the r8152 assignment to CPU4 is an assignment of the whole USB3, not r8152 only). from the last Kobol team posts the cause instability of the instability is unknown Any told it was DFS, ie the instability would not come from the frequencies per se that are set during the transitions but the speed between these transitions (from the odroid forum post https://forum.odroid.com/viewtopic.php?t=30303 the bigger the frequency switch at once the more unstable). However this remains to be confirmed that this is what makes the big CPU on our board unstable, the Odroid n1 post and @piter75 patchset for NanoPi M4V2 were about the little cores, not the big ones. That is they added "max-buck-steps-per-change = 4;" to help with instability but this setting applies to the rk808-D regulator which to me only affects the little CPU cluster (I have not yet tried if the little cluster is unstable without this setting though), ie not the a72 CPUs. As confirmed by the patch submitter @piter75 these max-buck-steps-per-change were to fix little cores: I believe the big CPU cluster is stable on other rk3399 boards (even those with the same syr827 regulator), though it is just a guess. If one could try my cpufreq-switching-2-b test on another rk3399 board that would help. The Kobol team also took a patch from the Odroid team repository (https://forum.odroid.com/viewtopic.php?t=30303) which switches the vdd_cpu_b regulator-ramp-delay from 1000 to 40000 to improve stability ... though I believe they misunderstood (the odroid patch aim was to speed up transition because it was tested as still stable). Increasing this regulator-ramp-delay does not up the delay between frequencies transition but fasten it (thus doing the opposite to what they meant to fix the instability that is slowing down frequency switching ie https://patchwork.ozlabs.org/project/uboot/patch/20190216094548.911-7-krzk@kernel.org/ the regulator-ramp-delay is in uV/uS which means it is the number of uV that it switches per uS. Increasing it switches faster. Maybe we could try the opposite that is lower this value and retry the test program. 0 Quote
ebin-dev Posted January 11, 2024 Posted January 11, 2024 3 hours ago, prahal said: Increasing this regulator-ramp-delay does not up the delay between frequencies transition but fasten it (thus doing the opposite to what they meant to fix the instability that is slowing down frequency switching This is very interesting. For regulator vdd_cpu_b, 'regulator-ramp-delay' is still set to decimal 40000 in the current dtb (6.6.8). You could try reduce that number in your dtb to increase the delay until your frequency switching program finishes its task. If the resulting value is large enough for your cpu to still respond quickly enough to tasks scheduled then you could have eliminated a source of instability. Since kernel 6.6.8 uses a more efficient scheduler you could use that one for your experiments. I actually do not think that the Kobol Team was mistaken: in their commit it is stated that the 'existing value make clock transisition time large and could causing random kernel crash'. Therefore the regulator-ramp-delay was increased from decimal 1000 to 40000 thereby decreasing the clock transition time. This was a step in the right direction - may be that one was too large ... 0 Quote
prahal Posted January 12, 2024 Posted January 12, 2024 @ebin-dev about regulator-ramp-delay you should take the rationale in the commit that introduced this setting in the kernel as a reference, not the comment from the Kobol team commit (which states that increasing this value has slowed down the frequency switching, as in my understanding they misunderstood the Odroid post https://forum.odroid.com/viewtopic.php?t=30303 which was about speeding the transitions not slowing them down because the poster wanted faster transition and he tested that even with a faster transition - ie greater regulator-ramp-delay - the CPU was still stable). As the Linux mainline commit states regulator ramp delay is the uV per uS, that is the greater it is the more V is switched per unit of time. I already reverted it to its previous 1000 value but as it was already unstable before being increased to 40000 I am not surprised it is still unstable (though my program ran longer than yours, but it might be random). I will try to decrease it next attempt. Still, to me, something else should be at play otherwise I do not understand why the same CPU would require a very slow transition switching on helios64 and a very fast one on Odroid N1 😕 At best if it works lowering regulator-ramp-delay this would be a workaround in my opinion. I begin to doubt the correctness of the dts nodes set by the Kobol team (thinking they could have set the wrong regulator type for vdd_cpu_b or the like, or maybe set the wrong pinctrl definition for this regulator ... all things that cannot be confirmed as they did not provide the schematics. I found a picture of the board without the heatsink (from the Kobol team on Twitter https://twitter.com/kobol_io/status/1281088456391667713) but I believe the picture is not detailed enough to see the marking on the syr827 regulator for cpu_b. And it will not tell the wiring and pulldown. Maybe we could ask @aprayoga as he told he would still be around, in September 2021 https://forum.armbian.com/topic/18844-kobol-team-is-pulling-the-plug/?do=findComment&comment=128364). And I do not exclude DDR timings even though from the previous DDR issue (which led me to revert to rockchip DDR setting blob in u-boot) it seems to me such an issue also affects userspace and with the current instability I do not get user space programs crashing, only kernel errors (but this is based on a single experience of a DDR setting issue). I also want to try other things like an ATX power supply plugged to the board instead the power adapter (even though my multimeter shown above 12V on the board with the power adapter, power is a common cause of kernel issue on SBC). 0 Quote
ebin-dev Posted January 12, 2024 Posted January 12, 2024 (edited) @prahal There are many values to choose from between 1000 and 40000 (regulator-ramp-delay). Why don't you try 2000, 4000, 10000, 20000 ? (It might solve your problem) Edited January 12, 2024 by ebin-dev 0 Quote
prahal Posted January 12, 2024 Posted January 12, 2024 @ebin-dev I am currently cleaning a backup archive on the helios64. I will test values below 1000 asap but I do not expect much (I already had the regulator-ramp-delay set at 1000 for months and it is not stable. Though it could be this regulator-ramp-delay is not the issue ... I already tried adding "regulator-settling-time-us = 5000", no better). I will also try with my test program only asking for a frequency switch every 5 seconds instead of 50 microseconds. I will also try to skip any frequencies to test if only specific frequencies are at play. At least with a reliable crasher (the above test program), it is easier to tell if a setting helps or not (not "it did not crash for a week so it is better" when the trigger for the crasher might not have happened for this week only). The test program helps but I am out of clue what other setting to try. If it turned out that this test program also crashes other rk3399 boards (or even knowing it does not) that would help. I would also like to test with the xhci and ahci interrupts removed from the big cores. This is the main difference with other boards. 0 Quote
snakekick Posted January 22, 2024 Posted January 22, 2024 (edited) Hello @prahal and @ebin-dev any news on your stability tests? I also have occasional freezes and reboot problems, so I'm very curious to see if anything will change. Thank you very much for your commitment! Edited January 23, 2024 by snakekick 0 Quote
Trillien Posted April 19, 2024 Posted April 19, 2024 (edited) Hi @prahal I've just done a test with your cpufreq-switching-2 program. I'm running Helios64 on Armbian 23.08.0-trunk Bookworm with Linux 6.6.8-edge-rockchip64 I've started with LITTLE (CPUL = 1) The program ran the 100 loops without issue. Then I ran with big (CPUB = 1) So far it failed at the 6th loop Before a third run, I tried to change the interrupt allocation on xhci and ahci as you suggested Please note the interrupts may vary after reboot (e.g. ahci was 76-80, after reboot it is 75-79) # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 18: 0 0 0 0 0 0 GICv3 25 Level vgic 20: 0 0 0 0 0 0 GICv3 27 Level kvm guest vtimer 23: 7947 8876 6014 7156 18916 24271 GICv3 30 Level arch_timer 25: 6601 5232 4476 4609 11249 4343 GICv3 113 Level rk_timer 31: 0 0 0 0 0 0 GICv3 37 Level ff6d0000.dma-controller 32: 0 0 0 0 0 0 GICv3 38 Level ff6d0000.dma-controller 33: 0 0 0 0 0 0 GICv3 39 Level ff6e0000.dma-controller 34: 0 0 0 0 0 0 GICv3 40 Level ff6e0000.dma-controller 36: 915 0 0 0 0 0 GICv3 132 Level ttyS2 37: 0 0 0 0 0 0 GICv3 147 Level ff650800.iommu 38: 0 0 0 0 0 0 GICv3 149 Level ff660480.iommu 39: 0 0 0 0 0 0 GICv3 151 Level ff8f3f00.iommu, ff8f0000.vop 40: 0 0 0 0 0 0 GICv3 150 Level ff903f00.iommu, ff900000.vop 41: 0 0 0 0 0 0 GICv3 75 Level ff914000.iommu 42: 0 0 0 0 0 0 GICv3 76 Level ff924000.iommu 43: 0 0 0 0 0 0 GICv3 85 Level ff1d0000.spi 44: 0 0 0 0 0 0 GICv3 84 Level ff1e0000.spi 45: 0 0 0 0 0 0 GICv3 164 Level ff200000.spi 46: 1399 0 0 0 1775 0 GICv3 142 Level xhci-hcd:usb1 47: 30 0 0 0 0 0 GICv3 67 Level ff120000.i2c 48: 0 0 0 0 0 0 GICv3 68 Level ff160000.i2c 49: 5031 0 0 0 0 0 GICv3 89 Level ff3c0000.i2c 50: 540 0 0 0 0 0 GICv3 88 Level ff3d0000.i2c 51: 0 0 0 0 0 0 GICv3 90 Level ff3e0000.i2c 52: 0 0 0 0 0 0 GICv3 129 Level rockchip_thermal 53: 0 0 0 0 0 0 GICv3 152 Edge ff848000.watchdog 54: 0 0 0 0 0 0 GICv3-23 0 Level arm-pmu 55: 0 0 0 0 0 0 GICv3-23 1 Level arm-pmu 56: 0 0 0 0 0 0 rockchip_gpio_irq 9 Edge 2-0020 57: 0 0 0 0 0 0 rockchip_gpio_irq 10 Level rk808 63: 0 0 0 0 0 0 rk808 5 Edge RTC alarm 67: 2 0 0 0 0 0 GICv3 94 Level ff100000.saradc 68: 0 0 0 0 0 0 GICv3 97 Level dw-mci 69: 0 0 0 0 0 0 rockchip_gpio_irq 7 Edge fe320000.mmc cd 70: 0 0 0 0 0 0 GICv3 81 Level pcie-sys 72: 0 0 0 0 0 0 GICv3 83 Level pcie-client 74: 0 0 0 0 0 0 ITS-MSI 0 Edge PCIe PME, aerdrv 75: 0 489 0 0 524 0 ITS-MSI 524288 Edge ahci0 76: 0 0 237 0 0 904 ITS-MSI 524289 Edge ahci1 77: 0 0 0 489 31578 0 ITS-MSI 524290 Edge ahci2 78: 0 0 0 0 249 0 ITS-MSI 524291 Edge ahci3 79: 0 0 0 0 0 248 ITS-MSI 524292 Edge ahci4 83: 14093 0 0 0 0 0 GICv3 43 Level mmc1 84: 0 0 0 0 0 0 rockchip_gpio_irq 5 Edge Power 85: 0 0 0 0 0 0 rockchip_gpio_irq 3 Edge User Button 1 86: 0 0 0 931 0 0 GICv3 44 Level end0 87: 5 0 0 0 0 0 rockchip_gpio_irq 2 Level fsc_interrupt_int_n 88: 0 0 0 0 0 0 GICv3 59 Level rockchip_usb2phy 89: 0 0 0 0 0 0 GICv3 135 Level rockchip_usb2phy_bvalid 90: 0 0 0 0 0 0 GICv3 136 Level rockchip_usb2phy_id 91: 0 0 0 0 0 0 GICv3 60 Level ohci_hcd:usb4 92: 0 0 0 0 0 0 GICv3 58 Level ehci_hcd:usb3 93: 0 0 0 0 0 0 GICv3 137 Level dwc3-otg, xhci-hcd:usb5 94: 0 0 0 0 0 0 GICv3 32 Level rk-crypto 95: 0 0 0 0 0 0 GICv3 146 Level ff650000.video-codec 96: 0 0 0 0 0 0 GICv3 87 Level ff680000.rga 97: 0 0 0 0 0 0 GICv3 145 Level ff650000.video-codec 98: 0 0 0 0 0 0 GICv3 148 Level ff660000.video-codec 99: 0 0 0 0 0 0 rockchip_gpio_irq 2 Edge gpio-charger 100: 0 0 0 0 0 0 rockchip_gpio_irq 27 Edge gpio-charger 101: 2 0 0 0 0 0 GICv3 51 Level panfrost-gpu 102: 0 0 0 0 0 0 GICv3 53 Level panfrost-mmu 103: 0 0 0 0 0 0 GICv3 52 Level panfrost-job IPI0: 1384 1517 1472 1311 4816 7551 Rescheduling interrupts IPI1: 12225 10971 9100 9240 10161 26978 Function call interrupts IPI2: 0 0 0 0 0 0 CPU stop interrupts IPI3: 0 0 0 0 0 0 CPU stop (for crash dump) interrupts IPI4: 2213 2003 2357 2402 2137 1671 Timer broadcast interrupts IPI5: 598 601 747 496 1106 784 IRQ work interrupts IPI6: 0 0 0 0 0 0 CPU wake-up interrupts Err: 0 I reallocated the interrupts over the little core. # echo 0 > /proc/irq/46/smp_affinity_list # echo 1 > /proc/irq/75/smp_affinity_list # echo 2 > /proc/irq/76/smp_affinity_list # echo 3 > /proc/irq/77/smp_affinity_list # echo 0 > /proc/irq/78/smp_affinity_list # echo 1 > /proc/irq/79/smp_affinity_list Then I ran the program on the big again (CPUB = 1) And I reach the 25th loop before it failed. Edited April 19, 2024 by Trillien 0 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.