Jump to content

prahal

Members
  • Posts

    162
  • Joined

  • Last visited

Everything posted by prahal

  1. Don't expect much yet. If you were already on 6.6 you should not see changes. By official support, it means I took maintainership. I have a fix for eMMC hs400 pending but not in armbian, and a fix for the network interface MAC to fix a regression but only in 6.9 and up (armbian edge). About the frequency policy no fix yet. Tentative hacks in the dtb voltage for the big CPUs seem to help but nothing in Armbian at this stage.
  2. Hi. I am interested as I want to ease maintaining Helios64 by getting a second board. Would you be shipping to France?
  3. I guess this is one core not responding anymore, likely CPU 5 (one of the big cores). Which kernel do you run? Is this the first time you encounter this bug? You might want to run ebin-dev dtb (there are voltage hacks for the big CPUs in it).
  4. Which kind of additional communication from serial terminal? The boot parameter I am thinking of might have no way to be applied to a USB SATA adapter. This is libata.force=3.0Gbps or 1.5Gbos. But I only used it for PCIe SATA. You might want ro try a dual power USB 3 cable to have enough current for the USB M.2 data drive. Have you tried ebin-dev patched DTS with your USB m.2 SATA device attached (might require the edge kernel though)? Edit: can you try with your USB adapter plugged to an USB 3 Power hub itself plugged to the helios64?
  5. $ grep I2C_RK3X /boot/config-6.9.3-edge-rockchip64 CONFIG_I2C_RK3X=y I2C is a bus to let a microprocessore communicate with other circuits. It is enabled and cannot be disable on armbuan without rebuilding your kernel. If I remind well it is even required on helios64 for the PMIC (the circuitry that control power, reboot, etc). Likely you get i2c errors because the board went unstable hardware wise.
  6. Did you mean you also experienced a crash "with" the patch when you plugged a SATA to USB3 device? (You wrote "without the patch"). Do you know the current drain of your device? ( Is the device an enclosure with a non bundled drive? Is it a 2.5 inch HDD?)
  7. The USB board is USB 3.0 thus should be 0.9A max and 5V. Your SSD is 1.5*3.3 ie 4.95W, vs USB 3.0 max 4.5W. Though I believe the SSD might not always consumes the maximum. Could be later kernel use the SSD to it's maximum thus makes this SSD consumes too much. It is not given there is a bug in newer kernel. Also it could be this extra consumption only destabilize the board at boot because the board has other components also draining more current at boot. There might be tests that can help sort this out. I think there might be ways to lower the current consumption of the M.2 SSD (maybe by lowering the libata link speed via a kernel boot param). We could also try to find another USB 3 device which also stretch the limit. And check if the behavior under 5.10 and 6.6 is reproducible. Can you give a link to the SSD you put on the USB M.2 board? If it is not too expensive I could try to reproduce the setup. I have a USB multimeter and could check if the USB 3 limit is really overflowed (and if the current drawn is different for 5.1 and 6.6). But I won't be able to tell soon. COM error is likely your serial console program. Might be related to the helios64 board crashing on the other side. Mind that hardware hang just freeze the board, no messages are outputted. In the case I encountered it means too low voltage for load. When you tell you plugged the SSD in the back USB, do you mean you plugged it after boot or before boot when it failed?
  8. My change was about restoring the eth0/end0 MAC address to its intended value (ie grabbed from OTP via SPI as it was designed to work until it was broken at one point). It does not change the interface names. I warned about it as if one had a static DHCP lease on the MAC that was set for a few years other than the OTP one the lease would not apply. The change is included for 6.9 and up, the issue reported above is for 6.8 (though it is likely a user space issue as interface renaming is udev userspace). So I guess @crosser upgraded from a kernel with the initial behavior I restored to an intermediate one which did not grab the MAC from the OTP. Note also that I did not apply these changes to the 6.6 kernel (-current), only to -edge. The MAC address and the interface rename issue are unrelated.
  9. Thanks for the logs. I don't know what is wrong with your boot. You told that your eMMC setup broke (what you call MMC, I guess)? It was probably the eMMC breakage that affected most rk3399 boards and requires a property to be added for eMMC hs400 to boot. Since then hs400 has been disabled. I will reenable it after adding the property to the helios64 Armbian dts. If I understood correctly you also would like to boot from SD (and that you are currently booting from eMMC?). But when you plug a USB external SSD (that you use to store downloads, temp files, and anything not OS related) into the front USB socket boot fails (you said "I use the USB header for the front panel for an internal SSD", I guess you mean an external USB SSD, not an internal SSD). And you have 5 disks in the internal SATA slots as a RAID6. Mind the bootloader will stay on eMMC but you can move the OS to SD. This likely won't solve your boot with a USB external SSD plugged into the front USB socket... Mind I have multiple USB external HDDs plugged into the back USB socket and the boot is working. What is the amperage required by the external USB drive you plug into the front USB socket? (this socket can output max 900mA). ff3d0000.i2c is related to usb-c: [ 6.412784] OF: graph: no port node found in /i2c@ff3d0000/typec-portc@22
  10. It might be of interest in case it is not the same random corruption, then we would be able to fix the kernel. The random corruption (I believe at the CPU stage) most of the time we get a weird unrecognized instruction, but the issue still looks random (even if way more likely when btrfs scrub or zfs check). I really need to talk to hardware guys from armbian to sort out what to take note of (USB devices, power bank, PCI stress, ...). Either way you are way better with a USB c to serial cable to a computer to get the logs. You can even save the output from your serial terminal application with the "script" command, maybe in tmux/screen session.
  11. mind the voltage changes I am uneasy to push to main until I either get no more crashes for a long long time or the issue is nailed down to its cause. I might revisit this choice. I plan to have the emmc hs400 fix in Armbian and vanilla linux. Probably not for the august Armbian release but I expect for the next one.
  12. Adding more clues from your issue: this u-boot is I guess, mainlain u-boot and ATF with rockchip DDR blob. linux-u-boot-edge-helios64_22.02.1 is likely the rockchip ATF, miniloader and DDR blob. Seems linux-u-boot-edge-helios64_22.02.1_arm64 is not available to download anymore? EIther way, this is a binary blob so could only be used to test feature parity. apt policy linux-u-boot-helios64-edge linux-u-boot-helios64-edge: Installé : 24.5.1 Candidat : 24.5.1 Table de version : *** 24.5.1 500 500 http://apt.armbian.com bookworm/main arm64 Packages 100 /var/lib/dpkg/status 24.2.1 500 500 http://apt.armbian.com bookworm/main arm64 Packages the link you provided looks like a match for this issue. I noted that rockchip-i2s ff8a0000.i2s: Could not register PCM is printed each time a USB device is plugged in (probably also when probed at boot). also, it has been a long time I noticed: juin 08 16:18:53 helios64 kernel: platform ff1e0000.spi: deferred probe pending: (reason unknown) juin 08 16:18:53 helios64 kernel: platform ff200000.spi: deferred probe pending: (reason unknown) juin 08 16:18:53 helios64 kernel: platform ff8a0000.i2s: deferred probe pending: (reason unknown) juin 08 16:18:53 helios64 kernel: amba ff6d0000.dma-controller: deferred probe pending: (reason unknown) juin 08 16:18:53 helios64 kernel: amba ff6e0000.dma-controller: deferred probe pending: (reason unknown) juin 08 16:18:53 helios64 kernel: platform ff1d0000.spi: deferred probe pending: (reason unknown) at the end of the kernel boot log. (ie kernel giving up trying to load the drivers for these). Seems related too.
  13. prahal

    prahal

  14. Linux kernel ML discussion about an upstream fix WIP as of the 11th of June 2024: https://lore.kernel.org/lkml/20240326-rk-default-enable-strobe-pulldown-v1-3-f410c71605c0@folker-schwesinger.de/t/
  15. Started work on syncing the helios64 dts to upstream for 6.9: https://github.com/prahal/build/tree/helios64-6.9 . I removed the overclock disabling patch as the overclock as it disables an overclock that is at least nowadays not in the included rk3399-opp.dtsi (ie cluster0 has no opp6 and cluster no opp8). It was not a high priority beforehand but as the helios64 dts starts to change, thus carries unnecessary work. The pachtset applies. The helios64 dts compile fine to dtb. Kernel built and booted. (see below for network connection, ie ethernet MAC fixup) There are not many functional changes on my side (there are in upstream dts). There are a few differences with upstream dts I did not bring as I don't know if they are leftover from the initial patch set or new fixups. But this could already have been an issue before 3.9 for most of these changes (there is at least a new upstream change 93b36e1d3748c352a70c69aa378715e6572e51d1 "arm64: dts: rockchip: Fix USB interface compatible string on kobol-helios64") I brought forward. I also brought in vcc3v0_sd node "enable-active-high;" and "gpio = <&gpio0 RK_PA1 GPIO_ACTIVE_HIGH>;". Beware. ethernet MAC change (now working as designed). The fact I kept the aliases from upstream fixes the ability for the eth0 (then renamed to end0 by armbian-hardware-optimization) ethernet mac - grabbed from OTP via SPI in u-boot to be applied. Thus if you bound the MAC address that was generated by the kernel instead of from the hardware to a host and IP, you will have to find the new IP assigned by the DHCP server to connect to the helios64. I believe this is fine for edge even if not for current (so should be good for 6.9?).
  16. @Trillien thanks, that confirms that I am not alone with a setup that does crash even with 5 milliseconds delay 🙂 If time permits could you try with the TRANSITION_DELAY value increased 10 times in the test case code (to 50millisecs, ie 50000) then 100 times to 500000?
  17. @BipBip1981I agree and I did not plan on doing it on my own. But phone repair shops have skilled technicians who can do it. Still, the need to replace a hardware component is a wild guess. At this point, I was merely saying that I was ready to test a hardware change on my board to find out if the problem was a hardware issue. In the end, I believe that if we better understand what is wrong, be it the hardware, we might even be able to work around such a hardware shortcoming in software. I would not suggest messing with the hardware to test if it works better except if you are ready to lose the board. But mine is so unstable (probably due to my raid10 setup inherited from the helios4) that I could barely use it for years. So it is a matter of either testing if I can get it stable or buying a new NAS and sending this helios64 to the trash. I hope to be able to tell you a good governor/frequency but I need to test more. At least the most reliable frequencies without voltage quirks for the big cpu seemed to be the lowest 408000 and the highest 1800000. So you might want to force the "userspace" governor and "1800000" as a frequency.
  18. TLDR; yes upping 75mV helps drastically, but is not enough at least for all frequencies. Indeed, before upping by 75mV I could not boot most of the time (only "emergency" mode boot was reliable, ie no raid10 and services off). But it seems 75mV is not enough to compensate for the issue at stake all the time. The thing is I don't know what the root issue upping 75mV workaround is. Could be 100mV is enough, but this is a value based on testing, not a theory that requires 75mV (could be the proper value is upper or could be upping the voltage only helps to cope with voltage drops, making them less frequently drop below a certain value where cpub crashes). The datasheet for the cpub regulator requires a bigger capacitor on voltage input than the helios64 one. But the weird thing is most rk3399 boards also use the same weak below-spec capacitor value at this place. At my level (without understanding the hardware interactions or barely) the next step would be to test if my test case also crashes these other boxes with the same vin too low capacitor ... if they crash we could guess that the design is bad and without a bigger capacitor the regulator cannot deliver the voltage for cpub reliably. Could be we could workaround this in software, but I am not qualified to tell that, at least at this point (I read about how these components work, but I am not an expert. Mind also I tested the board way less for the time to come as now that it is quite reliable I started using it again (been down for months, then I extracted the motherboard to test with the less complex setup possible, in emergency mode). NB: upping the voltage makes the CPU hotter, you might want to check the temperature values (with "sensors"). Mine were fine, way below the throttling temp of 80°C for the rk3399. Even with all opp3 and above at 1.2V. The issue seems mostly of keeping the power consumption low. But I wonder if it has a noticeable effect on helios64 power consumption.
  19. @ebin-dev note that for a few days, I have upped the cpub opp3 and above to all 1.2V. I still had the box crash around once a day with 75mV.
  20. @snakekick Thanks, that seems to confirm my findings back a few months ago. Adding a 5ms delay in the test case did not prevent the crash. Though it could be the system load is at play. Maybe adding a delay at the kernel level would do. pcie is tagged on the big CPUs so the SATA disks seem to matter (as the ethernet port). One could try in emergency mode (passing emergency to the kernel (I do it by "setenv extraboardargs emergency" after halting u-boot with a key press then enter "boot"). You will have just the root partition mounted read-only (so no network connection, a serial console is required). Then run the test. Also note that the design of the GPU regulator has the same issue as the CPU b one ... (for my tests I blacklisted panfrost, ie the GPU driver). After looking at the rockchip64 board schematics the design around the CPU b regulator is not similar but exactly the same as the helios64 one (rockchip64 uses a tcs4545 regulator for cpu b and tcs4546 for GPU). I wonder if the easiest fix would not be to pay someone to desolder the syr837 regulator and solder a tcs4545 instead - same for the GPU regulator a tcs4546 instead of the syr838... except that these chips from Torch Chip seem nowhere to be found. Maybe rip them from a rockpro64 board. @aprayoga can you confirm the Helios64 design for the rk3399 big cpu and gpu regulators are the same as the RockPro64 ones? Would it make sense (and would it fix the unstable cpu_b) to desolder the syr837/syr838 to replace them with tcs4545/tcs4546? Ie the tcs4535 datasheet (I am still unable to find the tcs4545 datasheet) I found tells tcs425 has internal pulldown for VSEL and EN which syr837 does not, the syr837 datasheet requires a 22uF capacitor for VIN but the helios64 has a 10uF one like the rockpro64 for the tcs4545. The SW pin of the helios64 has 470uH inductor with 4 x 22uF capacitors like the rockpro64 for the tcs4545 (like the typical application in the tcs4535 datasheet with 470uH inductor with two 22uF capacitors)? Do you know a replacement for the TCS4545/TCS4546 that has closer specs than the syr837/syr838? I cannot seem to find TCS4545/TCS4546 for sale (maybe I could buy a rockproc64 to desolder them at least for a test... or could you check on your side with a helios64 board that the cpufreq-switching-2-b test above crash with syr837 but not with tcs4545 with vanilla rk3399 opp definitions in dts? Sadly the Helios64 filled a market that is left unfilled. People who do not have the know-how to go full low-wattage DIY NAS and who also cannot afford to pay 1K€ for a NAS (and who might need two NASs to make things worse). In the meantime, I spend a lot of time learning about DIY NAS, but it is still hard to get wattage at full load (they tend to give all idle power usage). I probably will end up gambling and buying one build and pray... but with Helios64 I had the metrics before buying. I found that the Rock960 has the same design for the cpu_b and gpu regulators except for the inductor which is 0.240uH on the rock960 and 0.470uH on the Helios64. But hard to tell if the Rock960 is stable with my cpufreq switching test for the big cpus of the rk3399, might be the use of the board does not stress it as much as a raid10 on the helios64 pcie sata which is tagged to the cpu_b ... (initially it was 4 3TB WD Red - the old CMR model WD30EFRX-68EUZN0), from Helios4 setup as advertised by Kobol wiki for the Helios4... the board crashes on first boot after assembly with this raid setup. Mind I found that the Pinebook Pro also has the same design as the Helios64 this time around the syr837/syr838 ... I begin to wonder if either they are all broken (could be the amount of stress of a NAS ethernet or raid10 pcie is not that common) or if this is not the issue at stake.
  21. I wonder if upping the voltage was the correct fix (and if it would always work). From other rk3399 board schematics and TCS4525 datasheet ... it seems Kobol team designed the board for the TCS4525 regulator used before CPU BIG in a lot of designs and replaced the TCS4525 with the SYR837 later on (without taking into account the different recommendations for the SYR837 ... ie VIN with a 22uF instead of 10uF for the TCS4525). All the components around the SYR837 on the helios64 datasheet match the reference design for the TCS4525 (from Torch Chip, datasheet behind Chinese paywalls). I don't know if replacing the VIN capacitor would be enough to get stable big CPUs...
  22. this should take time, how long did it take to complete? Could you paste the last 10 lines of output from the command (or even a single run)? And maybe run the test with "time for i in $(seq 1 100); do ./cpufreq-switching-2-b ; done" to get the time it took at the end (but if it took ages it ran fine, it is not required to run the 100 iterations anew). It might be that the test runs fine on your hardware. That would be interesting. But as I said as it crashed once I doubted it. One option is that in one one the first attempt you tried: and on the second you tried https://gist.github.com/prahal/8fab73325eb0d7091ad7c4627bf8e25a which has a delay between cpub frequency transition of 5 milliseconds while the first has no delay at this point. (again sorry I did not notice the gist github one had this 5msec delay I added to test if a delay would help. To check you can replace: "#define TRANSITION_DELAY 5000" by "#define TRANSITION_DELAY 0" and check if it crashes. Then it will point to an issue with the delay between switching operating points for the big CPU. Do you know which kernel was running when your box crashed? Also, do you know which u-boot you have? (requires serial console output) Mind you don't need to paste cpufrequtils data because the test case bypasses the cpufrequtils settings and manages the chosen frequencies and how to switch them on its own. Note: if you want to quote a text from this forum, select it with your mouse, a popover box will appear above the selection "Quote selection", click on it. You can quote more than one selection to the same post.
  23. @BipBip1981you mean you have no crash running cpufreq-switching-2-b five times with 6.6.16 and 6.6.28? No that is not what I expected especially since you told me the first time it crashed and rebooted? So it crashed with which kernel? Did you rebuild cpufreq-switching-2-b between the test that crashed and the ones that did not? You can run the test in a loop 100 times with: for i in $(seq 1 100); do ./cpufreqswitching/cpufreq-switching-2-b; done With only one opp not upped 75mV I have seen tests crash only after 80 runs but without any changes, it seems unlikely. Could you paste the 10 last lines from a cpufreq-switching-2-b run? I could think that any boards have defective components ... but then why did your board crash once and then no more? By the way, do not compile cpufreq-switching.c, as I told you previously it was not the correct code for the test case. Has no use for the issue at stake. It was a first attempt because Kobol team told the crash could be due to too fast frequency switching, so I tried the extremes only. But it turns out these are the most stable and likely the only ones that survive without upping the opp voltage by 75 mV. Yes, because it set the governor to userspace to be able to force switch the frequency via code. After the run, it does not restore the cpufreq-utils governor (/etc/default/cpufrequtils) "systemctl restart cpufrequtils.service" should restore it for you. You mean cpufreq-switching-2-b that output thousands of lines for each run, built with "gcc -o cpufreq-switching-2-b cpufreq-switching-2.c", from: and https://gist.github.com/prahal/8fab73325eb0d7091ad7c4627bf8e25a (note there is a small diff between the two, the "usleep(50);" which should not matter).
  24. @BipBip1981I don't understand what was not crashing on second try just after reboot. Still thank you for running my test case v2 (and again sorry for pasting you the v1 which was not the correct one to reproduce the crash at first). It is expected for the v2 to crash the board quite fast. Even if it survives a run you should test a few runs (at least 5). Next, we need to find someone to look into the schematics to find out if upping the voltage is the best course of action. If so ship the upped opp voltages into helios64 dts.
  25. If the issue is that the cpu frequency is switched too fast and I can reproduce the crash with a regulator-ramp-delay of 1000, then there is no point in testing anything above 1000 that will make the issue worse. regulator-ramp-delay is badly named. It is not a dealy it is a divider for the delay. The greater regulator-ramp-delay the fastest the transition (I believe the Kobol team made this mistake, but as I also believe the issue could be otherwise than the delay between transitions this is not a big deal). I still have not tried with a lower than 1000 value for regulator-ramp-delay (ie without tweaking the opp voltages as I am currently doing).
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines