

prahal
Members-
Posts
167 -
Joined
-
Last visited
Content Type
Forums
Store
Crowdfunding
Applications
Events
Raffles
Community Map
Everything posted by prahal
-
You could check with a voltmeter on the board for the 12V. Schematics are here https://wiki.kobol.io/helios64/docs/ . Do you have serial console output ? Though I believe with such an issue this is highly unlikely. From https://wiki.kobol.io/helios64/led/ comments, the system status LED and LAN LED are software controlled and HDD Activity LEDs hardware controlled. Though highly liekly LED1 (system rail power) is hardware controlled.
-
Awesome. And thanks a lot for the feedback. Could you explain which side you disconnect/reconnected? The HDD side only? Or on the motherboard too? I believe HDD side is enough, just want a clue if that could be wrong. I believe the connector were not clean, or a bit oxidized out of fabric (maybe connectors were stored in an area with an aggressive climate... I am no expert, but I clue that with the parts being serviced during Covid mess, some unusual process happened). It is not like the hardware is bad, only "dirty". Here I had bad lost connectivity to an HDD, extracting and inserting it a few times back seemed to have cleaned the connectors (I also put isopropanol on them, but I don't remind if I brushed them at that time).
-
-19 in ata1: link is slow to respond, please be patient (ready=-19) is ENODEV in https://github.com/torvalds/linux/blob/6efbea77b390604a7be7364583e19cd2d6a1291b/drivers/ata/libata-core.c#L3594 but ready is resetted to 0 else the function would exit before this message https://github.com/torvalds/linux/blob/6efbea77b390604a7be7364583e19cd2d6a1291b/drivers/ata/libata-core.c#L3577 As I told, I had bad contact on my sata ports, try removing and inserting the HDD in the sata socket a few times. That might remove things on the contacts. On my side I also clean the sata sockets with isopropanol when I bought some (I use 99,5% isopropanol but I don't know if less concentrated is OK). You might paste the complete log around the ata1 lines in the kernel log. If the issue is always with the same sata port, you could try swapping the HDD to see if the issue follows the HDD or if this is the link or port. But indeed I guess the issue is hardware.
-
@BipBip1981 Could you grep for "ata", or check the logs for ata errors and paste them? Or tell if there are no other messages with ata in the logs? Don't you have any "hard resetting link" messages in the kernel logs? On my side I once had drives that were not detected extracting the drive from the SATA power and data socket a few times (and cleaning them with isopropanol once, might have helped) did it. I believe my socket were oxidized though that is a wild guess. Issue gone either way.
-
Seeing how well the voltage hacks works on your boards I will include them in armbian (even though I still get crashes on my own board with only this 75mV hack, even though way less). But not upstream (I am close to sending the eMMC fix upstream, I only need to read the backlog there anew to avoid too much back and forth so the patch is up to the standard). at least until I sort out why they work (I was told to try them by a board designer that told me there was a design issue with the voltage regulator which I am not up to sort out. But I checked other rk3399 armbian boards'schematics and as far as I understand they have the same design. So either all of these boards are broken and are somewhat stable for an unknown reason (maybe less stress on the big CPUs) or I misunderstood what was wrong with helios64 hardware. I need to talk to an hardware engineer. Also I try to sort out a few other issues with other softwares and hardwares. And a few other issues. But I expect to have those in for mid October, maybe earlier.
-
Don't expect much yet. If you were already on 6.6 you should not see changes. By official support, it means I took maintainership. I have a fix for eMMC hs400 pending but not in armbian, and a fix for the network interface MAC to fix a regression but only in 6.9 and up (armbian edge). About the frequency policy no fix yet. Tentative hacks in the dtb voltage for the big CPUs seem to help but nothing in Armbian at this stage.
-
Hi. I am interested as I want to ease maintaining Helios64 by getting a second board. Would you be shipping to France?
-
Which kind of additional communication from serial terminal? The boot parameter I am thinking of might have no way to be applied to a USB SATA adapter. This is libata.force=3.0Gbps or 1.5Gbos. But I only used it for PCIe SATA. You might want ro try a dual power USB 3 cable to have enough current for the USB M.2 data drive. Have you tried ebin-dev patched DTS with your USB m.2 SATA device attached (might require the edge kernel though)? Edit: can you try with your USB adapter plugged to an USB 3 Power hub itself plugged to the helios64?
-
$ grep I2C_RK3X /boot/config-6.9.3-edge-rockchip64 CONFIG_I2C_RK3X=y I2C is a bus to let a microprocessore communicate with other circuits. It is enabled and cannot be disable on armbuan without rebuilding your kernel. If I remind well it is even required on helios64 for the PMIC (the circuitry that control power, reboot, etc). Likely you get i2c errors because the board went unstable hardware wise.
-
Did you mean you also experienced a crash "with" the patch when you plugged a SATA to USB3 device? (You wrote "without the patch"). Do you know the current drain of your device? ( Is the device an enclosure with a non bundled drive? Is it a 2.5 inch HDD?)
-
The USB board is USB 3.0 thus should be 0.9A max and 5V. Your SSD is 1.5*3.3 ie 4.95W, vs USB 3.0 max 4.5W. Though I believe the SSD might not always consumes the maximum. Could be later kernel use the SSD to it's maximum thus makes this SSD consumes too much. It is not given there is a bug in newer kernel. Also it could be this extra consumption only destabilize the board at boot because the board has other components also draining more current at boot. There might be tests that can help sort this out. I think there might be ways to lower the current consumption of the M.2 SSD (maybe by lowering the libata link speed via a kernel boot param). We could also try to find another USB 3 device which also stretch the limit. And check if the behavior under 5.10 and 6.6 is reproducible. Can you give a link to the SSD you put on the USB M.2 board? If it is not too expensive I could try to reproduce the setup. I have a USB multimeter and could check if the USB 3 limit is really overflowed (and if the current drawn is different for 5.1 and 6.6). But I won't be able to tell soon. COM error is likely your serial console program. Might be related to the helios64 board crashing on the other side. Mind that hardware hang just freeze the board, no messages are outputted. In the case I encountered it means too low voltage for load. When you tell you plugged the SSD in the back USB, do you mean you plugged it after boot or before boot when it failed?
-
My change was about restoring the eth0/end0 MAC address to its intended value (ie grabbed from OTP via SPI as it was designed to work until it was broken at one point). It does not change the interface names. I warned about it as if one had a static DHCP lease on the MAC that was set for a few years other than the OTP one the lease would not apply. The change is included for 6.9 and up, the issue reported above is for 6.8 (though it is likely a user space issue as interface renaming is udev userspace). So I guess @crosser upgraded from a kernel with the initial behavior I restored to an intermediate one which did not grab the MAC from the OTP. Note also that I did not apply these changes to the 6.6 kernel (-current), only to -edge. The MAC address and the interface rename issue are unrelated.
-
Thanks for the logs. I don't know what is wrong with your boot. You told that your eMMC setup broke (what you call MMC, I guess)? It was probably the eMMC breakage that affected most rk3399 boards and requires a property to be added for eMMC hs400 to boot. Since then hs400 has been disabled. I will reenable it after adding the property to the helios64 Armbian dts. If I understood correctly you also would like to boot from SD (and that you are currently booting from eMMC?). But when you plug a USB external SSD (that you use to store downloads, temp files, and anything not OS related) into the front USB socket boot fails (you said "I use the USB header for the front panel for an internal SSD", I guess you mean an external USB SSD, not an internal SSD). And you have 5 disks in the internal SATA slots as a RAID6. Mind the bootloader will stay on eMMC but you can move the OS to SD. This likely won't solve your boot with a USB external SSD plugged into the front USB socket... Mind I have multiple USB external HDDs plugged into the back USB socket and the boot is working. What is the amperage required by the external USB drive you plug into the front USB socket? (this socket can output max 900mA). ff3d0000.i2c is related to usb-c: [ 6.412784] OF: graph: no port node found in /i2c@ff3d0000/typec-portc@22
-
It might be of interest in case it is not the same random corruption, then we would be able to fix the kernel. The random corruption (I believe at the CPU stage) most of the time we get a weird unrecognized instruction, but the issue still looks random (even if way more likely when btrfs scrub or zfs check). I really need to talk to hardware guys from armbian to sort out what to take note of (USB devices, power bank, PCI stress, ...). Either way you are way better with a USB c to serial cable to a computer to get the logs. You can even save the output from your serial terminal application with the "script" command, maybe in tmux/screen session.
-
mind the voltage changes I am uneasy to push to main until I either get no more crashes for a long long time or the issue is nailed down to its cause. I might revisit this choice. I plan to have the emmc hs400 fix in Armbian and vanilla linux. Probably not for the august Armbian release but I expect for the next one.
-
Adding more clues from your issue: this u-boot is I guess, mainlain u-boot and ATF with rockchip DDR blob. linux-u-boot-edge-helios64_22.02.1 is likely the rockchip ATF, miniloader and DDR blob. Seems linux-u-boot-edge-helios64_22.02.1_arm64 is not available to download anymore? EIther way, this is a binary blob so could only be used to test feature parity. apt policy linux-u-boot-helios64-edge linux-u-boot-helios64-edge: Installé : 24.5.1 Candidat : 24.5.1 Table de version : *** 24.5.1 500 500 http://apt.armbian.com bookworm/main arm64 Packages 100 /var/lib/dpkg/status 24.2.1 500 500 http://apt.armbian.com bookworm/main arm64 Packages the link you provided looks like a match for this issue. I noted that rockchip-i2s ff8a0000.i2s: Could not register PCM is printed each time a USB device is plugged in (probably also when probed at boot). also, it has been a long time I noticed: juin 08 16:18:53 helios64 kernel: platform ff1e0000.spi: deferred probe pending: (reason unknown) juin 08 16:18:53 helios64 kernel: platform ff200000.spi: deferred probe pending: (reason unknown) juin 08 16:18:53 helios64 kernel: platform ff8a0000.i2s: deferred probe pending: (reason unknown) juin 08 16:18:53 helios64 kernel: amba ff6d0000.dma-controller: deferred probe pending: (reason unknown) juin 08 16:18:53 helios64 kernel: amba ff6e0000.dma-controller: deferred probe pending: (reason unknown) juin 08 16:18:53 helios64 kernel: platform ff1d0000.spi: deferred probe pending: (reason unknown) at the end of the kernel boot log. (ie kernel giving up trying to load the drivers for these). Seems related too.
-
Linux kernel ML discussion about an upstream fix WIP as of the 11th of June 2024: https://lore.kernel.org/lkml/20240326-rk-default-enable-strobe-pulldown-v1-3-f410c71605c0@folker-schwesinger.de/t/
- 1 reply
-
- Helios 64
- ROCK Pi 4C
-
(and 2 more)
Tagged with:
-
Started work on syncing the helios64 dts to upstream for 6.9: https://github.com/prahal/build/tree/helios64-6.9 . I removed the overclock disabling patch as the overclock as it disables an overclock that is at least nowadays not in the included rk3399-opp.dtsi (ie cluster0 has no opp6 and cluster no opp8). It was not a high priority beforehand but as the helios64 dts starts to change, thus carries unnecessary work. The pachtset applies. The helios64 dts compile fine to dtb. Kernel built and booted. (see below for network connection, ie ethernet MAC fixup) There are not many functional changes on my side (there are in upstream dts). There are a few differences with upstream dts I did not bring as I don't know if they are leftover from the initial patch set or new fixups. But this could already have been an issue before 3.9 for most of these changes (there is at least a new upstream change 93b36e1d3748c352a70c69aa378715e6572e51d1 "arm64: dts: rockchip: Fix USB interface compatible string on kobol-helios64") I brought forward. I also brought in vcc3v0_sd node "enable-active-high;" and "gpio = <&gpio0 RK_PA1 GPIO_ACTIVE_HIGH>;". Beware. ethernet MAC change (now working as designed). The fact I kept the aliases from upstream fixes the ability for the eth0 (then renamed to end0 by armbian-hardware-optimization) ethernet mac - grabbed from OTP via SPI in u-boot to be applied. Thus if you bound the MAC address that was generated by the kernel instead of from the hardware to a host and IP, you will have to find the new IP assigned by the DHCP server to connect to the helios64. I believe this is fine for edge even if not for current (so should be good for 6.9?).
-
@Trillien thanks, that confirms that I am not alone with a setup that does crash even with 5 milliseconds delay 🙂 If time permits could you try with the TRANSITION_DELAY value increased 10 times in the test case code (to 50millisecs, ie 50000) then 100 times to 500000?
-
@BipBip1981I agree and I did not plan on doing it on my own. But phone repair shops have skilled technicians who can do it. Still, the need to replace a hardware component is a wild guess. At this point, I was merely saying that I was ready to test a hardware change on my board to find out if the problem was a hardware issue. In the end, I believe that if we better understand what is wrong, be it the hardware, we might even be able to work around such a hardware shortcoming in software. I would not suggest messing with the hardware to test if it works better except if you are ready to lose the board. But mine is so unstable (probably due to my raid10 setup inherited from the helios4) that I could barely use it for years. So it is a matter of either testing if I can get it stable or buying a new NAS and sending this helios64 to the trash. I hope to be able to tell you a good governor/frequency but I need to test more. At least the most reliable frequencies without voltage quirks for the big cpu seemed to be the lowest 408000 and the highest 1800000. So you might want to force the "userspace" governor and "1800000" as a frequency.
-
TLDR; yes upping 75mV helps drastically, but is not enough at least for all frequencies. Indeed, before upping by 75mV I could not boot most of the time (only "emergency" mode boot was reliable, ie no raid10 and services off). But it seems 75mV is not enough to compensate for the issue at stake all the time. The thing is I don't know what the root issue upping 75mV workaround is. Could be 100mV is enough, but this is a value based on testing, not a theory that requires 75mV (could be the proper value is upper or could be upping the voltage only helps to cope with voltage drops, making them less frequently drop below a certain value where cpub crashes). The datasheet for the cpub regulator requires a bigger capacitor on voltage input than the helios64 one. But the weird thing is most rk3399 boards also use the same weak below-spec capacitor value at this place. At my level (without understanding the hardware interactions or barely) the next step would be to test if my test case also crashes these other boxes with the same vin too low capacitor ... if they crash we could guess that the design is bad and without a bigger capacitor the regulator cannot deliver the voltage for cpub reliably. Could be we could workaround this in software, but I am not qualified to tell that, at least at this point (I read about how these components work, but I am not an expert. Mind also I tested the board way less for the time to come as now that it is quite reliable I started using it again (been down for months, then I extracted the motherboard to test with the less complex setup possible, in emergency mode). NB: upping the voltage makes the CPU hotter, you might want to check the temperature values (with "sensors"). Mine were fine, way below the throttling temp of 80°C for the rk3399. Even with all opp3 and above at 1.2V. The issue seems mostly of keeping the power consumption low. But I wonder if it has a noticeable effect on helios64 power consumption.
-
@ebin-dev note that for a few days, I have upped the cpub opp3 and above to all 1.2V. I still had the box crash around once a day with 75mV.
-
@snakekick Thanks, that seems to confirm my findings back a few months ago. Adding a 5ms delay in the test case did not prevent the crash. Though it could be the system load is at play. Maybe adding a delay at the kernel level would do. pcie is tagged on the big CPUs so the SATA disks seem to matter (as the ethernet port). One could try in emergency mode (passing emergency to the kernel (I do it by "setenv extraboardargs emergency" after halting u-boot with a key press then enter "boot"). You will have just the root partition mounted read-only (so no network connection, a serial console is required). Then run the test. Also note that the design of the GPU regulator has the same issue as the CPU b one ... (for my tests I blacklisted panfrost, ie the GPU driver). After looking at the rockchip64 board schematics the design around the CPU b regulator is not similar but exactly the same as the helios64 one (rockchip64 uses a tcs4545 regulator for cpu b and tcs4546 for GPU). I wonder if the easiest fix would not be to pay someone to desolder the syr837 regulator and solder a tcs4545 instead - same for the GPU regulator a tcs4546 instead of the syr838... except that these chips from Torch Chip seem nowhere to be found. Maybe rip them from a rockpro64 board. @aprayoga can you confirm the Helios64 design for the rk3399 big cpu and gpu regulators are the same as the RockPro64 ones? Would it make sense (and would it fix the unstable cpu_b) to desolder the syr837/syr838 to replace them with tcs4545/tcs4546? Ie the tcs4535 datasheet (I am still unable to find the tcs4545 datasheet) I found tells tcs425 has internal pulldown for VSEL and EN which syr837 does not, the syr837 datasheet requires a 22uF capacitor for VIN but the helios64 has a 10uF one like the rockpro64 for the tcs4545. The SW pin of the helios64 has 470uH inductor with 4 x 22uF capacitors like the rockpro64 for the tcs4545 (like the typical application in the tcs4535 datasheet with 470uH inductor with two 22uF capacitors)? Do you know a replacement for the TCS4545/TCS4546 that has closer specs than the syr837/syr838? I cannot seem to find TCS4545/TCS4546 for sale (maybe I could buy a rockproc64 to desolder them at least for a test... or could you check on your side with a helios64 board that the cpufreq-switching-2-b test above crash with syr837 but not with tcs4545 with vanilla rk3399 opp definitions in dts? Sadly the Helios64 filled a market that is left unfilled. People who do not have the know-how to go full low-wattage DIY NAS and who also cannot afford to pay 1K€ for a NAS (and who might need two NASs to make things worse). In the meantime, I spend a lot of time learning about DIY NAS, but it is still hard to get wattage at full load (they tend to give all idle power usage). I probably will end up gambling and buying one build and pray... but with Helios64 I had the metrics before buying. I found that the Rock960 has the same design for the cpu_b and gpu regulators except for the inductor which is 0.240uH on the rock960 and 0.470uH on the Helios64. But hard to tell if the Rock960 is stable with my cpufreq switching test for the big cpus of the rk3399, might be the use of the board does not stress it as much as a raid10 on the helios64 pcie sata which is tagged to the cpu_b ... (initially it was 4 3TB WD Red - the old CMR model WD30EFRX-68EUZN0), from Helios4 setup as advertised by Kobol wiki for the Helios4... the board crashes on first boot after assembly with this raid setup. Mind I found that the Pinebook Pro also has the same design as the Helios64 this time around the syr837/syr838 ... I begin to wonder if either they are all broken (could be the amount of stress of a NAS ethernet or raid10 pcie is not that common) or if this is not the issue at stake.