Jump to content

prahal

Members
  • Posts

    120
  • Joined

  • Last visited

1 Follower

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. @Trillien thanks, that confirms that I am not alone with a setup that does crash even with 5 milliseconds delay 🙂 If time permits could you try with the TRANSITION_DELAY value increased 10 times in the test case code (to 50millisecs, ie 50000) then 100 times to 500000?
  2. @BipBip1981I agree and I did not plan on doing it on my own. But phone repair shops have skilled technicians who can do it. Still, the need to replace a hardware component is a wild guess. At this point, I was merely saying that I was ready to test a hardware change on my board to find out if the problem was a hardware issue. In the end, I believe that if we better understand what is wrong, be it the hardware, we might even be able to work around such a hardware shortcoming in software. I would not suggest messing with the hardware to test if it works better except if you are ready to lose the board. But mine is so unstable (probably due to my raid10 setup inherited from the helios4) that I could barely use it for years. So it is a matter of either testing if I can get it stable or buying a new NAS and sending this helios64 to the trash. I hope to be able to tell you a good governor/frequency but I need to test more. At least the most reliable frequencies without voltage quirks for the big cpu seemed to be the lowest 408000 and the highest 1800000. So you might want to force the "userspace" governor and "1800000" as a frequency.
  3. TLDR; yes upping 75mV helps drastically, but is not enough at least for all frequencies. Indeed, before upping by 75mV I could not boot most of the time (only "emergency" mode boot was reliable, ie no raid10 and services off). But it seems 75mV is not enough to compensate for the issue at stake all the time. The thing is I don't know what the root issue upping 75mV workaround is. Could be 100mV is enough, but this is a value based on testing, not a theory that requires 75mV (could be the proper value is upper or could be upping the voltage only helps to cope with voltage drops, making them less frequently drop below a certain value where cpub crashes). The datasheet for the cpub regulator requires a bigger capacitor on voltage input than the helios64 one. But the weird thing is most rk3399 boards also use the same weak below-spec capacitor value at this place. At my level (without understanding the hardware interactions or barely) the next step would be to test if my test case also crashes these other boxes with the same vin too low capacitor ... if they crash we could guess that the design is bad and without a bigger capacitor the regulator cannot deliver the voltage for cpub reliably. Could be we could workaround this in software, but I am not qualified to tell that, at least at this point (I read about how these components work, but I am not an expert. Mind also I tested the board way less for the time to come as now that it is quite reliable I started using it again (been down for months, then I extracted the motherboard to test with the less complex setup possible, in emergency mode). NB: upping the voltage makes the CPU hotter, you might want to check the temperature values (with "sensors"). Mine were fine, way below the throttling temp of 80°C for the rk3399. Even with all opp3 and above at 1.2V. The issue seems mostly of keeping the power consumption low. But I wonder if it has a noticeable effect on helios64 power consumption.
  4. @ebin-dev note that for a few days, I have upped the cpub opp3 and above to all 1.2V. I still had the box crash around once a day with 75mV.
  5. @snakekick Thanks, that seems to confirm my findings back a few months ago. Adding a 5ms delay in the test case did not prevent the crash. Though it could be the system load is at play. Maybe adding a delay at the kernel level would do. pcie is tagged on the big CPUs so the SATA disks seem to matter (as the ethernet port). One could try in emergency mode (passing emergency to the kernel (I do it by "setenv extraboardargs emergency" after halting u-boot with a key press then enter "boot"). You will have just the root partition mounted read-only (so no network connection, a serial console is required). Then run the test. Also note that the design of the GPU regulator has the same issue as the CPU b one ... (for my tests I blacklisted panfrost, ie the GPU driver). After looking at the rockchip64 board schematics the design around the CPU b regulator is not similar but exactly the same as the helios64 one (rockchip64 uses a tcs4545 regulator for cpu b and tcs4546 for GPU). I wonder if the easiest fix would not be to pay someone to desolder the syr837 regulator and solder a tcs4545 instead - same for the GPU regulator a tcs4546 instead of the syr838... except that these chips from Torch Chip seem nowhere to be found. Maybe rip them from a rockpro64 board. @aprayoga can you confirm the Helios64 design for the rk3399 big cpu and gpu regulators are the same as the RockPro64 ones? Would it make sense (and would it fix the unstable cpu_b) to desolder the syr837/syr838 to replace them with tcs4545/tcs4546? Ie the tcs4535 datasheet (I am still unable to find the tcs4545 datasheet) I found tells tcs425 has internal pulldown for VSEL and EN which syr837 does not, the syr837 datasheet requires a 22uF capacitor for VIN but the helios64 has a 10uF one like the rockpro64 for the tcs4545. The SW pin of the helios64 has 470uH inductor with 4 x 22uF capacitors like the rockpro64 for the tcs4545 (like the typical application in the tcs4535 datasheet with 470uH inductor with two 22uF capacitors)? Do you know a replacement for the TCS4545/TCS4546 that has closer specs than the syr837/syr838? I cannot seem to find TCS4545/TCS4546 for sale (maybe I could buy a rockproc64 to desolder them at least for a test... or could you check on your side with a helios64 board that the cpufreq-switching-2-b test above crash with syr837 but not with tcs4545 with vanilla rk3399 opp definitions in dts? Sadly the Helios64 filled a market that is left unfilled. People who do not have the know-how to go full low-wattage DIY NAS and who also cannot afford to pay 1K€ for a NAS (and who might need two NASs to make things worse). In the meantime, I spend a lot of time learning about DIY NAS, but it is still hard to get wattage at full load (they tend to give all idle power usage). I probably will end up gambling and buying one build and pray... but with Helios64 I had the metrics before buying. I found that the Rock960 has the same design for the cpu_b and gpu regulators except for the inductor which is 0.240uH on the rock960 and 0.470uH on the Helios64. But hard to tell if the Rock960 is stable with my cpufreq switching test for the big cpus of the rk3399, might be the use of the board does not stress it as much as a raid10 on the helios64 pcie sata which is tagged to the cpu_b ... (initially it was 4 3TB WD Red - the old CMR model WD30EFRX-68EUZN0), from Helios4 setup as advertised by Kobol wiki for the Helios4... the board crashes on first boot after assembly with this raid setup. Mind I found that the Pinebook Pro also has the same design as the Helios64 this time around the syr837/syr838 ... I begin to wonder if either they are all broken (could be the amount of stress of a NAS ethernet or raid10 pcie is not that common) or if this is not the issue at stake.
  6. I wonder if upping the voltage was the correct fix (and if it would always work). From other rk3399 board schematics and TCS4525 datasheet ... it seems Kobol team designed the board for the TCS4525 regulator used before CPU BIG in a lot of designs and replaced the TCS4525 with the SYR837 later on (without taking into account the different recommendations for the SYR837 ... ie VIN with a 22uF instead of 10uF for the TCS4525). All the components around the SYR837 on the helios64 datasheet match the reference design for the TCS4525 (from Torch Chip, datasheet behind Chinese paywalls). I don't know if replacing the VIN capacitor would be enough to get stable big CPUs...
  7. this should take time, how long did it take to complete? Could you paste the last 10 lines of output from the command (or even a single run)? And maybe run the test with "time for i in $(seq 1 100); do ./cpufreq-switching-2-b ; done" to get the time it took at the end (but if it took ages it ran fine, it is not required to run the 100 iterations anew). It might be that the test runs fine on your hardware. That would be interesting. But as I said as it crashed once I doubted it. One option is that in one one the first attempt you tried: and on the second you tried https://gist.github.com/prahal/8fab73325eb0d7091ad7c4627bf8e25a which has a delay between cpub frequency transition of 5 milliseconds while the first has no delay at this point. (again sorry I did not notice the gist github one had this 5msec delay I added to test if a delay would help. To check you can replace: "#define TRANSITION_DELAY 5000" by "#define TRANSITION_DELAY 0" and check if it crashes. Then it will point to an issue with the delay between switching operating points for the big CPU. Do you know which kernel was running when your box crashed? Also, do you know which u-boot you have? (requires serial console output) Mind you don't need to paste cpufrequtils data because the test case bypasses the cpufrequtils settings and manages the chosen frequencies and how to switch them on its own. Note: if you want to quote a text from this forum, select it with your mouse, a popover box will appear above the selection "Quote selection", click on it. You can quote more than one selection to the same post.
  8. @BipBip1981you mean you have no crash running cpufreq-switching-2-b five times with 6.6.16 and 6.6.28? No that is not what I expected especially since you told me the first time it crashed and rebooted? So it crashed with which kernel? Did you rebuild cpufreq-switching-2-b between the test that crashed and the ones that did not? You can run the test in a loop 100 times with: for i in $(seq 1 100); do ./cpufreqswitching/cpufreq-switching-2-b; done With only one opp not upped 75mV I have seen tests crash only after 80 runs but without any changes, it seems unlikely. Could you paste the 10 last lines from a cpufreq-switching-2-b run? I could think that any boards have defective components ... but then why did your board crash once and then no more? By the way, do not compile cpufreq-switching.c, as I told you previously it was not the correct code for the test case. Has no use for the issue at stake. It was a first attempt because Kobol team told the crash could be due to too fast frequency switching, so I tried the extremes only. But it turns out these are the most stable and likely the only ones that survive without upping the opp voltage by 75 mV. Yes, because it set the governor to userspace to be able to force switch the frequency via code. After the run, it does not restore the cpufreq-utils governor (/etc/default/cpufrequtils) "systemctl restart cpufrequtils.service" should restore it for you. You mean cpufreq-switching-2-b that output thousands of lines for each run, built with "gcc -o cpufreq-switching-2-b cpufreq-switching-2.c", from: and https://gist.github.com/prahal/8fab73325eb0d7091ad7c4627bf8e25a (note there is a small diff between the two, the "usleep(50);" which should not matter).
  9. @BipBip1981I don't understand what was not crashing on second try just after reboot. Still thank you for running my test case v2 (and again sorry for pasting you the v1 which was not the correct one to reproduce the crash at first). It is expected for the v2 to crash the board quite fast. Even if it survives a run you should test a few runs (at least 5). Next, we need to find someone to look into the schematics to find out if upping the voltage is the best course of action. If so ship the upped opp voltages into helios64 dts.
  10. If the issue is that the cpu frequency is switched too fast and I can reproduce the crash with a regulator-ramp-delay of 1000, then there is no point in testing anything above 1000 that will make the issue worse. regulator-ramp-delay is badly named. It is not a dealy it is a divider for the delay. The greater regulator-ramp-delay the fastest the transition (I believe the Kobol team made this mistake, but as I also believe the issue could be otherwise than the delay between transitions this is not a big deal). I still have not tried with a lower than 1000 value for regulator-ramp-delay (ie without tweaking the opp voltages as I am currently doing).
  11. @ebin-devI believe initramfs messages are not written to syslog. @Trillien you see that message on the serial console? /usr/share/initramfs-tools/scripts/local-bottom/mdadm is part of the mdadm package which pcakaged by Debian. "dpkg -S /usr/share/initramfs-tools/scripts/local-bottom/mdadm", "apt policy mdadm" Though it could be the fact that the generated initramfs lack/bin/rm is armbian specific. You might want to open a bug against armbian or at least open a topic in the forum. But nothing helios64 specific as far as I know. Could even be a Debian bug. I don't even know if we ought to fix this missing /bin/rm for mdadm at the board level, even as a workaround.
  12. could you try my older test case code: Turns out I did not compile my test case anew before pasting it to github gist and could be the new one I pasted there is not testing what I expected (in that it could be I changed it to try testing CPU frequency changes from max to min instead of each step). Mind I use a binary of the test case I made long ago for my tests which is the one in the link above. I did not feel like sharing a binary test case was a good idea. I prefer you to be able to audit the code (or have someone audit it for you). , I did not have much time to devote to sharing my findings so I checked the source was fine but not if the test was the same as the one I used on my side to stress test the big cpu. Sorry. It looks normal for you the test case I shared to you working fine as as far as I know 1.8GHz 1.2V and 408MHz at 825mV are pretty stable. They could crash I am not sure of that, but it would take more than 50 runs of the test for it to happen (at least it took 80 of them for the 600MHz to fail at 825mV). Mind you should do at least 5 runs of the above test case to be somewhat confident you cannot get the cpu b to crash. The fact that it does not crash is not the point of the test. Its usefulness is that it nearly always crashes the big cpu on the first run. EDIT: the previous gists I gave you as a test case was my v1. The current test case is https://gist.github.com/prahal/8fab73325eb0d7091ad7c4627bf8e25a which is in the other thread I linked in this comment.
  13. @ebin-devnearly, I do not change the max value of the voltage (only the min and the central value): change opp-microvolt = <0xc96a8 0xc96a8 0x1312d0>; to opp-microvolt = <0xdbba0 0xdbba0 0x1312d0> though it could be you could be able to increase the max value, only I don't know if it is safe and how to know if so. Note that in the edited dts (be it via armbian-config or else) you can replace the hex numbers by decimals. Ie you can write: opp-microvolt = <900000 900000 0x1312d0> It is way easier than computing the hex of the initial voltage with 75mV added. @BipBip1981 best is to have a reproducible way to trigger the crash. Then you can tell when the issue is gone. My test case is https://gist.github.com/prahal/316111da0a9b8cc0d0791d26659dc682 If you can run it without a crash with any kernel it is new to me. (I Believe I even got the linux 4.4 helios64 first kernel to crash with this test case). With this patch to increase the min and "central" voltage (I believe requested voltage) by 75mV I cannot get my above test case to crash helios64 (mind there are other helios64 crashers so it best to run the test case in systemd emergency mode, but I managed to run it 100 times in "full" session mode): EDIT: this patch is incomplete: since then I have added opp-00 an opp-01 with the same values as opp-02 (ir 900000 and the appropriate frequencies) diff --git a/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts b/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts index 77844650e2fe..34d94e4d6ada 100644 --- a/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts +++ b/arch/arm64/boot/dts/rockchip/rk3399-kobol-helios64.dts @@ -1160,10 +1160,36 @@ &cluster0_opp { /delete-node/ opp06; }; &cluster1_opp { /delete-node/ opp08; + + /delete-node/ opp02; + /delete-node/ opp03; + /delete-node/ opp04; + /delete-node/ opp05; + /delete-node/ opp06; + opp02 { + opp-hz = /bits/ 64 <816000000>; + opp-microvolt = <900000 900000 1250000>; + }; + opp03 { + opp-hz = /bits/ 64 <1008000000>; + opp-microvolt = <950000 950000 1250000>; + }; + opp04 { + opp-hz = /bits/ 64 <1200000000>; + opp-microvolt = <1025000 1025000 1250000>; + }; + opp05 { + opp-hz = /bits/ 64 <1416000000>; + opp-microvolt = <1100000 1100000 1250000>; + }; + opp06 { + opp-hz = /bits/ 64 <1608000000>; + opp-microvolt = <1175000 1175000 1250000>; + }; }; &cpu_thermal { trips { cpu_warm: cpu_warm { Mind this patch will not apply (the cpu_thermal is from another patch of mine. But it gives you an idea of what you should write. Also, you should account that crashes might be related to the load or the speed between transitions in the load. So a kernel version might help but will merely hide or render a crash less frequent. But it is not even a workaround, merely it makes the crash more or less frequent. It might be there is still a bug in the kernel that only affects helios64, but it is unlikely. I think I always had the helios64 (even on the first boot after I mounted the box) because I have a mdadm raid10 with ext4 setup. The raid10 stress the board (and especially the big cpus). If you could try my stress with your stable kernel that would help decipher if this kernel is really stable with regards to big cpu. Mind that even with this cpu-b 75mV workaround I still get crashes from my board, but not with my test case, and way less often. I don't have a test case or know what triggered these remaining crashes yet. Also, the fact that upping by 75mV workaround crashes when cpufreq switching the big cpus might not fix the root cause. I am not able to analyze the schematic on my own. We would need someone to do so to get a clearer clue as to why this helps and why it could be required. Finally the rk339 is told to be very robust. So it could be it sometimes works with invalid voltages but not all the time.
  14. Can you provide the exact commands you run to get the crash? Also for most of the instability (big CPU cluster) see my comment above, that is up the opp-table-1 voltages by 75mV. Else I will post the DTS block to up the opp-table-1 voltages by 75mV in a few days at most, I hope.
  15. @ebin-devI discussed with on IRC #u-boot and I believe one board designer told me that there could be issue with the regulatorh hardware design (CPU big). He suggested me to up the voltage to max after looking at the schematics (that are available in the wiki in the left pane documents section) to try if it fixed my crashes and this indeed fixed lost of them. That is I first tried every opp-table-1 ie cpu-b at 1.2V then I tried with voltage closer to the vanilla rk3399 ones. In the end I was able to run the cpufreq switching test I gave you 100 times without a crash with upping all the opp voltages for cpu-b by 75mV. Any of the opp run mostly stable with only 50mV but in I still had crashes. So up 75mV looks fine. I still have crash around once a day but not with my cpufreq test case as far as I know. I am now on on demand cpufreq governor with freq from 408MHz to 1.8GHz. Still I would really like to be able to be able to reproduce the crashes I still get. They might be from gpu opp voltages as they have the same hardware design as the CPU-big. Or something else. But I doubt the kernel is involved except that any kernel version might stress the board less. But for one I had added a big delay between CPU-b frequent switching and still had crashes, so I doubt the speed has anything to do with it. And in my test I tried with all cpu-b OPP voltages to 1.2V except even the lowest one and was still able to get random crashes with my cpufreq test case, so I doubt this had anything to do with high freq. Only that 1.6GHz was the one the most sensible to a voltage without 75mV up from upstream rk3399 OPP voltage values. And 408/600 were the less likely to crash but still crashed from time to time. I don't know if you know how to redefine the opp voltage values for cpu-b. I will try to post you my patch asap ( currently on my phone).
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines