Pali

  • Posts

    51
  • Joined

Posts posted by Pali

  1. 16 hours ago, Anders said:

    Where should direct my frustrations on why this has taken so long? Is it the fault of GlobalScale who failed to deliver on their mainline Linux support promise? Or is it Marvell who should have been more active in upstreaming their patches, instead of just providing an old kernel + a set of patches?

     

    Well, I do not know. But I have not seen any contribution from globalscale in upstream. They were even lazy to document or say something about ddr4 initialization changes which they have done in their custom repositories forks as big hexdump blobs (@barish identified it and then prepared pull requests to Marvell with fixes!)

     

    Most initial A3720 work in upstream was done by Bootlin developers and then from more companies and individuals.

     

    If you look into kernel cpufreq driver for a37xx source code and git history, you will find here description about HW errata that CPU voltage is not stable when CPU is running at 1.2 GHz. This clearly identifies HW bug in A3720 SoC. So this is Marvell issue and affects all products based on A3720 (not only Espressobin). And if you look at patches which we sent to mailing list, we found another similar (hw?) bugs which affects also 1.0 GHz mode. So this is for Marvell as they have not addressed these issues yet. Thanks to Marek who very quickly wrote first three workaround patches.

  2. Well, External Abort means that CPU tried to access memory on the bus (not internal CPU memory) and for some reason access (read or write) cannot be completed and it raised exception. Overheating may cause different issues and if some block on SoC stops working then abort may happen.

     

    From your log can be seen that base VDD voltage for 1.2 GHz CPU frequency is set to 1.202 V. Typical voltage for 1.2 GHz freq is documented by Marvell as 1.155 V. Higher voltage value makes CPU more stable, but also increase power usage and temperature. And of course higher temperature make CPU less stable.

     

    During testing 1 GHz mode when CPU voltage was too low I saw lot of times either this External Abort or segfault of random processes. So External Abort can be definitely caused by instability of CPU. But in your case it may be because of either high temperature or low voltage or something else... Also it may be because your SoC/CPU tested during manufactoring and checked that is not really stable for 1.2 GHz freq (and so written this information into OTP -- not defined VDD value for 1.2 GHz).

  3. There is just information that External Abort happened. Not sure what is the issue.

     

    Check that you have kernel with v3 version of a3720 cpufreq patches as in v3 was updated VDD value for load L1 when base freq is 1.2 GHz.

     

    Also can apply updated cpu_vdd_fallback.patch patch? It does not change behavior, only fixes printf for SVC REV: line. But it can can be useful to verify which CPU VDD value is used.

  4. 5 minutes ago, Anders said:

    Another nice-to-have is proper powerdown handling:

    
    anders@espressobin:~$ sudo shutdown -hP now
    [...]
    [  921.950348] reboot: Power down
    ERROR:   a3700_system_off needs to be implemented
    BACKTRACE: START: a3700_system_off
    0: EL3: 0x4024d34
    1: EL3: 0x4023834
    2: EL3: 0x4029c2c
    3: EL3: 0x4029b34
    4: EL3: 0x402a43c
    5: EL3: 0x4024ba0
    BACKTRACE: END: a3700_system_off
    Unhandled Exception from EL1
    x0             = 0x0000000084000008
    [...]

     

     

    This backtrace is from debug version of Trusted Firmware. If you compile it without DEBUG=1 then backtrace is omitted.

     

    A3720 SoC does not have full power management like ATX on PC, which can turn power off from all components. This is why function a3700_system_off() in Trusted Firmware is not implemented.

     

    Basically I just do not know what this a3700_system_off() should do (if I decide to implement it). It can just call reset or maybe call suspend procedure. A3720 has support for some kind of low power suspend/sleep mode, so maybe this is the best approximation of whole power off.

  5. 4 minutes ago, Anders said:

    @Pali

    That actually seems to work:

     

    Perfect! Thank you for testing. I will send this patch to Marvell.

     

    4 minutes ago, Anders said:
    
    SVC REV: 536817928, CPU VDD voltage is invalid, using default value: 

     

     

    Ouch, this is a bug :-( I did mistake in that patch, in printf is missing svc_rev parameter. I updated post with my patch.

     

    4 minutes ago, Anders said:

    I'll let you know if I get a crash.

     

    Should I be concerned about temperatures? How can I read the CPU temperature?

     

    Yes, if you board do not have voltage value for 1.2 GHz mode in OTP then it means that your board does not (officially) support 1.2 GHz mode and it could be for various reason... One can be temperature.

     

    But unfortunately A3720 SoC does not have temperature sensor from which can be read temperature.

     

    So just check if your espressobin is not too hot.

     

    I was told that SoC should have some sensor for HW overheat detection, but not sure how it works.

  6. 2 hours ago, Anders said:
    
    
    $ git clone https://github.com/MarvellEmbeddedProcessors/u-boot-marvell.git -b u-boot-2017.03-armada-17.10
    $ git clone https://github.com/MarvellEmbeddedProcessors/atf-marvell.git -b atf-v1.3-armada-17.10

     

     

    These are old repositories. You should use upstream u-boot (https://source.denx.de/u-boot/u-boot.git) and upstream TF-A (https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git). Also ensure that you have set correct DDR_TOPOLOGY= parameter.

     

    Whole build parameters are described in TF-A documentation at: https://trustedfirmware-a.readthedocs.io/en/latest/plat/marvell/armada/build.html Search for full example how to build on that page for inspiration. But basically your steps are correct. You do not need to set DEVICE_TREE variable.

  7. 2 minutes ago, Rötti said:

    I mean, how long does it take to get the patch into the kernel (weeks, month)?

     

    It would be merged either into next -rc version or into next mainline version (depends on how maintainers decide). See https://www.kernel.org/ for current released versions. And see https://www.kernel.org/category/faq.html for question When will the next kernel be released?

     

    After it is merged into rc or mainline version then this patch (because it is marked as bugfix) would be automatically included also into all longterm versions.

     

    2 minutes ago, Rötti said:

    How likely is it to be rejected?

     

    Unlikely. In case it is rejected it would mean it is needed to update this patch (fix issues) and Marek or me will do it.

     

    For rest of armbian related questions ask armbian people.

     

  8. @Anders Here is patch for A3700-utils-marvell repository which adds fallback for CPU VDD voltage to default value when there is no value burned value in OTP. Could you try to test it for 1.2 GHz mode?

     

    diff --git a/wtmi/sys_init/avs.c b/wtmi/sys_init/avs.c
    index c25fae087483..b993a80d9c5d 100644
    --- a/wtmi/sys_init/avs.c
    +++ b/wtmi/sys_init/avs.c
    @@ -196,12 +196,21 @@ int init_avs(u32 speed)
     	}
     
     	if (svc_rev >= SVC_REVISION_2) {
    -		vdd_otp = ((otp_data[OTP_DATA_SVC_SPEED_ID] >> shift) +
    -			   AVS_VDD_BASE) & AVS_VDD_MASK;
    -		regval |= (vdd_otp << HIGH_VDD_LIMIT_OFF);
    -		regval |= (vdd_otp << LOW_VDD_LIMIT_OFF);
    -		printf("SVC REV: %d, CPU VDD voltage: %s\n", svc_rev,
    -			avis_dump[vdd_otp].desc);
    +		vdd_otp = (otp_data[OTP_DATA_SVC_SPEED_ID] >> shift) &
    +			  AVS_VDD_MASK;
    +		if (!vdd_otp || vdd_otp + AVS_VDD_BASE > AVS_VDD_MASK) {
    +			regval |= (vdd_default << HIGH_VDD_LIMIT_OFF);
    +			regval |= (vdd_default << LOW_VDD_LIMIT_OFF);
    +			printf("SVC REV: %d, CPU VDD voltage is invalid,"
    +				" using default value: %s\n", svc_rev,
    +				avis_dump[vdd_default].desc);
    +		} else {
    +			vdd_otp += AVS_VDD_BASE;
    +			regval |= (vdd_otp << HIGH_VDD_LIMIT_OFF);
    +			regval |= (vdd_otp << LOW_VDD_LIMIT_OFF);
    +			printf("SVC REV: %d, CPU VDD voltage: %s\n",
    +				svc_rev, avis_dump[vdd_otp].desc);
    +		}
     	} else {
     		regval |= (vdd_default << HIGH_VDD_LIMIT_OFF);
     		regval |= (vdd_default << LOW_VDD_LIMIT_OFF);

     

  9. On 3/4/2021 at 11:32 AM, Igor said:

    we wasted insane a lot of time for little gain

     

    Excuse me, but what have you done? Can show e.g. your changes which have you done in upstream kernel, u-boot or any other project, including links to git commit to those projects? Or issues which you have fixed?

     

    On 3/4/2021 at 11:32 AM, Igor said:

    We also never received a cent from anyone for trying to keep this HW usable.

     

    Because the only thing which I saw, was removal of MAC addresses and claiming that this is the best for everybody.

     

    Sorry but nobody is going to pay to project which is just complaining about wasted insane of time or project which is erasing MAC addresses or project which is wasting time without any result.

     

     

    We have made 1GHz variant of A3720 SoC stable.

  10. This is issue in ASMedia SATA controller card, not in Espressobin PCIe. Card announces support for 512 byte long PCIe packets, but when PCIe controller is configured for such long payload size then card cause system crash. We have reproduce this issue on other platform too.

     

    Marek sent kernel patch which adds quirk for this ASMedia SATA controller to set maximal payload size to 265 bytes https://lore.kernel.org/linux-pci/20210317115924.31885-1-kabel@kernel.org/T/#u and which should workaround this issue.

  11. 14 minutes ago, Rötti said:

    As you can see in the output below I already have these parameters in the console variable.
    Is there a special way to boot with this parameter, or is it automatically used
    when I call 'boot' because of 'set_bootargs' which contains 'console' already?

     

    If you call 'boot' command it executes 'bootcmd' variable. And if you trace 'bootcmd' from your printenv output it can be clear that 'set_bootargs' is not called in this path.

     

    Seems that your 'bootcmd' ends in 'boot_a_script' variable which loads external boot script (from uSD card?) and this one boots kernel. Script can do anything, including setting new variables, etc. So it may be possible that this script set or does not set 'console' into 'bootargs'. You need to investigate it.

     

    You could try to unset 'console' (= booting without console=ttyMV0,115200), maybe it helps. For recent kernels this console should not be needed.

  12. See public HW documentation https://www.marvell.com/content/dam/marvell/en/public-collateral/embedded-processors/marvell-embedded-processors-armada-37xx-hardware-specifications-2019-09.pdf page 154 and 153. There is described what all numbers on label means. Public documentation is older and 1.2 GHz variant is missing here. But you can deduce that its Speed Code is 120.

     

    From your photos I cannot read all numbers on that chip. I see that first line is 88F3-BVB2, but I cannot read other lines.

  13. 14 minutes ago, Anders said:

     

    
    Marvell>> md d0012604 1; md d0012604 1; md d0012604 1
    d0012604: 00000000                               ....
    d0012604: 27236501                               .e#'
    d0012604: 0000f580                               ....

     

     

    Content of known bits in OTP:

    SVC revision is 5

    600 MHz mode is not supported

    800 MHz mode uses 1.073V

    1000 MHz mode uses 1.155V

    1200 MHz mode is not supported

     

    14 minutes ago, Anders said:

     

    
    TIM-1.0
    WTMI-devel-18.12.1-0967979
    WTMI: system early-init
    SVC REV: 5, CPU VDD voltage: 0.898V

    That voltage seems low, right?

     

    Do you still need me to pull the heat sink to get the lables / indentifiers printed on SoC chip?

     

    I guess it is not needed. In OTP is burned that 1.2GHz mode is not supported, so some lowest default value was chosen by WTMI.

     

    So the result is that your SoC most probably does not contain 1.2 GHz CPU, but only 1 GHz CPU.

     

    If you want to know definite answer then it is needed to check if package has printed C120 (1.2 GHz variant) or C100/I100 (1 GHz variant). But even if you have SoC with 1.2 GHz CPU and you want to use 1.2 GHz mode you would have to calibrate / figure out voltage level and then patch WTMI firmware to use that calibrated / measured voltage value.

  14. On 2/23/2021 at 5:46 AM, Anders said:

    Nice work. I'm unable to test it on my V7 1gb board though. Every time I fash it with a 1200mhz image, it bricks, and I can't even get to the uboot prompt.

     

    Could you please provide following information?

    • full UART output with 1200 MHz image if there is at least something
    • full UART output with some image which is working, up to the U-Boot booting
    • output of U-Boot command (from working image): md d0012604 1; md d0012604 1; md d0012604 1
    • lables / indentifiers printed on SoC chip package (that one identified by 88F3720)

     

    On 2/23/2021 at 5:46 AM, Anders said:

    Are there anyone who's been able to boot the Espressobin V7 at 1.2 Ghz?

     

    In email thread with mentioned kernel patches on linux-arm-kernel mailing list there is one tester and final version of patches are working stable also on 1.2 GHz mode.

     

    On 2/23/2021 at 5:46 AM, Anders said:

    By the way, who's sponsoring your work on this chip? :)

     

    It is for Turris MOX, modular router: https://www.turris.com/en/mox/overview/

  15. On 2/7/2021 at 11:19 PM, Rötti said:

    @Pali I posted the problem to the ide-linux kernel mailing list as proposed, but unfortunately received no answer.
    Here is the link: https://www.spinics.net/lists/linux-ide/msg60178.html

     

    Furthermore I were able to narrow down the kernel versions and exact image version of Armbian where it broke:
    Armbian 19.11.3 with Kernel 4.14.135 <- last version which was working
    Armbian      5.65 with Kernel 4.18.16   <- first version which is not working

     

    I have looked at email which you sent to mailing list https://lore.kernel.org/linux-ide/cbbb2496501fed013ccbeba524e8d573@posteo.de/T/#u and you did not provide all / enough information. At least output from lspci -nn -vv is needed to correctly identify type of your PCIe SATA controller. Also there is missing dmesg output between [ 0.000000] and [ 3.694604] period. Please provide these informations (to mailing list).