AxelFoley

  • Posts

    58
  • Joined

  • Last visited

Posts posted by AxelFoley

  1. I have been monitoring the voltage and current draw of an individual RockPro64 on boot with the PCIe NVME Connected going to console.

    I then launched the desktop as both root (Armbian desktop) and pico user (looks like XFCE) and looked for any power spikes.

    There was none!

    Both tests caused a lockup and HW Freeze.

    On boot the peak current draw was 0.7A with 12v steady, and current draw dropping to 0.3A steady.

    Then I launched Xorg as root (in the past tended to be more stable) and user pico (always locked up immediately)

     

    I did notice a difference when I ran startx as pico user and it launched XFCE .. its current draw peaked at 0.7a and voltage dropped to 11.96V

    Desktop immediately locked the board on launch.

     

    When I ran startx as root and it launched armbian desktop ... it only peaked at 0.6A and voltage never dipped below 12v.

    The desktop was responsive until when I loaded the Armbian forum in chrome and triggered the board lockup ..... there was no current spike or voltage drop! The board locked up while only drawing 0.29A.

     

    I checked the Pine forum and other people are reporting exactly the same issue as myself with the PCIe/NVMe setup.

    One person in 2018 reported he fixed the issue by launching a different kernel.

     

    I have logged the problem on the pine forum and got this response;

     

    "Currently working on the issue. It seems - as odd is its sounds - that the problem is somehow linked to pulseaudio. If you uninstall pulseaudio, and use alsa instead, the issue will just vanish. We have tried blacklisting PCIe for pulse in udev, and it prevents the issue from happening, but it also returns a segmentation error (SATA card / other adapter not accessible). Its very very strange".

     

    Should I stop posting here and move the discussion to Pine ?

     

     

     

     

     

     

    XFCEDesktopPowerDraw.jpg

    ArmbianDesktopPowerDraw.jpg

    ArmbianDesktopLockup.jpg

  2. @pfry

     

    Looking at the power Specifications for PCIe x 4 => Suggest that it can be powered from 3.3v (9.9W) & 12v (25W), however from the RockPro64 Power schema it looks like they negate to feed the 12v rail from the Supply voltage to the PCIe.

     

    Instead Pine feeds the PCIe Interface on the board only by the 3.3v (3A) rail (9.9W).

    From the Power schema It also looks like the board designers feed PCIe from the 5.1v rail converted by the RK808 PMU to 1.8v on vcc1v8_pcie (not sure how this is intended to be used on PCIe) 

     

     

    I am using the Pine PCIe ver 3.0 NVMe Card with a Samsung 970 EVO 500GB and a RockPro64 v2.1 Board.  PSU is a 102W 12v LRS-100-12

     

    The Max power draw of the EVO 970 NVMe is 5.8W (1.76A) which should be within spec for that 3.3v rail.

     

    But this bit worry's me    "vcc_sys: could not add device link regulator.6 err -2"   

     

    vcc_sys is the 3.3v rail that feeds the vcc3v3_pcie feed into the PCIe Socket.

    Although dmeag later says it can enable a regulator for the 3.3v vcc3v3_pcie rail hanging off vcc3v3_sys.

     

    I also see this warning;

     


    "Apr  8 19:09:24 localhost kernel: [    2.010352] pci_bus 0000:01: busn_res: can not insert [bus 01-ff] under [bus 00-1f] (conflicts with (null) [bus 00-1f])"

     

    I may have to get out a JTAG debugger to work this one out :-( 

     

    Not sure if this is a kernel driver issue or HW Power design. 

     

    ill see if I can get any Gerber files to scope out the voltage and current spikes.

     

     

  3. I think I have found the issue !!!!!!!

     

    Its the PCIe Express NVMe Card.

     

    I remove it and the desktop seems not to hang .... I add it back in .... and the desktop hangs.

     

    I wonder if this has something to do with power spikes when there is graphics activity the errors in the dmesg indicate that

    vpcie1v8 = 5.1v rail   Shared with GPU

    vpcie0v9 = 3v Rail 

     

    Both of these rails hang off the same core buck converter SY8113B, the other SY8113B manages the USB peripherals seperatly.

     

    @AndrewDB    .... looks like you may be correct it was power all along  but it looks like its a kernel issue with the PCIe Power Management ?

     

    P_20190408_215025.jpg

    P_20190408_215328.jpg

  4. Interesting!!!!    Fresh build on the spare EMMC64   Armbian_5.75_Rockpro64_Ubuntu_bionic_default_4.4.174_desktop.img

    created new user pico after changing the root password.

     

    The Device initiated nodm which initiated Xorg. Loaded the Armbian desktop ....... and hung immediately! HW Freeze and lock screen.

     

    Subsequent boots only boot to command-line  with login prompt not desktop.   /etc/default/nodem still has user root as default user to log in not pico.

     

    manually starting startx  results in black screen of death as pico user (or on second try HW lockup see attached) but root user loads desktop OK when running startx

     

    I have done no apt upgrade && apt update

     

    armbianmonitor -u results below.

    http://ix.io/1FGs

     

    mmc driver issues

    [Mon Apr  8 19:23:29 2019] rockchip_mmc_get_phase: invalid clk rate
    [Mon Apr  8 19:23:29 2019] rockchip_mmc_get_phase: invalid clk rate
    [Mon Apr  8 19:23:29 2019] rockchip_mmc_get_phase: invalid clk rate
    [Mon Apr  8 19:23:29 2019] rockchip_mmc_get_phase: invalid clk rate

     

    PMU issues may affect efficiency of CPU idle when not under load
    [Mon Apr  8 19:23:29 2019] rockchip_clk_register_frac_branch: could not find dclk_vop0_frac as parent of dclk_vop0, rate changes may not work
    [Mon Apr  8 19:23:29 2019] rockchip_clk_register_frac_branch: could not find dclk_vop1_frac as parent of dclk_vop1, rate changes may not work

     

    some possible pcie\nvme issues

    [Mon Apr  8 19:23:30 2019] rockchip-pcie f8000000.pcie: Looking up vpcie1v8-supply property in node /pcie@f8000000 failed
    [Mon Apr  8 19:23:30 2019] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
    [Mon Apr  8 19:23:30 2019] rockchip-pcie f8000000.pcie: Looking up vpcie0v9-supply from device tree
    [Mon Apr  8 19:23:30 2019] rockchip-pcie f8000000.pcie: Looking up vpcie0v9-supply property in node /pcie@f8000000 failed

     

    [Mon Apr  8 19:23:31 2019] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
    [Mon Apr  8 19:23:31 2019] pci_bus 0000:01: busn_res: can not insert [bus 01-ff] under [bus 00-1f] (conflicts with (null) [bus 00-1f])

     

    some pwm issues

    [Mon Apr  8 19:23:31 2019] pwm-regulator: supplied by vcc_sys
    [Mon Apr  8 19:23:31 2019] vcc_sys: could not add device link regulator.8 err -2
    [Mon Apr  8 19:23:31 2019] vcc_sys: could not add device link regulator.8 err -2

    [Mon Apr  8 19:23:31 2019] vcc_sys: could not add device link regulator.11 err -2

    .... etc a load of these

     

    some sound driver issues

    [Mon Apr  8 19:23:32 2019] of_get_named_gpiod_flags: can't parse 'simple-audio-card,hp-det-gpio' property of node '/spdif-sound[0]'
    [Mon Apr  8 19:23:32 2019] of_get_named_gpiod_flags: can't parse 'simple-audio-card,mic-det-gpio' property of node '/spdif-sound[0]'
    [Mon Apr  8 19:23:32 2019] rockchip-spdif ff870000.spdif: Missing dma channel for stream: 0
    [Mon Apr  8 19:23:32 2019] rockchip-spdif ff870000.spdif: ASoC: pcm constructor failed: -22
    [Mon Apr  8 19:23:32 2019] asoc-simple-card spdif-sound: ASoC: can't create pcm ff870000.spdif-dit-hifi :-22
    [Mon Apr  8 19:23:32 2019] asoc-simple-card spdif-sound: ASoC: failed to instantiate card -22

     

    some Ethernet nic issues (but still works)

    [Mon Apr  8 19:23:46 2019] cdn-dp fec00000.dp: Direct firmware load for rockchip/dptx.bin failed with error -2
    [Mon Apr  8 19:24:02 2019] cdn-dp fec00000.dp: Direct firmware load for rockchip/dptx.bin failed with error -2
    [Mon Apr  8 19:24:35 2019] cdn-dp fec00000.dp: Direct firmware load for rockchip/dptx.bin failed with error -2
    [Mon Apr  8 19:25:39 2019] cdn-dp fec00000.dp: [drm:cdn_dp_request_firmware] *ERROR* Timed out trying to load firmware

    [Mon Apr  8 19:23:32 2019] asoc-simple-card: probe of spdif-sound failed with error -22

    P_20190408_212044.jpg

  5. 1 hour ago, balbes150 said:

    For what purpose do you need a cluster ?

    @balbes150 It for prototyping, education and engineering ... essentially enabling a quick and dirty evaluation of SOA (Service Oriented Architecture) concepts for myself and some other devs. e.g. NOSQL Databases and in particular how to integrate Service discovery and RDMA paradigm's such as the OFED stack (RoCE), and building restFul interfaces & API Abstraction while understanding principles such as standardized Data and message models. In essence to evangelize open source software and frameworks as a solution to proprietary software integration & inter-operation inertia.

  6. 4 hours ago, AndrewDB said:

    This is all you need to know, actually. As I wrote before, this is a problem with hardware acceleration being used by Chromium. You can turn it off in advanced settings. 

    @AndrewDB apologies I missed the suggestion ... I disabled the HW Acceleration and restarted chrome from the terminal. Disabling the HW Acceleration stopped error messages being displayed in stdout.

    However the rockpro64 still locked up with a HW freeze loading the Armbian  forum.  But I think the whole graphics drivers on some of my cluster node have gone foo-bar for some reason.

    I may need to "apt install --reinstall [graphics subsystem packages] " to be sure that this is a bug not a missing library during interrupted apt update && apt upgrade

  7. Right I have now caught up with everybody's comments and questions and hopefully answered them.

    I now need some guidance on what to do next ....

     

    ### Hypothesis 1 ###: 

    The root cause of the instability was a fundamental issue with the current Armbian code base (kernel/driver) and the rockpro64 4GB V2.1 + NVME + 64GB EMMC triggered by chromium loading the Armbian forum (100% reproducible)

    *** Action ***

    Continue to troubleshoot the unstable cluster master when using Chromium and figure out how to trap the HW Lockup when chromium launches the Armbian forum (nothing captured in the system log files

    The only hint I get is from launching Chromium from a terminal (thanks @NicoD I had no leads until you suggested this as it was a complete HW Freeze\Lockup);

     

    root@rockpro64_0:~# chromium-browser
    libGL error: unable to load driver: rockchip_dri.so
    libGL error: driver pointer missing
    libGL error: failed to load driver: rockchip
    [17517:17517:0406/144119.625593:ERROR:sandbox_linux.cc(364)] InitializeSandbox() called with multiple threads in process gpu-process.
    [17647:17672:0406/144121.185188:ERROR:command_buffer_proxy_impl.cc(124)] ContextResult::kTransientFailure: Failed to send GpuChannelMsg_CreateCommandBuffer.

     

     

    ### Hypothesis 2###:

    The root cause of the issue is a corrupted Ambian installation caused by interruption to salt distributed "apt-get update && apt-get upgrade" (indicated by the need to force reinstall libegli due to missing libraries)

    *** Action ***

    Reformat the entire cluster using a desktop edition for the master and a cli version for the nodes and retest to see if we can recreate the issues with Chromium and also see if 20% of the nodes are still rebooting constantly.

     

    I have a spare EMMC 64 Chip so I can save the current boot image on the cluster master if I need to go back to that image.

     

    Q). Should I restart from scratch and reformat the cluster with an image recommended by this forum and I start testing from a known good stable place ?

    (seems Armbian does not have anybody offering to test the RockPro64 4GB V 2.1 + PCIe NVME + BlueTooth\WiFi Module seeing as I have 10 of them I may be a good volunteer)  

     

    Or

     

    Should I look at validating the current master Armbian packages and kernel driver installs to make sure there was not a corrupted installation / driver configuration issue that may lead to unearthing a genuine bug report?

     

     

     

     

  8. On 4/1/2019 at 11:50 PM, chwe said:

    Just another reminder: https://forum.armbian.com/guidelines

    especially this points here:

    .... post some logs etc etc

    @chwe My bad .... to be fair ... every time I loaded the forum to post logs from the unit itself ... it chromium crashed the board  :-)

    I did not release that armbianmonitor posted to a urlupload site ... smart !

     

    http://ix.io/1Fs4

     

    See attached to the results from armbianmonitor -u

  9. On 4/1/2019 at 11:50 PM, chwe said:

    what makes you sure the powering issues are gone? Did you check voltage on 5v rail after replacing the PSU under stress?

    @chwe Yes I have been working with the Pico Cluster guys to revise the power supply unit they ship with their cluster and we have upgraded it to a 12v unit instead of a 5V unit. I have also completely rewired the power cable loom and installed a buck converter for the 5v Switches and Fans.

    I have not gone to the extreme of checking individual output voltages and currents to the Boards from the PSU. But I have been monitoring the clusters total power consumption. The cluster is not loaded at all because I have not been able to do any work on it due to the stability of it. at peak it has never pulled in more than 64watts (5.3A). Most of the load is Cassandra recovering after the node reboots as its multi threaded it can load all 6 cores.

     

    The 12v DC In goes via two Buck Converters (SY8113B) to create 2 x 5.1v rail. One of those rails goes into a RK808 Buck converter which I think is embedded into the RK3399 chip.

    That RK808 feeds the GPIO pints which I have measured to be 5.190v.

    There is a 4 Pin SATA JST Connector that says its wired to the raw DC in (12v) (but SATA also has 12v, 5v and 3,3v specifications so I am not sure of one of those pins is in addition the 5v rail direct after the SY8113B).

     

    @chwe  Do you want me to monitor the 5v rail after the SY8113B) with an Oscilloscope prior to lock up? Have you some concerns with with the board power design?

     

     

    P_20190406_132945.thumb.jpg.9861cd48fe35d32d280fb0cc327e37df.jpg

     

    P_20190406_132903.jpg

    P_20190406_133000.jpg

  10. On 4/1/2019 at 11:50 PM, chwe said:

    If I didn't miss something the Rockpro is still marked as wip in the downloadpage right?

     

    rockpro.thumb.png.29ffe0379630543e1e6d9fe224cc6985.png

     

    but well, maybe we should rename it back to wip so that only expert=yes can build images for it.. to make it more obvious..

     

    but also this sub-forum has a nice reminder:

    rk3399.PNG.88816f3f4f84c719241c4669d60dfc2c.PNG

     

    My RockPi 4b was used in a pure CPU numbers crunshing project for 17 days between 75°C and 80°C without any crash.. Would I call the RockPi stable.. for sure not. It seems that it did well for this test but I've no idea if it runs stable under all the other use-cases people can imagine for the board in question.

     

    Just another reminder: https://forum.armbian.com/guidelines

    especially this points here:

     

     

    what makes you sure the powering issues are gone? Did you check voltage on 5v rail after replacing the PSU under stress?

    @chwe A quick heads up .... people may not be aware but the pine guys have their own image installer based on Etcher that automatically pulls the Armbian image down without any indication of WIP or Stable status of the project (see attached). This is here Pine falls down a bit ... they should have focus on one main desktop & command line release to recommend to users ... if they are going to obscure project status from their installer.

     

    I only found out about WIP when I reported the issues.

     

    I have been testing several desktop imaged and to be fair to Armbian ... they are all far away from where Ambian is on the RockPro64, its by far the best performing I have found (mrfixit build is broken atm if you have an emmc boot). Maybe this is just me but I only need a desktop when developing to access a web browser .... it to save me having to kill trees ...... I have had to print this lot all out so I could continue working because of the Chromium / Mali driver issue. 

     

    However I think that I have no choice but to reformat the whole cluster ... with all these lockups I have seen evidence of packages installed through apt that are actually missing libraries and I had to do force installs.

     

    apt install --reinstall lightdm

    apt install --reinstall libegli

     

    if it was not an issue with the armbian package release at the time or if the 10 RockPro64's I have were HW Locking up and resetting during a package install.

     

    Sometime its better to go back to square one, Whats peoples opinion ?

     

    I am happy to act as a tester for this board and Armbian and find out where the real issues are.

    pineInstaller.png

    P_20190406_131504.jpg

  11. On 4/1/2019 at 2:11 PM, AndrewDB said:
    
    tail -f /var/log/{messages,kernel,dmesg,syslog}

    @AndrewDB Thanks for the suggestions I did try this but the lockup is so instantaneous. Nothing gets logged to disk .... see attached for a before and after screenshot of the syslog / kernel log and the X.org log ...      this is why it has been so hard to troubleshoot.  However somebody brilliant idea to launch chromium from a terminal has given me some interesting new lead's!

    see attached before crash, after crash and teh chromium error to the command line when launched from a terminal;

     

    I really don't care about chromium ,,,,  when working on these boards I try to keep it as simple as possible and use stock packages in the hope I will have more  bugs squashed.   I am now using Firefox so I can at least start working on the GPIO code again.

     

    Thanks for the suggestions.

     

    It could be I need to run a debug kernel as I at least expected to see some issues there .... but the nature of the lockup could mean its a HW issue triggered in the Mali Graphics chip triggered by Chromium only

     

    beforeCrash.jpg

    AfterCrash.jpg

    ChromiumError.jpg

  12. On 4/1/2019 at 1:26 PM, Da Alchemist said:

    Using Rockpro64 for Desktop Scenarios is a really bad idea at this state of development.

    @Da Alchemist that is a good point ... I made a rookie mistake thinking I could put up with a few glitches and still code. I then compounded it by using my  cluster master as my development IDE, git master, salt master, Prometheus master, Grafana server.... its my own fault. I think its time to restart from scratch and move my code to github and reformat everything. I have clearly managed to bugger up the graphics mail drivers on the cluster Master/Dev box through continual upgrades and a link on the wiki to install HW accelerated mali drivers  that are probably not feature complete.

     

    The decision was in part due to the fact the pico cluster case only allows for one easy HDMI output and that was my cluster master.

    I don't believe that there is thunderbolt support over USB C in the RockPor64 so a USB-C to HDMI Dongle is not an option.

     

    I can reformat the cluster without desktop and I have a KVM so I can access a web browser and the cluster more easily and I can use vim (mostly I am writing C and python).

     

  13. On 4/1/2019 at 1:09 AM, NicoD said:

    @JMCCYou could run Chromium with the terminal (just type chromium)and see if it gives any clues to what's going on.

    @NicoD Brilliant idea ..... I don't know why I did not think of that ...

    See attached ... I got those errors when 1st launching the Chromium browser. It works for a while as I navigate to the Armbian forum ... then locks up with no more command line output.   Interesting that it goes back to my hunch this was a graphics driver issue causing the instability. Its just strange that Chromium triggers the issue but not firefox!

     

     

    P_20190406_113943.jpg

  14. armbian-config does not do a good job of installing lightdm, it failed to launch lightdm on boot with errors around trying to restart the service too quickly after it initially failed to start.

     

    I don't have time to troubleshoot somebody else's sloppy mess.

    I have reverted back to nodm so I can get back to writing some code.

     

    The display lockup when nodm boots and logs in as any other user apart from root ... is not going to be nodm's problem I suspect.

    more likely an Xorg config issue \ graphics driver.

     

    good to know I can avoid 90% of my stability issues just but not using the default armbian chromium browser.

     

    I am soon going to have to face facts and accept I need to reformat my entire cluster with a different distro, I just can not get Armbian stable on rockPro64.

    I have been trying for 4 months. I have these lockup issues with my master node and I have 20% of my nodes constantly restarting.

    I have eliminated power issues.

    I have ordered a load of heatsink fans to eliminate the last thing I can think of ... heat as being the problem (despite them all having heat sinks and a pico cluster case).

     

    The fans are the last thing I try before giving up on armbian and starting from scratch and loosing all my salt/cassandra/promethious/grafana cluster configuration.

  15. Another reproducible issue is that I can get exactly the same screen lockup on boot

     

    When I re installed the desktop from  the armbian-config it installed lightdm and removed nodm.

    I backed out this change .. uninstalled lightdm and reinstallled nodm in case that was what was causing the instability.

     

    Unfortunately in doing so ... it changed the default nodm auto login to root instead of mu default user (pico).

     

    if i edit /etc/default/nodm  back to pico instead of root ..... it locks up every time in the same manner 100% every time.

     

    To fix the issue I have to mount the emmc module externa,,y and edit the file back to default login as root to be able to access the desktop.

     

     

     

     

  16. Right ... I think I can recreate the problem 100% by simply loading the Armbian Forum in chromium.

    I have tried for 30min to create the same OS lockup using firefox with no problems what so ever. Indeed video on sites like BBC News play flawlessly.

    So it looks like the issue is specific to the default Chromium install of Armbian.

     

    I had overlooked it being chromium simply because it looked like it was a HW Video Driver glitch as the entire desktop would lock up and the mouse curser would typically not even move.

     

    but bingo 30seconds after I loaded Chromium ... I went straight into the Armbian Forum and it locked up on the default forum page.

     

    It did not used to be this reproducible ... I avoided media sites and generally I could get on with coding befoe a crash/lock would occur.

     

    I am not sure what has made the environment so much more unstable and easy to reproduce a crash as I apt update && apt upgrade several times a day.

     

    is there a way I can run chromium in debug mode and get it not to delete the log files when I have to hard reboot the device ?

  17. I have had these lockup issues  ever since I built the cluster in Dec 2018.

    I initially thought it was a power issues (I was seeing some behaviour that suggested power was a problem)..

    However since then I have resolved / upgraded my power source and eliminated that instability.

    When I installed the base Armbian image I did follow some instructions to install some lib-mali drivers.

    I suspected that may have been the issue and I associated the crashes trying to run video in a browser, so I avoided that.

    However now more than ever I am seeing lockups just loading a browser and going to the armbian folder.

    I will try to install firefox and see if that has similar problems.

     

     

  18. I had to transfer the script to a USB stick on a laptop then transfer it to the RockPro64 because the browser kept locking up when trying to point Chromium to the armbian media script site. I selected the defaults of System, MPV and Gstreamer from the install CLI. I then opted for the Armsoc fullscreen vsync.

     

    Tonight development is a write off so I will start again tomorrow.

     

    Thanks for all your help ..  fingers crossed I can stop the desktop from locking up.

    Its only been this bad recently since the distro upgrades (I upgrade daily).

     

    Normally the forums are fine ... and I avoid sites with media content (video always locks up) but recently even the Armbiam forum causes the OS to lock the RockPro64 OS\HW Up.

     

    The install script threw errors (see attached)

     

    I will do an apt autoremove, reboot

     

    and report back if stability improves

    install.log

  19. Is it just me .... all the time I start to use the Armbian web browser  ...including simply loading this forum the whole RockPro64 OS locks up and I have to reset,.

    Typically its when I load Media web sites like the BBC ,.... but it even occurs when I load this forum!!

     

    I am trying to develop some C GPIO library to help the rockPro64 community, but its hopeless. I have to work on console and web brows on a laptop.

    I have ordered a KVM to help but this is really frustrating.

     

    Because the HW locks up I can not detect any issues in the logs and I don't know how to set up a HW trace in the graphics drivers.

     

    I have eliminated any Power issues. I have a pico cluster case and decent heatsink ... I have ordered some Heatsink fans but CPU is in the 40 Deg C.

     

    Any Ideas how I find out why the HW just freezes ?

     

    It must have something to do with the mali drivers ?

     

    Any ideas welcome