Jump to content

tkaiser

Members
  • Posts

    5462
  • Joined

Posts posted by tkaiser

  1. 1 hour ago, ag123 said:

    i'm using zram-config in armbian stretch

     

    My personal opinion on this: Linux sucks.

     

    While zram is a nice way to make more use of the physically available DRAM the whole approach still sucks since we would need to take care also about our attempts to store browser profiles and caches in RAM (uncompressed -- what a waste). Since there's nothing like a globally acting memory compressor task in the background (as in macOS) it would need some more work to enhance psd/cache behavior (using compressed RAM of course).

     

    And then still the only reasonable way to run a fully blown desktop environment on those boards with low RAM is adding a fast UAS connected SSD, putting the rootfs on it and use properly configured zswap instead of zram. But why? Adding all the costs together a properly sized board with eMMC is the cheaper variant that sucks less.

  2. 10 minutes ago, ag123 said:

    i tried the above

     

    Wrong way since we implemented an own zram-control mechanism in the meantime already available in nightlies and supposed to be rolled out with next major update.

     

    For anyone coming across this: do NOT follow the above recipe, it's deprecated in Armbian and causes more harm than any good.

     

    @Igor: IMO we need to make the amount of zram configurable. Currently it's set as 'half of available RAM' in line 35 of the initialization routine. But since updates will overwrite this users who want to benefit from massive zram overcommitment (since it just works brilliant) are forced to edit this script over and over again.

     

    I propose to define the amount used as $ZRAM_PERCENTAGE that defaults to 50 and can be defined in a yet not created /etc/defaults/armbian-zram-config file. Any opinions?

  3. 1 hour ago, Dan Christian said:

    Is the memory bus the same width for the 2GB and 4GB versions?

     

    Yep. Check the link to the review and compare @hjc's tinymembench numbers from his 2GB M4 (4 x DDR3) with e.g. RockPro64 (2 x LPDDR4). It's both times dual-channel DRAM but it could be possible that we get more recent DRAM initialization BLOBs from Rockchip and then RockPro64 with LPDDR4 might be slightly faster (I don't think this will change anything with the larger 4GB M4 configuration using LPDDR3).

     

    BTW: For most use cases memory bandwidth is pretty much irrelevant.

  4. 1 hour ago, t-minik said:

    there is only scaling_cur_freq but if i undersatnd, it seem to be the value that kernel is thinking cpu is running instead of real value (in throttling case it's inaccurate)

     

    Yes, scaling_cur_freq is just some number compared to cpuinfo_cur_freq: https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt

     

    Querying the correct sysfs node is also more 'expensive' and therefore only allowed by root. Please see also @jeanrhum's adventure with the very same Atom and obviously similar kernel: 

     

  5. 5 hours ago, coollofty said:
    
    4.286s armbian-hardware-monitor.service
    1.342s armbian-ramlog.service

     

    Since you claim you have to wait 30 seconds obviously these two services (collecting debug information and setting up efficient logging to RAM) are NOT the culprit, true?

     

    You better provide output from

    systemd-analyze critical-chain
    armbianmonitor -u

     

  6. 7 hours ago, chrisf said:

    Have you got a tool to check the latency to compare USB2 and USB3? Or CPU usage when doing the same workload?

     

    I had the SSH session window still open and collected the relevant logging portions from 'iostat 1800' while running the test with USB3, USB2 and then again zram/lzo (which also surprisingly again outperformed lz4):

    USB3:     %user   %nice %system %iowait  %steal   %idle
              82.31    0.00   12.56    4.68    0.00    0.45
              74.77    0.00   16.80    8.25    0.00    0.18
              55.24    0.00   19.84   24.44    0.00    0.48
              72.22    0.00   16.94   10.43    0.00    0.41
              50.96    0.00   22.24   26.09    0.00    0.71
    
    USB2:     %user   %nice %system %iowait  %steal   %idle
              81.77    0.00   11.95    5.30    0.00    0.99
              75.99    0.00   16.95    6.71    0.00    0.35
              66.50    0.00   19.19   13.81    0.00    0.49
              77.64    0.00   18.31    3.97    0.00    0.08
              44.17    0.00   12.99   13.09    0.00   29.74
    
    zram/lzo: %user   %nice %system %iowait  %steal   %idle
              84.83    0.00   14.68    0.01    0.00    0.48
              82.94    0.00   17.06    0.00    0.00    0.00
              81.51    0.00   18.49    0.00    0.00    0.00
              78.33    0.00   21.66    0.00    0.00    0.01

     

    7 hours ago, chrisf said:

    maybe at the hardware level USB3 requires more resources, all the interrupts could be causing excessive context switching

     

    That's an interesting point and clearly something I forgot to check. But I was running with latest IRQ assignment settings (USB2 on CPU1 and USB3 on CPU2) so there shouldn't have been a problem with my crippled setup (hiding CPUs 4 and 5). But iostat output above reveals that %iowait with USB3 was much higher compared to USB2 so this is clearly something that needs more investigations.

  7. 3 hours ago, tkaiser said:

    real 155m7.422s

     

    This was 'swap with SSD connected to USB3 port'. Now a final number. I was curious how long the whole build orgy will take if I use the same UAS attached EVO840 SSD and connect it to an USB2 port. Before and after (lsusb -t):

    /:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
        |__ Port 1: Dev 3, If 0, Class=Mass Storage, Driver=uas, 5000M
    
    /:  Bus 05.Port 1: Dev 1, Class=root_hub, Driver=ehci-platform/1p, 480M
        |__ Port 1: Dev 3, If 0, Class=Mass Storage, Driver=uas, 480M

    The SSD is now connected via Hi-Speed but still UAS is usable. Here the (somewhat surprising) results:

    tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
    ...
    real	145m37.703s
    user	410m38.084s
    sys	66m56.026s
    
    tk@nanopct4:~/ComputeLibrary-18.03$ free
                  total        used        free      shared  buff/cache   available
    Mem:        1014192       67468      758332        3312      188392      869388
    Swap:       3071996       31864     3040132

    That's almost 10 minutes faster compared to USB3 above. Another surprising result is the amount of data written to the SSD: this time only 49.5 GB:

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda             905.22      3309.40      6821.28    5956960   12278368
    sda            1819.48      4871.02      5809.35    8767832   10456832
    sda            2505.42      6131.65      6467.18   11036972   11640928
    sda            1896.49      5149.54      4429.97    9269216    7973988
    sda            1854.91      3911.03      5293.68    7039848    9528616

    And this time I also queried the SSD via SMART before and after about 'Total_LBAs_Written' (that's 512 bytes with Samsung SSDs):

    241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       16901233973
    241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       17004991437

    Same 49.5 GB number so unfortunately my EVO840 doesn't expose amount of data written at the flash layer but just at the block device layer.

     

    Well, result is surprising (a storage relevant task performing faster with same SSD connected to USB2 compared to USB3) but most probably I did something wrong. No idea and no time any further. I checked my bash history but I repeated the test as I did all the time before and also iozone results look as expected:

       39  cd ../
       40  rm -rf ComputeLibrary-18.03/
       41  tar xvf v18.03.tar.gz
       42  lsusb -t
       43  cd ComputeLibrary-18.03/
       44  grep -r lala *
       45  time scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
    
    EVO840 / USB3                                                 random    random
                  kB  reclen    write  rewrite    read    reread    read     write
              102400       4    16524    20726    19170    19235    19309    20479
              102400      16    53314    64717    65279    66016    64425    65024
              102400     512   255997   275974   254497   255720   255696   274090
              102400    1024   294096   303209   290610   292860   288668   299653
              102400   16384   349175   352628   350241   353221   353234   350942
             1024000   16384   355773   362711   354363   354632   354731   362887
    
    EVO840 / USB2                                                 random    random
                  kB  reclen    write  rewrite    read    reread    read     write
              102400       4     5570     7967     8156     7957     8156     7971
              102400      16    19057    19137    21165    21108    20993    19130
              102400     512    32625    32660    32586    32704    32696    32642
              102400    1024    33121    33179    33506    33467    33573    33226
              102400   16384    33925    33953    35436    35500    34695    33923
             1024000   16384    34120    34193    34927    34935    34933    34169

     

  8. Now tests with the RK3399 crippled down to a quad-core A53 running at 800 MHz done. One time with 4 GB DRAM w/o swapping and the other time again with zram/lz4 and just 1 GB DRAM assigned to provoke swapping:

     

    Without swapping:

    tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
    ...
    real	99m39.537s
    user	385m51.276s
    sys	11m2.063s
    
    tk@nanopct4:~/ComputeLibrary-18.03$ free
                  total        used        free      shared  buff/cache   available
    Mem:        3902736      102648     3124104       13336      675984     3696640
    Swap:       6291440           0     6291440

    Vs. zram/lz4:

    tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
    ...
    real	130m3.264s
    user	403m18.539s
    sys	39m7.080s
    
    tk@nanopct4:~/ComputeLibrary-18.03$ free
                  total        used        free      shared  buff/cache   available
    Mem:        1014192       82940      858740        3416       72512      859468
    Swap:       3042560       27948     3014612

    This is a 30% performance drop. Still great given that I crippled the RK3399 to a quad-core A53 running at just 800 MHz. Funnily lzo again outperforms lz4:

    real	123m47.246s
    user	401m20.097s
    sys	35m14.423s

    As a comparison: swap with probably the fastest way possible on all common SBC (except those RK3399 boads that can interact with NVMe SSDs). Now I test with an USB3 connected EVO840 SSD (I created a swapfile on an ext4 FS on the SSD and deactivated zram based swap entirely):

    tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
    ...
    real	155m7.422s
    user	403m34.509s
    sys	67m11.278s
    
    tk@nanopct4:~/ComputeLibrary-18.03$ free
                  total        used        free      shared  buff/cache   available
    Mem:        1014192       66336      810212        4244      137644      869692
    Swap:       3071996       26728     3045268
    
    tk@nanopct4:~/ComputeLibrary-18.03$ /sbin/swapon
    NAME                 TYPE SIZE USED PRIO
    /mnt/evo840/swapfile file   3G  26M   -1

    With ultra fast swap on SSD execution time further increases by 25 minutes so clearly zram is the winner. I also let 'iostat 1800' run in parallel to get a clue how much data has been transferred between board and SSD (at the blockdevice layer -- below at the flash layer amount of writes could have been significantly higher):

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda             965.11      3386.99      7345.81    6096576   13222460
    sda            1807.44      4788.42      5927.86    8619208   10670216
    sda            2868.95      7041.86      7431.29   12675496   13376468
    sda            1792.79      4770.62      4828.07    8587116    8690528
    sda            2984.65      7850.61      9276.85   14131184   16698424

    I stopped a bit too early but what these numbers tell is that this compile job swapping on SSD resulted in +60 GB writes and +48 GB reads to/from flash storage. Now imagine running this on a crappy SD card. Would take ages and maybe the card will die in between :)

     

    @Igor: IMO we can switch to new behaviour. We need to take care about two things when upgrading/replacing packages:

    apt purge zram-config
    grep -q vm.swappiness /etc/sysctl.conf
    case $? in
    	0)
    		sed -i 's/vm\.swappiness.*/vm.swappiness=100/' /etc/sysctl.conf
    		;;
    	*)
    		echo vm.swappiness=100 >>/etc/sysctl.conf
    		;;
    esac

     

  9. http://forum.banana-pi.org/t/bpi-w2-sources-for-linux-etc/5780/

    • 'We’re working on it, and when it’s ready, it’s updated to github'
    • 'we will update code and image soon'
    • (just the usual blabla as always)

    @Nora Lee: Is this all unfortunate W2 customers can expect: No sources but only pre-compiled BLOBs?

     

    If you got u-boot and kernel sources from RealTek already why don't you share them as the GPL requires anyway? Do you understand that you're violating the GPL?

  10. 6 hours ago, hjc said:

    This result is lower than that tested on my Intel NUC

     

    It's also lower compared to the numbers I made with my first RK3399 device some time ago: ODROID-N1 (Hardkernel built RK's 4.4 just like ayufan without CONFIG_ARM_ROCKCHIP_DMC_DEVFREQ). But on NanoPi M4 there's always the internal VIA VL817 hub in between SuperSpeed devices and USB3 host controller and this usually affects performance as well.

     

    Update: EVO840 behind JMS567 attached to NanoPC-T4 with same 4.4 kernel and dmc governor set to performance (but CPU crippled down to a quad-core A53 clocked at 800 MHz):

                                                                  random    random
                  kB  reclen    write  rewrite    read    reread    read     write
             1024000   16384   396367   402479   372088   373177   373097   402228

    373/400 MB/s read/write. So M4 numbers above need a second test anyway.

  11. 14 hours ago, tkaiser said:

    Now on RockPro64 without any swapping happened we get 73m27.934s. So given the test has been executed appropriately we're talking about ... 16% performance decrease

     

    Since I was not entirely sure whether 'test has been executed appropriately' I went a bit further to test no swap vs. zram on a RK3399 device directly. I had to move from RockPro64 to NanoPC-T4 since with ayufan OS image on RockPro64 I didn't manage to restrict available DRAM in extlinux.conf

     

    So I did my test with Armbian on a NanoPC-T4. One time I let the build job run with 4 GB DRAM available and no swapping, next time I limited available physical memory to 1 GB via extraargs="mem=1110M" in /boot/armbianEnv.txt and swapping happened with lz4 compression.

     

    We're talking about a 12% difference in performance: 4302 seconds without swapping vs. 4855 seconds with zram/lz4:

    tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
    ...
    real	71m42.193s
    user	277m55.787s
    sys	8m7.028s
    
    tk@nanopct4:~/ComputeLibrary-18.03$ free
                  total        used        free      shared  buff/cache   available
    Mem:        3902736      105600     3132652        8456      664484     3698568
    Swap:       6291440           0     6291440

    And now with zram/lz4:

    tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
    ...
    real	80m55.042s
    user	293m12.371s
    sys	27m48.478s
    
    tk@nanopct4:~/ComputeLibrary-18.03$ free
                  total        used        free      shared  buff/cache   available
    Mem:        1014192       85372      850404        3684       78416      853944
    Swap:       3042560       27608     3014952

     

    Problem is: this test is not that representative for real-world workloads since I artificially limited the build job to CPUs 0-3 (little cores) and therefore all the memory compression stuff happened on the two free A72 cores. So next test: trying to disable the two big cores in RK3399 entirely. For whatever reasons setting extraargs="mem=1110M maxcpus=4" in /boot/armbianEnv.txt didn't work (obviously a problem with boot.cmd used for the board) so I ended up with:

    extraargs="mem=1110M"
    extraboardargs="maxcpus=4"

    After a reboot /proc/cpuinfo confirms that only little cores are available any more and we're running with just 1 GB DRAM. Only caveat: cpufreq scaling is also gone and now the little cores are clocked with ~806 MHz:

    root@nanopct4:~# /usr/local/src/mhz/mhz 3 100000
    count=330570 us50=20515 us250=102670 diff=82155 cpu_MHz=804.747
    count=330570 us50=20540 us250=102614 diff=82074 cpu_MHz=805.541
    count=330570 us50=20542 us250=102645 diff=82103 cpu_MHz=805.257

    So then this test will answer a different question: how much overhead adds zram based swapping on much slower boards. That's ok too :)

     

    To be continued...

  12. Just a quick note about DRAM latency effects. We noticed that with default kernel settings the memory controller when dmc code is active increases memory access latency a lot (details). When testing efficiency of zram swap compression more or less by accident I tested another use case that is highly affected by higher memory latency.

     

    When trying to build ARM's Compute Library on a NanoPC T4 limited to 1 GB DRAM (adding extraargs="mem=1110M" to /boot/armbianEnv.txt) my first run was with default settings (/sys/bus/platform/drivers/rockchip-dmc/dmc/devfreq/dmc/governor set to dmc_ondemand). Execution time with the build job relying heavily on zram based swap: 107m9.612s.

     

    Next try with /sys/bus/platform/drivers/rockchip-dmc/dmc/devfreq/dmc/governor set to performance: 80m55.042s.

     

    Massive difference only due to some code trying to save some energy and therefore increasing memory latency (details). In Armbian we'll use an ugly hack to 'fix' this but this is something board makers who provide own OS images should also care about (@mindee for example)

     

     

     

     

  13. On 5/21/2018 at 1:43 PM, tkaiser said:

    In the meantime I started over with my Fire3 and tested through different values of vm.swappiness and count of active CPU cores (adding e.g. extraargs="maxcpus=4" to /boot/armbianEnv.txt) using this script started from /etc/rc.local.

     

    As a comparison now the same task (building ARM's Compute Library on a SBC) on a device where swapping does not occur. The purpose of this test was to check for efficiency of different swapping implementations on a device running low on memory (NanoPi Fire3 with 8 Cortex-A53 cores @ 1.4GHz but just 1 GB DRAM). Results back then when running on all 8 CPU cores (full details):

    zram lzo                  46m57.427s
    zram lz4                  47m18.022s
    SSD via USB2             144m40.780s
    SanDisk Ultra A1 16 GB   247m56.744s
    HDD via USB2             570m14.406s

    I used my RockPro64 with 4 GB DRAM and pinned execution of the compilation to the 4 Cortex-A53 cores running also at 1.4 GHz like the Fire3:

    time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native

    This is a quick htop check (pinned to an A72 core) confirming that only the 4 A53 cores are busy:

     

    Bildschirmfoto%202018-09-03%20um%2016.12

    On NanoPi Fire3 when being limited to 4 CPU cores and with just 1 GB DRAM we got the following execution times (slightly faster with lzo in contrast to 'common knowledge' telling us lz4 would always be the better choice):

    Sun May 20 16:05:17 UTC 2018   100    4    lzo [lz4] deflate lz4hc    real    86m55.073s
    Mon May 21 11:41:36 UTC 2018   100    4    [lzo] lz4 deflate lz4hc    real    85m24.440s

    Now on RockPro64 without any swapping happened we get 73m27.934s. So given the test has been executed appropriately we're talking about a performance impact of below 20% when swapping to a compressed block device with a quad-core A53 @ 1.4 GHz (5125 seconds with lzo zram on NanoPi Fire3 vs. 4408 seconds without any swapping at all on RockPro64 --> 16% performance decrease). I looked at the free output and the maximum I observed was 2.6GB RAM used:

    root@rockpro64:/home/rock64# free
                  total        used        free      shared  buff/cache   available
    Mem:        3969104     2666692      730212        8468      572200     1264080
    Swap:             0           0           0

    'Used' DRAM over the whole benchmark execution was almost always well above 1 GB and often in the 2 GB region.

     

  14. 22 hours ago, hjc said:

    What if I connected a lot of USB 3.0 device and exceeded the 5V/2A limit? Well, I did try that (connect 4 USB HDD and run cpuburn, or even connect 2 SBCs to the USB), and the answer is simple: the board crashed

     

    Thank you for the detailed test and especially covering also the underpowering situation (a lot of users might run into). I don't know whether it was available from the beginning but now FriendlyELEC lists as options in their web shop:

    • 5V 4A Power Adapter (+$8.99)
    • German Plug Adapter( applies to: France,Germany,Portugal,Spain,Korea ) (+$5.99)

    The PSU as well as the heatsink seem like mandatory accessories to me.

  15. 46 minutes ago, lanefu said:

    Effecticely cifs is deprecated

     

    Which doesn't change much wrt the name of Linux kernel modules that provide SMB3 client functionality ;) 

     

    Speaking about Linux naming conventions: 'mount_smb' is the deprecated variant of 'mount_cifs' that should be used today. And this most probably originated from the history of SMB/CIFS support in Linux. Two decades ago SMB could've been best described as a pile of rotten protocols that are pretty much useless since Microsoft's implementations differed in almost any detail. Compared to that CIFS was an advancement (read through Linux manual pages -- many of this historical stuff still there). SMB2 and SMB3 have nothing in common with the SMB we know from 2 decades ago. Robust protocols with lots of cool features and specifications worth the name.

     

    Anyway: CONFIG_CIFS is not set on just 5 kernel variants (by accident) so @sunxishan please send a PR with them enabled as module.

  16. 10 minutes ago, TLLim said:

    What is your copper shim thickness and dimension

     

    20x20x1mm. Ordered them 18 months ago on Aliexpress for 2 bucks (5 pieces) but the link is dead. Anyway: I don't think such a copper shim is a good solution for end users. Heatsink able to be directly attached to SoC is better.

     

    Will try again with my next RK3399 board with thermal glue between heatsink and copper shim and normal thermal paste between shim and SoC. Currently I fear a bit the shim could move when vibrations occur.

  17. On 8/23/2018 at 7:05 PM, danglin said:

    Here are results for 800 and 1000:

    http://ix.io/1l2a

    http://ix.io/1l2n

     

    And here's for the flash-image-1g-2cs-1200_750_boot_sd_and_usb.bin BLOB: http://ix.io/1lCe (if you look closely you see that between 23:01:08 and 23:06:24 some background activity happened -- one of my Macs backing up to the EspressoBin -- so that I had to repeat the OpenSSL test on an idle system later)

     

    With 'working' cpufreq my EspressoBin idles with 200 MHz at 5.8W (measured at the wall) and the SoC is hot like hell. Now without CONFIG_ARM_ARMADA_37XX_CPUFREQ and the 1200 MHz settings the board idles at 6.5W while running all the time at 1190 MHz, still being hot like hell.

     

    A difference of 0.7W is a joke given that these idle numbers are way too high anyway. There's only one SATA HDD connected (standby) and one LAN connection. RockPro64 with same 12V PSU measures below 3.7W.

     

    Update: with flash-image-1g-2cs-800_800_boot_sd_and_usb.bin the board idles at 5.9W.

×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines