chwe Posted May 18, 2018 Posted May 18, 2018 3 hours ago, zador.blood.stained said: 3 hours ago, Tido said: IIRC I read from TK somewhere that ZRAM is only used on Debian or Ubuntu, why not on both? Because zram-config package exists only in Ubuntu by default, and for reasons I don't remember (probably version numbering / potential repository priority issues?) we decided to not copy it to our repository to be available for all releases. But then wouldn't it make sense to keep ZRAM by default as a Ubuntu feature and summarize Thomas findings together with a little tutorial how you can implement ZRAM on your own for debian on the tutorial page? As you said once: Quote I see it as Ubuntu/Debian + bootloader with customizations + available kernel(s) with customizations + minimal OS (userspace) customizations and some optional tools/scripts like armbian-config. So if ZRAM is a standard Ubuntu feature but not Debian then let it be a standard Armbian Ubuntu feature too and show our users how they can use ZRAM under Debian if they want it (or a possibility for ZRAM over Armbianconfig). 3 hours ago, zador.blood.stained said: Recent examples to add to that - there is no purpose in recently added "development" branch if "master" is completely abandoned as a result and suggesting to "switch to beta" to fix any issues defeats the purpose of "beta" - fixes should be immediately pushed to the stable branch This branch is 242 commits ahead, 42 commits behind master. the longer we wait, the more problematic will a merge be. I think the dev. branch was opened a bit to early without clear 'rules' of its purpose. Clean it up before its getting more and more problematic?
zador.blood.stained Posted May 18, 2018 Posted May 18, 2018 5 minutes ago, chwe said: So if ZRAM is a standard Ubuntu feature but not Debian then let it be a standard Armbian Ubuntu feature too and show our users how they can use ZRAM under Debian if they want it (or a possibility for ZRAM over Armbianconfig). It's not a feature, it's just an independent package with a dedicated maintainer (in upstream Ubuntu), similar to other packages in Ubuntu that don't exist in standard Debian repositories.
tkaiser Posted May 18, 2018 Author Posted May 18, 2018 1 hour ago, chwe said: So if ZRAM is a standard Ubuntu feature but not Debian then let it be a standard Armbian Ubuntu feature too and show our users how they can use ZRAM under Debian if they want it (or a possibility for ZRAM over Armbianconfig). This has been discussed already in this thread. We could enable this on our own (and make it configurable in a sane way) but the way it's prepared by me (as part of our armhwinfo script) is not a good idea. So it would need someone to split all the armhwinfo functionality into different parts that can also be configured by the user. Also our current policy with vm.swappiness=0 is something that could be discussed/changed or at least be configurable by the user in an easy way. But since at least I have not the slightest idea in which direction Armbian is moving and since I have a hard time understanding for what time is wasted and especially since i really hate to waste my own time with stuff I don't like (e.g. trying to repair broken patches or insecure scripting) I simply do not care any more. Same with 'speed' of progress. I started with this zram journey over one year ago, wasted my time to analyze 'armbianmonitor -u' output from various installations, waited another few months whether there are complaints from Ubuntu users about zram now being active (I'm not aware of a single one) and would like to see better usage of RAM rather sooner than later. But as it's today I simply don't care any more since all this stuff simply feels like time wasted.
JMCC Posted May 18, 2018 Posted May 18, 2018 From a user's point of view, all that stuff is useful for people who want to experiment and learn about computing, either students (I know many kids who are learning a lot with SBC's, using them for robotic projects), either hobbyists. I wouldn't say it is a waste of time cooperating with that, I think it is useful for many people. Of course, provided it doesn't interfere with other commitments the developer may have. Getting back on-topic, on the XU4 the compiling took 27m 9s, with the swap numbers I posted above: Spoiler $ cat /proc/swaps Filename Type Size Used Priority /dev/zram0 partition 127612 11516 5 /dev/zram1 partition 127612 11492 5 /dev/zram2 partition 127612 11684 5 /dev/zram3 partition 127612 11552 5 /dev/zram4 partition 127612 11720 5 /dev/zram5 partition 127612 11448 5 /dev/zram6 partition 127612 11668 5 /dev/zram7 partition 127612 11600 5 /mnt/externo/swapfile1 file 2097148 0 -1 I'll post more numbers about the TinkerBoard and 3Gb Khadas Vim2
JMCC Posted May 19, 2018 Posted May 19, 2018 More numbers: Tinkerboard (I set 8 jobs, though it has 4 cores): 41m 2s Spoiler $ cat /proc/swaps Filename Type Size Used Priority /dev/zram0 partition 257440 39396 5 /dev/zram1 partition 257440 39428 5 /dev/zram2 partition 257440 39384 5 /dev/zram3 partition 257440 39388 5 Khadas VIM2 3Gb: 48m 32s Spoiler $ cat /proc/swaps Filename Type Size Used Priority /dev/zram0 partition 172164 3856 5 /dev/zram1 partition 172164 3884 5 /dev/zram2 partition 172164 3896 5 /dev/zram3 partition 172164 3892 5 /dev/zram4 partition 172164 3848 5 /dev/zram5 partition 172164 3804 5 /dev/zram6 partition 172164 3804 5 /dev/zram7 partition 172164 3820 5 (Note: To be fair, we must remember that TB and XU4 are compiling 32-bit binaries, while Nanopi Fire3 and VIM2 compile 64-bit). So I would conclude that zram does have a performance impact, but it is not too big: Fire3, having a CPU speed about 16% faster than VIM2 (and probably slower DDR3 RAM but at the same time with a smaller latency), performed about 4% better. @tkaiser: Did I speak wisely?
tkaiser Posted May 20, 2018 Author Posted May 20, 2018 13 hours ago, JMCC said: Khadas VIM2 3Gb: 48m 32s Thank you. I just repeated the test while limiting my NanoPi Fire3 to 1200 MHz with zram/lz4 and vm.swappiness=100 (/etc/sysctl.conf): 51m34.139s (and with lzo it was 50m30.884s -- so again with this workload no advantage for lz4 for whatever reasons) But since we know that Vim2 unfortunately relies on an Amlogic SoC with cheating firmware blob (fake clockspeeds) the only reasonable way to get a real comparison would be you repeating the test twice: First time with purged zram-config package and commented swap entry in fstab to force the board to do no zram paging at all Then again this time with the Vim2 limited to 1 GB DRAM ('mem=1G' added to kernel cmdline), setting up vm.swappiness=100 and activating zram with the following modified activate_zram routine in /etc/init.d/armhwinfo (needs to be uncommented of course too): activate_zram() { # Do not interfere with already present config-zram package dpkg -l | grep -q 'zram-config' && return # Load zram module with n instances (one per CPU core, 4 are the maximum) zram_devices=5 module_args="$(modinfo zram | awk -F" " '/num_devices/ {print $2}' | cut -f1 -d:)" [[ -n ${module_args} ]] && modprobe zram ${module_args}=${zram_devices} || return # Use half of the real memory by default --> 1/${ram_divisor} ram_divisor=2 mem_info=$(LC_ALL=C free -w 2>/dev/null | grep "^Mem" || LC_ALL=C free | grep "^Mem") memory_total=$(awk '{printf("%d",$2*1024)}' <<<${mem_info}) mem_per_zram_device=$(( ${memory_total} / ${ram_divisor} )) for (( i=0; i<zram_devices; i++ )); do [[ -f /sys/block/zram${i}/comp_algorithm ]] && echo lz4 >/sys/block/zram${i}/comp_algorithm 2>/dev/null echo -n ${mem_per_zram_device} > /sys/block/zram${i}/disksize mkswap /dev/zram${i} swapon -p 5 /dev/zram${i} done echo -e "\n### Activated ${zram_devices} zram swap devices with ${mem_per_zram_device} MB each\n" >>${Log} } # activate_zram Edit: Added lzo numbers above.
JMCC Posted May 20, 2018 Posted May 20, 2018 (edited) Here are the results. Some preliminary notes: This particular test is not too accurate in HMP CPU's, because it has some big static libraries compiled at the end, and depending whether they fall in a slow or fast core, results can vary a few minutes. That explains why the time no-swap time posted here is a little higher than the one I posted before. I wasn't able to make the kernel parameter work with balbes150's image, so I decided to take off the white gloves and do a dirty hack: stress --vm-bytes 1677721600 --vm-keep -m 1 Which created a initial memory status that more or less can do the job: Spoiler $ cat /proc/meminfo MemTotal: 2754696 kB MemFree: 814840 kB MemAvailable: 952124 kB Buffers: 13088 kB Cached: 146168 kB SwapCached: 0 kB Active: 1744304 kB Inactive: 84700 kB Active(anon): 1673024 kB Inactive(anon): 13312 kB Active(file): 71280 kB Inactive(file): 71388 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 5242860 kB SwapFree: 5242860 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 1669820 kB Mapped: 21968 kB Shmem: 16592 kB Slab: 38316 kB SReclaimable: 19492 kB SUnreclaim: 18824 kB KernelStack: 2624 kB PageTables: 4428 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 6620208 kB Committed_AS: 1849200 kB VmallocTotal: 1048576 kB VmallocUsed: 72872 kB VmallocChunk: 966844 kB TotalCMA: 221184 kB UsedCMA: 3244 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB After it, I changed vm.swappiness to 100. That being said, these are the numbers: 3Gb RAM, no swap: 49:14 Spoiler $ cat /proc/swaps Filename Type Size Used Priority Pseudo-1Gb RAM, high swappiness: 77m 34s Spoiler $ cat /proc/swaps Filename Type Size Used Priority /dev/zram0 partition 1048572 340600 5 /dev/zram1 partition 1048572 340652 5 /dev/zram2 partition 1048572 339484 5 /dev/zram3 partition 1048572 340616 5 /dev/zram4 partition 1048572 340264 5 Edited May 20, 2018 by JMCC EDIT: corrected some numbers
JMCC Posted May 20, 2018 Posted May 20, 2018 (edited) On second thought, I realized that "stress --vm-keep" was probably using lots of memory bandwith by itself, so the second number is of no use. I think the parameter I should have used is "--vm-hang 0". But I already put the VIM2 away, so it'll have to wait. [EDIT: No, that won't work either, because if the memory hog doesn't change, it will get swapped away and the kernel will use the physical RAM for compiling. Any suggestion is welcome] Edited May 20, 2018 by JMCC
tkaiser Posted May 21, 2018 Author Posted May 21, 2018 On 5/20/2018 at 10:35 PM, JMCC said: On second thought, I realized that "stress --vm-keep" was probably using lots of memory bandwith by itself, so the second number is of no use Yep, I agree (especially since Fire3 clocked down to 1.2GHz scores around 51m with zram so the 77 minutes seem just wrong). In the meantime I started over with my Fire3 and tested through different values of vm.swappiness and count of active CPU cores (adding e.g. extraargs="maxcpus=4" to /boot/armbianEnv.txt) using this script started from /etc/rc.local. I tested again with lz4 and 2 CPU cores another time since first run results looked bogus: Timestamp vm.swappiness cores algorithm execution time Sun May 20 12:45:12 UTC 2018 100 8 lzo [lz4] deflate lz4hc real 47m53.246s Sun May 20 13:34:26 UTC 2018 80 8 lzo [lz4] deflate lz4hc real 48m9.429s Sun May 20 14:23:55 UTC 2018 60 8 lzo [lz4] deflate lz4hc real 48m25.700s Sun May 20 15:13:40 UTC 2018 40 8 lzo [lz4] deflate lz4hc real 49m40.919s Sun May 20 16:05:17 UTC 2018 100 4 lzo [lz4] deflate lz4hc real 86m55.073s Sun May 20 17:33:34 UTC 2018 80 4 lzo [lz4] deflate lz4hc real 87m50.534s Sun May 20 19:02:49 UTC 2018 60 4 lzo [lz4] deflate lz4hc real 88m43.067s Sun May 20 20:32:55 UTC 2018 40 4 lzo [lz4] deflate lz4hc real 98m43.243s Sun May 20 22:15:55 UTC 2018 100 2 lzo [lz4] deflate lz4hc real 148m58.772s Mon May 21 00:46:19 UTC 2018 80 2 lzo [lz4] deflate lz4hc real 146m58.757s Mon May 21 03:14:40 UTC 2018 60 2 lzo [lz4] deflate lz4hc real 147m3.493s Mon May 21 05:43:08 UTC 2018 40 2 lzo [lz4] deflate lz4hc real 155m22.952s Mon May 21 08:20:34 UTC 2018 100 8 [lzo] lz4 deflate lz4hc real 46m56.667s Mon May 21 09:08:59 UTC 2018 80 8 [lzo] lz4 deflate lz4hc real 47m25.969s Mon May 21 09:57:58 UTC 2018 60 8 [lzo] lz4 deflate lz4hc real 47m45.961s Mon May 21 10:47:16 UTC 2018 40 8 [lzo] lz4 deflate lz4hc real 48m14.999s Mon May 21 11:41:36 UTC 2018 100 4 [lzo] lz4 deflate lz4hc real 85m24.440s Mon May 21 13:08:31 UTC 2018 80 4 [lzo] lz4 deflate lz4hc real 85m47.343s Mon May 21 14:35:44 UTC 2018 60 4 [lzo] lz4 deflate lz4hc real 85m59.063s Mon May 21 16:03:11 UTC 2018 40 4 [lzo] lz4 deflate lz4hc real 86m49.615s Mon May 21 21:53:07 UTC 2018 100 2 [lzo] lz4 deflate lz4hc real 143m1.995s Tue May 22 00:17:40 UTC 2018 80 2 [lzo] lz4 deflate lz4hc real 144m0.501s Tue May 22 02:43:08 UTC 2018 60 2 [lzo] lz4 deflate lz4hc real 144m37.204s Tue May 22 05:09:14 UTC 2018 40 2 [lzo] lz4 deflate lz4hc real 146m51.361s Tue May 22 07:56:42 UTC 2018 100 2 lzo [lz4] deflate lz4hc real 147m15.069s Tue May 22 10:25:33 UTC 2018 80 2 lzo [lz4] deflate lz4hc real 147m31.538s Tue May 22 12:54:31 UTC 2018 60 2 lzo [lz4] deflate lz4hc real 147m27.517s Tue May 22 15:23:28 UTC 2018 40 2 lzo [lz4] deflate lz4hc real 150m54.700s So as expected with zram based swap increasing vm.swappiness to the maximum helps with performance in such memory overcommitment situations like doing this huge compile job (Arm Compute Library) that needs up to 2.6GB with a 64-bit userland -- just 2 GB when doing a 32-bit build). And for whatever reasons at least with kernel 4.14 and defaults lz4 does not perform better compared to lzo, it's quite the opposite and with lzo the jobs finish even faster.
Regis Michel LeClerc Posted May 21, 2018 Posted May 21, 2018 Ok, brilliant... So, just to summarise, if I want my Orange Pi PC2 to use zRam instead of using a static 8GB off of my SDCard and compile the Monero thing (an operation that seems to require about 4GB anyway), how do I configure that, from start to end?
tkaiser Posted May 23, 2018 Author Posted May 23, 2018 On 5/21/2018 at 11:21 PM, Regis Michel LeClerc said: I want my Orange Pi PC2 to use zRam instead of using a static 8GB off of my SDCard OPi PC2 has just one GB DRAM so trying to use 8 GB zram won't work. The average compression ratio I've seen in all tests so far was between 3:1 and 3.5:1 and also zram needs a small amount of DRAM for itself. So zram using 3 times the available RAM can be considered maximum and might even fail already when memory contents aren't compressable at such a ratio. If you look at page 1 of this thread you'll see that using an UAS attached SSD is the way to go in such situations. And maybe switching from zram to zcache when you want to use both DRAM and storage for swapping. Configuring zram and 'disk' as swap at the same time has some caveats.
tkaiser Posted September 3, 2018 Author Posted September 3, 2018 On 5/21/2018 at 1:43 PM, tkaiser said: In the meantime I started over with my Fire3 and tested through different values of vm.swappiness and count of active CPU cores (adding e.g. extraargs="maxcpus=4" to /boot/armbianEnv.txt) using this script started from /etc/rc.local. As a comparison now the same task (building ARM's Compute Library on a SBC) on a device where swapping does not occur. The purpose of this test was to check for efficiency of different swapping implementations on a device running low on memory (NanoPi Fire3 with 8 Cortex-A53 cores @ 1.4GHz but just 1 GB DRAM). Results back then when running on all 8 CPU cores (full details): zram lzo 46m57.427s zram lz4 47m18.022s SSD via USB2 144m40.780s SanDisk Ultra A1 16 GB 247m56.744s HDD via USB2 570m14.406s I used my RockPro64 with 4 GB DRAM and pinned execution of the compilation to the 4 Cortex-A53 cores running also at 1.4 GHz like the Fire3: time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native This is a quick htop check (pinned to an A72 core) confirming that only the 4 A53 cores are busy: On NanoPi Fire3 when being limited to 4 CPU cores and with just 1 GB DRAM we got the following execution times (slightly faster with lzo in contrast to 'common knowledge' telling us lz4 would always be the better choice): Sun May 20 16:05:17 UTC 2018 100 4 lzo [lz4] deflate lz4hc real 86m55.073s Mon May 21 11:41:36 UTC 2018 100 4 [lzo] lz4 deflate lz4hc real 85m24.440s Now on RockPro64 without any swapping happened we get 73m27.934s. So given the test has been executed appropriately we're talking about a performance impact of below 20% when swapping to a compressed block device with a quad-core A53 @ 1.4 GHz (5125 seconds with lzo zram on NanoPi Fire3 vs. 4408 seconds without any swapping at all on RockPro64 --> 16% performance decrease). I looked at the free output and the maximum I observed was 2.6GB RAM used: root@rockpro64:/home/rock64# free total used free shared buff/cache available Mem: 3969104 2666692 730212 8468 572200 1264080 Swap: 0 0 0 'Used' DRAM over the whole benchmark execution was almost always well above 1 GB and often in the 2 GB region.
tkaiser Posted September 4, 2018 Author Posted September 4, 2018 14 hours ago, tkaiser said: Now on RockPro64 without any swapping happened we get 73m27.934s. So given the test has been executed appropriately we're talking about ... 16% performance decrease Since I was not entirely sure whether 'test has been executed appropriately' I went a bit further to test no swap vs. zram on a RK3399 device directly. I had to move from RockPro64 to NanoPC-T4 since with ayufan OS image on RockPro64 I didn't manage to restrict available DRAM in extlinux.conf So I did my test with Armbian on a NanoPC-T4. One time I let the build job run with 4 GB DRAM available and no swapping, next time I limited available physical memory to 1 GB via extraargs="mem=1110M" in /boot/armbianEnv.txt and swapping happened with lz4 compression. We're talking about a 12% difference in performance: 4302 seconds without swapping vs. 4855 seconds with zram/lz4: tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native ... real 71m42.193s user 277m55.787s sys 8m7.028s tk@nanopct4:~/ComputeLibrary-18.03$ free total used free shared buff/cache available Mem: 3902736 105600 3132652 8456 664484 3698568 Swap: 6291440 0 6291440 And now with zram/lz4: tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native ... real 80m55.042s user 293m12.371s sys 27m48.478s tk@nanopct4:~/ComputeLibrary-18.03$ free total used free shared buff/cache available Mem: 1014192 85372 850404 3684 78416 853944 Swap: 3042560 27608 3014952 Problem is: this test is not that representative for real-world workloads since I artificially limited the build job to CPUs 0-3 (little cores) and therefore all the memory compression stuff happened on the two free A72 cores. So next test: trying to disable the two big cores in RK3399 entirely. For whatever reasons setting extraargs="mem=1110M maxcpus=4" in /boot/armbianEnv.txt didn't work (obviously a problem with boot.cmd used for the board) so I ended up with: extraargs="mem=1110M" extraboardargs="maxcpus=4" After a reboot /proc/cpuinfo confirms that only little cores are available any more and we're running with just 1 GB DRAM. Only caveat: cpufreq scaling is also gone and now the little cores are clocked with ~806 MHz: root@nanopct4:~# /usr/local/src/mhz/mhz 3 100000 count=330570 us50=20515 us250=102670 diff=82155 cpu_MHz=804.747 count=330570 us50=20540 us250=102614 diff=82074 cpu_MHz=805.541 count=330570 us50=20542 us250=102645 diff=82103 cpu_MHz=805.257 So then this test will answer a different question: how much overhead adds zram based swapping on much slower boards. That's ok too To be continued... 1
tkaiser Posted September 4, 2018 Author Posted September 4, 2018 Now tests with the RK3399 crippled down to a quad-core A53 running at 800 MHz done. One time with 4 GB DRAM w/o swapping and the other time again with zram/lz4 and just 1 GB DRAM assigned to provoke swapping: Without swapping: tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native ... real 99m39.537s user 385m51.276s sys 11m2.063s tk@nanopct4:~/ComputeLibrary-18.03$ free total used free shared buff/cache available Mem: 3902736 102648 3124104 13336 675984 3696640 Swap: 6291440 0 6291440 Vs. zram/lz4: tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native ... real 130m3.264s user 403m18.539s sys 39m7.080s tk@nanopct4:~/ComputeLibrary-18.03$ free total used free shared buff/cache available Mem: 1014192 82940 858740 3416 72512 859468 Swap: 3042560 27948 3014612 This is a 30% performance drop. Still great given that I crippled the RK3399 to a quad-core A53 running at just 800 MHz. Funnily lzo again outperforms lz4: real 123m47.246s user 401m20.097s sys 35m14.423s As a comparison: swap with probably the fastest way possible on all common SBC (except those RK3399 boads that can interact with NVMe SSDs). Now I test with an USB3 connected EVO840 SSD (I created a swapfile on an ext4 FS on the SSD and deactivated zram based swap entirely): tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native ... real 155m7.422s user 403m34.509s sys 67m11.278s tk@nanopct4:~/ComputeLibrary-18.03$ free total used free shared buff/cache available Mem: 1014192 66336 810212 4244 137644 869692 Swap: 3071996 26728 3045268 tk@nanopct4:~/ComputeLibrary-18.03$ /sbin/swapon NAME TYPE SIZE USED PRIO /mnt/evo840/swapfile file 3G 26M -1 With ultra fast swap on SSD execution time further increases by 25 minutes so clearly zram is the winner. I also let 'iostat 1800' run in parallel to get a clue how much data has been transferred between board and SSD (at the blockdevice layer -- below at the flash layer amount of writes could have been significantly higher): Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 965.11 3386.99 7345.81 6096576 13222460 sda 1807.44 4788.42 5927.86 8619208 10670216 sda 2868.95 7041.86 7431.29 12675496 13376468 sda 1792.79 4770.62 4828.07 8587116 8690528 sda 2984.65 7850.61 9276.85 14131184 16698424 I stopped a bit too early but what these numbers tell is that this compile job swapping on SSD resulted in +60 GB writes and +48 GB reads to/from flash storage. Now imagine running this on a crappy SD card. Would take ages and maybe the card will die in between @Igor: IMO we can switch to new behaviour. We need to take care about two things when upgrading/replacing packages: apt purge zram-config grep -q vm.swappiness /etc/sysctl.conf case $? in 0) sed -i 's/vm\.swappiness.*/vm.swappiness=100/' /etc/sysctl.conf ;; *) echo vm.swappiness=100 >>/etc/sysctl.conf ;; esac
tkaiser Posted September 4, 2018 Author Posted September 4, 2018 3 hours ago, tkaiser said: real 155m7.422s This was 'swap with SSD connected to USB3 port'. Now a final number. I was curious how long the whole build orgy will take if I use the same UAS attached EVO840 SSD and connect it to an USB2 port. Before and after (lsusb -t): /: Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M |__ Port 1: Dev 3, If 0, Class=Mass Storage, Driver=uas, 5000M /: Bus 05.Port 1: Dev 1, Class=root_hub, Driver=ehci-platform/1p, 480M |__ Port 1: Dev 3, If 0, Class=Mass Storage, Driver=uas, 480M The SSD is now connected via Hi-Speed but still UAS is usable. Here the (somewhat surprising) results: tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native ... real 145m37.703s user 410m38.084s sys 66m56.026s tk@nanopct4:~/ComputeLibrary-18.03$ free total used free shared buff/cache available Mem: 1014192 67468 758332 3312 188392 869388 Swap: 3071996 31864 3040132 That's almost 10 minutes faster compared to USB3 above. Another surprising result is the amount of data written to the SSD: this time only 49.5 GB: Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 905.22 3309.40 6821.28 5956960 12278368 sda 1819.48 4871.02 5809.35 8767832 10456832 sda 2505.42 6131.65 6467.18 11036972 11640928 sda 1896.49 5149.54 4429.97 9269216 7973988 sda 1854.91 3911.03 5293.68 7039848 9528616 And this time I also queried the SSD via SMART before and after about 'Total_LBAs_Written' (that's 512 bytes with Samsung SSDs): 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 16901233973 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 17004991437 Same 49.5 GB number so unfortunately my EVO840 doesn't expose amount of data written at the flash layer but just at the block device layer. Well, result is surprising (a storage relevant task performing faster with same SSD connected to USB2 compared to USB3) but most probably I did something wrong. No idea and no time any further. I checked my bash history but I repeated the test as I did all the time before and also iozone results look as expected: 39 cd ../ 40 rm -rf ComputeLibrary-18.03/ 41 tar xvf v18.03.tar.gz 42 lsusb -t 43 cd ComputeLibrary-18.03/ 44 grep -r lala * 45 time scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native EVO840 / USB3 random random kB reclen write rewrite read reread read write 102400 4 16524 20726 19170 19235 19309 20479 102400 16 53314 64717 65279 66016 64425 65024 102400 512 255997 275974 254497 255720 255696 274090 102400 1024 294096 303209 290610 292860 288668 299653 102400 16384 349175 352628 350241 353221 353234 350942 1024000 16384 355773 362711 354363 354632 354731 362887 EVO840 / USB2 random random kB reclen write rewrite read reread read write 102400 4 5570 7967 8156 7957 8156 7971 102400 16 19057 19137 21165 21108 20993 19130 102400 512 32625 32660 32586 32704 32696 32642 102400 1024 33121 33179 33506 33467 33573 33226 102400 16384 33925 33953 35436 35500 34695 33923 1024000 16384 34120 34193 34927 34935 34933 34169
chrisf Posted September 4, 2018 Posted September 4, 2018 Have you got a tool to check the latency to compare USB2 and USB3? Or CPU usage when doing the same workload? My understanding of the difference between USB2 and 3 is USB2 is polled, while USB3 is interrupt driven. Assuming you haven't done something wrong and your numbers are an accurate representation, maybe at the hardware level USB3 requires more resources, all the interrupts could be causing excessive context switching. Or the drivers aren't as optimised yet. would be interesting to compare between different hardware USB3 implementations. 1
tkaiser Posted September 5, 2018 Author Posted September 5, 2018 7 hours ago, chrisf said: Have you got a tool to check the latency to compare USB2 and USB3? Or CPU usage when doing the same workload? I had the SSH session window still open and collected the relevant logging portions from 'iostat 1800' while running the test with USB3, USB2 and then again zram/lzo (which also surprisingly again outperformed lz4): USB3: %user %nice %system %iowait %steal %idle 82.31 0.00 12.56 4.68 0.00 0.45 74.77 0.00 16.80 8.25 0.00 0.18 55.24 0.00 19.84 24.44 0.00 0.48 72.22 0.00 16.94 10.43 0.00 0.41 50.96 0.00 22.24 26.09 0.00 0.71 USB2: %user %nice %system %iowait %steal %idle 81.77 0.00 11.95 5.30 0.00 0.99 75.99 0.00 16.95 6.71 0.00 0.35 66.50 0.00 19.19 13.81 0.00 0.49 77.64 0.00 18.31 3.97 0.00 0.08 44.17 0.00 12.99 13.09 0.00 29.74 zram/lzo: %user %nice %system %iowait %steal %idle 84.83 0.00 14.68 0.01 0.00 0.48 82.94 0.00 17.06 0.00 0.00 0.00 81.51 0.00 18.49 0.00 0.00 0.00 78.33 0.00 21.66 0.00 0.00 0.01 7 hours ago, chrisf said: maybe at the hardware level USB3 requires more resources, all the interrupts could be causing excessive context switching That's an interesting point and clearly something I forgot to check. But I was running with latest IRQ assignment settings (USB2 on CPU1 and USB3 on CPU2) so there shouldn't have been a problem with my crippled setup (hiding CPUs 4 and 5). But iostat output above reveals that %iowait with USB3 was much higher compared to USB2 so this is clearly something that needs more investigations. 1
sfx2000 Posted September 17, 2018 Posted September 17, 2018 (edited) oh hai! noob here - but zram is interesting... zram-config is always good, as it kinda sorts things, but looking at distro's where that's not available as a package simple short shell script (cribbed this somewhere else, forget where)... Anyways - just needs to ensure that zram is enabled in the kernel config. zram.sh - put this over in /usr/bin/zram.sh and make it executable...then add it to /etc/rc.local - add to /etc/sysctl.conf the vm.swappiness = 10 to keep pressure off unless it's needed Spoiler #!/bin/bash cores=$(nproc --all) modprobe zram num_devices=$cores swapoff -a totalmem=`free | grep -e "^Mem:" | awk '{print $2}'` mem=$(( ($totalmem / $cores)* 1024 )) core=0 while [ $core -lt $cores ]; do echo $mem > /sys/block/zram$core/disksize mkswap /dev/zram$core swapon -p 5 /dev/zram$core let core=core+1 done memory manager sorts things out here, and this is a good item for small mem devices, Edited September 17, 2018 by Tido added spoiler | see message below not recommended to use
tkaiser Posted September 17, 2018 Author Posted September 17, 2018 2 hours ago, sfx2000 said: zram.sh - put this over in /usr/bin/zram.sh and make it executable For anyone else reading this: do NOT do this. Just use Armbian -- we care about zram at the system level and also set vm.swappiness accordingly (low values are bad)
Igor Posted September 18, 2018 Posted September 18, 2018 There is a problem at a kernel change, more precisely if/when initrd is regenerated. update-initramfs: Generating /boot/initrd.img-4.18.8-odroidc2 I: The initramfs will attempt to resume from /dev/zram4 I: (UUID=368b4521-07d1-43df-803d-159c60c5c833) I: Set the RESUME variable to override this. update-initramfs: Converting to u-boot format This leads to boot delay: Spoiler Starting kernel ... Loading, please wait... starting version 232 Begin: Loading essential drivers ... done. Begin: Running /scripts/init-premount ... done. Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done. Begin: Running /scripts/local-premount ... Scanning for Btrfs filesystems Begin: Waiting for suspend/resume device ... Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. Begin: Running /scripts/local-block ... done. done. Gave up waiting for suspend/resume device done. Begin: Will now check root file system ... fsck from util-linux 2.29.2 [/sbin/fsck.ext4 (1) -- /dev/mmcblk1p1] fsck.ext4 -a -C0 /dev/mmcblk1p1 /dev/mmcblk1p1: clean, 77231/481440 files, 600696/1900644 blocks done. done. Begin: Running /scripts/local-bottom ... done. Begin: Running /scripts/init-bottom ... done. Welcome to Debian GNU/Linux 9 (stretch)! Ideas on how to fix it best?
tkaiser Posted September 18, 2018 Author Posted September 18, 2018 10 minutes ago, Igor said: Ideas on how to fix it best? https://lists.debian.org/debian-kernel/2017/04/msg00333.html 2
sfx2000 Posted September 21, 2018 Posted September 21, 2018 @Igor @tkaiser beat me to the punch on the initramfs 'glitch'... but it's an easy fix edit (if the file isn't there, create it) /etc/initramfs-tools/conf.d/resume add/modify the line there - can do none, or push it to another location other than zram RESUME=none then refresh the initramfs update-initramfs -u -k all
sfx2000 Posted September 21, 2018 Posted September 21, 2018 On 9/16/2018 at 10:37 PM, tkaiser said: we care about zram at the system level and also set vm.swappiness accordingly (low values are bad) Agree - that we can that we're both concerned about the memory manager in general - and the zram.sh script is something that's been tuned for quite some time and experience across multiple archs/distros... I'm more for not aggressively swapping out - the range is Zero to 100 - looking at the rk3288-tinker image, it's set to 100, which is very aggressive at swapping pages... keep in mind that the default is usually 60 My thought is that lower values are better in most cases - the value of 10 is reasonable for most - keeps pressure of the swap partitions which is important if not running zram, as going to swap on SD/eMMC is going to be a real hit on performance, and even with zram, we only want to swap if we really need to as hitting the zram is going to have a cost in overall performance.
tkaiser Posted September 22, 2018 Author Posted September 22, 2018 10 hours ago, sfx2000 said: the zram.sh script is something that's been tuned for quite some time and experience across multiple archs/distros... Huh? This script is not 'tuned' whatsoever. It basically sets up some zram devices in an outdated way (since recent kernels do not need one zram device per CPU core, this could have even negative effects on big.LITTLE designs and that's why we made all of this configurable in Armbian via /etc/default/armbian-zram-config). vm.swappiness... the 'default' is from 100 years ago when we had neither fast flash storage nor compressed zram block devices. Back then swapping happened on spinning rust! With zram any value lower than 100 makes no sense at all. https://forum.armbian.com/topic/8161-swap-on-sbc/ https://github.com/armbian/build/commit/a23b02d11e14510999de9d50d58c0a192a5d667b
sfx2000 Posted September 23, 2018 Posted September 23, 2018 On 9/22/2018 at 2:07 AM, tkaiser said: vm.swappiness... the 'default' is from 100 years ago when we had neither fast flash storage nor compressed zram block devices. Back then swapping happened on spinning rust! With zram any value lower than 100 makes no sense at all. I think we're going to have to agree to disagree here - and frank discussion is always good... What you have to look at is the tendency to swap, and what that cost actually is - one can end up unmapping pages if not careful, and have a less responsive system - spinning rust, compcache, nvme, etc... swap is still swap. swap_tendency = mapped_ratio/2 + distress + vm_swappiness (for the lay folks - the 0-100 value in vm.swappiness is akin to the amount free memory in use before swapping is initiated - so a value of 60 says that as long as we have free memory of 60 percent, we don't swap, if less than that, we start swapping out pages - it's a weighted value) So if you want to spend time thrashing memory, keep it high - higher does keep the caches free, which may or may not be desired depending on the particular workload in play... worst case if set too high, app responsiveness may suffer... One of the other consideration is that some apps does try to manage their own memory - mysql/mariadb is a good example, where it can really send memory manager off the deep end if heavily loaded... So it's ok to have different opinions here, and easy enough to test/modify/test again... for those that want to play - it's easy enough to change on the fly.... sudo sysctl -w vm.swappiness=<value> # the range here is 0-100 - 0 is swap disabled
sfx2000 Posted September 23, 2018 Posted September 23, 2018 On 9/4/2018 at 10:51 PM, tkaiser said: That's an interesting point and clearly something I forgot to check. But I was running with latest IRQ assignment settings (USB2 on CPU1 and USB3 on CPU2) so there shouldn't have been a problem with my crippled setup (hiding CPUs 4 and 5). But iostat output above reveals that %iowait with USB3 was much higher compared to USB2 so this is clearly something that needs more investigations. Hint - putting a task to observe changes the behavior, as the task itself takes up time and resources... Even JTAG does this, and I've had more than a few junior engineers learn this the hard way... Back in the days when I was doing Qualcomm MSM work - running the DIAG task on REX changed timing, or running additional debug/tracing in userland - so things that would crash the MSM standalone, wouldn't crash when actually trying to chase the problem and fix it. This was especially true with the first MSM's that did DVFS - the MSM6100 was the first one I ran into... It's a lightweight version of Schrödinger's Cat -- https://en.wikipedia.org/wiki/Schrödinger's_cat I always asked my guys - "did you kill the cat?" on their test results....
sfx2000 Posted September 24, 2018 Posted September 24, 2018 On 9/22/2018 at 2:07 AM, tkaiser said: Huh? This script is not 'tuned' whatsoever. It basically sets up some zram devices in an outdated way (since recent kernels do not need one zram device per CPU core, this could have even negative effects on big.LITTLE designs and that's why we made all of this configurable in Armbian via /etc/default/armbian-zram-config). Actually it does and doesn't - with big.LITTLE, we have ARM GTS on our side which makes things a bit transparent, so one can always do a single zram pool and let the cores sort it out with the appropriate kernel patches from ARM... my little script assumes all cores are the same, so we do take some liberty there with allocations...
sfx2000 Posted September 24, 2018 Posted September 24, 2018 On 9/16/2018 at 10:37 PM, tkaiser said: For anyone else reading this: do NOT do this. Just use Armbian -- we care about zram at the system level and also set vm.swappiness accordingly (low values are bad) Apologies up front - after digging thru the forums, you have a fair investment in your methods and means... fair enough, and much appreciated. Just ask that you keep an open mind on this item - I've got other things to worry about... current tasks are rk3288 clocks and temps, and an ask to look at rk_cypto performance overall... Keep it simple there... many use cases to consider - one can always find a benchmark to prove a case... I've been there, and this isn't the first ARM platform I've worked with - I've done BSP's for imx6, mvedbu, broadcom, and QCA... not my first rodeo here. Just trying to help. 2
Tido Posted September 24, 2018 Posted September 24, 2018 12 hours ago, sfx2000 said: (for the lay folks - the 0-100 value in vm.swappiness is akin to the amount free memory in use before swapping is initiated - so a value of 60 says that as long as we have free memory of 60 percent, we don't swap, if less than that, we start swapping out pages - it's a weighted value) So if you want to spend time thrashing memory, keep it high - higher does keep the caches free, which may or may not be desired depending on the particular workload in play... worst case if set too high, app responsiveness may suffer... One of the other consideration is that some apps does try to manage their own memory - mysql/mariadb is a good example, where it can really send memory manager off the deep end if heavily loaded... So, @tkaiser would have to put load (use RAM) on an SBC when doing the benchmarking. And to be frank, you would have to test it with different scenario's of load before you go on such a high level as 100. However, you could create 20 senarios and would still not catch every situation/combination. That said, I am with you that it is better to have a value lower than 100.
tkaiser Posted September 24, 2018 Author Posted September 24, 2018 1 hour ago, Tido said: @tkaiser would have to put load (use RAM) on an SBC when doing the benchmarking I started with this 'zram on SBC' journey more than 2 years ago, testing with GUI use cases on PineBook, searching for other use cases that require huge amounts of memory, testing with old as well as brand new kernel versions and ending up with huge compile jobs as an example where heavy DRAM overcommitment is possible and zram shows its strengths. Days of work, zero help/contributions by others until recently (see @botfap contribution in the other thread). Now that as an result of this work a new default is set additional time is needed to discuss about feelings and believes? Really impressive... 12 hours ago, sfx2000 said: putting a task to observe changes the behavior, as the task itself takes up time and resources Care to elaborate what I did wrong when always running exactly the same set of 'monitoring' with each test (using a pretty lightweight 'iostat 1800' call which simply queries the kernel's counters and displays some numbers every 30 minutes)? 13 hours ago, sfx2000 said: it's ok to have different opinions here, and easy enough to test/modify/test again... Why should opinions matter if there's no reasoning provided? I'm happy to learn how and what I could test/modify again since when starting with this zram journey and GUI apps I had no way to measure different settings since everything is just 'feeling' (with zram and massive overcommitment you can open 10 more browsers tabs without the system becoming unresponsive which is not news anyway but simply as expected). So I ended up with one huge compile job as worst case test scenario. I'm happy to learn in which situations with zram only a vm.swappiness value higher than 60 results in lower performance or problems. We're talking about Armbian's new defaults: that's zram only without any other swap file mechanism on physical storage active. If users want to add additional swap space they're responsible for tuning their system on their own (and hopefully know about zswap which seems to me the way better alternative in such scenarios) so now it's really just about 'zram only'. I'm not interested in 'everyone will tell you' stories or 'in theory this should happen' but real experiences. See the reason why we switched back to lzo as default also for zram even if everyone on the Internet tells you that would be stupid and lz4 always the better option. 1
Recommended Posts