3 3
tkaiser

zram vs swap

Recommended Posts

3 hours ago, zador.blood.stained said:
3 hours ago, Tido said:

IIRC I read from TK somewhere that ZRAM is only used on Debian or Ubuntu, why not on both?

Because zram-config package exists only in Ubuntu by default, and for reasons I don't remember (probably version numbering / potential repository priority issues?) we decided to not copy it to our repository to be available for all releases.

But then wouldn't it make sense to keep ZRAM by default as a Ubuntu feature and summarize Thomas findings together with a little tutorial how you can implement ZRAM on your own for debian on the tutorial page? As you said once:

Quote

I see it as Ubuntu/Debian + bootloader with customizations + available kernel(s) with customizations + minimal OS (userspace) customizations and some optional tools/scripts like armbian-config.

So if ZRAM is a standard Ubuntu feature but not Debian then let it be a standard Armbian Ubuntu feature too and show our users how they can use ZRAM under Debian if they want it (or a possibility for ZRAM over Armbianconfig). 

 

3 hours ago, zador.blood.stained said:

Recent examples to add to that - there is no purpose in recently added "development" branch if "master" is completely abandoned as a result and suggesting to "switch to beta" to fix any issues defeats the purpose of "beta" - fixes should be immediately pushed to the stable branch

This branch is 242 commits ahead, 42 commits behind master.

the longer we wait, the more problematic will a merge be. I think the dev. branch was opened a bit to early without clear 'rules' of its purpose. Clean it up before its getting more and more problematic?  

Share this post


Link to post
Share on other sites
5 minutes ago, chwe said:

So if ZRAM is a standard Ubuntu feature but not Debian then let it be a standard Armbian Ubuntu feature too and show our users how they can use ZRAM under Debian if they want it (or a possibility for ZRAM over Armbianconfig). 

It's not a feature, it's just an independent package with a dedicated maintainer (in upstream Ubuntu), similar to other packages in Ubuntu that don't exist in standard Debian repositories. 

Share this post


Link to post
Share on other sites
1 hour ago, chwe said:

So if ZRAM is a standard Ubuntu feature but not Debian then let it be a standard Armbian Ubuntu feature too and show our users how they can use ZRAM under Debian if they want it (or a possibility for ZRAM over Armbianconfig).

 

This has been discussed already in this thread. We could enable this on our own (and make it configurable in a sane way) but the way it's prepared by me (as part of our armhwinfo script) is not a good idea. So it would need someone to split all the armhwinfo functionality into different parts that can also be configured by the user. Also our current policy with vm.swappiness=0 is something that could be discussed/changed or at least be configurable by the user in an  easy way.

 

But since at least I have not the slightest idea in which direction Armbian is moving and since I have a hard time understanding for what time is wasted and especially since i really hate to waste my own time with stuff I don't like (e.g. trying to repair broken patches or insecure scripting) I simply do not care any more. Same with 'speed' of progress. I started with this zram journey over one year ago, wasted my time to analyze 'armbianmonitor -u' output from various installations, waited another few months whether there are complaints from Ubuntu users about zram now being active (I'm not aware of a single one) and would like to see better usage of RAM rather sooner than later. But as it's today I simply don't care any more since all this stuff simply feels like time wasted.

Share this post


Link to post
Share on other sites

From a user's point of view, all that stuff is useful for people who want to experiment and learn about computing, either students (I know many kids who are learning a lot with SBC's, using them for robotic projects), either hobbyists. I wouldn't say it is a waste of time cooperating with that, I think it is useful for many people. Of course, provided it doesn't interfere with other commitments the developer may have.

 

Getting back on-topic, on the XU4 the compiling took 27m 9s, with the swap numbers I posted above:

Spoiler

$ cat /proc/swaps
Filename                                Type            Size    Used    Priority
/dev/zram0                              partition       127612  11516   5
/dev/zram1                              partition       127612  11492   5
/dev/zram2                              partition       127612  11684   5
/dev/zram3                              partition       127612  11552   5
/dev/zram4                              partition       127612  11720   5
/dev/zram5                              partition       127612  11448   5
/dev/zram6                              partition       127612  11668   5
/dev/zram7                              partition       127612  11600   5
/mnt/externo/swapfile1                  file            2097148 0       -1

 

I'll post more numbers about the TinkerBoard and 3Gb Khadas Vim2

Share this post


Link to post
Share on other sites

More numbers:

 

Tinkerboard (I set 8 jobs, though it has 4 cores): 41m 2s

Spoiler

$ cat /proc/swaps
Filename                                Type            Size    Used    Priority
/dev/zram0                              partition       257440  39396   5
/dev/zram1                              partition       257440  39428   5
/dev/zram2                              partition       257440  39384   5
/dev/zram3                              partition       257440  39388   5

 

 

Khadas VIM2 3Gb: 48m 32s

Spoiler

$ cat /proc/swaps
Filename                                Type            Size    Used    Priority
/dev/zram0                              partition       172164  3856    5
/dev/zram1                              partition       172164  3884    5
/dev/zram2                              partition       172164  3896    5
/dev/zram3                              partition       172164  3892    5
/dev/zram4                              partition       172164  3848    5
/dev/zram5                              partition       172164  3804    5
/dev/zram6                              partition       172164  3804    5
/dev/zram7                              partition       172164  3820    5

 

 

(Note: To be fair, we must remember that TB and XU4 are compiling 32-bit binaries, while Nanopi Fire3 and VIM2 compile 64-bit).

 

So I would conclude that zram does have a performance impact, but it is not too big: Fire3, having a CPU speed about 16% faster than VIM2 (and probably slower DDR3 RAM but at the same time with a smaller latency), performed about 4% better. @tkaiser: Did I speak wisely?

Share this post


Link to post
Share on other sites
13 hours ago, JMCC said:

Khadas VIM2 3Gb: 48m 32s

 

Thank you. I just repeated the test while limiting my NanoPi Fire3 to 1200 MHz with zram/lz4 and vm.swappiness=100 (/etc/sysctl.conf): 51m34.139s (and with lzo it was 50m30.884s -- so again with this workload no advantage for lz4 for whatever reasons)

 

But since we know that Vim2 unfortunately relies on an Amlogic SoC with cheating firmware blob (fake clockspeeds) the only reasonable way to get a real comparison would be you repeating the test twice:

  1. First time with purged zram-config package and commented swap entry in fstab to force the board to do no zram paging at all
  2. Then again this time with the Vim2 limited to 1 GB DRAM ('mem=1G' added to kernel cmdline), setting up vm.swappiness=100 and activating zram with the following modified activate_zram routine in /etc/init.d/armhwinfo (needs to be uncommented of course too):
 

activate_zram() {
	# Do not interfere with already present config-zram package
	dpkg -l | grep -q 'zram-config' && return

	# Load zram module with n instances (one per CPU core, 4 are the maximum)
	zram_devices=5
	module_args="$(modinfo zram | awk -F" " '/num_devices/ {print $2}' | cut -f1 -d:)"
	[[ -n ${module_args} ]] && modprobe zram ${module_args}=${zram_devices} || return

	# Use half of the real memory by default --> 1/${ram_divisor}
	ram_divisor=2
	mem_info=$(LC_ALL=C free -w 2>/dev/null | grep "^Mem" || LC_ALL=C free | grep "^Mem")
	memory_total=$(awk '{printf("%d",$2*1024)}' <<<${mem_info})
	mem_per_zram_device=$(( ${memory_total} / ${ram_divisor} ))

	for (( i=0; i<zram_devices; i++ )); do
		[[ -f /sys/block/zram${i}/comp_algorithm ]] && echo lz4 >/sys/block/zram${i}/comp_algorithm 2>/dev/null
		echo -n ${mem_per_zram_device} > /sys/block/zram${i}/disksize
		mkswap /dev/zram${i}
		swapon -p 5 /dev/zram${i}
	done
	echo -e "\n### Activated ${zram_devices} zram swap devices with ${mem_per_zram_device} MB each\n" >>${Log}
} # activate_zram

 

 

Edit: Added lzo numbers above.

Share this post


Link to post
Share on other sites
(edited)

Here are the results. Some preliminary notes:

  • This particular test is not too accurate in HMP CPU's, because it has some big static libraries compiled at the end, and depending whether they fall in a slow or fast core, results can vary a few minutes. That explains why the time no-swap time posted here is a little higher than the one I posted before.
  • I wasn't able to make the kernel parameter work with balbes150's image, so I decided to take off the white gloves and do a dirty hack:
stress --vm-bytes 1677721600 --vm-keep -m 1

Which created a initial memory status that more or less can do the job:

Spoiler

$ cat /proc/meminfo
MemTotal:        2754696 kB
MemFree:          814840 kB
MemAvailable:     952124 kB
Buffers:           13088 kB
Cached:           146168 kB
SwapCached:            0 kB
Active:          1744304 kB
Inactive:          84700 kB
Active(anon):    1673024 kB
Inactive(anon):    13312 kB
Active(file):      71280 kB
Inactive(file):    71388 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       5242860 kB
SwapFree:        5242860 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:       1669820 kB
Mapped:            21968 kB
Shmem:             16592 kB
Slab:              38316 kB
SReclaimable:      19492 kB
SUnreclaim:        18824 kB
KernelStack:        2624 kB
PageTables:         4428 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6620208 kB
Committed_AS:    1849200 kB
VmallocTotal:    1048576 kB
VmallocUsed:       72872 kB
VmallocChunk:     966844 kB
TotalCMA:         221184 kB
UsedCMA:            3244 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

After it, I changed vm.swappiness to 100.

 

That being said, these are the numbers:

3Gb RAM,  no swap:  49:14

Spoiler

$ cat /proc/swaps
Filename                                Type            Size    Used    Priority

 

 

Pseudo-1Gb RAM, high swappiness: 77m 34s

Spoiler

$ cat /proc/swaps
Filename                                Type            Size    Used    Priority
/dev/zram0                              partition       1048572 340600  5
/dev/zram1                              partition       1048572 340652  5
/dev/zram2                              partition       1048572 339484  5
/dev/zram3                              partition       1048572 340616  5
/dev/zram4                              partition       1048572 340264  5


 

 

Edited by JMCC
EDIT: corrected some numbers

Share this post


Link to post
Share on other sites
(edited)

On second thought, I realized that "stress --vm-keep" was probably using lots of memory bandwith by itself, so the second number is of no use. I think the parameter I should have used is "--vm-hang 0". But I already put the VIM2 away, so it'll have to wait.

 

[EDIT: No, that won't work either, because if the memory hog doesn't change, it will get swapped away and the kernel will use the physical RAM for compiling. Any suggestion is welcome]

Edited by JMCC

Share this post


Link to post
Share on other sites
On 5/20/2018 at 10:35 PM, JMCC said:

On second thought, I realized that "stress --vm-keep" was probably using lots of memory bandwith by itself, so the second number is of no use

 

Yep, I agree (especially since Fire3 clocked down to 1.2GHz scores around 51m with zram so the 77 minutes seem just wrong).

 

In the meantime I started over with my Fire3 and tested through different values of vm.swappiness and count of active CPU cores (adding e.g. extraargs="maxcpus=4" to /boot/armbianEnv.txt) using this script started from /etc/rc.local. I tested again with lz4 and 2 CPU cores another time since first run results looked bogus:

Timestamp              vm.swappiness cores        algorithm             execution time
Sun May 20 12:45:12 UTC 2018   100    8    lzo [lz4] deflate lz4hc    real    47m53.246s
Sun May 20 13:34:26 UTC 2018    80    8    lzo [lz4] deflate lz4hc    real    48m9.429s
Sun May 20 14:23:55 UTC 2018    60    8    lzo [lz4] deflate lz4hc    real    48m25.700s
Sun May 20 15:13:40 UTC 2018    40    8    lzo [lz4] deflate lz4hc    real    49m40.919s
Sun May 20 16:05:17 UTC 2018   100    4    lzo [lz4] deflate lz4hc    real    86m55.073s
Sun May 20 17:33:34 UTC 2018    80    4    lzo [lz4] deflate lz4hc    real    87m50.534s
Sun May 20 19:02:49 UTC 2018    60    4    lzo [lz4] deflate lz4hc    real    88m43.067s
Sun May 20 20:32:55 UTC 2018    40    4    lzo [lz4] deflate lz4hc    real    98m43.243s
Sun May 20 22:15:55 UTC 2018   100    2    lzo [lz4] deflate lz4hc    real   148m58.772s
Mon May 21 00:46:19 UTC 2018    80    2    lzo [lz4] deflate lz4hc    real   146m58.757s
Mon May 21 03:14:40 UTC 2018    60    2    lzo [lz4] deflate lz4hc    real   147m3.493s
Mon May 21 05:43:08 UTC 2018    40    2    lzo [lz4] deflate lz4hc    real   155m22.952s
Mon May 21 08:20:34 UTC 2018   100    8    [lzo] lz4 deflate lz4hc    real    46m56.667s
Mon May 21 09:08:59 UTC 2018    80    8    [lzo] lz4 deflate lz4hc    real    47m25.969s
Mon May 21 09:57:58 UTC 2018    60    8    [lzo] lz4 deflate lz4hc    real    47m45.961s
Mon May 21 10:47:16 UTC 2018    40    8    [lzo] lz4 deflate lz4hc    real    48m14.999s
Mon May 21 11:41:36 UTC 2018   100    4    [lzo] lz4 deflate lz4hc    real    85m24.440s
Mon May 21 13:08:31 UTC 2018    80    4    [lzo] lz4 deflate lz4hc    real    85m47.343s
Mon May 21 14:35:44 UTC 2018    60    4    [lzo] lz4 deflate lz4hc    real    85m59.063s
Mon May 21 16:03:11 UTC 2018    40    4    [lzo] lz4 deflate lz4hc    real    86m49.615s
Mon May 21 21:53:07 UTC 2018   100    2    [lzo] lz4 deflate lz4hc    real   143m1.995s
Tue May 22 00:17:40 UTC 2018    80    2    [lzo] lz4 deflate lz4hc    real   144m0.501s
Tue May 22 02:43:08 UTC 2018    60    2    [lzo] lz4 deflate lz4hc    real   144m37.204s
Tue May 22 05:09:14 UTC 2018    40    2    [lzo] lz4 deflate lz4hc    real   146m51.361s
Tue May 22 07:56:42 UTC 2018   100    2    lzo [lz4] deflate lz4hc    real   147m15.069s
Tue May 22 10:25:33 UTC 2018    80    2    lzo [lz4] deflate lz4hc    real   147m31.538s
Tue May 22 12:54:31 UTC 2018    60    2    lzo [lz4] deflate lz4hc    real   147m27.517s
Tue May 22 15:23:28 UTC 2018    40    2    lzo [lz4] deflate lz4hc    real   150m54.700s

 

So as expected with zram based swap increasing vm.swappiness to the maximum helps with performance in such memory overcommitment situations like doing this huge compile job (Arm Compute Library) that needs up to 2.6GB with a 64-bit userland -- just 2 GB when doing a 32-bit build). And for whatever reasons at least with kernel 4.14 and defaults lz4 does not perform better compared to lzo, it's quite the opposite and with lzo the jobs finish even faster.

Share this post


Link to post
Share on other sites

Ok, brilliant...

So, just to summarise, if I want my Orange Pi PC2 to use zRam instead of using a static 8GB off of my SDCard and compile the Monero thing (an operation that seems to require about 4GB anyway), how do I configure that, from start to end?

Share this post


Link to post
Share on other sites
On 5/21/2018 at 11:21 PM, Regis Michel LeClerc said:

I want my Orange Pi PC2 to use zRam instead of using a static 8GB off of my SDCard

 

OPi PC2 has just one GB DRAM so trying to use 8 GB zram won't work. The average compression ratio I've seen in all tests so far was between 3:1 and 3.5:1 and also zram needs a small amount of DRAM for itself. So zram using 3 times the available RAM can be considered maximum and might even fail already when memory contents aren't compressable at such a ratio.

 

If you look at page 1 of this thread you'll see that using an UAS attached SSD is the way to go in such situations. And maybe switching from zram to zcache when you want to use both DRAM and storage for swapping. Configuring zram and 'disk' as swap at the same time has some caveats.

Share this post


Link to post
Share on other sites
On 5/21/2018 at 1:43 PM, tkaiser said:

In the meantime I started over with my Fire3 and tested through different values of vm.swappiness and count of active CPU cores (adding e.g. extraargs="maxcpus=4" to /boot/armbianEnv.txt) using this script started from /etc/rc.local.

 

As a comparison now the same task (building ARM's Compute Library on a SBC) on a device where swapping does not occur. The purpose of this test was to check for efficiency of different swapping implementations on a device running low on memory (NanoPi Fire3 with 8 Cortex-A53 cores @ 1.4GHz but just 1 GB DRAM). Results back then when running on all 8 CPU cores (full details):

zram lzo                  46m57.427s
zram lz4                  47m18.022s
SSD via USB2             144m40.780s
SanDisk Ultra A1 16 GB   247m56.744s
HDD via USB2             570m14.406s

I used my RockPro64 with 4 GB DRAM and pinned execution of the compilation to the 4 Cortex-A53 cores running also at 1.4 GHz like the Fire3:

time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native

This is a quick htop check (pinned to an A72 core) confirming that only the 4 A53 cores are busy:

 

Bildschirmfoto%202018-09-03%20um%2016.12

On NanoPi Fire3 when being limited to 4 CPU cores and with just 1 GB DRAM we got the following execution times (slightly faster with lzo in contrast to 'common knowledge' telling us lz4 would always be the better choice):

Sun May 20 16:05:17 UTC 2018   100    4    lzo [lz4] deflate lz4hc    real    86m55.073s
Mon May 21 11:41:36 UTC 2018   100    4    [lzo] lz4 deflate lz4hc    real    85m24.440s

Now on RockPro64 without any swapping happened we get 73m27.934s. So given the test has been executed appropriately we're talking about a performance impact of below 20% when swapping to a compressed block device with a quad-core A53 @ 1.4 GHz (5125 seconds with lzo zram on NanoPi Fire3 vs. 4408 seconds without any swapping at all on RockPro64 --> 16% performance decrease). I looked at the free output and the maximum I observed was 2.6GB RAM used:

root@rockpro64:/home/rock64# free
              total        used        free      shared  buff/cache   available
Mem:        3969104     2666692      730212        8468      572200     1264080
Swap:             0           0           0

'Used' DRAM over the whole benchmark execution was almost always well above 1 GB and often in the 2 GB region.

 

Share this post


Link to post
Share on other sites
14 hours ago, tkaiser said:

Now on RockPro64 without any swapping happened we get 73m27.934s. So given the test has been executed appropriately we're talking about ... 16% performance decrease

 

Since I was not entirely sure whether 'test has been executed appropriately' I went a bit further to test no swap vs. zram on a RK3399 device directly. I had to move from RockPro64 to NanoPC-T4 since with ayufan OS image on RockPro64 I didn't manage to restrict available DRAM in extlinux.conf

 

So I did my test with Armbian on a NanoPC-T4. One time I let the build job run with 4 GB DRAM available and no swapping, next time I limited available physical memory to 1 GB via extraargs="mem=1110M" in /boot/armbianEnv.txt and swapping happened with lz4 compression.

 

We're talking about a 12% difference in performance: 4302 seconds without swapping vs. 4855 seconds with zram/lz4:

tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
...
real	71m42.193s
user	277m55.787s
sys	8m7.028s

tk@nanopct4:~/ComputeLibrary-18.03$ free
              total        used        free      shared  buff/cache   available
Mem:        3902736      105600     3132652        8456      664484     3698568
Swap:       6291440           0     6291440

And now with zram/lz4:

tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
...
real	80m55.042s
user	293m12.371s
sys	27m48.478s

tk@nanopct4:~/ComputeLibrary-18.03$ free
              total        used        free      shared  buff/cache   available
Mem:        1014192       85372      850404        3684       78416      853944
Swap:       3042560       27608     3014952

 

Problem is: this test is not that representative for real-world workloads since I artificially limited the build job to CPUs 0-3 (little cores) and therefore all the memory compression stuff happened on the two free A72 cores. So next test: trying to disable the two big cores in RK3399 entirely. For whatever reasons setting extraargs="mem=1110M maxcpus=4" in /boot/armbianEnv.txt didn't work (obviously a problem with boot.cmd used for the board) so I ended up with:

extraargs="mem=1110M"
extraboardargs="maxcpus=4"

After a reboot /proc/cpuinfo confirms that only little cores are available any more and we're running with just 1 GB DRAM. Only caveat: cpufreq scaling is also gone and now the little cores are clocked with ~806 MHz:

root@nanopct4:~# /usr/local/src/mhz/mhz 3 100000
count=330570 us50=20515 us250=102670 diff=82155 cpu_MHz=804.747
count=330570 us50=20540 us250=102614 diff=82074 cpu_MHz=805.541
count=330570 us50=20542 us250=102645 diff=82103 cpu_MHz=805.257

So then this test will answer a different question: how much overhead adds zram based swapping on much slower boards. That's ok too :)

 

To be continued...

Share this post


Link to post
Share on other sites

Now tests with the RK3399 crippled down to a quad-core A53 running at 800 MHz done. One time with 4 GB DRAM w/o swapping and the other time again with zram/lz4 and just 1 GB DRAM assigned to provoke swapping:

 

Without swapping:

tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
...
real	99m39.537s
user	385m51.276s
sys	11m2.063s

tk@nanopct4:~/ComputeLibrary-18.03$ free
              total        used        free      shared  buff/cache   available
Mem:        3902736      102648     3124104       13336      675984     3696640
Swap:       6291440           0     6291440

Vs. zram/lz4:

tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
...
real	130m3.264s
user	403m18.539s
sys	39m7.080s

tk@nanopct4:~/ComputeLibrary-18.03$ free
              total        used        free      shared  buff/cache   available
Mem:        1014192       82940      858740        3416       72512      859468
Swap:       3042560       27948     3014612

This is a 30% performance drop. Still great given that I crippled the RK3399 to a quad-core A53 running at just 800 MHz. Funnily lzo again outperforms lz4:

real	123m47.246s
user	401m20.097s
sys	35m14.423s

As a comparison: swap with probably the fastest way possible on all common SBC (except those RK3399 boads that can interact with NVMe SSDs). Now I test with an USB3 connected EVO840 SSD (I created a swapfile on an ext4 FS on the SSD and deactivated zram based swap entirely):

tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
...
real	155m7.422s
user	403m34.509s
sys	67m11.278s

tk@nanopct4:~/ComputeLibrary-18.03$ free
              total        used        free      shared  buff/cache   available
Mem:        1014192       66336      810212        4244      137644      869692
Swap:       3071996       26728     3045268

tk@nanopct4:~/ComputeLibrary-18.03$ /sbin/swapon
NAME                 TYPE SIZE USED PRIO
/mnt/evo840/swapfile file   3G  26M   -1

With ultra fast swap on SSD execution time further increases by 25 minutes so clearly zram is the winner. I also let 'iostat 1800' run in parallel to get a clue how much data has been transferred between board and SSD (at the blockdevice layer -- below at the flash layer amount of writes could have been significantly higher):

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             965.11      3386.99      7345.81    6096576   13222460
sda            1807.44      4788.42      5927.86    8619208   10670216
sda            2868.95      7041.86      7431.29   12675496   13376468
sda            1792.79      4770.62      4828.07    8587116    8690528
sda            2984.65      7850.61      9276.85   14131184   16698424

I stopped a bit too early but what these numbers tell is that this compile job swapping on SSD resulted in +60 GB writes and +48 GB reads to/from flash storage. Now imagine running this on a crappy SD card. Would take ages and maybe the card will die in between :)

 

@Igor: IMO we can switch to new behaviour. We need to take care about two things when upgrading/replacing packages:

apt purge zram-config
grep -q vm.swappiness /etc/sysctl.conf
case $? in
	0)
		sed -i 's/vm\.swappiness.*/vm.swappiness=100/' /etc/sysctl.conf
		;;
	*)
		echo vm.swappiness=100 >>/etc/sysctl.conf
		;;
esac

 

Share this post


Link to post
Share on other sites
3 hours ago, tkaiser said:

real 155m7.422s

 

This was 'swap with SSD connected to USB3 port'. Now a final number. I was curious how long the whole build orgy will take if I use the same UAS attached EVO840 SSD and connect it to an USB2 port. Before and after (lsusb -t):

/:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
    |__ Port 1: Dev 3, If 0, Class=Mass Storage, Driver=uas, 5000M

/:  Bus 05.Port 1: Dev 1, Class=root_hub, Driver=ehci-platform/1p, 480M
    |__ Port 1: Dev 3, If 0, Class=Mass Storage, Driver=uas, 480M

The SSD is now connected via Hi-Speed but still UAS is usable. Here the (somewhat surprising) results:

tk@nanopct4:~/ComputeLibrary-18.03$ time taskset -c 0-3 scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native
...
real	145m37.703s
user	410m38.084s
sys	66m56.026s

tk@nanopct4:~/ComputeLibrary-18.03$ free
              total        used        free      shared  buff/cache   available
Mem:        1014192       67468      758332        3312      188392      869388
Swap:       3071996       31864     3040132

That's almost 10 minutes faster compared to USB3 above. Another surprising result is the amount of data written to the SSD: this time only 49.5 GB:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             905.22      3309.40      6821.28    5956960   12278368
sda            1819.48      4871.02      5809.35    8767832   10456832
sda            2505.42      6131.65      6467.18   11036972   11640928
sda            1896.49      5149.54      4429.97    9269216    7973988
sda            1854.91      3911.03      5293.68    7039848    9528616

And this time I also queried the SSD via SMART before and after about 'Total_LBAs_Written' (that's 512 bytes with Samsung SSDs):

241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       16901233973
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       17004991437

Same 49.5 GB number so unfortunately my EVO840 doesn't expose amount of data written at the flash layer but just at the block device layer.

 

Well, result is surprising (a storage relevant task performing faster with same SSD connected to USB2 compared to USB3) but most probably I did something wrong. No idea and no time any further. I checked my bash history but I repeated the test as I did all the time before and also iozone results look as expected:

   39  cd ../
   40  rm -rf ComputeLibrary-18.03/
   41  tar xvf v18.03.tar.gz
   42  lsusb -t
   43  cd ComputeLibrary-18.03/
   44  grep -r lala *
   45  time scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a build=native

EVO840 / USB3                                                 random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       4    16524    20726    19170    19235    19309    20479
          102400      16    53314    64717    65279    66016    64425    65024
          102400     512   255997   275974   254497   255720   255696   274090
          102400    1024   294096   303209   290610   292860   288668   299653
          102400   16384   349175   352628   350241   353221   353234   350942
         1024000   16384   355773   362711   354363   354632   354731   362887

EVO840 / USB2                                                 random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       4     5570     7967     8156     7957     8156     7971
          102400      16    19057    19137    21165    21108    20993    19130
          102400     512    32625    32660    32586    32704    32696    32642
          102400    1024    33121    33179    33506    33467    33573    33226
          102400   16384    33925    33953    35436    35500    34695    33923
         1024000   16384    34120    34193    34927    34935    34933    34169

 

Share this post


Link to post
Share on other sites

Have you got a tool to check the latency to compare USB2 and USB3? Or CPU usage when doing the same workload?

My understanding of the difference between USB2 and 3 is USB2 is polled, while USB3 is interrupt driven.

 

Assuming you haven't done something wrong and your numbers are an accurate representation, maybe at the hardware level USB3 requires more resources, all the interrupts could be causing excessive context switching. Or the drivers aren't as optimised yet.

would be interesting to compare between different hardware USB3 implementations.

Share this post


Link to post
Share on other sites
7 hours ago, chrisf said:

Have you got a tool to check the latency to compare USB2 and USB3? Or CPU usage when doing the same workload?

 

I had the SSH session window still open and collected the relevant logging portions from 'iostat 1800' while running the test with USB3, USB2 and then again zram/lzo (which also surprisingly again outperformed lz4):

USB3:     %user   %nice %system %iowait  %steal   %idle
          82.31    0.00   12.56    4.68    0.00    0.45
          74.77    0.00   16.80    8.25    0.00    0.18
          55.24    0.00   19.84   24.44    0.00    0.48
          72.22    0.00   16.94   10.43    0.00    0.41
          50.96    0.00   22.24   26.09    0.00    0.71

USB2:     %user   %nice %system %iowait  %steal   %idle
          81.77    0.00   11.95    5.30    0.00    0.99
          75.99    0.00   16.95    6.71    0.00    0.35
          66.50    0.00   19.19   13.81    0.00    0.49
          77.64    0.00   18.31    3.97    0.00    0.08
          44.17    0.00   12.99   13.09    0.00   29.74

zram/lzo: %user   %nice %system %iowait  %steal   %idle
          84.83    0.00   14.68    0.01    0.00    0.48
          82.94    0.00   17.06    0.00    0.00    0.00
          81.51    0.00   18.49    0.00    0.00    0.00
          78.33    0.00   21.66    0.00    0.00    0.01

 

7 hours ago, chrisf said:

maybe at the hardware level USB3 requires more resources, all the interrupts could be causing excessive context switching

 

That's an interesting point and clearly something I forgot to check. But I was running with latest IRQ assignment settings (USB2 on CPU1 and USB3 on CPU2) so there shouldn't have been a problem with my crippled setup (hiding CPUs 4 and 5). But iostat output above reveals that %iowait with USB3 was much higher compared to USB2 so this is clearly something that needs more investigations.

Share this post


Link to post
Share on other sites

oh hai! noob here - but zram is interesting...

 

zram-config is always good, as it kinda sorts things, but looking at distro's where that's not available as a package

 

simple short shell script (cribbed this somewhere else, forget where)...

 

Anyways - just needs to ensure that zram is enabled in the kernel config.

 

zram.sh - put this over in /usr/bin/zram.sh and make it executable...then add it to /etc/rc.local - add to /etc/sysctl.conf the vm.swappiness = 10 to keep pressure off unless it's needed

Spoiler

 


#!/bin/bash
cores=$(nproc --all)
modprobe zram num_devices=$cores

swapoff -a

totalmem=`free | grep -e "^Mem:" | awk '{print $2}'`
mem=$(( ($totalmem / $cores)* 1024 ))

core=0
while [ $core -lt $cores ]; do
  echo $mem > /sys/block/zram$core/disksize
  mkswap /dev/zram$core
  swapon -p 5 /dev/zram$core
  let core=core+1
done

memory manager sorts things out here, and this is a good item for small mem devices,

 

 

Edited by Tido
added spoiler | see message below not recommended to use

Share this post


Link to post
Share on other sites
2 hours ago, sfx2000 said:

zram.sh - put this over in /usr/bin/zram.sh and make it executable

 

For anyone else reading this: do NOT do this. Just use Armbian -- we care about zram at the system level and also set vm.swappiness accordingly (low values are bad)

Share this post


Link to post
Share on other sites

There is a problem at a kernel change, more precisely if/when initrd is regenerated.

update-initramfs: Generating /boot/initrd.img-4.18.8-odroidc2
I: The initramfs will attempt to resume from /dev/zram4
I: (UUID=368b4521-07d1-43df-803d-159c60c5c833)
I: Set the RESUME variable to override this.
update-initramfs: Converting to u-boot format

This leads to boot delay:

Spoiler

Starting kernel ...

Loading, please wait...
starting version 232
Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... Scanning for Btrfs filesystems
Begin: Waiting for suspend/resume device ... Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
done.
Gave up waiting for suspend/resume device
done.
Begin: Will now check root file system ... fsck from util-linux 2.29.2
[/sbin/fsck.ext4 (1) -- /dev/mmcblk1p1] fsck.ext4 -a -C0 /dev/mmcblk1p1 
/dev/mmcblk1p1: clean, 77231/481440 files, 600696/1900644 blocks
done.
done.
Begin: Running /scripts/local-bottom ... done.
Begin: Running /scripts/init-bottom ... done.

Welcome to Debian GNU/Linux 9 (stretch)!

 

Ideas on how to fix it best?

Share this post


Link to post
Share on other sites

@Igor

 

@tkaiser beat me to the punch on the initramfs 'glitch'... but it's an easy fix

 

edit (if the file isn't there, create it)

 

    /etc/initramfs-tools/conf.d/resume

 

add/modify the line there - can do none, or push it to another location other than zram

 

    RESUME=none

 

then refresh the initramfs

 

    update-initramfs -u -k all

 

 

Share this post


Link to post
Share on other sites
On 9/16/2018 at 10:37 PM, tkaiser said:

we care about zram at the system level and also set vm.swappiness accordingly (low values are bad)

 

Agree - that we can that we're both concerned about the memory manager in general - and the zram.sh script is something that's been tuned for quite some time and experience across multiple archs/distros...

 

I'm more for not aggressively swapping out - the range is Zero to 100 - looking at the rk3288-tinker image, it's set to 100, which is very aggressive at swapping pages... keep in mind that the default is usually 60

 

My thought is that lower values are better in most cases - the value of 10 is reasonable for most - keeps pressure of the swap partitions which is important if not running zram, as going to swap on SD/eMMC is going to be a real hit on performance, and even with zram, we only want to swap if we really need to as hitting the zram is going to have a cost in overall performance.

Share this post


Link to post
Share on other sites
10 hours ago, sfx2000 said:

the zram.sh script is something that's been tuned for quite some time and experience across multiple archs/distros...

 

Huh? This script is not 'tuned' whatsoever. It basically sets up some zram devices in an outdated way (since recent kernels do not need one zram device per CPU core, this could have even negative effects on big.LITTLE designs and that's why we made all of this configurable in Armbian via /etc/default/armbian-zram-config).

 

vm.swappiness... the 'default' is from 100 years ago when we had neither fast flash storage nor compressed zram block devices. Back then swapping happened on spinning rust! With zram any value lower than 100 makes no sense at all.

Share this post


Link to post
Share on other sites
On 9/22/2018 at 2:07 AM, tkaiser said:

vm.swappiness... the 'default' is from 100 years ago when we had neither fast flash storage nor compressed zram block devices. Back then swapping happened on spinning rust! With zram any value lower than 100 makes no sense at all.

 

I think we're going to have to agree to disagree here - and frank discussion is always good...

 

What you have to look at is the tendency to swap, and what that cost actually is - one can end up unmapping pages if not careful, and have a less responsive system - spinning rust, compcache, nvme, etc... swap is still swap.

 

swap_tendency = mapped_ratio/2 + distress + vm_swappiness

 

(for the lay folks - the 0-100 value in vm.swappiness is akin to the amount free memory in use before swapping is initiated - so a value of 60 says that as long as we have free memory of 60 percent, we don't swap, if less than that, we start swapping out pages - it's a weighted value)

 

So if you want to spend time thrashing memory, keep it high - higher does keep the caches free, which may or may not be desired depending on the particular workload in play... worst case if set too high, app responsiveness may suffer...

 

One of the other consideration is that some apps does try to manage their own memory - mysql/mariadb is a good example, where it can really send memory manager off the deep end if heavily loaded...

 

So it's ok to have different opinions here, and easy enough to test/modify/test again...

 

for those that want to play - it's easy enough to change on the fly....

 

 

sudo sysctl -w vm.swappiness=<value> # the range here is 0-100 - 0 is swap disabled

Share this post


Link to post
Share on other sites
On 9/4/2018 at 10:51 PM, tkaiser said:

That's an interesting point and clearly something I forgot to check. But I was running with latest IRQ assignment settings (USB2 on CPU1 and USB3 on CPU2) so there shouldn't have been a problem with my crippled setup (hiding CPUs 4 and 5). But iostat output above reveals that %iowait with USB3 was much higher compared to USB2 so this is clearly something that needs more investigations.

 

Hint - putting a task to observe changes the behavior, as the task itself takes up time and resources... Even JTAG does this, and I've had more than a few junior engineers learn this the hard way... 

 

Back in the days when I was doing Qualcomm MSM work - running the DIAG task on REX changed timing, or running additional debug/tracing in userland - so things that would crash the MSM standalone, wouldn't crash when actually trying to chase the problem and fix it. This was especially true with the first MSM's that did DVFS - the MSM6100 was the first one I ran into...

 

It's a lightweight version of Schrödinger's Cat -- https://en.wikipedia.org/wiki/Schrödinger's_cat

 

I always asked my guys - "did you kill the cat?" on their test results....

Share this post


Link to post
Share on other sites
On 9/22/2018 at 2:07 AM, tkaiser said:

Huh? This script is not 'tuned' whatsoever. It basically sets up some zram devices in an outdated way (since recent kernels do not need one zram device per CPU core, this could have even negative effects on big.LITTLE designs and that's why we made all of this configurable in Armbian via /etc/default/armbian-zram-config).

 

Actually it does and doesn't - with big.LITTLE, we have ARM GTS on our side which makes things a bit transparent, so one can always do a single zram pool and let the cores sort it out with the appropriate kernel patches from ARM...

 

my little script assumes all cores are the same, so we do take some liberty there with allocations...

Share this post


Link to post
Share on other sites
On 9/16/2018 at 10:37 PM, tkaiser said:

For anyone else reading this: do NOT do this. Just use Armbian -- we care about zram at the system level and also set vm.swappiness accordingly (low values are bad)

 

Apologies up front - after digging thru the forums, you have a fair investment in your methods and means...  fair enough, and much appreciated.

 

Just ask that you keep an open mind on this item - I've got other things to worry about...

 

current tasks are rk3288 clocks and temps, and an ask to look at rk_cypto performance overall...

 

Keep it simple there... many use cases to consider - one can always find a benchmark to prove a case...

 

I've been there, and this isn't the first ARM platform I've worked with - I've done BSP's for imx6, mvedbu, broadcom, and QCA... not my first rodeo here.

 

Just trying to help.

Share this post


Link to post
Share on other sites
12 hours ago, sfx2000 said:

(for the lay folks - the 0-100 value in vm.swappiness is akin to the amount free memory in use before swapping is initiated - so a value of 60 says that as long as we have free memory of 60 percent, we don't swap, if less than that, we start swapping out pages - it's a weighted value)

 

So if you want to spend time thrashing memory, keep it high - higher does keep the caches free, which may or may not be desired depending on the particular workload in play... worst case if set too high, app responsiveness may suffer...

 

One of the other consideration is that some apps does try to manage their own memory - mysql/mariadb is a good example, where it can really send memory manager off the deep end if heavily loaded...

So, @tkaiser would have to put load (use RAM) on an SBC when doing the benchmarking. And to be frank, you would have to test it with different scenario's of load before you go on such a high level as 100.

However, you could create 20 senarios and would still not catch every situation/combination. That said, I am with you that it is better to have a value lower than 100.

 

Share this post


Link to post
Share on other sites
1 hour ago, Tido said:

@tkaiser would have to put load (use RAM) on an SBC when doing the benchmarking

 

 

I started with this 'zram on SBC' journey more than 2 years ago, testing with GUI use cases on PineBook, searching for other use cases that require huge amounts of memory, testing with old as well as brand new kernel versions and ending up with huge compile jobs as an example where heavy DRAM overcommitment is possible and zram shows its strengths. Days of work, zero help/contributions by others until recently (see @botfap contribution in the other thread). Now that as an result of this work a new default is set additional time is needed to discuss about feelings and believes? Really impressive...

 

12 hours ago, sfx2000 said:

putting a task to observe changes the behavior, as the task itself takes up time and resources

 

Care to elaborate what I did wrong when always running exactly the same set of 'monitoring' with each test (using a pretty lightweight 'iostat 1800' call which simply queries the kernel's counters and displays some numbers every 30 minutes)?

 

13 hours ago, sfx2000 said:

it's ok to have different opinions here, and easy enough to test/modify/test again...

 

Why should opinions matter if there's no reasoning provided? I'm happy to learn how and what I could test/modify again since when starting with this zram journey and GUI apps I had no way to measure different settings since everything is just 'feeling' (with zram and massive overcommitment you can open 10 more browsers tabs without the system becoming unresponsive which is not news anyway but simply as expected). So I ended up with one huge compile job as worst case test scenario.

 

I'm happy to learn in which situations with zram only a vm.swappiness value higher than 60 results in lower performance or problems. We're talking about Armbian's new defaults: that's zram only without any other swap file mechanism on physical storage active. If users want to add additional swap space they're responsible for tuning their system on their own (and hopefully know about zswap which seems to me the way better alternative in such scenarios) so now it's really just about 'zram only'.

 

I'm not interested in 'everyone will tell you' stories or 'in theory this should happen' but real experiences. See the reason why we switched back to lzo as default also for zram even if everyone on the Internet tells you that would be stupid and lz4 always the better option.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
3 3