Jump to content

Recommended Posts

Posted

This is some more research based on prior efforts.

 

The goal is to make more efficient use of available RAM. If the system runs low on memory only two options are possible: either the kernel invokes the oom-killer to quit tasks to free memory (oom --> out of memory) or starting to swap.

 

Swap is a problem if it happens on slow media. 'Slow' media usually describes the situation on SBC. 'Average' SD cards (not A1 rated) are slow as hell when it's about random IO performance. So swapping is usually something that should be avoided. But... technology improves over time.

 

In Linux we're able to swap not only to physical storage but since a few years also to compressed memory. If you want to get the details simply do a web search for zram or check Wikipedia first..

 

Test setup is a NanoPC-T4 equipped with 4 GB RAM (RK3399 based so a big.LITTLE design with 2xA72 and 4xA53). I crippled the board down to being a quad-core A53 running at 800 MHz where I can easily switch between 4GB RAM and lower numbers: Adding 'mem=1110M maxcpus=4' to kernel cmdline results in the A72 cores being inactive, the kernel only using 1 GB DRAM and for whatever reasons cpufreq scaling not working so the RK3399 statically being clocked at 808 MHz. All tests done with RK's 4.4 (4.4.152).

 

This test setup is meant as 'worst case possible'. A quad-core A53 at 800 MHz is more or less equivalent to a quad-core A7 running at ~1000-1100 MHz. So we're trying to test with the lower limit.

 

I used a compile job that requires up to 2.6 GB RAM to be built (based on this blog post). The task is to build ARM's Compute Library which involves swapping on systems with less than 3 GB memory. Let's have a look:

 

In the following I tried a couple of different scenarios: Swap on physical media and also two different zram algorithms:

 

  • w/o: no swapping happened since board booted with full 4GB RAM active
  • nvme: Transcend TS128GMTE110S SSD in M.2 slot, link is established at x4 Gen2
  • emmc: the 16GB ultra fast Samsung eMMC 5.1 on NanoPC-T4
  • usb2: Samsung EVO840 SSD in JMS567 disk enclosure, attached to USB2 port (UAS works)
  • usb3: Samsung EVO840 SSD in JMS567 disk enclosure, attached to USB3 port (UAS works)
  • hdd: Samsung HM500JI 2.5" HDD in JMS567 disk enclosure, attached to USB2 port (UAS works)
  • sd card: 'average' SanDisk 8 GB SD card (not A1 rated so horribly low random IO performance)
  • lzo: zram with lzo as compression algorithm
  • lz4: zram with lz4 as compression algorithm
     

And the numbers are:

          w/o    nvme     lzo     lz4    emmc    usb2    usb3     hdd    sd card
real	100m39  118m47  125m26  127m46  133m34  146m49  154m51  481m19   1151m21
user    389m48  415m38  405m39  402m52  415m38  415m29  407m18  346m28    342m49
sys      11m05   29m37   36m14   60m01   34m35   66m59   65m44   23m05    216m25

You need to look at the 1st row: that's the time the whole job took. For more details consult the 'time' manual page.

 

In other words: When limiting the RK3399 on NanoPC-T4 to just the four A53 cores running at 800 MHz the compile job takes 100 minutes with 4 GB RAM. As soon as we limit the available RAM to 1 GB swapping has to occur so it gets interesting how efficient the various approaches are:

 

  • NVMe SSD is the fastest option. Performance drop only 18%. That's due to NVMe being a modern storage protocol suited for modern (multi-core) CPUs. Problem: there's no PCIe and therefore no NVMe on the majority of SBC
  • Zram with both lzo and lz4 algorithms performs more or less the same (interestingly lzo slightly faster)
  • Slightly slower: the fast Samsung eMMC 5.1
  • Surprisingly the EVO840 SSD connected via USB2 performs better than connected via USB3 (some thoughts on this)
  • Using a HDD for swap is BS (and was BS already the last 4 decades but we had no alternative until SSDs appeared). The compile job needs almost 5 times longer to complete since all HDD suck at random IO
  • Using an average SD card for swap is just horrible. The job that finished within 100 minutes with 4 GB DRAM available took over 19 HOURS with swap on an average SD card (please note that today usual A1 rated SD cards are magnitudes faster and easily outperform HDDs)

 

Summarizing: NVMe SSDs are no general option (since only available on some RK3399 boards). Swap on HDD or SD card is insane. Swap on USB connected SSDs performs ok-ish (~1.5 times slower) so the best option is to use compressed DRAM. We get a performance drop of just 25% at no additional cost. That's amazing.

 

The above numbers were 'worst case'. That's why I crippled the RK3399 to a slow performing quad-core A53. You get the idea how 'worse' zram might be on the slowest SBCs Armbian runs on (I know that there are still the boring Allwinner A20 boards around -- yep, they're too slow for this).

 

When I did all this boring test stuff I always recorded the environment using 'iostat 1800' (reports every 30 minutes what really happened and shows in detail how much data has been transferred and on which the CPU cores spent time). Quite interesting to compare %user, %sys and especially %iowait percentages:

  Reveal hidden contents
Posted

Next test: RK3399 with unlocked performance (all 6 CPU cores active at usual clockspeeds: 1.5/2.0GHz)

          w/o     nvme     lzo/2    lzo/6    lz4/2    lz4/6
real	 31m55    40m32    41m56    41m38    43m57    44m26
user    184m16   194m58   200m37   202m20   195m17   197m51
sys       6m04    16m17    25m02    23m14    40m59    42m15

Full test output:

  Reveal hidden contents

 

 

For obvious reasons I did not test the crappy variants again (HDD, SD card, USB attached anything). So we're only looking at performance without swap, swap on NVMe SSD and zram.

 

RK3399 when allowed to run at full speed finishes the same compile job in less than 32 minutes. Swap on NVMe SSD increases time by almost 30% now. I now also compared whether the count of zram devices makes a difference (still on RK's 4.4 kernel). Still lzo outperforms lz4 (which is irritating since everyone tells you lz4 would be an improvement over lzo) but there is no clear answer about count of zram devices (in fact the kernel uses 1-n streams to access each device so with modern kernels even a single zram device should suffice since kernel takes care of distributing the load accross all the CPU cores)

Posted

Thanks for your testing on NanoPC-T4, this would be very helpful for someone who want to know about the NVME SSD real performance on RK3399 boards.

Posted
  On 9/9/2018 at 11:55 PM, mindee said:

very helpful for someone who want to know about the NVME SSD real performance on RK3399 boards

Expand  

 

In fact I bought cheap since I got the NVMe SSD for less than 40 bucks on sale. My small TS128GMTE110S has soldered only 2 flash chips on it, if the maximum would be present (8) then it would be much much faster since all modern SSD controllers make heavy use of paralellisms (the more flash chips the faster).

 

I did a quick iozone test with kernel 4.4 and results look not that great compared to an EVO 960 for example.

  Reveal hidden contents

 

But that's not relevant since the protocol makes the difference. NVMe has been invented in this century and not the last as all the other storage protocols we use today. And this makes a real difference since NVMe has been developed with accessing flash storage efficiently in mind. All the other protocols we might use (including SATA) were designed decades ago for way slower storage and all do bottleneck access to fast flash.

 

With swap on the NVMe SSD the maximum %iowait percentage according to iostat monitoring was 0.37%. That's 70 times less compared to up to 25.07% with USB3!

Posted
  On 9/9/2018 at 10:32 PM, tkaiser said:

Next test: RK3399 with unlocked performance (all 6 CPU cores active at usual clockspeeds: 1.5/2.0GHz)

          w/o     nvme     lzo/2    lzo/6    lz4/2    lz4/6
real	 31m55    40m32    41m56    41m38    43m57    44m26
user    184m16   194m58   200m37   202m20   195m17   197m51
sys       6m04    16m17    25m02    23m14    40m59    42m15

 

Expand  

 

All those tests I did before were done with Rockchip's 4.4 kernel.

 

Since stuff in the kernel improves over time now let's test with brand new 4.19.0-rc1. I just did a quick build (with default device tree that limits maximum cpufreq to 1.8 GHz on the big and 1.4 GHz on the little cores) and only tested performance without swapping and zram based swap with the available algorithms (more recent kernels provide more compression algorithms to choose from):

          w/o      lzo      lz4      zstd    lz4hc
real	 29m11    35m58    36m59    48m38    58m55
user    167m59   177m24   175m22   182m02   173m57
sys       5m32    21m10    22m59    69m35   123m46

 

Results:

  • More recent kernel --> better results. Even with lower clockspeeds (1.8/1.4 GHz vs. 2.0/1.5 GHz) the test with kernel 4.19 runs 8% faster. So at same clockspeed this would result even in ~10% better performance
  • Performance drop with zram/lzo compared to no swap with 4.4 was 31%. With 4.19 it's just 23%. So efficiency/performance of the zram implementation itself also improved a lot
  • Again lzo is slightly faster than lz4, both zstd and lz4hc are no good candidates for this use case (but zstd is a great candidate for Armbian's new ramlog approach since it provides higher compression -- more on this later)

In other words: with mainline kernel it makes even more sense to swap to a compressed block device in RAM since performance further increased. With this specific use case (large compile job) the performance drop when running out of memory and the kernel starting to swap to zram is below the 25% margin which is just awesome :)

 

Posted

Little update: In the meantime I also tested with the really fast Samsung 16GB eMMC 5.1 on NanoPC-T4 (again crippled down to a quad-core A53 at 800 MHz). The board runs off the NVMe SSD, I mounted the eMMC as an ext4 partition, put there ARM's ComputeLibrary install and the swapfile on and fired up the test again. First post above is updated.

 

With 4 GB RAM and no swap 100:39 minutes, with swapping on the fast NVMe SSD 118:47 and just 133:34 on the eMMC:

          w/o    nvme     lzo     lz4    emmc    usb2    usb3     hdd    sd card
real	100m39  118m47  125m26  127m46  133m34  146m49  154m51  481m19   1151m21

That's impressive. But this Samsung eMMC 5.1 on NanoPC-T4 (and also on ODROID-N1) is most probably the fastest eMMC we get on SBC today (see benchmark numbers). And still zram is faster and we get 'more RAM' for free (since swap on flash media contributes to the medium wearing out of course)

 

 

Posted

Interesting tests, especially the lzo vs lz4 outcomes. We found the same difference last year and went for lzo zram in the end because of the overall performance benefit and marginal difference in compression

 

Lzo was faster and less cpu intensive overall despite lz4 being better on paper (and from the opinions of almost every internet warrior). In our case the difference came from the compression overhead of lz4 being much higher than lzo and while the decompression of lz4 was faster it wasn't enough to claw back what it loses in compression time. Interestingly when we ran the same tests on Intel cpu's (Core m3 4.5W) instead of Arm64 then the situation was reversed with lz4 coming out on top

 

Do you know which variant of lzo is being used by Armbian? We used 1x-1-15 for 4+ core Arm devices and 1x-1 for single / dual core devices

Posted
  On 9/11/2018 at 10:20 AM, botfap said:

Lzo was faster and less cpu intensive overall despite lz4 being better on paper (and from the opinions of almost every internet warrior)

Expand  

 

Ok, so another time the same observation. Maybe we should switch back to lzo then already. I feared my test always using the same task is somewhat flawed. At least it's configurable as SWAP_ALGORITHM in /etc/default/armbian-zram-config. But starting with the best default is for sure a good thing prior to next major release when this stuff gets rolled out.

 

  On 9/11/2018 at 10:20 AM, botfap said:

Do you know which variant of lzo is being used by Armbian?

Expand  

 

Nope. I simply used default kernel settings (and only tested with Rockchip 4.4, mainline 4.14 on NanoPi Fire3 and 4.19 on RK3399). How to configure the specific algorithm?

 

Edit: another interesting observation: https://bugs.chromium.org/p/chromium/issues/detail?id=584437#c15 I really wonder whether the compression algorithms on ARM use NEON optimizations or not (the performance boost can be huge)

Posted
  On 9/11/2018 at 11:05 AM, tkaiser said:

 

Ok, so another time the same observation. Maybe we should switch back to lzo then already. I feared my test always using the same task is somewhat flawed. At least it's configurable as SWAP_ALGORITHM in /etc/default/armbian-zram-config. But starting with the best default is for sure a good thing prior to next major release when this stuff gets rolled out.

Expand  

 

If I were building a modern x86 target then I would go lz4 without question but on Arm there seems to be a flip around and lzo comes out on top in our testing for both Arm7 and Arm8. I have no idea why this is, maybe Intel's vector instructions are more efficient, anyone have any idea?

 

  On 9/11/2018 at 11:05 AM, tkaiser said:

 

Nope. I simply used default kernel settings (and only tested with Rockchip 4.4, mainline 4.14 on NanoPi Fire3 and 4.19 on RK3399). How to configure the specific algorithm?

Expand  

 

Just had a look at an Armbian build for a tinkerboard (rk4.4) I have on my desk. Default kernel lzo algo is 1x_1 which is the fastest but least efficient of the variants and probably the best default option for anything with 1GB+ RAM. For 256MB/512MB boards then lz4 would probably offer 10-15% more in the way of commit-able ram at the expense of significantly slower compression speed and higher cpu utilization

 

There was also support for lzo 1x_999 which has almost double the compression efficiency of lzo 1x_1 but takes twice as long to compress making it worse than standard lz4 for zram use. There was no specific support for lzo 1x_1_15 which is primarily just a multi core optimization of of the 1x_1 algo but I think in newer kernels and lzo > 2.07 1x_1_15 is automatically used instead of 1x_1 when 4 cores or more are initialized

 

Your results suggest that you did the test with lzo 1x_1_15 because 1x_1 would have only used a single core and been slower than lz4 in theory

Posted

Did some quick research after noticing your edit and wanting to verify my vector instruction suspicions. Neither lzo or lz4 use neon vector instructions on Arm7 or 8 which is very far from ideal and explains why the performance of lzo and lz4 is much better on Intel than on Arm

 

 

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines