tkaiser Posted September 9, 2018 Posted September 9, 2018 This is some more research based on prior efforts. The goal is to make more efficient use of available RAM. If the system runs low on memory only two options are possible: either the kernel invokes the oom-killer to quit tasks to free memory (oom --> out of memory) or starting to swap. Swap is a problem if it happens on slow media. 'Slow' media usually describes the situation on SBC. 'Average' SD cards (not A1 rated) are slow as hell when it's about random IO performance. So swapping is usually something that should be avoided. But... technology improves over time. In Linux we're able to swap not only to physical storage but since a few years also to compressed memory. If you want to get the details simply do a web search for zram or check Wikipedia first.. Test setup is a NanoPC-T4 equipped with 4 GB RAM (RK3399 based so a big.LITTLE design with 2xA72 and 4xA53). I crippled the board down to being a quad-core A53 running at 800 MHz where I can easily switch between 4GB RAM and lower numbers: Adding 'mem=1110M maxcpus=4' to kernel cmdline results in the A72 cores being inactive, the kernel only using 1 GB DRAM and for whatever reasons cpufreq scaling not working so the RK3399 statically being clocked at 808 MHz. All tests done with RK's 4.4 (4.4.152). This test setup is meant as 'worst case possible'. A quad-core A53 at 800 MHz is more or less equivalent to a quad-core A7 running at ~1000-1100 MHz. So we're trying to test with the lower limit. I used a compile job that requires up to 2.6 GB RAM to be built (based on this blog post). The task is to build ARM's Compute Library which involves swapping on systems with less than 3 GB memory. Let's have a look: In the following I tried a couple of different scenarios: Swap on physical media and also two different zram algorithms: w/o: no swapping happened since board booted with full 4GB RAM active nvme: Transcend TS128GMTE110S SSD in M.2 slot, link is established at x4 Gen2 emmc: the 16GB ultra fast Samsung eMMC 5.1 on NanoPC-T4 usb2: Samsung EVO840 SSD in JMS567 disk enclosure, attached to USB2 port (UAS works) usb3: Samsung EVO840 SSD in JMS567 disk enclosure, attached to USB3 port (UAS works) hdd: Samsung HM500JI 2.5" HDD in JMS567 disk enclosure, attached to USB2 port (UAS works) sd card: 'average' SanDisk 8 GB SD card (not A1 rated so horribly low random IO performance) lzo: zram with lzo as compression algorithm lz4: zram with lz4 as compression algorithm And the numbers are: w/o nvme lzo lz4 emmc usb2 usb3 hdd sd card real 100m39 118m47 125m26 127m46 133m34 146m49 154m51 481m19 1151m21 user 389m48 415m38 405m39 402m52 415m38 415m29 407m18 346m28 342m49 sys 11m05 29m37 36m14 60m01 34m35 66m59 65m44 23m05 216m25 You need to look at the 1st row: that's the time the whole job took. For more details consult the 'time' manual page. In other words: When limiting the RK3399 on NanoPC-T4 to just the four A53 cores running at 800 MHz the compile job takes 100 minutes with 4 GB RAM. As soon as we limit the available RAM to 1 GB swapping has to occur so it gets interesting how efficient the various approaches are: NVMe SSD is the fastest option. Performance drop only 18%. That's due to NVMe being a modern storage protocol suited for modern (multi-core) CPUs. Problem: there's no PCIe and therefore no NVMe on the majority of SBC Zram with both lzo and lz4 algorithms performs more or less the same (interestingly lzo slightly faster) Slightly slower: the fast Samsung eMMC 5.1 Surprisingly the EVO840 SSD connected via USB2 performs better than connected via USB3 (some thoughts on this) Using a HDD for swap is BS (and was BS already the last 4 decades but we had no alternative until SSDs appeared). The compile job needs almost 5 times longer to complete since all HDD suck at random IO Using an average SD card for swap is just horrible. The job that finished within 100 minutes with 4 GB DRAM available took over 19 HOURS with swap on an average SD card (please note that today usual A1 rated SD cards are magnitudes faster and easily outperform HDDs) Summarizing: NVMe SSDs are no general option (since only available on some RK3399 boards). Swap on HDD or SD card is insane. Swap on USB connected SSDs performs ok-ish (~1.5 times slower) so the best option is to use compressed DRAM. We get a performance drop of just 25% at no additional cost. That's amazing. The above numbers were 'worst case'. That's why I crippled the RK3399 to a slow performing quad-core A53. You get the idea how 'worse' zram might be on the slowest SBCs Armbian runs on (I know that there are still the boring Allwinner A20 boards around -- yep, they're too slow for this). When I did all this boring test stuff I always recorded the environment using 'iostat 1800' (reports every 30 minutes what really happened and shows in detail how much data has been transferred and on which the CPU cores spent time). Quite interesting to compare %user, %sys and especially %iowait percentages: Without swap: real 100m39.355s user 389m48.308s sys 11m5.366s avg-cpu: %user %nice %system %iowait %steal %idle 97.14 0.00 2.81 0.00 0.00 0.05 98.11 0.00 1.89 0.00 0.00 0.00 96.49 0.00 3.51 0.00 0.00 0.00 33.63 0.00 1.17 0.00 0.00 65.20 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 0.39 6.19 15.34 11136 27604 mmcblk1 0.12 0.01 7.08 24 12748 mmcblk1 0.29 0.04 29.12 76 52408 mmcblk1 0.15 0.16 17.47 280 31444 128 GB NVMe SSD (Transcend TS128GMTE110S) real 118m47.028s user 415m38.041s sys 29m37.947s avg-cpu: %user %nice %system %iowait %steal %idle 90.17 0.00 9.83 0.00 0.00 0.00 89.09 0.00 10.65 0.24 0.00 0.02 89.06 0.00 10.87 0.06 0.00 0.01 79.83 0.00 13.75 0.37 0.00 6.05 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn nvme0n1 3531.51 4049.97 10076.08 7290020 18137140 nvme0n1 4389.70 6759.53 10799.27 12167156 19438688 nvme0n1 4196.89 7548.11 9239.46 13586596 16631036 nvme0n1 5397.18 7772.77 13815.96 13990984 24868736 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 8.60 494.36 15.36 889860 27656 mmcblk1 5.43 300.62 5.71 541120 10276 mmcblk1 7.35 332.49 29.08 598480 52336 mmcblk1 11.18 587.00 20.49 1056608 36876 Samsung eMMC 5.1 real 133m34.405s user 415m38.955s sys 34m35.487s avg-cpu: %user %nice %system %iowait %steal %idle 86.06 0.00 9.33 4.02 0.00 0.59 82.91 0.00 11.93 4.61 0.00 0.54 78.06 0.00 13.79 7.60 0.00 0.55 79.67 0.00 12.85 6.67 0.00 0.81 23.34 0.00 4.78 5.33 0.00 66.55 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 885.85 3661.18 7399.21 6590168 13318660 mmcblk1 1525.72 5780.62 7345.17 10405124 13221300 mmcblk1 2074.55 7532.86 6865.80 13559216 12358516 mmcblk1 1465.59 5757.48 7218.02 10363516 12992516 mmcblk1 768.81 2683.68 3888.00 4830624 6998408 Samsung EVO840 USB3 (Class=Mass Storage, Driver=uas, 5000M) real 154m51.541s user 407m18.963s sys 65m44.394s avg-cpu: %user %nice %system %iowait %steal %idle 83.58 0.00 12.39 3.89 0.00 0.14 75.49 0.00 16.36 7.95 0.00 0.20 54.74 0.00 19.54 25.07 0.00 0.65 74.57 0.00 16.33 8.83 0.00 0.27 51.38 0.00 22.86 24.88 0.00 0.88 5.81 0.00 1.01 0.57 0.00 92.60 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 899.99 3277.29 7391.12 5899448 13304748 sda 1771.87 4602.59 5641.83 8284336 10154892 sda 2859.21 7020.32 7443.85 12636572 13398932 sda 1627.26 4463.28 4558.92 8033988 8206140 sda 2986.07 8075.17 10204.91 14535304 18368832 sda 114.34 433.28 234.57 779896 422224 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 8.76 449.59 14.97 809300 26944 mmcblk1 6.06 225.07 4.76 405108 8564 mmcblk1 6.22 212.92 4.06 383248 7304 mmcblk1 7.90 290.09 28.89 522164 52008 mmcblk1 8.65 373.69 4.33 672648 7796 mmcblk1 0.74 25.53 15.89 45956 28596 Samsung EVO840 USB2 (Class=Mass Storage, Driver=uas, 480M) real 146m49.211s user 415m29.511s sys 66m59.827s avg-cpu: %user %nice %system %iowait %steal %idle 82.26 0.00 11.60 5.52 0.00 0.63 77.11 0.00 16.94 5.68 0.00 0.27 67.59 0.00 19.14 12.92 0.00 0.35 78.39 0.00 18.05 3.50 0.00 0.07 44.69 0.00 13.18 15.43 0.00 26.70 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 875.49 3208.80 6997.15 5775896 12595008 sda 1791.04 4718.25 5574.88 8493692 10035784 sda 2491.66 6055.81 6341.38 10900164 11414168 sda 1844.43 5053.24 4258.27 9095928 7664972 sda 1854.93 4253.01 5503.57 7655424 9906428 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 5.74 360.25 14.36 648456 25856 mmcblk1 3.96 176.13 4.72 317068 8504 mmcblk1 5.94 173.85 4.32 312916 7776 mmcblk1 6.75 182.92 7.72 329256 13888 mmcblk1 6.43 259.09 39.36 466356 70856 SAMSUNG HM500JI USB2 (Class=Mass Storage, Driver=uas, 480M) real 481m19.903s user 346m28.221s sys 23m5.888s avg-cpu: %user %nice %system %iowait %steal %idle 62.34 0.00 5.30 26.52 0.00 5.84 47.05 0.00 3.66 45.03 0.00 4.27 30.16 0.00 2.33 60.95 0.00 6.55 37.39 0.00 3.19 52.27 0.00 7.15 47.45 0.00 4.44 42.35 0.00 5.76 16.43 0.00 1.47 52.25 0.00 29.85 1.86 0.00 0.72 46.10 0.00 51.32 2.40 0.00 0.70 54.83 0.00 42.08 1.83 0.00 0.60 52.51 0.00 45.06 2.38 0.00 0.61 54.73 0.00 42.28 3.41 0.00 0.72 60.17 0.00 35.70 2.12 0.00 0.60 53.32 0.00 43.97 2.89 0.00 0.58 58.69 0.00 37.84 3.39 0.00 0.68 60.29 0.00 35.63 3.35 0.00 0.65 52.47 0.00 43.53 24.89 0.00 0.51 23.60 0.00 51.00 0.46 0.00 0.05 0.43 0.00 99.05 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 202.08 1103.73 2568.98 1987224 4625352 sda 210.18 823.92 759.15 1483160 1366572 sda 251.15 614.00 1118.09 1105060 2012324 sda 243.75 637.75 562.81 1147928 1013036 sda 242.72 535.27 757.41 963668 1363592 sda 179.62 584.52 240.07 1052216 432164 sda 183.93 595.70 468.54 1072096 843248 sda 196.65 569.87 352.33 1025772 634192 sda 185.96 533.97 285.93 961144 514676 sda 190.90 503.49 303.76 906288 546772 sda 189.79 479.81 467.16 863660 840892 sda 174.72 498.87 315.20 897960 567356 sda 178.52 462.67 278.10 832800 500588 sda 192.74 449.58 405.82 809244 730484 sda 183.37 497.22 350.15 894996 630268 sda 74.32 312.27 14.20 562084 25560 sda 2.92 12.27 0.00 22092 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 5.04 331.77 14.07 597332 25332 mmcblk1 0.85 27.32 5.08 49172 9144 mmcblk1 0.47 45.07 1.68 81124 3032 mmcblk1 1.67 41.63 5.73 74924 10312 mmcblk1 3.30 79.95 5.89 143936 10612 mmcblk1 0.99 24.50 2.61 44108 4692 mmcblk1 0.06 8.91 0.48 16036 860 mmcblk1 0.11 10.18 1.02 18328 1828 mmcblk1 0.08 10.23 1.04 18416 1876 mmcblk1 0.03 4.24 0.98 7640 1764 mmcblk1 0.02 3.66 0.45 6580 812 mmcblk1 0.05 7.84 1.00 14120 1804 mmcblk1 0.02 2.26 0.45 4068 812 mmcblk1 0.03 4.56 0.98 8216 1764 mmcblk1 0.09 15.84 1.41 28520 2536 mmcblk1 0.69 23.72 20.68 42696 37220 mmcblk1 0.26 6.97 22.35 12544 40232 zram lz4 4 streams: real 127m46.928s user 402m52.389s sys 60m1.737s avg-cpu: %user %nice %system %iowait %steal %idle 84.16 0.00 15.81 0.00 0.00 0.02 81.10 0.00 18.89 0.00 0.00 0.01 76.31 0.00 23.68 0.00 0.00 0.01 79.77 0.00 20.22 0.00 0.00 0.01 16.56 0.00 8.59 0.01 0.00 74.85 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn zram1 1203.30 1641.51 3171.71 2954732 5709104 zram2 1202.68 1638.97 3171.76 2950160 5709196 zram3 1203.74 1643.14 3171.81 2957672 5709288 zram4 1202.88 1639.80 3171.70 2951664 5709096 zram1 1634.75 2491.59 4047.41 4485068 7285668 zram2 1632.13 2481.02 4047.48 4466040 7285792 zram3 1632.28 2481.82 4047.32 4467468 7285496 zram4 1633.93 2488.36 4047.37 4479248 7285592 zram1 2142.44 3778.25 4791.51 6800844 8624712 zram2 2141.56 3774.83 4791.39 6794700 8624500 zram3 2141.69 3775.36 4791.42 6795640 8624556 zram4 2142.77 3779.59 4791.50 6803260 8624692 zram1 1714.65 2936.11 3922.50 5285296 7060884 zram2 1713.89 2933.00 3922.56 5279696 7061000 zram3 1712.34 2926.74 3922.62 5268420 7061116 zram4 1714.12 2933.77 3922.69 5281088 7061232 zram1 755.08 1467.73 1552.57 2641884 2794600 zram2 756.14 1472.05 1552.53 2649652 2794528 zram3 754.55 1465.77 1552.42 2638356 2794328 zram4 755.22 1468.43 1552.46 2643152 2794400 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 16.72 1025.73 15.19 1846324 27340 mmcblk1 17.52 957.67 6.43 1723876 11572 mmcblk1 21.79 1547.45 26.22 2785408 47188 mmcblk1 18.02 970.77 7.58 1747492 13652 mmcblk1 12.51 803.49 26.94 1446264 48500 zram lzo 4 streams: real 125m26.180s user 405m39.383s sys 36m14.588s avg-cpu: %user %nice %system %iowait %steal %idle 85.31 0.00 14.63 0.00 0.00 0.05 82.56 0.00 17.44 0.00 0.00 0.00 80.95 0.00 19.05 0.00 0.00 0.00 79.52 0.00 20.47 0.00 0.00 0.01 11.89 0.00 5.46 0.00 0.00 82.65 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn zram1 1174.74 1582.76 3116.19 2848960 5609136 zram2 1174.93 1583.45 3116.27 2850208 5609284 zram3 1175.11 1584.21 3116.24 2851580 5609232 zram4 1174.92 1583.52 3116.15 2850328 5609072 zram1 1588.69 2414.99 3939.76 4346988 7091564 zram2 1589.01 2416.06 3939.99 4348900 7091976 zram3 1588.51 2414.16 3939.86 4345492 7091748 zram4 1588.29 2413.34 3939.81 4344016 7091660 zram1 1816.51 3200.85 4065.19 5761560 7317376 zram2 1815.69 3197.45 4065.31 5755436 7317592 zram3 1816.19 3199.44 4065.32 5759020 7317620 zram4 1816.44 3200.38 4065.37 5760724 7317700 zram1 1823.10 3125.61 4166.79 5626100 7500224 zram2 1821.87 3120.63 4166.85 5617132 7500332 zram3 1822.11 3121.60 4166.83 5618876 7500288 zram4 1822.79 3124.31 4166.83 5623760 7500292 zram1 517.68 987.91 1082.82 1778232 1949072 zram2 517.41 986.80 1082.82 1776248 1949084 zram3 517.20 985.93 1082.87 1774680 1949168 zram4 517.13 985.70 1082.82 1774256 1949072 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 12.29 782.37 13.75 1408268 24752 mmcblk1 10.81 635.53 6.22 1143952 11188 mmcblk1 14.39 921.31 26.93 1658368 48476 mmcblk1 14.46 857.51 7.68 1543516 13828 mmcblk1 5.61 351.26 20.07 632264 36120 zram lzo 1 stream: real 124m52.403s user 397m20.110s sys 58m55.228s avg-cpu: %user %nice %system %iowait %steal %idle 84.59 0.00 15.28 0.00 0.00 0.13 81.48 0.00 18.35 0.00 0.00 0.18 79.98 0.00 19.84 0.00 0.00 0.17 76.64 0.00 22.98 0.00 0.00 0.38 10.40 0.00 5.06 0.00 0.00 84.54 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn zram1 4558.09 6483.94 11748.41 11671812 21148436 zram1 6377.26 10079.11 15429.92 18142504 27774004 zram1 7050.71 12673.06 15529.80 22811500 27953632 zram1 7258.40 12565.52 16468.06 22617944 29642512 zram1 1644.30 3404.62 3172.57 6128356 5710652 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 15.91 968.35 16.16 1743132 29092 mmcblk1 15.08 852.01 5.87 1533628 10568 mmcblk1 17.93 1064.32 26.88 1915772 48376 mmcblk1 18.86 1036.18 9.54 1865128 17168 mmcblk1 8.58 494.16 22.66 889484 40780 SD card: real 1151m21.149s user 342m49.658s sys 216m25.202s avg-cpu: %user %nice %system %iowait %steal %idle 11.12 0.00 1.60 80.44 0.00 6.83 9.03 0.00 1.71 78.10 0.00 11.15 10.69 0.00 1.42 82.05 0.00 5.84 21.97 0.00 1.34 70.34 0.00 6.35 9.90 0.00 1.97 69.94 0.00 18.18 0.35 0.00 0.83 88.24 0.00 10.57 2.14 0.00 1.50 81.17 0.00 15.20 2.68 0.00 1.25 81.60 0.00 14.47 2.87 0.00 1.10 84.92 0.00 11.11 8.89 0.00 2.19 77.69 0.00 11.23 3.54 0.00 1.47 84.79 0.00 10.21 10.19 0.00 2.91 76.93 0.00 9.96 10.22 0.00 3.20 78.63 0.00 7.96 8.12 0.00 4.10 73.85 0.00 13.93 2.93 0.00 9.44 68.68 0.00 18.95 2.02 0.00 5.03 62.21 0.00 30.73 2.30 0.00 6.99 69.24 0.00 21.47 7.43 0.00 6.35 68.76 0.00 17.45 6.63 0.00 15.24 61.14 0.00 16.99 6.37 0.00 12.22 66.69 0.00 14.73 7.44 0.00 13.48 64.41 0.00 14.67 1.59 0.00 2.16 81.48 0.00 14.76 8.89 0.00 13.72 64.28 0.00 13.11 4.74 0.00 5.52 77.09 0.00 12.65 5.57 0.00 9.39 72.62 0.00 12.42 9.20 0.00 13.25 63.39 0.00 14.16 7.95 0.00 12.44 65.51 0.00 14.11 11.55 0.00 14.41 60.93 0.00 13.11 9.60 0.00 15.77 62.18 0.00 12.44 3.92 0.00 15.74 59.88 0.00 20.46 5.34 0.00 19.68 55.40 0.00 19.58 7.11 0.00 19.83 55.26 0.00 17.81 6.17 0.00 16.61 57.38 0.00 19.84 6.29 0.00 16.96 56.82 0.00 19.93 6.26 0.00 17.32 55.47 0.00 20.95 6.20 0.00 8.00 56.80 0.00 29.00 9.91 0.00 7.30 50.51 0.00 32.28 18.88 0.00 6.58 46.38 0.00 28.16 7.03 0.00 0.15 0.43 0.00 92.39 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk2 76.61 450.41 1382.78 810772 2489124 mmcblk2 118.21 657.84 1485.70 1204804 2721000 mmcblk2 71.49 515.17 1415.50 955424 2625136 mmcblk2 35.76 296.81 1296.73 568028 2481664 mmcblk2 201.44 977.12 848.80 1840180 1598524 mmcblk2 29.22 16.47 110.05 28892 193032 mmcblk2 97.93 293.55 159.87 535056 291392 mmcblk2 78.06 206.14 127.74 369048 228684 mmcblk2 65.40 167.94 148.04 324444 286004 mmcblk2 135.49 394.42 180.17 685148 312972 mmcblk2 75.55 201.50 153.93 355436 271520 mmcblk2 143.00 401.61 227.92 749180 425180 mmcblk2 156.31 396.85 262.57 713496 472064 mmcblk2 139.27 240.70 347.53 435308 628516 mmcblk2 161.35 251.15 506.99 449260 906896 mmcblk2 239.78 528.38 476.98 949888 857484 mmcblk2 172.31 352.63 400.22 642808 729560 mmcblk2 178.92 468.90 311.74 844492 561436 mmcblk2 340.91 799.06 710.93 1439004 1280280 mmcblk2 277.69 683.66 547.90 1232240 987552 mmcblk2 321.37 796.36 631.85 1443872 1145604 mmcblk2 77.21 181.19 156.36 327988 283040 mmcblk2 317.58 788.05 607.13 1407312 1084220 mmcblk2 157.16 407.68 282.06 744680 515216 mmcblk2 198.91 439.60 443.84 787068 794652 mmcblk2 337.17 818.63 667.23 1474116 1201480 mmcblk2 332.78 796.49 659.56 1431248 1185196 mmcblk2 373.29 972.52 664.27 1750564 1195704 mmcblk2 335.74 778.88 707.79 1405296 1277036 mmcblk2 326.77 801.76 669.36 1445520 1206824 mmcblk2 357.16 1028.33 668.17 1845528 1199148 mmcblk2 328.49 891.48 669.25 1604684 1204672 mmcblk2 325.89 861.09 661.31 1553152 1192808 mmcblk2 358.29 981.11 661.84 1776640 1198484 mmcblk2 420.42 1245.85 659.41 2232780 1181772 mmcblk2 470.04 1186.00 723.31 2133452 1301144 mmcblk2 518.43 1405.45 696.22 2534760 1255640 mmcblk2 477.34 1259.47 666.39 2264684 1198256 mmcblk2 12.11 69.58 49.24 125248 88636
tkaiser Posted September 9, 2018 Author Posted September 9, 2018 Next test: RK3399 with unlocked performance (all 6 CPU cores active at usual clockspeeds: 1.5/2.0GHz) w/o nvme lzo/2 lzo/6 lz4/2 lz4/6 real 31m55 40m32 41m56 41m38 43m57 44m26 user 184m16 194m58 200m37 202m20 195m17 197m51 sys 6m04 16m17 25m02 23m14 40m59 42m15 Full test output: Without Swap: real 31m55.360s user 184m16.317s sys 6m3.999s avg-cpu: %user %nice %system %iowait %steal %idle 96.78 0.00 3.11 0.00 0.00 0.11 96.56 0.00 3.30 0.00 0.00 0.14 11.57 0.00 0.42 0.00 0.00 88.01 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 0.56 0.09 36.40 80 32758 mmcblk1 1.53 6.29 95.42 5664 85876 mmcblk1 0.14 0.26 29.38 232 26440 lzo 2 streams: real 41m56.261s user 200m36.964s sys 25m2.247s avg-cpu: %user %nice %system %iowait %steal %idle 79.01 0.00 20.70 0.04 0.00 0.25 83.46 0.00 16.40 0.00 0.00 0.14 61.40 0.00 17.19 0.01 0.00 21.40 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn zram1 10000.57 15515.19 24487.10 13963672 22038392 zram2 9994.63 15491.28 24487.23 13942156 22038508 zram1 10381.19 17556.42 23968.32 15800952 21571732 zram2 10378.70 17546.46 23968.32 15791988 21571732 zram1 9469.21 16759.32 21117.53 15083392 19005776 zram2 9455.55 16704.60 21117.59 15034136 19005828 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 83.01 8366.31 33.08 7529680 29776 mmcblk1 38.71 2303.10 58.27 2072816 52444 mmcblk1 46.68 2686.10 50.05 2417489 45048 lzo 6 streams: real 41m38.302s user 202m20.016s sys 23m14.408s avg-cpu: %user %nice %system %iowait %steal %idle 83.18 0.00 16.71 0.00 0.00 0.11 82.59 0.00 17.29 0.01 0.00 0.11 59.98 0.00 16.90 0.01 0.00 23.11 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn zram1 3039.69 4346.14 7812.62 3911524 7031360 zram2 3040.84 4350.74 7812.63 3915668 7031364 zram3 3038.73 4342.47 7812.44 3908220 7031196 zram4 3039.70 4346.43 7812.39 3911784 7031148 zram5 3038.88 4342.94 7812.59 3908648 7031328 zram6 3040.02 4347.62 7812.47 3912860 7031220 zram1 3416.83 5791.84 7875.48 5212656 7087936 zram2 3417.75 5795.52 7875.49 5215968 7087944 zram3 3419.16 5801.10 7875.53 5220992 7087980 zram4 3417.88 5795.81 7875.70 5216232 7088132 zram5 3418.87 5799.86 7875.61 5219872 7088048 zram6 3419.90 5804.16 7875.44 5223740 7087896 zram1 3060.02 5409.21 6830.87 4868344 6147848 zram2 3059.34 5406.76 6830.61 4866136 6147616 zram3 3059.39 5406.96 6830.61 4866316 6147616 zram4 3060.38 5410.83 6830.70 4869800 6147696 zram5 3060.49 5411.10 6830.87 4870040 6147848 zram6 3059.70 5407.84 6830.96 4867112 6147932 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 52.03 3429.48 32.36 3086532 29120 mmcblk1 37.18 2246.65 58.60 2021988 52740 mmcblk1 47.53 2735.41 52.83 2461900 47544 lz4 2 streams: real 43m57.637s user 195m17.556s sys 40m59.904s avg-cpu: %user %nice %system %iowait %steal %idle 77.08 0.00 22.16 0.03 0.00 0.73 75.15 0.00 23.91 0.02 0.00 0.92 65.67 0.00 25.35 0.03 0.00 8.94 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn zram1 8801.74 12553.56 22653.41 11298200 20388072 zram2 8800.08 12547.31 22653.00 11292580 20387704 zram1 11100.47 18802.53 25599.34 16922840 23040176 zram2 11099.44 18798.44 25599.32 16919156 23040156 zram1 10362.92 18539.87 22911.82 16685884 20620640 zram2 10355.76 18511.28 22911.76 16660152 20620584 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 66.22 4788.56 31.21 4309700 28092 mmcblk1 57.70 3918.69 57.91 3526936 52124 mmcblk1 66.55 3865.90 57.75 3479308 51976 lz4 6 streams: real 44m26.940s user 197m51.586s sys 42m15.525s avg-cpu: %user %nice %system %iowait %steal %idle 77.44 0.00 22.18 0.02 0.00 0.35 75.06 0.00 24.54 0.01 0.00 0.38 68.29 0.00 26.81 0.02 0.00 4.88 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn zram1 2866.07 4112.82 7351.47 3701580 6616400 zram2 2866.50 4114.47 7351.54 3703064 6616464 zram3 2866.32 4113.75 7351.52 3702416 6616444 zram4 2868.42 4122.26 7351.44 3710072 6616372 zram5 2865.52 4110.65 7351.45 3699624 6616380 zram6 2867.04 4116.73 7351.44 3705100 6616368 zram1 3610.06 6068.57 8371.69 5461832 7534688 zram2 3608.80 6063.51 8371.70 5457276 7534700 zram3 3608.88 6063.82 8371.70 5457560 7534700 zram4 3612.11 6076.78 8371.65 5469228 7534652 zram5 3608.69 6062.87 8371.88 5456704 7534856 zram6 3609.15 6064.71 8371.88 5458364 7534856 zram1 3628.46 6460.98 8052.86 5814944 7247656 zram2 3626.32 6452.34 8052.93 5807168 7247720 zram3 3626.74 6454.27 8052.68 5808912 7247492 zram4 3627.31 6456.32 8052.91 5810752 7247700 zram5 3627.74 6458.25 8052.71 5812488 7247520 zram6 3629.89 6466.64 8052.92 5820044 7247708 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 64.98 4413.62 30.87 3972300 27780 mmcblk1 56.31 3752.78 59.29 3377580 53364 mmcblk1 67.02 4039.19 58.18 3635312 52360 NVMe SSD: real 40m32.985s user 194m58.775s sys 16m17.506s avg-cpu: %user %nice %system %iowait %steal %idle 87.55 0.00 11.60 0.62 0.00 0.23 85.85 0.00 11.29 2.42 0.00 0.45 85.85 0.00 11.29 2.42 0.00 0.45 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn nvme0n1 10178.30 12957.70 27755.50 11662060 24980224 nvme0n1 12027.94 22808.95 25302.80 20528052 22772516 nvme0n1 12027.94 22808.95 25302.80 20528052 22772516 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn mmcblk1 44.03 2983.29 31.68 2684992 28508 mmcblk1 22.29 930.41 17.25 837372 15524 mmcblk1 22.29 930.41 17.25 837372 15524 For obvious reasons I did not test the crappy variants again (HDD, SD card, USB attached anything). So we're only looking at performance without swap, swap on NVMe SSD and zram. RK3399 when allowed to run at full speed finishes the same compile job in less than 32 minutes. Swap on NVMe SSD increases time by almost 30% now. I now also compared whether the count of zram devices makes a difference (still on RK's 4.4 kernel). Still lzo outperforms lz4 (which is irritating since everyone tells you lz4 would be an improvement over lzo) but there is no clear answer about count of zram devices (in fact the kernel uses 1-n streams to access each device so with modern kernels even a single zram device should suffice since kernel takes care of distributing the load accross all the CPU cores)
mindee Posted September 9, 2018 Posted September 9, 2018 Thanks for your testing on NanoPC-T4, this would be very helpful for someone who want to know about the NVME SSD real performance on RK3399 boards.
tkaiser Posted September 10, 2018 Author Posted September 10, 2018 5 hours ago, mindee said: very helpful for someone who want to know about the NVME SSD real performance on RK3399 boards In fact I bought cheap since I got the NVMe SSD for less than 40 bucks on sale. My small TS128GMTE110S has soldered only 2 flash chips on it, if the maximum would be present (8) then it would be much much faster since all modern SSD controllers make heavy use of paralellisms (the more flash chips the faster). I did a quick iozone test with kernel 4.4 and results look not that great compared to an EVO 960 for example. Transcend TS128GMTE110S ext4 random random kB reclen write rewrite read reread read write 102400 4 81473 115499 145683 147298 40272 82992 102400 16 179264 249123 293189 293318 111869 246077 102400 512 578704 579090 829601 832798 658995 567873 102400 1024 585086 577757 928864 935910 789642 565868 102400 16384 527840 531048 1045632 1056965 1031390 546275 2048000 16384 544678 549905 1064665 1064439 1039331 545381 But that's not relevant since the protocol makes the difference. NVMe has been invented in this century and not the last as all the other storage protocols we use today. And this makes a real difference since NVMe has been developed with accessing flash storage efficiently in mind. All the other protocols we might use (including SATA) were designed decades ago for way slower storage and all do bottleneck access to fast flash. With swap on the NVMe SSD the maximum %iowait percentage according to iostat monitoring was 0.37%. That's 70 times less compared to up to 25.07% with USB3!
tkaiser Posted September 11, 2018 Author Posted September 11, 2018 On 9/10/2018 at 12:32 AM, tkaiser said: Next test: RK3399 with unlocked performance (all 6 CPU cores active at usual clockspeeds: 1.5/2.0GHz) w/o nvme lzo/2 lzo/6 lz4/2 lz4/6 real 31m55 40m32 41m56 41m38 43m57 44m26 user 184m16 194m58 200m37 202m20 195m17 197m51 sys 6m04 16m17 25m02 23m14 40m59 42m15 All those tests I did before were done with Rockchip's 4.4 kernel. Since stuff in the kernel improves over time now let's test with brand new 4.19.0-rc1. I just did a quick build (with default device tree that limits maximum cpufreq to 1.8 GHz on the big and 1.4 GHz on the little cores) and only tested performance without swapping and zram based swap with the available algorithms (more recent kernels provide more compression algorithms to choose from): w/o lzo lz4 zstd lz4hc real 29m11 35m58 36m59 48m38 58m55 user 167m59 177m24 175m22 182m02 173m57 sys 5m32 21m10 22m59 69m35 123m46 Results: More recent kernel --> better results. Even with lower clockspeeds (1.8/1.4 GHz vs. 2.0/1.5 GHz) the test with kernel 4.19 runs 8% faster. So at same clockspeed this would result even in ~10% better performance Performance drop with zram/lzo compared to no swap with 4.4 was 31%. With 4.19 it's just 23%. So efficiency/performance of the zram implementation itself also improved a lot Again lzo is slightly faster than lz4, both zstd and lz4hc are no good candidates for this use case (but zstd is a great candidate for Armbian's new ramlog approach since it provides higher compression -- more on this later) In other words: with mainline kernel it makes even more sense to swap to a compressed block device in RAM since performance further increased. With this specific use case (large compile job) the performance drop when running out of memory and the kernel starting to swap to zram is below the 25% margin which is just awesome
tkaiser Posted September 11, 2018 Author Posted September 11, 2018 Little update: In the meantime I also tested with the really fast Samsung 16GB eMMC 5.1 on NanoPC-T4 (again crippled down to a quad-core A53 at 800 MHz). The board runs off the NVMe SSD, I mounted the eMMC as an ext4 partition, put there ARM's ComputeLibrary install and the swapfile on and fired up the test again. First post above is updated. With 4 GB RAM and no swap 100:39 minutes, with swapping on the fast NVMe SSD 118:47 and just 133:34 on the eMMC: w/o nvme lzo lz4 emmc usb2 usb3 hdd sd card real 100m39 118m47 125m26 127m46 133m34 146m49 154m51 481m19 1151m21 That's impressive. But this Samsung eMMC 5.1 on NanoPC-T4 (and also on ODROID-N1) is most probably the fastest eMMC we get on SBC today (see benchmark numbers). And still zram is faster and we get 'more RAM' for free (since swap on flash media contributes to the medium wearing out of course)
botfap Posted September 11, 2018 Posted September 11, 2018 Interesting tests, especially the lzo vs lz4 outcomes. We found the same difference last year and went for lzo zram in the end because of the overall performance benefit and marginal difference in compression Lzo was faster and less cpu intensive overall despite lz4 being better on paper (and from the opinions of almost every internet warrior). In our case the difference came from the compression overhead of lz4 being much higher than lzo and while the decompression of lz4 was faster it wasn't enough to claw back what it loses in compression time. Interestingly when we ran the same tests on Intel cpu's (Core m3 4.5W) instead of Arm64 then the situation was reversed with lz4 coming out on top Do you know which variant of lzo is being used by Armbian? We used 1x-1-15 for 4+ core Arm devices and 1x-1 for single / dual core devices
tkaiser Posted September 11, 2018 Author Posted September 11, 2018 1 hour ago, botfap said: Lzo was faster and less cpu intensive overall despite lz4 being better on paper (and from the opinions of almost every internet warrior) Ok, so another time the same observation. Maybe we should switch back to lzo then already. I feared my test always using the same task is somewhat flawed. At least it's configurable as SWAP_ALGORITHM in /etc/default/armbian-zram-config. But starting with the best default is for sure a good thing prior to next major release when this stuff gets rolled out. 1 hour ago, botfap said: Do you know which variant of lzo is being used by Armbian? Nope. I simply used default kernel settings (and only tested with Rockchip 4.4, mainline 4.14 on NanoPi Fire3 and 4.19 on RK3399). How to configure the specific algorithm? Edit: another interesting observation: https://bugs.chromium.org/p/chromium/issues/detail?id=584437#c15 I really wonder whether the compression algorithms on ARM use NEON optimizations or not (the performance boost can be huge)
botfap Posted September 11, 2018 Posted September 11, 2018 1 hour ago, tkaiser said: Ok, so another time the same observation. Maybe we should switch back to lzo then already. I feared my test always using the same task is somewhat flawed. At least it's configurable as SWAP_ALGORITHM in /etc/default/armbian-zram-config. But starting with the best default is for sure a good thing prior to next major release when this stuff gets rolled out. If I were building a modern x86 target then I would go lz4 without question but on Arm there seems to be a flip around and lzo comes out on top in our testing for both Arm7 and Arm8. I have no idea why this is, maybe Intel's vector instructions are more efficient, anyone have any idea? 1 hour ago, tkaiser said: Nope. I simply used default kernel settings (and only tested with Rockchip 4.4, mainline 4.14 on NanoPi Fire3 and 4.19 on RK3399). How to configure the specific algorithm? Just had a look at an Armbian build for a tinkerboard (rk4.4) I have on my desk. Default kernel lzo algo is 1x_1 which is the fastest but least efficient of the variants and probably the best default option for anything with 1GB+ RAM. For 256MB/512MB boards then lz4 would probably offer 10-15% more in the way of commit-able ram at the expense of significantly slower compression speed and higher cpu utilization There was also support for lzo 1x_999 which has almost double the compression efficiency of lzo 1x_1 but takes twice as long to compress making it worse than standard lz4 for zram use. There was no specific support for lzo 1x_1_15 which is primarily just a multi core optimization of of the 1x_1 algo but I think in newer kernels and lzo > 2.07 1x_1_15 is automatically used instead of 1x_1 when 4 cores or more are initialized Your results suggest that you did the test with lzo 1x_1_15 because 1x_1 would have only used a single core and been slower than lz4 in theory
botfap Posted September 11, 2018 Posted September 11, 2018 Did some quick research after noticing your edit and wanting to verify my vector instruction suspicions. Neither lzo or lz4 use neon vector instructions on Arm7 or 8 which is very far from ideal and explains why the performance of lzo and lz4 is much better on Intel than on Arm 1
Recommended Posts