zador.blood.stained Posted August 1, 2018 Posted August 1, 2018 On 8/1/2018 at 11:37 PM, tkaiser said: If ARMv8 Crypto Extensions are available funnily openssl numbers with smaller data chunks are higher with 1.0.2g than 1.1.0g (see NanoPC T3+ numbers or those for Vim2). Did you already check whether all our kernels have the respective AF_ALG and NEON switches enabled? I thought more about this and also ran openssl and cryptsetup through strace and checked openssl build configuration in Ubuntu. Stock Ubuntu (and most likely Debian) OpenSSL will use userspace crypto. So if there are CPU instructions (NEON, ARMv8 CE) - it should use them, but it won't be using HW engines like sun4i-ss or CESA. At least we have some comparable numbers as long as we don't compare OpenSSL 1.0.x results with 1.1.x directly. This means that AES numbers in the table will not resemble performance in some real world scenarios that use in-kernel crypto (like disk and filesystem encryption) On 8/1/2018 at 11:37 PM, tkaiser said: Well, identifying such stuff is IMO part of the journey. But people will still use your results table to compare boards, so IMO it's worth adding a note for boards where HW crypto engines are available. ARMv8 CE is not a crypto engine, its numbers should depend on CPU performance and should be affected by throttling, compared to, i.e., CESA that uses a fixed clock. 1 Quote
tkaiser Posted August 1, 2018 Posted August 1, 2018 43 minutes ago, NicoD said: XU4 Armbian Stretch http://ix.io/1iWL Thank you. Unfortunately our DT clocks little cores just with 1400 MHz and some minor throttling happened: 2000 MHz: 2083.27 sec 1900 MHz: 25.23 sec 1800 MHz: 7.58 sec Anyway, numbers are usable. Will add them with next Results.md update. 0 Quote
tkaiser Posted August 2, 2018 Posted August 2, 2018 8 hours ago, zador.blood.stained said: But people will still use your results table to compare boards, so IMO it's worth adding a note for boards where HW crypto engines are available. For now I added just a warning: https://github.com/ThomasKaiser/sbc-bench/blob/master/Results.md -- I'll add a TODO section soon where I'll try to explain how to deal with 'numbers generated' vs. 'insights generated': coming up with other benchmarks that more properly describe real-world use cases. Wrt the crypto stuff: Most probably using cryptsetup and then doing also some real-world tasks that can be measured (involving a ton of other dependencies like filesystem performance and so on) 0 Quote
ccbur Posted August 2, 2018 Posted August 2, 2018 On 7/31/2018 at 11:50 PM, tkaiser said: Eagerly waiting for more results (from other boards) since we start to get some understanding why common benchmark results are that weird. Here the result for a Jetson TK1 with more or less mainline kernel 4.14 and Debian Buster (sorry, currently no Stretch available): http://ix.io/1j0f There are some discrepancies in both cpufreq OPP table!? Any idea, how to read those numbers? Never had a stability issue and normally CPU is not a bottleneck for me, so I'm not really worried about throttling. I'm using my Jetsons with the little standard fan, but maybe it's time to enable the temperature sensors somehow :). 0 Quote
tkaiser Posted August 2, 2018 Posted August 2, 2018 1 hour ago, ccbur said: Jetson TK1 with more or less mainline kernel 4.14 and Debian Buster (sorry, currently no Stretch available): http://ix.io/1j0f In this case distro version doesn't matter since cpuminer test fails on 32-bit platforms anyway (and here libs and GCC version would've been important). Buster still uses 7-zip v16.02 so numbers are comparable somewhat. 1 hour ago, ccbur said: There are some discrepancies in both cpufreq OPP table!? Any idea, how to read those numbers? Well, sbc-bench is using @wtarreau's nice mhz tool to calculate real clockspeeds and I hope I use it correctly. It seems CPU clockspeeds are controlled by some firmware in reality and the cpufreq driver reports nonsense since with an idle system we see measured clockspeeds being much higher just to be limited to ~1565 MHz while running the 7-zip benchmark: Cpufreq OPP: 204 Measured: 224.991/224.846/224.977 Cpufreq OPP: 306 Measured: 334.677/334.810/334.586 Cpufreq OPP: 408 Measured: 444.497/444.497/443.872 Cpufreq OPP: 510 Measured: 554.177/554.061/554.038 Cpufreq OPP: 612 Measured: 671.761/670.167/670.167 Cpufreq OPP: 714 Measured: 779.803/780.207/779.591 Cpufreq OPP: 816 Measured: 889.699/889.881/889.642 Cpufreq OPP: 918 Measured: 1003.636/1006.018/1005.933 Cpufreq OPP: 1020 Measured: 1116.065/1115.752/1115.655 Cpufreq OPP: 1122 Measured: 1225.857/1225.945/1225.916 Cpufreq OPP: 1224 Measured: 1335.698/1333.532/1335.864 Cpufreq OPP: 1326 Measured: 1451.911/1452.270/1451.944 Cpufreq OPP: 1428 Measured: 1562.109/1561.977/1562.147 Cpufreq OPP: 1530 Measured: 1562.373/1562.241/1558.959 Cpufreq OPP: 1632 Measured: 1561.882/1561.769/1561.769 Cpufreq OPP: 1734 Measured: 1561.731/1561.580/1561.901 Cpufreq OPP: 1836 Measured: 1561.939/1561.901/1561.920 Cpufreq OPP: 1938 Measured: 1562.241/1562.090/1561.693 Cpufreq OPP: 2014 Measured: 1561.750/1562.165/1561.693 Cpufreq OPP: 2116 Measured: 1561.807/1561.977/1561.825 Cpufreq OPP: 2218 Measured: 1561.825/1559.053/1561.637 Cpufreq OPP: 2320 Measured: 1561.825/1561.958/1561.618 7-zip contains an own measuring routine and seems to agree: CPU Freq: 1425 1524 1557 1558 1557 1558 1558 1557 1557 CPU Freq: 1524 1540 1558 1558 1557 1558 1557 1558 1558 CPU Freq: 1557 1560 1560 1560 1560 1559 1560 1559 1559 CPU Freq: 1562 1562 1563 1563 1563 1563 1563 1563 1563 As a reference Tinkerboard (quad-core A17 in RK3288) scores 5350 7-zip MIPS at 1730 MHz while your Jetson scores less: 5290. Memory bandwidth is much higher on the Jetson so the 1565 MHz start to look plausible. I've sysbench numbers from an ODROID-XU4 here made only on the A15-cluster. 'sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4' took 62 seconds but that was on Debian Jessie (GCC 4.7) so numbers on Buster (GCC 8.1) will be completely different. Are the reported temperatures real? ~33°C seem way too low and most probably the sysfs node for CPU temperature is a different one than the one my script used. Can you provide the output of the following please? find /sys -name "thermal_zone*" | while read ; do echo "${REPLY}: $(cat ${REPLY}/type) $(cat ${REPLY}/temp)" done 0 Quote
wtarreau Posted August 2, 2018 Posted August 2, 2018 (edited) 49 minutes ago, tkaiser said: Well, sbc-bench is using @wtarreau's nice mhz tool to calculate real clockspeeds and I hope I use it correctly. What you can do is increase the 2nd argument, it's the number of loops you want to run. At 1000 you can miss some precision. I tend to use 100000 on medium-power boards like nanopis. On the clearfog at 2 GHz, "mhz 3 100000" takes 150ms. This can be much for your use case. It reports 1999 MHz. With 1000 it has a slightly larger variation (1996 to 2000). Well, it's probably OK at 10000. I took bad habits on x86 with intel_pstate taking a while to start. Maybe you should always take a small and a large count in your tests. This would more easily show if there's some automatic frequency adjustment : the larger count would report a significantly higher frequency in this case because part of the loop would run at a higher frequency. Just an idea. Or probably that you should have two distinct tools : "sbc-bench" and "sbc-diag". The former would report measured values over short periods, an the latter would be used with deeper tests to try to figure whats wrong when the first values look suspicious. Edited August 2, 2018 by wtarreau wrong version of the tool = wrong execution time :-) 1 Quote
tkaiser Posted August 3, 2018 Posted August 3, 2018 7 hours ago, wtarreau said: What you can do is increase the 2nd argument, it's the number of loops you want to run. At 1000 you can miss some precision. I tend to use 100000 on medium-power boards like nanopis. Thanks for the suggestion. Just did a test on the slowest device I've around (single Cortex-A8 at 912 MHz). Before: Cpufreq OPP: 240 Measured: 113.205/238.015/271.572 Cpufreq OPP: 624 Measured: 622.264/620.388/622.821 Cpufreq OPP: 864 Measured: 859.927/863.043/861.208 Cpufreq OPP: 912 Measured: 910.931/870.387/432.613 And after (now with 100000): Cpufreq OPP: 240 Measured: 216.979/237.445/237.738 Cpufreq OPP: 624 Measured: 584.162/285.411/622.403 Cpufreq OPP: 864 Measured: 862.854/825.796/862.574 Cpufreq OPP: 912 Measured: 908.568/910.098/869.719 Ok, benchmarking such a slow system where background activity is pretty high all the time due to just a single CPU core is rather pointless. But I keep the 100000 for now and change the OPP checking routine after the most demanding benchmark has been running to decline from highest to lowest OPP to hopefully spot hidden throttling as much as possible (like on the Jetson, Amlogic S905X/S912 and of course the Raspberry Pi). Providing a separate sbc-diag is a good idea. This tool could also focus on testing for anomalies wrt cpufreq scaling (sbc-bench immediately switches to performance governor to prevent results being harmed by strange cpufreq scaling behaviour so not suited to give the answer 'why behaves my system so slow?') 0 Quote
tkaiser Posted August 3, 2018 Posted August 3, 2018 13 hours ago, ccbur said: There are some discrepancies in both cpufreq OPP table!? The numbers above (mhz, 7-zip) already suggested the Jetson does 'hidden' throttling at around ~1565 MHz. The openssl benchmark might also be useful to compare results and to determine clockspeeds. First row is from an ODROID-XU4 (A15 cores at 2000 MHz), 2nd from your Jetson: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes ODROID-XU4 59359.98k 66782.42k 70469.97k 71398.06k 71740.07k 71527.08k Jetson TK1 40004.73k 50301.27k 54554.88k 55691.61k 56008.70k 55973.21k If we use XU4 numbers as base (1998 MHz) and do some simple math we may determine real clockspeed of the NVIDIA: 1998 / (59359.98 / 40004.73) = 1347 1998 / (66782.42 / 50301.27) = 1505 1998 / (70469.97 / 54554.88) = 1547 1998 / (71398.06 / 55691.61) = 1558 1998 / (71740.07 / 56008.70) = 1560 1998 / (71527.08 / 55973.21) = 1563 With less initialization overhead == larger data chunks (1K or above) this calculation approach seems to work fairly well (due to openssl benchmark not relying on memory bandwidth so 2 different A15 cores could be compared directly). Your A15 cores being limited to ~1565 MHz under load seem to be very plausible. 0 Quote
NicoD Posted August 3, 2018 Posted August 3, 2018 @tkaiser What do you think about the fact that 7-zip multicore never uses close to 100% of the cpu's? Would we need to take that in acount? Here XU4 System health while running 7-zip multi core benchmark: Time big.LITTLE load %cpu %sys %usr %nice %io %irq Temp 20:53:32: 2000/1400MHz 5.96 11% 0% 10% 0% 0% 0% 51.0°C 20:54:02: 2000/1400MHz 5.86 66% 1% 65% 0% 0% 0% 55.0°C 20:54:32: 2000/1400MHz 5.14 61% 2% 59% 0% 0% 0% 61.0°C 20:55:02: 2000/1400MHz 5.43 71% 1% 70% 0% 0% 0% 63.0°C 20:55:34: 1900/1400MHz 5.47 65% 1% 63% 0% 0% 0% 75.0°C 20:56:05: 2000/1400MHz 5.54 67% 1% 66% 0% 0% 0% 56.0°C 20:56:35: 2000/1400MHz 5.22 68% 1% 66% 0% 0% 0% 65.0°C 20:57:05: 2000/1400MHz 5.27 69% 1% 67% 0% 0% 0% 69.0°C 20:57:36: 2000/1400MHz 5.12 60% 1% 58% 0% 0% 0% 73.0°C 20:58:07: 2000/1400MHz 5.31 66% 1% 64% 0% 0% 0% 64.0°C Here the NanoPC-T3+ System health while running 7-zip multi core benchmark: Time big.LITTLE load %cpu %sys %usr %nice %io %irq Temp 18:22:37: 1400/1400MHz 5.55 10% 0% 9% 0% 0% 0% 41.0°C 18:23:07: 1400/1400MHz 6.21 72% 0% 71% 0% 0% 0% 48.0°C 18:23:37: 1400/1400MHz 5.91 75% 0% 74% 0% 0% 0% 49.0°C 18:24:08: 1400/1400MHz 5.18 76% 0% 75% 0% 0% 0% 49.0°C 18:24:38: 1400/1400MHz 5.95 78% 0% 77% 0% 0% 0% 46.0°C 18:25:08: 1400/1400MHz 6.23 77% 1% 76% 0% 0% 0% 50.0°C 18:25:38: 1400/1400MHz 6.81 73% 0% 72% 0% 0% 0% 49.0°C 18:26:08: 1400/1400MHz 6.62 74% 0% 74% 0% 0% 0% 49.0°C NanoPC is around 75% and the XU4 is wel below 70%. Would we get "more useful" numbers if we did for example for xu4 : 8734/68 * 100 = 12 844 for T3+: 10226/75 * 100 = 13 634.6666.... Just a question. I use Blender because that uses 100% of all cores. 0 Quote
tkaiser Posted August 3, 2018 Posted August 3, 2018 43 minutes ago, NicoD said: What do you think about the fact that 7-zip multicore never uses close to 100% of the cpu's? That's great since the purpose of this test is an overall estimate how performant the relevant board would be when doing 'server stuff'. If you click directly on the 7-zip link here https://github.com/ThomasKaiser/sbc-bench#7-zip then you get an explanation what's going on at the bottom of that page: Quote The test code doesn't use FPU and SSE. Most of the code is 32-bit integer code. Only some minor part in compression code uses also 64-bit integers. RAM and Cache bandwidth are not so important for these tests. The latencies are much more important. The CPU's IPC (Instructions per cycle) rate is not very high for these tests. The estimated value of test's IPC is 1 (one instruction per cycle) for modern CPU. The compression test has big number of random accesses to RAM and Data Cache. So big part of execution time the CPU waits the data from Data Cache or from RAM. The decompression test has big number of pipeline flushes after mispredicted branches. Such low IPC means that there are some unloaded CPU resources. But the CPU with Hyper-Threading feature can load these CPU resources using two threads. So Hyper-Threading provides pretty big improvement in these tests. In other words: as expected. Your use case with Blender is something entirely different and 7-zip scores are useless for this. Primarly for the reason that Blender involves floating point stuff (while 7-zip focuses on integer and low memory latency). It's as always about the use case If we look closely on the other results we see that S905 for example has an advantage with cpuminer compared to the rather similar A53 SoCs S905X and RK3328 (that perform rather identical with 7-zip for example). Maybe the root cause for cpuminer's better scores will also be responsible for better Blender results on S905 compared to other A53 SoCs? It needs a different benchmark and a lot of cross-testing with the real application to get an idea how to reliably test for your use case. 1 Quote
NicoD Posted August 3, 2018 Posted August 3, 2018 3 hours ago, tkaiser said: the 7-zip link here https://github.com/ThomasKaiser/sbc-bench#7-zip then you get an explanation what's going on at the bottom of that page: Just seen that your Rock64 results are with a 2GB. I'll do the same with my 4GB. Just interested if there's a difference with 7zip multicore. It's only 4-cores tho. Now doing Blender bench between Xenial default and nightly. Again just out of curiosity. I red it's now clocked to 1.39Ghz. I'll give you the sbc-bench results later if they're interesting. I don't think they will be. It's already 100%... Eidit : Blender just crashed on 1.39Ghz after 30minutes. At 1.3Ghz no problem : 1h17m55s I'll try again. Maybe it's not stable enough. 2nd try did it. So could be a fluke the crash. Xenial Rock64 4GB results : http://ix.io/1j7d Again very different than other distro's. I'm also wondering if zram would make a big difference with the 8-core 2gb devices? Cheers 0 Quote
gprovost Posted August 3, 2018 Posted August 3, 2018 On 8/2/2018 at 3:54 AM, zador.blood.stained said: EDIT: not sure if OpenSSL uses AF_ALF by default, but pretty sure that cryptsetup does @zador.blood.stained I think there isn't any distro OpenSSL packages that is built with hardware engine support. Also, even if engine is installed, OpenSSL doesn't use any engine by default, you need to configure it in openssl.cnf. But you right about cryptsetup (dm-crypt), it uses AF_ALG by default. I was wondering why so much delta between my 'cryptsetup benchmark' and 'openssl speed' test on Helios4. I just did a test by compiling openssl-1.1.1-pre8 with the AF_ALG (... enable-engine enable-afalgeng ...) and here are the benchmark result on Helios4 : $> openssl speed -evp aes-xxx-cbc -engine afalg -elapsed type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128-cbc 745.71k 3018.47k 11270.23k 36220.25k 90355.03k 101094.74k aes-256-cbc 739.49k 2964.93k 11085.23k 34178.05k 82597.21k 90461.53k The difference is quite interesting, with AF_ALG it performs much better on bigger block size, but poorly on very small block size. $> openssl speed -evp aes-xxx-cbc -elapsed type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128-cbc 44795.07k 55274.84k 59076.27k 59920.04k 59719.68k 59353.77k aes-256-cbc 34264.93k 40524.42k 42168.92k 42496.68k 42535.59k 42500.10k System : Linux helios4 4.14.57-mvebu #2 SMP Tue Jul 24 08:29:55 UTC 2018 armv7l GNU/Linux Supposedly you can even have better perf with cryptodev, but I think Crypto API AF_ALK is more elegant and easier to setup. Once I have a cleaner install of this AF_ALK (or cryptodev), I will run sbc_bench and send you ( @tkaiser ) the output. 1 Quote
zador.blood.stained Posted August 3, 2018 Posted August 3, 2018 4 minutes ago, gprovost said: The difference is quite interesting, with AF_ALG it performs much better on bigger block size, but poorly on very small block size. Quote With the presence of numerous user space cryptographic libraries, one may ask why is there a need for the kernel to expose its kernel crypto API to user space. As there are system calls and potentially memory copies needed before a cipher can be invoked, it should be typically slower than user space shared libraries. so most likely the syscall + data copy overhead for each block kills the performance with smaller block sizes. 7 minutes ago, gprovost said: Supposedly you can even have better perf with cryptodev, but I think Crypto API AF_ALK is more elegant and easier to setup. If I remember correctly I compared cryptodev with AF_ALG on some platform (most likely with cryptotest) and AF_ALG was slightly faster compared to cryptodev. 0 Quote
ccbur Posted August 5, 2018 Posted August 5, 2018 On 8/2/2018 at 10:05 PM, tkaiser said: Are the reported temperatures real? ~33°C seem way too low and most probably the sysfs node for CPU temperature is a different one than the one my script used. Can you provide the output of the following please? find /sys -name "thermal_zone*" | while read ; do echo "${REPLY}: $(cat ${REPLY}/type) $(cat ${REPLY}/temp)" done I needed to add CONFIG_THERMAL / CONFIG_TEGRA_THERM to my kernel. Now the temperatures are accessible via sysfs. Here some numbers during a sbc-bench: (taken while 7-zip was running) /sys/devices/virtual/thermal/thermal_zone3: pllx 51000 /sys/devices/virtual/thermal/thermal_zone1: mem 47000 /sys/devices/virtual/thermal/thermal_zone2: gpu 47000 /sys/devices/virtual/thermal/thermal_zone0: cpu 51000 /sys/class/thermal/thermal_zone3: pllx 51000 /sys/class/thermal/thermal_zone1: mem 47000 /sys/class/thermal/thermal_zone2: gpu 47000 /sys/class/thermal/thermal_zone0: cpu 51000 But reported temperatures during this run seems much lower (cp. http://ix.io/1j4m) I did a watch 'cat /sys/devices/virtual/thermal/thermal_zone0/temp&&cat /sys/class/hwmon/hwmon0/temp1_input' during a sbc-bench and the thermal_zone0 temperature was always between 38 and 53°C while hwmon0 was way lower. And regarding the OPP cpufreq calculation: Now the OPP cpufreq is throttled to ~1565 MHz on the first measurement (no load upfront!), and boosted beyond 2400 MHz on the second OPP calculation at the end of the run. Strange, isn't it? Cpufreq OPP: 204 Measured: 346.544/335.861/346.856 Cpufreq OPP: 306 Measured: 503.798/516.667/515.484 Cpufreq OPP: 408 Measured: 685.631/684.615/676.372 Cpufreq OPP: 510 Measured: 854.971/849.269/856.052 Cpufreq OPP: 612 Measured: 1030.608/1035.536/1035.546 Cpufreq OPP: 714 Measured: 1205.473/1201.910/1202.302 Cpufreq OPP: 816 Measured: 1374.352/1372.307/1375.552 Cpufreq OPP: 918 Measured: 1554.848/1562.581/1555.110 Cpufreq OPP: 1020 Measured: 1721.640/1725.044/1721.787 Cpufreq OPP: 1122 Measured: 1894.246/1895.047/1894.980 Cpufreq OPP: 1224 Measured: 2064.348/2064.053/2065.067 Cpufreq OPP: 1326 Measured: 2238.183/2244.731/2240.394 Cpufreq OPP: 1428 Measured: 2414.822/2414.966/2414.735 Cpufreq OPP: 1530 Measured: 2414.909/2414.533/2415.227 Cpufreq OPP: 1632 Measured: 2415.284/2415.458/2408.336 Cpufreq OPP: 1734 Measured: 2419.049/2408.164/2415.892 Cpufreq OPP: 1836 Measured: 2415.342/2421.255/2414.764 Cpufreq OPP: 1938 Measured: 2408.595/2416.615/2416.442 Cpufreq OPP: 2014 Measured: 2416.702/2411.013/2416.818 Cpufreq OPP: 2116 Measured: 2416.905/2417.339/2408.739 Cpufreq OPP: 2218 Measured: 2417.455/2417.542/2424.252 Cpufreq OPP: 2320 Measured: 2417.513/2410.725/2417.629 This behaviour makes benchmarking not easy... ;-) 0 Quote
tkaiser Posted August 5, 2018 Posted August 5, 2018 35 minutes ago, ccbur said: during a sbc-bench and the thermal_zone0 temperature was always between 38 and 53°C while hwmon0 was way lower. Ok, then to get more reasonable readouts on your system the following will 'fix' this: mkdir -p /etc/armbianmonitor/datasources ln -s /sys/devices/virtual/thermal/thermal_zone0/temp /etc/armbianmonitor/datasources/soctemp The real clockspeeds reported look ok (though I've no idea why) since this time you get 7680 7-zip MIPS while it was 5290 before. On the other hand now your tinymembench numbers were lower in the beginning than last time. So this time your CPU cores were bottlenecked to ~1565 in the beginning and later clockspeeds jumped up to ~2415. 35 minutes ago, ccbur said: This behaviour makes benchmarking not easy... ;-) Benchmarking is never easy and that's what the various monitoring stuff in sbc-bench wants to take care of. At least we identified that the cpufreq driver has no control over cpufreq on your Jetson (just like on Raspberries or various Amlogic SoCs). No idea what's responsible for this but mpst probably it's an MCU running inside the SoC controlled by a firmware BLOB? Maybe another run with creation of /etc/armbianmonitor/datasources/soctemp first and latest sbc-bench gives more clues. Or you might search sysfs for "*volts" entries since maybe cpufreq/DVFS scaling depends on undervoltage (just like on Raspberry Pi ) too or something like that? BTW: I prepared sbc-bench to run on x86 SBC too in the meantime but due to being too lazy to search for my UP Board I did the changes on a Xeon box I had to check cpufreq governor behaviour anyway. With ondemand and performance governors TurboBoost can be spotted nicely: http://ix.io/1j79 0 Quote
gprovost Posted August 8, 2018 Posted August 8, 2018 @tkaiser Here the link of my latest benchmark of Helios4 : http://ix.io/1jCy I get the following result for OpenSSL speed OpenSSL results: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128-cbc 1280.56k 5053.40k 18249.13k 52605.27k 102288.04k 109390.51k aes-128-cbc 1285.51k 5030.68k 18256.13k 53001.90k 100128.09k 109188.44k aes-192-cbc 1276.82k 4959.19k 18082.22k 51421.53k 96897.71k 103093.59k aes-192-cbc 1290.35k 4961.09k 17777.24k 51629.74k 95647.06k 102596.61k aes-256-cbc 1292.07k 5037.99k 17762.90k 50542.25k 92782.59k 98298.54k aes-256-cbc 1281.35k 5050.94k 17874.77k 49915.90k 93164.89k 98822.83k In order to leverage on hw crypto engine, I had no choice but to use OpenSSL 1.1.1 lib (openssl-1.1.1-pre8) and I decided to use cryptodev-linux instead of AF_ALG since it gives me slightly better result (+5-10%). Here a bit of findings regarding OpenSSL engines implementation : As stated in the changelog Changes between 1.0.2h and 1.1.0 [25 Aug 2016] *) Added the AFALG engine. This is an async capable engine which is able to offload work to the Linux kernel. In this initial version it only supports AES128-CBC. The kernel must be version 4.1.0 or greater. [Catriona Lucey] So using Debian Stretch package OpenSSL 1.1.0f, or any more recent 1.1.0 version, the only cipher supported by AFALG engine was effectively AES-128-CBC $> openssl engine -c (dynamic) Dynamic engine loading support (afalg) AFALG engine support [AES-128-CBC] Starting OpenSSL 1.1.1, even though it is not mentioned anywhere in the changelog, AES192-CBC and AES256-CBC is supported by the AFALG engine $> openssl engine -c (dynamic) Dynamic engine loading support (afalg) AFALG engine support [AES-128-CBC, AES-192-CBC, AES-256-CBC] But one thing much more exiting about OpenSSL 1.1.1 is the following Changes between 1.1.0h and 1.1.1 [xx XXX xxxx] *) Add devcrypto engine. This has been implemented against cryptodev-linux, then adjusted to work on FreeBSD 8.4 as well. Enable by configuring with 'enable-devcryptoeng'. This is done by default on BSD implementations, as cryptodev.h is assumed to exist on all of them. [Richard Levitte] So now with the 1.1.1 is pretty straight forward to use cryptodev, no need to patch or configure anything in openssl, openssl will detect automatically if module cryptodev is loaded and will offload crypto operation on it if presents. $> openssl engine -c (devcrypto) /dev/crypto engine [DES-CBC, DES-EDE3-CBC, BF-CBC, AES-128-CBC, AES-192-CBC, AES-256-CBC, AES-128-CTR, AES-192-CTR, AES-256-CTR, AES-128-ECB, AES-192-ECB, AES-256-ECB, CAMELLIA-128-CBC, CAMELLIA-192-CBC, CAMELLIA-256-CBC, MD5, SHA1] (dynamic) Dynamic engine loading support Based on those info, and making the assumption that sooner than later openssl 1.1.1 will be available in Debian Stretch (via backports most probably), i think the best approach to add openssl crypto engine support in Armbian is via the cryptodev approach. This way we can support all the ciphers now. I will look how to patch properly dpkg openssl_1.1.0f-3+deb9u2 to activate cryptodev supports. @zador.blood.stained maybe you have a different option on the topic ? 1 Quote
tkaiser Posted August 8, 2018 Posted August 8, 2018 23 minutes ago, gprovost said: I get the following result for OpenSSL speed OpenSSL results: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128-cbc 1280.56k 5053.40k 18249.13k 52605.27k 102288.04k 109390.51k aes-128-cbc 1285.51k 5030.68k 18256.13k 53001.90k 100128.09k 109188.44k aes-192-cbc 1276.82k 4959.19k 18082.22k 51421.53k 96897.71k 103093.59k aes-192-cbc 1290.35k 4961.09k 17777.24k 51629.74k 95647.06k 102596.61k aes-256-cbc 1292.07k 5037.99k 17762.90k 50542.25k 92782.59k 98298.54k aes-256-cbc 1281.35k 5050.94k 17874.77k 49915.90k 93164.89k 98822.83k In order to leverage on hw crypto engine, I had no choice but to use OpenSSL 1.1.1 lib (openssl-1.1.1-pre8) So you simply replaced a lib somewhere with openssl binary still being 1.1.0f? Interesting the initialization overhead with data chunks below 1KB. These were my numbers with same SoC at same clockspeed on the Clearfog Pro: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128-cbc 44566.57k 54236.33k 58027.01k 58917.21k 59400.19k 59408.38k aes-128-cbc 44378.63k 54520.66k 58114.47k 59056.47k 59225.43k 59331.93k aes-192-cbc 39354.78k 46780.01k 49606.14k 50319.02k 50328.92k 50435.41k aes-192-cbc 39281.18k 46864.04k 49414.40k 50279.08k 50536.45k 50369.88k aes-256-cbc 35510.68k 41226.05k 43100.59k 43753.13k 43950.08k 43816.28k aes-256-cbc 35009.43k 41104.00k 43218.77k 43696.81k 43909.12k 43958.27k 0 Quote
gprovost Posted August 8, 2018 Posted August 8, 2018 As Zador said, offloading to hw engine has a little overhead that will get obvious on tiny block benchmark. On 8/4/2018 at 2:40 AM, zador.blood.stained said: so most likely the syscall + data copy overhead for each block kills the performance with smaller block sizes. 5 minutes ago, tkaiser said: So you simply replaced a lib somewhere with openssl binary still being 1.1.0f? That's correct. I just download the openssl src from official website then do the following (on target directly). $> ./config shared enable-engine enable-dso enable-afalgeng enable-devcryptoeng --prefix=/opt/openssl --openssldir=/opt/openssl $> make $> sudo make install $> export LD_LIBRARY_PATH=/opt/openssl/lib $> openssl version OpenSSL 1.1.0f 25 May 2017 (Library: OpenSSL 1.1.1-pre8 (beta) 20 Jun 2018) 0 Quote
tkaiser Posted August 8, 2018 Posted August 8, 2018 2 hours ago, gprovost said: OpenSSL 1.1.0f 25 May 2017 (Library: OpenSSL 1.1.1-pre8 (beta) 20 Jun 2018) Thank you, latest sbc-bench version should log this better Also interesting: it's not only twice as much crypto performance but also half as much CPU utilization at the same time. When I tested on the Clearfog I had constant 50% CPU utilization (%usr) with this single threaded test while with cryptodev on your Helios4 it looks like this (all %sys): Time CPU load %cpu %sys %usr %nice %io %irq Temp 08:17:40: 1600MHz 1.07 33% 33% 0% 0% 0% 0% 54.0°C 08:17:50: 1600MHz 1.14 17% 16% 0% 0% 0% 0% 54.0°C 08:18:00: 1600MHz 1.12 31% 30% 0% 0% 0% 0% 54.0°C 08:18:10: 1600MHz 1.10 22% 21% 0% 0% 0% 0% 54.0°C 08:18:20: 1600MHz 1.16 24% 23% 0% 0% 0% 0% 54.0°C 08:18:30: 1600MHz 1.13 28% 27% 0% 0% 0% 0% 54.0°C 08:18:40: 1600MHz 1.11 20% 19% 0% 0% 0% 0% 54.0°C 08:18:50: 1600MHz 1.18 31% 30% 0% 0% 0% 0% 54.0°C 08:19:00: 1600MHz 1.15 15% 15% 0% 0% 0% 0% 54.0°C 08:19:10: 1600MHz 1.12 31% 30% 0% 0% 0% 0% 53.0°C 0 Quote
gprovost Posted August 10, 2018 Posted August 10, 2018 On 8/8/2018 at 6:54 PM, tkaiser said: Also interesting: it's not only twice as much crypto performance but also half as much CPU utilization at the same time. When I tested on the Clearfog I had constant 50% CPU utilization (%usr) with this single threaded test while with cryptodev on your Helios4 it looks like this (all %sys) I must admit it's quite cool Now I'm trying to make apache2 (for nextcloud use case) offload HTTPS on the cryptodev. Not trivial at tall Need to recompile stuff. 0 Quote
gprovost Posted August 13, 2018 Posted August 13, 2018 @tkaiser Why in your benchmark table for AES-128-CBC you are showing the results for 16Byte block while for AES-256-CBC you are show the results for 16KByte ? I'm obliviously asking because for Helios4 showing the perf on 16Byte block while using the cryptodev is a bit unfair 0 Quote
tkaiser Posted August 13, 2018 Posted August 13, 2018 18 minutes ago, gprovost said: Why in your benchmark table for AES-128-CBC you are showing the results for 16Byte block while for AES-256-CBC you are show the results for 16KByte ? To create awareness that the amount of data to test with is important and to outline that initialization overhead is something to consider. The whole approach is not to generate graphs and numbers to compare without using the brain but to generate insights instead. 0 Quote
gprovost Posted August 14, 2018 Posted August 14, 2018 15 hours ago, tkaiser said: The whole approach is not to generate graphs and numbers to compare without using the brain but to generate insights instead. I understand that, and I'm sure everyone appreciate a lot the great work you are doing with sbc-bench since it was something clearly missing out there... I think it could actually become the baseline reference for sbc benchmark. I'm pretty sure many people will look at the numbers without reading all the explanation your provided or without reading this thread. The fact that the Clearfog which is based on the same SoC then Helios4 shows completely different number for AES-128-CBC 16Byte block is clearly confusing. I would prefer we display both value, with | without hw engine. 0 Quote
ag123 Posted September 18, 2018 Posted September 18, 2018 i'm not sure whether to even call this a benchmark https://forum.armbian.com/topic/8203-orangepione-h3-an-experiment-with-small-heatsink/ in an attempt to verify the effectiveness of a heatsink, i used some codes which basically does 1000x1000 matrix multiplication (single precision) i think 'traditionally' that's done with blas and calling sgemm (sp) or dgemm (dp) functions. 1000x1000 matrix multiply has a formula for the number of floating point ops which is 2N^3 (i.e. 2 billion floating points ops), i'm not using blas but rather using some c++ codes that only does the matrix multiply. that allows me to do some basic verification of whether a small heatsink would after all even made a difference, i did some optimization, unroll the loop and surprisingly got a 10 fold increase in performance in terms of mflops. the codes are attached in the 2nd post in the thread. i think matrix multiplication doesn't always reflect real world scenarios as in even if there are matrix multiplications, in real cases matrices are not necessary sized this way (e.g. could be much smaller or bigger) and is not necessary square matrices. but that doing strictly square matrix computations do give a sort of way to 'get a feel' of how would the same matrix multiplication differ between the different frequencies etc doing things like linpack may give one lower mflops say compared to this as linpack involves solving the matrices rather than simply multiplying matrices and in addition, how the optimization is applied can drastically change the mflops throughput (e.g. in my case for H3, the loop unrolling optimization achieved a 10 fold increase in mflops, but the same optimization may not work as well for a different soc, or in the case of a superscalar soc the cpu may be able to 'unroll the loop' in hardware even without this loop unrolling optimization this is very much simply 'synthetic' 0 Quote
sfx2000 Posted October 1, 2018 Posted October 1, 2018 @tkaiser Little quirk with sbc-bench on the git... Once tests are done, looks like the cores are left on their own and cooking.... Recent test, and after.... dropping the clocks was after a shutdown, pull power, and startup... Example here is NanoPI-NEO, current Armbian bionic image.... http://ix.io/1nY2 Oddly enough, a couple of weeks back, Tinker got into a bad place there where it did a hard shutdown... 0 Quote
tkaiser Posted October 1, 2018 Posted October 1, 2018 4 hours ago, sfx2000 said: Once tests are done, looks like the cores are left on their own and cooking.... 'Cooking'? The 'performance' cpufreq governor is chosen, that's all. On an otherwise idle system there's not much consumption and temperature difference between idling at lowest or highest cpufreq OPP anyway. 0 Quote
sfx2000 Posted October 3, 2018 Posted October 3, 2018 On 9/30/2018 at 11:46 PM, tkaiser said: 'Cooking'? The 'performance' cpufreq governor is chosen, that's all. On an otherwise idle system there's not much consumption and temperature difference between idling at lowest or highest cpufreq OPP anyway. Opened up an issue on the git - don't change the governor - report what is in use, so folks can A/B changes... Anyways - RK3288-Tinker, which can be/often is thermally challenged - had to hack the sbc-bench script to not set the CPU gov to perf... gov/sched/swappinness... perf - noop - 100 http://ix.io/1o7Y Throttling statistics (time spent on each cpufreq OPP): 1800 MHz: 86.82 sec 1704 MHz: 83.66 sec 1608 MHz: 114.67 sec 1512 MHz: 155.11 sec 1416 MHz: 205.71 sec 1200 MHz: 169.03 sec 1008 MHz: 109.07 sec 816 MHz: 144.17 sec 600 MHz: 106.14 sec schedutil -CFQ - 10 http://ix.io/1o9Y Throttling statistics (time spent on each cpufreq OPP): 1800 MHz: 350.94 sec 1704 MHz: 121.37 sec 1608 MHz: 73.64 sec 1512 MHz: 92.46 sec 1416 MHz: 104.79 sec 1200 MHz: 92.22 sec 1008 MHz: 66.33 sec 816 MHz: 149.59 sec 600 MHz: 132.10 sec Interesting numbers... feel free to walk thru the rest... Without getting into the uboot and dt to reset the lower limit - would be nice to see if the RK can sprint to get fast, fallback a bit more to recover and sprint again... RK3288 can do the limbo as low as 126MHz - right now with Armbian we're capped at the bottom at 600MHz - so the dynamic range between idle and max'ed out is 10C, as the RK3288 with the provided HS idles at 60C in the current build. That and going with PREEMPT vs normal there... the PREEMPT stuff does weird things with IO and drivers there, would be nice to have a regular kernel within Armbian to bounce against. 0 Quote
sfx2000 Posted October 3, 2018 Posted October 3, 2018 On 9/30/2018 at 11:46 PM, tkaiser said: Cooking'? The 'performance' cpufreq governor is chosen, that's all. On an otherwise idle system there's not much consumption and temperature difference between idling at lowest or highest cpufreq OPP anyway. Same configs as the Tinker - testing Nano Pi NEO with stock armbian clocks on mainline - my Neo isn't likely going to throttle as it's got good power and very good on the thermals with the heat sink and case it's in... stock clocks has it underclocked a bit as it is for power reasons - re other Armbian docs on this board for IoT, etc... gov/sched/swappinness perf - noop - 100 http://ix.io/1oac schedutil -CFQ - 10 http://ix.io/1oai (this one throttled just a bit... but that's one sample) 0 Quote
tkaiser Posted October 3, 2018 Posted October 3, 2018 10 hours ago, sfx2000 said: Interesting numbers 100% meaningless numbers now since they do not provide any insights any more (you can not differentiate what's throttling and what's -- potentially inappropriate -- cpufreq scaling behavior). 10 hours ago, sfx2000 said: RK3288 can do the limbo as low as 126MHz - right now with Armbian we're capped at the bottom at 600MHz For two simple reasons: allowing very low cpufreq OPP trashes performance (on your Tinkerboard it's storage performance due to the clockspeeds not increasing fast enough) the way DVFS works at the lower end of the scale the differences in consumption and generated heat are negligible between silly ultra low MHz values and a reasonable lower limit (we in Armbian TEST for stuff like this with every new SoC family!) Closed your issue right now spending additional time to explain: https://github.com/ThomasKaiser/sbc-bench/issues/4 0 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.