Benchmarking CPUs

zador.blood.stained · August 1, 2018

On 8/1/2018 at 11:37 PM, tkaiser said:

If ARMv8 Crypto Extensions are available funnily openssl numbers with smaller data chunks are higher with 1.0.2g than 1.1.0g (see NanoPC T3+ numbers or those for Vim2). Did you already check whether all our kernels have the respective AF_ALG and NEON switches enabled?

I thought more about this and also ran openssl and cryptsetup through strace and checked openssl build configuration in Ubuntu.

Stock Ubuntu (and most likely Debian) OpenSSL will use userspace crypto. So if there are CPU instructions (NEON, ARMv8 CE) - it should use them, but it won't be using HW engines like sun4i-ss or CESA. At least we have some comparable numbers as long as we don't compare OpenSSL 1.0.x results with 1.1.x directly.
This means that AES numbers in the table will not resemble performance in some real world scenarios that use in-kernel crypto (like disk and filesystem encryption)

On 8/1/2018 at 11:37 PM, tkaiser said:

Well, identifying such stuff is IMO part of the journey.

But people will still use your results table to compare boards, so IMO it's worth adding a note for boards where HW crypto engines are available. ARMv8 CE is not a crypto engine, its numbers should depend on CPU performance and should be affected by throttling, compared to, i.e., CESA that uses a fixed clock.

tkaiser · August 1, 2018

43 minutes ago, NicoD said:

XU4 Armbian Stretch http://ix.io/1iWL

Thank you. Unfortunately our DT clocks little cores just with 1400 MHz and some minor throttling happened:

2000 MHz: 2083.27 sec
1900 MHz:   25.23 sec
1800 MHz:    7.58 sec

Anyway, numbers are usable. Will add them with next Results.md update.

NicoD · August 1, 2018

OPi+2 Armbian Stretch http://ix.io/1iX4

tkaiser · August 2, 2018

8 hours ago, zador.blood.stained said:

But people will still use your results table to compare boards, so IMO it's worth adding a note for boards where HW crypto engines are available.

For now I added just a warning: https://github.com/ThomasKaiser/sbc-bench/blob/master/Results.md -- I'll add a TODO section soon where I'll try to explain how to deal with 'numbers generated' vs. 'insights generated': coming up with other benchmarks that more properly describe real-world use cases. Wrt the crypto stuff: Most probably using cryptsetup and then doing also some real-world tasks that can be measured (involving a ton of other dependencies like filesystem performance and so on)

ccbur · August 2, 2018

On 7/31/2018 at 11:50 PM, tkaiser said:

Eagerly waiting for more results (from other boards) since we start to get some understanding why common benchmark results are that weird.

Here the result for a Jetson TK1 with more or less mainline kernel 4.14 and Debian Buster (sorry, currently no Stretch available): http://ix.io/1j0f

There are some discrepancies in both cpufreq OPP table!? Any idea, how to read those numbers?

Never had a stability issue and normally CPU is not a bottleneck for me, so I'm not really worried about throttling. I'm using my Jetsons with the little standard fan, but maybe it's time to enable the temperature sensors somehow :).

tkaiser · August 2, 2018

1 hour ago, ccbur said:

Jetson TK1 with more or less mainline kernel 4.14 and Debian Buster (sorry, currently no Stretch available): http://ix.io/1j0f

In this case distro version doesn't matter since cpuminer test fails on 32-bit platforms anyway (and here libs and GCC version would've been important). Buster still uses 7-zip v16.02 so numbers are comparable somewhat.

1 hour ago, ccbur said:

There are some discrepancies in both cpufreq OPP table!? Any idea, how to read those numbers?

Well, sbc-bench is using @wtarreau's nice mhz tool to calculate real clockspeeds and I hope I use it correctly. It seems CPU clockspeeds are controlled by some firmware in reality and the cpufreq driver reports nonsense since with an idle system we see measured clockspeeds being much higher just to be limited to ~1565 MHz while running the 7-zip benchmark:

Cpufreq OPP:  204    Measured: 224.991/224.846/224.977
Cpufreq OPP:  306    Measured: 334.677/334.810/334.586
Cpufreq OPP:  408    Measured: 444.497/444.497/443.872
Cpufreq OPP:  510    Measured: 554.177/554.061/554.038
Cpufreq OPP:  612    Measured: 671.761/670.167/670.167
Cpufreq OPP:  714    Measured: 779.803/780.207/779.591
Cpufreq OPP:  816    Measured: 889.699/889.881/889.642
Cpufreq OPP:  918    Measured: 1003.636/1006.018/1005.933
Cpufreq OPP: 1020    Measured: 1116.065/1115.752/1115.655
Cpufreq OPP: 1122    Measured: 1225.857/1225.945/1225.916
Cpufreq OPP: 1224    Measured: 1335.698/1333.532/1335.864
Cpufreq OPP: 1326    Measured: 1451.911/1452.270/1451.944
Cpufreq OPP: 1428    Measured: 1562.109/1561.977/1562.147
Cpufreq OPP: 1530    Measured: 1562.373/1562.241/1558.959
Cpufreq OPP: 1632    Measured: 1561.882/1561.769/1561.769
Cpufreq OPP: 1734    Measured: 1561.731/1561.580/1561.901
Cpufreq OPP: 1836    Measured: 1561.939/1561.901/1561.920
Cpufreq OPP: 1938    Measured: 1562.241/1562.090/1561.693
Cpufreq OPP: 2014    Measured: 1561.750/1562.165/1561.693
Cpufreq OPP: 2116    Measured: 1561.807/1561.977/1561.825
Cpufreq OPP: 2218    Measured: 1561.825/1559.053/1561.637
Cpufreq OPP: 2320    Measured: 1561.825/1561.958/1561.618

7-zip contains an own measuring routine and seems to agree:

CPU Freq:  1425  1524  1557  1558  1557  1558  1558  1557  1557
CPU Freq:  1524  1540  1558  1558  1557  1558  1557  1558  1558
CPU Freq:  1557  1560  1560  1560  1560  1559  1560  1559  1559
CPU Freq:  1562  1562  1563  1563  1563  1563  1563  1563  1563

As a reference Tinkerboard (quad-core A17 in RK3288) scores 5350 7-zip MIPS at 1730 MHz while your Jetson scores less: 5290. Memory bandwidth is much higher on the Jetson so the 1565 MHz start to look plausible.

I've sysbench numbers from an ODROID-XU4 here made only on the A15-cluster. 'sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4' took 62 seconds but that was on Debian Jessie (GCC 4.7) so numbers on Buster (GCC 8.1) will be completely different.

Are the reported temperatures real? ~33°C seem way too low and most probably the sysfs node for CPU temperature is a different one than the one my script used. Can you provide the output of the following please?

find /sys -name "thermal_zone*" | while read ; do
    echo "${REPLY}: $(cat ${REPLY}/type) $(cat ${REPLY}/temp)"
done

wtarreau · August 2, 2018

49 minutes ago, tkaiser said:

Well, sbc-bench is using @wtarreau's nice mhz tool to calculate real clockspeeds and I hope I use it correctly.

What you can do is increase the 2nd argument, it's the number of loops you want to run. At 1000 you can miss some precision. I tend to use 100000 on medium-power boards like nanopis. On the clearfog at 2 GHz, "mhz 3 100000" takes 150ms. This can be much for your use case. It reports 1999 MHz. With 1000 it has a slightly larger variation (1996 to 2000). Well, it's probably OK at 10000. I took bad habits on x86 with intel_pstate taking a while to start.

Maybe you should always take a small and a large count in your tests. This would more easily show if there's some automatic frequency adjustment : the larger count would report a significantly higher frequency in this case because part of the loop would run at a higher frequency. Just an idea.

Or probably that you should have two distinct tools : "sbc-bench" and "sbc-diag". The former would report measured values over short periods, an the latter would be used with deeper tests to try to figure whats wrong when the first values look suspicious.

Edited August 2, 2018 by wtarreau
wrong version of the tool = wrong execution time :-)

tkaiser · August 3, 2018

7 hours ago, wtarreau said:

What you can do is increase the 2nd argument, it's the number of loops you want to run. At 1000 you can miss some precision. I tend to use 100000 on medium-power boards like nanopis.

Thanks for the suggestion. Just did a test on the slowest device I've around (single Cortex-A8 at 912 MHz). Before:

Cpufreq OPP:  240    Measured: 113.205/238.015/271.572
Cpufreq OPP:  624    Measured: 622.264/620.388/622.821
Cpufreq OPP:  864    Measured: 859.927/863.043/861.208
Cpufreq OPP:  912    Measured: 910.931/870.387/432.613

And after (now with 100000):

Cpufreq OPP:  240    Measured: 216.979/237.445/237.738
Cpufreq OPP:  624    Measured: 584.162/285.411/622.403
Cpufreq OPP:  864    Measured: 862.854/825.796/862.574
Cpufreq OPP:  912    Measured: 908.568/910.098/869.719

Ok, benchmarking such a slow system where background activity is pretty high all the time due to just a single CPU core is rather pointless. But I keep the 100000 for now and change the OPP checking routine after the most demanding benchmark has been running to decline from highest to lowest OPP to hopefully spot hidden throttling as much as possible (like on the Jetson, Amlogic S905X/S912 and of course the Raspberry Pi).

Providing a separate sbc-diag is a good idea. This tool could also focus on testing for anomalies wrt cpufreq scaling (sbc-bench immediately switches to performance governor to prevent results being harmed by strange cpufreq scaling behaviour so not suited to give the answer 'why behaves my system so slow?')

tkaiser · August 3, 2018

13 hours ago, ccbur said:

There are some discrepancies in both cpufreq OPP table!?

The numbers above (mhz, 7-zip) already suggested the Jetson does 'hidden' throttling at around ~1565 MHz. The openssl benchmark might also be useful to compare results and to determine clockspeeds. First row is from an ODROID-XU4 (A15 cores at 2000 MHz), 2nd from your Jetson:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ODROID-XU4       59359.98k    66782.42k    70469.97k    71398.06k    71740.07k    71527.08k
Jetson TK1       40004.73k    50301.27k    54554.88k    55691.61k    56008.70k    55973.21k

If we use XU4 numbers as base (1998 MHz) and do some simple math we may determine real clockspeed of the NVIDIA:

1998 / (59359.98 / 40004.73) = 1347
1998 / (66782.42 / 50301.27) = 1505
1998 / (70469.97 / 54554.88) = 1547
1998 / (71398.06 / 55691.61) = 1558
1998 / (71740.07 / 56008.70) = 1560
1998 / (71527.08 / 55973.21) = 1563

With less initialization overhead == larger data chunks (1K or above) this calculation approach seems to work fairly well (due to openssl benchmark not relying on memory bandwidth so 2 different A15 cores could be compared directly). Your A15 cores being limited to ~1565 MHz under load seem to be very plausible.

NicoD · August 3, 2018

@tkaiser

What do you think about the fact that 7-zip multicore never uses close to 100% of the cpu's? Would we need to take that in acount?

Here XU4

System health while running 7-zip multi core benchmark:

Time       big.LITTLE   load %cpu %sys %usr %nice %io %irq   Temp
20:53:32: 2000/1400MHz  5.96  11%   0%  10%   0%   0%   0%  51.0°C
20:54:02: 2000/1400MHz  5.86  66%   1%  65%   0%   0%   0%  55.0°C
20:54:32: 2000/1400MHz  5.14  61%   2%  59%   0%   0%   0%  61.0°C
20:55:02: 2000/1400MHz  5.43  71%   1%  70%   0%   0%   0%  63.0°C
20:55:34: 1900/1400MHz  5.47  65%   1%  63%   0%   0%   0%  75.0°C
20:56:05: 2000/1400MHz  5.54  67%   1%  66%   0%   0%   0%  56.0°C
20:56:35: 2000/1400MHz  5.22  68%   1%  66%   0%   0%   0%  65.0°C
20:57:05: 2000/1400MHz  5.27  69%   1%  67%   0%   0%   0%  69.0°C
20:57:36: 2000/1400MHz  5.12  60%   1%  58%   0%   0%   0%  73.0°C
20:58:07: 2000/1400MHz  5.31  66%   1%  64%   0%   0%   0%  64.0°C

Here the NanoPC-T3+

System health while running 7-zip multi core benchmark:

Time       big.LITTLE   load %cpu %sys %usr %nice %io %irq   Temp
18:22:37: 1400/1400MHz  5.55  10%   0%   9%   0%   0%   0%  41.0°C
18:23:07: 1400/1400MHz  6.21  72%   0%  71%   0%   0%   0%  48.0°C
18:23:37: 1400/1400MHz  5.91  75%   0%  74%   0%   0%   0%  49.0°C
18:24:08: 1400/1400MHz  5.18  76%   0%  75%   0%   0%   0%  49.0°C
18:24:38: 1400/1400MHz  5.95  78%   0%  77%   0%   0%   0%  46.0°C
18:25:08: 1400/1400MHz  6.23  77%   1%  76%   0%   0%   0%  50.0°C
18:25:38: 1400/1400MHz  6.81  73%   0%  72%   0%   0%   0%  49.0°C
18:26:08: 1400/1400MHz  6.62  74%   0%  74%   0%   0%   0%  49.0°C

NanoPC is around 75% and the XU4 is wel below 70%.
Would we get "more useful" numbers if we did for example
for xu4 : 8734/68 * 100 = 12 844
for T3+: 10226/75 * 100 = 13 634.6666....
Just a question.

I use Blender because that uses 100% of all cores.

tkaiser · August 3, 2018

43 minutes ago, NicoD said:

What do you think about the fact that 7-zip multicore never uses close to 100% of the cpu's?

That's great since the purpose of this test is an overall estimate how performant the relevant board would be when doing 'server stuff'. If you click directly on the 7-zip link here https://github.com/ThomasKaiser/sbc-bench#7-zip then you get an explanation what's going on at the bottom of that page:

Quote

The test code doesn't use FPU and SSE. Most of the code is 32-bit integer code. Only some minor part in compression code uses also 64-bit integers. RAM and Cache bandwidth are not so important for these tests. The latencies are much more important.

The CPU's IPC (Instructions per cycle) rate is not very high for these tests. The estimated value of test's IPC is 1 (one instruction per cycle) for modern CPU. The compression test has big number of random accesses to RAM and Data Cache. So big part of execution time the CPU waits the data from Data Cache or from RAM. The decompression test has big number of pipeline flushes after mispredicted branches. Such low IPC means that there are some unloaded CPU resources. But the CPU with Hyper-Threading feature can load these CPU resources using two threads. So Hyper-Threading provides pretty big improvement in these tests.

In other words: as expected.

Your use case with Blender is something entirely different and 7-zip scores are useless for this. Primarly for the reason that Blender involves floating point stuff (while 7-zip focuses on integer and low memory latency). It's as always about the use case

If we look closely on the other results we see that S905 for example has an advantage with cpuminer compared to the rather similar A53 SoCs S905X and RK3328 (that perform rather identical with 7-zip for example). Maybe the root cause for cpuminer's better scores will also be responsible for better Blender results on S905 compared to other A53 SoCs? It needs a different benchmark and a lot of cross-testing with the real application to get an idea how to reliably test for your use case.

NicoD · August 3, 2018

3 hours ago, tkaiser said:

the 7-zip link here https://github.com/ThomasKaiser/sbc-bench#7-zip then you get an explanation what's going on at the bottom of that page:

Just seen that your Rock64 results are with a 2GB. I'll do the same with my 4GB. Just interested if there's a difference with 7zip multicore. It's only 4-cores tho.
Now doing Blender bench between Xenial default and nightly. Again just out of curiosity. I red it's now clocked to 1.39Ghz.
I'll give you the sbc-bench results later if they're interesting. I don't think they will be. It's already 100%...

Eidit : Blender just crashed on 1.39Ghz after 30minutes. At 1.3Ghz no problem : 1h17m55s
I'll try again. Maybe it's not stable enough.
2nd try did it. So could be a fluke the crash.

Xenial Rock64 4GB results : http://ix.io/1j7d
Again very different than other distro's.

I'm also wondering if zram would make a big difference with the 8-core 2gb devices?
Cheers

gprovost · August 3, 2018

On 8/2/2018 at 3:54 AM, zador.blood.stained said:

EDIT: not sure if OpenSSL uses AF_ALF by default, but pretty sure that cryptsetup does

@zador.blood.stained I think there isn't any distro OpenSSL packages that is built with hardware engine support.

Also, even if engine is installed, OpenSSL doesn't use any engine by default, you need to configure it in openssl.cnf.

But you right about cryptsetup (dm-crypt), it uses AF_ALG by default. I was wondering why so much delta between my 'cryptsetup benchmark' and 'openssl speed' test on Helios4.

I just did a test by compiling openssl-1.1.1-pre8 with the AF_ALG (... enable-engine enable-afalgeng ...) and here are the benchmark result on Helios4 :

$> openssl speed -evp aes-xxx-cbc -engine afalg -elapsed

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc        745.71k     3018.47k    11270.23k    36220.25k    90355.03k   101094.74k
aes-256-cbc        739.49k     2964.93k    11085.23k    34178.05k    82597.21k    90461.53k

The difference is quite interesting, with AF_ALG it performs much better on bigger block size, but poorly on very small block size.

$> openssl speed -evp aes-xxx-cbc -elapsed

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc      44795.07k    55274.84k    59076.27k    59920.04k    59719.68k    59353.77k
aes-256-cbc      34264.93k    40524.42k    42168.92k    42496.68k    42535.59k    42500.10k

System : Linux helios4 4.14.57-mvebu #2 SMP Tue Jul 24 08:29:55 UTC 2018 armv7l GNU/Linux

Supposedly you can even have better perf with cryptodev, but I think Crypto API AF_ALK is more elegant and easier to setup.

Once I have a cleaner install of this AF_ALK (or cryptodev), I will run sbc_bench and send you ( @tkaiser ) the output.

zador.blood.stained · August 3, 2018

4 minutes ago, gprovost said:

The difference is quite interesting, with AF_ALG it performs much better on bigger block size, but poorly on very small block size.

Quote

With the presence of numerous user space cryptographic libraries, one may ask why is there a need for the kernel to expose its kernel crypto API to user space. As there are system calls and potentially memory copies needed before a cipher can be invoked, it should be typically slower than user space shared libraries.

so most likely the syscall + data copy overhead for each block kills the performance with smaller block sizes.

7 minutes ago, gprovost said:

Supposedly you can even have better perf with cryptodev, but I think Crypto API AF_ALK is more elegant and easier to setup.

If I remember correctly I compared cryptodev with AF_ALG on some platform (most likely with cryptotest) and AF_ALG was slightly faster compared to cryptodev.

ccbur · August 5, 2018

On 8/2/2018 at 10:05 PM, tkaiser said:
Are the reported temperatures real? ~33°C seem way too low and most probably the sysfs node for CPU temperature is a different one than the one my script used. Can you provide the output of the following please?
find /sys -name "thermal_zone*" | while read ; do
    echo "${REPLY}: $(cat ${REPLY}/type) $(cat ${REPLY}/temp)"
done

I needed to add CONFIG_THERMAL / CONFIG_TEGRA_THERM to my kernel. Now the temperatures are accessible via sysfs. Here some numbers during a sbc-bench: (taken while 7-zip was running)

/sys/devices/virtual/thermal/thermal_zone3: pllx 51000
/sys/devices/virtual/thermal/thermal_zone1: mem 47000
/sys/devices/virtual/thermal/thermal_zone2: gpu 47000
/sys/devices/virtual/thermal/thermal_zone0: cpu 51000
/sys/class/thermal/thermal_zone3: pllx 51000
/sys/class/thermal/thermal_zone1: mem 47000
/sys/class/thermal/thermal_zone2: gpu 47000
/sys/class/thermal/thermal_zone0: cpu 51000

But reported temperatures during this run seems much lower (cp. http://ix.io/1j4m)

I did a

watch 'cat /sys/devices/virtual/thermal/thermal_zone0/temp&&cat /sys/class/hwmon/hwmon0/temp1_input'

during a sbc-bench and the thermal_zone0 temperature was always between 38 and 53°C while hwmon0 was way lower.

And regarding the OPP cpufreq calculation:

Now the OPP cpufreq is throttled to ~1565 MHz on the first measurement (no load upfront!), and boosted beyond 2400 MHz on the second OPP calculation at the end of the run. Strange, isn't it?

Cpufreq OPP:  204    Measured: 346.544/335.861/346.856
Cpufreq OPP:  306    Measured: 503.798/516.667/515.484
Cpufreq OPP:  408    Measured: 685.631/684.615/676.372
Cpufreq OPP:  510    Measured: 854.971/849.269/856.052
Cpufreq OPP:  612    Measured: 1030.608/1035.536/1035.546
Cpufreq OPP:  714    Measured: 1205.473/1201.910/1202.302
Cpufreq OPP:  816    Measured: 1374.352/1372.307/1375.552
Cpufreq OPP:  918    Measured: 1554.848/1562.581/1555.110
Cpufreq OPP: 1020    Measured: 1721.640/1725.044/1721.787
Cpufreq OPP: 1122    Measured: 1894.246/1895.047/1894.980
Cpufreq OPP: 1224    Measured: 2064.348/2064.053/2065.067
Cpufreq OPP: 1326    Measured: 2238.183/2244.731/2240.394
Cpufreq OPP: 1428    Measured: 2414.822/2414.966/2414.735
Cpufreq OPP: 1530    Measured: 2414.909/2414.533/2415.227
Cpufreq OPP: 1632    Measured: 2415.284/2415.458/2408.336
Cpufreq OPP: 1734    Measured: 2419.049/2408.164/2415.892
Cpufreq OPP: 1836    Measured: 2415.342/2421.255/2414.764
Cpufreq OPP: 1938    Measured: 2408.595/2416.615/2416.442
Cpufreq OPP: 2014    Measured: 2416.702/2411.013/2416.818
Cpufreq OPP: 2116    Measured: 2416.905/2417.339/2408.739
Cpufreq OPP: 2218    Measured: 2417.455/2417.542/2424.252
Cpufreq OPP: 2320    Measured: 2417.513/2410.725/2417.629

This behaviour makes benchmarking not easy... ;-)

tkaiser · August 5, 2018

35 minutes ago, ccbur said:

during a sbc-bench and the thermal_zone0 temperature was always between 38 and 53°C while hwmon0 was way lower.

Ok, then to get more reasonable readouts on your system the following will 'fix' this:

mkdir -p /etc/armbianmonitor/datasources
ln -s /sys/devices/virtual/thermal/thermal_zone0/temp /etc/armbianmonitor/datasources/soctemp

The real clockspeeds reported look ok (though I've no idea why) since this time you get 7680 7-zip MIPS while it was 5290 before. On the other hand now your tinymembench numbers were lower in the beginning than last time. So this time your CPU cores were bottlenecked to ~1565 in the beginning and later clockspeeds jumped up to ~2415.

35 minutes ago, ccbur said:

This behaviour makes benchmarking not easy... ;-)

Benchmarking is never easy and that's what the various monitoring stuff in sbc-bench wants to take care of. At least we identified that the cpufreq driver has no control over cpufreq on your Jetson (just like on Raspberries or various Amlogic SoCs). No idea what's responsible for this but mpst probably it's an MCU running inside the SoC controlled by a firmware BLOB?

Maybe another run with creation of /etc/armbianmonitor/datasources/soctemp first and latest sbc-bench gives more clues. Or you might search sysfs for "*volts" entries since maybe cpufreq/DVFS scaling depends on undervoltage (just like on Raspberry Pi ) too or something like that?

BTW: I prepared sbc-bench to run on x86 SBC too in the meantime but due to being too lazy to search for my UP Board I did the changes on a Xeon box I had to check cpufreq governor behaviour anyway. With ondemand and performance governors TurboBoost can be spotted nicely: http://ix.io/1j79

gprovost · August 8, 2018

@tkaiser Here the link of my latest benchmark of Helios4 : http://ix.io/1jCy

I get the following result for OpenSSL speed

OpenSSL results:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc       1280.56k     5053.40k    18249.13k    52605.27k   102288.04k   109390.51k
aes-128-cbc       1285.51k     5030.68k    18256.13k    53001.90k   100128.09k   109188.44k
aes-192-cbc       1276.82k     4959.19k    18082.22k    51421.53k    96897.71k   103093.59k
aes-192-cbc       1290.35k     4961.09k    17777.24k    51629.74k    95647.06k   102596.61k
aes-256-cbc       1292.07k     5037.99k    17762.90k    50542.25k    92782.59k    98298.54k
aes-256-cbc       1281.35k     5050.94k    17874.77k    49915.90k    93164.89k    98822.83k

In order to leverage on hw crypto engine, I had no choice but to use OpenSSL 1.1.1 lib (openssl-1.1.1-pre8) and I decided to use cryptodev-linux instead of AF_ALG since it gives me slightly better result (+5-10%).

Here a bit of findings regarding OpenSSL engines implementation :

As stated in the changelog

Changes between 1.0.2h and 1.1.0  [25 Aug 2016]

 *) Added the AFALG engine. This is an async capable engine which is able to
     offload work to the Linux kernel. In this initial version it only supports
     AES128-CBC. The kernel must be version 4.1.0 or greater.
     [Catriona Lucey]

So using Debian Stretch package OpenSSL 1.1.0f, or any more recent 1.1.0 version, the only cipher supported by AFALG engine was effectively AES-128-CBC

$> openssl engine -c
(dynamic) Dynamic engine loading support
(afalg) AFALG engine support
 [AES-128-CBC]

Starting OpenSSL 1.1.1, even though it is not mentioned anywhere in the changelog, AES192-CBC and AES256-CBC is supported by the AFALG engine

$> openssl engine -c
(dynamic) Dynamic engine loading support
(afalg) AFALG engine support
 [AES-128-CBC, AES-192-CBC, AES-256-CBC]

But one thing much more exiting about OpenSSL 1.1.1 is the following

 Changes between 1.1.0h and 1.1.1 [xx XXX xxxx]

  *) Add devcrypto engine.  This has been implemented against cryptodev-linux,
     then adjusted to work on FreeBSD 8.4 as well.
     Enable by configuring with 'enable-devcryptoeng'.  This is done by default
     on BSD implementations, as cryptodev.h is assumed to exist on all of them.
     [Richard Levitte]

So now with the 1.1.1 is pretty straight forward to use cryptodev, no need to patch or configure anything in openssl, openssl will detect automatically if module cryptodev is loaded and will offload crypto operation on it if presents.

$> openssl engine -c
(devcrypto) /dev/crypto engine
 [DES-CBC, DES-EDE3-CBC, BF-CBC, AES-128-CBC, AES-192-CBC, AES-256-CBC, AES-128-CTR, AES-192-CTR, AES-256-CTR, AES-128-ECB, AES-192-ECB, AES-256-ECB, CAMELLIA-128-CBC, CAMELLIA-192-CBC, CAMELLIA-256-CBC, MD5, SHA1]
(dynamic) Dynamic engine loading support

Based on those info, and making the assumption that sooner than later openssl 1.1.1 will be available in Debian Stretch (via backports most probably), i think the best approach to add openssl crypto engine support in Armbian is via the cryptodev approach. This way we can support all the ciphers now. I will look how to patch properly dpkg openssl_1.1.0f-3+deb9u2 to activate cryptodev supports. @zador.blood.stained maybe you have a different option on the topic ?

tkaiser · August 8, 2018

23 minutes ago, gprovost said:

I get the following result for OpenSSL speed


OpenSSL results:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc       1280.56k     5053.40k    18249.13k    52605.27k   102288.04k   109390.51k
aes-128-cbc       1285.51k     5030.68k    18256.13k    53001.90k   100128.09k   109188.44k
aes-192-cbc       1276.82k     4959.19k    18082.22k    51421.53k    96897.71k   103093.59k
aes-192-cbc       1290.35k     4961.09k    17777.24k    51629.74k    95647.06k   102596.61k
aes-256-cbc       1292.07k     5037.99k    17762.90k    50542.25k    92782.59k    98298.54k
aes-256-cbc       1281.35k     5050.94k    17874.77k    49915.90k    93164.89k    98822.83k

In order to leverage on hw crypto engine, I had no choice but to use OpenSSL 1.1.1 lib (openssl-1.1.1-pre8)

So you simply replaced a lib somewhere with openssl binary still being 1.1.0f?

Interesting the initialization overhead with data chunks below 1KB. These were my numbers with same SoC at same clockspeed on the Clearfog Pro:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc      44566.57k    54236.33k    58027.01k    58917.21k    59400.19k    59408.38k
aes-128-cbc      44378.63k    54520.66k    58114.47k    59056.47k    59225.43k    59331.93k
aes-192-cbc      39354.78k    46780.01k    49606.14k    50319.02k    50328.92k    50435.41k
aes-192-cbc      39281.18k    46864.04k    49414.40k    50279.08k    50536.45k    50369.88k
aes-256-cbc      35510.68k    41226.05k    43100.59k    43753.13k    43950.08k    43816.28k
aes-256-cbc      35009.43k    41104.00k    43218.77k    43696.81k    43909.12k    43958.27k

gprovost · August 8, 2018

As Zador said, offloading to hw engine has a little overhead that will get obvious on tiny block benchmark.

On 8/4/2018 at 2:40 AM, zador.blood.stained said:

so most likely the syscall + data copy overhead for each block kills the performance with smaller block sizes.

5 minutes ago, tkaiser said:

So you simply replaced a lib somewhere with openssl binary still being 1.1.0f?

That's correct. I just download the openssl src from official website then do the following (on target directly).

$> ./config shared enable-engine enable-dso enable-afalgeng enable-devcryptoeng --prefix=/opt/openssl --openssldir=/opt/openssl

$> make

$> sudo make install

$> export LD_LIBRARY_PATH=/opt/openssl/lib

$> openssl version
OpenSSL 1.1.0f  25 May 2017 (Library: OpenSSL 1.1.1-pre8 (beta) 20 Jun 2018)

tkaiser · August 8, 2018

2 hours ago, gprovost said:

OpenSSL 1.1.0f 25 May 2017 (Library: OpenSSL 1.1.1-pre8 (beta) 20 Jun 2018)

Thank you, latest sbc-bench version should log this better Also interesting: it's not only twice as much crypto performance but also half as much CPU utilization at the same time. When I tested on the Clearfog I had constant 50% CPU utilization (%usr) with this single threaded test while with cryptodev on your Helios4 it looks like this (all %sys):

Time        CPU    load %cpu %sys %usr %nice %io %irq   Temp
08:17:40: 1600MHz  1.07  33%  33%   0%   0%   0%   0%  54.0°C
08:17:50: 1600MHz  1.14  17%  16%   0%   0%   0%   0%  54.0°C
08:18:00: 1600MHz  1.12  31%  30%   0%   0%   0%   0%  54.0°C
08:18:10: 1600MHz  1.10  22%  21%   0%   0%   0%   0%  54.0°C
08:18:20: 1600MHz  1.16  24%  23%   0%   0%   0%   0%  54.0°C
08:18:30: 1600MHz  1.13  28%  27%   0%   0%   0%   0%  54.0°C
08:18:40: 1600MHz  1.11  20%  19%   0%   0%   0%   0%  54.0°C
08:18:50: 1600MHz  1.18  31%  30%   0%   0%   0%   0%  54.0°C
08:19:00: 1600MHz  1.15  15%  15%   0%   0%   0%   0%  54.0°C
08:19:10: 1600MHz  1.12  31%  30%   0%   0%   0%   0%  53.0°C

gprovost · August 10, 2018

On 8/8/2018 at 6:54 PM, tkaiser said:

Also interesting: it's not only twice as much crypto performance but also half as much CPU utilization at the same time. When I tested on the Clearfog I had constant 50% CPU utilization (%usr) with this single threaded test while with cryptodev on your Helios4 it looks like this (all %sys)

I must admit it's quite cool

Now I'm trying to make apache2 (for nextcloud use case) offload HTTPS on the cryptodev. Not trivial at tall Need to recompile stuff.

gprovost · August 13, 2018

@tkaiser Why in your benchmark table for AES-128-CBC you are showing the results for 16Byte block while for AES-256-CBC you are show the results for 16KByte ?

I'm obliviously asking because for Helios4 showing the perf on 16Byte block while using the cryptodev is a bit unfair

tkaiser · August 13, 2018

18 minutes ago, gprovost said:

Why in your benchmark table for AES-128-CBC you are showing the results for 16Byte block while for AES-256-CBC you are show the results for 16KByte ?

To create awareness that the amount of data to test with is important and to outline that initialization overhead is something to consider.

The whole approach is not to generate graphs and numbers to compare without using the brain but to generate insights instead.

gprovost · August 14, 2018

15 hours ago, tkaiser said:

The whole approach is not to generate graphs and numbers to compare without using the brain but to generate insights instead.

I understand that, and I'm sure everyone appreciate a lot the great work you are doing with sbc-bench since it was something clearly missing out there... I think it could actually become the baseline reference for sbc benchmark.

I'm pretty sure many people will look at the numbers without reading all the explanation your provided or without reading this thread. The fact that the Clearfog which is based on the same SoC then Helios4 shows completely different number for AES-128-CBC 16Byte block is clearly confusing. I would prefer we display both value, with | without hw engine.

ag123 · September 18, 2018

i'm not sure whether to even call this a benchmark

https://forum.armbian.com/topic/8203-orangepione-h3-an-experiment-with-small-heatsink/

in an attempt to verify the effectiveness of a heatsink, i used some codes which basically does 1000x1000 matrix multiplication (single precision)

i think 'traditionally' that's done with blas and calling sgemm (sp) or dgemm (dp) functions. 1000x1000 matrix multiply has a formula for the number of floating point ops which is 2N^3 (i.e. 2 billion floating points ops), i'm not using blas but rather using some c++ codes that only does the matrix multiply.

that allows me to do some basic verification of whether a small heatsink would after all even made a difference, i did some optimization, unroll the loop and surprisingly got a 10 fold increase in performance in terms of mflops. the codes are attached in the 2nd post in the thread.

i think matrix multiplication doesn't always reflect real world scenarios as in even if there are matrix multiplications, in real cases matrices are not necessary sized this way (e.g. could be much smaller or bigger) and is not necessary square matrices. but that doing strictly square matrix computations do give a sort of way to 'get a feel' of how would the same matrix multiplication differ between the different frequencies etc

doing things like linpack may give one lower mflops say compared to this as linpack involves solving the matrices rather than simply multiplying matrices

and in addition, how the optimization is applied can drastically change the mflops throughput (e.g. in my case for H3, the loop unrolling optimization achieved a 10 fold increase in mflops, but the same optimization may not work as well for a different soc, or in the case of a superscalar soc the cpu may be able to 'unroll the loop' in hardware even without this loop unrolling optimization

this is very much simply 'synthetic'

sfx2000 · October 1, 2018

@tkaiser

Little quirk with sbc-bench on the git...

Once tests are done, looks like the cores are left on their own and cooking....

Recent test, and after.... dropping the clocks was after a shutdown, pull power, and startup...

Example here is NanoPI-NEO, current Armbian bionic image.... http://ix.io/1nY2

280536976_ScreenShot2018-09-30at6_36_48PM.png.4597fbacedfa590adf045c21246bc9ce.png

Oddly enough, a couple of weeks back, Tinker got into a bad place there where it did a hard shutdown...

tkaiser · October 1, 2018

4 hours ago, sfx2000 said:

Once tests are done, looks like the cores are left on their own and cooking....

'Cooking'? The 'performance' cpufreq governor is chosen, that's all. On an otherwise idle system there's not much consumption and temperature difference between idling at lowest or highest cpufreq OPP anyway.

sfx2000 · October 3, 2018

On 9/30/2018 at 11:46 PM, tkaiser said:

'Cooking'? The 'performance' cpufreq governor is chosen, that's all. On an otherwise idle system there's not much consumption and temperature difference between idling at lowest or highest cpufreq OPP anyway.

Opened up an issue on the git - don't change the governor - report what is in use, so folks can A/B changes...

Anyways - RK3288-Tinker, which can be/often is thermally challenged - had to hack the sbc-bench script to not set the CPU gov to perf...

gov/sched/swappinness...

perf - noop - 100

http://ix.io/1o7Y

Throttling statistics (time spent on each cpufreq OPP): 
1800 MHz: 86.82 sec 
1704 MHz: 83.66 sec 
1608 MHz: 114.67 sec 
1512 MHz: 155.11 sec 
1416 MHz: 205.71 sec 
1200 MHz: 169.03 sec 
1008 MHz: 109.07 sec 
816 MHz: 144.17 sec 
600 MHz: 106.14 sec

schedutil -CFQ - 10

http://ix.io/1o9Y

Throttling statistics (time spent on each cpufreq OPP): 
1800 MHz: 350.94 sec 
1704 MHz: 121.37 sec 
1608 MHz: 73.64 sec 
1512 MHz: 92.46 sec 
1416 MHz: 104.79 sec 
1200 MHz: 92.22 sec 
1008 MHz: 66.33 sec 
816 MHz: 149.59 sec 
600 MHz: 132.10 sec

Interesting numbers... feel free to walk thru the rest...

Without getting into the uboot and dt to reset the lower limit - would be nice to see if the RK can sprint to get fast, fallback a bit more to recover and sprint again... RK3288 can do the limbo as low as 126MHz - right now with Armbian we're capped at the bottom at 600MHz - so the dynamic range between idle and max'ed out is 10C, as the RK3288 with the provided HS idles at 60C in the current build.

That and going with PREEMPT vs normal there... the PREEMPT stuff does weird things with IO and drivers there, would be nice to have a regular kernel within Armbian to bounce against.

sfx2000 · October 3, 2018

On 9/30/2018 at 11:46 PM, tkaiser said:

Cooking'? The 'performance' cpufreq governor is chosen, that's all. On an otherwise idle system there's not much consumption and temperature difference between idling at lowest or highest cpufreq OPP anyway.

Same configs as the Tinker - testing Nano Pi NEO with stock armbian clocks on mainline - my Neo isn't likely going to throttle as it's got good power and very good on the thermals with the heat sink and case it's in... stock clocks has it underclocked a bit as it is for power reasons - re other Armbian docs on this board for IoT, etc...

gov/sched/swappinness

perf - noop - 100

http://ix.io/1oac

schedutil -CFQ - 10

http://ix.io/1oai

(this one throttled just a bit... but that's one sample)

tkaiser · October 3, 2018

10 hours ago, sfx2000 said:

Interesting numbers

100% meaningless numbers now since they do not provide any insights any more (you can not differentiate what's throttling and what's -- potentially inappropriate -- cpufreq scaling behavior).

10 hours ago, sfx2000 said:

RK3288 can do the limbo as low as 126MHz - right now with Armbian we're capped at the bottom at 600MHz

For two simple reasons:

allowing very low cpufreq OPP trashes performance (on your Tinkerboard it's storage performance due to the clockspeeds not increasing fast enough)
the way DVFS works at the lower end of the scale the differences in consumption and generated heat are negligible between silly ultra low MHz values and a reasonable lower limit (we in Armbian TEST for stuff like this with every new SoC family!)

Closed your issue right now spending additional time to explain: https://github.com/ThomasKaiser/sbc-bench/issues/4

Sign In

Benchmarking CPUs

Recommended Posts

zador.blood.stained

tkaiser

NicoD

tkaiser

ccbur

tkaiser

wtarreau

tkaiser

tkaiser

NicoD

tkaiser

NicoD

gprovost

zador.blood.stained

ccbur

tkaiser

gprovost

tkaiser

gprovost

tkaiser

gprovost

gprovost

tkaiser

gprovost

ag123

sfx2000

tkaiser

sfx2000

sfx2000

tkaiser

Join the conversation

Forums

My Activity Streams

Download

Store

Important Information