Some basic benchmarks for Le Potato?

tkaiser · November 3, 2017

EDIT: Please don't trust in the numbers appearing at the top of this thread. Obviously there were bootloader/firmware issues that needed to be resolved and afterwards Potato performance will be somewhat higher.

Since I didn't find numbers may I ask owners of the device ( @Igor, @TonyMac32 ?) to run 4 quick benchmarks? Two times openssl, 7-zip and tinymembench. Please comment also which clockspeeds and which distro you used (Xenial preferred).

Thanks!

Tido · November 3, 2017

SDcard Samsung Evo+ 32GB

Linux lepotato 4.13.11-meson64 #96 SMP PREEMPT Fri Nov 3 01:27:06 CET 2017 aarch64

openssl speed rsa4096 -multi 4

Spoiler


root@lepotato:~# openssl version -a
OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
platform: debian-arm64
options:  bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) blowfish(ptr) 
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
OPENSSLDIR: "/usr/lib/ssl"
root@lepotato:~# 
root@lepotato:~# openssl speed rsa4096 -multi 4
Forked child 0
Forked child 1
Forked child 2
Forked child 3
+DTP:4096:private:rsa:10
+DTP:4096:private:rsa:10
+DTP:4096:private:rsa:10
+DTP:4096:private:rsa:10
+R1:111:4096:10.05
+DTP:4096:public:rsa:10
+R1:111:4096:10.05
+R1:111:4096:10.05
+DTP:4096:public:rsa:10
+DTP:4096:public:rsa:10
+R1:111:4096:10.07
+DTP:4096:public:rsa:10
+R2:7770:4096:10.00
Got: +F2:3:4096:0.090541:0.001287 from 0
+R2:7770:4096:10.01
+R2:7777:4096:10.00
Got: +F2:3:4096:0.090541:0.001286 from 1
+R2:7783:4096:10.00
Got: +F2:3:4096:0.090721:0.001285 from 2
Got: +F2:3:4096:0.090541:0.001288 from 3
OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) 
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
                  sign    verify    sign/s verify/s
rsa 4096 bits 0.022646s 0.000322s     44.2   3109.2

for i in 128 192 256; do openssl speed -elapsed -evp aes-${i}-cbc ; done

Spoiler


root@lepotato:~# for i in 128 192 256; do openssl speed -elapsed -evp aes-${i}-cbc ; done
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 32264602 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 22322941 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 9329487 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2899974 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 390089 aes-128-cbc's in 3.00s
OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) 
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     172077.88k   476222.74k   796116.22k   989857.79k  1065203.03k
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-192-cbc for 3s on 16 size blocks: 31184019 aes-192-cbc's in 3.00s
Doing aes-192-cbc for 3s on 64 size blocks: 19253234 aes-192-cbc's in 3.00s
Doing aes-192-cbc for 3s on 256 size blocks: 7453968 aes-192-cbc's in 3.00s
Doing aes-192-cbc for 3s on 1024 size blocks: 2217062 aes-192-cbc's in 3.00s
Doing aes-192-cbc for 3s on 8192 size blocks: 293332 aes-192-cbc's in 3.00s
OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) 
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-192-cbc     166314.77k   410735.66k   636071.94k   756757.16k   800991.91k
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 29870994 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 17384233 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 6378728 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 1846681 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 241963 aes-256-cbc's in 3.00s
OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) 
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc     159311.97k   370863.64k   544318.12k   630333.78k   660720.30k

7z b

Spoiler


root@lepotato:~# 7z b

7-Zip 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,4 CPUs)

RAM size:    1850 MB,  # CPU hardware threads:   4
RAM usage:    850 MB,  # Benchmark threads:      4

Dict        Compressing          |        Decompressing
      Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
       KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS

22:    1942   289    652   1889  |    51700   398   1170   4664
23:    1942   291    680   1979  |    50868   398   1169   4655
24:    1928   291    712   2073  |    50381   399   1170   4674
25:    1921   292    752   2193  |    49890   399   1174   4691
----------------------------------------------------------------
Avr:          291    699   2034               399   1171   4671
Tot:          345    935   3352

Tido · November 3, 2017

Funny enough, I threw that code at my tinker board: openssl speed rsa4096 -multi 4 - it switched off about 2 seconds later. Maybe a hardware failure

tkaiser · November 3, 2017

1 hour ago, Tido said:

Funny enough, I threw that code at my tinker board: openssl speed rsa4096 -multi 4 - it switched off about 2 seconds later. Maybe a hardware failure

Undervoltage, it seems you forgot that the Tinkerboard is a pile of crap you can switch off even with light loads.

Wrt the S905X benchmark results unfortunately I miss distro and clockspeed info. Based on the information I assume it's Ubuntu Xenial y la patata is running at slightly above 1.4 GHz

These are your results:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     172077.88k   476222.74k   796116.22k   989857.79k  1065203.03k
aes-192-cbc     166314.77k   410735.66k   636071.94k   756757.16k   800991.91k
aes-256-cbc     159311.97k   370863.64k   544318.12k   630333.78k   660720.30k

And this is ROCK64 at stable 1.3 GHz:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     163161.40k   436259.80k   729289.90k   906723.33k   975929.34k
aes-192-cbc     152362.85k   375675.22k   582690.99k   693259.95k   733563.56k
aes-256-cbc     145928.50k   337163.26k   498586.20k   577371.48k   605145.77k

Smells like 1.3 GHz vs. 1.42 GHz (and not 1.5GHz). Would be great if someone could provide tinymembench results too.

7-zip numbers are not that great since even an RPi 3 at 1200 MHz performs at this level. Strange.

Tido · November 4, 2017

8 hours ago, tkaiser said:

Undervoltage, it seems you forgot that the Tinkerboard is a pile of

If you don't catch irone, I will place a tag next time for you, thought a smiley were enough.

Anyway, unplugged HDMI cable, replaced 119cm Nexus 4 Micro USB cable with: 35cm cable

et voila:

Spoiler


root@tinkerboard:~# openssl speed rsa4096 -multi 4 
Forked child 0
Forked child 1
Forked child 2
Forked child 3
+DTP:4096:private:rsa:10
+DTP:4096:private:rsa:10
+DTP:4096:private:rsa:10
+DTP:4096:private:rsa:10
+R1:154:4096:10.01
+DTP:4096:public:rsa:10
+R1:154:4096:10.01
+DTP:4096:public:rsa:10
+R1:152:4096:10.04
+DTP:4096:public:rsa:10
+R1:155:4096:10.07
+DTP:4096:public:rsa:10
+R2:8670:4096:10.00
+R2:8740:4096:10.00
+R2:8601:4096:10.00
+R2:8734:4096:10.00
Got: +F2:3:4096:0.064968:0.001145 from 0
Got: +F2:3:4096:0.065000:0.001153 from 1
Got: +F2:3:4096:0.065000:0.001144 from 2
Got: +F2:3:4096:0.066053:0.001163 from 3
OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) blowfish(ptr) 
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM
                  sign    verify    sign/s verify/s
rsa 4096 bits 0.016313s 0.000288s     61.3   3474.6

and this tinker board runs as if it wouldn't know it any better - Bazinga

- Btw, still no heatsink attached -

You have specifically asked for Xenial - so I did. Latest nightly.

Armbian_5.34.171104_Lepotato_Ubuntu_xenial_next_4.13.11.img

Edited November 4, 2017 by Tido
added distro of Le Potato

tkaiser · November 4, 2017

Seems I need to ask a second time: is anyone here able and willing to do a quick but correct benchmark test on Le Potato (not Tinkerboard, that's not interesting).

I'm still interested in tinymembench numbers for this board and both openssl and 7-zip tests in an environment where cpufreq is monitored and throttling (if it would happen) gets noticed and avoided in a subsequent run.

Da Xue · November 4, 2017

@tkaiser you want the numbers without throttling? memory speeds are set in u-boot and drastically affect performance. the boards have ddr3-2133 but higher doesn't always equate to better for tinymembench due to timings.

tkaiser · November 4, 2017

15 minutes ago, Da Xue said:

you want the numbers without throttling?

That would be great. Or at least I need to know which clockspeeds are used (I know nothing about current state of Le Potato kernel, whether cpufreq/DVFS is already working and if it's working how throttling is configured). I assumed S905X would run with 1500 MHz but both openssl and especially 7z numbers are too low for that.

17 minutes ago, Da Xue said:

memory speeds are set in u-boot and drastically affect performance

I know and that's the reason I was asking for tinymembench numbers (low 7-zip compression speed is often related to high memory latency).

Da Xue · November 4, 2017

@tkaiser The default bl30.bin permits the s905x to run at 1512MHz vs 1536MHz for s905. I have a bl30.bin that can be configured up to 1680MHz but the performance isn't linear due to throttling or some other logic in the hardware/firmware. With the hacked pre-1.0 SCPI that Amlogic implemented, I really don't know how to monitor clock speeds to ensure that it doesn't throttle down. I have no experience in reliably monitoring clock speeds in software for ARM so if you know a way, please let me know and I will run the numbers for you.

tkaiser · November 4, 2017

46 minutes ago, Da Xue said:

I really don't know how to monitor clock speeds to ensure that it doesn't throttle down.

So there is either no cpufreq support or the numbers are bogus? Anyway, the mostly useless sysbench pseudo benchmark can be used funnily for exactly this: estimating at which clockspeed a specific CPU is running if CPU architecture and build options for the binary are known (easy with upstream distro packages).

Can you please execute

sysbench –test=cpu –cpu-max-prime=20000 run –num-threads=4
sysbench –test=cpu –cpu-max-prime=20000 run –num-threads=2
sysbench –test=cpu –cpu-max-prime=20000 run –num-threads=1
find /sys -name "scaling_available_frequencies"
cat /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state
armbianmonitor -u

Da Xue · November 6, 2017

I don't have armbian monitor but heres the results with the CPU set to 1680MHz and the DDR set to 2108MHz. The DDR can go past 2200MHz but I haven't tested the performance because it seemed that the preconfigured timing in uboot negatively affects performance at higher speeds. I'll run the test at stock tomorrow or the day after.

7-Zip 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,4 CPUs)

RAM size:    1852 MB,  # CPU hardware threads:   4
RAM usage:    850 MB,  # Benchmark threads:      4

Dict        Compressing          |        Decompressing
      Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
       KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS

22:    2019   288    682   1964  |    53850   399   1218   4858
23:    2016   289    709   2054  |    51161   385   1217   4681
24:    2009   290    744   2160  |    52508   399   1219   4871
25:    2002   290    788   2286  |    52036   400   1223   4893
----------------------------------------------------------------
Avr:          289    731   2116               396   1219   4826
Tot:          342    975   3471

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 4

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000


Test execution summary:
    total time:                          6.2510s
    total number of events:              10000
    total time taken by event execution: 24.9940
    per-request statistics:
         min:                                  2.50ms
         avg:                                  2.50ms
         max:                                  2.61ms
         approx.  95 percentile:               2.50ms

Threads fairness:
    events (avg/stddev):           2500.0000/0.71
    execution time (avg/stddev):   6.2485/0.00

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 2

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000


Test execution summary:
    total time:                          12.5001s
    total number of events:              10000
    total time taken by event execution: 24.9964
    per-request statistics:
         min:                                  2.50ms
         avg:                                  2.50ms
         max:                                  2.60ms
         approx.  95 percentile:               2.50ms

Threads fairness:
    events (avg/stddev):           5000.0000/1.00
    execution time (avg/stddev):   12.4982/0.00

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000


Test execution summary:
    total time:                          25.0038s
    total number of events:              10000
    total time taken by event execution: 25.0010
    per-request statistics:
         min:                                  2.50ms
         avg:                                  2.50ms
         max:                                  2.56ms
         approx.  95 percentile:               2.50ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   25.0010/0.00

tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   1941.6 MB/s (1.4%)
 C copy backwards (32 byte blocks)                    :   1944.8 MB/s (1.6%)
 C copy backwards (64 byte blocks)                    :   1915.5 MB/s (1.6%)
 C copy                                               :   1951.1 MB/s (1.5%)
 C copy prefetched (32 bytes step)                    :   1514.9 MB/s (0.3%)
 C copy prefetched (64 bytes step)                    :   1629.5 MB/s
 C 2-pass copy                                        :   1766.9 MB/s
 C 2-pass copy prefetched (32 bytes step)             :   1247.8 MB/s
 C 2-pass copy prefetched (64 bytes step)             :   1258.8 MB/s (0.2%)
 C fill                                               :   6068.0 MB/s
 C fill (shuffle within 16 byte blocks)               :   6068.4 MB/s
 C fill (shuffle within 32 byte blocks)               :   6068.3 MB/s
 C fill (shuffle within 64 byte blocks)               :   6068.5 MB/s
 ---
 standard memcpy                                      :   2026.4 MB/s (0.4%)
 standard memset                                      :   6072.0 MB/s
 ---
 NEON LDP/STP copy                                    :   2016.4 MB/s
 NEON LDP/STP copy pldl2strm (32 bytes step)          :   1365.3 MB/s (0.5%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :   1805.1 MB/s (0.3%)
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   2388.2 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   2385.8 MB/s
 NEON LD1/ST1 copy                                    :   2003.2 MB/s (1.5%)
 NEON STP fill                                        :   6072.4 MB/s
 NEON STNP fill                                       :   6015.9 MB/s
 ARM LDP/STP copy                                     :   2020.0 MB/s (0.2%)
 ARM STP fill                                         :   6072.4 MB/s
 ARM STNP fill                                        :   6015.9 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    4.0 ns          /     7.2 ns 
    131072 :    6.1 ns          /    10.5 ns 
    262144 :    7.6 ns          /    12.5 ns 
    524288 :   69.2 ns          /   118.9 ns 
   1048576 :   95.9 ns          /   127.8 ns 
   2097152 :  108.5 ns          /   160.0 ns 
   4194304 :  143.3 ns          /   179.9 ns 
   8388608 :  164.1 ns          /   190.8 ns 
  16777216 :  173.4 ns          /   196.3 ns 
  33554432 :  182.1 ns          /   199.9 ns 
  67108864 :  190.7 ns          /   215.9 ns 

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    4.0 ns          /     7.0 ns 
    131072 :    6.1 ns          /    10.4 ns 
    262144 :    7.7 ns          /    12.7 ns 
    524288 :   68.6 ns          /   105.3 ns 
   1048576 :   95.9 ns          /   127.8 ns 
   2097152 :  108.0 ns          /   159.6 ns 
   4194304 :  111.1 ns          /   131.2 ns 
   8388608 :  113.7 ns          /   138.4 ns 
  16777216 :  115.1 ns          /   149.5 ns 
  33554432 :  115.4 ns          /   145.4 ns 
  67108864 :  115.7 ns          /   149.9 ns

tkaiser · November 6, 2017

2 minutes ago, Da Xue said:

I don't have armbian monitor but heres the results with the CPU set to 1680MHz and the DDR set to 2108MHz.

Thank you! Just to interpret the sysbench numbers: Which distro do you use, I need output from 'lsb_release -c' and 'file /usr/bin/sysbench'.

Da Xue · November 6, 2017

Just now, tkaiser said:

Thank you! Just to interpret the sysbench numbers: Which distro do you use, I need output from 'lsb_release -c' and 'file /usr/bin/sysbench'.

ubuntu xenial

/usr/bin/sysbench: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, for GNU/Linux 3.7.0, BuildID[sha1]=01b3ec2b7f6a203ed

This is just a boostrapped ubuntu image with the kernel on github.

Da Xue · November 6, 2017

No matter what I change the clock speed to in uboot to populate the PSCI, it will always say 1512 in the sys frequencies.

tkaiser · November 6, 2017

6 minutes ago, Da Xue said:

No matter what I change the clock speed to in uboot to populate the PSCI, it will always say 1512 in the sys frequencies

And according to sysbench it's even slightly lower (more like 1470 MHz but that is pretty close to the reported 1512 MHz). In other words: too early to do any benchmarking now

Da Xue · November 6, 2017

No matter what I change the clock speed tables to in uboot to populate the PSCI, it will always say 1512MHz in the sysfs frequencies. I think the A53 cores with crypto extensions perform slightly slower than their non-crypto counterparts in generic workloads.

7 minutes ago, tkaiser said:

And according to sysbench it's even slightly lower (more like 1470 MHz but that is pretty close to the reported 1512 MHz). In other words: too early to do any benchmarking now

Mind you that I am running this at 1680MHz and not 1512MHz for those numbers. The stock numbers are slower but not proportionally. The gains from anything over 1584MHz are very small if not negative.

tkaiser · November 6, 2017

5 minutes ago, Da Xue said:

I think the A53 cores with crypto extensions perform slightly slower than their non-crypto counterparts in generic workloads.

At least I found no evidence for this. What about the output from

find /sys -name "scaling_available_frequencies"
cat /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state

The sysbench numbers you generated indicate that the SoC's real clockspeed is slightly lower than 1.5GHz (the only good use case for sysbench since not depending on external memory bandwidth/latency). But we know from the past that 2 SoC vendors are cheating on us: Amlogic reporting bogus stuff through sysfs interface and same with RPi folks (where the same happens on RPi 2 and 3).

Da Xue · November 6, 2017

The scaling_available_frequencies are bogus and is hard coded in the trusted firmware so I have no visibility to the exact speed. It always reports 1512. I am not getting this node: /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state. Do I have to enable a module? I am running numbers now for 1584/1056.

tkaiser · November 6, 2017

12 minutes ago, Da Xue said:

I am not getting this node: /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state. Do I have to enable a module?

No idea, I only know that once cpufreq supports is enabled in any of the kernels we use this node should appear. Maybe it's at a different path (find /sys -name time_in_state)?

Anyway: the most important information for me was: not ready yet so we have to take the above benchmark results with a huge grain of salt (@Tido's numbers look like slightly above 1.4GHz, yours like slightly below 1.5GHz, once you figured out how to reliably enable the desired clockspeeds and we can monitor the stuff through sysfs it gets interesting again).

Da Xue · November 6, 2017

Here's the bizarre part. When I run it with the default bl30.bin (suppose to be 1512MHz), I get the following:

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 4

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000


Test execution summary:
    total time:                          7.6267s
    total number of events:              10000
    total time taken by event execution: 30.4950
    per-request statistics:
         min:                                  3.05ms
         avg:                                  3.05ms
         max:                                  3.15ms
         approx.  95 percentile:               3.05ms

Threads fairness:
    events (avg/stddev):           2500.0000/0.00
    execution time (avg/stddev):   7.6237/0.00

Do you have any results I can compare with for other boards?

tkaiser · November 6, 2017

11 minutes ago, Da Xue said:

Do you have any results I can compare with for other boards?

Pinebook (A64) and ROCK64 (RK3328) also with Ubuntu Xenial arm64 sysbench distro package (that's important! Otherwise just numbers without meaning since sysbench is a compiler settings benchmark and not able to meausre hardware performance)

echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
for i in $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies) ; do
	echo $i >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
	echo -e "$(( $i / 1000)): \c"
	sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4 2>&1 | grep 'execution time'
done

Pinebook (no throttling happening -- just compare time_in_state before/after every run). CPU clockspeed above 1152 MHz are disabled by Allwinner's budget cooling settings that's why 1200 and 1344 results are the same as for the 1152, since that's the real clockspeed:

480:     execution time (avg/stddev):   19.1398/0.01
600:     execution time (avg/stddev):   15.2882/0.01
720:     execution time (avg/stddev):   12.7485/0.01
816:     execution time (avg/stddev):   11.2629/0.01
912:     execution time (avg/stddev):   10.1254/0.01
960:     execution time (avg/stddev):   9.5806/0.00
1008:     execution time (avg/stddev):   9.0986/0.01
1056:     execution time (avg/stddev):   8.6765/0.01
1104:     execution time (avg/stddev):   8.3067/0.01
1152:     execution time (avg/stddev):   7.9538/0.00
1200:     execution time (avg/stddev):   7.9521/0.00
1344:     execution time (avg/stddev):   7.9843/0.01

ROCK64 (no throttling happened):

408:     execution time (avg/stddev):   23.4966/0.01
600:     execution time (avg/stddev):   15.4553/0.00
816:     execution time (avg/stddev):   11.3848/0.01
1008:     execution time (avg/stddev):   9.1798/0.01
1200:     execution time (avg/stddev):   7.6882/0.00
1296:     execution time (avg/stddev):   7.1025/0.00

ROCK64 is slightly slower which most probably is related to L1 cache latencies or something like that. You'll find a lot of additional information here: https://forum.armbian.com/topic/1748-sbc-consumptionperformance-comparisons/ (see there especially that how the sysbench binary has been built is the most important factor and that sysbench numbers of devices with totally different DRAM configuration/performance show identical sysbench scores)

tkaiser · November 6, 2017

37 minutes ago, Da Xue said:

When I run it with the default bl30.bin (suppose to be 1512MHz), I get the following:

That's 1200 MHz in reality.

Da Xue · November 6, 2017

So it would appear that the S905X is running somewhere around 1210MHz with the default bl30.bin. With the modified one, it operates around 1475MHz? I think HardKernel has the source for the S905 trusted firmware and I do not. I'm not quite sure how the throttling logic works with 4 cores or if there's a current throttler.

@tkaiser do you have the single thread results?

tkaiser · November 6, 2017

2 minutes ago, Da Xue said:

So it would appear that the S905X is running somewhere around 1210MHz with the default bl30.bin. With the modified one, it operates around 1475MHz?

Yes. That's what the benchmarks are telling.

2 minutes ago, Da Xue said:

do you have the single thread results?

I'll generate them only for ROCK64 since I tried this many times already in the past and sysbench's cpu test scales linearly with both count of CPU cores and clockspeed (in other words: it's a 'benchmark' that can not be used to model any real-world task out there since just calculating prime numbers inside the CPU cores/caches):

408:     execution time (avg/stddev):   91.3720/0.00
600:     execution time (avg/stddev):   61.7398/0.00
816:     execution time (avg/stddev):   45.2492/0.00
1008:     execution time (avg/stddev):   36.5560/0.00
1200:     execution time (avg/stddev):   30.6810/0.00
1296:     execution time (avg/stddev):   28.3843/0.00

I would suggest contacting Amlogic pretty soon since their next 'Amlogic is cheating on us!!1!!' drama is just around the corner (like last year when Willy Tarreau discovered that all reported S905 CPU clockspeeds above 1500 MHz were bogus)

tkaiser · November 6, 2017

@willmore In case you've your 'overclocked' ODROID-C2 around running an arm64 Ubuntu Xenial (!!!) it might be worth to give the above simple sysbench run a try walking through the available cpufreq OPP and reporting results for '--num-threads=1' and '--num-threads=4'.

willmore · November 6, 2017

Someone said my name? Sorry it took me a while to run this, but they offer a 100MHz clock speed and that takes a very long time to run--especially with one thread. I have a high degree of confidence that there is no throttling as IIRC, I tested this setup with cpuburn and got no throttling. I can't imagine this being more demanding than that!

Here's the data:

num-threads=4

100:     execution time (avg/stddev):   99.4764/0.02
250:     execution time (avg/stddev):   37.7647/0.01
500:     execution time (avg/stddev):   18.6581/0.00
1000:     execution time (avg/stddev):   9.2395/0.00
1296:     execution time (avg/stddev):   7.1300/0.00
1536:     execution time (avg/stddev):   6.0117/0.00
1656:     execution time (avg/stddev):   5.5794/0.01
1680:     execution time (avg/stddev):   5.4853/0.00
1752:     execution time (avg/stddev):   5.2694/0.01

num-threads=1

100:     execution time (avg/stddev):   369.1851/0.00
250:     execution time (avg/stddev):   146.8992/0.00
500:     execution time (avg/stddev):   73.3360/0.00
1000:     execution time (avg/stddev):   36.6221/0.00
1296:     execution time (avg/stddev):   28.2551/0.00
1536:     execution time (avg/stddev):   24.4123/0.00
1656:     execution time (avg/stddev):   22.0989/0.00
1680:     execution time (avg/stddev):   21.7828/0.00
1752:     execution time (avg/stddev):   21.3559/0.00

tkaiser · November 6, 2017

16 minutes ago, willmore said:

I can't imagine this being more demanding than that!

Sysbench is pretty lightweight yes. If throttling would happen the stddev value would increase (in your case they're between 0.00 and 0.02, so that's just some background activity). And we can also do the math:

6.0117 / 5.2694 * 1536 --> 1752.3762099
24.4123 / 21.3559 * 1536 --> 1755.828263

So, thanks. Your numbers confirm that we can use sysbench in a very special mode to report real CPU clockspeeds comparing same CPU cores using same binaries where sysfs nodes and cpufreq drivers are cheating on us (really can't believe that we see this with Amlogic again after their S905/S912 desaster last year)

willmore · November 6, 2017

3 minutes ago, tkaiser said:

Sysbench is pretty lightweight yes. If throttling would happen the stddev value would increase (in your case they're between 0.00 and 0.02, so that's just some background activity). And we can also do the math:

6.0117 / 5.2694 * 1536 --> 1752.3762099

24.4123 / 21.3559 * 1536 --> 1755.828263

So, thanks. Your numbers confirm that we can use sysbench in a very special mode to report real CPU clockspeeds comparing same CPU cores using same binaries where sysfs nodes and cpufreq drivers are cheating on us (really can't believe that we see this with Amlogic again after their S905/S912 desaster last year)

Ahh, yes, for a clock speed surrogate, you'd want exactly a task like that--something that doesn't stress the CPU too much that power and thermal issues com into play. You'd also want to avoid anything with alot of memory usage as that will be inelastic with CPU speed.

Yeah, the clock speed issue for the S905. I remember that well. Maybe some day I can release the info I have on how that was detected.....

tkaiser · November 6, 2017

5 minutes ago, willmore said:

Yeah, the clock speed issue for the S905. I remember that well. Maybe some day I can release the info I have on how that was detected.....

Huh? The few script lines above are sufficient to 'detect' this sort of cheating. It's just that almost nobody is benchmarking correctly since people prefer Phoronix or Geekbench BS / pseudo benchmarks.

See also http://wiki.ant-computing.com/Choosing_a_processor_for_a_build_farm#Devices (it's as easy as walking through all cpufreq opp and look whether a benchmark performs different or not at different clockspeeds, then you either have real throttling happening or some mechanism preventing to use some clockspeeds -- that's what we have on almost all Android kernels in the meantime)

willmore · November 6, 2017

6 minutes ago, tkaiser said:

Huh? The few script lines above are sufficient to 'detect' this sort of cheating. It's just that almost nobody is benchmarking correctly since people prefer Phoronix or Geekbench BS / pseudo benchmarks.

See also http://wiki.ant-computing.com/Choosing_a_processor_for_a_build_farm#Devices (it's as easy as walking through all cpufreq opp and look whether a benchmark performs different or not at different clockspeeds, then you either have real throttling happening or some mechanism preventing to use some clockspeeds -- that's what we have on almost all Android kernels in the meantime)

Back when the clock speed cheating was first detected, I had been doing just the kind of bencharking that detected it. I approached a vendor in the area with my results--looking to find out why there was a performance plateau at 1.5GHz. That was one of the things that triggered the investigation into the issue which lead to revelation that the firmware was lying. There's some data there that might be of interest historically. Nothing that matters to this thread, really.

The method that you've shown in this thread is similar to what I was doing back then and I'm confident that you can detect cheating with this.

Sign In

Some basic benchmarks for Le Potato?

Recommended Posts

tkaiser

Tido

Tido

tkaiser

Tido

tkaiser

Da Xue

tkaiser

Da Xue

tkaiser

Da Xue

tkaiser

Da Xue

Da Xue

tkaiser

Da Xue

tkaiser

Da Xue

tkaiser

Da Xue

tkaiser

tkaiser

Da Xue

tkaiser

tkaiser

willmore

tkaiser

willmore

tkaiser

willmore

Forums

My Activity Streams

Download

Store

Important Information