tkaiser Posted November 3, 2017 Posted November 3, 2017 EDIT: Please don't trust in the numbers appearing at the top of this thread. Obviously there were bootloader/firmware issues that needed to be resolved and afterwards Potato performance will be somewhat higher. Since I didn't find numbers may I ask owners of the device ( @Igor, @TonyMac32 ?) to run 4 quick benchmarks? Two times openssl, 7-zip and tinymembench. Please comment also which clockspeeds and which distro you used (Xenial preferred). Thanks!
Tido Posted November 3, 2017 Posted November 3, 2017 SDcard Samsung Evo+ 32GB Linux lepotato 4.13.11-meson64 #96 SMP PREEMPT Fri Nov 3 01:27:06 CET 2017 aarch64 openssl speed rsa4096 -multi 4 Spoiler root@lepotato:~# openssl version -a OpenSSL 1.0.2g 1 Mar 2016 built on: reproducible build, date unspecified platform: debian-arm64 options: bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) blowfish(ptr) compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM OPENSSLDIR: "/usr/lib/ssl" root@lepotato:~# root@lepotato:~# openssl speed rsa4096 -multi 4 Forked child 0 Forked child 1 Forked child 2 Forked child 3 +DTP:4096:private:rsa:10 +DTP:4096:private:rsa:10 +DTP:4096:private:rsa:10 +DTP:4096:private:rsa:10 +R1:111:4096:10.05 +DTP:4096:public:rsa:10 +R1:111:4096:10.05 +R1:111:4096:10.05 +DTP:4096:public:rsa:10 +DTP:4096:public:rsa:10 +R1:111:4096:10.07 +DTP:4096:public:rsa:10 +R2:7770:4096:10.00 Got: +F2:3:4096:0.090541:0.001287 from 0 +R2:7770:4096:10.01 +R2:7777:4096:10.00 Got: +F2:3:4096:0.090541:0.001286 from 1 +R2:7783:4096:10.00 Got: +F2:3:4096:0.090721:0.001285 from 2 Got: +F2:3:4096:0.090541:0.001288 from 3 OpenSSL 1.0.2g 1 Mar 2016 built on: reproducible build, date unspecified options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM sign verify sign/s verify/s rsa 4096 bits 0.022646s 0.000322s 44.2 3109.2 for i in 128 192 256; do openssl speed -elapsed -evp aes-${i}-cbc ; done Spoiler root@lepotato:~# for i in 128 192 256; do openssl speed -elapsed -evp aes-${i}-cbc ; done You have chosen to measure elapsed time instead of user CPU time. Doing aes-128-cbc for 3s on 16 size blocks: 32264602 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 64 size blocks: 22322941 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 256 size blocks: 9329487 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 1024 size blocks: 2899974 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 8192 size blocks: 390089 aes-128-cbc's in 3.00s OpenSSL 1.0.2g 1 Mar 2016 built on: reproducible build, date unspecified options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 172077.88k 476222.74k 796116.22k 989857.79k 1065203.03k You have chosen to measure elapsed time instead of user CPU time. Doing aes-192-cbc for 3s on 16 size blocks: 31184019 aes-192-cbc's in 3.00s Doing aes-192-cbc for 3s on 64 size blocks: 19253234 aes-192-cbc's in 3.00s Doing aes-192-cbc for 3s on 256 size blocks: 7453968 aes-192-cbc's in 3.00s Doing aes-192-cbc for 3s on 1024 size blocks: 2217062 aes-192-cbc's in 3.00s Doing aes-192-cbc for 3s on 8192 size blocks: 293332 aes-192-cbc's in 3.00s OpenSSL 1.0.2g 1 Mar 2016 built on: reproducible build, date unspecified options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-192-cbc 166314.77k 410735.66k 636071.94k 756757.16k 800991.91k You have chosen to measure elapsed time instead of user CPU time. Doing aes-256-cbc for 3s on 16 size blocks: 29870994 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 64 size blocks: 17384233 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 256 size blocks: 6378728 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 1024 size blocks: 1846681 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 8192 size blocks: 241963 aes-256-cbc's in 3.00s OpenSSL 1.0.2g 1 Mar 2016 built on: reproducible build, date unspecified options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-256-cbc 159311.97k 370863.64k 544318.12k 630333.78k 660720.30k 7z b Spoiler root@lepotato:~# 7z b 7-Zip 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18 p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,4 CPUs) RAM size: 1850 MB, # CPU hardware threads: 4 RAM usage: 850 MB, # Benchmark threads: 4 Dict Compressing | Decompressing Speed Usage R/U Rating | Speed Usage R/U Rating KB/s % MIPS MIPS | KB/s % MIPS MIPS 22: 1942 289 652 1889 | 51700 398 1170 4664 23: 1942 291 680 1979 | 50868 398 1169 4655 24: 1928 291 712 2073 | 50381 399 1170 4674 25: 1921 292 752 2193 | 49890 399 1174 4691 ---------------------------------------------------------------- Avr: 291 699 2034 399 1171 4671 Tot: 345 935 3352
Tido Posted November 3, 2017 Posted November 3, 2017 Funny enough, I threw that code at my tinker board: openssl speed rsa4096 -multi 4 - it switched off about 2 seconds later. Maybe a hardware failure
tkaiser Posted November 3, 2017 Author Posted November 3, 2017 1 hour ago, Tido said: Funny enough, I threw that code at my tinker board: openssl speed rsa4096 -multi 4 - it switched off about 2 seconds later. Maybe a hardware failure Undervoltage, it seems you forgot that the Tinkerboard is a pile of crap you can switch off even with light loads. Wrt the S905X benchmark results unfortunately I miss distro and clockspeed info. Based on the information I assume it's Ubuntu Xenial y la patata is running at slightly above 1.4 GHz These are your results: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 172077.88k 476222.74k 796116.22k 989857.79k 1065203.03k aes-192-cbc 166314.77k 410735.66k 636071.94k 756757.16k 800991.91k aes-256-cbc 159311.97k 370863.64k 544318.12k 630333.78k 660720.30k And this is ROCK64 at stable 1.3 GHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 163161.40k 436259.80k 729289.90k 906723.33k 975929.34k aes-192-cbc 152362.85k 375675.22k 582690.99k 693259.95k 733563.56k aes-256-cbc 145928.50k 337163.26k 498586.20k 577371.48k 605145.77k Smells like 1.3 GHz vs. 1.42 GHz (and not 1.5GHz). Would be great if someone could provide tinymembench results too. 7-zip numbers are not that great since even an RPi 3 at 1200 MHz performs at this level. Strange.
Tido Posted November 4, 2017 Posted November 4, 2017 (edited) 8 hours ago, tkaiser said: Undervoltage, it seems you forgot that the Tinkerboard is a pile of If you don't catch irone, I will place a tag next time for you, thought a smiley were enough. Anyway, unplugged HDMI cable, replaced 119cm Nexus 4 Micro USB cable with: 35cm cable et voila: Spoiler root@tinkerboard:~# openssl speed rsa4096 -multi 4 Forked child 0 Forked child 1 Forked child 2 Forked child 3 +DTP:4096:private:rsa:10 +DTP:4096:private:rsa:10 +DTP:4096:private:rsa:10 +DTP:4096:private:rsa:10 +R1:154:4096:10.01 +DTP:4096:public:rsa:10 +R1:154:4096:10.01 +DTP:4096:public:rsa:10 +R1:152:4096:10.04 +DTP:4096:public:rsa:10 +R1:155:4096:10.07 +DTP:4096:public:rsa:10 +R2:8670:4096:10.00 +R2:8740:4096:10.00 +R2:8601:4096:10.00 +R2:8734:4096:10.00 Got: +F2:3:4096:0.064968:0.001145 from 0 Got: +F2:3:4096:0.065000:0.001153 from 1 Got: +F2:3:4096:0.065000:0.001144 from 2 Got: +F2:3:4096:0.066053:0.001163 from 3 OpenSSL 1.0.2g 1 Mar 2016 built on: reproducible build, date unspecified options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) blowfish(ptr) compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM sign verify sign/s verify/s rsa 4096 bits 0.016313s 0.000288s 61.3 3474.6 and this tinker board runs as if it wouldn't know it any better - Bazinga - Btw, still no heatsink attached - You have specifically asked for Xenial - so I did. Latest nightly. Armbian_5.34.171104_Lepotato_Ubuntu_xenial_next_4.13.11.img Edited November 4, 2017 by Tido added distro of Le Potato
tkaiser Posted November 4, 2017 Author Posted November 4, 2017 Seems I need to ask a second time: is anyone here able and willing to do a quick but correct benchmark test on Le Potato (not Tinkerboard, that's not interesting). I'm still interested in tinymembench numbers for this board and both openssl and 7-zip tests in an environment where cpufreq is monitored and throttling (if it would happen) gets noticed and avoided in a subsequent run.
Da Xue Posted November 4, 2017 Posted November 4, 2017 @tkaiser you want the numbers without throttling? memory speeds are set in u-boot and drastically affect performance. the boards have ddr3-2133 but higher doesn't always equate to better for tinymembench due to timings.
tkaiser Posted November 4, 2017 Author Posted November 4, 2017 15 minutes ago, Da Xue said: you want the numbers without throttling? That would be great. Or at least I need to know which clockspeeds are used (I know nothing about current state of Le Potato kernel, whether cpufreq/DVFS is already working and if it's working how throttling is configured). I assumed S905X would run with 1500 MHz but both openssl and especially 7z numbers are too low for that. 17 minutes ago, Da Xue said: memory speeds are set in u-boot and drastically affect performance I know and that's the reason I was asking for tinymembench numbers (low 7-zip compression speed is often related to high memory latency).
Da Xue Posted November 4, 2017 Posted November 4, 2017 @tkaiser The default bl30.bin permits the s905x to run at 1512MHz vs 1536MHz for s905. I have a bl30.bin that can be configured up to 1680MHz but the performance isn't linear due to throttling or some other logic in the hardware/firmware. With the hacked pre-1.0 SCPI that Amlogic implemented, I really don't know how to monitor clock speeds to ensure that it doesn't throttle down. I have no experience in reliably monitoring clock speeds in software for ARM so if you know a way, please let me know and I will run the numbers for you.
tkaiser Posted November 4, 2017 Author Posted November 4, 2017 46 minutes ago, Da Xue said: I really don't know how to monitor clock speeds to ensure that it doesn't throttle down. So there is either no cpufreq support or the numbers are bogus? Anyway, the mostly useless sysbench pseudo benchmark can be used funnily for exactly this: estimating at which clockspeed a specific CPU is running if CPU architecture and build options for the binary are known (easy with upstream distro packages). Can you please execute sysbench –test=cpu –cpu-max-prime=20000 run –num-threads=4 sysbench –test=cpu –cpu-max-prime=20000 run –num-threads=2 sysbench –test=cpu –cpu-max-prime=20000 run –num-threads=1 find /sys -name "scaling_available_frequencies" cat /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state armbianmonitor -u
Da Xue Posted November 6, 2017 Posted November 6, 2017 I don't have armbian monitor but heres the results with the CPU set to 1680MHz and the DDR set to 2108MHz. The DDR can go past 2200MHz but I haven't tested the performance because it seemed that the preconfigured timing in uboot negatively affects performance at higher speeds. I'll run the test at stock tomorrow or the day after. 7-Zip 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18 p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,4 CPUs) RAM size: 1852 MB, # CPU hardware threads: 4 RAM usage: 850 MB, # Benchmark threads: 4 Dict Compressing | Decompressing Speed Usage R/U Rating | Speed Usage R/U Rating KB/s % MIPS MIPS | KB/s % MIPS MIPS 22: 2019 288 682 1964 | 53850 399 1218 4858 23: 2016 289 709 2054 | 51161 385 1217 4681 24: 2009 290 744 2160 | 52508 399 1219 4871 25: 2002 290 788 2286 | 52036 400 1223 4893 ---------------------------------------------------------------- Avr: 289 731 2116 396 1219 4826 Tot: 342 975 3471 sysbench 0.4.12: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 4 Doing CPU performance benchmark Threads started! Done. Maximum prime number checked in CPU test: 20000 Test execution summary: total time: 6.2510s total number of events: 10000 total time taken by event execution: 24.9940 per-request statistics: min: 2.50ms avg: 2.50ms max: 2.61ms approx. 95 percentile: 2.50ms Threads fairness: events (avg/stddev): 2500.0000/0.71 execution time (avg/stddev): 6.2485/0.00 sysbench 0.4.12: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 2 Doing CPU performance benchmark Threads started! Done. Maximum prime number checked in CPU test: 20000 Test execution summary: total time: 12.5001s total number of events: 10000 total time taken by event execution: 24.9964 per-request statistics: min: 2.50ms avg: 2.50ms max: 2.60ms approx. 95 percentile: 2.50ms Threads fairness: events (avg/stddev): 5000.0000/1.00 execution time (avg/stddev): 12.4982/0.00 sysbench 0.4.12: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Doing CPU performance benchmark Threads started! Done. Maximum prime number checked in CPU test: 20000 Test execution summary: total time: 25.0038s total number of events: 10000 total time taken by event execution: 25.0010 per-request statistics: min: 2.50ms avg: 2.50ms max: 2.56ms approx. 95 percentile: 2.50ms Threads fairness: events (avg/stddev): 10000.0000/0.00 execution time (avg/stddev): 25.0010/0.00 tinymembench v0.4.9 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 1941.6 MB/s (1.4%) C copy backwards (32 byte blocks) : 1944.8 MB/s (1.6%) C copy backwards (64 byte blocks) : 1915.5 MB/s (1.6%) C copy : 1951.1 MB/s (1.5%) C copy prefetched (32 bytes step) : 1514.9 MB/s (0.3%) C copy prefetched (64 bytes step) : 1629.5 MB/s C 2-pass copy : 1766.9 MB/s C 2-pass copy prefetched (32 bytes step) : 1247.8 MB/s C 2-pass copy prefetched (64 bytes step) : 1258.8 MB/s (0.2%) C fill : 6068.0 MB/s C fill (shuffle within 16 byte blocks) : 6068.4 MB/s C fill (shuffle within 32 byte blocks) : 6068.3 MB/s C fill (shuffle within 64 byte blocks) : 6068.5 MB/s --- standard memcpy : 2026.4 MB/s (0.4%) standard memset : 6072.0 MB/s --- NEON LDP/STP copy : 2016.4 MB/s NEON LDP/STP copy pldl2strm (32 bytes step) : 1365.3 MB/s (0.5%) NEON LDP/STP copy pldl2strm (64 bytes step) : 1805.1 MB/s (0.3%) NEON LDP/STP copy pldl1keep (32 bytes step) : 2388.2 MB/s NEON LDP/STP copy pldl1keep (64 bytes step) : 2385.8 MB/s NEON LD1/ST1 copy : 2003.2 MB/s (1.5%) NEON STP fill : 6072.4 MB/s NEON STNP fill : 6015.9 MB/s ARM LDP/STP copy : 2020.0 MB/s (0.2%) ARM STP fill : 6072.4 MB/s ARM STNP fill : 6015.9 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read, [MADV_NOHUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 4.0 ns / 7.2 ns 131072 : 6.1 ns / 10.5 ns 262144 : 7.6 ns / 12.5 ns 524288 : 69.2 ns / 118.9 ns 1048576 : 95.9 ns / 127.8 ns 2097152 : 108.5 ns / 160.0 ns 4194304 : 143.3 ns / 179.9 ns 8388608 : 164.1 ns / 190.8 ns 16777216 : 173.4 ns / 196.3 ns 33554432 : 182.1 ns / 199.9 ns 67108864 : 190.7 ns / 215.9 ns block size : single random read / dual random read, [MADV_HUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 4.0 ns / 7.0 ns 131072 : 6.1 ns / 10.4 ns 262144 : 7.7 ns / 12.7 ns 524288 : 68.6 ns / 105.3 ns 1048576 : 95.9 ns / 127.8 ns 2097152 : 108.0 ns / 159.6 ns 4194304 : 111.1 ns / 131.2 ns 8388608 : 113.7 ns / 138.4 ns 16777216 : 115.1 ns / 149.5 ns 33554432 : 115.4 ns / 145.4 ns 67108864 : 115.7 ns / 149.9 ns
tkaiser Posted November 6, 2017 Author Posted November 6, 2017 2 minutes ago, Da Xue said: I don't have armbian monitor but heres the results with the CPU set to 1680MHz and the DDR set to 2108MHz. Thank you! Just to interpret the sysbench numbers: Which distro do you use, I need output from 'lsb_release -c' and 'file /usr/bin/sysbench'.
Da Xue Posted November 6, 2017 Posted November 6, 2017 Just now, tkaiser said: Thank you! Just to interpret the sysbench numbers: Which distro do you use, I need output from 'lsb_release -c' and 'file /usr/bin/sysbench'. ubuntu xenial /usr/bin/sysbench: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, for GNU/Linux 3.7.0, BuildID[sha1]=01b3ec2b7f6a203ed This is just a boostrapped ubuntu image with the kernel on github. 1
Da Xue Posted November 6, 2017 Posted November 6, 2017 No matter what I change the clock speed to in uboot to populate the PSCI, it will always say 1512 in the sys frequencies.
tkaiser Posted November 6, 2017 Author Posted November 6, 2017 6 minutes ago, Da Xue said: No matter what I change the clock speed to in uboot to populate the PSCI, it will always say 1512 in the sys frequencies And according to sysbench it's even slightly lower (more like 1470 MHz but that is pretty close to the reported 1512 MHz). In other words: too early to do any benchmarking now
Da Xue Posted November 6, 2017 Posted November 6, 2017 No matter what I change the clock speed tables to in uboot to populate the PSCI, it will always say 1512MHz in the sysfs frequencies. I think the A53 cores with crypto extensions perform slightly slower than their non-crypto counterparts in generic workloads. 7 minutes ago, tkaiser said: And according to sysbench it's even slightly lower (more like 1470 MHz but that is pretty close to the reported 1512 MHz). In other words: too early to do any benchmarking now Mind you that I am running this at 1680MHz and not 1512MHz for those numbers. The stock numbers are slower but not proportionally. The gains from anything over 1584MHz are very small if not negative.
tkaiser Posted November 6, 2017 Author Posted November 6, 2017 5 minutes ago, Da Xue said: I think the A53 cores with crypto extensions perform slightly slower than their non-crypto counterparts in generic workloads. At least I found no evidence for this. What about the output from find /sys -name "scaling_available_frequencies" cat /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state The sysbench numbers you generated indicate that the SoC's real clockspeed is slightly lower than 1.5GHz (the only good use case for sysbench since not depending on external memory bandwidth/latency). But we know from the past that 2 SoC vendors are cheating on us: Amlogic reporting bogus stuff through sysfs interface and same with RPi folks (where the same happens on RPi 2 and 3).
Da Xue Posted November 6, 2017 Posted November 6, 2017 The scaling_available_frequencies are bogus and is hard coded in the trusted firmware so I have no visibility to the exact speed. It always reports 1512. I am not getting this node: /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state. Do I have to enable a module? I am running numbers now for 1584/1056.
tkaiser Posted November 6, 2017 Author Posted November 6, 2017 12 minutes ago, Da Xue said: I am not getting this node: /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state. Do I have to enable a module? No idea, I only know that once cpufreq supports is enabled in any of the kernels we use this node should appear. Maybe it's at a different path (find /sys -name time_in_state)? Anyway: the most important information for me was: not ready yet so we have to take the above benchmark results with a huge grain of salt (@Tido's numbers look like slightly above 1.4GHz, yours like slightly below 1.5GHz, once you figured out how to reliably enable the desired clockspeeds and we can monitor the stuff through sysfs it gets interesting again).
Da Xue Posted November 6, 2017 Posted November 6, 2017 Here's the bizarre part. When I run it with the default bl30.bin (suppose to be 1512MHz), I get the following: sysbench 0.4.12: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 4 Doing CPU performance benchmark Threads started! Done. Maximum prime number checked in CPU test: 20000 Test execution summary: total time: 7.6267s total number of events: 10000 total time taken by event execution: 30.4950 per-request statistics: min: 3.05ms avg: 3.05ms max: 3.15ms approx. 95 percentile: 3.05ms Threads fairness: events (avg/stddev): 2500.0000/0.00 execution time (avg/stddev): 7.6237/0.00 Do you have any results I can compare with for other boards?
tkaiser Posted November 6, 2017 Author Posted November 6, 2017 11 minutes ago, Da Xue said: Do you have any results I can compare with for other boards? Pinebook (A64) and ROCK64 (RK3328) also with Ubuntu Xenial arm64 sysbench distro package (that's important! Otherwise just numbers without meaning since sysbench is a compiler settings benchmark and not able to meausre hardware performance) echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor for i in $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies) ; do echo $i >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq echo -e "$(( $i / 1000)): \c" sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4 2>&1 | grep 'execution time' done Pinebook (no throttling happening -- just compare time_in_state before/after every run). CPU clockspeed above 1152 MHz are disabled by Allwinner's budget cooling settings that's why 1200 and 1344 results are the same as for the 1152, since that's the real clockspeed: 480: execution time (avg/stddev): 19.1398/0.01 600: execution time (avg/stddev): 15.2882/0.01 720: execution time (avg/stddev): 12.7485/0.01 816: execution time (avg/stddev): 11.2629/0.01 912: execution time (avg/stddev): 10.1254/0.01 960: execution time (avg/stddev): 9.5806/0.00 1008: execution time (avg/stddev): 9.0986/0.01 1056: execution time (avg/stddev): 8.6765/0.01 1104: execution time (avg/stddev): 8.3067/0.01 1152: execution time (avg/stddev): 7.9538/0.00 1200: execution time (avg/stddev): 7.9521/0.00 1344: execution time (avg/stddev): 7.9843/0.01 ROCK64 (no throttling happened): 408: execution time (avg/stddev): 23.4966/0.01 600: execution time (avg/stddev): 15.4553/0.00 816: execution time (avg/stddev): 11.3848/0.01 1008: execution time (avg/stddev): 9.1798/0.01 1200: execution time (avg/stddev): 7.6882/0.00 1296: execution time (avg/stddev): 7.1025/0.00 ROCK64 is slightly slower which most probably is related to L1 cache latencies or something like that. You'll find a lot of additional information here: https://forum.armbian.com/topic/1748-sbc-consumptionperformance-comparisons/ (see there especially that how the sysbench binary has been built is the most important factor and that sysbench numbers of devices with totally different DRAM configuration/performance show identical sysbench scores)
tkaiser Posted November 6, 2017 Author Posted November 6, 2017 37 minutes ago, Da Xue said: When I run it with the default bl30.bin (suppose to be 1512MHz), I get the following: That's 1200 MHz in reality.
Da Xue Posted November 6, 2017 Posted November 6, 2017 So it would appear that the S905X is running somewhere around 1210MHz with the default bl30.bin. With the modified one, it operates around 1475MHz? I think HardKernel has the source for the S905 trusted firmware and I do not. I'm not quite sure how the throttling logic works with 4 cores or if there's a current throttler. @tkaiser do you have the single thread results?
tkaiser Posted November 6, 2017 Author Posted November 6, 2017 2 minutes ago, Da Xue said: So it would appear that the S905X is running somewhere around 1210MHz with the default bl30.bin. With the modified one, it operates around 1475MHz? Yes. That's what the benchmarks are telling. 2 minutes ago, Da Xue said: do you have the single thread results? I'll generate them only for ROCK64 since I tried this many times already in the past and sysbench's cpu test scales linearly with both count of CPU cores and clockspeed (in other words: it's a 'benchmark' that can not be used to model any real-world task out there since just calculating prime numbers inside the CPU cores/caches): 408: execution time (avg/stddev): 91.3720/0.00 600: execution time (avg/stddev): 61.7398/0.00 816: execution time (avg/stddev): 45.2492/0.00 1008: execution time (avg/stddev): 36.5560/0.00 1200: execution time (avg/stddev): 30.6810/0.00 1296: execution time (avg/stddev): 28.3843/0.00 I would suggest contacting Amlogic pretty soon since their next 'Amlogic is cheating on us!!1!!' drama is just around the corner (like last year when Willy Tarreau discovered that all reported S905 CPU clockspeeds above 1500 MHz were bogus)
tkaiser Posted November 6, 2017 Author Posted November 6, 2017 @willmore In case you've your 'overclocked' ODROID-C2 around running an arm64 Ubuntu Xenial (!!!) it might be worth to give the above simple sysbench run a try walking through the available cpufreq OPP and reporting results for '--num-threads=1' and '--num-threads=4'.
willmore Posted November 6, 2017 Posted November 6, 2017 Someone said my name? Sorry it took me a while to run this, but they offer a 100MHz clock speed and that takes a very long time to run--especially with one thread. I have a high degree of confidence that there is no throttling as IIRC, I tested this setup with cpuburn and got no throttling. I can't imagine this being more demanding than that! Here's the data: num-threads=4 100: execution time (avg/stddev): 99.4764/0.02 250: execution time (avg/stddev): 37.7647/0.01 500: execution time (avg/stddev): 18.6581/0.00 1000: execution time (avg/stddev): 9.2395/0.00 1296: execution time (avg/stddev): 7.1300/0.00 1536: execution time (avg/stddev): 6.0117/0.00 1656: execution time (avg/stddev): 5.5794/0.01 1680: execution time (avg/stddev): 5.4853/0.00 1752: execution time (avg/stddev): 5.2694/0.01 num-threads=1 100: execution time (avg/stddev): 369.1851/0.00 250: execution time (avg/stddev): 146.8992/0.00 500: execution time (avg/stddev): 73.3360/0.00 1000: execution time (avg/stddev): 36.6221/0.00 1296: execution time (avg/stddev): 28.2551/0.00 1536: execution time (avg/stddev): 24.4123/0.00 1656: execution time (avg/stddev): 22.0989/0.00 1680: execution time (avg/stddev): 21.7828/0.00 1752: execution time (avg/stddev): 21.3559/0.00 1
tkaiser Posted November 6, 2017 Author Posted November 6, 2017 16 minutes ago, willmore said: I can't imagine this being more demanding than that! Sysbench is pretty lightweight yes. If throttling would happen the stddev value would increase (in your case they're between 0.00 and 0.02, so that's just some background activity). And we can also do the math: 6.0117 / 5.2694 * 1536 --> 1752.3762099 24.4123 / 21.3559 * 1536 --> 1755.828263 So, thanks. Your numbers confirm that we can use sysbench in a very special mode to report real CPU clockspeeds comparing same CPU cores using same binaries where sysfs nodes and cpufreq drivers are cheating on us (really can't believe that we see this with Amlogic again after their S905/S912 desaster last year)
willmore Posted November 6, 2017 Posted November 6, 2017 3 minutes ago, tkaiser said: Sysbench is pretty lightweight yes. If throttling would happen the stddev value would increase (in your case they're between 0.00 and 0.02, so that's just some background activity). And we can also do the math: 6.0117 / 5.2694 * 1536 --> 1752.3762099 24.4123 / 21.3559 * 1536 --> 1755.828263 So, thanks. Your numbers confirm that we can use sysbench in a very special mode to report real CPU clockspeeds comparing same CPU cores using same binaries where sysfs nodes and cpufreq drivers are cheating on us (really can't believe that we see this with Amlogic again after their S905/S912 desaster last year) Ahh, yes, for a clock speed surrogate, you'd want exactly a task like that--something that doesn't stress the CPU too much that power and thermal issues com into play. You'd also want to avoid anything with alot of memory usage as that will be inelastic with CPU speed. Yeah, the clock speed issue for the S905. I remember that well. Maybe some day I can release the info I have on how that was detected.....
tkaiser Posted November 6, 2017 Author Posted November 6, 2017 5 minutes ago, willmore said: Yeah, the clock speed issue for the S905. I remember that well. Maybe some day I can release the info I have on how that was detected..... Huh? The few script lines above are sufficient to 'detect' this sort of cheating. It's just that almost nobody is benchmarking correctly since people prefer Phoronix or Geekbench BS / pseudo benchmarks. See also http://wiki.ant-computing.com/Choosing_a_processor_for_a_build_farm#Devices (it's as easy as walking through all cpufreq opp and look whether a benchmark performs different or not at different clockspeeds, then you either have real throttling happening or some mechanism preventing to use some clockspeeds -- that's what we have on almost all Android kernels in the meantime)
willmore Posted November 6, 2017 Posted November 6, 2017 6 minutes ago, tkaiser said: Huh? The few script lines above are sufficient to 'detect' this sort of cheating. It's just that almost nobody is benchmarking correctly since people prefer Phoronix or Geekbench BS / pseudo benchmarks. See also http://wiki.ant-computing.com/Choosing_a_processor_for_a_build_farm#Devices (it's as easy as walking through all cpufreq opp and look whether a benchmark performs different or not at different clockspeeds, then you either have real throttling happening or some mechanism preventing to use some clockspeeds -- that's what we have on almost all Android kernels in the meantime) Back when the clock speed cheating was first detected, I had been doing just the kind of bencharking that detected it. I approached a vendor in the area with my results--looking to find out why there was a performance plateau at 1.5GHz. That was one of the things that triggered the investigation into the issue which lead to revelation that the firmware was lying. There's some data there that might be of interest historically. Nothing that matters to this thread, really. The method that you've shown in this thread is similar to what I was doing back then and I'm confident that you can detect cheating with this.
Recommended Posts