tkaiser Posted August 19, 2017 Posted August 19, 2017 On 18.8.2017 at 8:26 AM, Stuart Naylor said: has anyone tried port trunking a USB3.0 1Gb with the on-board just to have a glance at the workload & throughput? RK3328 can saturate the internal GbE MAC in combination with the external RTL8211 PHY (please note: On pre-production samples we had 8211E while production boards feature 8211F, no idea yet what this means wrt performance/consumption) as well as an RTL8153 USB GbE dongle. With appropriate IRQ affinity also at the same time. With synthetical benchmarks you get then ~1700 Mbits/sec combined. Now let's talk about use cases: When we're talking about 'trunks' then this is usually called link aggregation (IEEE 802.1AX-2008 formerly known as IEEE 802.3ad). This mode does NOT increase bandwidth for networking connections but only provides a mechanism to put individual node connections on either link. So you will NOT end up with 2 Gbits/sec but with 2 x 1 Gbits/sec instead. The algorithm used to determine which link to put which connection on has to be chosen carefully since it's pretty easy to configure everything in a way that all traffic remains on one link while the other is unused. In n-to-1 topologies n should be at least 10 for trunking/bonding/LACP to become useful What to do with 2 x 1 Gbits/sec? Which data to transmit? If the USB3 port is occupied by a RTL8153 the remaining interfaces are 2 x USB2 (each ~40 MB/s when 'USB Attached SCSI' (UAS) can be used, otherwise it's save to assume 35MB/s max) or eMMC and SD card. Even with implementing RAID-0 on the 2 USB2 ports we're not close to getting an IO bandwidth satisfying a 2nd GbE NIC. So we would need USB3 storage too and then there's the need to add an USB3 hub. There exist different kinds of USB3 hubs and especially older ones are error prone. Based on some research a year ago I believe(d) choosing an USB3 hub based on VIA812 is a good idea. There exist also some VIA812/RTL8153 combinations (like the one you can see on this picture I bought for ~20 bucks few months ago -- in the same thread at the top you see also some performance numbers and should also keep in mind that ODROID XU4/HC1 use this very same chip for their onboard GbE). To make use of an additional RTL8153 with storage use cases we would need to put an USB3 hub in between and then I'm already somewhat concerned with regard to reliability (the more complexity the less reliability). Next problem: sequential transfer speed limits of HDDs: even with the fastest 3.5" HDDs currently available due to ZBR (zone bit recording) sequential transfer speeds drop below 100 MB/s if the disk gets filled (top sequential speeds are only possible with empty disks when you benchmark on the outer disk tracks). To make use of 2 GbE links we would need to combine also at least two disks in a RAID-0 fashion. This is dangerous since a single disk fail will render all your data unusable. So what about redundant RAID modes? If we use such a VIA812/RTL8153 combination we could at least connect 3 disks and play RAID-5. Then we're switching from dangerous to insanely dangerous since from the on the USB hub acts as a single point of failure. Let there some USB resets happen for whatever reasons: all 3 disks behind the hub are not accessible so mdraid code will trash the whole array (please believe me: I deal now with failing RAIDs for exactly 2 decades and can tell you that RAID is only great until you would need it) Then you need a bunch of external PSUs and a whole mess of cables to setup such a multi disk environment and if you add all the costs you might realize that ROCK64 is a great single disk NAS but if you have to add more than one disk other solutions like Helios4 or a x64 based HP Microserver look more sufficient (or Marvell based solutions like Clearfog or Espressobin where you get between 1 and 5 or even 9 real SATA ports without any USB3 crappiness in between) TL;DR: It's possible to implement trunking, performance with synthetical benchmarks will look nice if you benchmark with 2 clients and add bandwidth (quite unrealistic of course) but I fail to identify a single use case that would justify trunking with a RK3328 device like ROCK64. Most probably the idea is not trunking but aggregated bandwidth like it's possible with some LAN protocols (for example 'SMB Multichannel' available since Windows 2012 server -- really impressive stuff) or SAN topologies (iSCSI multipathing for example). But then still due to the single USB3 port on RK3328 it's a really bad idea since added storage means more complexity (USB hub in between) and this negatively affects reliability. Wrt JMS561 (USB-to-SATA bridge combined with SATA port multiplier and primitive RAID engine) please check @Kosmatik's experiences (ODROID-XU4 user running in a lot of problems with Hardkernel's Cloudshell 2 device that relies on this chip). Due to the issues reported here and there I would not use any device based on JMS561 (or older such chips like JMS539). But based on my experiences with failed RAID and my tries to avoid single points of failure I would never use any of these proprietary chips anyway.
willmore Posted August 19, 2017 Posted August 19, 2017 @tkaiser, while all of that is correct, there is another use case that you didn't consider. What about using this for a NAS/router? One interface towards the internal network and the other to the outside. Not all of the traffic has to terminate on the Rock64. 1
zador.blood.stained Posted August 19, 2017 Posted August 19, 2017 Rock64 also shows exceptionally high numbers for the AES encryption in cryptsetup benchmark, and I wonder if it would also show such high numbers in Syncthing, which would make it a very good node for personal backup infrastructure based on Syncthing (or BTSync which also uses AES). 1
Stuart Naylor Posted August 19, 2017 Posted August 19, 2017 @tkaiser Yeah I know what port trunking is and to be honest was just wondering if anyone had done any ioperf tests/helios. I was just wondering about the Rock64 being a server that can supply multiple clients. The NAS/Router posts later also its more curiosity what can be achieved with that USB3.0 on a SoC when much of its bandwidth is bottlenecked by the ethernet. ~1700 Mbits/sec combined is that pulling from an SSD on the same USB? Anyone with and io and cpu stats? If this is wise or not is very much a matter of choice and purpose. With @Kosmatik's experiences of the JMS561 I posted the fix to Smartmon to stop it sending the wrong call to the controller. Its at the end of that thread.https://www.smartmontools.org/ticket/552 Again curiosity but I find it hard to differentiate a RAID1 JMS561 for $20 running two disks than say the 10$ single USB adaptor that many are doing perf tests with. Also it is completely dependent on the disks you choose being SSD, HDD or even hybrid.http://www.seagate.com/www-content/product-content/seagate-laptop-fam/barracuda_25/en-us/docs/100807728d.pdf Irrespective of cables, psu's which you could argue the only difference is what is hidden in a enclosure and do we have PSU's & Cables when its USB3.0 & 3.1? Has anyone tested a cheap $20 JMS561 after fixing the smartmon bug? I posted a $20 adapter but guess you guys might have a 2bay with the same chipset and just wondered how the SoC would cope and what is achievable. If you ever have the time I would be really interested, if anyone fancies giving it go as they already have the equipment, I think it would be of interest to many. The JMS561 was just an example of a single chipset, there are others and also others that do 4 bay and above. They maybe cheap & nasty, but they are getting really cheap and that might make them more fit for purpose. We could have newer forms of mediastore that are more suited to how we use data especially media. OverlayFS could have an SSD Upper with a HDD Lower mounted over NFS with a cheap SoC supplying numerous users. Where you archive down to the lower. Could even have a decentralized volume spread over network nodes or a cluster, where capacity is just add another node. USB3.0 could well be a precursor to the next rake of 3.1 systems with C connectors... There might be objections but I think its interesting and also useful to know what these SoCs are capable of without any assumptions of a singular method or employ. https://wdullaer.com/blog/2016/03/19/create-a-nas-with-redundancy-using-snapraid/ I keep thinking the Rock64 could make a great Kodi box that shares a USB attached mirror via NFS. No link aggregation no Snapraid just a cheap hardware mirror. For home a few boxes can pool those shares via aufs making a very simple node and collective system that scales by just adding another. You gain bandwidth by diversification as you are not always sharing from a central store. There are all sorts of ways you could use bandwidth when it starts to become available at this cost, maybe it would be informative to give it a try.
Stuart Naylor Posted August 19, 2017 Posted August 19, 2017 1 hour ago, zador.blood.stained said: Rock64 also shows exceptionally high numbers for the AES encryption in cryptsetup benchmark, and I wonder if it would also show such high numbers in Syncthing, which would make it a very good node for personal backup infrastructure based on Syncthing (or BTSync which also uses AES). AES encryption could be of interest but isn't the performance due to the embedded cipher engine or is there a way to use it with something like Snapraid? You might have link aggregated Rock64s in a Snapraid cluster or decentralized node array
willmore Posted August 20, 2017 Posted August 20, 2017 2 hours ago, zador.blood.stained said: Rock64 also shows exceptionally high numbers for the AES encryption in cryptsetup benchmark, and I wonder if it would also show such high numbers in Syncthing, which would make it a very good node for personal backup infrastructure based on Syncthing (or BTSync which also uses AES). Where have you seen cryptsetup benchmark results for the rock64? I searched this thread and didn't find any. It's just an A53 with the AES extensions, right? So, we'd expect something like the H5 +/- some for clock speed differences? Orange Pi PC2 (AllWinner H5) root@orangepipc2:~# cryptsetup benchmark # Tests are approximate using memory only (no storage IO). PBKDF2-sha1 129262 iterations per second PBKDF2-sha256 76293 iterations per second PBKDF2-sha512 70773 iterations per second PBKDF2-ripemd160 109409 iterations per second PBKDF2-whirlpool 24435 iterations per second # Algorithm | Key | Encryption | Decryption aes-cbc 128b 238.4 MiB/s 296.1 MiB/s serpent-cbc 128b 17.0 MiB/s 19.2 MiB/s twofish-cbc 128b 25.9 MiB/s 28.2 MiB/s aes-cbc 256b 204.6 MiB/s 267.8 MiB/s serpent-cbc 256b 17.2 MiB/s 19.1 MiB/s twofish-cbc 256b 26.1 MiB/s 28.2 MiB/s aes-xts 256b 259.8 MiB/s 261.3 MiB/s serpent-xts 256b 17.7 MiB/s 19.5 MiB/s twofish-xts 256b 27.7 MiB/s 28.7 MiB/s aes-xts 512b 240.3 MiB/s 239.8 MiB/s serpent-xts 512b 18.1 MiB/s 19.5 MiB/s twofish-xts 512b 28.2 MiB/s 28.6 MiB/s By way of comparison, a faster clocked A53 without AES (Odroid-C2 Amlogic S905): root@odroid64:~# cryptsetup benchmark # Tests are approximate using memory only (no storage IO). PBKDF2-sha1 275941 iterations per second PBKDF2-sha256 165913 iterations per second PBKDF2-sha512 152409 iterations per second PBKDF2-ripemd160 238312 iterations per second PBKDF2-whirlpool 52851 iterations per second # Algorithm | Key | Encryption | Decryption aes-cbc 128b 42.4 MiB/s 44.2 MiB/s serpent-cbc 128b 34.5 MiB/s 37.7 MiB/s twofish-cbc 128b 42.6 MiB/s 42.2 MiB/s aes-cbc 256b 32.7 MiB/s 33.0 MiB/s serpent-cbc 256b 35.2 MiB/s 37.7 MiB/s twofish-cbc 256b 43.5 MiB/s 42.2 MiB/s aes-xts 256b 45.2 MiB/s 44.7 MiB/s serpent-xts 256b 36.5 MiB/s 38.1 MiB/s twofish-xts 256b 45.5 MiB/s 42.7 MiB/s aes-xts 512b 34.1 MiB/s 33.3 MiB/s serpent-xts 512b 36.9 MiB/s 38.1 MiB/s twofish-xts 512b 45.9 MiB/s 42.7 MiB/s
Stuart Naylor Posted August 20, 2017 Posted August 20, 2017 @willmore http://opensource.rock-chips.com/images/d/d7/Rockchip_RK3328_Datasheet_V1.1-20170309.pdf Quad-core Cortex-A53 is integrated with separate Neon and FPU coprocessor, also with shared L2 Cache. The Quad-core GPU supports high-resolution display and game. Lots of high-performance interface to get very flexible solution, such as multi-channel display including HDMI2.0a and TV Encoder (CVBS). TrustZone and crypto hardware are integrated for security. 32bits DDR3/DDR3L/DDR4/LPDDR3 provides high memory bandwidth. Cipher engine Support AES 128/192/256 Supports the DES (ECB and CBC modes) and TDES (EDE and DED) algorithms Supports MD5, SHA-1 and SHA-256 HASH algorithms Support PKA(RSA) 512/1024/2048 bit Exp Modulator Support 160-bit Pseudo Random Number Generator (PRNG) Support 256-bit True Random Number Generator (TRNG) Apart from that dunno and not sure how supported or that anyone has done any benchmarks yet. So yeah the NEON extensions. Maybe @zador.blood.stained will supply some.
zador.blood.stained Posted August 20, 2017 Posted August 20, 2017 8 hours ago, willmore said: Where have you seen cryptsetup benchmark results for the rock64? I tested it by myself while making a Rock64 configuration for Armbian. I'm still not sure why cryptsetup shows much higher numbers than openssl (and so I decided to not post them right away without making some real world tests with cryptsetup on a real storage, but even if I had a spare SSD to make a benchmark, I broke the USB3 port while desoldering the protection diodes) 8 hours ago, willmore said: It's just an A53 with the AES extensions, right? Yes, and a relatively fast DRAM. So just by numbers Rock64 (4.4 kernel, performance governor) is more or less twice as fast as the Pinebook (3.10 kernel, performance governor, A64 has AES instructions too) and more or less 4 times as fast as Armada A388 with CESA. 8 hours ago, willmore said: So, we'd expect something like the H5 +/- some for clock speed differences? AFAIK A53 cores in H5 don't have AES support? Can you post contents of /proc/cpuinfo ?
nobe Posted August 20, 2017 Posted August 20, 2017 @zador.blood.stained regarding openssl, this post from odroid forums is worth investigating https://forum.odroid.com/viewtopic.php?p=105358&sid=c631e8aeddd1378f32ac6d439904b69b#p105358 short version -> you might need to check if openssl is compiled with cryptodev enabled
zador.blood.stained Posted August 20, 2017 Posted August 20, 2017 9 minutes ago, nobe said: short version -> you might need to check if openssl is compiled with cryptodev enabled AFAIK cryptodev is not in the mainline so it requires out-of-tree kernel module. AF_ALG on the other hand may explain the performance, and it requires OpenSSL 1.1 or higher - which can be found i.e. on Debian Stretch while Ubuntu Xenial and Debian Jessie have OpenSSL 1.0.x
zador.blood.stained Posted August 20, 2017 Posted August 20, 2017 Took some time to get some actual numbers. "crypsetup benchmark" shows similar (within ±5% margin) results on both Xenial and Stretch: root@rock64:~# cryptsetup benchmark # Tests are approximate using memory only (no storage IO). PBKDF2-sha1 273066 iterations per second for 256-bit key PBKDF2-sha256 514007 iterations per second for 256-bit key PBKDF2-sha512 214872 iterations per second for 256-bit key PBKDF2-ripemd160 161817 iterations per second for 256-bit key PBKDF2-whirlpool 72817 iterations per second for 256-bit key # Algorithm | Key | Encryption | Decryption aes-cbc 128b 366.3 MiB/s 455.7 MiB/s serpent-cbc 128b 25.0 MiB/s 27.4 MiB/s twofish-cbc 128b 29.4 MiB/s 30.9 MiB/s aes-cbc 256b 314.2 MiB/s 412.9 MiB/s serpent-cbc 256b 25.3 MiB/s 27.4 MiB/s twofish-cbc 256b 29.5 MiB/s 30.9 MiB/s aes-xts 256b 401.9 MiB/s 403.9 MiB/s serpent-xts 256b 26.7 MiB/s 28.0 MiB/s twofish-xts 256b 31.3 MiB/s 31.6 MiB/s aes-xts 512b 365.8 MiB/s 365.4 MiB/s serpent-xts 512b 26.7 MiB/s 27.9 MiB/s twofish-xts 512b 31.4 MiB/s 31.6 MiB/s openssl benchmark results are a little bit different, so I'm not sure if "benchmarking gone wrong" or what Jessie: root@rock64:~# openssl speed -elapsed -evp aes-128-cbc aes-192-cbc aes-256-cbc (cut) OpenSSL 1.0.2g 1 Mar 2016 built on: reproducible build, date unspecified options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 163161.40k 436259.80k 729289.90k 906723.33k 975929.34k aes-192-cbc 152362.85k 375675.22k 582690.99k 693259.95k 733563.56k aes-256-cbc 145928.50k 337163.26k 498586.20k 577371.48k 605145.77k Stretch: OpenSSL 1.1.0f 25 May 2017 built on: reproducible build, date unspecified options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr) compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/lib/ssl\"" -DENGINESDIR="\"/usr/lib/aarch64-linux-gnu/engines-1.1\"" The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128-cbc 89075.61k 281317.21k 589750.10k 844657.32k 965124.10k 975323.14k aes-192-cbc 85167.28k 252748.95k 487843.41k 655406.42k 727607.98k 733538.99k aes-256-cbc 83124.71k 235290.07k 427535.10k 550874.11k 600997.89k 603417.26k Edit: looks like benchmarking actually went wrong and "-evp" parameter placement (or existence) on the command line affects the benchmark Edit 2: Redid and updates Stretch numbers Edit 3: Redid and updated Xenial numbers Had to run "openssl speed -elapsed -evp <alg>" for each algorithm separately. 1
Stuart Naylor Posted August 20, 2017 Posted August 20, 2017 @willmore Can you supply the same for the Odroid64 & orangepipc2? openssl speed -elapsed -evp aes-128-cbc aes-192-cbc aes-256-cbc Must be the Neon AES & SHA support and boy is the AES optimization off the chart for the Rock64 with the OrangePiPC2 not being bad either.
willmore Posted August 21, 2017 Posted August 21, 2017 @zador.blood.stained From OrangePiPC2: Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid root@orangepipc2:~# openssl speed -elapsed -evp aes-128-cbc aes-192-cbc aes-256-cbc You have chosen to measure elapsed time instead of user CPU time. Doing aes-192 cbc for 3s on 16 size blocks: 4382225 aes-192 cbc's in 3.00s Doing aes-192 cbc for 3s on 64 size blocks: 1168568 aes-192 cbc's in 3.00s Doing aes-192 cbc for 3s on 256 size blocks: 299007 aes-192 cbc's in 3.00s Doing aes-192 cbc for 3s on 1024 size blocks: 75171 aes-192 cbc's in 3.00s Doing aes-192 cbc for 3s on 8192 size blocks: 9412 aes-192 cbc's in 3.00s Doing aes-256 cbc for 3s on 16 size blocks: 3942328 aes-256 cbc's in 3.00s Doing aes-256 cbc for 3s on 64 size blocks: 1028331 aes-256 cbc's in 3.00s Doing aes-256 cbc for 3s on 256 size blocks: 262540 aes-256 cbc's in 3.00s Doing aes-256 cbc for 3s on 1024 size blocks: 65973 aes-256 cbc's in 3.00s Doing aes-256 cbc for 3s on 8192 size blocks: 8302 aes-256 cbc's in 3.00s Doing aes-128-cbc for 3s on 16 size blocks: 19229648 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 64 size blocks: 12855383 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 256 size blocks: 5371646 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 1024 size blocks: 1669660 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 8192 size blocks: 224669 aes-128-cbc's in 3.00s OpenSSL 1.0.2g 1 Mar 2016 built on: reproducible build, date unspecified options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-192 cbc 23371.87k 24929.45k 25515.26k 25658.37k 25701.03k aes-256 cbc 21025.75k 21937.73k 22403.41k 22518.78k 22669.99k aes-128-cbc 102558.12k 274248.17k 458380.46k 569910.61k 613496.15k root@odroid64:~# openssl speed -elapsed -evp aes-128-cbc aes-192-cbc aes-256-cbc You have chosen to measure elapsed time instead of user CPU time. Doing aes-192 cbc for 3s on 16 size blocks: 9426226 aes-192 cbc's in 3.00s Doing aes-192 cbc for 3s on 64 size blocks: 2513241 aes-192 cbc's in 3.00s Doing aes-192 cbc for 3s on 256 size blocks: 642946 aes-192 cbc's in 3.00s Doing aes-192 cbc for 3s on 1024 size blocks: 161675 aes-192 cbc's in 3.00s Doing aes-192 cbc for 3s on 8192 size blocks: 20241 aes-192 cbc's in 3.00s Doing aes-256 cbc for 3s on 16 size blocks: 8471996 aes-256 cbc's in 3.00s Doing aes-256 cbc for 3s on 64 size blocks: 2211530 aes-256 cbc's in 3.00s Doing aes-256 cbc for 3s on 256 size blocks: 564468 aes-256 cbc's in 3.00s Doing aes-256 cbc for 3s on 1024 size blocks: 141815 aes-256 cbc's in 3.00s Doing aes-256 cbc for 3s on 8192 size blocks: 17766 aes-256 cbc's in 3.00s Doing aes-128-cbc for 3s on 16 size blocks: 9706011 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 64 size blocks: 2782108 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 256 size blocks: 727117 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 1024 size blocks: 183869 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 8192 size blocks: 23058 aes-128-cbc's in 3.00s OpenSSL 1.0.2g 1 Mar 2016 built on: reproducible build, date unspecified options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-192 cbc 50273.21k 53615.81k 54864.73k 55185.07k 55271.42k aes-256 cbc 45183.98k 47179.31k 48167.94k 48406.19k 48513.02k aes-128-cbc 51765.39k 59351.64k 62047.32k 62760.62k 62963.71k Looks like openssl uses the AES instructions for the 128 bit keylength, but not 192 nor 256 which is a bit strange. Then again, it's an old version. The Odroid c2 is running xenal and the PC2 is running armbian current.
zador.blood.stained Posted August 21, 2017 Posted August 21, 2017 11 minutes ago, willmore said: Looks like openssl uses the AES instructions for the 128 bit keylength, but not 192 nor 256 which is a bit strange. Then again, it's an old version. It's most likely "benchmarking gone wrong". 11 minutes ago, willmore said: openssl speed -elapsed -evp aes-128-cbc aes-192-cbc aes-256-cbc -evp here applies only to the next algo on the command line (aes-128-cbc), 2 next ones are not affected by this option. So I would advise to rerun the test 3 times, 1 algo at a time, and edit/post a combined table. openssl speed -elapsed -evp aes-128-cbc openssl speed -elapsed -evp aes-192-cbc openssl speed -elapsed -evp aes-256-cbc 1
willmore Posted August 21, 2017 Posted August 21, 2017 (edited) Okay, composited: root@orangepipc2 Doing aes-128-cbc for 3s on 16 size blocks: 19231577 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 64 size blocks: 12853395 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 256 size blocks: 5372534 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 1024 size blocks: 1669698 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 8192 size blocks: 224642 aes-128-cbc's in 3.00s Doing aes-192-cbc for 3s on 16 size blocks: 17959061 aes-192-cbc's in 3.00s Doing aes-192-cbc for 3s on 64 size blocks: 11051987 aes-192-cbc's in 3.00s Doing aes-192-cbc for 3s on 256 size blocks: 4292528 aes-192-cbc's in 3.00s Doing aes-192-cbc for 3s on 1024 size blocks: 1276599 aes-192-cbc's in 3.00s Doing aes-192-cbc for 3s on 8192 size blocks: 168931 aes-192-cbc's in 3.00s Doing aes-256-cbc for 3s on 16 size blocks: 17198520 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 64 size blocks: 9922363 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 256 size blocks: 3673052 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 1024 size blocks: 1063205 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 8192 size blocks: 139337 aes-256-cbc's in 3.00s The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 102568.41k 274205.76k 458456.23k 569923.58k 613422.42k aes-192-cbc 95781.66k 235775.72k 366295.72k 435745.79k 461294.25k aes-256-cbc 91725.44k 211677.08k 313433.77k 362907.31k 380482.90k root@odroid64 Doing aes-128-cbc for 3s on 16 size blocks: 9702869 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 64 size blocks: 2781948 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 256 size blocks: 727164 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 1024 size blocks: 183877 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 8192 size blocks: 23058 aes-128-cbc's in 3.00s Doing aes-192-cbc for 3s on 16 size blocks: 8720919 aes-192-cbc's in 3.00s Doing aes-192-cbc for 3s on 64 size blocks: 2461310 aes-192-cbc's in 3.00s Doing aes-192-cbc for 3s on 256 size blocks: 639833 aes-192-cbc's in 3.00s Doing aes-192-cbc for 3s on 1024 size blocks: 161576 aes-192-cbc's in 3.00s Doing aes-192-cbc for 3s on 8192 size blocks: 20256 aes-192-cbc's in 3.00s Doing aes-256-cbc for 3s on 16 size blocks: 7892666 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 64 size blocks: 2170451 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 256 size blocks: 561814 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 1024 size blocks: 141717 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 8192 size blocks: 17766 aes-256-cbc's in 3.00s type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 51748.63k 59348.22k 62051.33k 62763.35k 62963.71k aes-192-cbc 46511.57k 52507.95k 54599.08k 55151.27k 55312.38k aes-256-cbc 42094.22k 46302.95k 47941.46k 48372.74k 48513.02k Edited August 21, 2017 by willmore Remove html formatting for plain text 1
tkaiser Posted August 21, 2017 Posted August 21, 2017 Hmm... to summarize the 'OpenSSL 1.0.2g 1 Mar 2016' results for the 3 boards/SoC tested above with some more numbers added (on all A53 cores with crypto extensions enabled performance is directly proportional to CPU clockspeeds -- nice): ODROID N1 / RK3399 A72 @ 2.0GHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 377879.56k 864100.25k 1267985.24k 1412154.03k 1489756.16k aes-192-cbc 325844.85k 793977.30k 1063641.34k 1242280.28k 1312189.10k aes-256-cbc 270982.47k 721167.51k 992207.02k 1079193.94k 1122691.75k ODROID N1 / RK3399 A53 @ 1.5GHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 103350.94k 326209.49k 683714.13k 979303.08k 1118808.75k aes-192-cbc 98758.18k 291794.65k 565252.01k 759266.99k 843298.13k aes-256-cbc 96390.77k 273654.98k 495746.99k 638750.04k 696857.94k MacchiatoBin / ARMADA 8040 @ 1.3GHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 360791.31k 684250.01k 885927.34k 943325.18k 977362.94k aes-192-cbc 133711.13k 382607.98k 685033.56k 786573.31k 854780.59k aes-256-cbc 314631.74k 553833.58k 683859.97k 719003.99k 738915.67k Orange Pi One Plus / H6 @ 1800 MHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 226657.97k 606014.83k 1013054.98k 1259576.66k 1355773.27k aes-192-cbc 211655.34k 517779.82k 809443.75k 963041.96k 1019251.37k aes-256-cbc 202708.41k 470698.97k 692581.21k 802039.13k 840761.34k NanoPi Fire3 / Nexell S5P6818 @ 1400 MHz (4.14.40 64-bit kernel): type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 96454.85k 303549.92k 637307.56k 909027.59k 1041484.46k aes-192-cbc 91930.59k 274220.78k 527673.43k 705704.40k 785708.37k aes-256-cbc 89652.23k 254797.65k 460436.75k 594723.84k 648388.61k ROCK64 / Rockchip RK3328 @ 1296 MHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 163161.40k 436259.80k 729289.90k 906723.33k 975929.34k aes-192-cbc 152362.85k 375675.22k 582690.99k 693259.95k 733563.56k aes-256-cbc 145928.50k 337163.26k 498586.20k 577371.48k 605145.77k PineBook / Allwinner A64 @ 1152 MHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 144995.37k 387488.51k 648090.20k 805775.36k 867464.53k aes-192-cbc 135053.95k 332235.56k 516605.95k 609853.78k 650671.45k aes-256-cbc 129690.99k 300415.98k 443108.44k 513158.49k 537903.10k Espressobin / Marvell Armada 3720 @ 1000 MHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 68509.24k 216097.11k 453277.35k 649243.99k 741862.06k aes-192-cbc 65462.17k 194529.30k 375030.70k 503817.22k 559303.34k aes-256-cbc 63905.67k 181436.03k 328664.06k 423431.51k 462012.42k OPi PC2 / Allwinner H5 @ 816 MHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 102568.41k 274205.76k 458456.23k 569923.58k 613422.42k aes-192-cbc 95781.66k 235775.72k 366295.72k 435745.79k 461294.25k aes-256-cbc 91725.44k 211677.08k 313433.77k 362907.31k 380482.90k Banana Pi R2 / MediaTek MT7623 @ 1040 MHz and MTK Crypto Engine active type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 519.15k 1784.13k 6315.78k 25199.27k 124499.22k aes-192-cbc 512.39k 1794.01k 6375.59k 25382.23k 118693.89k aes-256-cbc 508.30k 1795.05k 6339.93k 25042.60k 112943.10k MiQi / RK3288 @ 2000 MHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128 cbc 87295.72k 94739.03k 98363.39k 99325.95k 99562.84k ODROID-HC1 / Samsung Exynos 5244 @ (A15 core @ 2000 MHz): type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 78690.05k 89287.85k 94056.79k 95104.34k 95638.87k aes-192-cbc 69102.10k 77545.47k 81156.61k 81964.71k 82351.45k aes-256-cbc 61715.85k 68172.80k 71120.73k 71710.72k 72040.45k ODROID-C2 / Amlogic S905 @ 1752 MHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 51748.63k 59348.22k 62051.33k 62763.35k 62963.71k aes-192-cbc 46511.57k 52507.95k 54599.08k 55151.27k 55312.38k aes-256-cbc 42094.22k 46302.95k 47941.46k 48372.74k 48513.02k NanoPi M3 / Nexell S5P6818 @ 1400 MHz (3.4.39 32-bit kernel): type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 44264.22k 54627.49k 58849.88k 59756.35k 60257.62k aes-192-cbc 39559.11k 47999.32k 51095.30k 51736.15k 52158.46k aes-256-cbc 35803.41k 42665.24k 44926.47k 45733.21k 45883.39k Clearfog Pro / Marvell Armada 38x @ 1600 MHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 47352.87k 54746.43k 57855.57k 58686.12k 58938.71k aes-192-cbc 41516.52k 47126.91k 49317.55k 49932.63k 50151.42k aes-256-cbc 36960.26k 41269.63k 43042.65k 43512.15k 43649.71k Raspberry Pi 3 / BCM2837 @ 1200 MHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 31186.04k 47189.70k 52744.87k 54331.73k 54799.02k aes-192-cbc 30170.93k 40512.11k 44541.35k 45672.11k 45992.62k aes-256-cbc 27073.50k 35401.37k 38504.70k 39369.39k 39616.51k Banana Pi M3 / Allwinner A83T @ 1800 MHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 36122.38k 43447.94k 45895.34k 46459.56k 46713.51k aes-192-cbc 32000.05k 37428.74k 39234.30k 39661.91k 39718.95k aes-256-cbc 28803.39k 33167.72k 34550.53k 34877.10k 35042.65k Banana Pi R2 / MediaTek MT7623 @ 1040 MHz: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 22082.67k 25522.92k 26626.22k 26912.77k 26995.37k aes-192-cbc 19340.79k 21932.39k 22739.54k 22932.82k 23008.60k aes-256-cbc 17379.62k 19425.11k 20058.03k 20223.66k 20267.01k Edit: Added results for Pinebook and ODROID-HC1 ensuring both were running at max cpufreq Edit 2: Added cpufreq settings for each tested device. Please note throttling dependencies and multi-threaded results below Edit 3: Added Banana Pi M3 single thread performance above. Performance with 8 threads sucks since A83T throttles down to 1.2GHz within 10 minutes and overall AES253 score is below 190000k. Edit 4: Added EspressoBin numbers from here. Another nice example for the efficiency of ARMv8 crypto extensions. Edit 5: Added NanoPi M3 numbers from there. Edit 6: Added Clearfog Pro numbers (Cortex-A9 -- unfortunately OpenSSL currently doesn't make use of CESA crypto engine otherwise numbers would be 3 to 4 times higher) Edit 7: Added Banana Pi R2 numbers from here (Cortex-A7, cpufreq scaling broken since ever so SoC only running with 1040 MHz, numbers might slightly improve once MTK manages to fix cpufreq scaling) Edit 8: Added numbers for ARMADA8040 (A72) from CNX comment thread. Edit 9: Added RK3288 (Cortex A17) numbers from here. Edit 10: Added RPI 3 (BCM2837) numbers. Please be aware that these are not Raspbian numbers but made with 64-bit kernel and Debian arm64 userland. When using Raspbian you get lower numbers! Edit 11: Added Allwinner H6 numers from here. Edit 12: Added RK3399 numbers from here. Edit 13: Added new S5P6818 numbers since now with mainline 64-bit kernel ARMv8 crypto extensions are available 4
willmore Posted August 21, 2017 Posted August 21, 2017 @tkaiser Nice summary. The Rock64 looks pretty good. Do you have XU4 results to add as context? Here are results for an i5-3220m (3.2GHz IVB core): type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 569283.29k 617659.88k 627125.93k 629601.96k 630164.14k aes-192-cbc 479330.65k 508491.20k 514591.74k 514747.05k 517674.33k aes-256-cbc 399388.34k 429790.57k 440986.54k 448194.56k 445876.91k
tkaiser Posted August 21, 2017 Posted August 21, 2017 36 minutes ago, willmore said: Do you have XU4 results to add as context? I added them above already (for HC1 which shows better heat dissipation than XU4 but that doesn't matter since 'while true ; do openssl speed -elapsed -evp aes-256-cbc 2>&1 | grep "^aes-256-cbc" ; done' doesn't exceed 70"C reported SoC temperature after 15 minutes). The summary above might not be 'benchmarking gone wrong' any more but still 'numbers without meaning' Since without reported cpufreq the numbers don't tell much (assuming all the A53 perform identical at the same clockspeed I calculated based on Pinebook cpufreq the one for ROCK64: 605 / 537 *1152 --> 1297 and the one for OPi PC 2: 380 / 537 *1152 --> 815). In other words: with an Armbian image on PC2 that currently does not implement cpufreq scaling we're running the benchmark here with just 816 MHz (set by u-boot) so once cpufreq scaling is working numbers of RK3328, H5 and A64 devices depend solely on cpufreq) Now the important question: At which clockspeed did Amlogic's S905 run? And how do numbers look like after we allowed the C2 to try some throttling: timeout 1200 bash -c 'while true ; do openssl speed -elapsed -evp aes-256-cbc 2>&1 | grep "^aes-256-cbc" ; done'
zador.blood.stained Posted August 21, 2017 Posted August 21, 2017 4 minutes ago, tkaiser said: Now the important question: At which clockspeed did Amlogic's S905 run? AFAIK it doesn't have ARM crypto extensions so that should be the main reason for lower performance.
tkaiser Posted August 21, 2017 Posted August 21, 2017 1 minute ago, zador.blood.stained said: AFAIK it doesn't have ARM crypto extensions so that should be the main reason for lower performance. Of course that's the reason for the much lower numbers but it would be interesting whether the C2 numbers were done at 1.75 GHz or the stock 1.5GHz and how performance looks after 20 minutes of running the benchmark continually (getting the benchmark behaving more like a real world workload where high AES performance is needed not only for 15 seconds but for longer periods of time. If the SoC for example starts to throttle down after 3 minutes this should be also considered -- unlikely though since this test seems to be singlethreaded anyway)
tkaiser Posted August 21, 2017 Posted August 21, 2017 44 minutes ago, tkaiser said: this test seems to be singlethreaded anyway It is indeed. So to get a more realistic idea about the AES encryption potential when more than one CPU core is involved I would suggest running: tk@pinebook:~$ cat check-ssl-speed.sh #!/bin/bash while true; do for i in 0 1 2 3 ; do openssl speed -elapsed -evp aes-256-cbc 2>/dev/null & done wait done tk@pinebook:~$ ./check-ssl-speed.sh | grep "^aes-256-cbc" With Pinebook I'm throttled 'down' to 1056 MHz after a few minutes and the total AES-256 score remains below 2,000,000k: https://pastebin.com/hYDvaRdH On ODROID-HC1 I prefixed with 'taskset -c 4-7' to let the stuff run on the big cores only. They throttled down to 1.5 GHz after some times and overall performance is slightly above 220,000k: https://pastebin.com/HbZVnp87 Now back on-topic (ROCK64, RK3328, 28nm vs. 40nm with H5/A64): I would believe ROCK64 when making use of the ARM crypto extensions can remain on 1.3GHz all the time while calculating the stuff with 4 threads in parallel. @zador.blood.stained will you give it a try?
willmore Posted August 21, 2017 Posted August 21, 2017 @tkaiser The C2 is at 1.752GHz, not the stock 1.5GHz. I'll start running the multi threaded burn in and see where it goes.
zador.blood.stained Posted August 21, 2017 Posted August 21, 2017 10 minutes ago, tkaiser said: I would believe ROCK64 when making use of the ARM crypto extensions can remain on 1.3GHz all the time while calculating the stuff with 4 threads in parallel. @zador.blood.stainedwill do you give it a try? I'm not sure if the DRAM throughput affects the results in addition to the CPU clock speed (especially in multithreaded/multiprocess scenarios) so I'm not sure if we should push our benchmarking attempts in this direction. In addition I believe I already erased the card I used for tests, so I'll postpone them for now. IMO storage benchmarks on different boards when using LUKS/cryptsetup with AES encryption would be a more real world scenario and we could see how disk encryption affects the usual NAS performance on, for example, XU4, Rock64, Clearfog (with the mainline kernel) and something like OPi Plus2E.
tkaiser Posted August 21, 2017 Posted August 21, 2017 51 minutes ago, zador.blood.stained said: In addition I believe I already erased the card I used for tests Haha, same problem here -- running out of SD cards (in the meantime I could already collect some 'not booting' experiences with combinations of board + SD card adapter + eMMC modules: Marvell Armada 38x + Pine's FORESEE eMMC --> no boot). 54 minutes ago, zador.blood.stained said: IMO storage benchmarks on different boards when using LUKS/cryptsetup with AES encryption would be a more real world scenario and we could see how disk encryption affects the usual NAS performance on, for example, XU4, Rock64, Clearfog (with the mainline kernel) and something like OPi Plus2E. Agreed. But while we're then at it a 10 minute check of multi-threaded 'openssl --speed' benchmark should IMO also be done at least to be able to educate users about the meaning of some numbers (IMO problem N° 1 with benchmarks: how do they correlate with real-world workloads?). And then another use case would be interesting: Using such a board as OpenVPN/IPSec box -- so no storage influence but interesting real-world numbers. In case the 'application' can then benefit from ARM crypto extensions on H5 while not being able to use them with H2+/H3 the demand for a potential OPi R1 upgrade might increase
willmore Posted August 21, 2017 Posted August 21, 2017 Okay, so I let them bake for a few hours. The PC2 quickly climbed to 80C and then slowly up to 100C where it stayed. Performance did not change during the run. The C2 slowly climbed up to 49C and stayed there. It also had no performance changed during the test. I have no idea what the clock speed of the PC2 is. It's whatever current mainline armbian probides--which @tkaiser said was 815MHz. So, we could expect that one to come up a little, but not much--100C seems to be pretty toasty.
zador.blood.stained Posted August 21, 2017 Posted August 21, 2017 5 minutes ago, willmore said: Okay, so I let them bake for a few hours. The PC2 quickly climbed to 80C and then slowly up to 100C where it stayed. I would recommend to kill the test and turn it off immediately. There is no DVFS or THS in 4.11 branch that we are using for H5, so the board may literally bake itself to death without even trying to throttle.
willmore Posted August 21, 2017 Posted August 21, 2017 @zador.blood.stained Yeah, I turned it off. Thanks, though.
tkaiser Posted August 21, 2017 Posted August 21, 2017 1 hour ago, willmore said: I have no idea what the clock speed of the PC2 is. It's whatever current mainline armbian probides--which @tkaiser said was 815MHz That was just doing the math given that I knew A64 was clocked with 1152 MHz and then calculating clockspeed based on values for Pinebook and OPi PC 2 --> 815. So I assumed PC2 is running with 816 MHz. In the meantime I tested with my only H5 board (OPi Zero Plus 2 H5): openssl speed -elapsed -evp aes-256-cbc --> 26782.38k (Huh? What's going on here? Debug output). I again did some math (running sysbench on OPi PC2 and ROCK64, took both execution times, naively assuming PC2 running at 816 MHz again and then 'echo '11.2318 / 7.1657 * 816' | bc -l' --> 1279.03049248503286489152 (1296 MHz ROCK64 was running at). Why are my AES scores that low? Edit: Found a bug in armhwinfo. New armbianmonitor -u output here: http://sprunge.us/MdKL Since I found a spare SD card I couldn't resist to test with ROCK64 again (my first board with 2GB and an el cheapo heatsink applied). Debian Stretch, 'OpenSSL 1.1.0f 25 May 2017', same numbers as with Jessie when running single threaded. When testing AES256 with 4 threads it starts at almost 2,400,000k and after some time throttles down to 2,100,000k (it was even even below 2,050,000 but that was due to 'rock64_health.sh -w' running in parallel which is way too resource hungry in this mode updating every 0.5s): https://pastebin.com/Ck15UQv4
valant Posted August 24, 2017 Posted August 24, 2017 A noob question. Rock64 has a usb3 port, and there is possibility to attach a SATA-USB adapter there, like that one, Pine64 sells in their store. Hardkernel has added such a bridge right into the board with their new HC1. I am wondering (since I have no possibility to check this directly), how these bridges represent themselves to the host - more specifically - do they expose themselves as AHCI/SATA controllers, the same way as their x86 brothers sitting on PCIe do? Everybody is talking about UAS here, so probably they don't expose their SATA nature (more then that SAT thing), but I am a total ignorant with respect to USB internals yet, so I might be wrong presuming that if they exposed themselves as SATA controllers sitting on a USB bus, there were no need in UAS with them. If they don't act as pci-ide, ahci x86 counterparts, then why? why a SATA controller put on the SoC interconnect or PCIe bus can be a "native SATA", but not when put on USB?
TonyMac32 Posted August 24, 2017 Posted August 24, 2017 20 minutes ago, valant said: but I am a total ignorant with respect to USB internals The more ignorant of those internals you are, the happier your life will be. I wrote some simple drivers in assembly language for USB 2, and being totally honest, I kind of wish that standard would have sunk to the bottom of the ocean in the late 90's before it hit revision 2.0. USB has some special bus modes to handle mass data traffic (UAS), all data on USB is of a "type", it doesn't easily have a type-agnostic transfer like PCI(e). If it were exposed as a SATA device, this would be a software layer translating to do so, and would most likely be less efficient. Now, the experts will know more and (most probably) make me look foolish, but it's part of the job. 1
Recommended Posts