21 21
Xalius

ROCK64

Recommended Posts

On 18.8.2017 at 8:26 AM, Stuart Naylor said:

has anyone tried port trunking a USB3.0 1Gb with the on-board just to have a glance at the workload & throughput?

 

RK3328 can saturate the internal GbE MAC in combination with the external RTL8211 PHY (please note: On pre-production samples we had 8211E while production boards feature 8211F, no idea yet what this means wrt performance/consumption) as well as an RTL8153 USB GbE dongle. With appropriate IRQ affinity also at the same time. With synthetical benchmarks you get then ~1700 Mbits/sec combined.

 

Now let's talk about use cases:

  • When we're talking about 'trunks' then this is usually called link aggregation (IEEE 802.1AX-2008 formerly known as IEEE 802.3ad). This mode does NOT increase bandwidth for networking connections but only provides a mechanism to put individual node connections on either link. So you will NOT end up with 2 Gbits/sec but with 2 x 1 Gbits/sec instead. The algorithm used to determine which link to put which connection on has to be chosen carefully since it's pretty easy to configure everything in a way that all traffic remains on one link while the other is unused. In n-to-1 topologies n should be at least 10 for trunking/bonding/LACP to become useful
  • What to do with 2 x 1 Gbits/sec? Which data to transmit? If the USB3 port is occupied by a RTL8153 the remaining interfaces are 2 x USB2 (each ~40 MB/s when 'USB Attached SCSI' (UAS) can be used, otherwise it's save to assume 35MB/s max) or eMMC and SD card. Even with implementing RAID-0 on the 2 USB2 ports we're not close to getting an IO bandwidth satisfying a 2nd GbE NIC. So we would need USB3 storage too and then there's the need to add an USB3 hub.
  • There exist different kinds of USB3 hubs and especially older ones are error prone. Based on some research a year ago I believe(d) choosing an USB3 hub based on VIA812 is a good idea. There exist also some VIA812/RTL8153 combinations (like the one you can see on this picture I bought for ~20 bucks few months ago -- in the same thread at the top you see also some performance numbers and should also keep in mind that ODROID XU4/HC1 use this very same chip for their onboard GbE).
  • To make use of an additional RTL8153 with storage use cases we would need to put an USB3 hub in between and then I'm already somewhat concerned with regard to reliability (the more complexity the less reliability).
  • Next problem: sequential transfer speed limits of HDDs: even with the fastest 3.5" HDDs currently available due to ZBR (zone bit recording) sequential transfer speeds drop below 100 MB/s if the disk gets filled (top sequential speeds are only possible with empty disks when you benchmark on the outer disk tracks). To make use of 2 GbE links we would need to combine also at least two disks in a RAID-0 fashion. This is dangerous since a single disk fail will render all your data unusable.
  • So what about redundant RAID modes? If we use such a VIA812/RTL8153 combination we could at least connect 3 disks and play RAID-5. Then we're switching from dangerous to insanely dangerous since from the on the USB hub acts as a single point of failure. Let there some USB resets happen for whatever reasons: all 3 disks behind the hub are not accessible so mdraid code will trash the whole array (please believe me: I deal now with failing RAIDs for exactly 2 decades and can tell you that RAID is only great until you would need it)
  • Then you need a bunch of external PSUs and a whole mess of cables to setup such a multi disk environment and if you add all the costs you might realize that ROCK64 is a great single disk NAS but if you have to add more than one disk other solutions like Helios4 or a x64 based HP Microserver look more sufficient (or Marvell based solutions like Clearfog or Espressobin where you get between 1 and 5 or even 9 real SATA ports without any USB3 crappiness in between)

TL;DR: It's possible to implement trunking, performance with synthetical benchmarks will look nice if you benchmark with 2 clients and add bandwidth (quite unrealistic of course) but I fail to identify a single use case that would justify trunking with a RK3328 device like ROCK64. Most probably the idea is not trunking but aggregated bandwidth like it's possible with some LAN protocols (for example 'SMB Multichannel' available since Windows 2012 server -- really impressive stuff) or SAN topologies (iSCSI multipathing for example). But then still due to the single USB3 port on RK3328 it's a really bad idea since added storage means more complexity (USB hub in between) and this negatively affects reliability.

 

Wrt JMS561 (USB-to-SATA bridge combined with SATA port multiplier and primitive RAID engine) please check @Kosmatik's experiences (ODROID-XU4 user running in a lot of problems with Hardkernel's Cloudshell 2 device that relies on this chip). Due to the issues reported here and there I would not use any device based on JMS561 (or older such chips like JMS539). But based on my experiences with failed RAID and my tries to avoid single points of failure I would never use any of these proprietary chips anyway.

Share this post


Link to post
Share on other sites

@tkaiser, while all of that is correct, there is another use case that you didn't consider.  What about using this for a NAS/router?  One interface towards the internal network and the other to the outside.  Not all of the traffic has to terminate on the Rock64.

 

 

Share this post


Link to post
Share on other sites

@tkaiser
Yeah I know what port trunking is and to be honest was just wondering if anyone had done any ioperf tests/helios.

I was just wondering about the Rock64 being a server that can supply multiple clients.
The NAS/Router posts later also its more curiosity what can be achieved with that USB3.0 on a SoC when much of its bandwidth is bottlenecked by the ethernet.
~1700 Mbits/sec combined is that pulling from an SSD on the same USB?

Anyone with and io and cpu stats?

If this is wise or not is very much a matter of choice and purpose.

With @Kosmatik's experiences of the JMS561 I posted the fix to Smartmon to stop it sending the wrong call to the controller. Its at the end of that thread.
https://www.smartmontools.org/ticket/552
Again curiosity but I find it hard to differentiate a RAID1 JMS561 for $20 running two disks than say the 10$ single USB adaptor that many are doing perf tests with.

Also it is completely dependent on the disks you choose being SSD, HDD or even hybrid.
http://www.seagate.com/www-content/product-content/seagate-laptop-fam/barracuda_25/en-us/docs/100807728d.pdf
Irrespective of cables, psu's which you could argue the only difference is what is hidden in a enclosure and do we have PSU's & Cables when its USB3.0 & 3.1?
Has anyone tested a cheap $20 JMS561 after fixing the smartmon bug?

I posted a $20 adapter but guess you guys might have a 2bay with the same chipset and just wondered how the SoC would cope and what is achievable.

If you ever have the time I would be really interested, if anyone fancies giving it go as they already have the equipment, I think it would be of interest to many.

The JMS561 was just an example of a single chipset, there are others and also others that do 4 bay and above.
They maybe cheap & nasty, but they are getting really cheap and that might make them more fit for purpose.

We could have newer forms of mediastore that are more suited to how we use data especially media.
OverlayFS could have an SSD Upper with a HDD Lower mounted over NFS with a cheap SoC supplying numerous users.
Where you archive down to the lower.
Could even have a decentralized volume spread over network nodes or a cluster, where capacity is just add another node.

USB3.0 could well be a precursor to the next rake of 3.1 systems with C connectors...

There might be objections but I think its interesting and also useful to know what these SoCs are capable of without any assumptions of a singular method or employ. 

https://wdullaer.com/blog/2016/03/19/create-a-nas-with-redundancy-using-snapraid/

I keep thinking the Rock64 could make a great Kodi box that shares a USB attached mirror via NFS. No link aggregation no Snapraid just a cheap hardware mirror.
For home a few boxes can pool those shares via aufs making a very simple node and collective system that scales by just adding another.
You gain bandwidth by diversification as you are not always sharing from a central store.
There are all sorts of ways you could use bandwidth when it starts to become available at this cost, maybe it would be informative to give it a try.
 

Share this post


Link to post
Share on other sites
1 hour ago, zador.blood.stained said:

Rock64 also shows exceptionally high numbers for the AES encryption in cryptsetup benchmark, and I wonder if it would also show such high numbers in Syncthing, which would make it a very good node for personal backup infrastructure based on Syncthing (or BTSync which also uses AES).

AES encryption could be of interest but isn't the performance due to the embedded cipher engine or is there a way to use it with something like Snapraid?
You might have link aggregated Rock64s in a Snapraid cluster or decentralized node array :)   

Share this post


Link to post
Share on other sites
2 hours ago, zador.blood.stained said:

Rock64 also shows exceptionally high numbers for the AES encryption in cryptsetup benchmark, and I wonder if it would also show such high numbers in Syncthing, which would make it a very good node for personal backup infrastructure based on Syncthing (or BTSync which also uses AES).

Where have you seen cryptsetup benchmark results for the rock64?  I searched this thread and didn't find any.  It's just an A53 with the AES extensions, right?  So, we'd expect something like the H5 +/- some for clock speed differences?

Orange Pi PC2 (AllWinner H5)

root@orangepipc2:~# cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1       129262 iterations per second
PBKDF2-sha256      76293 iterations per second
PBKDF2-sha512      70773 iterations per second
PBKDF2-ripemd160  109409 iterations per second
PBKDF2-whirlpool   24435 iterations per second
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   238.4 MiB/s   296.1 MiB/s
 serpent-cbc   128b    17.0 MiB/s    19.2 MiB/s
 twofish-cbc   128b    25.9 MiB/s    28.2 MiB/s
     aes-cbc   256b   204.6 MiB/s   267.8 MiB/s
 serpent-cbc   256b    17.2 MiB/s    19.1 MiB/s
 twofish-cbc   256b    26.1 MiB/s    28.2 MiB/s
     aes-xts   256b   259.8 MiB/s   261.3 MiB/s
 serpent-xts   256b    17.7 MiB/s    19.5 MiB/s
 twofish-xts   256b    27.7 MiB/s    28.7 MiB/s
     aes-xts   512b   240.3 MiB/s   239.8 MiB/s
 serpent-xts   512b    18.1 MiB/s    19.5 MiB/s
 twofish-xts   512b    28.2 MiB/s    28.6 MiB/s

By way of comparison, a faster clocked A53 without AES (Odroid-C2 Amlogic S905):

root@odroid64:~# cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1       275941 iterations per second
PBKDF2-sha256     165913 iterations per second
PBKDF2-sha512     152409 iterations per second
PBKDF2-ripemd160  238312 iterations per second
PBKDF2-whirlpool   52851 iterations per second
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b    42.4 MiB/s    44.2 MiB/s
 serpent-cbc   128b    34.5 MiB/s    37.7 MiB/s
 twofish-cbc   128b    42.6 MiB/s    42.2 MiB/s
     aes-cbc   256b    32.7 MiB/s    33.0 MiB/s
 serpent-cbc   256b    35.2 MiB/s    37.7 MiB/s
 twofish-cbc   256b    43.5 MiB/s    42.2 MiB/s
     aes-xts   256b    45.2 MiB/s    44.7 MiB/s
 serpent-xts   256b    36.5 MiB/s    38.1 MiB/s
 twofish-xts   256b    45.5 MiB/s    42.7 MiB/s
     aes-xts   512b    34.1 MiB/s    33.3 MiB/s
 serpent-xts   512b    36.9 MiB/s    38.1 MiB/s
 twofish-xts   512b    45.9 MiB/s    42.7 MiB/s

 

Share this post


Link to post
Share on other sites

@willmore

http://opensource.rock-chips.com/images/d/d7/Rockchip_RK3328_Datasheet_V1.1-20170309.pdf
Quad-core Cortex-A53 is integrated with separate Neon and FPU coprocessor, also with shared L2 Cache. The Quad-core GPU supports high-resolution display and game. Lots of high-performance interface to get very flexible solution, such as multi-channel display including HDMI2.0a and TV Encoder (CVBS). TrustZone and crypto hardware are integrated for security. 32bits DDR3/DDR3L/DDR4/LPDDR3 provides high memory bandwidth.

Cipher engine
 Support AES 128/192/256
 Supports the DES (ECB and CBC modes) and TDES (EDE and DED) algorithms
 Supports MD5, SHA-1 and SHA-256 HASH algorithms
 Support PKA(RSA) 512/1024/2048 bit Exp Modulator
 Support 160-bit Pseudo Random Number Generator (PRNG)
 Support 256-bit True Random Number Generator (TRNG)

 

Apart from that dunno and not sure how supported or that anyone has done any benchmarks yet.
So yeah the NEON extensions.
Maybe @zador.blood.stained will supply some.

Share this post


Link to post
Share on other sites
8 hours ago, willmore said:

Where have you seen cryptsetup benchmark results for the rock64?

I tested it by myself while making a Rock64 configuration for Armbian. I'm still not sure why cryptsetup shows much higher numbers than openssl (and so I decided to not post them right away without making some real world tests with cryptsetup on a real storage, but even if I had a spare SSD to make a benchmark, I broke the USB3 port while desoldering the protection diodes)

 

8 hours ago, willmore said:

It's just an A53 with the AES extensions, right?

Yes, and a relatively fast DRAM. So just by numbers Rock64 (4.4 kernel, performance governor) is more or less twice as fast as the Pinebook (3.10 kernel, performance governor, A64 has AES instructions too) and more or less 4 times as fast as Armada A388 with CESA.

 

8 hours ago, willmore said:

So, we'd expect something like the H5 +/- some for clock speed differences?

AFAIK A53 cores in H5 don't have AES support? Can you post contents of /proc/cpuinfo ?

Share this post


Link to post
Share on other sites
9 minutes ago, nobe said:

short version -> you might need to check if openssl is compiled with cryptodev enabled 

AFAIK cryptodev is not in the mainline so it requires out-of-tree kernel module. AF_ALG on the other hand may explain the performance, and it requires OpenSSL 1.1 or higher - which can be found i.e. on Debian Stretch while Ubuntu Xenial and Debian Jessie have OpenSSL 1.0.x

Share this post


Link to post
Share on other sites

Took some time to get some actual numbers.

"crypsetup benchmark" shows similar (within ±5% margin) results on both Xenial and Stretch:

root@rock64:~# cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1       273066 iterations per second for 256-bit key
PBKDF2-sha256     514007 iterations per second for 256-bit key
PBKDF2-sha512     214872 iterations per second for 256-bit key
PBKDF2-ripemd160  161817 iterations per second for 256-bit key
PBKDF2-whirlpool   72817 iterations per second for 256-bit key
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   366.3 MiB/s   455.7 MiB/s
 serpent-cbc   128b    25.0 MiB/s    27.4 MiB/s
 twofish-cbc   128b    29.4 MiB/s    30.9 MiB/s
     aes-cbc   256b   314.2 MiB/s   412.9 MiB/s
 serpent-cbc   256b    25.3 MiB/s    27.4 MiB/s
 twofish-cbc   256b    29.5 MiB/s    30.9 MiB/s
     aes-xts   256b   401.9 MiB/s   403.9 MiB/s
 serpent-xts   256b    26.7 MiB/s    28.0 MiB/s
 twofish-xts   256b    31.3 MiB/s    31.6 MiB/s
     aes-xts   512b   365.8 MiB/s   365.4 MiB/s
 serpent-xts   512b    26.7 MiB/s    27.9 MiB/s
 twofish-xts   512b    31.4 MiB/s    31.6 MiB/s

openssl benchmark results are a little bit different, so I'm not sure if "benchmarking gone wrong" or what

Jessie:

root@rock64:~# openssl speed -elapsed -evp aes-128-cbc aes-192-cbc aes-256-cbc
(cut)
OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr)
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     163161.40k   436259.80k   729289.90k   906723.33k   975929.34k
aes-192-cbc     152362.85k   375675.22k   582690.99k   693259.95k   733563.56k
aes-256-cbc     145928.50k   337163.26k   498586.20k   577371.48k   605145.77k

Stretch:

OpenSSL 1.1.0f  25 May 2017
built on: reproducible build, date unspecified
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/lib/ssl\"" -DENGINESDIR="\"/usr/lib/aarch64-linux-gnu/engines-1.1\""

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc      89075.61k   281317.21k   589750.10k   844657.32k   965124.10k   975323.14k
aes-192-cbc      85167.28k   252748.95k   487843.41k   655406.42k   727607.98k   733538.99k
aes-256-cbc      83124.71k   235290.07k   427535.10k   550874.11k   600997.89k   603417.26k

Edit: looks like benchmarking actually went wrong and "-evp" parameter placement (or existence) on the command line affects the benchmark

Edit 2: Redid and updates Stretch numbers

Edit 3: Redid and updated Xenial numbers

 

Had to run "openssl speed -elapsed -evp <alg>" for each algorithm separately.

Share this post


Link to post
Share on other sites

@willmore

 

Can you supply the same for the Odroid64 & orangepipc2?

openssl speed -elapsed -evp aes-128-cbc aes-192-cbc aes-256-cbc

Must be the Neon AES & SHA support and boy is the AES optimization off the chart for the Rock64 with the OrangePiPC2 not being bad either. 
 

Share this post


Link to post
Share on other sites

@zador.blood.stained

 

From OrangePiPC2:

Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

 

root@orangepipc2:~# openssl speed -elapsed -evp aes-128-cbc aes-192-cbc aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-192 cbc for 3s on 16 size blocks: 4382225 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 64 size blocks: 1168568 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 256 size blocks: 299007 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 1024 size blocks: 75171 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 8192 size blocks: 9412 aes-192 cbc's in 3.00s
Doing aes-256 cbc for 3s on 16 size blocks: 3942328 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 64 size blocks: 1028331 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 256 size blocks: 262540 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 1024 size blocks: 65973 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 8302 aes-256 cbc's in 3.00s
Doing aes-128-cbc for 3s on 16 size blocks: 19229648 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 12855383 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 5371646 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 1669660 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 224669 aes-128-cbc's in 3.00s
OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr)
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-192 cbc      23371.87k    24929.45k    25515.26k    25658.37k    25701.03k
aes-256 cbc      21025.75k    21937.73k    22403.41k    22518.78k    22669.99k
aes-128-cbc     102558.12k   274248.17k   458380.46k   569910.61k   613496.15k
root@odroid64:~# openssl speed -elapsed -evp aes-128-cbc aes-192-cbc aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-192 cbc for 3s on 16 size blocks: 9426226 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 64 size blocks: 2513241 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 256 size blocks: 642946 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 1024 size blocks: 161675 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 8192 size blocks: 20241 aes-192 cbc's in 3.00s
Doing aes-256 cbc for 3s on 16 size blocks: 8471996 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 64 size blocks: 2211530 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 256 size blocks: 564468 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 1024 size blocks: 141815 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 17766 aes-256 cbc's in 3.00s
Doing aes-128-cbc for 3s on 16 size blocks: 9706011 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 2782108 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 727117 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 183869 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 23058 aes-128-cbc's in 3.00s
OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) 
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-192 cbc      50273.21k    53615.81k    54864.73k    55185.07k    55271.42k
aes-256 cbc      45183.98k    47179.31k    48167.94k    48406.19k    48513.02k
aes-128-cbc      51765.39k    59351.64k    62047.32k    62760.62k    62963.71k

Looks like openssl uses the AES instructions for the 128 bit keylength, but not 192 nor 256 which is a bit strange.  Then again, it's an old version.

 

The Odroid c2 is running xenal and the PC2 is running armbian current.

Share this post


Link to post
Share on other sites
11 minutes ago, willmore said:

Looks like openssl uses the AES instructions for the 128 bit keylength, but not 192 nor 256 which is a bit strange.  Then again, it's an old version.

It's most likely "benchmarking gone wrong".

11 minutes ago, willmore said:

openssl speed -elapsed -evp aes-128-cbc aes-192-cbc aes-256-cbc

-evp here applies only to the next algo on the command line (aes-128-cbc), 2 next ones are not affected by this option. So I would advise to rerun the test 3 times, 1 algo at a time, and edit/post a combined table.

openssl speed -elapsed -evp aes-128-cbc
openssl speed -elapsed -evp aes-192-cbc
openssl speed -elapsed -evp aes-256-cbc

 

Share this post


Link to post
Share on other sites

Okay, composited:

root@orangepipc2
Doing aes-128-cbc for 3s on 16 size blocks: 19231577 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 12853395 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 5372534 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 1669698 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 224642 aes-128-cbc's in 3.00s
Doing aes-192-cbc for 3s on 16 size blocks: 17959061 aes-192-cbc's in 3.00s
Doing aes-192-cbc for 3s on 64 size blocks: 11051987 aes-192-cbc's in 3.00s
Doing aes-192-cbc for 3s on 256 size blocks: 4292528 aes-192-cbc's in 3.00s
Doing aes-192-cbc for 3s on 1024 size blocks: 1276599 aes-192-cbc's in 3.00s
Doing aes-192-cbc for 3s on 8192 size blocks: 168931 aes-192-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16 size blocks: 17198520 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 9922363 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 3673052 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 1063205 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 139337 aes-256-cbc's in 3.00s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     102568.41k   274205.76k   458456.23k   569923.58k   613422.42k
aes-192-cbc      95781.66k   235775.72k   366295.72k   435745.79k   461294.25k
aes-256-cbc      91725.44k   211677.08k   313433.77k   362907.31k   380482.90k
root@odroid64
Doing aes-128-cbc for 3s on 16 size blocks: 9702869 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 2781948 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 727164 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 183877 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 23058 aes-128-cbc's in 3.00s
Doing aes-192-cbc for 3s on 16 size blocks: 8720919 aes-192-cbc's in 3.00s
Doing aes-192-cbc for 3s on 64 size blocks: 2461310 aes-192-cbc's in 3.00s
Doing aes-192-cbc for 3s on 256 size blocks: 639833 aes-192-cbc's in 3.00s
Doing aes-192-cbc for 3s on 1024 size blocks: 161576 aes-192-cbc's in 3.00s
Doing aes-192-cbc for 3s on 8192 size blocks: 20256 aes-192-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16 size blocks: 7892666 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 2170451 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 561814 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 141717 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 17766 aes-256-cbc's in 3.00s
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      51748.63k    59348.22k    62051.33k    62763.35k    62963.71k
aes-192-cbc      46511.57k    52507.95k    54599.08k    55151.27k    55312.38k
aes-256-cbc      42094.22k    46302.95k    47941.46k    48372.74k    48513.02k

 

Edited by willmore
Remove html formatting for plain text

Share this post


Link to post
Share on other sites

Hmm... to summarize the 'OpenSSL 1.0.2g  1 Mar 2016' results for the 3 boards/SoC tested above with some more numbers added (on all A53 cores with crypto extensions enabled performance is directly proportional to CPU clockspeeds -- nice):

ODROID N1 / RK3399 A72 @ 2.0GHz:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     377879.56k   864100.25k  1267985.24k  1412154.03k  1489756.16k
aes-192-cbc     325844.85k   793977.30k  1063641.34k  1242280.28k  1312189.10k
aes-256-cbc     270982.47k   721167.51k   992207.02k  1079193.94k  1122691.75k

ODROID N1 / RK3399 A53 @ 1.5GHz:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     103350.94k   326209.49k   683714.13k   979303.08k  1118808.75k
aes-192-cbc      98758.18k   291794.65k   565252.01k   759266.99k   843298.13k
aes-256-cbc      96390.77k   273654.98k   495746.99k   638750.04k   696857.94k

MacchiatoBin / ARMADA 8040 @ 1.3GHz:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     360791.31k   684250.01k   885927.34k   943325.18k   977362.94k
aes-192-cbc     133711.13k   382607.98k   685033.56k   786573.31k   854780.59k
aes-256-cbc     314631.74k   553833.58k   683859.97k   719003.99k   738915.67k

Orange Pi One Plus / H6 @ 1800 MHz:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     226657.97k   606014.83k  1013054.98k  1259576.66k  1355773.27k
aes-192-cbc     211655.34k   517779.82k   809443.75k   963041.96k  1019251.37k
aes-256-cbc     202708.41k   470698.97k   692581.21k   802039.13k   840761.34k

NanoPi Fire3 / Nexell S5P6818 @ 1400 MHz (4.14.40 64-bit kernel):
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      96454.85k   303549.92k   637307.56k   909027.59k  1041484.46k
aes-192-cbc      91930.59k   274220.78k   527673.43k   705704.40k   785708.37k
aes-256-cbc      89652.23k   254797.65k   460436.75k   594723.84k   648388.61k

ROCK64 / Rockchip RK3328 @ 1296 MHz:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     163161.40k   436259.80k   729289.90k   906723.33k   975929.34k
aes-192-cbc     152362.85k   375675.22k   582690.99k   693259.95k   733563.56k
aes-256-cbc     145928.50k   337163.26k   498586.20k   577371.48k   605145.77k

PineBook / Allwinner A64 @ 1152 MHz:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     144995.37k   387488.51k   648090.20k   805775.36k   867464.53k
aes-192-cbc     135053.95k   332235.56k   516605.95k   609853.78k   650671.45k
aes-256-cbc     129690.99k   300415.98k   443108.44k   513158.49k   537903.10k

Espressobin / Marvell Armada 3720 @ 1000 MHz:
type             16 bytes    64 bytes     256 bytes    1024 bytes   8192 bytes
aes-128-cbc      68509.24k   216097.11k   453277.35k   649243.99k   741862.06k
aes-192-cbc      65462.17k   194529.30k   375030.70k   503817.22k   559303.34k
aes-256-cbc      63905.67k   181436.03k   328664.06k   423431.51k   462012.42k

OPi PC2 / Allwinner H5 @ 816 MHz:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     102568.41k   274205.76k   458456.23k   569923.58k   613422.42k
aes-192-cbc      95781.66k   235775.72k   366295.72k   435745.79k   461294.25k
aes-256-cbc      91725.44k   211677.08k   313433.77k   362907.31k   380482.90k

Banana Pi R2 / MediaTek MT7623 @ 1040 MHz and MTK Crypto Engine active
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc        519.15k     1784.13k     6315.78k    25199.27k   124499.22k
aes-192-cbc        512.39k     1794.01k     6375.59k    25382.23k   118693.89k
aes-256-cbc        508.30k     1795.05k     6339.93k    25042.60k   112943.10k

MiQi / RK3288 @ 2000 MHz:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128 cbc      87295.72k    94739.03k    98363.39k    99325.95k    99562.84k

ODROID-HC1 / Samsung Exynos 5244 @ (A15 core @ 2000 MHz):
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      78690.05k    89287.85k    94056.79k    95104.34k    95638.87k
aes-192-cbc      69102.10k    77545.47k    81156.61k    81964.71k    82351.45k
aes-256-cbc      61715.85k    68172.80k    71120.73k    71710.72k    72040.45k

ODROID-C2 / Amlogic S905 @ 1752 MHz:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      51748.63k    59348.22k    62051.33k    62763.35k    62963.71k
aes-192-cbc      46511.57k    52507.95k    54599.08k    55151.27k    55312.38k
aes-256-cbc      42094.22k    46302.95k    47941.46k    48372.74k    48513.02k

NanoPi M3 / Nexell S5P6818 @ 1400 MHz (3.4.39 32-bit kernel):
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      44264.22k    54627.49k    58849.88k    59756.35k    60257.62k
aes-192-cbc      39559.11k    47999.32k    51095.30k    51736.15k    52158.46k
aes-256-cbc      35803.41k    42665.24k    44926.47k    45733.21k    45883.39k

Clearfog Pro / Marvell Armada 38x @ 1600 MHz:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      47352.87k    54746.43k    57855.57k    58686.12k    58938.71k
aes-192-cbc      41516.52k    47126.91k    49317.55k    49932.63k    50151.42k
aes-256-cbc      36960.26k    41269.63k    43042.65k    43512.15k    43649.71k

Raspberry Pi 3 / BCM2837 @ 1200 MHz:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      31186.04k    47189.70k    52744.87k    54331.73k    54799.02k
aes-192-cbc      30170.93k    40512.11k    44541.35k    45672.11k    45992.62k
aes-256-cbc      27073.50k    35401.37k    38504.70k    39369.39k    39616.51k

Banana Pi M3 / Allwinner A83T @ 1800 MHz:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      36122.38k    43447.94k    45895.34k    46459.56k    46713.51k
aes-192-cbc      32000.05k    37428.74k    39234.30k    39661.91k    39718.95k
aes-256-cbc      28803.39k    33167.72k    34550.53k    34877.10k    35042.65k

Banana Pi R2 / MediaTek MT7623 @ 1040 MHz:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      22082.67k    25522.92k    26626.22k    26912.77k    26995.37k
aes-192-cbc      19340.79k    21932.39k    22739.54k    22932.82k    23008.60k
aes-256-cbc      17379.62k    19425.11k    20058.03k    20223.66k    20267.01k

Edit: Added results for Pinebook and ODROID-HC1 ensuring both were running at max cpufreq

 

Edit 2: Added cpufreq settings for each tested device. Please note throttling dependencies and multi-threaded results below

 

Edit 3: Added Banana Pi M3 single thread performance above. Performance with 8 threads sucks since A83T throttles down to 1.2GHz within 10 minutes and overall AES253 score is below 190000k.

 

Edit 4: Added EspressoBin numbers from here. Another nice example for the efficiency of ARMv8 crypto extensions.

 

Edit 5: Added NanoPi M3 numbers from there.

 

Edit 6: Added Clearfog Pro numbers (Cortex-A9 -- unfortunately OpenSSL currently doesn't make use of CESA crypto engine otherwise numbers would be 3 to 4 times higher)

 

Edit 7: Added Banana Pi R2 numbers from here (Cortex-A7, cpufreq scaling broken since ever so SoC only running with 1040 MHz, numbers might slightly improve once MTK manages to fix cpufreq scaling)

 

Edit 8: Added numbers for ARMADA8040 (A72) from CNX comment thread.

 

Edit 9: Added RK3288 (Cortex A17) numbers from here.

 

Edit 10: Added RPI 3 (BCM2837) numbers. Please be aware that these are not Raspbian numbers but made with 64-bit kernel and Debian arm64 userland. When using Raspbian you get lower numbers!

 

Edit 11: Added Allwinner H6 numers from here.

 

Edit 12: Added RK3399 numbers from here.

 

Edit 13: Added new S5P6818 numbers since now with mainline 64-bit kernel ARMv8 crypto extensions are available

Share this post


Link to post
Share on other sites

@tkaiser  Nice summary.  The Rock64 looks pretty good.  Do you have XU4 results to add as context?

 

Here are results for an i5-3220m (3.2GHz IVB core):

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     569283.29k   617659.88k   627125.93k   629601.96k   630164.14k
aes-192-cbc     479330.65k   508491.20k   514591.74k   514747.05k   517674.33k
aes-256-cbc     399388.34k   429790.57k   440986.54k   448194.56k   445876.91k

 

Share this post


Link to post
Share on other sites
36 minutes ago, willmore said:

Do you have XU4 results to add as context?

 

I added them above already (for HC1 which shows better heat dissipation than XU4 but that doesn't matter since 'while true ; do openssl speed -elapsed -evp aes-256-cbc 2>&1 | grep "^aes-256-cbc" ; done' doesn't exceed 70"C reported SoC temperature after 15 minutes).

 

The summary above might not be 'benchmarking gone wrong' any more but still 'numbers without meaning' ;) 

 

Since without reported cpufreq the numbers don't tell much (assuming all the A53 perform identical at the same clockspeed I calculated based on Pinebook cpufreq the one for ROCK64: 605 / 537 *1152 --> 1297 and the one for OPi PC 2: 380 / 537 *1152 --> 815). In other words: with an Armbian image on PC2 that currently does not implement cpufreq scaling we're running the benchmark here with just 816 MHz (set by u-boot) so once cpufreq scaling is working numbers of RK3328, H5 and A64 devices depend solely on cpufreq)

 

Now the important question: At which clockspeed did Amlogic's S905 run? And how do numbers look like after we allowed the C2 to try some throttling:

 timeout 1200 bash -c 'while true ; do openssl speed -elapsed -evp aes-256-cbc 2>&1 | grep "^aes-256-cbc" ; done'

 

Share this post


Link to post
Share on other sites
1 minute ago, zador.blood.stained said:

AFAIK it doesn't have ARM crypto extensions so that should be the main reason for lower performance.

Of course that's the reason for the much lower numbers but it would be interesting whether the C2 numbers were done at 1.75 GHz or the stock 1.5GHz and how performance looks after 20 minutes of running the benchmark continually (getting the benchmark behaving more like a real world workload where high AES performance is needed not only for 15 seconds but for longer periods of time. If the SoC for example starts to throttle down after 3 minutes this should be also considered -- unlikely though since this test seems to be singlethreaded anyway)

Share this post


Link to post
Share on other sites
44 minutes ago, tkaiser said:

this test seems to be singlethreaded anyway

 

It is indeed. So to get a more realistic idea about the AES encryption potential when more than one CPU core is involved I would suggest running:

tk@pinebook:~$ cat check-ssl-speed.sh
#!/bin/bash
while true; do
	for i in 0 1 2 3 ; do 
		openssl speed -elapsed -evp aes-256-cbc 2>/dev/null &
	done
	wait
done
tk@pinebook:~$ ./check-ssl-speed.sh | grep "^aes-256-cbc"

With Pinebook I'm throttled 'down' to 1056 MHz after a few minutes and the total AES-256 score remains below 2,000,000k: https://pastebin.com/hYDvaRdH

 

On ODROID-HC1 I prefixed with 'taskset -c 4-7' to let the stuff run on the big cores only. They throttled down to 1.5 GHz after some times and overall performance is slightly above 220,000k: https://pastebin.com/HbZVnp87

 

Now back on-topic (ROCK64, RK3328, 28nm vs. 40nm with H5/A64): I would believe ROCK64 when making use of the ARM crypto extensions can remain on 1.3GHz all the time while calculating the stuff with 4 threads in parallel. @zador.blood.stained will you give it a try? :)

Share this post


Link to post
Share on other sites
10 minutes ago, tkaiser said:

I would believe ROCK64 when making use of the ARM crypto extensions can remain on 1.3GHz all the time while calculating the stuff with 4 threads in parallel. @zador.blood.stainedwill do you give it a try? :)

I'm not sure if the DRAM throughput affects the results in addition to the CPU clock speed (especially in multithreaded/multiprocess scenarios) so I'm not sure if we should push our benchmarking attempts in this direction. In addition I believe I already erased the card I used for tests, so I'll postpone them for now.

IMO storage benchmarks on different boards when using LUKS/cryptsetup with AES encryption would be a more real world scenario and we could see how disk encryption affects the usual NAS performance on, for example, XU4, Rock64, Clearfog (with the mainline kernel) and something like OPi Plus2E.

Share this post


Link to post
Share on other sites
51 minutes ago, zador.blood.stained said:

In addition I believe I already erased the card I used for tests

 

Haha, same problem here -- running out of SD cards (in the meantime I could already collect some 'not booting' experiences with combinations of board + SD card adapter + eMMC modules: Marvell Armada 38x + Pine's FORESEE eMMC --> no boot).

54 minutes ago, zador.blood.stained said:

IMO storage benchmarks on different boards when using LUKS/cryptsetup with AES encryption would be a more real world scenario and we could see how disk encryption affects the usual NAS performance on, for example, XU4, Rock64, Clearfog (with the mainline kernel) and something like OPi Plus2E.

 

Agreed. But while we're then at it a 10 minute check of multi-threaded 'openssl --speed' benchmark should IMO also be done at least to be able to educate users about the meaning of some numbers (IMO problem N° 1 with benchmarks: how do they correlate with real-world workloads?).

 

And then another use case would be interesting: Using such a board as OpenVPN/IPSec box -- so no storage influence but interesting real-world numbers. In case the 'application' can then benefit from ARM crypto extensions on H5 while not being able to use them with H2+/H3 the demand for a potential OPi R1 upgrade might increase ;) 

Share this post


Link to post
Share on other sites

Okay, so I let them bake for a few hours.  The PC2 quickly climbed to 80C and then slowly up to 100C where it stayed.  Performance did not change during the run.  The C2 slowly climbed up to 49C and stayed there.  It also had no performance changed during the test.

 

I have no idea what the clock speed of the PC2 is.  It's whatever current mainline armbian probides--which @tkaiser said was 815MHz.  So, we could expect that one to come up a little, but not much--100C seems to be pretty toasty.

Share this post


Link to post
Share on other sites
5 minutes ago, willmore said:

Okay, so I let them bake for a few hours.  The PC2 quickly climbed to 80C and then slowly up to 100C where it stayed. 

I would recommend to kill the test and turn it off immediately. There is no DVFS or THS in 4.11 branch that we are using for H5, so the board may literally bake itself to death without even trying to throttle.

Share this post


Link to post
Share on other sites
1 hour ago, willmore said:

I have no idea what the clock speed of the PC2 is.  It's whatever current mainline armbian probides--which @tkaiser said was 815MHz

 

That was just doing the math given that I knew A64 was clocked with 1152 MHz and then calculating clockspeed based on values for Pinebook and OPi PC 2 --> 815. So I assumed PC2 is running with 816 MHz. In the meantime I tested with my only H5 board (OPi Zero Plus 2 H5): openssl speed -elapsed -evp aes-256-cbc --> 26782.38k (Huh? What's going on here? Debug output). I again did some math (running sysbench on OPi PC2 and ROCK64, took both execution times, naively assuming PC2 running at 816 MHz again and then 'echo '11.2318 / 7.1657 * 816' | bc -l' --> 1279.03049248503286489152 (1296 MHz ROCK64 was running at). Why are my AES scores that low?

 

Edit: Found a bug in armhwinfo. New armbianmonitor -u output here: http://sprunge.us/MdKL

 

Since I found a spare SD card I couldn't resist to test with ROCK64 again (my first board with 2GB and an el cheapo heatsink applied). Debian Stretch, 'OpenSSL 1.1.0f  25 May 2017', same numbers as with Jessie when running single threaded. When testing AES256 with 4 threads it starts at almost 2,400,000k and after some time throttles down to 2,100,000k (it was even even below 2,050,000 but that was due to 'rock64_health.sh -w' running in parallel which is way too resource hungry in this mode updating every 0.5s): https://pastebin.com/Ck15UQv4

Share this post


Link to post
Share on other sites

A noob question. Rock64 has a usb3 port, and there is possibility to attach a SATA-USB adapter there, like that one, Pine64 sells in their store. Hardkernel has added such a bridge right into the board with their new HC1. I am wondering (since I have no possibility to check this directly), how these bridges represent themselves to the host - more specifically - do they expose themselves as AHCI/SATA controllers, the same way as their x86 brothers sitting on PCIe do? Everybody is talking about UAS here, so probably they don't expose their SATA nature (more then that SAT thing), but I am a total ignorant with respect to USB internals yet, so I might be wrong presuming that if they exposed themselves as SATA controllers sitting on a USB bus, there were no need in UAS with them. If they don't act as pci-ide, ahci x86 counterparts, then why? why a SATA controller put on the SoC interconnect or PCIe bus can be a "native SATA", but not when put on USB?

Share this post


Link to post
Share on other sites
20 minutes ago, valant said:

but I am a total ignorant with respect to USB internals

 

The more ignorant of those internals you are, the happier your life will be.  I wrote some simple drivers in assembly language for USB 2, and being totally honest, I kind of wish that standard would have sunk to the bottom of the ocean in the late 90's before it hit revision 2.0.

 

USB has some special bus modes to handle mass data traffic (UAS), all data on USB is of a "type", it doesn't easily have a type-agnostic transfer like PCI(e).  If it were exposed as a SATA device, this would be a software layer translating to do so, and would most likely be less efficient.  Now, the experts will know more and (most probably) make me look foolish, but it's part of the job.  ;)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
21 21