Jump to content

Recommended Posts

Posted

N1_plus_Test_Equipment.jpg

 

 

 

UPDATE: You'll find a preliminary performance overview at the end of the thread. Click here.

 

 

This is NOT an ODROID N1 review since it's way too early for this and the following will focus on just a very small amount of use cases the board might be used for: server stuff and everything that focuses on network, IO and internal limitations. If you want the hype instead better join Hardkernel's vendor community over there: https://forum.odroid.com/viewforum.php?f=148

 

All numbers you find below are PRELIMINARY since it's way too early to benchmark this board. This is just the try to get some baseline numbers to better understand for which use cases the device might be appropriate, where to look further into and which settings might need improvements.

 

Background info first

 

ODROID N1 is based on the Rockchip RK3399 SoC so we know already a lot since RK3399 isn't really new (see Chromebooks, countless TV boxes with this chip and dev boards like Firefly RK3399, ROCK960 and a lot of others... and there will be a lot more devices coming in 2018 like another board from China soon with a M.2 key M slot exposing all PCIe lanes).

 

What we already know is that the SoC is one of Rockchip's 'open source SoCs' so software support is already pretty good and the chip vendor itself actively upstreams software support. We also know RK3399 is not the greatest choice for compiling code (use case bottlenecked by memory bandwidth and only 2 fast cores combined with 4 slow ones, for this use case 4 x A15 or A17 cores perform much better), that ARMv8 crypto extensions are supported (see few posts below), that the SoC performs nicely with Android and 'Desktop Linux' stuff (think about GPU and VPU acceleration). We also know that this SoC has 2 USB3 ports and implements PCIe 2.1 with a four lane interface. But so far we don't know how the internal bottlenecks look like so let's focus on this now.

 

The PCIe 2.1 x4 interface is said to support both Gen1 and Gen2 link speeds (2.5 vs. 5GT/s) but there was recently a change in RK3399 datasheet (downgrade from Gen2 to Gen1) and some mainline kernel patch descriptions seem to indicate that RK3399 is not always able to train for Gen2 link speeds. On ODROID N1 there's a x1 PCIe link used configured as either Gen1 or Gen2 to which a dual-port SATA adapter is connected. The Asmedia ASM1061 was the obvious choice since while being a somewhat old design (AFAIK from 2010) it's cheap and 'fast enough' at least when combined with one or even two HDD.

 

Since the PCIe implementation on this early N1 dev samples is fixed and limited we need to choose other RK3399 devices to get a clue about PCIe limitations (RockPro64, ROCK960 or the yet not announced other board from China). So let's focus on SATA and USB3 instead. While SATA on 'development boards' isn't nothing new, it's often done with (sometimes really crappy) USB2 SATA bridges, recently sometimes with good USB3 SATA bridges (see ODROID HC1/HC2, Cloudmedia Transformer or Swiftboard) and sometimes it's even 'true' SATA:

 

 

All the above SoC families do 'native SATA' (the SoC itself implements SATA protocols and connectivity) but performance differs a lot with 'Allwinner SATA' being the worst and only the Marvell implementations performing as expected (+500 MB/s sequential and also very high random IO performance which is what you're looking after when using SSDs). As Armbian user you already know: this stuff is documented in detail, just read through this and that.

 

RK3399 is not SATA capable and we're talking here about PCIe attached SATA which has 2 disadvantages: slightly bottlenecking performance while increasing overall consumption. N1's SATA implementation and how it's 'advertised' (rootfs on SATA) pose another challenge but this is something for a later post (the sh*tshow known from 'SD cards' the last years now arriving at a different product category called 'SSD').

 

Benchmarking storage performance is challenging and most 'reviews' done on SBCs use inappropriate tools (see this nice bonnie/bonnie++ example), inappropriate settings (see all those dd and hdparm numbers testing partially filesystems buffers and caches and not storage) or focus only on irrelevant stuff (eg. sequential performance in 'worst case testing mode' only looking at one direction).

 

N1_and_4_SSDs.jpg

 

Some USB3 tests first

 

All SSDs I use for the test are powered externally and not by N1 since I ran more than one time in situations with board powered SSDs that performance dropped a lot when some sorts of underpowering occured. The 2 USB3 enclosures above are powered by a separate 5V rail and the SATA attached SSDs by the dual-voltage PSU behind. As expected USB3 storage can use the much faster UAS protocol (we know this from RK3328 devices like ROCK64 already which uses same XHCI controller and most probably nearly identical kernel) and also performance numbers match (with large block and file sizes we get close to 400 MB/s).

 

We chose iozone for the simple reason to be able to compare with previous numbers but a more thorough benchmark would need some fio testing with different test sets. But it's only about getting a baseline now. Tests done with Hardkernel's Debian Stretch image with some tweaks applied. The image relies on Rockchip's 4.4 BSP kernel (4.4.112) with some Hardkernel tweaks and I adjusted the following: First set both cpufreq governors to performance to be not affected by potentially wrong/weird cpufreq scaling behaviour. Then do static IRQ distribution for USB3 and PCIe on cpu1, cpu2 and cpu3 (all little cores but while checking CPU utilization none of the cores was fully saturated so A53@1.5GHz is fine):

echo 2 >/proc/irq/226/smp_affinity
echo 4 >/proc/irq/227/smp_affinity
echo 8 >/proc/irq/228/smp_affinity

To avoid CPU core collissions the benchmark task itself has been sent to one of the two A72 cores:

taskset -c 5 iozone -e -I -a -s 100M -r 1k -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2

Unfortunately currently I've only crappy SSDs lying around (all cheap consumer SSDs: Samsung EVO 840 and 750, a Samsung PM851 and a Intel 540). So we need to take the results with a grain of salt since those SSDs suck especially with continuous write tests (sequential write performance drops down a lot after a short period of time).

 

First test is to determine whether USB3 ports behave differently (AFAIK one of the two could also be configured as an OTG port and with some SBC I've seen serious performance drops in such a mode). But nope, they perform identical:

EVO840 behind JMS567 (UAS active) on lower USB3 port (xhci-hcd:usb7, IRQ 228):
                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     6200     6569     7523     7512     4897     6584
          102400       4    23065    25349    34612    34813    23978    25231
          102400      16    78836    87689   105249   106777    78658    88240
          102400     512   302757   314163   292206   300964   292599   321848
          102400    1024   338803   346394   327101   339218   329792   351382
          102400   16384   357991   376834   371308   384247   383501   377039

EVO840 behind JMS567 (UAS active) on upper USB3 port (xhci-hcd:usb5, IRQ 227):
                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     6195     6545     7383     7383     4816     6518
          102400       4    23191    25114    34370    34716    23580    25199
          102400      16    78727    86695   104957   106634    76359    87610
          102400     512   307469   315243   293077   302678   293442   321779
          102400    1024   335772   336833   326940   339128   330298   350271
          102400   16384   366465   376863   371193   384503   383297   379898

Now attaching an EVO750 (not that fast) that performs pretty identical behind the XHCI host controller and the JMS567 controller inside the enclosure:

EVO750 behind JMS567 (UAS active) on lower USB3 port (xhci-hcd:usb7, IRQ 228):
                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     6200     6569     7523     7512     4897     6584
          102400       4    23065    25349    34612    34813    23978    25231
          102400      16    78836    87689   105249   106777    78658    88240
          102400     512   302757   314163   292206   300964   292599   321848
          102400    1024   338803   346394   327101   339218   329792   351382
          102400   16384   357991   376834   371308   384247   383501   377039

(so USB3 is the bottleneck here, especially with random IO an EVO840 is much much faster than an EVO750 but here they perform identical due to the massive USB protocol overhead)

 

Let's try both USB3 ports at the same time


First quick try was a BTRFS RAID-0 made with 'mkfs.btrfs -f -m raid0 -d raid0 /dev/sda1 /dev/sdb1'. Please note that BTRFS is not the best choice here since all (over)writes with blocksizes lower than btrfs' internal blocksize (4K default) are way slower compared to non CoW filesystems:

                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     2659     1680   189424   621860   435196     1663
          102400       4    21943    18762    24206    24034    18107    17505
          102400      16    41983    46379    62235    60665    52517    42925
          102400     512   180106   170002   143494   149187   138185   180238
          102400    1024   170757   185623   159296   156870   156869   179560
          102400   16384   231366   247201   340649   351774   353245   231721

That's BS numbers, let's forget about them. Now trying the same with mdraid/ext4 configuring a RAID 0 and putting an ext4 on it and... N1 simply powered down when executing mkfs.ext4. Adding 'coherent_pool=2M' to bootargs seems to do the job (and I created the mdraid0 in between with both SSDs connected through SATA)

                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       4    25133    29444    38340    38490    23403    27947
          102400      16    85036    97638   113992   114834    79505    95274
          102400     512   306492   314124   295266   305411   289393   322493
          102400    1024   344588   343012   322018   332545   316320   357040
          102400   16384   384689   392707   371415   384741   388054   388908

Seems we're talking here already about one real bottleneck? We see nice improvements with small blocksizes which is an indication that RAID0 is doing its job. But with larger blocksizes we're not able to exceed the 400MB/s barrier so it seems both USB3 ports have to share bandwidth (comparable to the situation on ODROID XU4 where the two USB3 receptacles are connected to an internal USB3 hub which is connected to one USB3 port of the Exynos SoC)

 

Edit: @Xalius used these results to look into RK3399 TRM (technical reference manual). Quoting ROCK64 IRC:

 

 


[21:12] <Xalius_> let me pull out that TRM again

[21:16] <Xalius_> the USB-C PHY seems to be an extra block, but I guess that is mostly because it can switch the aux function to display port

[21:16] <Xalius_> it's not obvious to me how that would change the bandwidth

[21:16] <Xalius_> unless in normal USB3 mode they somehow have a PHY where both hosts connect or sth?

[21:17] <Xalius_> also I don't think it matters wrt the pinout

[21:17] <Xalius_> maybe you can switch one port to the USB-C PHY anyways

[21:17] <Xalius_> even if not using any USB-C type things

[21:18] <Xalius_> like force it into host mode

[21:19] <Xalius_> tkaiser, "Simultaneous IN and OUT transfer for USB3.0, up to 8Gbps bandwidth"

[21:19] <Xalius_> just reading the USB3 part

[21:21] <Xalius_> it also has some Ethernet hardware accelerator

[21:21] <Xalius_> "Scheduling of multiple Ethernet packets without interrupt"

[21:22] <Xalius_> apparently each USB3 host shares bandwidth with one USB2 host

[21:22] <Xalius_> "Concurrent USB3.0/USB2.0 traffic, up to 8.48Gbps bandwidth"

[21:26] <Xalius_> tkaiser, they also have two sets of IRQs in the list

[21:26] <tkaiser> Xalius_: Yeah, with this kernel I see them both. And assigned them to different CPU cores already
 

 

Posted

SATA performance

 

As already said RK3399 is not SATA capable so we're talking here in reality about RK3399 PCIe performance and the performance of the SATA controller Hardkernel chose (ASM1061).

 

I've destroyed the RAID-0 array from before, attached EVO 750 to SATA port 1 and EVO840 to SATA port 2 (both externally powered) so let's test (same settings as before: IRQ affinitiy and sending iozone to cpu5):

EVO750 connected to SATA port 1 (ata1.00)
                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     7483     8366     8990     8997     5985     8320
          102400       4    26895    31233    33467    33536    22688    31074
          102400      16    87658    98748   103510   103772    75473    98533
          102400     512   319330   320934   309735   311915   283113   322654
          102400    1024   332979   338408   321312   321328   306621   336457
          102400   16384   343053   346736   325660   327009   318830   341269

EVO840 connected to SATA port 2 (ata2.00)
                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     7282     8225     9004     8639     5540     7857
          102400       4    25295    29532    31754    32422    22069    30526
          102400      16    85907    97049   102244   102615    77170    96130
          102400     512   308776   312344   305041   308835   299016   306654
          102400    1024   326341   327747   316543   321559   315103   321031
          102400   16384   365294   378264   385631   391119   390479   293734

If we compare with the USB3 numbers above we see clearly one of the many 'benchmarking gone wrong' occurences. How on earth is the EVO750 connected via USB3 faster than when accessed through SATA (look at the sequential performance with 512K, 1M and 16M blocksizes. With USB3 we exceeded 380 MB/s read and are now stuck at ~325 MB/s -- that's impossible?!).

 

Reason is pretty simple: after I destroyed the RAID0 I recreated the filesystems on both SSDs and mkfs.ext4 took ages. Looking at dmesg shows the problem:

[  874.771379] ata1.00: NCQ disabled due to excessive errors

Both SSDs got initialized with NCQ (native command queueing and maximum queue depth of 31:

[    2.498063] ata1.00: ATA-9: Samsung SSD 750 EVO 120GB, MAT01B6Q, max UDMA/133
[    2.498070] ata1.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[    2.964660] ata2.00: ATA-9: Samsung SSD 840 EVO 120GB, EXT0BB0Q, max UDMA/133
[    2.964666] ata2.00: 234441648 sectors, multi 16: LBA48 NCQ (depth 31/32), AA

But then there were transmission errors and the kernel decided to give up on NCQ which is responsible for trashing SATA performance. When I attached the SATA cables to N1 I already expected troubles (one of the two connections felt somewhat 'loose') so looking into dmesg output was mandatory: http://ix.io/Kzf

 

Ok, shutting down the board and exchanging the SSDs so that now EVO840 is on port 1 and EVO750 on port 2:

EVO750 connected to SATA port 2 (ata2.00)
                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     7479     8257     8996     8997     5972     8305
          102400       4    26859    31206    33540    33580    22719    31026
          102400      16    87690    98865   103442   103715    75507    98374
          102400     512   319251   323358   308725   311769   283398   320156
          102400    1024   333172   338362   318633   322155   304734   332370
          102400   16384   379016   386131   387834   391267   389064   387225

EVO840 connected to SATA port 1 (ata1.00)
                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     7350     8238     8921     8925     5627     8167
          102400       4    26169    30599    33183    33313    22879    30418
          102400      16    85579    96564   102667   100994    76254    95562
          102400     512   312950   312802   309188   311725   303605   314411
          102400    1024   325669   324499   319510   321793   316649   324817
          102400   16384   373322   372417   385662   390987   390181   372922

Now performance is as expected (and with ASM1061 you can't expect more -- 390 MB/s sequential transfer speeds can be considered really great). But still... both SSDs seem to perform identically which is just weird since EVO840 is the much faster one. So let's have a look at a native SATA implementation of another ARM board: the Clearfog Pro. With same EVO840, partially crappy settings (and not testing 1K block size) it looks like this -- random IO of course way better than compared to ASM1061:

Clearfog Pro with EVO840 connected to a native SATA port of the ARMADA 385:
                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       4    69959   104711   113108   113920    40591    76737
          102400      16   166789   174407   172029   215341   123020   159731
          102400     512   286833   344871   353944   304479   263423   269149
          102400    1024   267743   269565   286443   361535   353766   351175
          102400   16384   347347   327456   353394   389994   425475   379687

(you find all details here. On a side note: the Clearfog Pro can be configured to provide 3 native SATA ports and Solid-Run engineers tested with 3 fast SATA SSDs in parallel and were able to exceed 1,500 MB/s in total. That was in early 2016)

 

So now that we have both SSDs running with NCQ and maximum queue depth let's try again RAID0:

                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     7082     7595     8545     8552     5593     7884
          102400       4    25434    29603    31858    31831    21195    29381
          102400      16    83270    93265    97376    97138    70859    93365
          102400     512   303983   297795   300294   286355   277441   301486
          102400    1024   330594   320820   316379   313175   314558   332272
          102400   16384   367334   367674   351361   366017   364117   351142

Nope, performance sucks. And the reason is the same. New dmesg output reveals that still SATA port 1 has a problem, so now the EVO840 runs with no NCQ any more so performance has to drop: http://ix.io/KA6

 

Carefully exchanging cables and checking contacts and another run with the SATA RAID0:

                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     7363     7990     8897     8901     6113     8176
          102400       4    26369    30720    33251    33310    23606    30484
          102400      16    85555    97111   102577   102953    78091    96233
          102400     512   306039   316729   309768   311106   294009   316353
          102400    1024   329348   339153   335685   333575   342699   346854
          102400   16384   382487   384749   385321   389949   390039   384479

Now everything fine since we again reach the 390 MB/s. If we look closer at the numbers we see that RAID0 with fast SSDs is just a waste of ressources since ASM1061 is the real bottleneck here. There exists an almost twice as expensive variant called ASM1062 which can make use of 2 PCIe lanes and shows overall better performance. But whether this would really result in higher storage performance is a different question since it could happen that a PCIe device attached with 2 lanes instead of one will bring down the link speed to Gen1 (so zero performance gain) or that there exists an internal SoC bandwidth.

 

Since we can't test for this with the ODROID N1 samples right now we need to do more tests with other RK3399 devices. In the meantime I created one RAID0 out of 4 SSDs (as can be seen in the picture above -- 2 x USB3, 2 x SATA) and let the iozone test repeat:

                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       4    25565    29387    33952    33814    19793    28524
          102400      16    82857    94170   101870   101376    63274    92038
          102400     512   283743   292047   292733   293601   275781   270178
          102400    1024   312713   312202   311117   311408   275342   320691
          102400   16384   469131   458924   616917   652571   619976   454828

We can see clearly that RAID0 is working (see the increased numbers with small blocksizes) but obviously there's an overall bandwidth limitation. As already said the SSDs I test with are cheap and crappy so the write limitation is caused by my SSDs while the read limitation seems some sort of a bandwidth bottleneck on the board or SoC (or kernel/drivers or current settings used!). Repeated the test with a new RAID0 made out of the two fastest SSDs, one connected via USB3, the other via SATA and now PCIe power management settings set to performance (search for /sys/module/pcie_aspm/parameters/policy below):

                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       4    33296    40390    50845    51146    31154    39931
          102400      16   105127   120863   139497   140849    97505   120296
          102400     512   315177   319535   302748   308408   294243   317566
          102400    1024   529760   569271   561234   570950   546556   555642
          102400   16384   688061   708164   736293   754982   753050   711708

When testing with sequential transfers only, large block sizes and 500 MB test size we get 740/755 MB/s write/read. Given there is something like a 'per port group' bandwidth limitation then this is as expected but as already said: this is just a quick try to search for potential bottlenecks and it's way too early to draw any conclusions now. We need a lot more time to look into details.

 

On the bright side: the above numbers are a confirmation that certain use cases like 'NAS box with 4 HDDs' will not be a problem at all (as long as users are willing and able to accept that USB3 SATA with a good and UAS capable SATA bridge is not worse compared to PCIe attached SATA here. HDDs all show crappy random IO performance so all that counts is sequential IO and the current bandwidth limitations of ~400 MB/s for both USB3 ports as well as both SATA ports are perfectly fine. People who want to benefit from ultra fast SSD storage might better look somewhere else.

Posted

More storage performance: eMMC and SD cards

 

The N1 has not only 2 SATA ports but also the usual SD card slot and also the usual eMMC socket known from other ODROID boards. Hardkernel sells some of the best eMMC modules you can get for this connector and they usually also take care that SD cards can enter higher speed modes. This usually requires switching between 3.3V and 1.8V but at least released schematics for this (early!) board revision do not mention 1.8V here.

 

Hardkernel shipped the dev sample with their new Samsung based orange eMMC (16 GB) but since this is severly limited wrt sequential write performance (as usual, flash memory modules with low capacity always suffer from this problem) we use the 64GB module to show the performance. Since the use case I'm interested in is 'rootfs' or 'OS drive' sequential performance is more or less irrelevant and all that really matters is random IO performance (especially writes at small block sizes). Test setup as before with iozone task sent to cpu5:

Orange 64GB eMMC (Samsung):
                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     2069     1966     8689     8623     7316     2489
          102400       4    32464    36340    30699    30474    27776    31799
          102400      16    94637   100995    89970    90294    83993    96937
          102400     512   147091   151657   278646   278126   269186   146851
          102400    1024   143085   148288   287749   291479   275359   143229
          102400   16384   147880   149969   306523   306023   307040   147470

If we compare random IOPS at 4K and 16K block size it's as follows (IOPS -- IO operations per second -- means we need to divide the KB/s numbers above through block size!). Below numbers are not KB/s but IOPS instead:

                     4K read   4K write   16K read  16K write
             JMS567:   6000      6300       4925       5500
  ASM1061 powersave:   5700      7600       4750       6000
          16GB eMMC:   7250      7100       5025       2950
   32/64/128GB eMMC:   7450      7350       5200       5700
ASM1061 performance:   9200     15050       6625       9825

(Not so) surprisingly Hardkernel's eMMC modules are faster than an SSD with default settings (and we're talking about ok-ish consumer SSDs and not cheap crap). Some important notes:

  • 'JMS567' is the USB3-to-SATA chipset used for my tests. The above is not an 'USB3 number' but one made with a great JMicron chipset and UAS active (UAS == USB Attached SCSI, the basic requirement to get storage performance with USB that does not totally suck). If you don't take care about the chipset you use your USB3 storage performance can be magnitudes lower
  • 'ASM1061' is not a synonym for 'native SATA', it's just PCIe attached SATA and most probably one of the slowest implementations available. There are two numbers above since PCIe power management settings have an influence on both consumption and performance. When /sys/module/pcie_aspm/parameters/policy is set to performance instead of powersave idle consumption increases by around 250mW but performance improves also a lot with small block sizes

As a reference here iozone numbers for all orange Samsung based eMMC modules tested on N1 (Hardkernel sent the numbers on request): https://pastebin.com/ePUCXyg6 (as can be seen the 16 GB module already performs great but for full performance better choose one of the larger modules)

 

So what about SD cards? 

 

Update: Hardkernel forgot to include an UHS patch to the kernel they provided with the developer samples so once this is fixed the performance bottleneck with SD cards reported below should be gone: https://forum.odroid.com/viewtopic.php?f=153&t=30193#p215915

 

Update 2: already fixed with those 3 simple lines in device-tree configuration (therefore the below numbers only as 'historical reference' and what happen with slowest SD card speed mode -- for current performance with SDR104 mode see here and there)

 

As Armbian user you already know that 'SD card' is not a performance class but just a form factor and an interface specification. There is all the counterfeit crap, there exist 'reputable brands' that produce SD cards that are slow as hell when it comes to random IO and there are good performers that show even 100 times better random IO performance than eg. an average Kingston or PNY card: https://forum.armbian.com/topic/954-sd-card-performance/

 

Unfortunately in the past 'random IO' was not part of SD Association's speed classes but this has changed last year. In the meantime there's 'A1 speed class' which specifies minimum random IO performance and now these cards even exist. I tried to buy a SanDisk Extreme Plus A1 but was too stupid and ordered a SanDisk Extreme A1 instead (without the 'Plus' which means extra performance and especially extra reliability). But since I saved few bucks by accident and there was an 'SanDisk Ultra A1' offer... I bought two A1 cards today:

Fresh SanDisk Extreme A1 32GB SD card:
                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1      998      716     4001     3997     3049      740
          102400       4     3383     3455    10413    10435     9631     4156
          102400      16     8560     8607    17149    17159    17089    11949
          102400     512    21199    21399    22447    22457    22464    20571
          102400    1024    22075    22168    22912    22922    22919    21742
          102400   16384    22415    22417    23357    23372    23372    22460

Fresh SanDisk Ultra A1 32GB SD card:
                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1      683      718     3466     3467     2966      449
          102400       4     2788     3918     9821     9805     8763     2713
          102400      16     4212     7950    16577    16627    15765     7121
          102400     512    10069    14514    22301    22346    22253    13652
          102400    1024    14259    14489    22851    22892    22868    13664
          102400   16384    15254    14597    23262    23342    23340    14312

Slightly used SanDisk Extreme Plus (NO A1!) 16GB SD card:
                                                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1      614      679     3245     3245     2898      561
          102400       4     2225     2889     9367     9360     7820     2765
          102400      16     8202     8523    16836    16806    16807     7507
          102400     512    20545    21797    22429    22465    22485    21857
          102400    1024    22352    22302    22903    22928    22918    22125
          102400   16384    22756    22748    23292    23323    23325    22691

Oh well, performance is limited to slowest SD card mode possible (4 bit, 50 MHz --> ~23 MB/s max) which also affects random IO performance slightly (small blocksizes) to severely (large blocksizes).

 

At least the N1 dev samples have a problem here. No idea whether this is a hardware limitation (no switching to 1.8V?) or just a settings problem. But I really hope Hardkernel addresses this since in the past I always enjoyed great performance with SD cards on the ODROIDs (due to Hardkernel being one of the few board makers taking care of such details)

 

Posted

BTW: Since checking out a new board without some kind of monitoring is just stupid... here's what it takes to get armbianmonitor to run with Hardkernel's Stretch (or Ubuntu later -- the needed RK3399 tweaks have been added long ago).

mkdir -p /etc/armbianmonitor/datasources
cd /etc/armbianmonitor/datasources
ln -s /sys/devices/virtual/thermal/thermal_zone0/temp soctemp
wget https://raw.githubusercontent.com/armbian/build/master/packages/bsp/common/usr/bin/armbianmonitor
mv armbianmonitor /usr/local/sbin/
chmod 755 /usr/local/sbin/armbianmonitor

Then it's just calling 'sudo armbianmonitor -m' to get a clue what's going on (throttling, big.LITTLE stuff, %iowait... everything included):

root@odroid:/home/odroid# armbianmonitor -m
Stop monitoring using [ctrl]-[c]
Time       big.LITTLE   load %cpu %sys %usr %nice %io %irq   CPU  C.St.

23:14:25:  408/1200MHz  0.38   6%   2%   3%   0%   0%   0% 43.9°C  0/3
23:14:30:  408/ 408MHz  0.51   1%   0%   0%   0%   0%   0% 43.9°C  0/3
23:14:35:  600/ 408MHz  0.55   2%   0%   1%   0%   0%   0% 44.4°C  0/3
23:14:41:  408/ 408MHz  0.51   0%   0%   0%   0%   0%   0% 46.9°C  1/3
23:14:46: 1992/ 816MHz  0.63  33%   0%  33%   0%   0%   0% 52.8°C  1/3
23:14:51:  408/ 408MHz  0.74  16%   0%  16%   0%   0%   0% 42.8°C  0/3
23:14:56: 1992/ 600MHz  0.68   5%   4%   0%   0%   0%   0% 44.4°C  0/3
23:15:01:  600/1008MHz  0.86  45%   8%   0%   0%  36%   0% 42.8°C  0/3
23:15:07:  408/ 408MHz  0.95  19%   2%   0%   0%  16%   0% 42.8°C  0/3
23:15:12:  408/ 600MHz  1.04  23%   2%   0%   0%  20%   0% 43.3°C  0/3
23:15:17: 1200/ 600MHz  1.12  18%   4%   0%   0%  14%   0% 43.9°C  0/3
23:15:22: 1992/1512MHz  1.03  51%  18%  23%   0%   8%   0% 52.8°C  1/3
23:15:27: 1992/1512MHz  1.42  88%  20%  34%   0%  32%   0% 51.1°C  1/3
23:15:32: 1992/1512MHz  1.79  72%  16%  34%   0%  20%   0% 51.7°C  1/3
Time       big.LITTLE   load %cpu %sys %usr %nice %io %irq   CPU  C.St.
23:15:37: 1992/1512MHz  2.05  77%  16%  34%   0%  26%   0% 50.0°C  1/3
23:15:42: 1992/1512MHz  2.29  79%  21%  34%   0%  23%   0% 50.0°C  1/3
23:15:47: 1992/1512MHz  2.42  85%  24%  34%   0%  26%   0% 48.8°C  1/3
23:15:52:  408/ 408MHz  2.71  50%   8%  11%   0%  29%   0% 40.6°C  0/3
23:15:57:  408/ 816MHz  2.65  33%   2%   0%   0%  30%   0% 40.6°C  0/3
23:16:03: 1008/ 600MHz  2.60  18%   4%   0%   0%  14%   0% 40.6°C  0/3
23:16:08:  408/ 408MHz  2.79   3%   0%   0%   0%   2%   0% 40.6°C  0/3^C

root@odroid:/home/odroid# 

 

Posted

Gigabit Ethernet performance

 

RK3399 has an internal GbE MAC implementation combined with an external RTL8211 GbE PHY. I did only some quick tests which were well above 900 Mbits/sec but since moving IRQs to one of the A72 cores didn't improve scores it's either my current networking setup (ODROID N1 connected directly to an older GbE switch I don't trust that much any more) or necessary TX/RX delay adjustments.

 

Anyway: the whole process should be well known and is documented so it's time for someone else to look into. With RK SoCs it's pretty easy to test for this with DT overlays: https://github.com/ayufan-rock64/linux-build/blob/master/recipes/gmac-delays-test/range-test

 

And the final result might be some slight DT modifications that allow for 940 Mbits/sec in both directions with as less CPU utilization as possible. Example for RK3328/ROCK64: https://github.com/ayufan-rock64/linux-kernel/commit/2047dd881db53c15a952b1755285e817985fd556

 

Since RK3399 uses the same Synopsys Designware Ethernet implementation as currently almost every other GbE capable ARM SoC around and since we get maximum throughput on RK3328 with adjusted settings... I'm pretty confident that this will be the same on RK3399.

Posted
4 hours ago, tkaiser said:

This usually requires switching between 3.3V and 1.8V but at least released schematics for this (early!) board revision do not mention 1.8V here.

 

Yeah, oddly it looks like 3V0, which I believe falls within the required range.  That said, scratching brain to remember if VDD is constant at 3.3 Volts and only the signalling voltage changes (SD_CLK, SD_CMD, Data_0..3).  In that case it would be switched at the SoC.

Posted

AES crypto performance, checking for bogus clockspeeds, thermal tresholds

 

As Armbian user you already might know that almost all currently available 64 bit ARM SoCs licensed ARM's ARMv8 crypto extensions and that AES performance especially with small data chunks (think about VPN encryption) is something where A72 cores shine: https://forum.armbian.com/topic/4583-rock64/?do=findComment&comment=37829 (the only two exceptions are Raspberry Pi 3 and ODROID-C2 where the SoC makers 'forgot' to license the ARMv8 crypto extensions)

 

Let's have a look at ODROID N1 and A53@1.5GHz vs. A72@2GHz. I use the usual openssl benchmark that runs in a single thread. Once pinned to cpu1 (little core) and another time pinned to cpu5 (big core):

for i in 128 192 256 ; do taskset -c 1 openssl speed -elapsed -evp aes-${i}-cbc 2>/dev/null; done | grep cbc
for i in 128 192 256 ; do taskset -c 5 openssl speed -elapsed -evp aes-${i}-cbc 2>/dev/null; done | grep cbc

As usual monitoring happened in another shell and when testing on the A72 I not only got a huge result variation but armbianmonitor also reported 'cooling state' reaching 1 already -- see last column 'C.St.'  (nope, that's the PWM fan, see few posts below)

Time       big.LITTLE   load %cpu %sys %usr %nice %io %irq   CPU  C.St.
06:00:44: 1992/1512MHz  0.46  16%   0%  16%   0%   0%   0% 51.1°C  1/3

So I added a huge and silent USB powered 5V fan to the setup blowing air over the board at an 45° angle to improve heat dissipation a bit (I hate those small and inefficient fansinks like the one on XU4 and the N1 sample now) and tried again. This time cooling state remained at 0 the internal fan did not start and we had no result variation any more (standard deviation low enough between multiple runs):

Time       big.LITTLE   load %cpu %sys %usr %nice %io %irq   CPU  C.St.
06:07:03: 1992/1512MHz  0.46   0%   0%   0%   0%   0%   0% 30.0°C  0/3
06:07:08: 1992/1512MHz  0.42   0%   0%   0%   0%   0%   0% 30.0°C  0/3
06:07:13: 1992/1512MHz  0.39   0%   0%   0%   0%   0%   0% 30.0°C  0/3
06:07:18: 1992/1512MHz  0.36   0%   0%   0%   0%   0%   0% 30.0°C  0/3
06:07:23: 1992/1512MHz  0.33   0%   0%   0%   0%   0%   0% 30.0°C  0/3
06:07:28: 1992/1512MHz  0.38  12%   0%  12%   0%   0%   0% 32.2°C  0/3
06:07:33: 1992/1512MHz  0.43  16%   0%  16%   0%   0%   0% 32.2°C  0/3
06:07:38: 1992/1512MHz  0.48  16%   0%  16%   0%   0%   0% 32.8°C  0/3
06:07:43: 1992/1512MHz  0.52  16%   0%  16%   0%   0%   0% 33.9°C  0/3
06:07:48: 1992/1512MHz  0.56  16%   0%  16%   0%   0%   0% 33.9°C  0/3
06:07:53: 1992/1512MHz  0.60  16%   0%  16%   0%   0%   0% 33.9°C  0/3
06:07:58: 1992/1512MHz  0.63  16%   0%  16%   0%   0%   0% 34.4°C  0/3
06:08:04: 1992/1512MHz  0.66  16%   0%  16%   0%   0%   0% 34.4°C  0/3
06:08:09: 1992/1512MHz  0.69  16%   0%  16%   0%   0%   0% 34.4°C  0/3
06:08:14: 1992/1512MHz  0.71  16%   0%  16%   0%   0%   0% 35.0°C  0/3

So these are the single threaded PRELIMINARY openssl results for ODROID N1 differentiating between A53 and A72 cores:

A53               16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     103354.37k   326225.96k   683938.47k   979512.32k  1119100.93k
aes-192-cbc      98776.57k   293354.45k   565838.51k   760103.94k   843434.67k
aes-256-cbc      96389.62k   273205.14k   495712.34k   638675.29k   696685.91k

A72               16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     377879.56k   864100.25k  1267985.24k  1412154.03k  1489756.16k
aes-192-cbc     317481.96k   779417.49k  1045567.57k  1240775.00k  1306637.65k
aes-256-cbc     270982.47k   663337.94k   963150.93k  1062750.21k  1122691.75k

The numbers look somewhat nice but need further investigation:

  • When we compared with other A53 and especially A72 SoCs a while ago (especially the A72 numbers made on a RK3399 TV box only clocking at 1.8 GHz) the A72 scores above seem to low with all test sizes (see the numbers here with AES-128 on a H96-Pro)
  • Cooling state 1 is entered pretty early (when zone0 exceeds already 50°C) -- this needs further investigation. And further benchmarking especially with multiple threads in parallel is useless until this is resolved/understood

So let's check with Willy Tarreau's 'mhz' tool whether the CPU clockspeeds reported are bogus (I'm still using performance cpufreq governor so should run with 2 and 1.5 GHz on A72 and A53 cores):

root@odroid:/home/odroid/mhz# taskset -c 1 ./mhz
count=645643 us50=21495 us250=107479 diff=85984 cpu_MHz=1501.775
root@odroid:/home/odroid/mhz# taskset -c 5 ./mhz
count=807053 us50=20330 us250=101641 diff=81311 cpu_MHz=1985.102

All fine so we need to have a look at memory bandwidth. Here are tinymembench numbers pinned to an A53 and here with an A72. As a reference some numbers made with other RK3399 devices few days ago on request: https://irclog.whitequark.org/linux-rockchip/2018-02-12#21298744;

 

One interesting observation is throttling behaviour in a special SoC engine affecting crypto. When cooling state 1 was reached the cpufreq still remained at 2 and 1.5 GHz respectively but AES performance dropped a lot. So the ARMv8 crypto engine is part of BSP 4.4 kernel throttling strategies and performance in such a case does not scale linearly with repored cpufreq. In other words: for the next round of tests the thermal tresholds defined in DT should be lifted a lot. 

 

Edit: Wrong assumption wrt openssl numbers on A72 cores -- see next post

Posted

Openssl and thermal update

 

I've been wrong before wrt 'cooling state 1' -- the result variation must had a different reason before. I decided to test with AES encryption running on all 6 CPU cores in parallel using a simple script testing only with AES-256:

root@odroid:/home/odroid# cat check-aes.sh 
#!/bin/bash
while true; do
	for i in 0 1 2 3 4 5 ; do 
		taskset -c ${i} openssl speed -elapsed -evp aes-256-cbc 2>/dev/null &
	done
	wait
done

Results as follows: https://pastebin.com/fHzJ5tJF (please note that cpufreq governor was set to performance and how especially A72 scores were lower in the beginning just to improve over time: with 16 byte it were 309981.41k in the beginning and then later 343045.14k and even slightly more)

 

Here armbianmonitor output: https://pastebin.com/1hsmk63i (at '07:07:28' I stopped the huge 5V fan and the small fansink can cope with this load though cooling state 2 is sometimes reached when SoC temperature exceeds 55°C). So for whatever reasons we still have a somewhat huge result variation with this single benchmark which needs further investigation (especially whether benchmark behaviour relates to real-world use cases like VPN and full disk encryption)

Posted

Regarding cooling states - since this is a HMP device with a PWM fan according to DT it should have 3 cooling devices - big cluster throttling, little cluster throttling and the fan.

armbianmonitor currently doesn't deal with this situation - it reads the cooling device 0 state each time. And since I see only 3 available cooling states in your output most likely it's the fan.

You can check /sys/devices/virtual/thermal/ to confirm that multiple cooling devices are used.

Posted
4 minutes ago, zador.blood.stained said:

armbianmonitor currently doesn't deal with this situation - it reads the cooling device 0 state each time. And since I see only 3 available cooling states in your output most likely it's the fan.

Correct:

root@odroid:/home/odroid# cat /sys/devices/virtual/thermal/cooling_device0/type 
pwm-fan

So everything I've written above about cooling state is BS since it's just showing the fansink starting to work :)

 

One should stop to think while benchmarking. Just collect numbers like a robot, check the data later whether it makes sense, throw numbers away and test again and again and again. Fortunately I already figured out that the result variation with openssl on the A72 cores has a different reason. But whether these benchmark numbers tell something is questionable. It would need some real-world tests with VPN and full disk encryption and then trying to pin the tasks to a little or a big core to get the idea what's really going on and whether the numbers generated with a synthetic benchmark have any meaning for real tasks.

Posted

"cryptsetup benchmark" numbers may be interesting, but they also heavily depend on the cryptography related kernel configuration options, so these numbers should be accompanied by /proc/crypto contents and lsmod output after the test.

Posted

Also interesting - I see the "Dynamic Memory Controller" in the DT which has its own set of operating points and this table

		system-status-freq = <
			/*system status         freq(KHz)*/
			SYS_STATUS_NORMAL       800000
			SYS_STATUS_REBOOT       528000
			SYS_STATUS_SUSPEND      200000
			SYS_STATUS_VIDEO_1080P  300000
			SYS_STATUS_VIDEO_4K     600000
			SYS_STATUS_VIDEO_4K_10B 800000
			SYS_STATUS_PERFORMANCE  800000
			SYS_STATUS_BOOST        400000
			SYS_STATUS_DUALVIEW     600000
			SYS_STATUS_ISP          600000
		>;

A quick Google search points to this page so this method should be tested to monitor the frequency

Quote

cat /sys/kernel/debug/clk/clk_summary | grep dpll_ddr

assuming the kernel was compiled with DDR devfreq support

 

Edit: though I see status = "disabled"; in the dmc node so it may not be operational yet.

Posted

IO scheduler influence on SATA performance

 

I tried to add the usual Armbian tweaks to Hardkernel's Debian Stretch image but something went wrong (we usually set cfq for HDDs and noop for flash media from /etc/init.d/armhwinfo -- I simply forgot to load the script so it never got executed at boot):

root@odroid:/home/odroid# cat /sys/block/sd*/queue/scheduler
noop deadline [cfq] 
noop deadline [cfq] 

So let's use the mdraid0 made of EVO840 and EVO750 (to ensure parallel disk accesses) with an ext4 on top and check for NCQ issues first:

root@odroid:/mnt/MDRAID0# dmesg | grep -i ncq
[    2.007269] ahci 0000:01:00.0: flags: 64bit ncq sntf stag led clo pmp pio slum part ccc sxs 
[    2.536884] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
[    2.536897] ata1.00: 234441648 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
[    2.537652] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
[    3.011571] ata2.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 31/32), AA

No issues, we can use NCQ with maximum queue depth so let's test through the three available schedulers with performance cpufreq governor to avoid being influenced by cpufreq scaling behaviour:

cfq                                                           random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     7320     7911     8657     8695     5954     8106
          102400       4    25883    30470    33159    33169    23205    30464
          102400      16    85609    96712   101527   102396    77224    96583
          102400     512   311645   312376   301644   303945   289410   308194
          102400    1024   345891   338773   329284   330738   329926   332866
          102400   16384   382101   379907   383779   387747   386901   383664

deadline                                                      random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     6963     8307     8211     8402     5772     8483
          102400       4    24701    30999    34728    34653    23160    31728
          102400      16    87390    98898   105589    97539    78259    97638
          102400     512   306420   304645   298131   302033   286582   303119
          102400    1024   345178   345458   329122   333318   329688   340144
          102400   16384   381596   374789   383850   387551   386428   381956

noop                                                          random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     6995     8589     9340     8498     5763     8246
          102400       4    26011    31307    30267    32635    21445    30859
          102400      16    88185   100135    97252   105090    79601    91052
          102400     512   307553   312609   304311   307922   291425   308387
          102400    1024   344472   340192   322881   333104   332405   333082
          102400   16384   372224   373183   380530   386994   386273   379506

Well, this looks like result varation but of course someone interested in this could do a real benchmark testing with each scheduler at least 30 times and then generating average values. In the past on slower ARM boards with horribly bottlenecked IO capabilities (think about those USB2 only boards that do not even can  use USB Attached SCSI due to lacking kernel/driver support) we've seen some severe performance impact based on IO scheduler used but in this situation this seems negligible.

 

If someone takes his time to benchmark through this it would be interesting to repeat the tests also with ondemand governor, io_is_busy set 1 of course and then playing around with different values for up_threshold and sampling_down_factor since if cpufreq scaling behaviour starts to vary based in IO scheduler used performance differences can be massive.

 

I just did a quick check how performance with ondemand cpufreq governor and the iozone benchmark varies between Stretch / Hardkernel defaults and our usual tweaks: https://github.com/armbian/build/blob/751aa7194f77eabcb41b19b8d19f17f6ea23272a/packages/bsp/common/etc/init.d/armhwinfo#L82-L94

 

Makes quite a difference but again the IO scheduler chosen still doesn't matter that much (but adjusting io_is_busy, up_threshold and sampling_down_factor does):

cfq defaults                                                  random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     5965     6656     7197     7173     5107     6586
          102400       4    20864    24899    27205    27214    19421    24595
          102400      16    68376    79415    85409    85930    66138    77598
          102400     512   274000   268473   267356   269046   247424   272822
          102400    1024   310992   314672   299571   299065   298518   315823
          102400   16384   366152   376293   375176   379202   379123   370254

cfq with Armbian settings                                     random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     7145     7871     8600     8591     5996     7973
          102400       4    25817    29773    32174    32385    23021    29627
          102400      16    83848    94665    98502    98857    75576    93879
          102400     512   303710   314778   303135   309050   280823   300391
          102400    1024   335067   332595   327539   332574   323887   329956
          102400   16384   381987   373067   381911   386585   387089   381956

deadline defaults                                             random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     6231     6872     7750     7746     5410     6804
          102400       4    21792    25941    28752    28701    20262    25380
          102400      16    70078    84209    88703    87375    69296    80708
          102400     512   276422   276042   259416   271542   250835   271743
          102400    1024   305166   321265   300374   296094   311020   323350
          102400   16384   363016   373751   376570   377294   378730   377186

deadline with Armbian settings                                random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     7389     8018     9018     9047     6162     8233
          102400       4    26526    30799    33487    33603    23712    30838
          102400      16    85703    96066   105055   103831    77281    97086
          102400     512   302688   297832   292569   288282   278384   294447
          102400    1024   343165   340770   317211   320999   329411   330670
          102400   16384   380267   375233   388286   390289   391849   375236

noop defaults                                                 random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     6301     6900     7766     7779     5350     6841
          102400       4    21995    25884    28466    28540    20240    25664
          102400      16    69547    81721    88044    88596    68043    81277
          102400     512   281386   276749   262216   262762   255387   261948
          102400    1024   300716   314233   288672   298921   310456   307875
          102400   16384   376137   371625   376620   378136   379143   371308

noop with Armbian settings                                    random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     7409     8026     9030     9033     6193     8259
          102400       4    26562    30861    33494    33649    23676    30870
          102400      16    85819    96956   102372   101982    77890    97341
          102400     512   310007   303370   293432   297090   281048   301772
          102400    1024   330968   352003   328052   318009   333682   337339
          102400   16384   373958   375028   384865   386749   389401   376501

(but as already said: to get more insights each test has to be repeated at least 30 times and then average values need to be generated -- 'single shot' benchmarking is useless to generate meaningful numbers)

Posted
21 minutes ago, zador.blood.stained said:

cat /sys/kernel/debug/clk/clk_summary | grep dpll_ddr

root@odroid:/mnt/MDRAID0# grep dpll_ddr /sys/kernel/debug/clk/clk_summary
root@odroid:/mnt/MDRAID0# cat /sys/kernel/debug/clk/clk_summary | curl -F 'f:1=<-' http://ix.io
http://ix.io/KNS

 

Posted

7-zip

 

Running on all 6 cores in parallel (no throttling occured, I start to like the small fansink ;) )

root@odroid:/tmp# 7zr b

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE)

LE
CPU Freq:   401   400   401  1414  1985  1985  1985  1984  1985

RAM size:    3882 MB,  # CPU hardware threads:   6
RAM usage:   1323 MB,  # Benchmark threads:      6

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:       4791   499    934   4661  |     100897   522   1647   8605
23:       4375   477    935   4458  |      98416   522   1631   8516
24:       4452   524    914   4787  |      95910   523   1610   8418
25:       4192   524    914   4787  |      92794   523   1579   8258
----------------------------------  | ------------------------------
Avr:             506    924   4673  |              523   1617   8449
Tot:             514   1270   6561

Now still 6 threads but pinned only to the little cores:

root@odroid:/tmp# taskset -c 0,1,2,3 7zr b

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE)

LE
CPU Freq:  1492  1500  1499  1500  1499  1493  1498  1499  1499

RAM size:    3882 MB,  # CPU hardware threads:   6
RAM usage:   1323 MB,  # Benchmark threads:      6

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:       2475   375    642   2408  |      64507   396   1387   5501
23:       2440   385    646   2487  |      60795   383   1374   5261
24:       2361   391    649   2539  |      58922   381   1359   5172
25:       2249   394    652   2568  |      58033   388   1332   5165
----------------------------------  | ------------------------------
Avr:             386    647   2501  |              387   1363   5275
Tot:             387   1005   3888

And now 6 threads but bound to the A72 cores:

root@odroid:/tmp# taskset -c 4,5 7zr b

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE)

LE
CPU Freq:   400   401   498  1984  1985  1981  1985  1985  1985

RAM size:    3882 MB,  # CPU hardware threads:   6
RAM usage:   1323 MB,  # Benchmark threads:      6

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:       2790   199   1364   2715  |      47828   200   2040   4079
23:       2630   199   1343   2680  |      46641   200   2020   4036
24:       2495   200   1344   2683  |      45505   200   1999   3994
25:       2366   200   1353   2702  |      43998   200   1959   3916
----------------------------------  | ------------------------------
Avr:             199   1351   2695  |              200   2005   4006
Tot:             200   1678   3350

 

Posted

Cpuminer test (heavy NEON optimizations)

 

And another test:

sudo apt install automake autoconf pkg-config libcurl4-openssl-dev libjansson-dev libssl-dev libgmp-dev make g++
git clone https://github.com/tkinjo1985/cpuminer-multi.git
cd cpuminer-multi/
./build.sh 
./cpuminer --benchmark

When running on all 6 cores this benchmark scores at 'Total: 8.80 kH/s' without throttling. After killing the big cores (echo 0 >/sys/devices/system/cpu/cpu[45]/online) I get scores up to 'Total: 4.69 kH/s' which is the expected value since I got 3.9 kH/s/s on an overclocked A64 (also Cortex-A53, back then running at 1296MHz). And when bringing back the big cores and killing the littles we're at around 'Total: 4.10 kH/s':

root@odroid:/usr/local/src/cpuminer-multi# echo 1 >/sys/devices/system/cpu/cpu5/online
root@odroid:/usr/local/src/cpuminer-multi# echo 1 >/sys/devices/system/cpu/cpu4/online
root@odroid:/usr/local/src/cpuminer-multi# echo 0 >/sys/devices/system/cpu/cpu3/online
root@odroid:/usr/local/src/cpuminer-multi# echo 0 >/sys/devices/system/cpu/cpu2/online
root@odroid:/usr/local/src/cpuminer-multi# echo 0 >/sys/devices/system/cpu/cpu1/online
root@odroid:/usr/local/src/cpuminer-multi# echo 0 >/sys/devices/system/cpu/cpu0/online
root@odroid:/usr/local/src/cpuminer-multi# ./cpuminer --benchmark
** cpuminer-multi 1.3.3 by tpruvot@github **
BTC donation address: 1FhDPLPpw18X4srecguG3MxJYe4a1JsZnd (tpruvot)

[2018-02-16 10:41:28] 6 miner threads started, using 'scrypt' algorithm.
[2018-02-16 10:41:29] CPU #0: 0.54 kH/s
[2018-02-16 10:41:29] CPU #5: 0.54 kH/s
[2018-02-16 10:41:30] CPU #2: 0.44 kH/s
[2018-02-16 10:41:30] CPU #3: 0.45 kH/s
[2018-02-16 10:41:30] CPU #1: 0.44 kH/s
[2018-02-16 10:41:30] CPU #4: 0.44 kH/s
[2018-02-16 10:41:32] Total: 3.90 kH/s
[2018-02-16 10:41:33] Total: 3.95 kH/s
[2018-02-16 10:41:37] CPU #4: 0.73 kH/s
[2018-02-16 10:41:37] CPU #3: 0.65 kH/s
[2018-02-16 10:41:38] CPU #1: 0.60 kH/s
[2018-02-16 10:41:38] CPU #2: 0.68 kH/s
[2018-02-16 10:41:38] CPU #0: 0.59 kH/s
[2018-02-16 10:41:38] CPU #5: 0.81 kH/s
[2018-02-16 10:41:38] Total: 4.01 kH/s
[2018-02-16 10:41:43] CPU #3: 0.66 kH/s
[2018-02-16 10:41:43] CPU #4: 0.71 kH/s
[2018-02-16 10:41:44] CPU #5: 0.73 kH/s
[2018-02-16 10:41:44] Total: 4.10 kH/s
[2018-02-16 10:41:47] CPU #0: 0.68 kH/s
[2018-02-16 10:41:48] CPU #2: 0.67 kH/s
[2018-02-16 10:41:48] Total: 4.08 kH/s
[2018-02-16 10:41:48] CPU #1: 0.68 kH/s
[2018-02-16 10:41:53] CPU #3: 0.68 kH/s
[2018-02-16 10:41:53] CPU #5: 0.72 kH/s
[2018-02-16 10:41:53] Total: 4.13 kH/s
[2018-02-16 10:41:53] CPU #4: 0.68 kH/s
[2018-02-16 10:41:54] CPU #1: 0.65 kH/s
[2018-02-16 10:41:54] CPU #0: 0.68 kH/s
[2018-02-16 10:41:58] Total: 4.05 kH/s
[2018-02-16 10:41:58] CPU #2: 0.65 kH/s
[2018-02-16 10:42:03] CPU #1: 0.64 kH/s
[2018-02-16 10:42:03] CPU #3: 0.66 kH/s
[2018-02-16 10:42:03] CPU #0: 0.65 kH/s
[2018-02-16 10:42:03] CPU #5: 0.73 kH/s
[2018-02-16 10:42:03] Total: 4.02 kH/s
[2018-02-16 10:42:03] CPU #4: 0.71 kH/s
^C[2018-02-16 10:42:05] SIGINT received, exiting

With ODROID-XU4/HC1/HC2 it looks like this: When forced to run on the little cores cpuminer gets 2.43 khash/s (no throttling occuring), running on the big cores it starts with 8.2 khash/s at 2.0GHz but even with the fansink on XU4 immediately cpufreq drops down to 1.8 or even 1.6 GHz. At least that's what happens on my systems, maybe others have seen other behaviour.

 

Let's do a 'per core' comparison:

  • A15 @ 2.0GHz: 2.35 khash/sec
  • A72 @ 2.0GHz: 2.05 khash/sec
  • A7 @ 1.5 GHz: 0.61 khash/sec
  • A53 @ 1.5 GHz:  1.18 khash/sec

In other words: with such or similar workloads ('number crunching', NEON optimized stuff) an A15 core might be slightly faster than an A72 core (and since Exynos has twice as much fast cores it performs better with such workloads) while there's a great improvement when looking at the little cores: A53 performs almost twice as fast as an A7 at same clockspeed but this is due to this specific benchmark making heavy use of NEON instructions and there switching to 64-bit/ARMv8 ISA makes a huge difference. Please be also aware that cpuminer is heavily dependent on memory bandwidth so that these cpuminer numbers are not a good representation for other workloads. This is just 'number cruncher' stuff where NEON can be used.

Posted

Cryptsetup benchmark

 

2 hours ago, zador.blood.stained said:

 

"cryptsetup benchmark" numbers may be interesting, but they also heavily depend on the cryptography related kernel configuration options, so these numbers should be accompanied by /proc/crypto contents and lsmod output after the test.

 

 

Here we go. Same numbers with all cores active or just the big ones:

# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1       669588 iterations per second for 256-bit key
PBKDF2-sha256    1315653 iterations per second for 256-bit key
PBKDF2-sha512     485451 iterations per second for 256-bit key
PBKDF2-ripemd160  365612 iterations per second for 256-bit key
PBKDF2-whirlpool  134847 iterations per second for 256-bit key
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   661.7 MiB/s   922.4 MiB/s
 serpent-cbc   128b           N/A           N/A
 twofish-cbc   128b    80.0 MiB/s    81.2 MiB/s
     aes-cbc   256b   567.6 MiB/s   826.9 MiB/s
 serpent-cbc   256b           N/A           N/A
 twofish-cbc   256b    79.6 MiB/s    81.1 MiB/s
     aes-xts   256b   736.3 MiB/s   741.3 MiB/s
 serpent-xts   256b           N/A           N/A
 twofish-xts   256b    83.7 MiB/s    82.5 MiB/s
     aes-xts   512b   683.7 MiB/s   686.0 MiB/s
 serpent-xts   512b           N/A           N/A
 twofish-xts   512b    83.7 MiB/s    82.5 MiB/s

When killing the big cores it looks like this (all the time running with performance cpufreq governor):

# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1       332670 iterations per second for 256-bit key
PBKDF2-sha256     623410 iterations per second for 256-bit key
PBKDF2-sha512     253034 iterations per second for 256-bit key
PBKDF2-ripemd160  193607 iterations per second for 256-bit key
PBKDF2-whirlpool   85556 iterations per second for 256-bit key
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   369.9 MiB/s   449.0 MiB/s
 serpent-cbc   128b           N/A           N/A
 twofish-cbc   128b    33.5 MiB/s    35.1 MiB/s
     aes-cbc   256b   323.9 MiB/s   414.7 MiB/s
 serpent-cbc   256b           N/A           N/A
 twofish-cbc   256b    33.5 MiB/s    35.1 MiB/s
     aes-xts   256b   408.4 MiB/s   408.7 MiB/s
 serpent-xts   256b           N/A           N/A
 twofish-xts   256b    36.1 MiB/s    36.4 MiB/s
     aes-xts   512b   376.6 MiB/s   377.3 MiB/s
 serpent-xts   512b           N/A           N/A
 twofish-xts   512b    35.9 MiB/s    36.3 MiB/s

Other information as requested: https://pastebin.com/hMhKUStN

Posted

I just got my board and I'll try to make Armbian ASAP. 

Another pcs of info -  when running 

./cpuminer --benchmark

I get a 0.9A draw at the power source. Nothing else except console is attached.

Posted
5 minutes ago, Igor said:

I get a 0.9A draw at the power source. Nothing else except console is attached.

Since ODROID N1 with current default settings has a pretty high 'ground' consumption (most probably both related to the ASM1061 and DC-DC circuitry) we should better talk about consumption differences. I get 3.2W at the wall in idle and 12.1W when running 'cpuminer --benchmark'. So that's 8.9W for '8.77 kH/s' or just about 1W per kH/s (12V PSU included!). Now let's try the same with ODROID XU4 ;)

 

To get an idea how much the ASM1061 adds to idle consumption I would assume that we need to change CONFIG_PCIE_ROCKCHIP and friends from y to m? Or use DT overlays to disable the respective DT nodes?

Posted
Just now, tkaiser said:

To get an idea how much the ASM1061 adds to idle consumption I would assume that we need to change CONFIG_PCIE_ROCKCHIP and friends from y to m? Or use DT overlays to disable the respective DT nodes?

Or just recompile the DT with dtc and reboot since loading overlays needs either kernel or u-boot patches.

Posted
2 minutes ago, zador.blood.stained said:

Or just recompile the DT with dtc and reboot

 

Hmm...

root@odroid:/media/boot# dtc -I dtb -O dts -o rk3399-odroidn1-linux.dts rk3399-odroidn1-linux.dtb 
Warning (unit_address_vs_reg): Node /usb@fe800000 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /usb@fe900000 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@0 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@1 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@2 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@3 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@4 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@5 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@6 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@7 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@8 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@9 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /phy@e220 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /efuse@ff690000/id has a reg or ranges property, but no unit name
Warning (unit_address_vs_reg): Node /efuse@ff690000/cpul-leakage has a reg or ranges property, but no unit name
Warning (unit_address_vs_reg): Node /efuse@ff690000/cpub-leakage has a reg or ranges property, but no unit name
Warning (unit_address_vs_reg): Node /efuse@ff690000/gpu-leakage has a reg or ranges property, but no unit name
Warning (unit_address_vs_reg): Node /efuse@ff690000/center-leakage has a reg or ranges property, but no unit name
Warning (unit_address_vs_reg): Node /efuse@ff690000/logic-leakage has a reg or ranges property, but no unit name
Warning (unit_address_vs_reg): Node /efuse@ff690000/wafer-info has a reg or ranges property, but no unit name
Warning (unit_address_vs_reg): Node /gpio-keys/button@0 has a unit name, but no reg property
Warning (unit_address_vs_reg): Node /gpiomem has a reg or ranges property, but no unit name

root@odroid:/media/boot# cat rk3399-odroidn1-linux.dts | curl -F 'f:1=<-' http://ix.io
http://ix.io/KQc

Anyway, I backed the eMMC contents up already yesterday so nothing can go wrong :)

Posted

Well, just setting two nodes to disabled results in PCIe being gone but just ~150mW (mW not mA!) less consumption:

root@odroid:/media/boot# diff rk3399-odroidn1-linux.dts rk3399-odroidn1-linux-mod.dts
8c8
< 	model = "Hardkernel ODROID-N1";
---
> 	model = "Hardkernel ODROID-N1 low power";
1654c1654
< 		status = "okay";
---
> 		status = "disabled";
1682c1682
< 		status = "okay";
---
> 		status = "disabled";
root@odroid:/media/boot# cat /proc/device-tree/model ; echo
Hardkernel ODROID-N1 low power
root@odroid:/media/boot# lspci
root@odroid:/media/boot# 

After reverting back to original DT I've PCIe back and consumption increased by a whopping ~150mW ;)

root@odroid:/home/odroid# lspci
00:00.0 PCI bridge: Device 1d87:0100
01:00.0 IDE interface: ASMedia Technology Inc. ASM1061 SATA IDE Controller (rev 02)

 

Posted
1 hour ago, tkaiser said:

Since ODROID N1 with current default settings has a pretty high 'ground' consumption (most probably both related to the ASM1061 and DC-DC circuitry) we should better talk about consumption differences. I get 3.2W at the wall in idle and 12.1W when running 'cpuminer --benchmark'. So that's 8.9W for '8.77 kH/s' or just about 1W per kH/s (12V PSU included!).

 

Since we were already talking about power vs. consumption I gave cpuburn-a53 a try. I had to manually start it on the big cluster as well ('taskset -c 4,5 cpuburn-a53 &') but when the tool ran on all 6 CPU cores the fan started to spin on lowest level and SoC temperature became stable at 52.8°C:

Time       big.LITTLE   load %cpu %sys %usr %nice %io %irq   CPU  C.St.
13:00:34: 1992/1512MHz  8.44 100%   0%  99%   0%   0%   0% 52.8°C  1/3
13:00:42: 1992/1512MHz  8.40 100%   0%  99%   0%   0%   0% 52.8°C  1/3
13:00:51: 1992/1512MHz  8.41 100%   0%  99%   0%   0%   0% 52.8°C  1/3
13:00:59: 1992/1512MHz  8.42 100%   0%  99%   0%   0%   0% 52.8°C  1/3
13:01:08: 1992/1512MHz  8.39 100%   0%  99%   0%   0%   0% 52.8°C  1/3
13:01:17: 1992/1512MHz  8.40 100%   0%  99%   0%   0%   0% 52.8°C  1/3
13:01:25: 1992/1512MHz  8.41 100%   0%  99%   0%   0%   0% 52.8°C  1/3
13:01:33: 1992/1512MHz  8.43 100%   0%  99%   0%   0%   0% 52.8°C  1/3
13:01:42: 1992/1512MHz  8.40 100%   0%  99%   0%   0%   0% 52.8°C  1/3^C

My powermeter showed then also just 12.1W so it seems with such heavy NEON workloads and RK3399 busy on all CPU cores we can't get the board to consume more than 9W compared to idle...

 

Testing again with openssl and the crypto engine I'll see the powermeter reporting 13.2W maximum (that's 10W more compared to idle) while the fan is working harder but temperature still below 60°C:

Time       big.LITTLE   load %cpu %sys %usr %nice %io %irq   CPU  C.St.
13:12:06: 1992/1512MHz  6.01 100%   0%  99%   0%   0%   0% 55.0°C  2/3
13:12:13: 1992/1512MHz  6.17  99%   0%  99%   0%   0%   0% 55.6°C  2/3
13:12:20: 1992/1512MHz  6.16 100%   0%  99%   0%   0%   0% 55.0°C  2/3
13:12:27: 1992/1512MHz  6.14 100%   0%  99%   0%   0%   0% 55.0°C  2/3
13:12:33: 1992/1512MHz  6.27  99%   0%  99%   0%   0%   0% 54.4°C  2/3
13:12:40: 1992/1512MHz  6.25 100%   0%  99%   0%   0%   0% 55.0°C  2/3
13:12:47: 1992/1512MHz  6.23  99%   0%  99%   0%   0%   0% 56.7°C  2/3

IMO this is pretty amazing and I've to admit that I start to like the fansink Hardkernel put on this board. While looking similar to the one on my XU4 bought last year this one is way less annoying. If one puts the N1 into a cabinet (as I do with all IT stuff I don't need on my desk) you can't hear the thing.

Posted

Thermal update

 

Since I was curious why temperatures in idle and under load were that low and to be assured that throttling with the 4.4 BSP kernel we're currently using works... I decided to remove N1's heatsink:

 

N1_without_Heatsink.jpg

 

Looks good, so now let's see how the board performs without heatsink applied. Since I had not the slightest idea whether throttling works and how I decided to let a huge fan assist in the beginning:

 

N1_and_Fan.jpg

 

Board booted up nicely, the small PWM fan started to blow air around, the large efficiently cooled somewhat and I decided to again run 'cpuminer --benchmark'. To my surprise (I expected ODROID XU4 behaviour) the big cores were only throttled to 1800 and 1608 after a couple of minutes so at least I knew throttling was working. Then deciced to stop the 5V USB connected fan and let the benchmark run on its own (board lying flat on the table, neither heatsink nor fan involved).

 

After about half an hour cpuminer reported still a hash rate of 'Total: 6.60 kH/s' (all 6 cores involved) and armbianmonitor output showed current throttling settings:

Time       big.LITTLE   load %cpu %sys %usr %nice %io %irq   CPU  C.St.
17:40:23: 1008/1512MHz  6.56 100%   0%   0%  99%   0%   0% 84.4°C  3/3
17:40:28: 1008/1512MHz  6.91 100%   0%   0%  99%   0%   0% 84.4°C  3/3
17:40:33: 1008/1512MHz  6.84 100%   0%   0%  99%   0%   0% 84.4°C  3/3
17:40:38:  816/1512MHz  6.77 100%   0%   0%  99%   0%   0% 85.0°C  3/3
17:40:43:  816/1512MHz  6.71 100%   0%   0%  99%   0%   0% 85.0°C  3/3
17:40:48: 1008/1512MHz  6.73 100%   0%   0%  99%   0%   0% 84.4°C  3/3
17:40:53: 1008/1512MHz  6.67 100%   0%   0%  99%   0%   0% 84.4°C  3/3
17:40:59: 1008/1512MHz  6.62 100%   0%   0%  99%   0%   0% 84.4°C  3/3
17:41:04: 1008/1512MHz  6.57 100%   0%   0%  99%   0%   0% 84.4°C  3/3
17:41:09: 1008/1512MHz  6.52 100%   0%   0%  99%   0%   0% 84.4°C  3/3
17:41:14: 1008/1512MHz  6.48 100%   0%   0%  99%   0%   0% 83.9°C  3/3
17:41:19: 1200/1512MHz  6.44 100%   0%   0%  99%   0%   0% 85.0°C  3/3
17:41:24: 1200/1512MHz  6.41 100%   0%   0%  99%   0%   0% 84.4°C  3/3
17:41:29: 1008/1512MHz  6.37 100%   0%   0%  99%   0%   0% 84.4°C  3/3
17:41:34: 1008/1512MHz  6.34 100%   0%   0%  99%   0%   0% 84.4°C  3/3
Time       big.LITTLE   load %cpu %sys %usr %nice %io %irq   CPU  C.St.
17:41:39: 1008/1512MHz  6.40 100%   0%   0%  99%   0%   0% 84.4°C  3/3
17:41:45: 1200/1512MHz  6.37 100%   0%   0%  99%   0%   0% 85.6°C  3/3
17:41:50: 1992/1512MHz  5.86  24%   0%   0%  23%   0%   0% 78.8°C  3/3
17:41:55: 1992/1512MHz  5.39   0%   0%   0%   0%   0%   0% 75.0°C  3/3

So the big cores were throttled down to even 816 MHz but the board was still running with full load and generated 6.60 kH/s. Before I stopped the benchmark I checked the Powermeter reporting 8.2W. In other words: with these throttling settings clocking only the big cores down) we're talking now about a 5W delta compared to idle and 6.6 kH/s That's 1.3 kH/s per W consumed. Pretty amazing especially when comparing with ODROID XU4 or Tinkerboard...

 

After stopping the benchmark I put the board into an upright position and switched to ondemand governor to watch the temperatures dropping down to 45°C (full armbianmonitor output):

Time       big.LITTLE   load %cpu %sys %usr %nice %io %irq   CPU  C.St.
18:04:31:  408/ 408MHz  0.01   0%   0%   0%   0%   0%   0% 45.6°C  0/3
18:04:36:  408/ 408MHz  0.01   0%   0%   0%   0%   0%   0% 45.0°C  0/3
18:04:41:  408/ 408MHz  0.01   0%   0%   0%   0%   0%   0% 45.0°C  0/3
18:04:46:  408/ 408MHz  0.01   0%   0%   0%   0%   0%   0% 45.6°C  0/3

That's really impressive. But be warned: once you use Android on this thing or GPU acceleration works within Linux then operation without heatsink won't be a good idea (the Mali on this SoC is quite capable). Anyway: with pure CPU workloads this all looks very nicely and way more energy efficient than those beefy ARMv7 boards with Cortex-A15 or A17 cores.

Posted

Preliminary 'performance' summary

 

Based on the tests done above and elsewhere let's try to collect some performance data. Below GPU data is missing for the simple reason that I'm not interested in anything GPU related (or attaching a display at all). Besides used for display stuff and 'retro gaming' RK3399's Mali T860 MP4 GPU is also OpenCL capable. If you search for results (ODROID N1's SoC is available for some years now so you find a lot by searching for 'RK3399' -- for example here are some OpenCL/OpenCV numbers) please keep in mind that Hardkernel might use different clockspeeds for the GPU as well (with CPU cores it's just like that: almost everywhere around big/little cores are clocked with 1.8/1.4 GHz while the N1 settings use 2.0/1.5 GHz instead)

 

CPU horsepower

 

Situation with RK3399 is somewhat special since it's a HMP design combining two fast Cortex-A72 cores with four 'slow' A53. So depending on which CPU core a job lands execution time can vary by factor 2. With Android or 'Desktop Linux' workloads this shouldn't be an issue since there things are mostly single-threaded and the scheduler will move these tasks to the big cores automagically if performance is needed.

 

With other workloads it differs:

  • People wanting to use RK3399 as part of a compile farm might be disappointed and still prefer ARM designs that feature four instead of two fast cores (eg. RK3288 or Exynos 5422 -- for reasons why see again comments section on CNX)
  • For 'general purpose' server use cases the 7-zip scores are interesting since giving a rough estimate how fast a RK3399 device will perform as server (or how many tasks you can run in parallel). Overall score is 6,500 (see this comparison list) but due to the big.LITTLE design we're talking about the big cluster scoring at 3350 and the little cluster at 3900. So tasks that execute on the big cores finish almost twice as fast. Keep this in mind when setting up your environment. Experimenting with cgroups and friends to assign certain tasks to specific CPU clusters will be worth the efforts!
  • 'Number crunchers' who can make use of NEON instructions should look at 'cpuminer --benchmark' results: We get a total 8.80 kH/s rate when running on all 6 cores (big cores only: 4.10 kH/s, little cores only: 4.90 kH/s -- so again 'per core' performance almost twice as good on the big cores) which is at the same performance level of an RK3288 (4 x A17) but gets outperformed by an ODROID XU4 for example at +10kH/s since there the little cores add a little bit to the result. But this needs improved cooling otherwise an XU4 will immediately throttle down. The RK3399 provides this performance with way lower consumption and heat generation!
  • Crypto performance: just awesome due to ARMv8 Crypto Extensions available and useable on all cores in parallel. Simply check cryptsetup results above and our 'openssl speed' numbers and keep in mind that if your crypto stuff can run in parallel (eg. terminating few different VPN sessions) you can almost add the individual throughput numbers (and even with 6 threads in parallel at full clockspeed the RK3399 just draws 10W more compared to idle)

Talking about 'two fast and four slow CPU cores': the A53 cores are clocked at 1.5GHz so when comparing with RK3399's little sibling RK3328 with only 4xA53 (ROCK64, Libre Computer Renegade or Swiftboard/Transformer) the RK3399 when running on the 'slow' cores will compete or already outperform the RK3328 boards but still has 2 big cores available for heavy stuff. But since a lot of workloads are bottlenecked by memory bandwidth you should have a look at the tinymembench results collected above (and use some google-fu to compare with other devices)

 

Storage performance

 

N1 has 2 SATA ports provided by a PCIe attached ASM1061 controller and 2 USB3 ports directly routed to the SoC. The per port bandwidth limitation that also seems to apply to both port groups is around 390 MB/s (applies to all ports regardless whether SATA or USB3 -- also random IO performance with default settings is pretty much the same). But this is not an overall internal SoC bottleneck since when testing with fast SSDs on both USB3 and SATA ports at the same time we got numbers at around ~750MB/s. I just retested again with an EVO840 on the N1 at SATA and USB3 ports with a good UAS capable enclosure and as a comparison repeated the same test with a 'true NAS SoC': the Marvell Armada 385 on Clearfog Pro which provides 'native SATA' by the SoC itself:

 

 

Same Samsung EVO840 used for the tests, same settings (for iozone command line see somewhere above)
 


ODROID N1 USB3/JMS567                                         random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     7348     8214     9805    10012     5473     8085
          102400       4    26394    30872    41039    40473    20255    30509
          102400      16    68892    98586   120807   121118    66786    97474
          102400     512   327991   334624   312310   316452   305005   331188
          102400    1024   357135   365850   349055   354952   343348   359507
          102400   16384   376355   388326   395179   400291   399759   384052

ODROID N1 PCIe/ASM1061 powersave                              random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1     7585     8562     9322     9331     5907     8505
          102400       4    26400    31745    34586    34798    24039    31595
          102400      16    87201    99311   105977   106152    79099    99618
          102400     512   313662   316992   308216   310013   301521   308300
          102400    1024   327748   324230   319738   322929   317812   325224
          102400   16384   368813   369384   385862   390732   390612   379333

ODROID N1 PCIe/ASM1061 performance                            random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1    15218    19331    23617    23661    10690    18965
          102400       4    49071    65403    79028    79247    39287    64922
          102400      16   137845   168899   185766   186482   116789   166413
          102400     512   326117   332789   324468   326999   317332   328611
          102400    1024   330827   331303   326731   329246   325201   333325
          102400   16384   378331   368429   385870   392127   391348   371753

Clearfog Pro SATA                                             random    random
              kB  reclen    write  rewrite    read    reread    read     write
          102400       1    21853    37308    39815    39753    12597    35440
          102400       4    63930   121585   132720   133372    46210   118527
          102400      16   176397   262801   278098   289824   143121   265142
          102400     512   387158   404191   425735   432220   415117   386369
          102400    1024   376309   395735   450046   421499   432396   387842
          102400   16384   384486   389053   506038   509033   500409   402384



EVO840_Clearfog.jpg
 

 

If we look carefully at the numbers we see that USB3 slightly outperforms ASM1061 when it's about top sequential performance. The two ASM1061 numbers are due to different settings of /sys/module/pcie_aspm/parameters/policy (defaults to powersave but can be changed to performance which not only results in ~250mW higher idle consumption but also a lot better performance with small block sizes). While USB3 seems to perform slightly better when looking only at irrelevant sequential transfer speeds better attach disks to the SATA ports for a number of reasons:

  • With USB you need disk enclosures with good USB to SATA bridges that are capable of UAS --> 'USB Attached SCSI' (we can only recommend the following ones: ASMedia ASM1153/ASM1351, JMicron JMS567/JMS578 or VIA VL711/VL715/VL716 -- unfortunately even if those chipsets are used sometimes crappy firmwares need USB quirks or require UAS blacklisting and then performance sucks. A good example are Seagate USB3 disks)
  • When you use SSDs you want to be able to use TRIM (helps with retaining drive performance and increases longevity). With SATA attached SSDs this is not a problem but on USB ports it depends on a lot of stuff and usually does NOT work. If you understand just half of what's written here then think about SSDs on USB ports otherwise better choose the SATA ports here
  • And PCIe is also less 'expensive' since it needs less ressources (lower CPU utilization with disk on SATA ports and less interrupts to process, see the 800k IRQs for SATA/PCIe vs. 2 million for USB3 with exactly the same workload below):
226:        180     809128          0          0          0          0   ITS-MSI 524288 Edge      0000:01:00.0
226:          0          0          0          0          0          0   ITS-MSI 524288 Edge      0000:01:00.0
227:        277          0    2066085          0          0          0     GICv3 137 Level     xhci-hcd:usb5
228:          0          0          0          0          0          0     GICv3 142 Level     xhci-hcd:usb7

There's also eMMC and SD cards useable as storage. Wrt SD cards it's too early to talk about performance since at least the N1 developer samples do only implement the slowest SD card speed mode (and I really hope this will change with the final N1 version later) a necessary kernel patch is missing to remove the current SD card performance bottleneck..

 

The eMMC performance is awesome! If we look only at random IO performance with smaller block sizes (that's the 'eMMC as OS drive' use case) then the Hardkernel eMMC modules starting at 32GB size perform as fast as an SSD connected to USB3 or SATA ports. With SATA ports we get a nice speed boost by changing ASPM (Active State Power Management) settings by switching from the 'powersave' default to performance (+250mW idle consumption). Only then a SSD behind a SATA port on N1 can outperform a Hardkernel eMMC module wrt random IO or 'OS drive' performance. But of course this has a price: when SATA or USB drives are used consumption is a lot higher.

 

Network performance

 

Too early to report 'success' but I'm pretty confident we get Gigabit Ethernet fully saturated after applying some tweaks. With RK3328 it was the same situation in the beginning and maybe same fixes that helped there will fix it with RK3399 on N1 too. I would assume progress can be monitored here: https://forum.odroid.com/viewtopic.php?f=150&t=30126

Posted

Storage performance update... what to use to store the rootfs on?

 

In the following I compare 4 good SD cards with 4 different eMMC modules Hardkernel sells for the N1 with 4 different SSD setups. As some background why I chose to measure random IO with 1k, 4k and 16k block sizes please read the 'SD card performance 2018 update' first.

 

The following are IOPS numbers (IO operations per second) and important if we want to know how fast storage performs when used as an 'OS drive' (random IO performance is the most important factor here):

                                 1K w/r        4K w/r        16K w/r
 SanDisk Extreme Plus 16GB     566  2998     731  2738      557  2037    
     SanDisk Ultra A1 32GB     456  3171     843  2791      548  1777
   SanDisk Extreme A1 32GB     833  3289    1507  3281     1126  2113
          Samsung Pro 64GB    1091  4786    1124  3898      478  2296
 
          Orange eMMC 16GB    2450  7344    7093  7243     2968  5038
          Orange eMMC 32GB    2568  7453    7365  7463     5682  5203
          Orange eMMC 64GB    2489  7316    7950  6944     6059  5250
         Orange eMMC 128GB    2498  8337    7064  7197     5459  4909
        
            Intel 540 USB3    7076  4732    7053  4785     5342  3294
       Samsung EVO750 USB3    8043  6245    7622  5421     6175  4481
  Samsung EVO840 powersave    8167  5627    7605  5720     5973  4766
Samsung EVO840 performance   18742 10471   16156  9657    10390  7188

The SD cards I chose for this comparison all perform very well (an average no-name, Kingston, PNY, Verbatim or whatever other 'reputable' brand performs way lower wrt random IO!). But it can be clearly seen that Hardkernel's eMMC modules are a lot more performant. Regardless of size they all perform pretty similar though the small 16GB module being bottlenecked due to a write performance limitation that also affects 16k random IO write performance.

 

With SSDs it depends: I chose somewhat ok-ish consumer SSDs for the test so in case you want to buy used SSDs or some 'great bargains' on Aliexpress or eBay be prepared that your numbers will look way worse. The SATA connected EVO840 is listed two times since performance with small blocksizes heavily depends on PCIe power management settings (default is powersave -- switching to performance increases idle consumption by around ~250mW but only then a SATA connected SSD is able to outperform Hardkernel's eMMC. That's important to know and also only applies to really performant SSDs. Cheap SSDs especially with small capacities perform way lower)

 

Now let's look at sequential performance with large blocksizes (something that does NOT represent the 'OS drive' use case even remotely and is pretty irrelevant for almost all use cases except creation of stupid benchmark graphs):

                                   MB/s write     MB/s read
 SanDisk Extreme Plus 16GB             63            67
     SanDisk Ultra A1 32GB             20            66
   SanDisk Extreme A1 32GB             59            68
          Samsung Pro 64GB             61            66
        
          Orange eMMC 16GB             48           298            
          Orange eMMC 32GB            133           252
          Orange eMMC 64GB            148           306
         Orange eMMC 128GB            148           302
        
            Intel 540 USB3            325           370
       Samsung EVO750 USB3            400           395
  Samsung EVO840 powersave            375           385
Samsung EVO840 performance            375           385

We can see that N1's SD card interface seems to bottleneck sequential read performance of all tested cards to around ~67 MB/s. Write performance depends mostly on the cards (all cheap cards like the tested SanDisk Ultra A1 32GB you get currently for $12 on Amazon are limited here). The Hardkernel eMMC modules perform very well with sustained read performance at around 300MB/s and write performance depending on module size at up to ~150 MB/s.

 

With SSDs it depends -- we have an interface limitation of around ~395 MB/s on the USB3 ports and a little bit lower on the SATA ports but unless you buy rather expensive SSDs you won't be able to reach the board's bottleneck anyway. Please also keep in mind that the vast majority of consumer SSDs implements some sort of write caching and write performance drops down drastically once a certain amount of data is written (my Intel 540 get's then as slow as 60MB/s, IIRC the EVO750 can achieve ~150 MB/s and the EVO840 180 MB/s).

 

Why aren't HDDs listed above? Since useless. Even Enterprise HDDs show random IO performance way too low. These things are good to store 'cold data' on it but never ever put your rootfs on them. They're outperformed by at least 5 times by any recent A1 rated SD card, even crappy SSDs are at least 10 times faster and Hardkernel's eMMC performs at least 50 times better.

 

So how to interpret results above? If you want energy efficient and ok-ish performing storage for your rootfs (OS drive) then choose any of the currently available A1 rated SD cards from reputable vendors (choose more expensive ones for better performance/resilience, choose larger capacities than needed if you fear your flash memory wearing out too fast). If you want top performance at lowest consumption level choose Hardkernel's eMMC and keep in mind that the smallest module is somewhat write performance bottlenecked. Again: if you fear your flash memory wearing out too fast simply choose larger capacities than 'needed'.

 

If you want to waste huge amounts of energy while still being outperformed by Hardkernel's eMMC buy a cheap SSD. Keep in mind that you need to disable PCIe powermanagement further increasing idle consumption to be able to outperform eMMC storage otherwise N1's SATA/PCIe implementation will bottleneck too much. So when do SSDs start to make sense? If you either really need higher performance than Hardkernel's eMMC modules and are willing to spend some serious amount of money for a good SSD or the '1k random IO' use case really applies to you (e.g. trying to run a database with insanely small record sizes that constantly updates at the storage layer).

 

But always keep in mind: if you not really choose a more expensive and high performing SSD you'll always get lower performance than eMMC while consumption is at least 100 times higher. And always use SSDs at the SATA ports since only there you can get higher random IO performance compared to eMMC and being able to benefit from TRIM is essential (for details why TRIM is a problem on USB ports see above). But keep in mind that internal SATA ports are rated for 50 matings max so be prepared to destroy connectors easily when you permanently change cables on those SATA ports :)  

 

But what if you feel that any SATA attached storage (the cheapest SSD around and even HDDs) must be an improvement compared to eMMC or SD cards? Just use it, all of the above is about facts and not feelings. You should only ensure to never ever test your storage performance since that might hurt your feelings (it would be as easy as 'cd $ssd-mountpoint ; iozone -e -I -a -s 100M -r 1k -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2' but really don't do this if you want to believe in one of the most common misbeliefs with consumer electronics today)

 

As a reference all IO benchmark results for SD cards, Hardkernel's eMMC modules and the SSD tests:

 

 

Posted

Just a miniature SATA/ASM1061 related material collection

 

  • multiple disks behind ASM1061 problem with Turris Omnia
  • Suggested 'fix' by Turris folks (slowing down PCIe): https://gitlab.labs.nic.cz/turris/turris-os-packages/merge_requests/48/diffs -- please note that the ASM106x firmware matters, their ASM1061 registers itself as class '0x010601' (AHCI 1.0) while the ASM1061 Hardkernel put on the N1 dev samples uses a firmware that reports class '0x010185' (IDE 1.0) instead. Doesn't matter wrt performance since there the chosen driver is important but if code wants to differentiate based on PCIe device classes this of course has to match. Same with device ids: can be either '0x0611' (ASM1061) or '0x0612' (ASM1062) based on firmware and not hardware (the Turris ASM1061 shows up as ASM1062).
  • To disable NCQ and/or to set link speed negotation limits you could adjust the 'setenv bootargs' line in /media/boot/boot.ini: for example setenv bootargs "${bootrootfs} libata.force=1.5,noncq" (see kernel cmdline parameters, could be interesting for SSD users in case NCQ and TRIM interfere)
  • To check SATA relevant dmesg output: dmesg | egrep -i "ahci|sata| ata|scsi|ncq" (mandatory prior and after any benchmarks!)
  • There's a newer firmware for the ASM1061 available -- to be able to use the included binary it would need a few steps but even then the update operation fails: dpkg --add-architecture armhf ; apt install binutils:armhf ; ./106flash ahci420g.rom (Hardkernel put a SPI flash for ASM1061 on the PCB but the flash program stops with 'ASM106X SPI Flash ROM Write Linux V2.6.4 • Find 1 ASM106X Controller • Read_RomID Failed!!') 
Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines