tkaiser Posted February 15, 2018 Posted February 15, 2018 UPDATE: You'll find a preliminary performance overview at the end of the thread. Click here. This is NOT an ODROID N1 review since it's way too early for this and the following will focus on just a very small amount of use cases the board might be used for: server stuff and everything that focuses on network, IO and internal limitations. If you want the hype instead better join Hardkernel's vendor community over there: https://forum.odroid.com/viewforum.php?f=148 All numbers you find below are PRELIMINARY since it's way too early to benchmark this board. This is just the try to get some baseline numbers to better understand for which use cases the device might be appropriate, where to look further into and which settings might need improvements. Background info first ODROID N1 is based on the Rockchip RK3399 SoC so we know already a lot since RK3399 isn't really new (see Chromebooks, countless TV boxes with this chip and dev boards like Firefly RK3399, ROCK960 and a lot of others... and there will be a lot more devices coming in 2018 like another board from China soon with a M.2 key M slot exposing all PCIe lanes). What we already know is that the SoC is one of Rockchip's 'open source SoCs' so software support is already pretty good and the chip vendor itself actively upstreams software support. We also know RK3399 is not the greatest choice for compiling code (use case bottlenecked by memory bandwidth and only 2 fast cores combined with 4 slow ones, for this use case 4 x A15 or A17 cores perform much better), that ARMv8 crypto extensions are supported (see few posts below), that the SoC performs nicely with Android and 'Desktop Linux' stuff (think about GPU and VPU acceleration). We also know that this SoC has 2 USB3 ports and implements PCIe 2.1 with a four lane interface. But so far we don't know how the internal bottlenecks look like so let's focus on this now. The PCIe 2.1 x4 interface is said to support both Gen1 and Gen2 link speeds (2.5 vs. 5GT/s) but there was recently a change in RK3399 datasheet (downgrade from Gen2 to Gen1) and some mainline kernel patch descriptions seem to indicate that RK3399 is not always able to train for Gen2 link speeds. On ODROID N1 there's a x1 PCIe link used configured as either Gen1 or Gen2 to which a dual-port SATA adapter is connected. The Asmedia ASM1061 was the obvious choice since while being a somewhat old design (AFAIK from 2010) it's cheap and 'fast enough' at least when combined with one or even two HDD. Since the PCIe implementation on this early N1 dev samples is fixed and limited we need to choose other RK3399 devices to get a clue about PCIe limitations (RockPro64, ROCK960 or the yet not announced other board from China). So let's focus on SATA and USB3 instead. While SATA on 'development boards' isn't nothing new, it's often done with (sometimes really crappy) USB2 SATA bridges, recently sometimes with good USB3 SATA bridges (see ODROID HC1/HC2, Cloudmedia Transformer or Swiftboard) and sometimes it's even 'true' SATA: Allwinner A10/A20/R40/V40 (many SBC) AM572x Sitara (eg. BeagleBoard-X15 with 1 x eSATA and 1 x SATA on Expansion header) Marvell Armada 38x (Clearfog Base, Clearfog Pro, Helios4) Marvell Armada 37x0 (EspressoBin) NXP i.MX6 (Cubox-i, the various Hummingboard, versions, same with Wandboard and so on) All the above SoC families do 'native SATA' (the SoC itself implements SATA protocols and connectivity) but performance differs a lot with 'Allwinner SATA' being the worst and only the Marvell implementations performing as expected (+500 MB/s sequential and also very high random IO performance which is what you're looking after when using SSDs). As Armbian user you already know: this stuff is documented in detail, just read through this and that. RK3399 is not SATA capable and we're talking here about PCIe attached SATA which has 2 disadvantages: slightly bottlenecking performance while increasing overall consumption. N1's SATA implementation and how it's 'advertised' (rootfs on SATA) pose another challenge but this is something for a later post (the sh*tshow known from 'SD cards' the last years now arriving at a different product category called 'SSD'). Benchmarking storage performance is challenging and most 'reviews' done on SBCs use inappropriate tools (see this nice bonnie/bonnie++ example), inappropriate settings (see all those dd and hdparm numbers testing partially filesystems buffers and caches and not storage) or focus only on irrelevant stuff (eg. sequential performance in 'worst case testing mode' only looking at one direction). Some USB3 tests first All SSDs I use for the test are powered externally and not by N1 since I ran more than one time in situations with board powered SSDs that performance dropped a lot when some sorts of underpowering occured. The 2 USB3 enclosures above are powered by a separate 5V rail and the SATA attached SSDs by the dual-voltage PSU behind. As expected USB3 storage can use the much faster UAS protocol (we know this from RK3328 devices like ROCK64 already which uses same XHCI controller and most probably nearly identical kernel) and also performance numbers match (with large block and file sizes we get close to 400 MB/s). We chose iozone for the simple reason to be able to compare with previous numbers but a more thorough benchmark would need some fio testing with different test sets. But it's only about getting a baseline now. Tests done with Hardkernel's Debian Stretch image with some tweaks applied. The image relies on Rockchip's 4.4 BSP kernel (4.4.112) with some Hardkernel tweaks and I adjusted the following: First set both cpufreq governors to performance to be not affected by potentially wrong/weird cpufreq scaling behaviour. Then do static IRQ distribution for USB3 and PCIe on cpu1, cpu2 and cpu3 (all little cores but while checking CPU utilization none of the cores was fully saturated so A53@1.5GHz is fine): echo 2 >/proc/irq/226/smp_affinity echo 4 >/proc/irq/227/smp_affinity echo 8 >/proc/irq/228/smp_affinity To avoid CPU core collissions the benchmark task itself has been sent to one of the two A72 cores: taskset -c 5 iozone -e -I -a -s 100M -r 1k -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2 Unfortunately currently I've only crappy SSDs lying around (all cheap consumer SSDs: Samsung EVO 840 and 750, a Samsung PM851 and a Intel 540). So we need to take the results with a grain of salt since those SSDs suck especially with continuous write tests (sequential write performance drops down a lot after a short period of time). First test is to determine whether USB3 ports behave differently (AFAIK one of the two could also be configured as an OTG port and with some SBC I've seen serious performance drops in such a mode). But nope, they perform identical: EVO840 behind JMS567 (UAS active) on lower USB3 port (xhci-hcd:usb7, IRQ 228): random random kB reclen write rewrite read reread read write 102400 1 6200 6569 7523 7512 4897 6584 102400 4 23065 25349 34612 34813 23978 25231 102400 16 78836 87689 105249 106777 78658 88240 102400 512 302757 314163 292206 300964 292599 321848 102400 1024 338803 346394 327101 339218 329792 351382 102400 16384 357991 376834 371308 384247 383501 377039 EVO840 behind JMS567 (UAS active) on upper USB3 port (xhci-hcd:usb5, IRQ 227): random random kB reclen write rewrite read reread read write 102400 1 6195 6545 7383 7383 4816 6518 102400 4 23191 25114 34370 34716 23580 25199 102400 16 78727 86695 104957 106634 76359 87610 102400 512 307469 315243 293077 302678 293442 321779 102400 1024 335772 336833 326940 339128 330298 350271 102400 16384 366465 376863 371193 384503 383297 379898 Now attaching an EVO750 (not that fast) that performs pretty identical behind the XHCI host controller and the JMS567 controller inside the enclosure: EVO750 behind JMS567 (UAS active) on lower USB3 port (xhci-hcd:usb7, IRQ 228): random random kB reclen write rewrite read reread read write 102400 1 6200 6569 7523 7512 4897 6584 102400 4 23065 25349 34612 34813 23978 25231 102400 16 78836 87689 105249 106777 78658 88240 102400 512 302757 314163 292206 300964 292599 321848 102400 1024 338803 346394 327101 339218 329792 351382 102400 16384 357991 376834 371308 384247 383501 377039 (so USB3 is the bottleneck here, especially with random IO an EVO840 is much much faster than an EVO750 but here they perform identical due to the massive USB protocol overhead) Let's try both USB3 ports at the same time First quick try was a BTRFS RAID-0 made with 'mkfs.btrfs -f -m raid0 -d raid0 /dev/sda1 /dev/sdb1'. Please note that BTRFS is not the best choice here since all (over)writes with blocksizes lower than btrfs' internal blocksize (4K default) are way slower compared to non CoW filesystems: random random kB reclen write rewrite read reread read write 102400 1 2659 1680 189424 621860 435196 1663 102400 4 21943 18762 24206 24034 18107 17505 102400 16 41983 46379 62235 60665 52517 42925 102400 512 180106 170002 143494 149187 138185 180238 102400 1024 170757 185623 159296 156870 156869 179560 102400 16384 231366 247201 340649 351774 353245 231721 That's BS numbers, let's forget about them. Now trying the same with mdraid/ext4 configuring a RAID 0 and putting an ext4 on it and... N1 simply powered down when executing mkfs.ext4. Adding 'coherent_pool=2M' to bootargs seems to do the job (and I created the mdraid0 in between with both SSDs connected through SATA) random random kB reclen write rewrite read reread read write 102400 4 25133 29444 38340 38490 23403 27947 102400 16 85036 97638 113992 114834 79505 95274 102400 512 306492 314124 295266 305411 289393 322493 102400 1024 344588 343012 322018 332545 316320 357040 102400 16384 384689 392707 371415 384741 388054 388908 Seems we're talking here already about one real bottleneck? We see nice improvements with small blocksizes which is an indication that RAID0 is doing its job. But with larger blocksizes we're not able to exceed the 400MB/s barrier so it seems both USB3 ports have to share bandwidth (comparable to the situation on ODROID XU4 where the two USB3 receptacles are connected to an internal USB3 hub which is connected to one USB3 port of the Exynos SoC) Edit: @Xalius used these results to look into RK3399 TRM (technical reference manual). Quoting ROCK64 IRC: [21:12] <Xalius_> let me pull out that TRM again [21:16] <Xalius_> the USB-C PHY seems to be an extra block, but I guess that is mostly because it can switch the aux function to display port [21:16] <Xalius_> it's not obvious to me how that would change the bandwidth [21:16] <Xalius_> unless in normal USB3 mode they somehow have a PHY where both hosts connect or sth? [21:17] <Xalius_> also I don't think it matters wrt the pinout [21:17] <Xalius_> maybe you can switch one port to the USB-C PHY anyways [21:17] <Xalius_> even if not using any USB-C type things [21:18] <Xalius_> like force it into host mode [21:19] <Xalius_> tkaiser, "Simultaneous IN and OUT transfer for USB3.0, up to 8Gbps bandwidth" [21:19] <Xalius_> just reading the USB3 part [21:21] <Xalius_> it also has some Ethernet hardware accelerator [21:21] <Xalius_> "Scheduling of multiple Ethernet packets without interrupt" [21:22] <Xalius_> apparently each USB3 host shares bandwidth with one USB2 host [21:22] <Xalius_> "Concurrent USB3.0/USB2.0 traffic, up to 8.48Gbps bandwidth" [21:26] <Xalius_> tkaiser, they also have two sets of IRQs in the list [21:26] <tkaiser> Xalius_: Yeah, with this kernel I see them both. And assigned them to different CPU cores already 3
tkaiser Posted February 15, 2018 Author Posted February 15, 2018 SATA performance As already said RK3399 is not SATA capable so we're talking here in reality about RK3399 PCIe performance and the performance of the SATA controller Hardkernel chose (ASM1061). I've destroyed the RAID-0 array from before, attached EVO 750 to SATA port 1 and EVO840 to SATA port 2 (both externally powered) so let's test (same settings as before: IRQ affinitiy and sending iozone to cpu5): EVO750 connected to SATA port 1 (ata1.00) random random kB reclen write rewrite read reread read write 102400 1 7483 8366 8990 8997 5985 8320 102400 4 26895 31233 33467 33536 22688 31074 102400 16 87658 98748 103510 103772 75473 98533 102400 512 319330 320934 309735 311915 283113 322654 102400 1024 332979 338408 321312 321328 306621 336457 102400 16384 343053 346736 325660 327009 318830 341269 EVO840 connected to SATA port 2 (ata2.00) random random kB reclen write rewrite read reread read write 102400 1 7282 8225 9004 8639 5540 7857 102400 4 25295 29532 31754 32422 22069 30526 102400 16 85907 97049 102244 102615 77170 96130 102400 512 308776 312344 305041 308835 299016 306654 102400 1024 326341 327747 316543 321559 315103 321031 102400 16384 365294 378264 385631 391119 390479 293734 If we compare with the USB3 numbers above we see clearly one of the many 'benchmarking gone wrong' occurences. How on earth is the EVO750 connected via USB3 faster than when accessed through SATA (look at the sequential performance with 512K, 1M and 16M blocksizes. With USB3 we exceeded 380 MB/s read and are now stuck at ~325 MB/s -- that's impossible?!). Reason is pretty simple: after I destroyed the RAID0 I recreated the filesystems on both SSDs and mkfs.ext4 took ages. Looking at dmesg shows the problem: [ 874.771379] ata1.00: NCQ disabled due to excessive errors Both SSDs got initialized with NCQ (native command queueing and maximum queue depth of 31: [ 2.498063] ata1.00: ATA-9: Samsung SSD 750 EVO 120GB, MAT01B6Q, max UDMA/133 [ 2.498070] ata1.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 31/32), AA [ 2.964660] ata2.00: ATA-9: Samsung SSD 840 EVO 120GB, EXT0BB0Q, max UDMA/133 [ 2.964666] ata2.00: 234441648 sectors, multi 16: LBA48 NCQ (depth 31/32), AA But then there were transmission errors and the kernel decided to give up on NCQ which is responsible for trashing SATA performance. When I attached the SATA cables to N1 I already expected troubles (one of the two connections felt somewhat 'loose') so looking into dmesg output was mandatory: http://ix.io/Kzf Ok, shutting down the board and exchanging the SSDs so that now EVO840 is on port 1 and EVO750 on port 2: EVO750 connected to SATA port 2 (ata2.00) random random kB reclen write rewrite read reread read write 102400 1 7479 8257 8996 8997 5972 8305 102400 4 26859 31206 33540 33580 22719 31026 102400 16 87690 98865 103442 103715 75507 98374 102400 512 319251 323358 308725 311769 283398 320156 102400 1024 333172 338362 318633 322155 304734 332370 102400 16384 379016 386131 387834 391267 389064 387225 EVO840 connected to SATA port 1 (ata1.00) random random kB reclen write rewrite read reread read write 102400 1 7350 8238 8921 8925 5627 8167 102400 4 26169 30599 33183 33313 22879 30418 102400 16 85579 96564 102667 100994 76254 95562 102400 512 312950 312802 309188 311725 303605 314411 102400 1024 325669 324499 319510 321793 316649 324817 102400 16384 373322 372417 385662 390987 390181 372922 Now performance is as expected (and with ASM1061 you can't expect more -- 390 MB/s sequential transfer speeds can be considered really great). But still... both SSDs seem to perform identically which is just weird since EVO840 is the much faster one. So let's have a look at a native SATA implementation of another ARM board: the Clearfog Pro. With same EVO840, partially crappy settings (and not testing 1K block size) it looks like this -- random IO of course way better than compared to ASM1061: Clearfog Pro with EVO840 connected to a native SATA port of the ARMADA 385: random random kB reclen write rewrite read reread read write 102400 4 69959 104711 113108 113920 40591 76737 102400 16 166789 174407 172029 215341 123020 159731 102400 512 286833 344871 353944 304479 263423 269149 102400 1024 267743 269565 286443 361535 353766 351175 102400 16384 347347 327456 353394 389994 425475 379687 (you find all details here. On a side note: the Clearfog Pro can be configured to provide 3 native SATA ports and Solid-Run engineers tested with 3 fast SATA SSDs in parallel and were able to exceed 1,500 MB/s in total. That was in early 2016) So now that we have both SSDs running with NCQ and maximum queue depth let's try again RAID0: random random kB reclen write rewrite read reread read write 102400 1 7082 7595 8545 8552 5593 7884 102400 4 25434 29603 31858 31831 21195 29381 102400 16 83270 93265 97376 97138 70859 93365 102400 512 303983 297795 300294 286355 277441 301486 102400 1024 330594 320820 316379 313175 314558 332272 102400 16384 367334 367674 351361 366017 364117 351142 Nope, performance sucks. And the reason is the same. New dmesg output reveals that still SATA port 1 has a problem, so now the EVO840 runs with no NCQ any more so performance has to drop: http://ix.io/KA6 Carefully exchanging cables and checking contacts and another run with the SATA RAID0: random random kB reclen write rewrite read reread read write 102400 1 7363 7990 8897 8901 6113 8176 102400 4 26369 30720 33251 33310 23606 30484 102400 16 85555 97111 102577 102953 78091 96233 102400 512 306039 316729 309768 311106 294009 316353 102400 1024 329348 339153 335685 333575 342699 346854 102400 16384 382487 384749 385321 389949 390039 384479 Now everything fine since we again reach the 390 MB/s. If we look closer at the numbers we see that RAID0 with fast SSDs is just a waste of ressources since ASM1061 is the real bottleneck here. There exists an almost twice as expensive variant called ASM1062 which can make use of 2 PCIe lanes and shows overall better performance. But whether this would really result in higher storage performance is a different question since it could happen that a PCIe device attached with 2 lanes instead of one will bring down the link speed to Gen1 (so zero performance gain) or that there exists an internal SoC bandwidth. Since we can't test for this with the ODROID N1 samples right now we need to do more tests with other RK3399 devices. In the meantime I created one RAID0 out of 4 SSDs (as can be seen in the picture above -- 2 x USB3, 2 x SATA) and let the iozone test repeat: random random kB reclen write rewrite read reread read write 102400 4 25565 29387 33952 33814 19793 28524 102400 16 82857 94170 101870 101376 63274 92038 102400 512 283743 292047 292733 293601 275781 270178 102400 1024 312713 312202 311117 311408 275342 320691 102400 16384 469131 458924 616917 652571 619976 454828 We can see clearly that RAID0 is working (see the increased numbers with small blocksizes) but obviously there's an overall bandwidth limitation. As already said the SSDs I test with are cheap and crappy so the write limitation is caused by my SSDs while the read limitation seems some sort of a bandwidth bottleneck on the board or SoC (or kernel/drivers or current settings used!). Repeated the test with a new RAID0 made out of the two fastest SSDs, one connected via USB3, the other via SATA and now PCIe power management settings set to performance (search for /sys/module/pcie_aspm/parameters/policy below): random random kB reclen write rewrite read reread read write 102400 4 33296 40390 50845 51146 31154 39931 102400 16 105127 120863 139497 140849 97505 120296 102400 512 315177 319535 302748 308408 294243 317566 102400 1024 529760 569271 561234 570950 546556 555642 102400 16384 688061 708164 736293 754982 753050 711708 When testing with sequential transfers only, large block sizes and 500 MB test size we get 740/755 MB/s write/read. Given there is something like a 'per port group' bandwidth limitation then this is as expected but as already said: this is just a quick try to search for potential bottlenecks and it's way too early to draw any conclusions now. We need a lot more time to look into details. On the bright side: the above numbers are a confirmation that certain use cases like 'NAS box with 4 HDDs' will not be a problem at all (as long as users are willing and able to accept that USB3 SATA with a good and UAS capable SATA bridge is not worse compared to PCIe attached SATA here. HDDs all show crappy random IO performance so all that counts is sequential IO and the current bandwidth limitations of ~400 MB/s for both USB3 ports as well as both SATA ports are perfectly fine. People who want to benefit from ultra fast SSD storage might better look somewhere else.
tkaiser Posted February 15, 2018 Author Posted February 15, 2018 More storage performance: eMMC and SD cards The N1 has not only 2 SATA ports but also the usual SD card slot and also the usual eMMC socket known from other ODROID boards. Hardkernel sells some of the best eMMC modules you can get for this connector and they usually also take care that SD cards can enter higher speed modes. This usually requires switching between 3.3V and 1.8V but at least released schematics for this (early!) board revision do not mention 1.8V here. Hardkernel shipped the dev sample with their new Samsung based orange eMMC (16 GB) but since this is severly limited wrt sequential write performance (as usual, flash memory modules with low capacity always suffer from this problem) we use the 64GB module to show the performance. Since the use case I'm interested in is 'rootfs' or 'OS drive' sequential performance is more or less irrelevant and all that really matters is random IO performance (especially writes at small block sizes). Test setup as before with iozone task sent to cpu5: Orange 64GB eMMC (Samsung): random random kB reclen write rewrite read reread read write 102400 1 2069 1966 8689 8623 7316 2489 102400 4 32464 36340 30699 30474 27776 31799 102400 16 94637 100995 89970 90294 83993 96937 102400 512 147091 151657 278646 278126 269186 146851 102400 1024 143085 148288 287749 291479 275359 143229 102400 16384 147880 149969 306523 306023 307040 147470 If we compare random IOPS at 4K and 16K block size it's as follows (IOPS -- IO operations per second -- means we need to divide the KB/s numbers above through block size!). Below numbers are not KB/s but IOPS instead: 4K read 4K write 16K read 16K write JMS567: 6000 6300 4925 5500 ASM1061 powersave: 5700 7600 4750 6000 16GB eMMC: 7250 7100 5025 2950 32/64/128GB eMMC: 7450 7350 5200 5700 ASM1061 performance: 9200 15050 6625 9825 (Not so) surprisingly Hardkernel's eMMC modules are faster than an SSD with default settings (and we're talking about ok-ish consumer SSDs and not cheap crap). Some important notes: 'JMS567' is the USB3-to-SATA chipset used for my tests. The above is not an 'USB3 number' but one made with a great JMicron chipset and UAS active (UAS == USB Attached SCSI, the basic requirement to get storage performance with USB that does not totally suck). If you don't take care about the chipset you use your USB3 storage performance can be magnitudes lower 'ASM1061' is not a synonym for 'native SATA', it's just PCIe attached SATA and most probably one of the slowest implementations available. There are two numbers above since PCIe power management settings have an influence on both consumption and performance. When /sys/module/pcie_aspm/parameters/policy is set to performance instead of powersave idle consumption increases by around 250mW but performance improves also a lot with small block sizes As a reference here iozone numbers for all orange Samsung based eMMC modules tested on N1 (Hardkernel sent the numbers on request): https://pastebin.com/ePUCXyg6 (as can be seen the 16 GB module already performs great but for full performance better choose one of the larger modules) So what about SD cards? Update: Hardkernel forgot to include an UHS patch to the kernel they provided with the developer samples so once this is fixed the performance bottleneck with SD cards reported below should be gone: https://forum.odroid.com/viewtopic.php?f=153&t=30193#p215915 Update 2: already fixed with those 3 simple lines in device-tree configuration (therefore the below numbers only as 'historical reference' and what happen with slowest SD card speed mode -- for current performance with SDR104 mode see here and there) As Armbian user you already know that 'SD card' is not a performance class but just a form factor and an interface specification. There is all the counterfeit crap, there exist 'reputable brands' that produce SD cards that are slow as hell when it comes to random IO and there are good performers that show even 100 times better random IO performance than eg. an average Kingston or PNY card: https://forum.armbian.com/topic/954-sd-card-performance/ Unfortunately in the past 'random IO' was not part of SD Association's speed classes but this has changed last year. In the meantime there's 'A1 speed class' which specifies minimum random IO performance and now these cards even exist. I tried to buy a SanDisk Extreme Plus A1 but was too stupid and ordered a SanDisk Extreme A1 instead (without the 'Plus' which means extra performance and especially extra reliability). But since I saved few bucks by accident and there was an 'SanDisk Ultra A1' offer... I bought two A1 cards today: Fresh SanDisk Extreme A1 32GB SD card: random random kB reclen write rewrite read reread read write 102400 1 998 716 4001 3997 3049 740 102400 4 3383 3455 10413 10435 9631 4156 102400 16 8560 8607 17149 17159 17089 11949 102400 512 21199 21399 22447 22457 22464 20571 102400 1024 22075 22168 22912 22922 22919 21742 102400 16384 22415 22417 23357 23372 23372 22460 Fresh SanDisk Ultra A1 32GB SD card: random random kB reclen write rewrite read reread read write 102400 1 683 718 3466 3467 2966 449 102400 4 2788 3918 9821 9805 8763 2713 102400 16 4212 7950 16577 16627 15765 7121 102400 512 10069 14514 22301 22346 22253 13652 102400 1024 14259 14489 22851 22892 22868 13664 102400 16384 15254 14597 23262 23342 23340 14312 Slightly used SanDisk Extreme Plus (NO A1!) 16GB SD card: random random kB reclen write rewrite read reread read write 102400 1 614 679 3245 3245 2898 561 102400 4 2225 2889 9367 9360 7820 2765 102400 16 8202 8523 16836 16806 16807 7507 102400 512 20545 21797 22429 22465 22485 21857 102400 1024 22352 22302 22903 22928 22918 22125 102400 16384 22756 22748 23292 23323 23325 22691 Oh well, performance is limited to slowest SD card mode possible (4 bit, 50 MHz --> ~23 MB/s max) which also affects random IO performance slightly (small blocksizes) to severely (large blocksizes). At least the N1 dev samples have a problem here. No idea whether this is a hardware limitation (no switching to 1.8V?) or just a settings problem. But I really hope Hardkernel addresses this since in the past I always enjoyed great performance with SD cards on the ODROIDs (due to Hardkernel being one of the few board makers taking care of such details)
tkaiser Posted February 15, 2018 Author Posted February 15, 2018 BTW: Since checking out a new board without some kind of monitoring is just stupid... here's what it takes to get armbianmonitor to run with Hardkernel's Stretch (or Ubuntu later -- the needed RK3399 tweaks have been added long ago). mkdir -p /etc/armbianmonitor/datasources cd /etc/armbianmonitor/datasources ln -s /sys/devices/virtual/thermal/thermal_zone0/temp soctemp wget https://raw.githubusercontent.com/armbian/build/master/packages/bsp/common/usr/bin/armbianmonitor mv armbianmonitor /usr/local/sbin/ chmod 755 /usr/local/sbin/armbianmonitor Then it's just calling 'sudo armbianmonitor -m' to get a clue what's going on (throttling, big.LITTLE stuff, %iowait... everything included): root@odroid:/home/odroid# armbianmonitor -m Stop monitoring using [ctrl]-[c] Time big.LITTLE load %cpu %sys %usr %nice %io %irq CPU C.St. 23:14:25: 408/1200MHz 0.38 6% 2% 3% 0% 0% 0% 43.9°C 0/3 23:14:30: 408/ 408MHz 0.51 1% 0% 0% 0% 0% 0% 43.9°C 0/3 23:14:35: 600/ 408MHz 0.55 2% 0% 1% 0% 0% 0% 44.4°C 0/3 23:14:41: 408/ 408MHz 0.51 0% 0% 0% 0% 0% 0% 46.9°C 1/3 23:14:46: 1992/ 816MHz 0.63 33% 0% 33% 0% 0% 0% 52.8°C 1/3 23:14:51: 408/ 408MHz 0.74 16% 0% 16% 0% 0% 0% 42.8°C 0/3 23:14:56: 1992/ 600MHz 0.68 5% 4% 0% 0% 0% 0% 44.4°C 0/3 23:15:01: 600/1008MHz 0.86 45% 8% 0% 0% 36% 0% 42.8°C 0/3 23:15:07: 408/ 408MHz 0.95 19% 2% 0% 0% 16% 0% 42.8°C 0/3 23:15:12: 408/ 600MHz 1.04 23% 2% 0% 0% 20% 0% 43.3°C 0/3 23:15:17: 1200/ 600MHz 1.12 18% 4% 0% 0% 14% 0% 43.9°C 0/3 23:15:22: 1992/1512MHz 1.03 51% 18% 23% 0% 8% 0% 52.8°C 1/3 23:15:27: 1992/1512MHz 1.42 88% 20% 34% 0% 32% 0% 51.1°C 1/3 23:15:32: 1992/1512MHz 1.79 72% 16% 34% 0% 20% 0% 51.7°C 1/3 Time big.LITTLE load %cpu %sys %usr %nice %io %irq CPU C.St. 23:15:37: 1992/1512MHz 2.05 77% 16% 34% 0% 26% 0% 50.0°C 1/3 23:15:42: 1992/1512MHz 2.29 79% 21% 34% 0% 23% 0% 50.0°C 1/3 23:15:47: 1992/1512MHz 2.42 85% 24% 34% 0% 26% 0% 48.8°C 1/3 23:15:52: 408/ 408MHz 2.71 50% 8% 11% 0% 29% 0% 40.6°C 0/3 23:15:57: 408/ 816MHz 2.65 33% 2% 0% 0% 30% 0% 40.6°C 0/3 23:16:03: 1008/ 600MHz 2.60 18% 4% 0% 0% 14% 0% 40.6°C 0/3 23:16:08: 408/ 408MHz 2.79 3% 0% 0% 0% 2% 0% 40.6°C 0/3^C root@odroid:/home/odroid#
tkaiser Posted February 15, 2018 Author Posted February 15, 2018 Gigabit Ethernet performance RK3399 has an internal GbE MAC implementation combined with an external RTL8211 GbE PHY. I did only some quick tests which were well above 900 Mbits/sec but since moving IRQs to one of the A72 cores didn't improve scores it's either my current networking setup (ODROID N1 connected directly to an older GbE switch I don't trust that much any more) or necessary TX/RX delay adjustments. Anyway: the whole process should be well known and is documented so it's time for someone else to look into. With RK SoCs it's pretty easy to test for this with DT overlays: https://github.com/ayufan-rock64/linux-build/blob/master/recipes/gmac-delays-test/range-test And the final result might be some slight DT modifications that allow for 940 Mbits/sec in both directions with as less CPU utilization as possible. Example for RK3328/ROCK64: https://github.com/ayufan-rock64/linux-kernel/commit/2047dd881db53c15a952b1755285e817985fd556 Since RK3399 uses the same Synopsys Designware Ethernet implementation as currently almost every other GbE capable ARM SoC around and since we get maximum throughput on RK3328 with adjusted settings... I'm pretty confident that this will be the same on RK3399.
TonyMac32 Posted February 16, 2018 Posted February 16, 2018 4 hours ago, tkaiser said: This usually requires switching between 3.3V and 1.8V but at least released schematics for this (early!) board revision do not mention 1.8V here. Yeah, oddly it looks like 3V0, which I believe falls within the required range. That said, scratching brain to remember if VDD is constant at 3.3 Volts and only the signalling voltage changes (SD_CLK, SD_CMD, Data_0..3). In that case it would be switched at the SoC.
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 AES crypto performance, checking for bogus clockspeeds, thermal tresholds As Armbian user you already might know that almost all currently available 64 bit ARM SoCs licensed ARM's ARMv8 crypto extensions and that AES performance especially with small data chunks (think about VPN encryption) is something where A72 cores shine: https://forum.armbian.com/topic/4583-rock64/?do=findComment&comment=37829 (the only two exceptions are Raspberry Pi 3 and ODROID-C2 where the SoC makers 'forgot' to license the ARMv8 crypto extensions) Let's have a look at ODROID N1 and A53@1.5GHz vs. A72@2GHz. I use the usual openssl benchmark that runs in a single thread. Once pinned to cpu1 (little core) and another time pinned to cpu5 (big core): for i in 128 192 256 ; do taskset -c 1 openssl speed -elapsed -evp aes-${i}-cbc 2>/dev/null; done | grep cbc for i in 128 192 256 ; do taskset -c 5 openssl speed -elapsed -evp aes-${i}-cbc 2>/dev/null; done | grep cbc As usual monitoring happened in another shell and when testing on the A72 I not only got a huge result variation but armbianmonitor also reported 'cooling state' reaching 1 already -- see last column 'C.St.' (nope, that's the PWM fan, see few posts below) Time big.LITTLE load %cpu %sys %usr %nice %io %irq CPU C.St. 06:00:44: 1992/1512MHz 0.46 16% 0% 16% 0% 0% 0% 51.1°C 1/3 So I added a huge and silent USB powered 5V fan to the setup blowing air over the board at an 45° angle to improve heat dissipation a bit (I hate those small and inefficient fansinks like the one on XU4 and the N1 sample now) and tried again. This time cooling state remained at 0 the internal fan did not start and we had no result variation any more (standard deviation low enough between multiple runs): Time big.LITTLE load %cpu %sys %usr %nice %io %irq CPU C.St. 06:07:03: 1992/1512MHz 0.46 0% 0% 0% 0% 0% 0% 30.0°C 0/3 06:07:08: 1992/1512MHz 0.42 0% 0% 0% 0% 0% 0% 30.0°C 0/3 06:07:13: 1992/1512MHz 0.39 0% 0% 0% 0% 0% 0% 30.0°C 0/3 06:07:18: 1992/1512MHz 0.36 0% 0% 0% 0% 0% 0% 30.0°C 0/3 06:07:23: 1992/1512MHz 0.33 0% 0% 0% 0% 0% 0% 30.0°C 0/3 06:07:28: 1992/1512MHz 0.38 12% 0% 12% 0% 0% 0% 32.2°C 0/3 06:07:33: 1992/1512MHz 0.43 16% 0% 16% 0% 0% 0% 32.2°C 0/3 06:07:38: 1992/1512MHz 0.48 16% 0% 16% 0% 0% 0% 32.8°C 0/3 06:07:43: 1992/1512MHz 0.52 16% 0% 16% 0% 0% 0% 33.9°C 0/3 06:07:48: 1992/1512MHz 0.56 16% 0% 16% 0% 0% 0% 33.9°C 0/3 06:07:53: 1992/1512MHz 0.60 16% 0% 16% 0% 0% 0% 33.9°C 0/3 06:07:58: 1992/1512MHz 0.63 16% 0% 16% 0% 0% 0% 34.4°C 0/3 06:08:04: 1992/1512MHz 0.66 16% 0% 16% 0% 0% 0% 34.4°C 0/3 06:08:09: 1992/1512MHz 0.69 16% 0% 16% 0% 0% 0% 34.4°C 0/3 06:08:14: 1992/1512MHz 0.71 16% 0% 16% 0% 0% 0% 35.0°C 0/3 So these are the single threaded PRELIMINARY openssl results for ODROID N1 differentiating between A53 and A72 cores: A53 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 103354.37k 326225.96k 683938.47k 979512.32k 1119100.93k aes-192-cbc 98776.57k 293354.45k 565838.51k 760103.94k 843434.67k aes-256-cbc 96389.62k 273205.14k 495712.34k 638675.29k 696685.91k A72 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 377879.56k 864100.25k 1267985.24k 1412154.03k 1489756.16k aes-192-cbc 317481.96k 779417.49k 1045567.57k 1240775.00k 1306637.65k aes-256-cbc 270982.47k 663337.94k 963150.93k 1062750.21k 1122691.75k The numbers look somewhat nice but need further investigation: When we compared with other A53 and especially A72 SoCs a while ago (especially the A72 numbers made on a RK3399 TV box only clocking at 1.8 GHz) the A72 scores above seem to low with all test sizes (see the numbers here with AES-128 on a H96-Pro) Cooling state 1 is entered pretty early (when zone0 exceeds already 50°C) -- this needs further investigation. And further benchmarking especially with multiple threads in parallel is useless until this is resolved/understood So let's check with Willy Tarreau's 'mhz' tool whether the CPU clockspeeds reported are bogus (I'm still using performance cpufreq governor so should run with 2 and 1.5 GHz on A72 and A53 cores): root@odroid:/home/odroid/mhz# taskset -c 1 ./mhz count=645643 us50=21495 us250=107479 diff=85984 cpu_MHz=1501.775 root@odroid:/home/odroid/mhz# taskset -c 5 ./mhz count=807053 us50=20330 us250=101641 diff=81311 cpu_MHz=1985.102 All fine so we need to have a look at memory bandwidth. Here are tinymembench numbers pinned to an A53 and here with an A72. As a reference some numbers made with other RK3399 devices few days ago on request: https://irclog.whitequark.org/linux-rockchip/2018-02-12#21298744; One interesting observation is throttling behaviour in a special SoC engine affecting crypto. When cooling state 1 was reached the cpufreq still remained at 2 and 1.5 GHz respectively but AES performance dropped a lot. So the ARMv8 crypto engine is part of BSP 4.4 kernel throttling strategies and performance in such a case does not scale linearly with repored cpufreq. In other words: for the next round of tests the thermal tresholds defined in DT should be lifted a lot. Edit: Wrong assumption wrt openssl numbers on A72 cores -- see next post
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 Openssl and thermal update I've been wrong before wrt 'cooling state 1' -- the result variation must had a different reason before. I decided to test with AES encryption running on all 6 CPU cores in parallel using a simple script testing only with AES-256: root@odroid:/home/odroid# cat check-aes.sh #!/bin/bash while true; do for i in 0 1 2 3 4 5 ; do taskset -c ${i} openssl speed -elapsed -evp aes-256-cbc 2>/dev/null & done wait done Results as follows: https://pastebin.com/fHzJ5tJF (please note that cpufreq governor was set to performance and how especially A72 scores were lower in the beginning just to improve over time: with 16 byte it were 309981.41k in the beginning and then later 343045.14k and even slightly more) Here armbianmonitor output: https://pastebin.com/1hsmk63i (at '07:07:28' I stopped the huge 5V fan and the small fansink can cope with this load though cooling state 2 is sometimes reached when SoC temperature exceeds 55°C). So for whatever reasons we still have a somewhat huge result variation with this single benchmark which needs further investigation (especially whether benchmark behaviour relates to real-world use cases like VPN and full disk encryption)
zador.blood.stained Posted February 16, 2018 Posted February 16, 2018 Regarding cooling states - since this is a HMP device with a PWM fan according to DT it should have 3 cooling devices - big cluster throttling, little cluster throttling and the fan. armbianmonitor currently doesn't deal with this situation - it reads the cooling device 0 state each time. And since I see only 3 available cooling states in your output most likely it's the fan. You can check /sys/devices/virtual/thermal/ to confirm that multiple cooling devices are used.
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 4 minutes ago, zador.blood.stained said: armbianmonitor currently doesn't deal with this situation - it reads the cooling device 0 state each time. And since I see only 3 available cooling states in your output most likely it's the fan. Correct: root@odroid:/home/odroid# cat /sys/devices/virtual/thermal/cooling_device0/type pwm-fan So everything I've written above about cooling state is BS since it's just showing the fansink starting to work One should stop to think while benchmarking. Just collect numbers like a robot, check the data later whether it makes sense, throw numbers away and test again and again and again. Fortunately I already figured out that the result variation with openssl on the A72 cores has a different reason. But whether these benchmark numbers tell something is questionable. It would need some real-world tests with VPN and full disk encryption and then trying to pin the tasks to a little or a big core to get the idea what's really going on and whether the numbers generated with a synthetic benchmark have any meaning for real tasks.
zador.blood.stained Posted February 16, 2018 Posted February 16, 2018 "cryptsetup benchmark" numbers may be interesting, but they also heavily depend on the cryptography related kernel configuration options, so these numbers should be accompanied by /proc/crypto contents and lsmod output after the test.
zador.blood.stained Posted February 16, 2018 Posted February 16, 2018 Also interesting - I see the "Dynamic Memory Controller" in the DT which has its own set of operating points and this table system-status-freq = < /*system status freq(KHz)*/ SYS_STATUS_NORMAL 800000 SYS_STATUS_REBOOT 528000 SYS_STATUS_SUSPEND 200000 SYS_STATUS_VIDEO_1080P 300000 SYS_STATUS_VIDEO_4K 600000 SYS_STATUS_VIDEO_4K_10B 800000 SYS_STATUS_PERFORMANCE 800000 SYS_STATUS_BOOST 400000 SYS_STATUS_DUALVIEW 600000 SYS_STATUS_ISP 600000 >; A quick Google search points to this page so this method should be tested to monitor the frequency Quote cat /sys/kernel/debug/clk/clk_summary | grep dpll_ddr assuming the kernel was compiled with DDR devfreq support Edit: though I see status = "disabled"; in the dmc node so it may not be operational yet.
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 IO scheduler influence on SATA performance I tried to add the usual Armbian tweaks to Hardkernel's Debian Stretch image but something went wrong (we usually set cfq for HDDs and noop for flash media from /etc/init.d/armhwinfo -- I simply forgot to load the script so it never got executed at boot): root@odroid:/home/odroid# cat /sys/block/sd*/queue/scheduler noop deadline [cfq] noop deadline [cfq] So let's use the mdraid0 made of EVO840 and EVO750 (to ensure parallel disk accesses) with an ext4 on top and check for NCQ issues first: root@odroid:/mnt/MDRAID0# dmesg | grep -i ncq [ 2.007269] ahci 0000:01:00.0: flags: 64bit ncq sntf stag led clo pmp pio slum part ccc sxs [ 2.536884] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1 [ 2.536897] ata1.00: 234441648 sectors, multi 16: LBA48 NCQ (depth 31/32), AA [ 2.537652] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1 [ 3.011571] ata2.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 31/32), AA No issues, we can use NCQ with maximum queue depth so let's test through the three available schedulers with performance cpufreq governor to avoid being influenced by cpufreq scaling behaviour: cfq random random kB reclen write rewrite read reread read write 102400 1 7320 7911 8657 8695 5954 8106 102400 4 25883 30470 33159 33169 23205 30464 102400 16 85609 96712 101527 102396 77224 96583 102400 512 311645 312376 301644 303945 289410 308194 102400 1024 345891 338773 329284 330738 329926 332866 102400 16384 382101 379907 383779 387747 386901 383664 deadline random random kB reclen write rewrite read reread read write 102400 1 6963 8307 8211 8402 5772 8483 102400 4 24701 30999 34728 34653 23160 31728 102400 16 87390 98898 105589 97539 78259 97638 102400 512 306420 304645 298131 302033 286582 303119 102400 1024 345178 345458 329122 333318 329688 340144 102400 16384 381596 374789 383850 387551 386428 381956 noop random random kB reclen write rewrite read reread read write 102400 1 6995 8589 9340 8498 5763 8246 102400 4 26011 31307 30267 32635 21445 30859 102400 16 88185 100135 97252 105090 79601 91052 102400 512 307553 312609 304311 307922 291425 308387 102400 1024 344472 340192 322881 333104 332405 333082 102400 16384 372224 373183 380530 386994 386273 379506 Well, this looks like result varation but of course someone interested in this could do a real benchmark testing with each scheduler at least 30 times and then generating average values. In the past on slower ARM boards with horribly bottlenecked IO capabilities (think about those USB2 only boards that do not even can use USB Attached SCSI due to lacking kernel/driver support) we've seen some severe performance impact based on IO scheduler used but in this situation this seems negligible. If someone takes his time to benchmark through this it would be interesting to repeat the tests also with ondemand governor, io_is_busy set 1 of course and then playing around with different values for up_threshold and sampling_down_factor since if cpufreq scaling behaviour starts to vary based in IO scheduler used performance differences can be massive. I just did a quick check how performance with ondemand cpufreq governor and the iozone benchmark varies between Stretch / Hardkernel defaults and our usual tweaks: https://github.com/armbian/build/blob/751aa7194f77eabcb41b19b8d19f17f6ea23272a/packages/bsp/common/etc/init.d/armhwinfo#L82-L94 Makes quite a difference but again the IO scheduler chosen still doesn't matter that much (but adjusting io_is_busy, up_threshold and sampling_down_factor does): cfq defaults random random kB reclen write rewrite read reread read write 102400 1 5965 6656 7197 7173 5107 6586 102400 4 20864 24899 27205 27214 19421 24595 102400 16 68376 79415 85409 85930 66138 77598 102400 512 274000 268473 267356 269046 247424 272822 102400 1024 310992 314672 299571 299065 298518 315823 102400 16384 366152 376293 375176 379202 379123 370254 cfq with Armbian settings random random kB reclen write rewrite read reread read write 102400 1 7145 7871 8600 8591 5996 7973 102400 4 25817 29773 32174 32385 23021 29627 102400 16 83848 94665 98502 98857 75576 93879 102400 512 303710 314778 303135 309050 280823 300391 102400 1024 335067 332595 327539 332574 323887 329956 102400 16384 381987 373067 381911 386585 387089 381956 deadline defaults random random kB reclen write rewrite read reread read write 102400 1 6231 6872 7750 7746 5410 6804 102400 4 21792 25941 28752 28701 20262 25380 102400 16 70078 84209 88703 87375 69296 80708 102400 512 276422 276042 259416 271542 250835 271743 102400 1024 305166 321265 300374 296094 311020 323350 102400 16384 363016 373751 376570 377294 378730 377186 deadline with Armbian settings random random kB reclen write rewrite read reread read write 102400 1 7389 8018 9018 9047 6162 8233 102400 4 26526 30799 33487 33603 23712 30838 102400 16 85703 96066 105055 103831 77281 97086 102400 512 302688 297832 292569 288282 278384 294447 102400 1024 343165 340770 317211 320999 329411 330670 102400 16384 380267 375233 388286 390289 391849 375236 noop defaults random random kB reclen write rewrite read reread read write 102400 1 6301 6900 7766 7779 5350 6841 102400 4 21995 25884 28466 28540 20240 25664 102400 16 69547 81721 88044 88596 68043 81277 102400 512 281386 276749 262216 262762 255387 261948 102400 1024 300716 314233 288672 298921 310456 307875 102400 16384 376137 371625 376620 378136 379143 371308 noop with Armbian settings random random kB reclen write rewrite read reread read write 102400 1 7409 8026 9030 9033 6193 8259 102400 4 26562 30861 33494 33649 23676 30870 102400 16 85819 96956 102372 101982 77890 97341 102400 512 310007 303370 293432 297090 281048 301772 102400 1024 330968 352003 328052 318009 333682 337339 102400 16384 373958 375028 384865 386749 389401 376501 (but as already said: to get more insights each test has to be repeated at least 30 times and then average values need to be generated -- 'single shot' benchmarking is useless to generate meaningful numbers)
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 21 minutes ago, zador.blood.stained said: cat /sys/kernel/debug/clk/clk_summary | grep dpll_ddr root@odroid:/mnt/MDRAID0# grep dpll_ddr /sys/kernel/debug/clk/clk_summary root@odroid:/mnt/MDRAID0# cat /sys/kernel/debug/clk/clk_summary | curl -F 'f:1=<-' http://ix.io http://ix.io/KNS
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 7-zip Running on all 6 cores in parallel (no throttling occured, I start to like the small fansink ) root@odroid:/tmp# 7zr b 7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE) LE CPU Freq: 401 400 401 1414 1985 1985 1985 1984 1985 RAM size: 3882 MB, # CPU hardware threads: 6 RAM usage: 1323 MB, # Benchmark threads: 6 Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS 22: 4791 499 934 4661 | 100897 522 1647 8605 23: 4375 477 935 4458 | 98416 522 1631 8516 24: 4452 524 914 4787 | 95910 523 1610 8418 25: 4192 524 914 4787 | 92794 523 1579 8258 ---------------------------------- | ------------------------------ Avr: 506 924 4673 | 523 1617 8449 Tot: 514 1270 6561 Now still 6 threads but pinned only to the little cores: root@odroid:/tmp# taskset -c 0,1,2,3 7zr b 7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE) LE CPU Freq: 1492 1500 1499 1500 1499 1493 1498 1499 1499 RAM size: 3882 MB, # CPU hardware threads: 6 RAM usage: 1323 MB, # Benchmark threads: 6 Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS 22: 2475 375 642 2408 | 64507 396 1387 5501 23: 2440 385 646 2487 | 60795 383 1374 5261 24: 2361 391 649 2539 | 58922 381 1359 5172 25: 2249 394 652 2568 | 58033 388 1332 5165 ---------------------------------- | ------------------------------ Avr: 386 647 2501 | 387 1363 5275 Tot: 387 1005 3888 And now 6 threads but bound to the A72 cores: root@odroid:/tmp# taskset -c 4,5 7zr b 7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE) LE CPU Freq: 400 401 498 1984 1985 1981 1985 1985 1985 RAM size: 3882 MB, # CPU hardware threads: 6 RAM usage: 1323 MB, # Benchmark threads: 6 Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS 22: 2790 199 1364 2715 | 47828 200 2040 4079 23: 2630 199 1343 2680 | 46641 200 2020 4036 24: 2495 200 1344 2683 | 45505 200 1999 3994 25: 2366 200 1353 2702 | 43998 200 1959 3916 ---------------------------------- | ------------------------------ Avr: 199 1351 2695 | 200 2005 4006 Tot: 200 1678 3350
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 Cpuminer test (heavy NEON optimizations) And another test: sudo apt install automake autoconf pkg-config libcurl4-openssl-dev libjansson-dev libssl-dev libgmp-dev make g++ git clone https://github.com/tkinjo1985/cpuminer-multi.git cd cpuminer-multi/ ./build.sh ./cpuminer --benchmark When running on all 6 cores this benchmark scores at 'Total: 8.80 kH/s' without throttling. After killing the big cores (echo 0 >/sys/devices/system/cpu/cpu[45]/online) I get scores up to 'Total: 4.69 kH/s' which is the expected value since I got 3.9 kH/s/s on an overclocked A64 (also Cortex-A53, back then running at 1296MHz). And when bringing back the big cores and killing the littles we're at around 'Total: 4.10 kH/s': root@odroid:/usr/local/src/cpuminer-multi# echo 1 >/sys/devices/system/cpu/cpu5/online root@odroid:/usr/local/src/cpuminer-multi# echo 1 >/sys/devices/system/cpu/cpu4/online root@odroid:/usr/local/src/cpuminer-multi# echo 0 >/sys/devices/system/cpu/cpu3/online root@odroid:/usr/local/src/cpuminer-multi# echo 0 >/sys/devices/system/cpu/cpu2/online root@odroid:/usr/local/src/cpuminer-multi# echo 0 >/sys/devices/system/cpu/cpu1/online root@odroid:/usr/local/src/cpuminer-multi# echo 0 >/sys/devices/system/cpu/cpu0/online root@odroid:/usr/local/src/cpuminer-multi# ./cpuminer --benchmark ** cpuminer-multi 1.3.3 by tpruvot@github ** BTC donation address: 1FhDPLPpw18X4srecguG3MxJYe4a1JsZnd (tpruvot) [2018-02-16 10:41:28] 6 miner threads started, using 'scrypt' algorithm. [2018-02-16 10:41:29] CPU #0: 0.54 kH/s [2018-02-16 10:41:29] CPU #5: 0.54 kH/s [2018-02-16 10:41:30] CPU #2: 0.44 kH/s [2018-02-16 10:41:30] CPU #3: 0.45 kH/s [2018-02-16 10:41:30] CPU #1: 0.44 kH/s [2018-02-16 10:41:30] CPU #4: 0.44 kH/s [2018-02-16 10:41:32] Total: 3.90 kH/s [2018-02-16 10:41:33] Total: 3.95 kH/s [2018-02-16 10:41:37] CPU #4: 0.73 kH/s [2018-02-16 10:41:37] CPU #3: 0.65 kH/s [2018-02-16 10:41:38] CPU #1: 0.60 kH/s [2018-02-16 10:41:38] CPU #2: 0.68 kH/s [2018-02-16 10:41:38] CPU #0: 0.59 kH/s [2018-02-16 10:41:38] CPU #5: 0.81 kH/s [2018-02-16 10:41:38] Total: 4.01 kH/s [2018-02-16 10:41:43] CPU #3: 0.66 kH/s [2018-02-16 10:41:43] CPU #4: 0.71 kH/s [2018-02-16 10:41:44] CPU #5: 0.73 kH/s [2018-02-16 10:41:44] Total: 4.10 kH/s [2018-02-16 10:41:47] CPU #0: 0.68 kH/s [2018-02-16 10:41:48] CPU #2: 0.67 kH/s [2018-02-16 10:41:48] Total: 4.08 kH/s [2018-02-16 10:41:48] CPU #1: 0.68 kH/s [2018-02-16 10:41:53] CPU #3: 0.68 kH/s [2018-02-16 10:41:53] CPU #5: 0.72 kH/s [2018-02-16 10:41:53] Total: 4.13 kH/s [2018-02-16 10:41:53] CPU #4: 0.68 kH/s [2018-02-16 10:41:54] CPU #1: 0.65 kH/s [2018-02-16 10:41:54] CPU #0: 0.68 kH/s [2018-02-16 10:41:58] Total: 4.05 kH/s [2018-02-16 10:41:58] CPU #2: 0.65 kH/s [2018-02-16 10:42:03] CPU #1: 0.64 kH/s [2018-02-16 10:42:03] CPU #3: 0.66 kH/s [2018-02-16 10:42:03] CPU #0: 0.65 kH/s [2018-02-16 10:42:03] CPU #5: 0.73 kH/s [2018-02-16 10:42:03] Total: 4.02 kH/s [2018-02-16 10:42:03] CPU #4: 0.71 kH/s ^C[2018-02-16 10:42:05] SIGINT received, exiting With ODROID-XU4/HC1/HC2 it looks like this: When forced to run on the little cores cpuminer gets 2.43 khash/s (no throttling occuring), running on the big cores it starts with 8.2 khash/s at 2.0GHz but even with the fansink on XU4 immediately cpufreq drops down to 1.8 or even 1.6 GHz. At least that's what happens on my systems, maybe others have seen other behaviour. Let's do a 'per core' comparison: A15 @ 2.0GHz: 2.35 khash/sec A72 @ 2.0GHz: 2.05 khash/sec A7 @ 1.5 GHz: 0.61 khash/sec A53 @ 1.5 GHz: 1.18 khash/sec In other words: with such or similar workloads ('number crunching', NEON optimized stuff) an A15 core might be slightly faster than an A72 core (and since Exynos has twice as much fast cores it performs better with such workloads) while there's a great improvement when looking at the little cores: A53 performs almost twice as fast as an A7 at same clockspeed but this is due to this specific benchmark making heavy use of NEON instructions and there switching to 64-bit/ARMv8 ISA makes a huge difference. Please be also aware that cpuminer is heavily dependent on memory bandwidth so that these cpuminer numbers are not a good representation for other workloads. This is just 'number cruncher' stuff where NEON can be used.
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 Cryptsetup benchmark 2 hours ago, zador.blood.stained said: "cryptsetup benchmark" numbers may be interesting, but they also heavily depend on the cryptography related kernel configuration options, so these numbers should be accompanied by /proc/crypto contents and lsmod output after the test. Here we go. Same numbers with all cores active or just the big ones: # Tests are approximate using memory only (no storage IO). PBKDF2-sha1 669588 iterations per second for 256-bit key PBKDF2-sha256 1315653 iterations per second for 256-bit key PBKDF2-sha512 485451 iterations per second for 256-bit key PBKDF2-ripemd160 365612 iterations per second for 256-bit key PBKDF2-whirlpool 134847 iterations per second for 256-bit key # Algorithm | Key | Encryption | Decryption aes-cbc 128b 661.7 MiB/s 922.4 MiB/s serpent-cbc 128b N/A N/A twofish-cbc 128b 80.0 MiB/s 81.2 MiB/s aes-cbc 256b 567.6 MiB/s 826.9 MiB/s serpent-cbc 256b N/A N/A twofish-cbc 256b 79.6 MiB/s 81.1 MiB/s aes-xts 256b 736.3 MiB/s 741.3 MiB/s serpent-xts 256b N/A N/A twofish-xts 256b 83.7 MiB/s 82.5 MiB/s aes-xts 512b 683.7 MiB/s 686.0 MiB/s serpent-xts 512b N/A N/A twofish-xts 512b 83.7 MiB/s 82.5 MiB/s When killing the big cores it looks like this (all the time running with performance cpufreq governor): # Tests are approximate using memory only (no storage IO). PBKDF2-sha1 332670 iterations per second for 256-bit key PBKDF2-sha256 623410 iterations per second for 256-bit key PBKDF2-sha512 253034 iterations per second for 256-bit key PBKDF2-ripemd160 193607 iterations per second for 256-bit key PBKDF2-whirlpool 85556 iterations per second for 256-bit key # Algorithm | Key | Encryption | Decryption aes-cbc 128b 369.9 MiB/s 449.0 MiB/s serpent-cbc 128b N/A N/A twofish-cbc 128b 33.5 MiB/s 35.1 MiB/s aes-cbc 256b 323.9 MiB/s 414.7 MiB/s serpent-cbc 256b N/A N/A twofish-cbc 256b 33.5 MiB/s 35.1 MiB/s aes-xts 256b 408.4 MiB/s 408.7 MiB/s serpent-xts 256b N/A N/A twofish-xts 256b 36.1 MiB/s 36.4 MiB/s aes-xts 512b 376.6 MiB/s 377.3 MiB/s serpent-xts 512b N/A N/A twofish-xts 512b 35.9 MiB/s 36.3 MiB/s Other information as requested: https://pastebin.com/hMhKUStN
zador.blood.stained Posted February 16, 2018 Posted February 16, 2018 4 minutes ago, tkaiser said: Cryptsetup benchmark Looks fast Just to be sure please grep for CRYPTO_USER in the config (in /boot or /proc/config.gz)
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 6 minutes ago, zador.blood.stained said: please grep for CRYPTO_USER in the config (in /boot or /proc/config.gz) Impossible since neither exists https://github.com/hardkernel/linux/blob/ee38808d9fd0ea4e4db980c82ba717b09fb103ae/arch/arm64/configs/odroidn1_defconfig#L114
zador.blood.stained Posted February 16, 2018 Posted February 16, 2018 4 minutes ago, tkaiser said: https://github.com/hardkernel/linux/blob/ee38808d9fd0ea4e4db980c82ba717b09fb103ae/arch/arm64/configs/odroidn1_defconfig Quote # CONFIG_CRYPTO_USER is not set Too bad. There may be a room for improvement, both for cryptsetup and for a recent enough openssl with AF_ALG support. Edit: CONFIG_CRYPTO_USER_API options are enabled, so I looked at the wrong option
Igor Posted February 16, 2018 Posted February 16, 2018 I just got my board and I'll try to make Armbian ASAP. Another pcs of info - when running ./cpuminer --benchmark I get a 0.9A draw at the power source. Nothing else except console is attached. 1
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 5 minutes ago, Igor said: I get a 0.9A draw at the power source. Nothing else except console is attached. Since ODROID N1 with current default settings has a pretty high 'ground' consumption (most probably both related to the ASM1061 and DC-DC circuitry) we should better talk about consumption differences. I get 3.2W at the wall in idle and 12.1W when running 'cpuminer --benchmark'. So that's 8.9W for '8.77 kH/s' or just about 1W per kH/s (12V PSU included!). Now let's try the same with ODROID XU4 To get an idea how much the ASM1061 adds to idle consumption I would assume that we need to change CONFIG_PCIE_ROCKCHIP and friends from y to m? Or use DT overlays to disable the respective DT nodes?
zador.blood.stained Posted February 16, 2018 Posted February 16, 2018 Just now, tkaiser said: To get an idea how much the ASM1061 adds to idle consumption I would assume that we need to change CONFIG_PCIE_ROCKCHIP and friends from y to m? Or use DT overlays to disable the respective DT nodes? Or just recompile the DT with dtc and reboot since loading overlays needs either kernel or u-boot patches.
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 2 minutes ago, zador.blood.stained said: Or just recompile the DT with dtc and reboot Hmm... root@odroid:/media/boot# dtc -I dtb -O dts -o rk3399-odroidn1-linux.dts rk3399-odroidn1-linux.dtb Warning (unit_address_vs_reg): Node /usb@fe800000 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /usb@fe900000 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@0 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@1 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@2 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@3 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@4 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@5 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@6 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@7 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@8 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /thermal-zones/soc-thermal/trips/trip-point@9 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /phy@e220 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /efuse@ff690000/id has a reg or ranges property, but no unit name Warning (unit_address_vs_reg): Node /efuse@ff690000/cpul-leakage has a reg or ranges property, but no unit name Warning (unit_address_vs_reg): Node /efuse@ff690000/cpub-leakage has a reg or ranges property, but no unit name Warning (unit_address_vs_reg): Node /efuse@ff690000/gpu-leakage has a reg or ranges property, but no unit name Warning (unit_address_vs_reg): Node /efuse@ff690000/center-leakage has a reg or ranges property, but no unit name Warning (unit_address_vs_reg): Node /efuse@ff690000/logic-leakage has a reg or ranges property, but no unit name Warning (unit_address_vs_reg): Node /efuse@ff690000/wafer-info has a reg or ranges property, but no unit name Warning (unit_address_vs_reg): Node /gpio-keys/button@0 has a unit name, but no reg property Warning (unit_address_vs_reg): Node /gpiomem has a reg or ranges property, but no unit name root@odroid:/media/boot# cat rk3399-odroidn1-linux.dts | curl -F 'f:1=<-' http://ix.io http://ix.io/KQc Anyway, I backed the eMMC contents up already yesterday so nothing can go wrong
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 Well, just setting two nodes to disabled results in PCIe being gone but just ~150mW (mW not mA!) less consumption: root@odroid:/media/boot# diff rk3399-odroidn1-linux.dts rk3399-odroidn1-linux-mod.dts 8c8 < model = "Hardkernel ODROID-N1"; --- > model = "Hardkernel ODROID-N1 low power"; 1654c1654 < status = "okay"; --- > status = "disabled"; 1682c1682 < status = "okay"; --- > status = "disabled"; root@odroid:/media/boot# cat /proc/device-tree/model ; echo Hardkernel ODROID-N1 low power root@odroid:/media/boot# lspci root@odroid:/media/boot# After reverting back to original DT I've PCIe back and consumption increased by a whopping ~150mW root@odroid:/home/odroid# lspci 00:00.0 PCI bridge: Device 1d87:0100 01:00.0 IDE interface: ASMedia Technology Inc. ASM1061 SATA IDE Controller (rev 02)
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 1 hour ago, tkaiser said: Since ODROID N1 with current default settings has a pretty high 'ground' consumption (most probably both related to the ASM1061 and DC-DC circuitry) we should better talk about consumption differences. I get 3.2W at the wall in idle and 12.1W when running 'cpuminer --benchmark'. So that's 8.9W for '8.77 kH/s' or just about 1W per kH/s (12V PSU included!). Since we were already talking about power vs. consumption I gave cpuburn-a53 a try. I had to manually start it on the big cluster as well ('taskset -c 4,5 cpuburn-a53 &') but when the tool ran on all 6 CPU cores the fan started to spin on lowest level and SoC temperature became stable at 52.8°C: Time big.LITTLE load %cpu %sys %usr %nice %io %irq CPU C.St. 13:00:34: 1992/1512MHz 8.44 100% 0% 99% 0% 0% 0% 52.8°C 1/3 13:00:42: 1992/1512MHz 8.40 100% 0% 99% 0% 0% 0% 52.8°C 1/3 13:00:51: 1992/1512MHz 8.41 100% 0% 99% 0% 0% 0% 52.8°C 1/3 13:00:59: 1992/1512MHz 8.42 100% 0% 99% 0% 0% 0% 52.8°C 1/3 13:01:08: 1992/1512MHz 8.39 100% 0% 99% 0% 0% 0% 52.8°C 1/3 13:01:17: 1992/1512MHz 8.40 100% 0% 99% 0% 0% 0% 52.8°C 1/3 13:01:25: 1992/1512MHz 8.41 100% 0% 99% 0% 0% 0% 52.8°C 1/3 13:01:33: 1992/1512MHz 8.43 100% 0% 99% 0% 0% 0% 52.8°C 1/3 13:01:42: 1992/1512MHz 8.40 100% 0% 99% 0% 0% 0% 52.8°C 1/3^C My powermeter showed then also just 12.1W so it seems with such heavy NEON workloads and RK3399 busy on all CPU cores we can't get the board to consume more than 9W compared to idle... Testing again with openssl and the crypto engine I'll see the powermeter reporting 13.2W maximum (that's 10W more compared to idle) while the fan is working harder but temperature still below 60°C: Time big.LITTLE load %cpu %sys %usr %nice %io %irq CPU C.St. 13:12:06: 1992/1512MHz 6.01 100% 0% 99% 0% 0% 0% 55.0°C 2/3 13:12:13: 1992/1512MHz 6.17 99% 0% 99% 0% 0% 0% 55.6°C 2/3 13:12:20: 1992/1512MHz 6.16 100% 0% 99% 0% 0% 0% 55.0°C 2/3 13:12:27: 1992/1512MHz 6.14 100% 0% 99% 0% 0% 0% 55.0°C 2/3 13:12:33: 1992/1512MHz 6.27 99% 0% 99% 0% 0% 0% 54.4°C 2/3 13:12:40: 1992/1512MHz 6.25 100% 0% 99% 0% 0% 0% 55.0°C 2/3 13:12:47: 1992/1512MHz 6.23 99% 0% 99% 0% 0% 0% 56.7°C 2/3 IMO this is pretty amazing and I've to admit that I start to like the fansink Hardkernel put on this board. While looking similar to the one on my XU4 bought last year this one is way less annoying. If one puts the N1 into a cabinet (as I do with all IT stuff I don't need on my desk) you can't hear the thing.
tkaiser Posted February 16, 2018 Author Posted February 16, 2018 Thermal update Since I was curious why temperatures in idle and under load were that low and to be assured that throttling with the 4.4 BSP kernel we're currently using works... I decided to remove N1's heatsink: Looks good, so now let's see how the board performs without heatsink applied. Since I had not the slightest idea whether throttling works and how I decided to let a huge fan assist in the beginning: Board booted up nicely, the small PWM fan started to blow air around, the large efficiently cooled somewhat and I decided to again run 'cpuminer --benchmark'. To my surprise (I expected ODROID XU4 behaviour) the big cores were only throttled to 1800 and 1608 after a couple of minutes so at least I knew throttling was working. Then deciced to stop the 5V USB connected fan and let the benchmark run on its own (board lying flat on the table, neither heatsink nor fan involved). After about half an hour cpuminer reported still a hash rate of 'Total: 6.60 kH/s' (all 6 cores involved) and armbianmonitor output showed current throttling settings: Time big.LITTLE load %cpu %sys %usr %nice %io %irq CPU C.St. 17:40:23: 1008/1512MHz 6.56 100% 0% 0% 99% 0% 0% 84.4°C 3/3 17:40:28: 1008/1512MHz 6.91 100% 0% 0% 99% 0% 0% 84.4°C 3/3 17:40:33: 1008/1512MHz 6.84 100% 0% 0% 99% 0% 0% 84.4°C 3/3 17:40:38: 816/1512MHz 6.77 100% 0% 0% 99% 0% 0% 85.0°C 3/3 17:40:43: 816/1512MHz 6.71 100% 0% 0% 99% 0% 0% 85.0°C 3/3 17:40:48: 1008/1512MHz 6.73 100% 0% 0% 99% 0% 0% 84.4°C 3/3 17:40:53: 1008/1512MHz 6.67 100% 0% 0% 99% 0% 0% 84.4°C 3/3 17:40:59: 1008/1512MHz 6.62 100% 0% 0% 99% 0% 0% 84.4°C 3/3 17:41:04: 1008/1512MHz 6.57 100% 0% 0% 99% 0% 0% 84.4°C 3/3 17:41:09: 1008/1512MHz 6.52 100% 0% 0% 99% 0% 0% 84.4°C 3/3 17:41:14: 1008/1512MHz 6.48 100% 0% 0% 99% 0% 0% 83.9°C 3/3 17:41:19: 1200/1512MHz 6.44 100% 0% 0% 99% 0% 0% 85.0°C 3/3 17:41:24: 1200/1512MHz 6.41 100% 0% 0% 99% 0% 0% 84.4°C 3/3 17:41:29: 1008/1512MHz 6.37 100% 0% 0% 99% 0% 0% 84.4°C 3/3 17:41:34: 1008/1512MHz 6.34 100% 0% 0% 99% 0% 0% 84.4°C 3/3 Time big.LITTLE load %cpu %sys %usr %nice %io %irq CPU C.St. 17:41:39: 1008/1512MHz 6.40 100% 0% 0% 99% 0% 0% 84.4°C 3/3 17:41:45: 1200/1512MHz 6.37 100% 0% 0% 99% 0% 0% 85.6°C 3/3 17:41:50: 1992/1512MHz 5.86 24% 0% 0% 23% 0% 0% 78.8°C 3/3 17:41:55: 1992/1512MHz 5.39 0% 0% 0% 0% 0% 0% 75.0°C 3/3 So the big cores were throttled down to even 816 MHz but the board was still running with full load and generated 6.60 kH/s. Before I stopped the benchmark I checked the Powermeter reporting 8.2W. In other words: with these throttling settings clocking only the big cores down) we're talking now about a 5W delta compared to idle and 6.6 kH/s That's 1.3 kH/s per W consumed. Pretty amazing especially when comparing with ODROID XU4 or Tinkerboard... After stopping the benchmark I put the board into an upright position and switched to ondemand governor to watch the temperatures dropping down to 45°C (full armbianmonitor output): Time big.LITTLE load %cpu %sys %usr %nice %io %irq CPU C.St. 18:04:31: 408/ 408MHz 0.01 0% 0% 0% 0% 0% 0% 45.6°C 0/3 18:04:36: 408/ 408MHz 0.01 0% 0% 0% 0% 0% 0% 45.0°C 0/3 18:04:41: 408/ 408MHz 0.01 0% 0% 0% 0% 0% 0% 45.0°C 0/3 18:04:46: 408/ 408MHz 0.01 0% 0% 0% 0% 0% 0% 45.6°C 0/3 That's really impressive. But be warned: once you use Android on this thing or GPU acceleration works within Linux then operation without heatsink won't be a good idea (the Mali on this SoC is quite capable). Anyway: with pure CPU workloads this all looks very nicely and way more energy efficient than those beefy ARMv7 boards with Cortex-A15 or A17 cores.
tkaiser Posted February 19, 2018 Author Posted February 19, 2018 Preliminary 'performance' summary Based on the tests done above and elsewhere let's try to collect some performance data. Below GPU data is missing for the simple reason that I'm not interested in anything GPU related (or attaching a display at all). Besides used for display stuff and 'retro gaming' RK3399's Mali T860 MP4 GPU is also OpenCL capable. If you search for results (ODROID N1's SoC is available for some years now so you find a lot by searching for 'RK3399' -- for example here are some OpenCL/OpenCV numbers) please keep in mind that Hardkernel might use different clockspeeds for the GPU as well (with CPU cores it's just like that: almost everywhere around big/little cores are clocked with 1.8/1.4 GHz while the N1 settings use 2.0/1.5 GHz instead) CPU horsepower Situation with RK3399 is somewhat special since it's a HMP design combining two fast Cortex-A72 cores with four 'slow' A53. So depending on which CPU core a job lands execution time can vary by factor 2. With Android or 'Desktop Linux' workloads this shouldn't be an issue since there things are mostly single-threaded and the scheduler will move these tasks to the big cores automagically if performance is needed. With other workloads it differs: People wanting to use RK3399 as part of a compile farm might be disappointed and still prefer ARM designs that feature four instead of two fast cores (eg. RK3288 or Exynos 5422 -- for reasons why see again comments section on CNX) For 'general purpose' server use cases the 7-zip scores are interesting since giving a rough estimate how fast a RK3399 device will perform as server (or how many tasks you can run in parallel). Overall score is 6,500 (see this comparison list) but due to the big.LITTLE design we're talking about the big cluster scoring at 3350 and the little cluster at 3900. So tasks that execute on the big cores finish almost twice as fast. Keep this in mind when setting up your environment. Experimenting with cgroups and friends to assign certain tasks to specific CPU clusters will be worth the efforts! 'Number crunchers' who can make use of NEON instructions should look at 'cpuminer --benchmark' results: We get a total 8.80 kH/s rate when running on all 6 cores (big cores only: 4.10 kH/s, little cores only: 4.90 kH/s -- so again 'per core' performance almost twice as good on the big cores) which is at the same performance level of an RK3288 (4 x A17) but gets outperformed by an ODROID XU4 for example at +10kH/s since there the little cores add a little bit to the result. But this needs improved cooling otherwise an XU4 will immediately throttle down. The RK3399 provides this performance with way lower consumption and heat generation! Crypto performance: just awesome due to ARMv8 Crypto Extensions available and useable on all cores in parallel. Simply check cryptsetup results above and our 'openssl speed' numbers and keep in mind that if your crypto stuff can run in parallel (eg. terminating few different VPN sessions) you can almost add the individual throughput numbers (and even with 6 threads in parallel at full clockspeed the RK3399 just draws 10W more compared to idle) Talking about 'two fast and four slow CPU cores': the A53 cores are clocked at 1.5GHz so when comparing with RK3399's little sibling RK3328 with only 4xA53 (ROCK64, Libre Computer Renegade or Swiftboard/Transformer) the RK3399 when running on the 'slow' cores will compete or already outperform the RK3328 boards but still has 2 big cores available for heavy stuff. But since a lot of workloads are bottlenecked by memory bandwidth you should have a look at the tinymembench results collected above (and use some google-fu to compare with other devices) Storage performance N1 has 2 SATA ports provided by a PCIe attached ASM1061 controller and 2 USB3 ports directly routed to the SoC. The per port bandwidth limitation that also seems to apply to both port groups is around 390 MB/s (applies to all ports regardless whether SATA or USB3 -- also random IO performance with default settings is pretty much the same). But this is not an overall internal SoC bottleneck since when testing with fast SSDs on both USB3 and SATA ports at the same time we got numbers at around ~750MB/s. I just retested again with an EVO840 on the N1 at SATA and USB3 ports with a good UAS capable enclosure and as a comparison repeated the same test with a 'true NAS SoC': the Marvell Armada 385 on Clearfog Pro which provides 'native SATA' by the SoC itself: Same Samsung EVO840 used for the tests, same settings (for iozone command line see somewhere above) ODROID N1 USB3/JMS567 random random kB reclen write rewrite read reread read write 102400 1 7348 8214 9805 10012 5473 8085 102400 4 26394 30872 41039 40473 20255 30509 102400 16 68892 98586 120807 121118 66786 97474 102400 512 327991 334624 312310 316452 305005 331188 102400 1024 357135 365850 349055 354952 343348 359507 102400 16384 376355 388326 395179 400291 399759 384052 ODROID N1 PCIe/ASM1061 powersave random random kB reclen write rewrite read reread read write 102400 1 7585 8562 9322 9331 5907 8505 102400 4 26400 31745 34586 34798 24039 31595 102400 16 87201 99311 105977 106152 79099 99618 102400 512 313662 316992 308216 310013 301521 308300 102400 1024 327748 324230 319738 322929 317812 325224 102400 16384 368813 369384 385862 390732 390612 379333 ODROID N1 PCIe/ASM1061 performance random random kB reclen write rewrite read reread read write 102400 1 15218 19331 23617 23661 10690 18965 102400 4 49071 65403 79028 79247 39287 64922 102400 16 137845 168899 185766 186482 116789 166413 102400 512 326117 332789 324468 326999 317332 328611 102400 1024 330827 331303 326731 329246 325201 333325 102400 16384 378331 368429 385870 392127 391348 371753 Clearfog Pro SATA random random kB reclen write rewrite read reread read write 102400 1 21853 37308 39815 39753 12597 35440 102400 4 63930 121585 132720 133372 46210 118527 102400 16 176397 262801 278098 289824 143121 265142 102400 512 387158 404191 425735 432220 415117 386369 102400 1024 376309 395735 450046 421499 432396 387842 102400 16384 384486 389053 506038 509033 500409 402384 If we look carefully at the numbers we see that USB3 slightly outperforms ASM1061 when it's about top sequential performance. The two ASM1061 numbers are due to different settings of /sys/module/pcie_aspm/parameters/policy (defaults to powersave but can be changed to performance which not only results in ~250mW higher idle consumption but also a lot better performance with small block sizes). While USB3 seems to perform slightly better when looking only at irrelevant sequential transfer speeds better attach disks to the SATA ports for a number of reasons: With USB you need disk enclosures with good USB to SATA bridges that are capable of UAS --> 'USB Attached SCSI' (we can only recommend the following ones: ASMedia ASM1153/ASM1351, JMicron JMS567/JMS578 or VIA VL711/VL715/VL716 -- unfortunately even if those chipsets are used sometimes crappy firmwares need USB quirks or require UAS blacklisting and then performance sucks. A good example are Seagate USB3 disks) When you use SSDs you want to be able to use TRIM (helps with retaining drive performance and increases longevity). With SATA attached SSDs this is not a problem but on USB ports it depends on a lot of stuff and usually does NOT work. If you understand just half of what's written here then think about SSDs on USB ports otherwise better choose the SATA ports here And PCIe is also less 'expensive' since it needs less ressources (lower CPU utilization with disk on SATA ports and less interrupts to process, see the 800k IRQs for SATA/PCIe vs. 2 million for USB3 with exactly the same workload below): 226: 180 809128 0 0 0 0 ITS-MSI 524288 Edge 0000:01:00.0 226: 0 0 0 0 0 0 ITS-MSI 524288 Edge 0000:01:00.0 227: 277 0 2066085 0 0 0 GICv3 137 Level xhci-hcd:usb5 228: 0 0 0 0 0 0 GICv3 142 Level xhci-hcd:usb7 There's also eMMC and SD cards useable as storage. Wrt SD cards it's too early to talk about performance since at least the N1 developer samples do only implement the slowest SD card speed mode (and I really hope this will change with the final N1 version later) a necessary kernel patch is missing to remove the current SD card performance bottleneck.. The eMMC performance is awesome! If we look only at random IO performance with smaller block sizes (that's the 'eMMC as OS drive' use case) then the Hardkernel eMMC modules starting at 32GB size perform as fast as an SSD connected to USB3 or SATA ports. With SATA ports we get a nice speed boost by changing ASPM (Active State Power Management) settings by switching from the 'powersave' default to performance (+250mW idle consumption). Only then a SSD behind a SATA port on N1 can outperform a Hardkernel eMMC module wrt random IO or 'OS drive' performance. But of course this has a price: when SATA or USB drives are used consumption is a lot higher. Network performance Too early to report 'success' but I'm pretty confident we get Gigabit Ethernet fully saturated after applying some tweaks. With RK3328 it was the same situation in the beginning and maybe same fixes that helped there will fix it with RK3399 on N1 too. I would assume progress can be monitored here: https://forum.odroid.com/viewtopic.php?f=150&t=30126 1
tkaiser Posted February 23, 2018 Author Posted February 23, 2018 Storage performance update... what to use to store the rootfs on? In the following I compare 4 good SD cards with 4 different eMMC modules Hardkernel sells for the N1 with 4 different SSD setups. As some background why I chose to measure random IO with 1k, 4k and 16k block sizes please read the 'SD card performance 2018 update' first. The following are IOPS numbers (IO operations per second) and important if we want to know how fast storage performs when used as an 'OS drive' (random IO performance is the most important factor here): 1K w/r 4K w/r 16K w/r SanDisk Extreme Plus 16GB 566 2998 731 2738 557 2037 SanDisk Ultra A1 32GB 456 3171 843 2791 548 1777 SanDisk Extreme A1 32GB 833 3289 1507 3281 1126 2113 Samsung Pro 64GB 1091 4786 1124 3898 478 2296 Orange eMMC 16GB 2450 7344 7093 7243 2968 5038 Orange eMMC 32GB 2568 7453 7365 7463 5682 5203 Orange eMMC 64GB 2489 7316 7950 6944 6059 5250 Orange eMMC 128GB 2498 8337 7064 7197 5459 4909 Intel 540 USB3 7076 4732 7053 4785 5342 3294 Samsung EVO750 USB3 8043 6245 7622 5421 6175 4481 Samsung EVO840 powersave 8167 5627 7605 5720 5973 4766 Samsung EVO840 performance 18742 10471 16156 9657 10390 7188 The SD cards I chose for this comparison all perform very well (an average no-name, Kingston, PNY, Verbatim or whatever other 'reputable' brand performs way lower wrt random IO!). But it can be clearly seen that Hardkernel's eMMC modules are a lot more performant. Regardless of size they all perform pretty similar though the small 16GB module being bottlenecked due to a write performance limitation that also affects 16k random IO write performance. With SSDs it depends: I chose somewhat ok-ish consumer SSDs for the test so in case you want to buy used SSDs or some 'great bargains' on Aliexpress or eBay be prepared that your numbers will look way worse. The SATA connected EVO840 is listed two times since performance with small blocksizes heavily depends on PCIe power management settings (default is powersave -- switching to performance increases idle consumption by around ~250mW but only then a SATA connected SSD is able to outperform Hardkernel's eMMC. That's important to know and also only applies to really performant SSDs. Cheap SSDs especially with small capacities perform way lower) Now let's look at sequential performance with large blocksizes (something that does NOT represent the 'OS drive' use case even remotely and is pretty irrelevant for almost all use cases except creation of stupid benchmark graphs): MB/s write MB/s read SanDisk Extreme Plus 16GB 63 67 SanDisk Ultra A1 32GB 20 66 SanDisk Extreme A1 32GB 59 68 Samsung Pro 64GB 61 66 Orange eMMC 16GB 48 298 Orange eMMC 32GB 133 252 Orange eMMC 64GB 148 306 Orange eMMC 128GB 148 302 Intel 540 USB3 325 370 Samsung EVO750 USB3 400 395 Samsung EVO840 powersave 375 385 Samsung EVO840 performance 375 385 We can see that N1's SD card interface seems to bottleneck sequential read performance of all tested cards to around ~67 MB/s. Write performance depends mostly on the cards (all cheap cards like the tested SanDisk Ultra A1 32GB you get currently for $12 on Amazon are limited here). The Hardkernel eMMC modules perform very well with sustained read performance at around 300MB/s and write performance depending on module size at up to ~150 MB/s. With SSDs it depends -- we have an interface limitation of around ~395 MB/s on the USB3 ports and a little bit lower on the SATA ports but unless you buy rather expensive SSDs you won't be able to reach the board's bottleneck anyway. Please also keep in mind that the vast majority of consumer SSDs implements some sort of write caching and write performance drops down drastically once a certain amount of data is written (my Intel 540 get's then as slow as 60MB/s, IIRC the EVO750 can achieve ~150 MB/s and the EVO840 180 MB/s). Why aren't HDDs listed above? Since useless. Even Enterprise HDDs show random IO performance way too low. These things are good to store 'cold data' on it but never ever put your rootfs on them. They're outperformed by at least 5 times by any recent A1 rated SD card, even crappy SSDs are at least 10 times faster and Hardkernel's eMMC performs at least 50 times better. So how to interpret results above? If you want energy efficient and ok-ish performing storage for your rootfs (OS drive) then choose any of the currently available A1 rated SD cards from reputable vendors (choose more expensive ones for better performance/resilience, choose larger capacities than needed if you fear your flash memory wearing out too fast). If you want top performance at lowest consumption level choose Hardkernel's eMMC and keep in mind that the smallest module is somewhat write performance bottlenecked. Again: if you fear your flash memory wearing out too fast simply choose larger capacities than 'needed'. If you want to waste huge amounts of energy while still being outperformed by Hardkernel's eMMC buy a cheap SSD. Keep in mind that you need to disable PCIe powermanagement further increasing idle consumption to be able to outperform eMMC storage otherwise N1's SATA/PCIe implementation will bottleneck too much. So when do SSDs start to make sense? If you either really need higher performance than Hardkernel's eMMC modules and are willing to spend some serious amount of money for a good SSD or the '1k random IO' use case really applies to you (e.g. trying to run a database with insanely small record sizes that constantly updates at the storage layer). But always keep in mind: if you not really choose a more expensive and high performing SSD you'll always get lower performance than eMMC while consumption is at least 100 times higher. And always use SSDs at the SATA ports since only there you can get higher random IO performance compared to eMMC and being able to benefit from TRIM is essential (for details why TRIM is a problem on USB ports see above). But keep in mind that internal SATA ports are rated for 50 matings max so be prepared to destroy connectors easily when you permanently change cables on those SATA ports But what if you feel that any SATA attached storage (the cheapest SSD around and even HDDs) must be an improvement compared to eMMC or SD cards? Just use it, all of the above is about facts and not feelings. You should only ensure to never ever test your storage performance since that might hurt your feelings (it would be as easy as 'cd $ssd-mountpoint ; iozone -e -I -a -s 100M -r 1k -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2' but really don't do this if you want to believe in one of the most common misbeliefs with consumer electronics today) As a reference all IO benchmark results for SD cards, Hardkernel's eMMC modules and the SSD tests: https://pastebin.com/2wxPWcWr https://pastebin.com/ePUCXyg6 https://pastebin.com/N5wEghn3 2
tkaiser Posted February 26, 2018 Author Posted February 26, 2018 Just a miniature SATA/ASM1061 related material collection multiple disks behind ASM1061 problem with Turris Omnia Suggested 'fix' by Turris folks (slowing down PCIe): https://gitlab.labs.nic.cz/turris/turris-os-packages/merge_requests/48/diffs -- please note that the ASM106x firmware matters, their ASM1061 registers itself as class '0x010601' (AHCI 1.0) while the ASM1061 Hardkernel put on the N1 dev samples uses a firmware that reports class '0x010185' (IDE 1.0) instead. Doesn't matter wrt performance since there the chosen driver is important but if code wants to differentiate based on PCIe device classes this of course has to match. Same with device ids: can be either '0x0611' (ASM1061) or '0x0612' (ASM1062) based on firmware and not hardware (the Turris ASM1061 shows up as ASM1062). To disable NCQ and/or to set link speed negotation limits you could adjust the 'setenv bootargs' line in /media/boot/boot.ini: for example setenv bootargs "${bootrootfs} libata.force=1.5,noncq" (see kernel cmdline parameters, could be interesting for SSD users in case NCQ and TRIM interfere) To check SATA relevant dmesg output: dmesg | egrep -i "ahci|sata| ata|scsi|ncq" (mandatory prior and after any benchmarks!) There's a newer firmware for the ASM1061 available -- to be able to use the included binary it would need a few steps but even then the update operation fails: dpkg --add-architecture armhf ; apt install binutils:armhf ; ./106flash ahci420g.rom (Hardkernel put a SPI flash for ASM1061 on the PCB but the flash program stops with 'ASM106X SPI Flash ROM Write Linux V2.6.4 • Find 1 ASM106X Controller • Read_RomID Failed!!')
Recommended Posts