tkaiser Posted August 30, 2016 Posted August 30, 2016 Since I've seen some really weird disk/IO benchmarks made on SBCs the last days and both a new SBC and a new SSD arrived in the meantime I thought let's give it a try with a slightly better test setup. I tested with 4 different SoC/SBC: NanoPi M3 with an octa-core S5P6818 Samsung/Nexell SoC, ODROID-C2 featuring a quad-core Amlogic S905, Orange Pi PC with a quad-core Allwinner H3 and an old Banana Pi Pro with a dual-core A20. The device considered the slowest (dual-core A20 with just 960 MHz clockspeed) is the fastest in reality when it's about disk IO. Since most if not all storage 'benchmarks' for SBC moronically focus on sequential transfer speeds only and completely forget that random IO is way more important on any SBC (it's not a digital camera or video recorder!) I tested this also. Since it's also somewhat moronically when you want to test the storage implementation of a computer to choose a disk that is limited the main test device is a brand new Samsung SSD 750 EVO 120GB I tested first on a PC whether the SSD is ok and to get a baseline what to expect. Since NanoPi M3, ODROID-C2 and Orange Pi PC only feature USB 2.0 I tested with 2 different USB enclosures that are known to be USB Attached SCSI (UAS) capable. The nice thing with UAS is that while it's an optional USB feature that came together with USB 3.0 we can use it with more recent sunxi SoCs also when running mainline kernel (A20, H3, A64 -- all only USB 2.0 capable). When clicking on the link you can also see how different USB enclosures (to be more precise: the included USB-to-SATA bridges used) perform. Keep that in mind when you see 'disk performance' numbers somewhere and people write SBC A would be 2MB/s faster than SBC B -- for the variation in numbers not only the SBC might be responsible but this is for sure also influenced by both the disk used and enclosure / USB-SATA bridge inside! The same applies to the kernel the SBC is running. So never trust in any numbers you find on the internet that are the results of tests at different times, with different disks or different enclosures. The numbers presented are just BS. The two enclosures I tested with are equipped with JMicron JMS567 and ASMedia ASM1153. With sunxi SBCs running mainline kernel UAS will be used, with other SoCs/SBC or running legacy kernels it will be USB Mass Storage instead. Banana Pi Pro is an exception since its SoC features true SATA (with limited sequential write speeds) which will outperform every USB implementation. And I also used a rather fast SD card and also a normal HDD with this device connected to USB with a non UASP capable disk enclosure to show how badly this affects the important performance factors (again: random IO!) I used iozone with 3 different runs: 1 MB test size with 1k, 2k and 4k record sizes 100 MB test size with 4k, 16k, 512k, 1024k and 16384k (16 MB) record sizes 4 GB test size with 4k and 1024k record sizes The variation in results is interesting. If 4K results between 1 and 100 MB test size differ you know that your benchmark is not testing disk througput but instead the (pretty small) disk cache. Using 4GB for sequential transfer speeds ensures that the whole amount of data exceeds DRAM size. The results: NanoPi M3 @ 1400 MHz / 3.4.39-s5p6818 / jessie / USB Mass Storage: Sequential transfer speeds with USB: 30MB/s with 1MB record size and just 7.5MB/s at 4K/100MB, lowest random IO numbers of all. All USB ports are behind an USB hub and it's already known that performance on the USB OTG port is higher. Unfortunately my SSD with both enclosures prevented negotiating an USB connection on the OTG port since each time I connected the SSD the following happened: WARN::dwc_otg_hcd_hub_control:2544: Overcurrent change detected ) JMicron JMS567 random random kB reclen write rewrite read reread read write 1024 1 2396 2589 2616 2665 2666 2657 1024 2 4397 5033 5101 5323 5334 5296 1024 4 7454 7138 7495 8000 8000 7924 102400 4 7063 7476 7531 7570 7536 7573 102400 16 15812 17276 20397 20326 20421 16990 102400 512 25465 25454 29117 28545 29114 25501 102400 1024 25843 25401 29048 29279 29420 25899 102400 16384 26592 26600 31280 31306 30841 26472 4096000 4 28107 28145 29994 29795 4096000 1024 29253 29578 31328 31123 ASMedia ASM1153 random random kB reclen write rewrite read reread read write 1024 1 2281 2600 2594 2659 2663 2642 1024 2 4171 5047 5050 5299 4411 5236 1024 4 6929 7386 7381 7969 7985 7782 102400 4 7568 7968 7995 7999 7985 7993 102400 16 15939 18160 21231 21294 18241 18176 102400 512 26865 26985 29784 29873 29609 26958 102400 1024 27191 27273 30211 30290 30222 27271 102400 16384 28427 28479 32559 32655 32676 28533 4096000 4 29275 29483 31404 31175 4096000 1024 29182 29500 31392 31178 ODROID-C2 @ 1536 MHz / 3.14.74-odroidc2 / xenial / USB Mass Storage: Sequential transfer speeds with USB: ~39MB/s with 1MB record size and ~10.5MB/s at 4K/100MB. All USB ports are behind an USB hub and the performance numbers look like there's always some buffering involved (not true disk test but kernel's caches involved partially) JMicron JMS567 random random kB reclen write rewrite read reread read write 1024 1 2630 2562 2623 2665 2665 2658 1024 2 5177 4884 5109 5328 5330 5305 1024 4 9868 9352 10188 10636 10662 10543 102400 4 10639 10633 10647 10656 10658 10649 102400 16 21288 21246 21288 21327 21326 21319 102400 512 37439 37382 40035 40134 39815 37461 102400 1024 38051 38081 40874 40894 40773 38235 102400 16384 38512 38401 41363 41346 41298 38542 4096000 4 42327 42383 37781 37792 4096000 1024 42387 42563 37811 37743 ASMedia ASM1153 random random kB reclen write rewrite read reread read write 1024 1 2592 2564 2609 2662 2663 2643 1024 2 5147 4899 5055 5318 5321 5250 1024 4 9454 9397 9987 10596 10631 10319 102400 4 10634 10634 10646 10655 10656 10651 102400 16 21292 21248 21279 21324 21326 21307 102400 512 37118 37117 38804 38851 37478 37045 102400 1024 37782 37944 39151 39277 37829 37927 102400 16384 38062 37957 39299 39360 39405 38061 4096000 4 42186 42465 36276 36251 4096000 1024 41990 42020 36177 36174 Orange Pi PC @ 1296 MHz / 4.7.2-sun8i / xenial / UAS: Sequential transfer speeds with USB: ~40MB/s with 1MB record size and ~9MB/s at 4K/100MB, best random IO with very small files. All USB ports are independant (just like on Orange Pi Plus 2E where identical results will be achieved since same SoC and same settings when running Armbian) JMicron JMS567 random random kB reclen write rewrite read reread read write 1024 1 2965 3670 3898 3955 2655 3843 1024 2 4953 7366 7175 7817 5326 7545 1024 4 8606 9394 9968 10370 7989 10168 102400 4 8859 10293 10622 10644 7996 8645 102400 16 22642 24860 24839 22971 21334 22089 102400 512 37057 34611 40005 40184 39697 35039 102400 1024 37555 37682 40681 40739 40398 37713 102400 16384 36809 38030 41050 41183 41172 38063 4096000 4 39228 39266 40931 40941 4096000 1024 38889 39037 40939 40950 ASMedia ASM1153 random random kB reclen write rewrite read reread read write 1024 1 2431 2933 3079 3063 2664 2962 1024 2 4395 5716 5926 6262 5326 6007 1024 4 7442 7937 8351 8762 7990 8146 102400 4 7976 8352 7993 8000 7976 8060 102400 16 21294 21838 22744 21874 21321 21576 102400 512 36848 34647 39241 39386 38959 34878 102400 1024 37451 37531 40050 40248 39940 37685 102400 16384 36937 38107 40884 41145 41138 38181 4096000 4 39124 39329 39994 39975 4096000 1024 39122 39179 39884 39792 Banana Pi Pro @ 960 MHz / 4.6.3-sunxi / xenial / SATA-SSD vs. USB-HDD: This test setup is totally different since the SSD will be connected through SATA and I use a normal HDD in an UAS incapable disk enclosure to show how huge the performance differences are. SATA sequential transfer speeds are unbalanced for still unknown reasons: write/read ~40/170MB/s with 1MB record size, 16/44MB/s with 4K/100MB (that's huge compared to all the USB numbers above!). Best random IO numbers (magnitudes faster since no USB-to-SATA bottleneck as with every USB disk is present). The HDD test shows the worst numbers: Just 29MB/s sequential speeds at 1MB record size and only ~5MB/s with 4K/100MB. Also the huge difference between the tests with 1MB vs. 100MB data size with 4K record size shows clearly that with 1MB test size only the HDD's internal DRAM cache has been tested (no disk involved): this was not a disk test but a disk cache test only. SSD on SATA port random random kB reclen write rewrite read reread read write 1024 1 5416 7880 12172 12603 7071 7808 1024 2 8876 12552 20764 21865 13054 12337 1024 4 14467 18387 37819 42846 26363 17098 102400 4 14932 19301 42873 45622 24953 19840 102400 16 27841 31117 103168 103871 73178 31151 102400 512 38764 38931 188829 189697 175944 38861 102400 1024 39369 39437 207168 208312 199500 39406 102400 16384 39922 39889 217207 218838 218113 40048 4096000 4 40185 40168 181351 183561 4096000 1024 39714 39707 162229 162155 HDD on USB 2.0 port random random kB reclen write rewrite read reread read write 1024 1 1557 1718 1914 1682 1790 1782 1024 2 2553 3796 3035 3577 3995 3218 1024 4 5091 4287 4590 6395 6384 4208 102400 4 5019 5782 5829 5798 447 1172 102400 16 14832 16491 15018 15632 1716 4115 102400 512 30147 30755 29154 29686 20518 29839 102400 1024 30665 31523 28922 29804 17880 30775 102400 16384 32828 32296 30602 30819 29758 33069 4096000 4 32263 32633 26844 26893 4096000 1024 32010 32418 26609 26736 Samsung Pro 64 GB SD card random random kB reclen write rewrite read reread read write 1024 1 884 942 2684 2723 2695 987 1024 2 1494 1831 4636 4685 4748 1885 1024 4 3158 3273 8050 7617 7849 3097 102400 4 2617 2818 7848 7818 7812 2272 102400 16 6555 5656 13380 13396 13329 4932 102400 512 20756 20904 21966 21971 21969 20983 102400 1024 21415 21558 22204 22207 22200 21520 102400 16384 21768 21836 22418 22418 22413 21857 4096000 4 22922 22692 21888 21901 4096000 1024 22995 22593 21132 21110 Lessons to learn? HDDs are slow. Even that slow that they are the bottleneck and invalidate every performance test when you want to test the performance of the host (the SBC in question) With HDDs data size matters since you get different results depending on whether the benchmark runs inside the HDD's internal caches or not. SSDs behave here differently since they do not contain ultra-slow rotating platters but their different types of internal storage (DRAM cache and flash) do not perform that different When you have both USB and SATA not using the latter is almost all the time simply stupid (even if sequential write performance looks identical. Sequential read speeds are way higher, random IO will always be superiour and this is more important) It always depends on the use case in question. Imagine you want to set up a lightweight web server dealing with static contents on any SBC that features only USB. Most of the accessed files are rather small especially when you configure your web server to deliver all content already pre-compressed. So if you compare random reads with 4k and 16k record size and 100MB data size you'll notice that a good SD card will perform magnitudes faster! For small files (4k) it's ~110 IOPS (447 KB/s) vs. 1950 IOPS (7812 KB/s) so SD card is ~18 times faster, for 16k size it's ~110 IOPS (1716 KB/s) vs. 830 IOPS (13329 KB/s) so SD card is still 7.5 times faster than USB disk. File size has to reach 512K to let USB disk perform as good as the SD card! Please note that I used a Samsung Pro 64GB for this test. The cheaper EVO/EVO+ with 32 and 64GB show identical sequential transfer speeds while being a lot faster when it's about random IO with small files. So you save money and get better performance by choosing the cards that look worse by specs! Record size always matters. Most fs accesses on an SBC are not large data that will be streamed but small chunks of randomly read/written data. Therefore check random IO results with small record sizes since this is important and have a look at the comparison of 1MB vs. 100 MB data size to get the idea when you're only testing your disk's caches and when your disk in reality. If you compare random IO numbers from crap SD cards (Kingston, noname, Verbatim, noname, PNY, noname, Intenso, noname and so on) with the results above then even the slow HDD connected through USB can shine. But better SD cards exist and some pretty fast eMMC implementations on some boards (ODROID-C2 being the best performer here). By comparing with the SSD results you get the idea how to improve performance when your workload depends on that (desktop Linux, web server, database server). Even a simple 'apt-get upgrade' when done after months without upgrades heavily depends on fast random IO (especially writes). So by relying on the usual bullshit benchmarks only showing sequential transfer speeds a HDD (30 MB/s) and a SD card (23 MB/s) seem to perform nearly identical while in reality the way more important random IO performance might differ a lot. And this solely depends on the SD card you bought and not on the SBC you use! For many server use cases when small file accesses happen good SD cards or eMMC will be magnitudes faster than HDDs (again, it's mostly about random IO and not sequential transfer speeds). I personally used/tested SD cards that show only 37 KB/s when running the 16K random write test (some cheap Intenso crap). Compared to the same test when combining A20 with a SATA SSD this is 'just' over 800 times slower (31000 KB/s). Compared to the best performer we currently know (EVO/EVO+ with 32/64GB) this is still 325 times slower (12000 KB/s). And this speed difference (again: random IO) will be responsible for an 'apt-get upgrade' with 200 packages on the Intenso card taking hours while finishing in less than a minute on the SATA disk and in 2 minutes with the good Samsung cards given your Internet connection is fast enough. 4
arox Posted August 31, 2016 Posted August 31, 2016 "some pretty fast eMMC implementations on some boards (ODROID-C2 being the best performer here)"Odroid eMMC performances are very impressive, but also are their price : quite the price of an SSD !The eMMC of my BPI m2+ is quite good at small IO and bad for large. I suppose the controller inside the chip is optimized to do so. And it is what I need anyway for a system disk : the performance while handling numerous small files random random kB reclen write rewrite read reread read write 102400 4 4753 5525 15134 15134 12817 3385 102400 512 9794 7042 61619 61412 61511 5389 102400 16384 7093 7712 77026 77053 52573 10426 Question : can you tell me how much can one expect with 4K random read threw network with an SSD/SATA/fast or USB ethernet/good Kernel ? The target use case is diskless clients.
tkaiser Posted August 31, 2016 Author Posted August 31, 2016 Odroid eMMC performances are very impressive, but also are their price : quite the price of an SSD ! Well, but cheap SSDs show less performance than these eMMC modules and since we're solely talking about SBC here the question remains how to connect an SSD? If you've an A20 device you're lucky since you can SATA, if you've to rely on USB even fast SD cards might easily outperform any USB connected SSD as long it's about small file sizes and random IO. Regarding BPi M2+ -- it seems the eMMC they now use is way slower than before (I've PCB revision 1.0, maybe they used better eMMC there to get better reviews?). Please compare with post #45 here. And please open a separate thread regarding diskless clients since this is a totally unrelated subject (at least based on my experiences so far key to success is low latency and caching as much as possible on the client side)
arox Posted August 31, 2016 Posted August 31, 2016 Well, but cheap SSDs show less performance than these eMMC modules and since we're solely talking about SBC here the question remains how to connect an SSD? If you've an A20 device you're lucky since you can SATA I have a Sandisk 64 GB SSD (quiet cheap, good transfer rates but power consumption a bit high in comparison to Sammsung SSDs). Installed on an Intel Atom 1.6 GHz with 4.4.6 kernel (gentoo optimized), it makes anyway SD, eMMC, HDD ridiculous : random random kB reclen write rewrite read reread read write 102400 4 30645 36430 44139 44268 16871 13850 102400 512 152591 147527 240342 255497 201405 109911 102400 16384 164887 184238 266631 266543 264865 179944 (I need to compile nfs in order to benchmark NFS or iSCSI solutions) BTW 1) Is there a solution to connect 2 SATA drives on one port ? BTW 2) My eMMC on BPI m2+ shows: Disk /dev/mmcblk0: 7.3 GiB, 7818182656 bytes, 15269888 sectors If yours with PCB 1.0 is really 8 GiB, it is a clue that they effectively made some cost "optimization" ...
tkaiser Posted September 1, 2016 Author Posted September 1, 2016 I have a Sandisk 64 GB SSD (quiet cheap, good transfer rates but power consumption a bit high in comparison to Sammsung SSDs). Installed on an Intel Atom 1.6 GHz with 4.4.6 kernel (gentoo optimized), it makes anyway SD, eMMC, HDD ridiculous Nope, not really. The mode how iozone displays results might be confusing. Random IO is about IOPS (input/output operations per second), iozone shows KB/s instead so you have to calculate yourself dividing the KB/s you get through the record size: 13850 KB/s at 4k record size: 3462 IOPS writing 109911 KB/s at 512k record size: 215 IOPS writing 179944 KB/s at 16384k record size: 11 IOPS writing The higher the record sizes grow the more the sequential transfer speed limits tamper random IO (see your read results: they're the same for random and sequential reads at 16MB record size) and also there's no use case for it (at least I know no application that writes/reads chunks of 16 MB randomly to disk, as soon as chunks get that large we're talking about 'streaming' use cases and then it's sequential performance and with HDDs we would start to take care of disk fragmentation). Taking the use case into consideration then testing random IO is useful for smaller record sizes (1 - 128 KB). And then please have a look how the smallest (slowest) eMMC module from Hardkernel in an SD card adapter on an old slow boring Banana Pi performs (value almost as fast as your SSD but BPi is faster in reality -- see below): 8 GB used with Banana Pi random random kB reclen write rewrite read reread read write 102400 4 11485 12921 7044 7060 7010 11957 And these are the 8GB and the 32 GB eMMC tested on ODROID-C2 itself: 8 GB random random kB reclen write rewrite read reread read write 102400 4 21526 21546 9848 9788 9547 21010 102400 512 43325 43320 108983 109771 109912 42450 102400 16384 42912 42980 107567 107679 107656 42876 32 GB 102400 4 21506 21871 10587 10589 10285 21568 102400 512 119884 119129 120658 120204 120285 117944 102400 16384 123984 124286 118559 118461 118260 121185 In all cases sequential and random results are the same, that means the benchmark doesn't tell the real thruth since random IO measurements are bottlenecked by transfer speeds (applies especially to the test with ODROID's eMMC in an SD card adapter at Banana Pro since this really suffers from a slow SDIO implementation)! That means with our test setup we did not test the eMMC/SD-card individually but tested the host's interface between SoC and storage too. If you get these numbers in passive benchmarking mode you would now simply stop, create graphs and make a blog post ('benchmarking gone wrong' as usual) In active benchmarking mode you try to understand what's happening: Random IO not visible due to test file size too large: We need more tests with lower file sizes. The random IO results above do not reflect IOPS possible since even at 4K record size already sequential and random numbers are identical. So we need to decrease the record sizes too (1k, 2k, 4k). Also the 32 GB eMMC does not show higher random IO values, they look just better since sequential transfer speed is higher (more parallelism on the larger eMMC modules) and random IO gets bottlenecked with higher numbers (but still 100% bottlenecked). And the results we would get from a new round of tests we have to set in a relationship with reality (benchmarks per se are useless). If you measure for example that your eMMC above shows 21568 KB/s at 4K (~5400 IOPS, that's already way more than your cheap SSD!) and then repeat the test with 2K and 1K and now get 15000KB/s (7500 IOPS at 2K) and 12000KB/s at 1K (here IOPS and KB/s the same for obvious reasons) then you see the real potential of the eMMC. Does this matter for real world situations? Only if we remain in active benchmarking mode. If you know that your storage implementation jumps from 5400 IOPS at 4K to 12000 IOPS at 1K and you use a filesystem with 4K blocksize, well then... The whole stuff is nothing for the average user. But stuff like this is something for us as Armbian community/project: raising awareness for the issues that really matter (that means constantly repeating that 99% of all benchmark numbers you find on the internet are crap, if it's about SBCs it gets close to 100%). And coming up with better defaults and enabling our users to choose the optimal devices for their use case. That's IMO the biggest advantage of Armbian: It enables you to choose between a huge variety of different SBC without having to fear that software sucks. So you choose the device that perfectly fits your needs. On a beefy server you can throw in more hardware if you realize that you're bottlenecked. That's different on these SBCs, there you have to choose wisely the combination of device and peripherals, then the correct settings having always the use case in mind.
tkaiser Posted September 1, 2016 Author Posted September 1, 2016 Now a real world example showing where you end up with the usual 'benchmarking gone wrong' approach. Imagine you need to set up a web server for static contents only. 100 GB of pure text files (not that realistic but just to show you how important it is to look closer). The server will be behind a leased line limited to 100 Mbits/sec. Which SBC to choose? The usual benchmark approaches tell you to measure sequential transfer speeds and nothing else (which is OK for streaming use cases when DVD images several GB each in size are provided but which is absolutely useless when we're talking about accessing 100 GB of small files in a random fashion -- the 'web server use case'). Then the usual benchmark tells you to measure throughput with iperf (web serving small files is about latency, that's quite the opposite) and some silly moronic stuff to measure how fast your web server is using the loopback interface of your server and test tool on the same machine not testing network at all (how does that translate to any real world web server usage? Exactly: not at all). If we rely on the passive benchmarking numbers and have in mind that we have to serve 100 GB at a reasonable cost we end up thinking about an externally connected HDD and a board with GbE (since iperf numbers look many times faster than Fast Ethernet) and a board that shows the highest page request numbers testing on the local machine (when the whole 'benchmark' turns into a multi-threaded CPU test and has nothing to do with web serving at all). Please don't laugh, but that's how usual SBC comparisons deal with this. So you choose from the list above as storage implementation an external 500 GB HDD since USB performance looks ok-ish with all boards (+30 MB/s), and NanoPi M3 since iperf numbers look nice (GbE) and most importantly it will perform the best on the loopback interface since it has the most and the fastest CPU cores. This way you end up with a really slow implementation since accessing files is more or less only random IO. The usual 2.5" notebook HDD you use on the USB port achieves less than 100 IOPS (see above result for USB HDD on Banana Pro with UASP incapable enclosure). By looking at iperf performance on the GbE interface you also overlooked that your web server is bottlenecked by the leased line to 100 Mbits/sec anyway. What to do? Use HTTP transport stream compression since text documents show a compression ratio of more than 1:3, many even 1:10, (every modern web server and most browsers support this). With this activated NanoPi now reads the text documents from disk and compresses it on the fly and based on a 1:3 compression ratio we can stream 300 Mbits/sec through our 100 Mbits/sec line. Initially accessing files is still slow as hell (lowest random IO performance possible by choosing USB HDD) but at least once the file has been read from disk it can saturate the leased line. So relying on passive benchmarking we chose a combination of devices (NanoPi M3 + 500 GB HDD) that costs +100$ considering also shipping/taxes and is slow as hell for the use case in question. If we stop relying on passive benchmarking, really look at our use case and switch on our brain we can not only save a lots of money but also improve performance by magnitudes. With an active benchmarking approach we identify the bottlenecks first: Leased line with 100 Mbits/sec only: we need to use HTTP content-stream compression to overcome this limitation Random access to many files: we need to take care of random IO more than sequential transfer speeds We need to tune our network settings to make the most out of the sitiuation. Being able to use the most recent kernel version is important! We're on a SBC and have to take care of CPU ressources: so we use a web server with minimum ressources and should find a way to avoid reading uncompressed contents from disk to immediately compress it on the fly since this wastes CPU ressources So let's take an approach that would look horribly slow in the usual benchmarks but improves performance a lot: An Orange Pi One together with a Samsung EVO 64 GB as hardware, mainline kernel + btrfs + nginx + gzip_static configuration. Why and how does this work? Orange Pi One has only Fast Ethernet and not GbE. Does this matter? Nope, since our leased line is limited to 100 Mbits/sec anyway we know that the cheap EVO/EVO+ with 32/64 GB perform excellent when it's about random reads. At 4K we get 875 IOPS (3500 KB/s, see comparison of results), that's 8 times faster than using an external USB HDD we use pre-compressed contents: that means a cron job compresses each and every of our static files and creates a compressed version with .gz suffix, if nginx communicates with browsers capable of that it delivers the already compressed contents directly (no CPU cylces wasted, if we configure nginx with sendfile option not even time in userspace wasted since the kernel shoves the file directly to the network interface!). Combine the sequential read limitation of SD cards on most boards (~23MB/s) with an 1:3 compression ratio and you end up at ~70MB/s with this trick. Twice as fast as uncompressed contents on an USB disk unfortunately we would also need the uncompressed data on disk since some browsers (behind proxies) do not support content compression. How to deal with that? Using mainline kernel, btrfs and btrfs' own transparent file compression. So the 'uncompressed' files are also compressed but at a lower layer and while we now have each and every file twice on disk (SD card in fact) we only need 50 GB storage capacity for 100 GB original contents based on an 1:3 compression ratio. The increase in sequential read performance is still twice as fast since decompression happens on the fly. Not directly related to the filesystem but by tweaking network settings for low latency and many concurrent connections we might be able to improve requests per seconds when many clients access in parallel also by factor 2 compared to an old smelly Android 3.x kernel we still have to use on many SBC (relationship with storage: If we do tune network settings this way we need storage with high IOPS even more) An Orange Pi One together with an EVO 64GB costs a fraction of NanoPi M3 + USB HDD, consumes nearly nothing while being magnitudes faster for the 'static files web server' use case if set up correctly. While the usual moronic benchmarks testing CPU horsepower, GbE throughput and sequential speeds would show exactly the opposite. And you get this reduction in costs and this increase in performance just by stopping to believe in all these 'benchmarking gone wrong' numbers spread everywhere and switching to active benchmarking: testing the stuff that really matters, checking how that correlates with reality (your use case and the average workload) and then setting up things the right way. Final note: Of course an Orange Pi One is not the perfect web server due to low amount of DRAM. The best way to overcome slow storage is to avoid access to it. As soon as files are in Linux' filesystem cache the speed of the storage implementation doesn't matter any more. So having our web server use case in mind: If we do further active benchmarking and identify a set of files that are accessed most frequently we could add another Orange Pi One and a Pine64+ with 2GB. The new OPi One acts as load balancer and SSL accelerator for the second OPi One, the Pine64+ does SSL encryption on his own and holds the most frequently accessed 1.7 GB in RAM ('grep -r foobar /var/www' at startup in the background -- please keep in mind that it's still +5 GB in reality if we're talking about a 1:3 compression ratio. Simply by switching on our brain we get 5GB contents cached in memory on a device that features only 2 GB physical RAM!). And the best: both new boards do not even need local storage since they can be FEL booted from our first OPi One. 4
tkaiser Posted September 2, 2016 Author Posted September 2, 2016 Benchmark update using Hardkernel's 8GB eMMC module (the 'slowest' one): random random kB reclen write rewrite read reread read write 1024 1 35617 35215 1512415 1841116 1239598 35344 1024 2 35507 35198 1292961 1646334 1687075 35648 1024 4 36724 35185 1380217 1784509 1735476 35779 102400 1 29633 32851 2202289 2110115 1472110 34801 102400 2 35134 35175 2207269 2013723 1885752 35400 102400 4 35417 35990 2525770 2509185 1907850 35808 102400 16 33307 28865 1881489 1982886 1652146 35368 102400 512 35525 35771 1930324 1982236 1735916 35860 102400 1024 35311 35920 1723261 1609477 1584696 35660 102400 16384 35663 35824 1792496 1726325 1514769 35633 1024000 1 34314 35536 2383857 2426782 1569038 35260 1024000 2 35363 35579 1986551 2340796 1724408 35119 1024000 4 35747 35745 2324267 2304826 1833397 35608 That's unbelievable +35000 IOPS at 1k! And 2300 MB/s sequential read speeds!! Nope, that's just testing this eMMC in Hardkernel's SD card adapter on OS X using the most crappy SD card adapter available here. The write speeds are influenced by the SD card reader, the read speeds by OS X. Benchmarking gone wrong as usual. Same benchmark tool (iozone) on the 'host' (ODROID-C2) that can take advantage of its eMMC (two runs): random random kB reclen write rewrite read reread read write 1024 1 2511 2354 4247 4250 4251 2297 1024 1 2544 2323 4731 4752 4722 1924 1024 2 6377 5371 9333 9339 9306 5282 1024 2 6910 4622 9348 9464 9468 5356 1024 4 21320 21684 10321 10338 10324 21541 1024 4 12991 21595 10388 10358 10392 21683 102400 4 18418 19068 9962 9971 9772 16770 102400 4 17636 18918 10163 10170 10026 17350 102400 16 41393 43687 27893 27967 27767 31352 102400 16 52928 54990 27969 28023 27845 49379 102400 512 43365 43680 110151 111280 111112 35012 102400 512 43322 43833 109205 110200 110033 37354 102400 1024 42524 43946 110590 111825 111652 43173 102400 1024 43585 43619 109797 110620 110585 43282 102400 16384 44005 43925 108812 109023 108946 43949 102400 16384 43972 43956 107518 107723 107706 43914 4096000 4 48377 49500 115668 115871 4096000 4 49216 49682 116116 116321 4096000 1024 48043 49277 110049 110386 4096000 1024 49263 49738 109830 109415 Strange results since with low record sizes performance is low while sequential write and especial read speeds exceed the numbers from before. As usual: Don't trust in these numbers. Evaluate them. Based on the numbers above ODROID's eMMC modules seem to be not limited regarding random IO (always same as sequential) but seem to adopt a strategy where 'sequential vs random' is of no use any more. As an exercise for the reader. The same eMMC module tested with fio: tk@odroidc2:~$ for i in 1 2 4 16 512 16384 ; do echo -e "Record size $(printf "%5s" ${i})k:\c" ; fio --rw=write --name=test --size=100M --direct=1 --bs=${i}k | grep "^ mmcblk0"; done Record size 1k: mmcblk0: ios=0/131784, merge=0/0, ticks=0/62790, in_queue=62630, util=96.57% Record size 2k: mmcblk0: ios=0/51138, merge=0/0, ticks=0/22800, in_queue=22710, util=95.04% Record size 4k: mmcblk0: ios=0/24217, merge=0/0, ticks=0/5980, in_queue=5990, util=90.96% Record size 16k: mmcblk0: ios=0/5774, merge=0/0, ticks=0/1300, in_queue=1310, util=83.55% Record size 512k: mmcblk0: ios=0/792, merge=0/0, ticks=0/5550, in_queue=5590, util=94.71% Record size 16384k: mmcblk0: ios=0/765, merge=0/0, ticks=0/139490, in_queue=140570, util=93.67% tk@odroidc2:~$ for i in 1 2 4 16 512 16384 ; do echo -e "Record size $(printf "%5s" ${i})k:\c" ; fio --rw=write --name=test --size=16M --direct=1 --bs=${i}k | grep "^ mmcblk0"; done Record size 1k: mmcblk0: ios=0/16356, merge=0/0, ticks=0/6740, in_queue=6750, util=95.03% Record size 2k: mmcblk0: ios=0/8009, merge=0/0, ticks=0/2850, in_queue=2850, util=91.55% Record size 4k: mmcblk0: ios=0/3605, merge=0/0, ticks=0/590, in_queue=590, util=76.52% Record size 16k: mmcblk0: ios=0/609, merge=0/0, ticks=0/150, in_queue=150, util=57.03% Record size 512k: mmcblk0: ios=0/48, merge=0/0, ticks=0/310, in_queue=350, util=53.23% Record size 16384k: mmcblk0: ios=0/42, merge=0/0, ticks=0/2720, in_queue=13040, util=49.43%
tkaiser Posted September 10, 2016 Author Posted September 10, 2016 Since I've currently lying a few disks around I decided to start a storage benchmark with our most beefy board in this regard. You might think I'm talking about an octa-core SBC with lots of DRAM. Nope, quite the opposite: it's just a dual-core Cortex-A9 with 1 GB DRAM: Solid-Run's Clearfog based on ARMADA A388, a SoC specially designed for storage and networking applications. I tested with the more expensive Clearfog Pro but the results below should be valid for the cheaper Clearfog Base too (no internal GbE switch and one mPCIe/mSATA port less than the Pro) The Clearfogs have one USB 3.0 port (not UASP capable, at least with current kernel), one M.2 slot and 1 or 2 MiniPCIe/mSATA slots. Both M.2 and mSATA slots can be turned into normal SATA 3.0 ports by using simple/cheap mechanical converters (like this or that). I tested also whether cheap JMB321 based SATA port multipliers work: they don't (might be fixable with kernel patches but to be honest using such a crappy PM with a Clearfog is no good idea anyway since these cheap PM are slow and prone to severe data corruption when overheating). I tested all 5 disks connected to the M.2 slot via a mechanical SATA connector (connection established always with highest SATA version the drive supports) and the USB3 results are made with a JMS567 enclosure for the 3 SSD and the 2.5" notebook HDD and an ASMedia ASM1051E based enclosure for the 3.5" Seagate Barracuda. The disks Samsung EVO 750 with 128 GB, planar NAND Samsung PM851 with 120 GB (EVO840 OEM version using different firmware) Samsung EVO 840 with 120 GB, '3D NAND' Seagate Barracuda 7200.14 with 3TB and rotating at 7200 rpm Seagate Momentus 5400.2 with 60GB: an old and boring 2.5" notebook HDD rotating with 5400 rpm The Samsung SSDs are somewhat special since they implement 'TurboWrite': there's a smaller buffer on the SSD behaving like (expensive) MLC NAND but as soon as this buffer is full, write speeds especially on the smaller SSDs drop down to pretty low values (for details see here) For the test I used these two iozone calls: iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2 iozone -a -g 4000m -s 4000m -i 0 -i 1 -r 4K -r 1024K The first test simply uses a 100 MB test file size and iterates through 4K - 16M record size and tests also random IO. Since iozone lists random IO not as IOPS but also as KB/s I 'translated' the values as I did already before. The second iozone call simply reads/writes a 4GB file, one time with 4K record size, the other with 1M: Random IO in IOPS Sequential IO in MB/S SATA USB3 SATA USB3 4K read/write 4K read/write 1M read/write 1M read/write EVO 750 6995/9008 4898/6073 341 / 214 260 / 169 PM851 6154/8621 4452/6240 339 / 133 254 / 134 EVO 840 10148/19184 6207/8734 507 / 116 250 / 146 Barracuda 288/4730 282/3730 172 / 182 152 / 171 Momentus 133/3528 131/3810 38 / 40 38 / 40 Results? Obviously SATA 3.0 and USB 3.0 performance of ARMADA 38x are pretty high: read speeds exceed 500 MB/s (and it's confirmed by Solid-Run engineers that Clearfog Pro can max out 3 x SATA 3.0 in parallel with fast SSDs ) The 2 HDD do not get bottlenecked that much when accessed via USB 3.0 (in fact for the slow notebook HDD nothing changes at all and the faster Barracuda is also not affected that much both regarding random and sequential IO) The really fast SSDs are heavily affected when it's about random IO, the fast EVO 840 is only half as fast behind an USB-to-SATA bridge compared to SATA (this might improve if/when ARMADA 38x is able to support UASP) Sequential reads with USB 3.0 get bottlenecked at around 250-260MB/s (same SSD in same enclosure on an UASP capable MacBook running OS X: +400 MB/s) Sequential write performance seems to be weird (EVO 840 for example being faster when USB 3.0 is used?) So why does write performance looks strange? Since my benchmark attempt sucks The tester knows that these SSDs implement TurboWrite therefore we need to test that individually to get the whole idea how these disks behave. It's also important to know that for all 3 SSD families the members with the small capacities perform worse when it's about sequential writes and even by choosing the slightly larger 250/256GB models write performance might be magnitudes faster (and gets then probably already bottlenecked by the host's SATA implementation!) So let's look closer how the 3 SSD perform with 1GB, 2GB and 3GB test sizes (so we're not already exceeding TurboWrite buffer size or only at rewrite stage with EVO 750 which has the largest buffers): kB reclen write rewrite read reread EVO 750 1024000 1024 365707 367293 371606 374167 2048000 1024 361948 222524 329422 358586 3072000 1024 360469 141784 345507 361666 PM851 1024000 1024 151648 150172 350515 377275 2048000 1024 142591 130159 337665 353065 3072000 1024 139294 133203 347399 356333 EVO 840 1024000 1024 264257 423689 532733 545420 2048000 1024 293179 418408 514973 517536 3072000 1024 183027 239014 498971 514133 (so by testing with smaller amounts of data the sequential write performance is both higher and starts to vary a lot more between the 3 SSD. But still strange why EVO 840 gets close to SATA 3.0 maximum with sequential reads while the 2 other remain at ~360MB/s here since they're known to be able to also reach/exceed 500 MB/s) All measurements here in detail: Samsung EVO 750 120GB SATA: random random kB reclen write rewrite read reread read write 102400 4 49312 44849 57951 61595 27979 36032 102400 16 107381 99088 132436 84990 63057 63580 102400 512 199633 200496 151926 107135 110112 199579 102400 1024 164322 131950 206634 207893 199718 175748 102400 16384 149019 215600 352129 361961 361936 249733 4096000 4 259973 138573 379219 394284 4096000 1024 214324 139364 341349 354944 USB3: random random kB reclen write rewrite read reread read write 102400 4 25455 28627 29593 27165 19594 24292 102400 16 66261 73271 86946 59871 49397 54845 102400 512 165606 149851 119594 120572 115886 110490 102400 1024 111329 109174 167747 227486 176693 167386 102400 16384 195114 159980 268971 267432 271965 280738 4096000 4 195354 136365 261331 271969 4096000 1024 168939 132854 260745 264737 smartctl 6.5 2016-01-24 r4214 [armv7l-linux-3.10.102-marvell] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: Samsung SSD 750 EVO 120GB Firmware Version: MAT01B6Q User Capacity: 120,034,123,776 bytes [120 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Fri Sep 9 22:29:29 2016 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x53) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 64) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 22 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 41 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 1 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 072 056 000 Old_age Always - 28 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 28 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 368121129 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing 255 0 65535 Read_scanning was never started Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Samsung PM851 128GB SATA: random random kB reclen write rewrite read reread read write 102400 4 47613 42879 72265 46837 24618 34483 102400 16 103046 95712 120942 82087 59907 68453 102400 512 130751 129812 106046 106573 104089 126059 102400 1024 130931 125081 168525 207537 200797 131278 102400 16384 122589 127418 218041 194790 326651 127126 4096000 4 137607 136550 374350 398187 4096000 1024 132604 136547 339007 354444 USB3: random random kB reclen write rewrite read reread read write 102400 4 27692 26188 29295 27991 17810 24962 102400 16 66344 61812 76393 62761 47696 53631 102400 512 121471 124919 209073 129169 153420 132398 102400 1024 111665 108868 227881 228360 215136 132017 102400 16384 139880 136557 217382 168550 244895 138823 4096000 4 135197 134535 262615 271436 4096000 1024 134266 133685 254154 264385 smartctl 6.5 2016-01-24 r4214 [armv7l-linux-3.10.102-marvell] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: SAMSUNG MZ7TE128HMGR-00004 Firmware Version: EXT0200Q User Capacity: 128,035,676,160 bytes [128 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Fri Sep 9 22:37:11 2016 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x53) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 70) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 000 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 13 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 10 177 Wear_Leveling_Count 0x0013 100 100 000 Pre-fail Always - 100 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 100 181 Program_Fail_Cnt_Total 0x0032 100 100 000 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 000 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 000 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 073 066 000 Old_age Always - 27 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 4 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 87499877 242 Total_LBAs_Read 0x0032 099 099 000 Old_age Always - 84185902 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short captive Completed without error 00% 0 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Samsung EVO 840 120GB SATA: random random kB reclen write rewrite read reread read write 102400 4 69959 104711 113108 113920 40591 76737 102400 16 166789 174407 172029 215341 123020 159731 102400 512 286833 344871 353944 304479 263423 269149 102400 1024 267743 269565 286443 361535 353766 351175 102400 16384 347347 327456 353394 389994 425475 379687 4096000 4 178889 229355 514766 524118 4096000 1024 116531 144489 507594 515497 USB3: random random kB reclen write rewrite read reread read write 102400 4 28443 35069 35333 35517 24827 34943 102400 16 89961 103490 102403 102855 76218 104573 102400 512 194544 207168 205707 208724 195665 210345 102400 1024 213667 213615 207489 212151 204601 217966 102400 16384 334480 336913 288879 302995 302626 316323 4096000 4 204026 207895 251612 253293 4096000 1024 146072 147508 250735 257175 smartctl 6.5 2016-01-24 r4214 [armv7l-linux-3.10.102-marvell] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Samsung based SSDs Device Model: Samsung SSD 840 EVO 120GB Firmware Version: EXT0BB0Q User Capacity: 120,034,123,776 bytes [120 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sat Sep 10 05:22:48 2016 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 4200) seconds. Offline data collection capabilities: (0x53) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 70) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 2995 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 555 177 Wear_Leveling_Count 0x0013 097 097 000 Pre-fail Always - 33 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 076 056 000 Old_age Always - 24 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 099 099 000 Old_age Always - 67 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 363 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 11550661943 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 0 - # 2 Extended offline Completed without error 00% 296 - # 3 Short offline Completed without error 00% 21 - # 4 Short offline Completed without error 00% 0 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Seagate Barracuda 7200.14 SATA: random random kB reclen write rewrite read reread read write 102400 4 15141 15241 15600 15677 533 14114 102400 16 33222 33098 36692 35654 1999 30093 102400 512 38373 37811 37025 37454 21181 38363 102400 1024 38404 37946 34787 34154 26649 38800 102400 16384 37804 37826 37997 38469 37217 37720 4096000 4 40422 39703 38836 38268 4096000 1024 40438 39757 38759 39286 USB3: random random kB reclen write rewrite read reread read write 102400 4 15491 16891 18276 19714 526 15242 102400 16 35054 33152 39822 39850 2011 30548 102400 512 39057 38851 40047 40074 23421 39925 102400 1024 38193 39700 40253 40106 28357 38624 102400 16384 38957 39678 39474 39498 38679 38770 4096000 4 40389 39791 38807 39390 4096000 1024 40472 39904 38923 39392 smartctl 6.5 2016-01-24 r4214 [armv7l-linux-3.10.102-marvell] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST3000DM001-9YN166 Serial Number: S1F0NSKL LU WWN Device Id: 5 000c50 05146d559 Firmware Version: CC4B User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sat Sep 10 15:28:43 2016 UTC ==> WARNING: A firmware update for this drive may be available, see the following Seagate web pages: http://knowledge.seagate.com/articles/en_US/FAQ/207931en http://knowledge.seagate.com/articles/en_US/FAQ/223651en SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 584) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 333) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 196751920 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 345 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 066 060 030 Pre-fail Always - 4127450 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 290 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 141 183 Runtime_Bad_Block 0x0032 055 055 000 Old_age Always - 45 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 080 000 Old_age Always - 37 41 80 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 074 043 045 Old_age Always In_the_past 26 (0 7 26 26 0) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 62 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1151 194 Temperature_Celsius 0x0022 026 057 000 Old_age Always - 26 (128 0 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 194 000 Old_age Always - 275 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 160h+39m+18.444s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 35603265679303 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3061425655745 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Seagate Momentus 5400.2 SATA: random random kB reclen write rewrite read reread read write 102400 4 15141 15241 15600 15677 533 14114 102400 16 33222 33098 36692 35654 1999 30093 102400 512 38373 37811 37025 37454 21181 38363 102400 1024 38404 37946 34787 34154 26649 38800 102400 16384 37804 37826 37997 38469 37217 37720 4096000 4 40422 39703 38836 38268 4096000 1024 40438 39757 38759 39286 USB3: random random kB reclen write rewrite read reread read write 102400 4 15491 16891 18276 19714 526 15242 102400 16 35054 33152 39822 39850 2011 30548 102400 512 39057 38851 40047 40074 23421 39925 102400 1024 38193 39700 40253 40106 28357 38624 102400 16384 38957 39678 39474 39498 38679 38770 4096000 4 40389 39791 38807 39390 4096000 1024 40472 39904 38923 39392 smartctl 6.5 2016-01-24 r4214 [armv7l-linux-3.10.102-marvell] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Momentus 5400.2 Device Model: ST96812AS Firmware Version: 7.01 User Capacity: 60,011,642,880 bytes [60.0 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-7 (minor revision not indicated) Local Time is: Sat Sep 10 06:08:43 2016 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 60) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 84) minutes. SCT capabilities: (0x0001) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000e 100 253 006 Old_age Always - 0 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1190 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 065 060 030 Pre-fail Always - 3491083 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 864 10 Spin_Retry_Count 0x0013 100 100 034 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 167 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 076 056 045 Old_age Always - 24 (Min/Max 20/24) 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 68 193 Load_Cycle_Count 0x0032 097 097 000 Old_age Always - 6221 194 Temperature_Celsius 0x0022 024 044 000 Old_age Always - 24 (0 17 0 0 0) 195 Hardware_ECC_Recovered 0x001a 119 055 000 Old_age Always - 1416015 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Crosscheck with Banana Pro To see how the same disks perform with another device we support I used Banana Pro (you can use any other A20 based device, results will only differ slightly based on DRAM clockspeeds since IO performance with tablet SoCs depends on both clockspeed and slightly also memory bandwidth). I chose mainline kernel and the only difference compared to 'stock Armbian' was to use performance governor since currently we use 4.6 with vanilla images and the new schedutil cpufreq governor (showing superiour IO performance compared to ondemand now) will be available starting with 4.7: Random IO in IOPS Sequential IO in MB/S SATA USB2 SATA USB2 4K read/write 4K read/write 1M read/write 1M read/write EVO 750 2779/893 1277/619 106 / 39 34 / 35 PM851 2562/859 1283/614 109 / 39 34 / 34 EVO 840 3943/3478 1535/1526 122 / 37 34 / 34 Barracuda 259/728 237/600 104 / 38 34 / 34 Momentus 126/716 120/598 38 / 38 33 / 35 Results? A20's SATA implementation is only 2.0 and the SoC shows for whatever reasons a pretty limited write throughput (that's why all write tests perform identical, the ~38 MB/s are a simple host limit) A20's sequential SATA read speed is also limited but seems to depend a bit on the device in question (see EVO840's better score here) The random IO values are way lower but that's partially since at just 4K record size random IO gets tampered with interface limitations. Anyway if we look again at EVO 840 then it's 10148/19184 read/write IOPS with ARMADA 38x vs. just 3943/3478. This difference is huge especially with smaller record sizes the difference would even be more! When having to rely on USB2.0 all disks perform identical if it's about sequential IO since USB2.0 is the bottleneck here. Also USB 2.0 lowers random IO values, the faster the disk, the more it gets slowed down (and with the ultra slow notebook HDD it makes almost no difference) We also have to keep in mind how HDDs work: They use ZBR (zone bit recording) and store more data on the outer tracks so they're faster when less capacity is used. For the Momentus tested above that means that this disk is limited to 40 MB/s when empty. As soon as data will be stored sequential transfer speeds decrease and most disks show just half the speeds when written completely full. Also fragmentation becomes an issue on HDDs and will further slow things down. This just being said since so many people start whining when they hear A20 devices are only able to write with ~40MB/s. In case you use the disk (put data on it) that's fine for most 2.5" disks anyway since the SATA interface is not the bottleneck (and 99 percent of all disk benchmarks don't tell the truth unfortunately since always empty disks are measured). Anyway: I wanted to do this test for a long time since it's quite interesting how different disk devices interact with the host's SATA implementation (on ARM boards, in x86 land nothing interesting happened the last decade since all SATA host implementations are fast enough). This is something one should keep in mind when stumbling accross 'SBC storage benchmarks' on the net. Different disks (and when USB is involved different USB-to-SATA bridges) massively influences storage performance attributed to the host. Final words regarding the Clearfogs. They're known to show both excellent IO performance and network performance in parallel. Even if the SoC is just 'dual core ARMv7'. When comparing with results from above (quad- and octa-core Cortex-A53 tablet/OTT SoCs) it's pretty easy to get the idea what really matters in this area. It's definitely not CPU horsepower but the SoC being optimized for the use case Update: Tried it also with kernel 4.4.20 but still no UASP and way lower performance -- kernel 3.10.102 as used for the tests above should be preferred when it's about storage performance root@armada:~# dmesg | egrep -i "usb|uas" [ 14.160047] orion-ehci f1058000.usb: EHCI Host Controller [ 14.160140] orion-ehci f1058000.usb: new USB bus registered, assigned bus number 5 [ 14.160602] orion-ehci f1058000.usb: irq 41, io mem 0xf1058000 [ 14.176679] orion-ehci f1058000.usb: USB 2.0 started, EHCI 1.00 [ 14.177198] usb usb5: New USB device found, idVendor=1d6b, idProduct=0002 [ 14.177208] usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [ 14.177214] usb usb5: Product: EHCI Host Controller [ 14.177220] usb usb5: Manufacturer: Linux 4.4.20-marvell ehci_hcd [ 14.177225] usb usb5: SerialNumber: f1058000.usb [ 14.198019] hub 5-0:1.0: USB hub found [ 966.388102] usb 4-1: new SuperSpeed USB device number 2 using xhci-hcd [ 966.409159] usb 4-1: New USB device found, idVendor=152d, idProduct=3562 [ 966.409170] usb 4-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [ 966.409176] usb 4-1: Product: AD TO BE II [ 966.409182] usb 4-1: Manufacturer: ADMKIV [ 966.409188] usb 4-1: SerialNumber: DB123456789699 [ 966.413288] usb 4-1: USB controller f10f8000.usb3 does not support streams, which are required by the UAS driver. [ 966.413300] usb 4-1: Please try an other USB controller if you wish to use UAS. [ 966.413308] usb-storage 4-1:1.0: USB Mass Storage device detected [ 966.417544] scsi host4: usb-storage 4-1:1.0 [ 966.487731] usbcore: registered new interface driver uas 1
rodolfo Posted September 12, 2016 Posted September 12, 2016 Here's an interesting benchmark from a low power storage solution for OPI ONE : Command line used: iozone -e -I -a -s 10m -r 4k -r 16k -r512k -i 0 -i 1 -i 2 kB reclen write rewrite read reread read write 10240 4 3946 4287 9421 7887 5294 3984 10240 16 11740 11651 21115 21087 14994 8719 10240 512 24376 24454 35105 35191 32879 24729 Beats ( by half a bit/s ) SDcards in performance, eMMC in price and any SSD or HDD in power consumption. Great for power optimized systems running on battery. Sandisk ULTRA FIT USB3 Flash Drive. Tested with Armbian 5.16, rootfs on /dev/sda1 ( USB flash ). Max. power consumption ( incl. WiFi,flash) during testing was 550mA, board running from battery.
tkaiser Posted September 12, 2016 Author Posted September 12, 2016 On 12.9.2016 at 0:05 PM, rodolfo said: Here's an interesting benchmark from a low power storage solution for OPI ONE Thanks for the numbers. OPi ONE means H3 which means USB2.0 sequential transfer speeds bottlenecked by host implementation (~35MB/s with legacy, ~40MB/s with vanilla kernel if UASP can be used -- I haven't found any thumb drive so far supporting that). It should also be noted that benchmarks with just 10 MB file size might not tell the truth since many USB thumb drives are prone to throttling. They perform nice for 2-3 minutes and then drop down to laughable performance numbers (I've seen thumb drives slowing down from 80 MB/s to 2.x MB/s) Anyway back to 'real storage' Since we support another family of SBCs with native SATA (i.MX6 quad core based boards) now let's have a look also at this platform. The Freescale/NXP i.MX6 features SATA 2.0, USB 2.0 and native GbE (the latter being limited to ~400 Mbits/sec due to internal bus limitations). When I started testing IO throughput on SBCs I went with a Cubietruck and a Wandboard Quad (with kernel 3.0.35). SATA throughput was limited to 100/90 MB/s back then. Since we know that random IO matters a lot more for many use cases let's have a look how the Wandboard performs also in this regard and also using latest kernel (I used 4.7.3-armv7-x2 and followed Robert C Nelson's instructions here https://eewiki.net/display/linuxonarm/Wandboard since we don't support the Wandboard yet in Armbian). So while I test on an unsupported board the results should be valid for all quad-core i.MX6 boards using vanilla kernel (only the quad-core i.MX6 has SATA!). Random IO in IOPS Sequential IO in MB/S SATA USB2 SATA USB2 4K read/write 4K read/write 1M read/write 1M read/write EVO 750 3138/2980 1594/1523 100 / 100 29 / 25 PM851 3056/3149 1473/1417 100 / 100 29 / 25 EVO 840 4141/5073 1664/1983 100 / 100 25 / 25 Barracuda 268/2330 245/1146 100 / 100 25 / 22 Momentus 130/2109 126/1273 38 / 40 27 / 25 Sequential USB and SATA speeds were not tested individually since the host's implementation is the bottleneck here and it's just a waste of time. I only tested with a few devices and used the appropriate values for all other sequential numbers above. Interpreting the results: Sequential SATA speeds are limited by i.MX6, you won't exceed 100 MB/s read/write USB2.0 performance with i.MX6 is rather low, if you attach disks here better choose SATA For whatever reasons EVO840 in JMS567 enclosure shows 4MB/s less USB2.0 read performance (keep such differences in mind if someone presents you a collection of 'SBC storage results' when he used different disks and different enclosures and attributes the numbers he gets to the SBC in question forgetting about the influence disk and USB-to-SATA bridge have!) i.MX6's USB controller seems to dislike ASM1051E USB-to-SATA bridge used to test with the 3.5" Seagate Barracuda. Sequential transfer speeds are pretty low (again: keep this in mind when stumbling accross 'SBC storage benchmarks' made with unknown equipment) Random IO results with the SSDs are not that good as compared with Clearfog but look way better compared to A20. Main reason: random IO gets tampered with sequential interface limitation. So in case only 1K chunks and not 4K would've been used random IO results with the 3 different SATA implementations would not vary that much as with 4K now The latter is something we (as distro people) should have in mind when choosing our defaults. If we deal with SSDs, mainline kernel and btrfs for example then due to both USB and SATA interface limitations choosing a smaller block size than the default 4K might result in way better performance when dealing with a lot of small files (to be confirmed with real world workloads since synthetic benchmarks tend to be misleading). Now time to come to some conclusions regarding Armbian devices and disk usage: I do not differentiate between individual boards that much since all devices using the same SoC will perform more or less identical unless memory bandwidth differs a lot (eg. NanoPi NEO with just 408 MHz while all other H3 devices use 624 MHz DRAM clockspeed -- to be confirmed) or strange design decisions limit the board's IO capabilities (eg. Orange Pi Plus and Plus 2 having an ultra slow USB-to-SATA bridge on the board limiting sequential transfer speeds to just 15/30 MB/s while all 4 USB ports are behind an internal USB hub so every cheap H3 device exposing USB ports directly outperforms these 2 more expensive H3 devices). Armada 38x (Clearfog Pro/Base currently): Best storage performance, able to max out multiple SATA 3.0 lines in parallel, high random IO values (depends on the disks used). USB3.0 somewhat limited but for HDDs no deal breaker. Using mechanical adapters to convert M.2 and mPCIe/mSATA slots into real SATA ports you can attach 2 or 3 (Clearfog Pro) high performance SATA disks, the USB3 port is ok for one slower SSD or up to 2 3.5"" or 3 2.5" disks (using an USB3 hub in between of course) Allwinner A20 (see list of recommended devices): SATA throughput limited but ok-ish especially when used together with 2.5" disks. USB2.0 performance ok-ish, on most A20 boards 2 USB host and 1 OTG port are available (no shared bandwidth) so up to 4 disks are possible. With mainline kernel and approriate enclosures UASP can be used and then both random and sequential performance increases (~40 MB/s). Allwinner A10: IO performance might be as good as with A20 but since A10 lacks GbE and is a single core SoC you might not be able to benefit from IO performance. NXP/Freescale i.MX6 (Hummingboard, Cubox-i, Wandboard): On boards with quad-core CPU there's 1 SATA 2.0 port with limited sequential transfer speeds available, 1 USB OTG (useable as host port) and depending on the board 1 - 3 USB host ports. USB performance is pretty low and anyone thinking about NAS use cases should keep in mind that GbE is limited to ~400 Mbits/sec here (on some i.MX6 boards (m)PCIe is available which could be used to add a 2nd dedicated GbE NIC) Allwinner H3/H5: No real SATA (the 'SATA port' on Orange Pi Plus and Plus 2 is an ultra slow USB-to-SATA bridge) but the SoC features 1 USB OTG and 3 USB host ports. It depends on the board in question how many of these ports are available and whether an internal USB hub is used or not. With mainline kernel UASP can be used so GbE equipped H3/H5 boards that expose their USB ports directly make a nice low-cost NAS (covered already in a separate post) All other boards we currently support are rather uninteresting when it's about storage performance, the ODROID-Cs being the exceptions when you need really high random IO performance and can afford their expensive eMMC modules. Some information regarding USB3.0 performance of ODROID-XU4 available here. Some USB3 information/numbers for Actions Semi's S500 here. I tried to run the whole set of tests with my Roseapple Pi dev sample too but it was necessary to increase input voltage to 5.4V to prevent boot loops with USB3 disk connected (that's what you get when you can only power the board through crappy Micro USB!) and every test crashed so I consider USB3 with Roseapple Pi and LeMaker's Guitar as being broken (LeMaker's USB3 implementation broken anyway since they chose a standards violating pin scheme for the USB3 receptacle). Edit: Really impressive USB3 results with ODROID-XU4 in the meantime due to UASP enabled/fixed in HK's 4.9 kernel branch: http://xu4.keltike.de/performance/odroidxu4-with-and-without-uas-support/ All results collected with Wandboard Quad before as a reference: EVO 750 SATA random random kB reclen write rewrite read reread read write 1024 1 1674 1398 33091 139321 115147 1267 1024 2 3827 2739 36232 215040 192264 2561 1024 4 11813 9821 15210 16188 12720 7488 102400 4 13527 14588 16472 16029 12554 11921 102400 16 28064 27647 32029 31988 28320 25568 102400 512 55824 56316 46212 46300 45289 55679 102400 1024 61382 61426 52161 52336 51207 61328 102400 16384 70819 70556 93527 93750 93882 70720 EVO 750 USB2 random random kB reclen write rewrite read reread read write 1024 1 866 746 14196 137470 119096 751 1024 2 1978 1442 14393 215851 197607 1436 1024 4 5661 5311 7178 7585 6402 5268 102400 4 6260 6191 6895 6789 6377 6094 102400 16 11941 12047 14203 14111 13958 11506 102400 512 20059 19996 22195 22185 21971 20000 102400 1024 20409 20381 22814 22873 22837 20350 102400 16384 22705 22693 27661 27710 27685 22679 4096000 4 25133 25053 29143 29247 4096000 1024 25101 25044 29095 29766 PM851 SATA random random kB reclen write rewrite read reread read write 1024 1 1508 1260 31120 136081 117067 1264 1024 2 3430 2545 24522 217777 194678 2876 1024 4 10365 8721 13536 16064 16033 7020 102400 4 13347 14398 16243 16404 12223 12598 102400 16 27878 27524 32129 32262 28169 25248 102400 512 56968 56551 46874 47186 46878 56191 102400 1024 59024 58994 47177 47467 47338 58805 102400 16384 68729 68342 91944 92299 92458 68484 PM851 USB2 random random kB reclen write rewrite read reread read write 1024 1 820 740 11443 137356 118489 799 1024 2 1934 1459 12738 217590 195868 1520 1024 4 4931 4942 6595 7475 7463 4885 102400 4 6155 6104 7159 7004 5892 5667 102400 16 11622 11628 14132 14222 13436 11507 102400 512 19996 19969 22296 22353 22189 19961 102400 1024 20436 20417 22949 23034 22996 20333 102400 16384 22609 22560 27654 27729 27689 22514 EVO 840 SATA random random kB reclen write rewrite read reread read write 1024 1 3302 5266 6271 6618 6527 4957 1024 2 7764 9267 11440 12428 12227 8748 1024 4 12147 14361 18865 21864 21567 14000 102400 4 16767 20489 21208 21239 16563 20291 102400 16 44285 50872 44976 44989 39181 50656 102400 512 96689 98347 73832 74044 73459 98207 102400 1024 96753 98834 73374 73534 73313 99334 102400 16384 111089 128889 109333 109760 109738 127815 4096000 4 79955 134886 105653 106075 4096000 1024 100780 131599 100493 101337 EVO 840 USB2 random random kB reclen write rewrite read reread read write 1024 1 1702 2450 2551 2660 2657 2389 1024 2 3593 4295 4842 5229 5222 4115 1024 4 6023 6459 7245 7972 7959 6341 102400 4 7266 7902 7980 7984 6658 7934 102400 16 14178 15733 16016 16064 15996 15472 102400 512 22935 23053 26182 26255 26136 23053 102400 1024 23549 23671 27108 27131 27097 23716 102400 16384 23796 24704 28483 28517 28511 24697 4096000 4 25101 25219 25013 25113 4096000 1024 25219 25248 24378 24491 Barracuda SATA random random kB reclen write rewrite read reread read write 1024 1 1159 974 4970 133767 115407 878 1024 2 2858 2119 15525 207578 196621 1815 1024 4 8194 7686 10299 14218 10713 7791 102400 4 12908 12106 15717 15816 1073 9320 102400 16 24002 26542 31438 31734 3988 19055 102400 512 54652 54886 45274 45949 33188 54503 102400 1024 57768 57216 45870 46525 38651 57300 102400 16384 65332 65425 80825 82306 89452 64314 Barracuda USB2 random random kB reclen write rewrite read reread read write 1024 1 682 501 4419 137119 118642 596 1024 2 1380 1117 4574 213735 195085 1151 1024 4 3965 3724 4407 5553 5078 3780 102400 4 5128 5086 5336 5318 980 4584 102400 16 10770 10503 13145 12480 3447 9943 102400 512 19085 19026 21011 21080 17964 18770 102400 1024 19060 19117 21299 21513 19242 18787 102400 16384 22926 22850 26155 25255 25265 22868 4096000 4 22746 22640 25947 25679 4096000 1024 22581 22393 25873 25787 Momentus SATA random random kB reclen write rewrite read reread read write 1024 1 1010 777 6605 132165 115742 436 1024 2 2132 1643 5997 206998 195975 1070 1024 4 5334 3602 8201 10564 4152 4495 102400 4 7927 9441 10999 11034 522 8438 102400 16 21230 21626 25049 24639 1948 18737 102400 512 38716 37962 33651 33601 18513 38903 102400 1024 38719 37988 27804 27801 21447 38663 102400 16384 31710 32190 33497 34030 34066 32311 4096000 4 40193 38990 38438 38522 4096000 1024 40231 39216 38583 38431 Momentus USB2 random random kB reclen write rewrite read reread read write 1024 1 623 583 4782 137969 118381 380 1024 2 1435 1069 5062 214997 193608 1034 1024 4 3961 3116 5184 6381 5036 3965 102400 4 5159 5068 6354 6290 503 5091 102400 16 10902 10750 14029 13199 1875 10394 102400 512 19584 19481 21127 21117 17410 19641 102400 1024 20012 19991 21244 21380 19366 20146 102400 16384 22449 22363 26166 26183 26313 22191 4096000 4 24968 24726 28279 28311 4096000 1024 24970 24784 27559 27988
tkaiser Posted September 12, 2016 Author Posted September 12, 2016 And a final look at how HDDs work and why you should care. As already said all modern HDDs use zone bit recording, that means they show higher sequential transfer speeds on the outer tracks compared to the inner. When a HDD is empty data is written to these outer tracks so measuring a totally empty HDD will show 'best case' performance. As soon as you start to fill the HDD with data (which is obviously the only reason you attach a HDD to a host) data will be written to the inner tracks and sequential transfer speeds will slow down. How exactly depends on the drive in question (and the use case the drive has been made for). To demonstrate that I reverted back to the Clearfog Pro (SATA is no bottleneck there) and used the 2 HDD, created 10 partitions of nearly equal size and tested through the first, the last and one in the middle (emulating empty, full and half full capacity). Seagate Momentus (2.5", 5400 rpm, 60 GB): Sequential transfer speeds start with ~39 MB/s when empty, drop down to 34 MB/s when half of the capacity is used and end up at just 22 MB/s when the disk is nearly full (I used just 2 GB test file size so the write speeds are tampered slightly with Linux fs buffers): /mnt/empty 1024 1 1613 1336 7839 251656 209922 783 1024 2 3269 2298 7271 407964 361717 1573 1024 4 5657 7810 10225 15710 4060 6745 102400 4 15766 14710 15567 15627 530 14105 102400 16 32941 32630 37075 35280 2011 30115 102400 512 38435 37894 32740 29356 21070 38251 102400 1024 37622 38043 34885 34407 26804 38733 102400 16384 38131 37420 38057 38469 37374 37716 2048000 4 41877 41543 39181 39754 2048000 1024 41887 41397 39121 39723 /mnt/half-full 1024 1 1436 1134 5072 251538 210994 448 1024 2 2956 2045 8195 404429 361717 1270 1024 4 7475 5233 10238 15856 7345 7161 102400 4 15367 15266 15815 15904 535 13230 102400 16 31096 30177 34725 35191 2011 28391 102400 512 34110 33199 33635 33951 19654 33545 102400 1024 33283 33657 31310 31021 24765 34202 102400 16384 33772 33661 33714 34104 33161 33487 2048000 4 37189 35951 33942 34314 2048000 1024 37078 35895 33972 34355 /mnt/full 1024 1 1175 1018 3836 250613 209074 625 1024 2 2223 2065 5413 413900 366096 1384 1024 4 3924 6318 9634 16351 7138 4609 102400 4 15642 15316 15673 15759 500 13776 102400 16 21882 19901 23201 23448 1836 20109 102400 512 23027 22473 23357 23566 15241 22685 102400 1024 22827 22585 21969 21736 18059 22984 102400 16384 22699 22379 22583 22653 22327 22665 2048000 4 24608 23575 22234 22464 2048000 1024 24670 23518 22147 22446 Seagate Barracuda (3.5", 7200 rpm, 3 TB): Sequential transfer speeds start with ~170 MB/s when empty, increase slightly to 180 MB/s when half of the capacity is used and end up at 120 MB/s when the disk is nearly full (I again used just 2 GB test file size so the write speeds are tampered slightly with Linux fs buffers): /mnt/empty 1024 1 2459 2733 20758 450687 381781 2904 1024 2 4923 3325 17927 406843 360806 4064 1024 4 12340 12346 13771 29327 19628 10760 102400 4 45184 36575 45931 39182 1138 17316 102400 16 73604 81948 110793 76730 4236 53874 102400 512 155736 141373 100301 103667 45085 118979 102400 1024 153853 120599 135423 138861 62901 114639 102400 16384 107160 105440 120605 134529 134225 105122 2048000 4 183645 179076 171697 173854 2048000 1024 181433 182583 169158 173991 /mnt/half-full 1024 1 2928 2510 20796 252099 209706 1600 1024 2 4733 3242 13781 410616 363647 3390 1024 4 11214 11214 11251 28084 19536 10067 102400 4 44854 35828 42258 38080 1127 17715 102400 16 91030 87158 109827 114544 4172 62722 102400 512 109817 132187 99452 96045 45056 112181 102400 1024 143043 104197 148451 135735 66263 106762 102400 16384 126271 110233 127666 142910 151631 105234 2048000 4 190118 187826 178616 180056 2048000 1024 191598 187306 179970 180271 /mnt/full 1024 1 2858 2508 20625 443841 374132 2260 1024 2 4551 3419 12341 411087 362481 3807 1024 4 10244 10242 11210 27866 16311 9476 102400 4 42817 33391 40694 37554 1119 16443 102400 16 73148 77292 109766 81873 4209 54197 102400 512 97568 107834 99448 99980 41625 106760 102400 1024 96071 106004 112473 109551 58857 105308 102400 16384 91855 90803 95528 98959 105164 88006 2048000 4 129782 125175 118716 120473 2048000 1024 129244 126688 119378 120973 So obviously the Barracuda uses a different ZBR strategy so let's test through all 10 partitions. This disk retains full performance over 2/3 the capacity and starts to slow down only when the last 1/3 of capacity will also be used: kB reclen write rewrite read reread 2048000 1024 183365 181698 174750 174246 2048000 1024 210040 209458 199876 200919 2048000 1024 196397 209302 199451 201650 2048000 1024 200381 198844 188294 192124 2048000 1024 186946 186665 178561 180075 2048000 1024 182293 179762 167753 172225 2048000 1024 173535 168544 159783 160495 2048000 1024 159402 158012 149518 150317 2048000 1024 146046 140609 133340 134079 2048000 1024 129375 127552 118809 121299 To be honest: I don't trust that much in these numbers since results with h2benchw (testing also through the whole capacity) show lower numbers and also that sequential performance degrades constantly over the whole capacity: http://www.tomshardware.com/reviews/4tb-3tb-hdd,3183-10.html Anyway: Since all HDDs use ZBR and nobody will use HDDs that are empty all the time we must keep in mind how these disks behave. They get slower when used (data stored on it) and fragmentation adds to this problem when you really stuff as much as possible on them. An average 2.5" HDD that shows up in moronic benchmark tests with a +90 MB/s score performs differently in reality: At half of the capacity we're already talking about just 70 MB/s and if the disk is used heavily and a lot of fragmentation happened then the average sequential access speed will already be at 50 MB/s or even below. That means that even any of our old A20 boards won't be a real bottleneck since with good settings the 1GB or 2GB DRAM will act as filesystem cache (so A20's SATA write bottleneck won't be that much of a problem) and SATA read speeds on A20 exceed 100 MB/s anyway. Unfortunately A20 shows also unbalanced GbE performance (in the opposite direction as SATA) but there's still some hope that A20's quad-core successor (called R40) uses new IP blocks already known from H3 boards so maybe we will support soon another SoC family with native SATA implementation that performs absolutely ok when we're talking about 2.5" disks (and exclude stuff like WD's VelociRaptor ) As a reference stuff used to test the disks: root@armada:~# cat /usr/local/test-slow-disk.sh #!/bin/bash cd /mnt/empty pwd iozone -e -I -a -s 1M -r 1k -r 2k -r 4k -i 0 -i 1 -i 2 | grep " 1024" iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2 | grep " 1024" iozone -a -g 2000m -s 2000m -i 0 -i 1 -r 4K -r 1024K | grep " 2048" cd /mnt/half-full pwd iozone -e -I -a -s 1M -r 1k -r 2k -r 4k -i 0 -i 1 -i 2 | grep " 1024" iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2 | grep " 1024" iozone -a -g 2000m -s 2000m -i 0 -i 1 -r 4K -r 1024K | grep " 204" cd /mnt/full pwd iozone -e -I -a -s 1M -r 1k -r 2k -r 4k -i 0 -i 1 -i 2 | grep " 1024" iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2 | grep " 1024" iozone -a -g 2000m -s 2000m -i 0 -i 1 -r 4K -r 1024K | grep " 2048" root@armada:~# cat /proc/partitions major minor #blocks name 179 0 7761920 mmcblk0 179 1 7605648 mmcblk0p1 8 0 58605120 sda 8 1 204800 sda1 8 2 5860512 sda2 8 3 5860512 sda3 8 4 5860512 sda4 8 5 5860512 sda5 8 6 5860512 sda6 8 7 5860512 sda7 8 8 5860512 sda8 8 9 5860512 sda9 8 10 5860512 sda10 8 11 4344952 sda11 root@armada:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/root 7.2G 1.2G 5.9G 17% / devtmpfs 503M 0 503M 0% /dev tmpfs 503M 0 503M 0% /dev/shm tmpfs 503M 7.8M 496M 2% /run tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 503M 0 503M 0% /sys/fs/cgroup tmpfs 503M 0 503M 0% /tmp tmpfs 101M 0 101M 0% /run/user/1000 /dev/sda6 5.6G 1.1M 5.1G 1% /mnt/half-full /dev/sda2 5.6G 2.7M 5.1G 1% /mnt/empty /dev/sda11 4.2G 608K 3.8G 1% /mnt/full root@armada:~# for i in 2 3 4 5 6 7 8 9 10 11 ; do > mkfs.btrfs -f /dev/sda${i} > mkdir /mnt/sda${i} > mount /dev/sda${i} /mnt/sda${i} > cd /mnt/sda${i} > iozone -a -g 2000m -s 2000m -i 0 -i 1 -r 1024K | grep " 2048" > done
rodolfo Posted September 12, 2016 Posted September 12, 2016 It should also be noted that benchmarks with just 10 MB file size might not tell the truth since many USB thumb drives are prone to throttling. They perform nice for 2-3 minutes and then drop down to laughable performance numbers (I've seen thumb drives slowing down from 80 MB/s to 2.x MB/s) Common USB2 thumb drives ( as most other USB2 peripherals ) are pretty useless, not so the mentioned USB 3 drive. The benchmark numbers were published after actually veryfying ( test loop running iostat for > 1 h ) results. Anyway back to 'real storage' Real storage for real use cases ? Some people might be interested in cheap fast low power storage for their OPI ONE/LITE/PC appliances running on batteries, solar powered devices or simple low volume wireless-NAS, private cloud .....
tkaiser Posted September 12, 2016 Author Posted September 12, 2016 Real storage for real use cases ? Some people might be interested in cheap fast low power storage for their OPI ONE/LITE/PC appliances running on batteries, solar powered devices or simple low volume wireless-NAS, private cloud ..... Hmm... thanks for mentioning this specific SanDisk USB thumb drive since both price and performance is way better than the "Samsung Memory Fit USB Flash Drive Speicherstick 128GB" I bought 2 months ago (shows really nice performance with USB3 on a MacBook but horribly low when used on a H3 device negotiating USB 2.0 -- no idea why but at least it's interesting/alarming that an USB3 device performs that bad when used on an USB2 port). But I still fail to understand the 'use case' here since a cheap Samsung EVO with 32GB or 64GB outperforms your USB stick -- see below -- and on WiFi only or Fast Ethernet devices like those you mentioned sequential performance is irrelevant more or less, isn't it? I ordered 4 x 64GB EVO on friday for 52€ in total (shipping/VAT included) and seem not to be able to order the thumb drive you mentioned with same capacity for a lower price (but when looking at 128GB it gets interesting since the Samsung EVO SD cards with 128 are more expensive and show less performance). Samsung EVO 64GB with ten times the file size tested: 102400 4 3233 3339 7547 7557 7561 3392 102400 16 11326 12256 14628 14618 14636 12237 102400 512 21248 21333 22682 22684 22680 21402 Random IO is slightly better when considering read rates and lower sequential performance shouldn't matter since at least I fail to understand which use case could benefit from sequential performance exceeding 10MB/s on small H3 devices. If we would talk about the larger H3 models featuring GbE it would make more sense to have an eye on sequential storage performance (that's also the reason why I immediately stopped testing with the Roseapple Pi: 2 GB DRAM combined with one USB3 port but only Fast Ethernet is already a fail so even if Actions Semi's S500 would show decent USB3 storage performance... where's the use case with just Fast Ethernet?) BTW: We replaced a few NAS boxes (40 TB each) with BananaPi M2+ a few months ago, each using an 64GB EVO and two 64GB USB thumb drives using btrfs' raid-0 mode and transparent file compression. Local access is as fast as before (even if they're still running on 4.6-rc1 using a rather old/slow version of montjoie's H3 Ethernet driver) but updating the contents is way faster (using btrfs send/receive feature to/from a rather distant location on the other side of the world). In other words: I totally agree that using flash storage might be a great option for specific use cases, especially when used with SBCs where we can make use of mainline kernel and advanced filesystem/raid features. But the purpose of this thread was to fight all those moronic SBC storage benchmarks available on the net that only focus on sequential transfer speeds and totally forget about all the important stuff (random IO, host vs. device bottlenecks, performance depends not only on host but also on disk and with USB bridge chip used and so on)
rodolfo Posted September 12, 2016 Posted September 12, 2016 But I still fail to understand the 'use case' here since a cheap Samsung EVO with 32GB or 64GB outperforms your USB stick -- see below -- and on WiFi only or Fast Ethernet devices like those you mentioned sequential performance is irrelevant more or less, isn't it? I ordered 4 x 64GB EVO on friday for 52€ in total (shipping/VAT included) and seem not to be able to order the thumb drive you mentioned with same capacity for a lower price (but when looking at 128GB it gets interesting since the Samsung EVO SD cards with 128 are more expensive and show less performance). I've generally found SDcards less reliable than their USB counterparts and the physical handling of USB3 sticks is usually much smoother with PCs and notebooks. This is just a personal preference. For OPI ONE/LITE I put boot stuff on an old SDcard of any class/size and the rootfs on a USB3 stick. A fast flash disk does of course nicely complement a fast SDcard if you need to add low power storage. As I already mentioned, OPI ONE and OPI LITE both with USB flash and wifi run from simple dual-18650 battery pack. HDD and SSD will meet some serious limits there. You are of course perfectly right in pointing out the speed nonsense promoted in benchmark infomercials. An SBC will always be a carefully balanced matched system of storage, computing and I/O. It just happens that the small OPI H3 boards running with stock legacy Armbian show very pleasing balanced performance. 1
cmirra Posted September 16, 2016 Posted September 16, 2016 Cubietruck+ (H8) and CT Raid Subboard 1x Samsung EVO 750 256GB SSD 1x Corsair Force 120GB SSD Disks contain ~1GB of previous test data in Raid1 Mode (120GB array). CT Raid Subboard is connected to CT+ via SATA connector. Subboard also offers USB3 data connection. I would test with additional Raid Modes but this board requires soldering tiny surface mount resistors to change modes. Raid 1 is what I need for my application so I won't be able to test Raid0, JBOD, or PM modes for a while. I'd expect all the IO is terrible, thanks to the Cubietruck+. Command line used: iozone -e -I -a -s 1M -r 1k -r 2k -r 4k -i 0 -i 1 -i 2 KB reclen write rewrite read reread read write 1024 1 1644 1997 7363 7732 7008 2004 1024 2 2727 3359 12774 13928 12051 3345 1024 4 4565 5268 23506 26897 20566 5257 Command line used: iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2 KB reclen write rewrite read reread read write 102400 4 6156 6876 25990 26144 25722 6284 102400 16 10489 11175 58778 58498 57430 10719 102400 512 12853 13055 86507 87503 86036 12977 102400 1024 12916 14186 78472 85506 85974 12914 102400 16384 12949 14395 94904 99168 97182 13241 Command line used: iozone -a -g 4000m -s 4000m -i 0 -i 1 -r 4K -r 1024K KB reclen write rewrite read reread 4096000 4 5765 7425 30598 29413 4096000 1024 This last one took so long I had to terminate it. Got things to do Via usb2 on Windows10, I got writes of ~40MB/s (according to explorer). Will properly test with Windows10 + SATA3/USB3 later -- but I think the moral of the story is the CubieTruck+'s SATA->USB bridge is as bad as you heard. Via USB3 on Windows10 Got the same results for 50MB, 1GB, and 4GB tests. I feel like these 4k results may not be so great, but I don't have benchmarks without the RAID board to crosscheck. Maybe later.
tkaiser Posted September 16, 2016 Author Posted September 16, 2016 CT Raid Subboard is connected to CT+ via SATA connector. Subboard also offers USB3 data connection. I would test with additional Raid Modes but this board requires soldering tiny surface mount resistors to change modes. LOL, really? Ok, this board is not interesting at all, no further test/numbers needed. I was interested in RAID-0 performance since due to the ultra slow GL830 USB-to-SATA bridge limiting real writes/reads to 15/30 MB/s max this mode would be absolutely useless (it's easy to understand but the average customer of this gadget most probably won't believe it until he sees numbers/graphs) BTW: I would never use RAID-1 with a proprietary controller like this until tested how the controller deals with data mismatch between the two drives. Easy test: Create a text file containing "1", then attach both RAID members directly to a SATA port, when the FS is not acessible then it's already time to throw the gadget away. If it's accessible modify the file one time to read "2", on the other disk "3". Then test what you get back when the 2 disks are again RAID members. If it's either "2" or "3" then throw the gadget away. It's really that easy.
tkaiser Posted April 20, 2017 Author Posted April 20, 2017 One final update regarding Roseapple Pi (using Actions Semi S500 just like LeMaker Guitar or the announced Cubieboard 6). Since I booted the board one last time anyway I thought let's give USB3 there also one last try. I connected a Samsung PM851 in an JMS567 enclosure (with own power supply!) to the USB3 port and had a look with most recent 3.10.105 kernel: root@roseapple:~# lsusb Bus 002 Device 002: ID 152d:3562 JMicron Technology Corp. / JMicron USA Technology Corp. Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub root@roseapple:~# lsusb -t /: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M |__ Port 1: Dev 2, If 0, Class=Mass Storage, Driver=uas, 5000M /: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 480M That looks nice since UAS seems to be useable. Let's give it a try with the 2 iozone calls from Clearfog measurements above: iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2 random random kB reclen write rewrite read reread read write 102400 4 13525 16451 19141 24275 14287 16492 102400 16 39343 48649 56409 63777 40203 45654 102400 512 68873 75835 89871 102977 98620 94677 102400 1024 115288 111747 170742 176837 172585 104936 102400 16384 117025 105977 195316 196457 196582 117819 iozone -a -g 4000m -s 4000m -i 0 -i 1 -r 4K -r 1024K kB reclen write rewrite read reread 4096000 4 124421 132386 134795 134760 4096000 1024 127135 134943 127559 128026 If you compare with PM851 numbers made with Clearfog above it's obvious that S500 numbers are not that great. And since S500 features only Fast Ethernet at least for NAS use cases sequential transfer speeds are irrelevant anyway. I tried then to use an external VIA812 USB3 hub with integrated RTL8153 Gigabit Ethernet but this only led to error messages, /dev/sda1 disappearing and the board failing to boot afterwards. Fortunately this Roseapple Pi (formerly called Lemon Pi more correctly) has never been sold. There exist just a few dev/review samples that were sent out around the globe. Maybe the above numbers help some future Cubieboard 6 owners who got tricked into believing CB6 would have 'real SATA' Funnily USB-SATA on Cubieboard 6 will be much faster than on older Cubieboards (using A20's 'real SATA' or the horrible GL830 USB-to-SATA bridge on Cubieboard 5) but for most use cases this won't help much since there's only Fast Ethernet on the board. So even when adding a RTL8153 Gigabit Ethernet dongle to one of the 2 USB2 ports 'NAS performance' won't exceed that of Cubieboard 3 (the so called Cubietruck) 1
manuti Posted April 20, 2017 Posted April 20, 2017 Is not completely related with storage but I'm following the process of a sysadmin trying to find the best cheap VPS. He is preparing an automated testing approach and sharing the ansible scripts and the results on GitHub. The guy is Joe http://joedicastro.com/ and the GitHub is https://github.com/joedicastro/vps-comparison Maybe can be interesting to run same test on local servers and LAN or from WAN using DynIP.
tkaiser Posted June 22, 2017 Author Posted June 22, 2017 On 12.9.2016 at 1:32 PM, tkaiser said: Edit: Really impressive USB3 results with ODROID-XU4 in the meantime due to UASP enabled/fixed in HK's 4.9 kernel branch: http://xu4.keltike.de/performance/odroidxu4-with-and-without-uas-support/ XU4 easily outperformed by my new arrival ROCK64. Testing with an Samsung EVO840 (the same as above) with ext4, UAS enabled, a 4.4.70 kernel (ayufan build) and an external JMS567 disk enclosure: iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2 random random kB reclen write rewrite read reread read write 102400 4 19727 23938 25879 24231 18475 23886 102400 16 67302 80160 87210 87008 68301 79484 102400 512 278860 292958 292461 301834 292974 291585 102400 1024 297825 310695 313499 324097 316276 313253 102400 16384 325951 330928 340258 351640 352109 323207 That's sequential write performance of +325 MB/s and read even better at ~345 MB/s. So now let's wait and see how pricing will look like (I guess we get the 1GB variant for less than $25). Debug log here (I simply used arm64 OMV rootfs from ODROID-C2): http://sprunge.us/HURC 1
tkaiser Posted June 22, 2017 Author Posted June 22, 2017 (edited) Time for a small '2017 follow-up' on this issue this time only focussing on 'disk performance' (HDD/SSD, not onboard storage like eMMC or SD cards). Since we all know that we don't need to differentiate here between individual boards but can look at board families (the SoC is the important thing) I only check the following 'families': cheap Allwinner stuff running with A64 or H5 (that's the Pine64 numbers below, H3 boards score a bit lower) Allwinner A20, R40 or V40 (that's the Banana Pi Pro numbers below, maybe random IO numbers look better with the R40/V40 boards but since those are only available from manufacturers also known as brain damaged retards that's not an option today) i.MX6 (Wandboard Quad, not supported by Armbian but a few other boards using the same SoC) Exynos 5422 (ODROID XU3 or XU4) RK3328 (ROCK64) Armada 38x (Clearfog Pro, same numbers will apply to Turris Omnia, Clearfog Base or Helios4 soon) Why are all other boards missing? Since uninteresting if it's about storage. SATA or USB3 are a must, the only exception below are the UAS capable USB2 Allwinner devices since they show a few more MB/s sequential throughput and way better random IO numbers compared to USB2 solutions that only support the old/anachronistic USB Mass Storage mode. All tests done with Samsung EVO SSDs (my EVO840 used for all tests except of the ODROID-XU4 results which were made with a much faster EVO850 instead and Pine64 numbers with a slower EVO750 so random IO numbers should be multiplied with 1.3 or even 1.4): Random IO in IOPS Sequential IO in MB/sec 4K read/write 1M read/write Pine64 (USB2/UAS) 2260/1948 42 / 41 Banana Pi Pro (SATA) 3943/3478 122 / 37 Wandboard Quad (SATA) 4141/5073 100 / 100 ODROID-XU4 (USB3/UAS) 4637/5126 262 / 282 ROCK64 (USB3/UAS) 4619/5972 311 / 297 Clearfog Pro (SATA) 10148/19184 507 / 448 Testing methodology is somewhat wrong but I refuse to waste much of my time to do proper benchmarking since it's only about getting the idea what to expect. So as a summary: USB2 SBC that lack UAS capabilities are simply too slow to be even listed here UAS capable cheap Allwinner USB2 boards show ok-ish performance for low-end setups (think of NanoPi NEO2 as NAS for example) SBC featuring 'real SATA' have to be differentiated. Armada 38x shows top performance like x86/x64 boards, i.MX6 is somewhat ok, A20/R40/V40 simply suck (you better don't buy this crap any more) USB3 SBC like ODROID-XU4/HC1 or ROCK64 show way better performance. You only have to take care of the USB downsides (XU4 with an internal USB hub and receptacles problems is in a bad position here) and ensure that you're using good USB-to-SATA bridges to connect SSD or HDD If we also take price/performance ratio into account then ROCK64 or other RK3328 boards that will follow are really hard to beat. The 1GB ROCK64 variant will most probably stay below the $25 margin and you should keep in mind that more DRAM is pretty useless if you think about (NAS) performance. Only on devices where IO throughput is way lower than network throughput having more DRAM as needed might improve NAS write performance if settings are appropriate (since then amounts of data that fit into RAM will be written at network speed and flushed later to disk) Edited September 6, 2017 by tkaiser Updated XU4/HC1 performance numbers since tested myself: https://forum.odroid.com/viewtopic.php?f=146&t=27548&start=50#p201175 2
infinity Posted June 22, 2017 Posted June 22, 2017 @tkaiserMan these numbers are absolutely awesome! Thanks for this first test to get an idea of this little piece of gold. I'm really looking forward to it to replace my current banana Pi (gigabit LAN and SATA) as my Seafile Cloud Server. Your thoughts about RAM sound sane. I guess 2GB will be enough for seafile and perhaps NAS altogether. What do you think about the capability to install armhf (instead of aarch64) software? Currently the Seafile server and some other stuff is mostly rpi2/3 compatible, which means it is armhf. I don't see any downside so far compared to aarch64. Ram usage is also lower with 32bit. I guess armhf stuff will be working on ROCK64 as well? In terms of USB3: As usb3 is specified to have a quite high capability of power supply... how would that be taken account by the ROCK64 board design? Any idea what the board might lift? I thought about bying a simple USB3toSATA bridge with UASP support and power a Crucial MX300 256GB without an external power supply (just leaving it away). Such an SSD should not consume too much, so I hope that the Rock64 USB3 port could handle it directly. Does anybody have something to say about the points I mentioned? On Topic: Somehow I totally feel you developers. You want to restrict development to worthy boards, which would also increase the quality of development for those few boards, because work is more focused. On the other hand I see boards like my old banana Pi and I really think that it would've never become what it is and as usable as it is without all your effort (which certainly started without much discussion before). From a user perspective it is a very hard topic that you discuss here. Obviously everybody in the community wants as much as possible. Unfortunately new builds/initial support matures after quite some time and then comes the breakthrough (which - sometimes - cannot be anticipated). Somehow it would also be a shame if nobody would even try to establish some initial support, thus nobody would know what the board is capable of :/. I really don't want to be in your situation to decide how this can be handled in the future. I'm just very thankful and happy that ARMBian was founded some years ago and that it became such a name in the ARM business About the ROCK64: One thing is clear: In my optinion the ROCK64 should be supported, as it combines so many good things. The interest is there, quite a few developers are working on it already, apparently. The more interesting is that two totally independent interest groups (ARMBian=Server/Cli aim and then the LibreELEC=Multimedia) show real interest and have put some effort already into it. This can push it very far and both camps can benefit one another. Really really great to see this!
tkaiser Posted March 19, 2018 Author Posted March 19, 2018 Early 2018 update Time for another small update. It's 2018 now and since it seems Armbian will support a couple of RK3399 devices later this year let's have a closer look at the storage performance of them. RK3399 provides 2 individual USB3 ports which seem to share a bandwidth limitation (you get with a fast SSD close to 400MB/s on each USB3 port but with two SSDs accessed in parallel total bandwidth won't exceed 400MB/s). RK3399 also features a PCIe 2.1 implemenation with 4 lanes that should operate at Gen2 link speeds (that's 5GT/s so in total we would talk about 20GT/s if the SoC is able to cope with this high bandwidth). Rockchip changed their latest RK3399 TRM (technical reference manual) last year and downgraded specs from Gen2 to Gen1 (2.5GT/s or 10GT/s with 4 lanes). So there was some doubt whether there's an internal overall bandwidth limitation or something like that (see speculation and links here). Fortunately a Theobroma engineer did a test recently using Theobroma System's RK3399-Q7 with a pretty fast Samsung EVO 960 NVMe SSD: https://irclog.whitequark.org/linux-rockchip/2018-03-14 -- it seems RK3399 is able to deal with storage access at up to 1.6GB/s (yes, GB/s and not MB/s). This is not only important if you want ultra fast NVMe storage (directly attached via PCIe and using a way more modern and efficient protocol like ancient AHCI used by SATA) but also if RK3399 device vendors want to put PCIe attached SATA controllers on their boards. ODROID guys chose to go with an ASM1061 (single lane) on their upcoming N1 since they feared switching to a x2 (dual lane) chip would only increase costs while providing no benefits. But Theobroma's test results are an indication that even x4 attached controllers using all PCIe lanes could make reasonable use of the full PCIe bandwidth of 20GT/s. Below we'll now have a look at USB3/UAS performance and PCIe attached SATA using ASM1061 (both done on an ODROID N1 developer sample some weeks ago). Those tests still use my usual EVO840 SATA SSD so results are comparable. You see two ASM1061 numbers since one is made with active PCIe link state powermanagement and the other without (more or less only affecting access patterns with small block sizes). Then of course beeble's NVMe SSD tests are listed (here fio and there iozone -- numbers should also be valid for the other RK3399 devices where you can access all 4 PCIe lanes via M.2 key M or a normal PCIe slot: Rock960, NanoPC-T4 or RockPro64 (M.2 adapter needed then of course -- ayufan tested and got even better iozone numbers than beeble). And maybe later I'll add SATA and USB3 results from EspressoBin with latest bootloader/kernel. (for an explanation which boards represent which SoC and why please see my last post above) Random IO in IOPS Sequential IO in MB/sec 4K read/write 1M read/write RPi 2 under-volted 2033/2009 29 / 29 RPi 2 2525/2667 30 / 30 Pine64 (USB2/UAS) 2836/2913 42 / 41 Banana Pi Pro (SATA) 3943/3478 122 / 37 Wandboard Quad (SATA) 4141/5073 100 / 100 ODROID-XU4 (USB3/UAS) 4637/5126 262 / 282 ROCK64 (USB3/UAS) 4619/5972 311 / 297 EspressoBin (SATA) 8493/16202 361 / 402 Clearfog Pro (SATA) 10148/19184 507 / 448 RK3399 (USB3/UAS) 5994/6308 330 / 340 ASM1061 powersave 6010/7900 320 / 325 ASM1061 performance 9820/16230 330 / 330 RK3399-Q7 (NVMe) 11640/36900 1070 / 1150 As we can see RK3399 USB3 performance slightly improved compared to RK3328 (Rock64). It should also be obvious that 'USB SATA' as in this case using USB3/SuperSpeed combined with a great UAS capable USB-to-SATA bridge (JMicron JMS567 or JMS578, ASMedia ASM1153 or ASM1351) is not really that worse compared to either PCIe attached SATA or 'native SATA'. If it's about sequential performance then USB3 even outperforms PCIe attached SATA slightly. The 2 USB3 ports RK3399 provides when combined with great UAS capable bridges are really worth a look to attach storage to. NVMe obviously outperforms all SATA variants. And while combining an ultra fast and somewhat expensive NVMe SSD with a dev board is usually nothing that happens in the wild at least it's important to know how the limitations look like. As we've seen from the RK3399-Q7 tests with fio and larger blocksizes we get close to 1600 MB/s at the application layer which is kinda impressive for devices of this type. Another interesting thing is how NVMe helps with keeping performance up: This is /proc/interrupts after an iozone run bound to the 2 big cores (taskset -c 4-5): https://gist.github.com/kgoger/768e987eca09fdb9c02a85819e704122 -- the IRQ processing happens on the same cores automagically, no more IRQ affinity issues with all interrupts ending up on cpu0 Edit 1: Replaced Pine64 numbers made with EVO750 from last year with fresh ones done with a more recent mainline kernel and my usual EVO840 Edit 2: Added Rasperry Pi 2 results from here. Edit 3: Added EspressoBin numbers from here. 1
fossxplorer Posted April 29, 2018 Posted April 29, 2018 Amazing results of RK3399. Thx for sharing such details with us. I'd like to order RockPro64 when it comes available again.
Jason Fisher Posted May 8, 2018 Posted May 8, 2018 On 4/29/2018 at 7:58 AM, fossxplorer said: Amazing results of RK3399. Thx for sharing such details with us. I'd like to order RockPro64 when it comes available again. My use case is NAS, and i think this could be a good board for it. Pine only sell a PCIe-SATA adapter with 2 SATA ports, but an adapter with more SATA ports would be nice since it can handle 1.6GB/s. So for my spinning disks, i could use up to 8 disks obviously. Hope we will be able to compile ZFS on this, as i wasn't on another RK3328 device. EDIT1: Also i could imagine using this board as my desktop, but wonder if we have the necessary driver etc for GPU? That would also be awesome to have such a low power desktop. I posted my ZFS solution here- able to compile/use the standard zfs-dkms packages with some modifications: 1
tkaiser Posted May 9, 2018 Author Posted May 9, 2018 On 4/29/2018 at 4:58 PM, fossxplorer said: RK3399 ... more SATA ports ... can handle 1.6GB/s Nope, please look at the details and how storage works. The 1.6GB/s above are just a confirmation that RK3399 has no crippled internal bandwidth (Hardkernel assumed such thing and designed their ODROID N1 around that assumption RK3399's PCIe implementation would be bottlenecked to 400 MB/s or something like that). These 1.6GB/s are just the proof that an x4 Gen2 connection can be established with RK3399 (4 x 5 GT/s with 8b10b encoding) and that a NVMe (!!!) SSD shows nice high bandwidth numbers. NVMe is not a protocol from stone age (like SATA/AHCI), it's made for modern CPUs with multiple cores. If you attach an old SATA or SAS controller to the PCIe bus you will run into IRQ affinity issues for sure and you have to keep in mind that you're now using the PCIe bus to talk to an own storage controller that implements protocols from last decade/century made for single CPU core systems with spinning rust connected (that's what SCSI, PATA, SATA, AHCI is all about). Lots of overhead and inefficiency involved now and when you then also try to combine a bunch of disks in special ways (e.g. RAID which is in my opinion a really stupid setup at home) performance will again drop drastically. RK3399 is a general purpose ARM SoC and not a NAS SoC. Those exist though -- look at Marvell Armada 7k/8k for example. These SoCs differentiate between AP (application processor, that's the part with ARM cores) and CP (communcation processor(s)) and all the relevant work is offloaded on the CP. That's why they're fast as NAS. BTW: RK3399 and PCIe GPUs won't work but please no off-topic discussions here. Let's focus on storage performance on SBC and try to avoid flooding this thread with babbling. So many 'educational' threads here meant as a tutorial to share knowledge have already been destroyed by babbling.
JMCC Posted May 9, 2018 Posted May 9, 2018 I thought it would be interesting to also have some numbers from a SSHD, more specifically a Firecuda 2.5" 2TB@5400rpm. The board is a HC1 with the latest firmare update from JMicron, and kernel 4.9.61. I ran two tests, one with a BTRFS partition, and another with an EXT4 partition. Both using the whole disk, and only about 5% full. These are the results: BTRFS: random random kB reclen write rewrite read reread read write 102400 4 11433 13426 18452 18513 1653 5039 102400 16 35302 41951 56583 57930 6069 36474 102400 512 95472 102089 114966 116459 46364 103092 102400 1024 102495 102538 115515 116898 54691 89851 102400 16384 78096 81337 114334 134513 110433 76181 EXT4: random random kB reclen write rewrite read reread read write 102400 4 12427 13485 19307 19430 1709 1887 102400 16 38360 55768 70089 70510 6280 16125 102400 512 106688 109609 122658 124822 47793 93470 102400 1024 108331 109693 123302 124806 64158 98570 102400 16384 80573 87536 120168 104656 121852 91237 kB reclen write rewrite read reread 4096000 4 122603 123460 131950 131997 4096000 1024 118265 118122 127649 127780 (I didn't do the 4Gb file test with BTRFS). I think it is interesting to see the big difference in IOPS at small write sizes between the two filesystems. It seems like, for some reason, BTRFS is using the SSD cache more efficiently than EXT4. @tkaiser any insight on this?
tkaiser Posted May 10, 2018 Author Posted May 10, 2018 17 hours ago, JMCC said: I think it is interesting to see the big difference in IOPS at small write sizes between the two filesystems. It seems like, for some reason, BTRFS is using the SSD cache more efficiently than EXT4. Nope since the same happens when testing with a normal HDD or SSD as well (see numbers and comment here). With btrfs writes and iozone the -I switch (direct IO) has no meaning and this is testing filesystem buffers in reality. That's why the numbers are higher. You simply can't use iozone -I to test btrfs writes but you would need to adjust the test filesize to 3 or 4 times the amount of DRAM to eliminate filesystem buffers. Just another example that you always have to benchmark the benchmark first to be able to generate insights from it. Two more notes: Almost all btrfs code lives inside the kernel so different kernels --> different behaviour and performance possible Doing such tests without switching first to performance cpufreq governor always only produces numbers without meaning since cpufreq governor behaviour especially with small accesses can become a significant factor (different fs might result in different cpufreq scaling behaviour so while this is interesting too it doesn't tell about 'filesystem performance')
JMCC Posted May 10, 2018 Posted May 10, 2018 Well, if that's the case, then looking at the rest of the numbers (with bigger filesizes) it seems like EXT4 performance is still better than BTRFS, at least in current Armbian 4.9.y kernel and with a real-life scenario of ondemand governor. Do you know if it is the opposite (BTRFS performing better than EXT4) with other kernels?
chwe Posted May 10, 2018 Posted May 10, 2018 (edited) 7 hours ago, tkaiser said: Just another example that you always have to benchmark the benchmark first to be able to generate insights from it. wouldn't it make sense to write a small bash script which takes care about 'sane' settings generate a 'report' together with boardname, filesystem used, kernel running etc.? I remember you explained it once to me, seems you did it (at least) once more on CNX and now here again. I think in the long term you save some time to see quickly if the posted 'benchmarking' was done with a appropriate board/setting or just a 'collection of garbage'.. Edit: Quote but please no off-topic discussions here As soon as you decided to write or not write such a script, I'll delete this post to keep the thread clean. Edited May 10, 2018 by chwe
tkaiser Posted May 10, 2018 Author Posted May 10, 2018 5 hours ago, JMCC said: Do you know if it is the opposite (BTRFS performing better than EXT4) with other kernels? More or less irrelevant since btrfs is a modern filesystem utilizing modern concepts. E.g. 'checksumming' to ensure data integrity. This does not only 'waste' CPU cycles but results also in higher storage activity for the same tasks (since checksums have to be calculated, written, read, verified). E.g. Copy-on-write (CoW) which directly affects performance of write patterns that are of the same size or less than the filesystem's block size (usually 4K or larger) since now every write is in reality a 'read, modify, write' cycle since already existing data needs to be read from disk, then the new stuff will be added and then the modified block will be written to a new location and only afterwards the old reference deleted. That's why btrfs and other CoW filesystems in all benchmarks writing small chunks of data shows horribly low performance. Same when running database benchmarks on btrfs with defaults --> horrible performance as expected since a CoW filesystem is nothing you want to put database storage on (and if you disable CoW in btrfs then also checksumming is gone and then you're better off using ext4 or XFS anyway) The (in)famous clickbait site dedicated to provide only numbers without meaning (Phoronix) periodically 'benchmarks' various filesystems inappropriately so if you're after numbers visit https://www.phoronix.com/scan.php?page=article&item=linux414-fs-compare and similar posts. 3 hours ago, chwe said: wouldn't it make sense to write a small bash script which takes care about 'sane' settings generate a 'report' together with boardname, filesystem used, kernel running etc.? Unfortunately not since passive benchmarking ('fire and forget' mode) never works. It only provides numbers without meaning and nice looking graphs but zero insights. With storage benchmarks as with every other benchmark only 'Active benchmarking' works. And that requires some understanding, a lot of time and the will to throw away most of your work (+95% of all benchmark results since usually something goes wrong, you have to find out what and then repeat). Almost nobody does this. On every benchmarked host but especially on SBC with their weak ARM CPU cores and limited resources you always need to monitor various resources in parallel (htop, 'iostat 5', 'vmstat 5' and so on), switch at least to performance governor, take care about process and IRQ affinity (watching htop) especially on those boards with big.LITTLE implementations since otherwise you end up with the usual passive benchmarking result: Casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C. Next problem: if you generated numbers it's about to generate insights from these numbers. Recently someone showed me this link as a proof that bcache in Linux would be a great way to accelerate HDD access by using SSDs: http://www.accelcloud.com/2012/04/18/linux-flashcache-and-bcache-performance-testing/ True or not? What do the numbers tell? While almost everyone looking at those numbers and graphs will agree that bcache is great in reality it's exactly the opposite what these benchmark numbers show. But people ignore this since they prefer data over information and ignore what the numbers really tell them. Back on topic (SBC storage and not filesystem performance): only reasonable way to compare different boards is to use same filesystem (ext4 since most robust and not that prone to show different performance depending on kernel version) and performance governor, eliminating all background tasks that could negatively impact performance and having at least an eye on htop. If you see there one CPU core being utilized at 100% you know that you run in a CPU bottleneck and have to take this into account (either by accepting/believing that a certain SBC is simply too weak to deliver good storage performance since CPU bottlenecked or by starting to improve settings as it's Armbian's or my 'Active Benchmarking' approach with benchmarks then ending up with optimized settings --> there's a reason 'our' images perform on identical hardware sometimes even twice as fast as other SBC distros that don't care about what's relevant) 1
Recommended Posts