[NanoPi M3] Cheap 8 core (35$)

tkaiser · June 24, 2016

Check the following

cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
find /sys -iname "*temp*"

Then this is Cortex-A53 so 'armv7l' is already an indication that there's something wrong. This is one of the few cases where running 'sysbench' might be interesting, just to see whether this device will finish within a few seconds (like ODROID-C2, Pine64 or other Cortex-A53 implementations do that use ARMv8 code) or is as slow as RPi 3 (also Cortex-A53 but used in 32-bit mode only). And while you run sysbench you should monitor /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq constantly.

You can install RPi-Monitor this way:

apt-get install perl librrds-perl libhttp-daemon-perl libwww-perl libjson-perl libipc-sharelite-perl libfile-which-perl
wget -O /tmp/rpimonitor_2.10-1_all.deb https://github.com/XavierBerger/RPi-Monitor-deb/blob/master/packages/rpimonitor_2.10-1_all.deb?raw=true
dpkg -i /tmp/rpimonitor_2.10-1_all.deb

This will at least enable monitoring of CPU clockspeed (and therefore throttling). And if you find a temperature node somewhere below /sys/ then you can adjust the relevant template to monitor thermal values too. Lowering temperatures is a complex process and involves deep knowledge, huge amounts of tests and a device worth the efforts (not true at all for this NanoPi M3 based on what's known yet)

Edited July 8, 2016 by tkaiser
Removed misleading Armbian lalala

wildcat_paris · June 24, 2016

i bought nano pi M3 .

but CPU is very hot .
but i can not mesure its temperature .
(lm-sensors is not OK)

usually armbian could make its temperature lower , this was done in case of orange pi PC .

I bought a Nano Pi M1.

CPU is also hot. I put a copper heatsink. idle mode: 56Â°C, external copper heatsink 45Â°C (infrared sensor)

Don't buy FriendlyARM boards with no proper power regulator (as tkaiser would write very well).

try (may or may not ~~vote~~ work)

cat /sys/class/thermal/thermal_zone?/temp

Edited June 24, 2016 by wildcat_paris
Brexit vote bug

hatahata · July 3, 2016

about nano pi M3

cross compile is possible on debiandog64 (perhaps intel 64bit and debian etc 64bit)

see -> http://akita-8core.blogspot.jp/2016/06/nano-pi-m3.html

so

diff .config .config-ori
53c53
< CONFIG_SWAP=y
---
> # CONFIG_SWAP is not set
96d95
< # CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set
1801,1802c1800
< CONFIG_THERMAL=y
< CONFIG_THERMAL_HWMON=y
---
> # CONFIG_THERMAL is not set

may show CPU temperature , but only posibility .

---

regards

hatahata · July 4, 2016

this morning , i try to boot nano pi M3 by rebuilded uIimage .

nano pi M3 can boot , but

swapon /SWAP
swapon: /SWAP: swapon failed: Function not implemented

---

regards

tkaiser · July 7, 2016

about nano pi M3

cross compile is possible on debiandog64 (perhaps intel 64bit and debian etc 64bit)

So what? Cross-compiling on a 64-bit Ubuntu system is what we do all the times for all our 32-bit kernels / OS images. Doesn't mean anything whether you can use a 64-bit host or not. NanoPi M3 has only a 32-bit kernel which is bad.

The rest of your posting (thermal stuff) I didn't understand. Fortunately others provide information how to read-out temperatures and how horribly this M3 will overheat without a HUGE heatsink + ventilation: http://climbers.net/sbc/40-core-arm-cluster-nanopc-t3/

hatahata · July 7, 2016

i am suprising your vast knowlege .

i only follow 'Install Cross Compiler' of
http://wiki.friendlyarm.com/wiki/index.php/NanoPi_M3#Boot_NanoPi_M3_from_SD_Card

i use 64bit CPU and 64bit debiandog .

i finally underastand that nano pi M3 is 64bit hardware but the present OS is 32 bit .

surely nano pi M3 becomes hot and hot and suddenly reboot .
so i made the device which misses heat .
now there is no rebooting .

Because it doesn't work easily, it's fascinating.
Because there is mystery, a detective story is read.
Because there is a mountain, it's climbed.

so i think .

---

regards http://akita-8core.blogspot.jp/2016/06/nano-pi-m3.html

wildcat_paris · July 7, 2016

@hatahata

i am suprising your vast knowledge .

so i made the device which misses heat .
now there is no rebooting .

TK is surely straight to the point and quite knowledgable, don't be surprised.

you made quite a remarquable DIY heatsink

have you put thermal paste (even cheap silicon one is enough) between the layers of copper/copper coins/aluminium please?

hatahata · July 7, 2016

of course

in japan , 10 yen coin ( about $0.1) is made by copper and is the cheapest material.

the best is silver coin .

are there coins made by copper in USA ?

and acryl plate is easily holed out with metal drill .

alminium cap presses coins against heat sink and so fixes coins and heat sink .

i use

20sheets-Double-Sided-Coated-Tissue-Tape

tkaiser · July 8, 2016

I learned recently that there are three different types of heatsinks regarding the fins on top. Large distance between fins means: Convection does the job. Small distance between fins (like in your case) means: Forced airflow is necessary. Putting coins on top means: Ineffective since now air is not dissipating heat but an insulator

Please run

sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=8
sysbench --test=cpu --cpu-max-prime=200000 run --num-threads=8

and report back the times it takes and monitor

CPU clockspeed (throttling!)
/sys/class/hwmon/hwmon0/device/temp_label at the same time (as already outlined you simply could install RPi-Monitor and then exchange the one line below /etc/rpimonitor/templates that reads out /sys/class/thermal/thermal_zone0/temp)

hatahata · July 8, 2016

date ; sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=8 ;date
Fri Jul 8 15:24:54 JST 2016
sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 8

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000

Test execution summary:
    total time:                          57.0156s
    total number of events:              10000
    total time taken by event execution: 455.8727
    per-request statistics:
         min:                                 45.43ms
         avg:                                 45.59ms
         max:                                 77.53ms
         approx. 95 percentile:              45.67ms

Threads fairness:
    events (avg/stddev):           1250.0000/0.87
    execution time (avg/stddev):   56.9841/0.02

Fri Jul 8 15:25:51 JST 2016

and

while sysbench --test=cpu --cpu-max-prime=200000 run --num-threads=8 ,

Fri Jul 8 15:32:58 JST 2016
87
fa@NanoPi3:~$ date ; cat /sys/class/hwmon/hwmon0/device/temp_label
Fri Jul 8 15:33:03 JST 2016
87
fa@NanoPi3:~$ date ; cat /sys/class/hwmon/hwmon0/device/temp_label
Fri Jul 8 15:33:06 JST 2016
88
fa@NanoPi3:~$ date ; cat /sys/class/hwmon/hwmon0/device/temp_label
Fri Jul 8 15:33:09 JST 2016
85
fa@NanoPi3:~$ date ; cat /sys/class/hwmon/hwmon0/device/temp_label
Fri Jul 8 15:33:14 JST 2016
88
fa@NanoPi3:~$ date ; cat /sys/class/hwmon/hwmon0/device/temp_label
Fri Jul 8 15:33:18 JST 2016
87
fa@NanoPi3:~$ date ; cat /sys/class/hwmon/hwmon0/device/temp_label
Fri Jul 8 15:33:23 JST 2016
88
fa@NanoPi3:~$ date ; cat /sys/class/hwmon/hwmon0/device/temp_label
Fri Jul 8 15:33:25 JST 2016
85
fa@NanoPi3:~$ date ; cat /sys/class/hwmon/hwmon0/device/temp_label
Fri Jul 8 15:33:35 JST 2016
86
fa@NanoPi3:~$ date ; cat /sys/class/hwmon/hwmon0/device/temp_label

max is 88 .

thanks for advice of detectiing Nano pi M3's CPU temperature !

tkaiser · July 8, 2016

execution time (avg/stddev): 56.9841/0.02

fa@NanoPi3:~$ date ; cat /sys/class/hwmon/hwmon0/device/temp_label

Fri Jul 8 15:33:14 JST 2016

88

To translate this: An ODROID-C2 (quad core Cortex-A53) is able to finish the same test in 3.x seconds (so now you know that it might make a huge difference to be able to execute ARMv8 code on ARMv8 cores -- NanoPi M3 is here as bad as RPi 3)

And then exceeding 88Â°C means this:

your heatsink approach doesn't work (compare with the measurements here -- 85Â° are also reached without any heatsink!)
throttling occurs and we don't know how low clockspeed has been adjusted. Without also monitoring cpufreq stuff (as already outlined) your results are worthless

Still: Why don't you simply install RPi-Monitor, adjust one single line in a template (temperature) and start to test like an adult?

tkaiser · July 8, 2016

thanks for advice of detectiing Nano pi M3's CPU temperature !

You never follow advises you get (why?!). Read above in the first post on this page:

find /sys -iname "*temp*"

Time to stop since it gets boring repeating the same stuff over and over again.

hatahata · July 8, 2016

i imagine

banana pi 2 core -> 25 degree

odroid-c2 4 core -> 55 degree

nano pi M3 8 core ->85 degree

then

arm 16 core , 115 degree

only imagination

in fact odroid's 8 core ($79) has active cooling .

---

regards

ps: nano pi M3 has 3 problems

1) temperature : tkaiser solved

2) cannot use swap

3) cannot see youtube

mattelacchiato · July 16, 2016

Hi!

I'm thinking about to buy this board as mini NAS. Could somebody provide the max write performance (USB HDD) and network stats?

Write performance:

time (dd if=/dev/zero of=/path/to/usb-drive bs=1M count=1K && sync)

Network Performance:

On pi: iperf -s

On your pc (wired): iperf -c <pi-IP>

thanks a lot!

Matthias

hatahata · July 20, 2016

i try partially.

1)

# date ; dd if=/dev/zero of=/ma1/k-test bs=1M count=1K && sync ; date
Thu Jul 21 04:32:26 JST 2016
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 37.0172 s, 29.0 MB/s
Thu Jul 21 04:33:08 JST 2016

hard disk is old IDE .

2)

iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------

I only use 5mm thickness heat conducting material and alminium cap only .

then

Thu Jul 21 04:46:25 JST 2016
temperature -> 60
total used free shared buffers cached
Mem: 872328 821296 51032 17908 21068 626484

when using plus active cooling by cooleng fan for raspberry pi

Thu Jul 21 05:03:42 JST 2016
42
total used free shared buffers cached
Mem: 872328 823560 48768 17908 23628 626752

i think 8 core is like a 8 cylinder engine .

so it is harder to controll than 2 cylinder engine .

but more and more expirience is piled up , someday breakthrough occur , i believe

----

regards

hatahata · July 22, 2016

hi all .

when sysbench --test=cpu --cpu-max-prime=200000 run --num-threads=8

case 1) fan from upper , CPU temperature is 85 ~ 86 degree

case 2) fan from lateral side , CPU temperature is 75 ~ 76 degree

lateral is much better

without fan

Fri Jul 22 11:42:19 JST 2016
58

start lateral side fun

Fri Jul 22 11:47:39 JST 2016
41

fnecboy · July 26, 2016

Thank you very much for your interest in FriendlyARM's product.

As far as M3's overheat issue is concerned I suggest trying the M3's cooling package which includes a specifically designed heat sink and a cooling fan:

With these two utilities applied to the M3 the overheat issue will be greatly relieved.

tkaiser · July 26, 2016

As far as M3's overheat issue is concerned I suggest trying the M3's cooling package which includes a specifically designed heat sink and a cooling fan

Thanks for pointing this out. Since a lot of people discussed here NanoPi M3 I think it's ok to inform (potential) customers that FriendlyARM designed a specific heatsink + fan and has this in stock. Also good to see that you implemented a sane mounting solution and that you clearly speak of an 'overheat issue' so customers know that improved heat dissipation is necessary when thinking about constant higher loads

It would be interesting to get some numbers regarding efficiency with and w/o fan as with NanoPC-T3 here: http://climbers.net/sbc/40-core-arm-cluster-nanopc-t3/

Apart from that you should be careful with product announcements here. Posts that look like spam trigger deletion and a blocked account pretty fast (Armbian mods try hard to keep the forums free from spam and normally act within minutes)

hatahata · July 27, 2016

i find a page about heat sink of parallela ( http://www.rs-online.com/designspark/electronics/jpn/blog/content-1032).

this also has lateral sid fan .

544941706f0c4499974030930ab56371Adapteva

tkaiser · August 10, 2016

As far as M3's overheat issue is concerned I suggest trying the M3's cooling package which includes a specifically designed heat sink and a cooling fan:

Thanks for adding this to your support package (arrived yesterday). I'm currently looking a bit around -- it's quite easy to adopt RPi-Monitor templates since SoC temperature and dvfs settings are available through sysfs:

root@NanoPi3:/# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_voltages 
1275000 1225000 1175000 1125000 1100000 1075000 1050000 1025000 1000000 1000000 1000000 
root@NanoPi3:/# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies 
1400000 1300000 1200000 1100000 1000000 900000 800000 700000 600000 500000 400000 
root@NanoPi3:/# cat /sys/class/hwmon/hwmon0/device/temp_label
  87

These 87Â°C are only with heatsink applied and no fan while running sysbench cpu test and letting the SoC already throttle down to 400 MHz. With fan active cpufreq increases again and it's only a loss of 15 percent compared to full performance when no throttling occurs. In other words: yes, both heatsink and fan really help but NanoPi M3 has to be mounted vibration free otherwise the small fan sounds pretty annoying.

Apart from that it seems you did a tremendous job regarding kernel, there's almost everything exposed through sysfs. I really start to like that board even if it doesn't match my normal use cases (due to lack of IO bandwidth)

tkaiser · August 10, 2016

Some performance numbers:

As we already know NanoPi M3's SoC is prone to overheating therefore throttling is an issue in case heavy workloads last longer ('longer' as in 'more than approx. 60 seconds').

So I let a few tests ran again using sysbench since NanoPi M3 is the other Cortex-A53 that makes no use of ARMv8 instruction sets since it comes with a 32-bit only kernel + userland just like RPi 3 (sysbench can not be used to compare different architectures since with ARMv8 code execution might be 15 times faster -- but we're producing numbers for ARMv7 code and can therefore compare with other 32-bit platforms)

Without a heatsink:

455.5495 seconds and 79Â°C when running sysbench on a single CPU core, with subsequent runs adding more threads always the throttling treshold 85Â°C will be hit while execution time is around ~260 seconds (should be at ~230 secs without throttling when running with 2 threads) and heavy throttling occurs so not even 2 cores can run with heavy workloads at the same time without performance impacts. A heatsink is therefore mandatory if you want to do anything more heavy with this SBC.

With FriendlyARM's heatsink but fan inactive:

When utilizing only 1, 2 or 3 CPU cores no throttling occured but with 4 CPU cores fully active slight throttling started (test execution should finish within 114 seconds but the 3rd run with --num-threads=4 already took 122 seconds and the throttling treshold temperature 85Â°C has been reached).

All subsequent runs with --num-threads=5/6/7/8 took around ~122 seconds so with this heatsink without the fan being active you get pretty much the same ARMv7 performance as an RPi 3 (which maxes out at 120 seconds remaining at 80Â°C so that no throttling occurs if you use a small heatsink). But for shorter load bursts (lasting less than 3 minutes) or with active cooling NanoPi M3 is over twice as fast as RPi 3 so it depends on whether you're willing to use an annoying fan or your workload whether CPU performance outperforms other SBCs or not.

Attached the log of the different runs always adding one more thread to sysbench execution. I queried /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state before each new run so it's obvious how NanoPi M3's kernel throttles (interactive governor): At idle it remains at 400 MHz, when load increases it immediately jumps to 1.4 GHz but when the SoC temperature hits 85Â°C cpufreq will be lowered down to 800 MHz or lower if that's not sufficient. 1000-1300 MHz are not used. So there's some room for improvements in FA's kernel or maybe I'm just too inexperienced with this Nexell platform to know the tweaks.

Format is: 2 newlines, then in a single line time stamp and current SoC temperature, followed by time_in_state to compare before/after and then the execution times of three sysbench runs each thereby increasing the --num-threads= by one:

tk@NanoPi3:~$ cat /var/log/performance.log 
Fri Jan  1 17:45:34 CST 2016   55C
1400000 216
1300000 0
1200000 1
1100000 0
1000000 0
900000 4
800000 4
700000 9
600000 0
500000 15
400000 1039
Now testing 1 thread:
tk@NanoPi3:~$ cat /var/log/performance.log 
Fri Jan  1 17:45:34 CST 2016   55C
1400000 216
1300000 0
1200000 1
1100000 0
1000000 0
900000 4
800000 4
700000 9
600000 0
500000 15
400000 1039
Now testing 1 thread:
    execution time (avg/stddev):   19181615.5700/0.00
    execution time (avg/stddev):   455.5304/0.00
    execution time (avg/stddev):   455.4859/0.00


Wed Aug 10 18:14:21 CST 2016   64C
1400000 136900
1300000 0
1200000 1
1100000 0
1000000 0
900000 4
800000 4
700000 9
600000 0
500000 27
400000 1066
Now testing 2 threads:
    execution time (avg/stddev):   227.7513/0.00
    execution time (avg/stddev):   227.7390/0.02
    execution time (avg/stddev):   227.7543/0.00


Wed Aug 10 18:25:45 CST 2016   72C
1400000 205236
1300000 0
1200000 1
1100000 0
1000000 0
900000 4
800000 4
700000 9
600000 0
500000 27
400000 1066
Now testing 3 threads:
    execution time (avg/stddev):   151.9069/0.01
    execution time (avg/stddev):   151.8217/0.02
    execution time (avg/stddev):   151.8324/0.01


Wed Aug 10 18:33:20 CST 2016   79C
1400000 250804
1300000 0
1200000 1
1100000 0
1000000 0
900000 4
800000 4
700000 9
600000 0
500000 27
400000 1066
Now testing 4 threads:
    execution time (avg/stddev):   113.8653/0.01
    execution time (avg/stddev):   115.5463/0.01
    execution time (avg/stddev):   122.3739/0.02


Wed Aug 10 18:39:12 CST 2016   84C
1400000 283720
1300000 0
1200000 1
1100000 0
1000000 0
900000 4
800000 1828
700000 459
600000 0
500000 27
400000 1066
Now testing 5 threads:
    execution time (avg/stddev):   116.5424/0.02
    execution time (avg/stddev):   122.9468/0.01
    execution time (avg/stddev):   122.7339/0.02


Wed Aug 10 18:45:15 CST 2016   85C
1400000 302574
1300000 0
1200000 1
1100000 0
1000000 0
900000 4
800000 6922
700000 5669
600000 6614
500000 491
400000 1066
Now testing 6 threads:
    execution time (avg/stddev):   121.1502/0.01
    execution time (avg/stddev):   122.4795/0.01
    execution time (avg/stddev):   121.6814/0.02


Wed Aug 10 18:51:20 CST 2016   84C
1400000 315553
1300000 0
1200000 1
1100000 0
1000000 0
900000 4
800000 11274
700000 10020
600000 10964
500000 4926
400000 7143
Now testing 7 threads:
    execution time (avg/stddev):   121.2647/0.02
    execution time (avg/stddev):   122.3777/0.03
    execution time (avg/stddev):   123.4143/0.03


Wed Aug 10 18:57:27 CST 2016   86C
1400000 324586
1300000 0
1200000 1
1100000 0
1000000 1
900000 4
800000 14956
700000 13695
600000 14640
500000 8596
400000 20132
Now testing 8 threads:
    execution time (avg/stddev):   121.1366/0.03
    execution time (avg/stddev):   122.2647/0.05
    execution time (avg/stddev):   121.9314/0.03

Small fan active:

Now with fan active slight throttling happens when running 8 threads since then throttling temperature (85Â°C) is reached. So in case you want to run really heavy stuff (cpuminer for example) or are able to utilize GPU cores too, this combination of heatsink + fan is not enough.

Fri Jan  1 17:06:30 CST 2016   39C
1400000 239
1300000 0
1200000 0
1100000 0
1000000 0
900000 0
800000 4
700000 0
600000 8
500000 59
400000 1342
Now testing 1 thread:
    execution time (avg/stddev):   19198877.6893/0.00
    execution time (avg/stddev):   455.4465/0.00
    execution time (avg/stddev):   455.5070/0.00


Wed Aug 10 22:23:01 CST 2016   49C
1400000 137019
1300000 0
1200000 0
1100000 0
1000000 0
900000 0
800000 4
700000 0
600000 14
500000 84
400000 1376
Now testing 2 threads:
    execution time (avg/stddev):   227.7335/0.01
    execution time (avg/stddev):   227.7450/0.02
    execution time (avg/stddev):   227.7162/0.00


Wed Aug 10 22:34:24 CST 2016   54C
1400000 205349
1300000 0
1200000 0
1100000 0
1000000 0
900000 0
800000 4
700000 0
600000 14
500000 84
400000 1378
Now testing 3 threads:
    execution time (avg/stddev):   151.8363/0.00
    execution time (avg/stddev):   151.8241/0.02
    execution time (avg/stddev):   151.8284/0.02


Wed Aug 10 22:41:59 CST 2016   58C
1400000 250909
1300000 0
1200000 0
1100000 0
1000000 0
900000 0
800000 4
700000 0
600000 14
500000 84
400000 1379
Now testing 4 threads:
    execution time (avg/stddev):   113.8751/0.01
    execution time (avg/stddev):   113.9169/0.01
    execution time (avg/stddev):   113.8720/0.01


Wed Aug 10 22:47:41 CST 2016   64C
1400000 285084
1300000 0
1200000 0
1100000 0
1000000 0
900000 0
800000 5
700000 0
600000 14
500000 84
400000 1380
Now testing 5 threads:
    execution time (avg/stddev):   91.1338/0.01
    execution time (avg/stddev):   91.1068/0.01
    execution time (avg/stddev):   91.1270/0.02


Wed Aug 10 22:52:15 CST 2016   68C
1400000 312430
1300000 0
1200000 0
1100000 0
1000000 0
900000 2
800000 5
700000 0
600000 14
500000 84
400000 1380
Now testing 6 threads:
    execution time (avg/stddev):   75.9181/0.02
    execution time (avg/stddev):   75.9481/0.02
    execution time (avg/stddev):   75.9080/0.01


Wed Aug 10 22:56:03 CST 2016   74C
1400000 335219
1300000 0
1200000 0
1100000 0
1000000 0
900000 2
800000 5
700000 0
600000 14
500000 84
400000 1380
Now testing 7 threads:
    execution time (avg/stddev):   65.0871/0.01
    execution time (avg/stddev):   65.0932/0.01
    execution time (avg/stddev):   65.0849/0.01


Wed Aug 10 22:59:18 CST 2016   81C
1400000 354756
1300000 0
1200000 0
1100000 0
1000000 0
900000 2
800000 5
700000 0
600000 14
500000 84
400000 1380
Now testing 8 threads:
    execution time (avg/stddev):   57.6927/0.01
    execution time (avg/stddev):   60.0639/0.02
    execution time (avg/stddev):   61.0930/0.02


Wed Aug 10 23:02:17 CST 2016   84C
1400000 370834
1300000 0
1200000 0
1100000 0
1000000 0
900000 2
800000 1825
700000 0
600000 14
500000 84
400000 1380

Adding 2nd fan:

Last try using a 2nd fan blowing air directly from the side towards M3's heatsink (through the fins) and therefore helping the small fan on the heatsink. It's the same that can be seen in post #10 here.

root@NanoPi3:~# cat /var/log/performance_2_fans.log 
Fri Jan  1 16:00:15 CST 2016   38C
1400000 244
1300000 0
1200000 0
1100000 0
1000000 0
900000 0
800000 4
700000 0
600000 0
500000 26
400000 1291
Now testing 1 thread:
    execution time (avg/stddev):   19198878.1969/0.00
    execution time (avg/stddev):   455.4915/0.00
    execution time (avg/stddev):   455.5227/0.00


Wed Aug 10 21:16:45 CST 2016   39C
1400000 136944
1300000 0
1200000 0
1100000 0
1000000 0
900000 0
800000 4
700000 0
600000 0
500000 26
400000 1309
Now testing 2 threads:
    execution time (avg/stddev):   227.7456/0.00
    execution time (avg/stddev):   227.7520/0.00
    execution time (avg/stddev):   227.7420/0.01


Wed Aug 10 21:28:08 CST 2016   45C
1400000 205278
1300000 0
1200000 0
1100000 0
1000000 0
900000 0
800000 4
700000 0
600000 0
500000 26
400000 1310
Now testing 3 threads:
    execution time (avg/stddev):   151.9040/0.02
    execution time (avg/stddev):   151.8259/0.02
    execution time (avg/stddev):   151.8836/0.01


Wed Aug 10 21:35:44 CST 2016   49C
1400000 250853
1300000 0
1200000 0
1100000 0
1000000 0
900000 0
800000 4
700000 0
600000 0
500000 26
400000 1310
Now testing 4 threads:
    execution time (avg/stddev):   113.8652/0.01
    execution time (avg/stddev):   113.9173/0.02
    execution time (avg/stddev):   113.8786/0.01


Wed Aug 10 21:41:25 CST 2016   54C
1400000 285029
1300000 0
1200000 0
1100000 0
1000000 0
900000 1
800000 4
700000 0
600000 0
500000 26
400000 1310
Now testing 5 threads:
    execution time (avg/stddev):   91.1168/0.02
    execution time (avg/stddev):   91.0957/0.02
    execution time (avg/stddev):   91.1245/0.02


Wed Aug 10 21:45:59 CST 2016   55C
1400000 312376
1300000 0
1200000 0
1100000 0
1000000 0
900000 1
800000 4
700000 0
600000 0
500000 26
400000 1310
Now testing 6 threads:
    execution time (avg/stddev):   75.9273/0.02
    execution time (avg/stddev):   75.9355/0.01
    execution time (avg/stddev):   75.9105/0.02


Wed Aug 10 21:49:47 CST 2016   60C
1400000 335165
1300000 0
1200000 0
1100000 1
1000000 0
900000 1
800000 4
700000 0
600000 0
500000 26
400000 1310
Now testing 7 threads:
    execution time (avg/stddev):   65.0825/0.01
    execution time (avg/stddev):   65.0925/0.01
    execution time (avg/stddev):   65.0917/0.01


Wed Aug 10 21:53:02 CST 2016   64C
1400000 354703
1300000 0
1200000 0
1100000 1
1000000 0
900000 1
800000 4
700000 0
600000 0
500000 26
400000 1310
Now testing 8 threads:
    execution time (avg/stddev):   56.9910/0.01
    execution time (avg/stddev):   57.0447/0.02
    execution time (avg/stddev):   57.0200/0.01


Wed Aug 10 21:55:54 CST 2016   69C
1400000 371821
1300000 0
1200000 0
1100000 2
1000000 2
900000 1
800000 4
700000 0
600000 0
500000 26
400000 1489

No throttling occured and even with full load on 8 CPU cores SoC temp not exceeding 70Â°C. So this is good news since you could combine heatpad + heatsink (even FA's own) with one large/silent fan that blows enough air over the heatsink's surface (through the fins!) and are able to make use of the full octa-core power without annoying noise.

Look at the picture with the cardboard roll above to get the idea (tested with small fan removed -- while running sysbench on all 8 cores SoC temperature never exceeded 78Â°C so using a large fan with controlled airflow you can remove the small fan on the heatsink and remain below throttling tresholds)

BTW: M3 was powered through the 4-pin header (using FriendlyARM's convenient PSU-ONECOM board). Do not even think about powering it through Micro USB since this won't work with these workloads

The script used to do these measurements (called from /etc/rc.local) looks like this

root@NanoPi3:~# cat /usr/local/bin/check-perf.sh
#!/bin/bash

for i in 1 2 3 4 5 6 7 8 ; do
if [ -f /var/log/performance.log ]; then
echo -e "\n\n$(date) $(cat /sys/class/hwmon/hwmon0/device/temp_label)C\n$(cat /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state)\nNow testing $i threads:" >>/var/log/performance.log
else
echo -e "$(date) $(cat /sys/class/hwmon/hwmon0/device/temp_label)C\n$(cat /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state)\nNow testing 1 thread:" >>/var/log/performance.log
fi 
sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=${i} | grep "execution time" >>/var/log/performance.log
sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=${i} | grep "execution time" >>/var/log/performance.log
sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=${i} | grep "execution time" >>/var/log/performance.log
done
echo -e "\n\n$(date) $(cat /sys/class/hwmon/hwmon0/device/temp_label)C\n$(cat /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state)" >>/var/log/performance.log

tkaiser · August 28, 2016

Since I like this board more and more another round of tests. NanoPi M3 is equipped with a GBit Ethernet interface obviously using the stmmac implementation combining an internal GbE MAC implementation in the SoC with RTL8211E external GBit PHY.

First turn FriendlyARM's Debian distro into a server OS: sudo systemctl disable lightdm (that's all, if you want to reclaim space on the SD card you might want to deinstall the GUI stuff but that's not important for performance behaviour)

Let's use iperf first to do some passive benchmarking. M3 and Client (x86 host capable of maxing out its GBit interfaces connected to lab switch):

M3 --> Client: 730 Mbits/sec

Client --> M3: 640 Mbits/sec

root@armbian:/var/git/Armbian# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.168.83.115 port 5001 connected with 192.168.83.113 port 52681
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec   845 MBytes   706 Mbits/sec
[  5] local 192.168.83.115 port 5001 connected with 192.168.83.113 port 52682
[  5]  0.0-10.0 sec   841 MBytes   702 Mbits/sec
[  4] local 192.168.83.115 port 5001 connected with 192.168.83.113 port 52683
[  4]  0.0-10.0 sec   862 MBytes   721 Mbits/sec
[  5] local 192.168.83.115 port 5001 connected with 192.168.83.113 port 52684
[  5]  0.0-10.1 sec   869 MBytes   726 Mbits/sec
[  4] local 192.168.83.115 port 5001 connected with 192.168.83.113 port 52685
[  4]  0.0-10.0 sec   849 MBytes   710 Mbits/sec
[  5] local 192.168.83.115 port 5001 connected with 192.168.83.113 port 52686
[  5]  0.0-300.0 sec  25.5 GBytes   730 Mbits/sec


root@NanoPi3:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.168.83.113 port 5001 connected with 192.168.83.115 port 60970
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec   736 MBytes   616 Mbits/sec
[  5] local 192.168.83.113 port 5001 connected with 192.168.83.115 port 60972
[  5]  0.0-10.0 sec   745 MBytes   624 Mbits/sec
[  4] local 192.168.83.113 port 5001 connected with 192.168.83.115 port 60974
[  4]  0.0-10.0 sec   761 MBytes   637 Mbits/sec
[  5] local 192.168.83.113 port 5001 connected with 192.168.83.115 port 60976
[  5]  0.0-10.0 sec   724 MBytes   606 Mbits/sec
[  4] local 192.168.83.113 port 5001 connected with 192.168.83.115 port 60978
[  4]  0.0-10.0 sec   748 MBytes   626 Mbits/sec
[  5] local 192.168.83.113 port 5001 connected with 192.168.83.115 port 60980
[  5]  0.0-300.0 sec  22.4 GBytes   643 Mbits/sec

Now let's take a closer look what happened by using htop, limiting maximum cpufreq to the lower frequency so CPU cores remain at 400 MHz all the time and looking at /proc/interrupts and monitoring cpu clockspeeds:

while true ; do cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq; sleep 1; done

5 minutes later:

cpufreq matters, when running only at 400 MHz instead of 1400 MHz throughput is affected (so we must look at cpufreq scaling behaviour in the next step!)
eth0 IRQs are spread accross all 8 CPU cores with this kernel (not that good, it's known that a fixed IRQ affinity helps on SMP systems with Ethernet loads)
When testing client --> M3 performance iperf runs single threaded and maxes out 1 CPU core. This is obviously a limiting factor
When testing M3 --> client performance iperf activity is spread accross all CPU cores. This leads to cpufreq remaining most of the times on the lowest CPU clockspeed (400 MHz) with interactive cpufreq governor which is obviously a limiting factor

To address the last issue switching to performance governor would be a 'solution' or looking into interactive taking notice of this sort of activity and switching to maximum clockspeed. So let's try to improve Ethernet IRQ handling and also create an artificial bottleneck and see what happens. Only change made is this:

echo 2 >/sys/class/net/eth0/queues/rx-0/rps_cpus

Now iperf and iperf3 performance increases a lot. We're exceeding already 900 Mbits/sec in both directions:

root@NanoPi3:~# iperf3 -c 192.168.83.115 -w 512k -l 512k
Connecting to host 192.168.83.115, port 5201
[  4] local 192.168.83.113 port 60061 connected to 192.168.83.115 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   108 MBytes   904 Mbits/sec    0    273 KBytes       
[  4]   1.00-2.00   sec   110 MBytes   924 Mbits/sec    0    273 KBytes       
[  4]   2.00-3.00   sec   108 MBytes   909 Mbits/sec    0    273 KBytes       
[  4]   3.00-4.02   sec   109 MBytes   896 Mbits/sec    0    273 KBytes       
[  4]   4.02-5.01   sec   105 MBytes   891 Mbits/sec    0    273 KBytes       
[  4]   5.01-6.02   sec   110 MBytes   912 Mbits/sec    0    273 KBytes       
[  4]   6.02-7.00   sec   106 MBytes   903 Mbits/sec    0    273 KBytes       
[  4]   7.00-8.00   sec   109 MBytes   917 Mbits/sec    0    273 KBytes       
[  4]   8.00-9.00   sec   108 MBytes   909 Mbits/sec    0    273 KBytes       
[  4]   9.00-10.00  sec   105 MBytes   883 Mbits/sec    0    273 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.05 GBytes   905 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.05 GBytes   905 Mbits/sec                  receiver

root@armbian:/var/git/Armbian# iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from 192.168.83.113, port 60058
[  5] local 192.168.83.115 port 5201 connected to 192.168.83.113 port 60059
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-1.00   sec   102 MBytes   851 Mbits/sec                  
[  5]   1.00-2.00   sec   107 MBytes   896 Mbits/sec                  
[  5]   2.00-3.00   sec   112 MBytes   937 Mbits/sec                  
[  5]   3.00-4.00   sec   106 MBytes   893 Mbits/sec                  
[  5]   4.00-5.00   sec   111 MBytes   927 Mbits/sec                  
[  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec                  
[  5]   6.00-7.00   sec   107 MBytes   901 Mbits/sec                  
[  5]   7.00-8.00   sec   105 MBytes   885 Mbits/sec                  
[  5]   8.00-9.00   sec   110 MBytes   925 Mbits/sec                  
[  5]   9.00-10.00  sec   110 MBytes   922 Mbits/sec                  
[  5]  10.00-10.06  sec  7.11 MBytes   941 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  5]   0.00-10.06  sec  1.06 GBytes   904 Mbits/sec    0             sender
[  5]   0.00-10.06  sec  1.06 GBytes   903 Mbits/sec                  receiver

Now the most important thing to notice: 900 Mbits/sec reported by iperf are enough if we start to think about how synthetic benchmarks correlate with reality.

Using iperf with default window sizes is a joke (way too small) so by further tuning this stuff also iperf performance numbers will improve. Are these better numbers important or even good? Not at all since real world applications behave differently (see here for an example how Windows' Explorer or OS X Finder tune their settings dynamically to get the idea how wrong it is to use iperf with default window sizes). Also iperf is limited by being bound to one CPU core when running as server task.

So what did we achieve with this single line added to /etc/rc.local:

echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus

We let all Ethernet RX IRQs be processed on a single CPU core (better performance) and also created a bottleneck which let the interactive governor behave 'better' and let cpufreq immediately jump from 400 MHz to 1400 Mhz which helps with benchmark numbers. What does this mean for real workloads running a web server or a NAS daemon? There the overall higher CPU activity would for sure lead to cpufreq scaling jumping to the maximum 1400 MHz and inventing a bottleneck as above might even negatively affect performance. This is something that has to be tested with real world workloads. Just better benchmark scores are not sufficient.

What do we learn from that? Passive benchmarking as it's done by most people does only create numbers without meaning and is crap. Always.

By looking at what's happening we are able to identify the bottlenecks that prevent synthetic benchmarks producing nice numbers. We now know how to improve numbers for this specific benchmark, we know that the tool in question sucks somehow (being maxed out by acting single threaded in a specific mode) and that the 'fix' for better benchmark scores might be counterproductive for real world workloads.

So now we know that Gbit Ethernet on this board performs very well, we know that by influencing CPU affinity of Ethernet RX IRQs we affect performance in 2 different ways (better IRQ processing and better cpufreq scaling of interactive governor) and most importantly we know where to look at when more serious network testing starts with real world workloads and not silly stuff like iperf/iperf3 with default settings.

BTW: Openend issue at FriendlyARM to let them know: https://github.com/friendlyarm/linux-3.4.y/issues/5

Update: I tested static delivery of web pages with nginx, this testfile with weighttp using this command line from my x86 box

weighttp -n 100000 -c 20 -t 4 -k http://$target:80/testfile.js

and with/without tweaking /sys/class/net/eth0/queues/rx-0/rps_cpus: No differences (result variation below 5 percent difference can be regarded identical). It's 4330 req/s on average (Pine64+ with BSP kernel, no network tuning at all and GbE connected to the same Gbit switch gets a +4800 req/s score, Pine64 testing my x86 host gets 4950 req/s but results are constant 496 req/s when one of the hosts is forced to use 100 MBits/sec network -- ethtool -s eth0 speed 100 duplex full -- so this test is bullshit anyway since it's not a webserver test but a throughput test similar to iperf)

root@armbian:/usr/local/src/weighttp/build/default# ./weighttp -n 100000 -c 20 -t 4 -k 192.168.83.113:80/testfile.js
weighttp 0.4 - a lightweight and simple webserver benchmarking tool

starting benchmark...
spawning thread #1: 5 concurrent requests, 25000 total requests
spawning thread #2: 5 concurrent requests, 25000 total requests
spawning thread #3: 5 concurrent requests, 25000 total requests
spawning thread #4: 5 concurrent requests, 25000 total requests
progress:  10% done
progress:  20% done
progress:  30% done
progress:  40% done
progress:  50% done
progress:  60% done
progress:  70% done
progress:  80% done
progress:  90% done
progress: 100% done

finished in 23 sec, 28 millisec and 421 microsec, 4342 req/s, 100135 kbyte/s
requests: 100000 total, 100000 started, 100000 done, 100000 succeeded, 0 failed, 0 errored
status codes: 100000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 2361295050 bytes total, 25295050 bytes http, 2336000000 bytes data

root@armbian:/usr/local/src/weighttp/build/default# ./weighttp -n 100000 -c 20 -t 4 -k 192.168.83.113:80/testfile.js
weighttp 0.4 - a lightweight and simple webserver benchmarking tool

starting benchmark...
spawning thread #1: 5 concurrent requests, 25000 total requests
spawning thread #2: 5 concurrent requests, 25000 total requests
spawning thread #3: 5 concurrent requests, 25000 total requests
spawning thread #4: 5 concurrent requests, 25000 total requests
progress:  10% done
progress:  20% done
progress:  30% done
progress:  40% done
progress:  50% done
progress:  60% done
progress:  70% done
progress:  80% done
progress:  90% done
progress: 100% done

finished in 23 sec, 208 millisec and 782 microsec, 4308 req/s, 99356 kbyte/s
requests: 100000 total, 100000 started, 100000 done, 100000 succeeded, 0 failed, 0 errored
status codes: 100000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 2361295050 bytes total, 25295050 bytes http, 2336000000 bytes data

tkaiser · August 29, 2016

Some more performance numbers

As we've already seen increasing Gbit Ethernet throughput to the max was a simple

echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus

added to /etc/rc.local since cpufreq governor behaviour negatively influenced throughput numbers. It will be interesting what has to be tweaked to get also lowest latency since one possible use case for this beefy board is to run cluster workloads.

Let's try cpuminer first:

Grab https://sourceforge.net/projects/cpuminer/files/pooler-cpuminer-2.4.5.tar.gz/download and then do

sudo apt-get install libcurl4-gnutls-dev
./configure CFLAGS="-O3 -mfpu=neon"
make
./minerd --benchmark

I was not able to run the benchmark with my setup at 1.4 GHz, 1.3 GHz was the maximum since then already 85Â°C have been reached and with the M3's kernel somewhat inefficient throttling starts (only switching between very low clockspeeds and 1400 MHz leading to lower khash/s values compared to fixing max clockspeed to 1.3GHz).

Running at 1300 MHz on 8 cores I got a whopping 9.96 khash/s score with this NEON optimized cpuminer version. Increasing maximum cpufreq slowly was necessary since otherwise the board simply deadlocked (I would suspect the PSU is simply overloaded when cpuminer starts on 8 cores with 1400 MHz)

So with better cooling +10.6 khash/s should be possible. As a comparison: with quad-core H3 (Orange Pi PC) at 1296 MHz we get 2.35 khash/s and with quad-core A64 (Pine64+ overclocked/overvolted to 1296 MHz) we get 3.9 khash/s.

Now Linpack/OpenBlas with NEON optimizations:

I followed these instructions: https://github.com/deater/performance_results/tree/master/build_instructions

With a freshly built Linpack with NEON optimizations I thought I start with 800 MHz cpufreq: 7.476 GFLOPS, then using 900 MHz I got 8.227 GFLOPS and at 1000 MHz the M3 simply deadlocked -- most probably a sign that my PSU is too weak since at 900 MHz SoC temperature only reached 70Â°C so it was not a thermal issue (this Linpack version is known to power off SBCs with insufficient power supply 100 percent reliable).

That means in case one uses a better PSU than mine, a more efficient heatsink+fan exceeding 12 GFLOPS should be possible but at the cost of insanely high consumption and huge efforts for cooling and power supply

As a comparison: with quad-core H3 (Orange Pi PC) at 1296 MHz we get 1.73 GFLOPS and with quad-core A64 (Pine64+ overclocked/overvolted to 1296 MHz) we get 3.4 GFLOPS while a RPi 3 running at just 1.2 GHz achieves 3.6 GFLOPS.

tkaiser · August 30, 2016

Today a look at IO and disk performance / features / capabilities

The SoC used on NanoPi M3 (applies to M2 and NanoPC-T2/T3 too) has one USB OTG port available through the Micro USB connector and one host port connected to an internal USB hub (Genesys Logic GL852G). All ports are USB 2.0 and that means that all USB receptacles and the 2 USB ports available on a pin header are behind the internal USB hub and have to share bandwidth

A quick test with an ordinary notebook disk confirms that USB performance on the host ports (behind the hub) is not that good, I got sequential speed results at around ~27 MB/s. Also the kernel only supports USB mass storage mode and not UASP so better no expectations to get high random IO numbers when connecting a SSD in an UASP capable enclosure.

So let's try out the OTG port available through the Micro USB jack (obviously you need to power the board then through the 4 pin header next to the 40 pin GPIO header which is recommended anyway!). The OTG port is in device mode by default so let's take a short adapter cable and test the disk again.

Good news, sequential USB performance on the Micro USB port is excellent (I used only hdparm since on the disk are HFS+ filesystems so with a more serious benchmark performance will be somewhere between 35 and 37 MB/s), also it's possible to query the disk there with SMART and even fire up SMART selftests, but using hdparm to control standby/sleeping behaviour didn't work and trying to spin the disk down immediately (hdparm -Y /dev/sda) ended with a kernel panic

root@NanoPi3:~# lsusb
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 002: ID 05e3:0610 Genesys Logic, Inc. 4-port hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 003: ID 13fd:1840 Initio Corporation INIC-1608 SATA bridge
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

root@NanoPi3:~# cat /sys/bus/platform/devices/dwc_otg/otg_mode 
device

root@NanoPi3:~# hdparm -t /dev/sda

/dev/sda:
 Timing buffered disk reads: 118 MB in  3.03 seconds =  39.01 MB/sec

root@NanoPi3:~# smartctl -t short /dev/sda
smartctl 6.4 2014-10-07 r4002 [armv7l-linux-3.4.39-s5p6818] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Tue Aug 30 16:28:16 2016

Use smartctl -X to abort test.
root@NanoPi3:~# smartctl -a /dev/sda
smartctl 6.4 2014-10-07 r4002 [armv7l-linux-3.4.39-s5p6818] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus SpinPoint M8 (AF)
Device Model:     ST1000LM024 HN-M101MBB
Serial Number:    S2RXJ9BDA05287
LU WWN Device Id: 5 0004cf 20b5c5a31
Firmware Version: 2AR10002
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Tue Aug 30 16:29:38 2016 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status command failed: scsi error medium or hardware error (serious)
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(13380) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 223) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   086   086   025    Pre-fail  Always       -       4457
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       203
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       91
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40
191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   058   000    Old_age   Always       -       30 (Min/Max 21/42)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       3
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       501

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        90         -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Completed [00% left] (0-65535)
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Please note: USB IRQs are distributed accross all CPU cores by default but love cpu0 for unknown reasons (so there is a chance that by controlling IRQ distribution further performance improvements can be made):

 81:       7366       1527       1764       2337       3630       2880       2980       3620       GIC  dwc_otg, dwc_otg_pcd, dwc_otg_hcd:usb2
 82:       1038        576        387        944        617        472        410        658       GIC  ehci_hcd:usb1, ohci_hcd:usb3

I also used performance cpufreq governor since I only tested with hdparm (which should not be used as benchmark since it's only able to read and 'benchmark' execution is way too short!) and when using interactive the benchmark results would've been tampered to much switching from the lowest clockspeed to the upper. For real world workloads interactive is pretty fine since within ~0.5 seconds cpufreq will be increased to the max.

Using 2 USB disks in a RAID-0 or RAID-1 should work when one disk is connected to Micro USB and the other to a normal host port. IMO only RAID-0 to increase performance (expect ~50MB/s which is ok-ish given the GbE performance) would make some sense since disk redundancy is not that useful on a SBC. Setting up a RAID-0 when connecting two disks to the both USB host ports is useless of course since ports have to share bandwidth.

TL;DR: USB performance on the host ports is somewhat limited (to be avoided anyway since all ports have to share bandwidth) but using disks connected to the Micro USB connector works by default, with high performance and rather full feature set (at least SMART is possible if the USB enclosure also supports it!). Since UASP support is missing due to a horribly outdated kernel version you shouldn't expect wonders regarding random IO speeds (but this is something where only a few sunxi SoCs can shine since H3, A20 and A64 support UASP with USB 2.0 when running on mainline kernel)

constantius · August 30, 2016

Hi

short question.:

Will you make Armbian for Nano pi M2 and M3 or not....

best regards Radek

tkaiser · August 30, 2016

short question.: Will you make Armbian for Nano pi M2 and M3 or not....

Short answer already given yesterday -- but in the wrong thread: http://forum.armbian.com/index.php/topic/1917-armbian-running-on-pine64-and-other-a64h5-devices/#entry14698

I don't know what Igor thinks (he has also 2 M3 on his desk) and at least I won't start with M3. I only had a look into a few performance tunables and here the board looks quite nice so we know what to expect from an Armbian port or from this SBC in general. But IMO FriendlyARM's OS images are pretty nice, support all their peripherals (displays, Matrix stuff) out of the box in a perfect way and all you've to do to get headless operation mode would be to disable start of lightdm daemon (and a few performance tweaks are listed in this thread already). So there's not much to gain from an Armbian port anyway.

And I fear if we would start supporting M2/M3 (and therefore also NanoPC-T2/T3) we would have to focus on headless usage since I would find it frustrating releasing desktop images that suck compared to FriendlyARM's since not all of the stuff is working (for example pairing one of the touchscreen LCDs from FA with M2/M3 is just connecting the ribbon cable, everything else works out of the box)

constantius · August 30, 2016

"but Friendy ARM OS images are pretty nice" Nano M2 image ubuntu 15.04 after update to 15.10 does not work, Kali linux no sound, Debian 8 suppoert 720p resolution. To change resolution 1080p please compile kernel - doesnot work. Android - works . Nano M3 the same as Nano M2 but without Kali and Ubuntu which doesnot exist

eternalWalker · August 30, 2016

...and aditionally...
"Matrix " is not working (on M3), the kernel is only 32-bit

tkaiser · August 30, 2016

"but Friendy ARM OS images are pretty nice" Nano M2 image ubuntu 15.04 after update to 15.10 does not work, Kali linux no sound, Debian 8 suppoert 720p resolution. To change resolution 1080p please compile kernel - doesnot work. Android - works . Nano M3 the same as Nano M2 but without Kali and Ubuntu which doesnot exist

Well, then you won't gain that much from headless Armbian images anyway. I really don't care about display output and if I would do I would use the LCDs since I've never seen board + lcd working that flawlessly out of the box.

After dealing too long with OS images from other Chinese vendors (especially those from 'Team BPi') my expectations weren't that high but dealing with their Debian images for M1 and M3 so far was impressive. Everything worked nearly perfect (only upgrading from Jessie to Stretch on their M3 image failed but hey, that's the 'testing' distribution so why should I expect that this works without problems?)

"Matrix " is not working (on M3), the kernel is only 32-bit

I was talking about hardware for which they provide comprehensive tutorials to get up and running and the necessary code. What are you talking about?

eternalWalker · August 30, 2016

soft

...

A problem from FriendlyArm Forum: The question:

"Hi Devs,
Please test the M3 with the latest matrix from git.
I am getting failures on initialising PMW & GPIO
Regards"

And the answer (FATechsupport)

"Unfortunately the Matrix code may not work with M2 for now and we haven't tested the code yet. We only tested the code for M1/NEO and Pi2/Fire/M2/T2"

Is not it great answer? :lol:

Sign In

[NanoPi M3] Cheap 8 core (35$)

Recommended Posts

tkaiser

wildcat_paris

hatahata

hatahata

tkaiser

hatahata

wildcat_paris

hatahata

tkaiser

hatahata

tkaiser

tkaiser

hatahata

mattelacchiato

hatahata

hatahata

fnecboy

tkaiser

hatahata

tkaiser

tkaiser

tkaiser

tkaiser

tkaiser

constantius

tkaiser

constantius

eternalWalker

tkaiser

eternalWalker

Forums

My Activity Streams

Download

Store

Important Information