Research SBC consumption/performance comparisons

tkaiser · August 4, 2016

The following is the start of a series of tests regarding minimized consumption mode for SBCs. Idea behind is to provide Armbian settings that allow some of the SBC we support to be used as cheap and low-power IoT nodes (or call it 'home automation' or anything else, at least it's about low-power and headless useage)

I start with some consumption comparisons with various RPi models below just to get a baseline of numbers and to verify that my consumption monitoring setup is ok. Also you find a lot of numbers on the net that are questionable (measured with inappropriate equipment, comparing OS images with different settings, taking cable resistance not into account, wonky methodology only looking at current and forgetting about voltage fluctuations and so on) so I thought lets take a few RPi lying around and do some own measurement with an absolutely identical setup so the numbers I get can be compared reliably.

I used the most recent Raspbian Debian Jessie Lite image currently available (2016-05-27-raspbian-jessie-lite.img) with latest kernel (4.4.13), all upgrades applied, HDMI disabled in /etc/rc.local by '/usr/bin/tvservice -o' and RPi powered through USB port of a Banana Pro (my monitoring PSU -- all values below are 30 min average values). All tests done using the same Raspbian installation on the same SD card.

Memory throughput tests done using https://github.com/ssvb/tinymembench

CPU 'benchmarks' done using sysbench (that is known to be not able to compare different CPU architectures but since RPi 3 has to run with an ARMv7 kernel and ARMv6 userland it's ok to use it, also it's lightweight enough to not overload my 'monitoring PSU' and throttling could be prevented just by applying a cheap heatsink to the SoC). If we were able to run ARMv8 code on RPI 3's SoC then sysbench would be completely useless since then the test would not take 120 seconds but less than 10 (that's what you get from not using ARMv8 instruction set)

I always used 2 sysbench runs, the first with '--cpu-max-prime=20000' to get some numbers to compare, the second running for at least an hour with '--cpu-max-prime=2000000' to get reliable consumption reporting. With the RPi 3 applying a cheap heatsink was necessary to prevent throttling (cpufreq remained at 1200 MHz and SoC temperature at 80Â°C). Tests used all available CPU cores so the results only apply to multi-threaded workloads (keep that in mind, your 'real world' application requirements normally look different!):

sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$(grep -c '^processor' /proc/cpuinfo)

A few words regarding the RPi platform: All RPi use basically the same SoC. It's a BroadCom VideoCore IV SoC that boots a proprietary OS combined with 1 to 4 ARM cores that are brought up later. RPi Zero/A/A+/B/B+ use the BCM2835 SoC which adds 1 ARMv6 core to the VideoCore VPU/GPU, BCM2836 replaced this with a quad-core ARMv7 cluster and on the latest BCM2837 design they replaced the Cortex-A7 cores with Cortex-A53 that currently have to run in 32-bit mode only.

The other limitations this platform suffers from are also due to this design (VideoCore VPU/GPU being the main part of the SoC and no further SoC development done except exchanging ARM cores and minor memory interface improvements):

only one single USB 2.0 OTG port available between SoC and the outside
only DDR2 DRAM possible and the maximum is 1GB (all RPi use LPDDR2 at 1.2V)
FAT partition needed where the proprietary VideoCore bootloader BLOBs are located

So how do some RPi provide Ethernet and 2 or 4 USB ports? They use an onboard component called LAN9512 (Fast Ethernet + 2 USB ports on RPi B(not B+!) or LAN9514 providing Fast Ethernet + 4 USB ports on RPi B+, 2 and 3. The RPi models that save this component (RPi A+ and Zero) show not so surprisingly the lowest consumption numbers. Same could've been true for RPi A but unfortunately RPi foundation chose inefficient LDO (low dropout) regulators to generate 3.3V and 1.8V needed by various ICs on the boards which transform power into heat on the two first models (so no numbers for RPi A and B here since they're not suitable for low-power operation due to this design flaw)

We can see below that disabling the LAN9514 hub/Ethernet combo makes a huge difference regarding consumption which we should take into account if we start to compare with boards supported by Armbian (eg. H3 boards that feature real Ethernet and 4 real USB ports). Same applies to RPi A+ or Zero when an USB-to-Ethernet dongle is connected but here it heavily depends on the dongle in question. When using one of my Gbit-Ethernet dongles (Realtek RTL8153 based) consumption increases by +1100mW regardless whether buspower is 0 or 1, with a random Fast Ethernet adapter it makes a difference -- see below.

RPi Zero with nothing connected, doing nothing, just power led:

echo 0 >/sys/devices/platform/soc/20980000.usb/buspower --> 365 mW

With a connected Apple USB-Fast-Ethernet dongle consumption is like this:

echo 0 >/sys/devices/platform/soc/20980000.usb/buspower --> 410 mW (no network)
echo 1 >/sys/devices/platform/soc/20980000.usb/buspower --> 1420 (network active, cable inserted but idling)

That means this USB-Ethernet dongle consumes 45mW when just connected (regardless whether the RPi is completely powered off or buspower = 0) and as soon as an USB connection between dongle and RPi is negotiated and an Ethernet connection on the other side another whopping 1010 mW adds to overall consumption. Therefore choose your Ethernet dongle wisely when you deal with devices that lack native Ethernet capabilities

Fortunately the RPi Zero exposes the SoC's one single OTG port as Micro USB with ID pin so the Zero unlike all other RPi models can switch to an USB gadget role so we can use the USB OTG connection as network connection using the g_ether module (quite simple in the meantime with most recent Raspbian images, just have a look at https://gist.github.com/gbaman/975e2db164b3ca2b51ae11e45e8fd40a).I'll cover performance and consumption numbers in this mode in a later post (covering idle and full load and also some camera scenarios since this is my only use case for any RPi: HW accelerated video encoding).

Performance numbers: sysbench takes 915 seconds on the single core @ 1000 MHz, 800 mW reported (+435 mW compared to 'baseline'). And tinymembench looks like this:

tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    169.8 MB/s
 C copy backwards (32 byte blocks)                    :    170.6 MB/s
 C copy backwards (64 byte blocks)                    :    168.2 MB/s
 C copy                                               :    185.9 MB/s (0.2%)
 C copy prefetched (32 bytes step)                    :    449.0 MB/s
 C copy prefetched (64 bytes step)                    :    273.3 MB/s
 C 2-pass copy                                        :    180.2 MB/s (4.0%)
 C 2-pass copy prefetched (32 bytes step)             :    313.4 MB/s
 C 2-pass copy prefetched (64 bytes step)             :    272.3 MB/s (2.7%)
 C fill                                               :    856.7 MB/s (4.1%)
 C fill (shuffle within 16 byte blocks)               :    856.8 MB/s (3.8%)
 C fill (shuffle within 32 byte blocks)               :    856.6 MB/s
 C fill (shuffle within 64 byte blocks)               :    856.9 MB/s
 ---
 standard memcpy                                      :    439.9 MB/s
 standard memset                                      :   1693.5 MB/s (3.7%)
 ---
 VFP copy                                             :    222.7 MB/s
 VFP 2-pass copy                                      :    198.1 MB/s (2.4%)
 ARM fill (STRD)                                      :    856.6 MB/s
 ARM fill (STM with 8 registers)                      :   1675.7 MB/s
 ARM fill (STM with 4 registers)                      :   1693.4 MB/s (3.7%)
 ARM copy prefetched (incr pld)                       :    440.0 MB/s (4.2%)
 ARM copy prefetched (wrap pld)                       :    270.4 MB/s
 ARM 2-pass copy prefetched (incr pld)                :    379.7 MB/s (3.7%)
 ARM 2-pass copy prefetched (wrap pld)                :    308.5 MB/s

==========================================================================
== Framebuffer read tests.                                              ==
==                                                                      ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled.       ==
== Writes to such framebuffers are quite fast, but reads are much       ==
== slower and very sensitive to the alignment and the selection of      ==
== CPU instructions which are used for accessing memory.                ==
==                                                                      ==
== Many x86 systems allocate the framebuffer in the GPU memory,         ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover,    ==
== PCI-E is asymmetric and handles reads a lot worse than writes.       ==
==                                                                      ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer    ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall    ==
== performance improvement. For example, the xf86-video-fbturbo DDX     ==
== uses this trick.                                                     ==
==========================================================================

 VFP copy (from framebuffer)                          :    223.8 MB/s
 VFP 2-pass copy (from framebuffer)                   :    207.0 MB/s
 ARM copy (from framebuffer)                          :    188.3 MB/s
 ARM 2-pass copy (from framebuffer)                   :    215.1 MB/s (3.5%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.8 ns          /     1.2 ns 
     32768 :   19.8 ns          /    32.4 ns 
     65536 :   32.6 ns          /    47.0 ns 
    131072 :   42.4 ns          /    57.3 ns 
    262144 :   98.1 ns          /   157.5 ns 
    524288 :  166.2 ns          /   293.0 ns 
   1048576 :  200.2 ns          /   364.7 ns 
   2097152 :  217.3 ns          /   401.7 ns 
   4194304 :  226.0 ns          /   420.5 ns 
   8388608 :  231.0 ns          /   430.4 ns 
  16777216 :  236.4 ns          /   442.2 ns 
  33554432 :  251.4 ns          /   473.6 ns 
  67108864 :  288.9 ns          /   548.7 ns

RPi B+: Nothing connected, doing nothing, just power led:

echo 0 >/sys/devices/platform/soc/20980000.usb/buspower --> 600 mW
echo 1 >/sys/devices/platform/soc/20980000.usb/buspower --> 985 mW
buspower = 1 and Ethernet cable connected --> 1200 mW

Performance: sysbench took 1311 seconds @ 700 MHz while 1160 mW consumption has been reported (+175 mW compared to 'baseline')

This is tinymembench:

tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    127.2 MB/s (0.1%)
 C copy backwards (32 byte blocks)                    :    130.5 MB/s
 C copy backwards (64 byte blocks)                    :    129.5 MB/s
 C copy                                               :    144.9 MB/s
 C copy prefetched (32 bytes step)                    :    368.9 MB/s
 C copy prefetched (64 bytes step)                    :    212.6 MB/s
 C 2-pass copy                                        :    137.8 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    248.5 MB/s (0.1%)
 C 2-pass copy prefetched (64 bytes step)             :    207.7 MB/s
 C fill                                               :    760.4 MB/s
 C fill (shuffle within 16 byte blocks)               :    760.5 MB/s
 C fill (shuffle within 32 byte blocks)               :    760.3 MB/s
 C fill (shuffle within 64 byte blocks)               :    760.5 MB/s
 ---
 standard memcpy                                      :    380.2 MB/s
 standard memset                                      :   1483.9 MB/s
 ---
 VFP copy                                             :    165.0 MB/s
 VFP 2-pass copy                                      :    145.1 MB/s
 ARM fill (STRD)                                      :    760.6 MB/s (1.3%)
 ARM fill (STM with 8 registers)                      :   1101.7 MB/s (0.1%)
 ARM fill (STM with 4 registers)                      :   1484.2 MB/s
 ARM copy prefetched (incr pld)                       :    380.3 MB/s
 ARM copy prefetched (wrap pld)                       :    205.3 MB/s
 ARM 2-pass copy prefetched (incr pld)                :    291.5 MB/s
 ARM 2-pass copy prefetched (wrap pld)                :    237.1 MB/s

==========================================================================
== Framebuffer read tests.                                              ==
==                                                                      ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled.       ==
== Writes to such framebuffers are quite fast, but reads are much       ==
== slower and very sensitive to the alignment and the selection of      ==
== CPU instructions which are used for accessing memory.                ==
==                                                                      ==
== Many x86 systems allocate the framebuffer in the GPU memory,         ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover,    ==
== PCI-E is asymmetric and handles reads a lot worse than writes.       ==
==                                                                      ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer    ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall    ==
== performance improvement. For example, the xf86-video-fbturbo DDX     ==
== uses this trick.                                                     ==
==========================================================================

 VFP copy (from framebuffer)                          :    169.6 MB/s
 VFP 2-pass copy (from framebuffer)                   :    153.8 MB/s
 ARM copy (from framebuffer)                          :    150.0 MB/s
 ARM 2-pass copy (from framebuffer)                   :    165.9 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.1 ns          /     0.1 ns 
     16384 :    1.7 ns          /     2.8 ns 
     32768 :   31.9 ns          /    51.7 ns 
     65536 :   51.6 ns          /    74.0 ns 
    131072 :   67.3 ns          /    90.6 ns 
    262144 :  132.4 ns          /   207.2 ns 
    524288 :  227.0 ns          /   396.9 ns 
   1048576 :  274.5 ns          /   495.7 ns 
   2097152 :  298.4 ns          /   546.5 ns 
   4194304 :  310.3 ns          /   572.2 ns 
   8388608 :  317.4 ns          /   587.5 ns 
  16777216 :  326.9 ns          /   606.8 ns 
  33554432 :  353.7 ns          /   660.5 ns 
  67108864 :  379.8 ns          /   712.8 ns

RPi 2: Nothing connected, doing nothing, just power led:

echo 0 >/sys/devices/platform/soc/3f980000.usb/buspower --> 645 mW
echo 1 >/sys/devices/platform/soc/3f980000.usb/buspower --> 1005 mW

Performance: sysbench takes 192 seconds @ 900 MHz, 2140 mW reported (+1135 mW compared to 'baseline'). And tinymembench looks like:

tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    244.6 MB/s
 C copy backwards (32 byte blocks)                    :    776.8 MB/s (1.1%)
 C copy backwards (64 byte blocks)                    :    980.5 MB/s
 C copy                                               :    706.7 MB/s (0.6%)
 C copy prefetched (32 bytes step)                    :    911.1 MB/s
 C copy prefetched (64 bytes step)                    :    951.9 MB/s (1.2%)
 C 2-pass copy                                        :    596.5 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    619.8 MB/s
 C 2-pass copy prefetched (64 bytes step)             :    629.3 MB/s (0.6%)
 C fill                                               :   1188.0 MB/s
 C fill (shuffle within 16 byte blocks)               :   1191.7 MB/s (0.4%)
 C fill (shuffle within 32 byte blocks)               :    400.2 MB/s (0.5%)
 C fill (shuffle within 64 byte blocks)               :    420.4 MB/s
 ---
 standard memcpy                                      :   1065.1 MB/s
 standard memset                                      :   1191.8 MB/s (0.1%)
 ---
 NEON read                                            :   1343.9 MB/s (0.5%)
 NEON read prefetched (32 bytes step)                 :   1370.5 MB/s
 NEON read prefetched (64 bytes step)                 :   1366.9 MB/s (0.4%)
 NEON read 2 data streams                             :    390.1 MB/s
 NEON read 2 data streams prefetched (32 bytes step)  :    727.2 MB/s (0.2%)
 NEON read 2 data streams prefetched (64 bytes step)  :    767.0 MB/s
 NEON copy                                            :    996.7 MB/s
 NEON copy prefetched (32 bytes step)                 :    961.7 MB/s (0.8%)
 NEON copy prefetched (64 bytes step)                 :   1033.2 MB/s
 NEON unrolled copy                                   :    954.4 MB/s (0.4%)
 NEON unrolled copy prefetched (32 bytes step)        :    925.9 MB/s
 NEON unrolled copy prefetched (64 bytes step)        :    985.7 MB/s
 NEON copy backwards                                  :    840.9 MB/s
 NEON copy backwards prefetched (32 bytes step)       :    845.5 MB/s (1.0%)
 NEON copy backwards prefetched (64 bytes step)       :    873.8 MB/s
 NEON 2-pass copy                                     :    625.4 MB/s
 NEON 2-pass copy prefetched (32 bytes step)          :    642.5 MB/s (0.3%)
 NEON 2-pass copy prefetched (64 bytes step)          :    648.8 MB/s (0.3%)
 NEON unrolled 2-pass copy                            :    588.9 MB/s
 NEON unrolled 2-pass copy prefetched (32 bytes step) :    578.9 MB/s (0.2%)
 NEON unrolled 2-pass copy prefetched (64 bytes step) :    611.2 MB/s (0.3%)
 NEON fill                                            :   1191.9 MB/s
 NEON fill backwards                                  :   1192.3 MB/s (0.1%)
 VFP copy                                             :    964.0 MB/s
 VFP 2-pass copy                                      :    587.0 MB/s (0.3%)
 ARM fill (STRD)                                      :   1190.8 MB/s (0.1%)
 ARM fill (STM with 8 registers)                      :   1192.1 MB/s
 ARM fill (STM with 4 registers)                      :   1192.2 MB/s (0.1%)
 ARM copy prefetched (incr pld)                       :    960.1 MB/s (0.7%)
 ARM copy prefetched (wrap pld)                       :    841.5 MB/s
 ARM 2-pass copy prefetched (incr pld)                :    633.0 MB/s
 ARM 2-pass copy prefetched (wrap pld)                :    606.7 MB/s (0.4%)

==========================================================================
== Framebuffer read tests.                                              ==
==                                                                      ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled.       ==
== Writes to such framebuffers are quite fast, but reads are much       ==
== slower and very sensitive to the alignment and the selection of      ==
== CPU instructions which are used for accessing memory.                ==
==                                                                      ==
== Many x86 systems allocate the framebuffer in the GPU memory,         ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover,    ==
== PCI-E is asymmetric and handles reads a lot worse than writes.       ==
==                                                                      ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer    ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall    ==
== performance improvement. For example, the xf86-video-fbturbo DDX     ==
== uses this trick.                                                     ==
==========================================================================

 NEON read (from framebuffer)                         :     61.7 MB/s (0.2%)
 NEON copy (from framebuffer)                         :     61.5 MB/s
 NEON 2-pass copy (from framebuffer)                  :     58.7 MB/s
 NEON unrolled copy (from framebuffer)                :     59.3 MB/s
 NEON 2-pass unrolled copy (from framebuffer)         :     58.2 MB/s (0.2%)
 VFP copy (from framebuffer)                          :    308.7 MB/s (0.7%)
 VFP 2-pass copy (from framebuffer)                   :    272.9 MB/s
 ARM copy (from framebuffer)                          :    208.3 MB/s
 ARM 2-pass copy (from framebuffer)                   :    180.5 MB/s (0.2%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    6.4 ns          /    11.6 ns 
    131072 :    9.9 ns          /    16.7 ns 
    262144 :   11.7 ns          /    19.0 ns 
    524288 :   14.7 ns          /    23.1 ns 
   1048576 :   88.7 ns          /   141.5 ns 
   2097152 :  134.3 ns          /   189.7 ns 
   4194304 :  158.0 ns          /   208.3 ns 
   8388608 :  171.5 ns          /   217.9 ns 
  16777216 :  181.8 ns          /   228.1 ns 
  33554432 :  191.8 ns          /   241.6 ns 
  67108864 :  207.1 ns          /   268.8 ns

Raspberry Pi 3: nothing connected, doing nothing, just power led:

echo 0 >/sys/devices/platform/soc/3f980000.usb/buspower --> 770 mW
echo 1 >/sys/devices/platform/soc/3f980000.usb/buspower --> 1165 mW
buspower = 1 and Ethernet cable connected --> 1360 mW

Important: RPi 3 idles at just ~130mW above RPi 2 level. Whether further savings are possible by disabling WiFi/BT is something that would need further investigations.

Performance: sysbench takes 120 seconds (constantly at 1200 MHz, 80Â°C), consumption reported is 3550 mW (+2385 mW compared to 'baseline') and tinymembench looks like:

tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   1345.8 MB/s (0.5%)
 C copy backwards (32 byte blocks)                    :   1334.3 MB/s (0.7%)
 C copy backwards (64 byte blocks)                    :   1333.5 MB/s (0.5%)
 C copy                                               :   1350.1 MB/s (0.4%)
 C copy prefetched (32 bytes step)                    :   1376.9 MB/s (0.3%)
 C copy prefetched (64 bytes step)                    :   1376.7 MB/s (0.5%)
 C 2-pass copy                                        :   1055.3 MB/s
 C 2-pass copy prefetched (32 bytes step)             :   1092.0 MB/s (0.2%)
 C 2-pass copy prefetched (64 bytes step)             :   1097.1 MB/s (0.3%)
 C fill                                               :   1732.6 MB/s
 C fill (shuffle within 16 byte blocks)               :   1735.9 MB/s
 C fill (shuffle within 32 byte blocks)               :   1733.1 MB/s
 C fill (shuffle within 64 byte blocks)               :   1731.9 MB/s
 ---
 standard memcpy                                      :   1372.2 MB/s (0.3%)
 standard memset                                      :   1737.6 MB/s (0.1%)
 ---
 NEON read                                            :   2254.5 MB/s
 NEON read prefetched (32 bytes step)                 :   2442.2 MB/s (0.6%)
 NEON read prefetched (64 bytes step)                 :   2420.1 MB/s
 NEON read 2 data streams                             :   2115.4 MB/s
 NEON read 2 data streams prefetched (32 bytes step)  :   2433.9 MB/s (0.3%)
 NEON read 2 data streams prefetched (64 bytes step)  :   2432.6 MB/s (0.3%)
 NEON copy                                            :   1327.8 MB/s (0.9%)
 NEON copy prefetched (32 bytes step)                 :   1376.1 MB/s
 NEON copy prefetched (64 bytes step)                 :   1379.9 MB/s (0.5%)
 NEON unrolled copy                                   :   1344.6 MB/s (0.3%)
 NEON unrolled copy prefetched (32 bytes step)        :   1369.6 MB/s
 NEON unrolled copy prefetched (64 bytes step)        :   1371.3 MB/s
 NEON copy backwards                                  :   1341.1 MB/s (0.5%)
 NEON copy backwards prefetched (32 bytes step)       :   1375.5 MB/s
 NEON copy backwards prefetched (64 bytes step)       :   1376.3 MB/s (0.4%)
 NEON 2-pass copy                                     :   1100.5 MB/s (0.3%)
 NEON 2-pass copy prefetched (32 bytes step)          :   1138.0 MB/s
 NEON 2-pass copy prefetched (64 bytes step)          :   1138.2 MB/s (0.2%)
 NEON unrolled 2-pass copy                            :   1075.5 MB/s
 NEON unrolled 2-pass copy prefetched (32 bytes step) :   1099.6 MB/s
 NEON unrolled 2-pass copy prefetched (64 bytes step) :   1100.1 MB/s
 NEON fill                                            :   1788.8 MB/s
 NEON fill backwards                                  :   1788.7 MB/s (0.2%)
 VFP copy                                             :   1342.4 MB/s (0.4%)
 VFP 2-pass copy                                      :   1070.1 MB/s (0.2%)
 ARM fill (STRD)                                      :   1786.8 MB/s (0.2%)
 ARM fill (STM with 8 registers)                      :   1789.1 MB/s (0.3%)
 ARM fill (STM with 4 registers)                      :   1787.8 MB/s (0.2%)
 ARM copy prefetched (incr pld)                       :   1373.3 MB/s
 ARM copy prefetched (wrap pld)                       :   1378.1 MB/s (0.4%)
 ARM 2-pass copy prefetched (incr pld)                :   1113.1 MB/s
 ARM 2-pass copy prefetched (wrap pld)                :   1108.8 MB/s

==========================================================================
== Framebuffer read tests.                                              ==
==                                                                      ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled.       ==
== Writes to such framebuffers are quite fast, but reads are much       ==
== slower and very sensitive to the alignment and the selection of      ==
== CPU instructions which are used for accessing memory.                ==
==                                                                      ==
== Many x86 systems allocate the framebuffer in the GPU memory,         ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover,    ==
== PCI-E is asymmetric and handles reads a lot worse than writes.       ==
==                                                                      ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer    ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall    ==
== performance improvement. For example, the xf86-video-fbturbo DDX     ==
== uses this trick.                                                     ==
==========================================================================

 NEON read (from framebuffer)                         :     73.4 MB/s (0.1%)
 NEON copy (from framebuffer)                         :     73.1 MB/s (0.2%)
 NEON 2-pass copy (from framebuffer)                  :     72.0 MB/s (0.2%)
 NEON unrolled copy (from framebuffer)                :     72.7 MB/s
 NEON 2-pass unrolled copy (from framebuffer)         :     71.2 MB/s (0.2%)
 VFP copy (from framebuffer)                          :    473.7 MB/s (0.4%)
 VFP 2-pass copy (from framebuffer)                   :    428.5 MB/s (1.1%)
 ARM copy (from framebuffer)                          :    260.1 MB/s (0.4%)
 ARM 2-pass copy (from framebuffer)                   :    242.5 MB/s (0.7%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    5.4 ns          /     9.2 ns 
    131072 :    8.2 ns          /    13.1 ns 
    262144 :    9.7 ns          /    14.8 ns 
    524288 :   11.0 ns          /    16.6 ns 
   1048576 :   75.2 ns          /   118.3 ns 
   2097152 :  110.9 ns          /   154.9 ns 
   4194304 :  134.4 ns          /   173.9 ns 
   8388608 :  146.8 ns          /   182.3 ns 
  16777216 :  154.7 ns          /   187.4 ns 
  33554432 :  159.7 ns          /   191.4 ns 
  67108864 :  162.6 ns          /   193.7 ns

To sum it up:

There's not much magic involved regarding consumption of the various RPi models:

When it's about the 'do really nothing' use case then RPi A+ most probably wins due to half the amount of LPDDR2 DRAM compared to RPi Zero who is next. Both SBC are dimensioned for light loads (only one USB port available that has to provide max 500mA by specs) and save the LAN9514 IC (combined internal USB hub and Fast Ethernet adapter)
The two first models RPi A and B are not worth a look when it's about low consumption since they use inefficient LDO regulators to provide different voltages that waste a lot energy. Newer RPi models rely on better circuitry.
By accessing /sys/devices/platform/soc/*.usb/buspower consumption can be influenced on all models but it depends on what's connected to the USB port (see the USB-Ethernet adapter example on RPi Zero above)
On RPi B+, 2 and 3 cutting power to LAN9514 saves ~400mW. When LAN9514 negotiates an Ethernet connection then consumption increases by ~200mW (which is just 600mW more and really not that bad!)
The energy savings of disabled HDMI and especially onboard leds are not that great but you can control behaviour from userspace and get these savings 'for free' so why not disabling stuff you don't need?
Consumption numbers for the 'everything disabled and doing nothing' (power cut to LAN9514!) use case do not differ that much. RPi Zero: 365 mW, RPi B+: 600 mW, RPi 2: 645 mW, RPi 3: 770 mW (still no idea whether disabling WiFi/BT on RPi 3 brings consumption down to B+/2 level)
When exactly no network connectivity is needed or only from time to time (eg. every hour for a minute or something like this) RPi Zero and A+ can shine. If you need LAN or WiFi permanently you should keep in mind that this adds approx. +1000mW to your consumption and then all LAN9514 equipped 'larger' RPi models might be more energy efficient (!).
Even if RPi 3 is not able to perform optimal (ARMv8 cores running an ARMv7 kernel and an ARMv6 userland) it might be an intersting replacement for a RPi B+ if you need the USB ports and Ethernet. You could limit maximum consumption by disabling CPU cores 2-4 and could still get less overall consumption when running light workloads since even with 1 CPU core active RPi 3 is almost twice as fast as the single core RPis (compare with the 'race to idle' concept, the faster work can be done the earlier CPU cores can enter low-power states). EDIT: Disabling CPU cores on RPi 3 does not help with consumption -- see post #5 below.

And now to answer the question many might ask since I was talking all the time about various RPi models:

Q: Do you now port Armbian to Raspberry Pi?!
A: Nope

To be honest, there's no need for that. Raspbian when running on Raspberries is really great (unlike the various crappy Raspbian images made eg. for Banana Pis), RPi users are familiar with it, tens of thousands tutorials are available and so on.

For me personally it was just important to verify some consumption numbers available on the net, to verify whether my readouts using the PMIC of an Allwinner SBC are correct (seems so) and to get the idea which energy savings level we should target with our new Armbian settings. Based on some experiments done with an Orange Pi Lite I'm pretty confident that we will soon have a couple of ultra cheap H3 boards (that are available unlike RPi Zero which costs way more due to added shipping costs and the inability to order more than one at a time!) that outperform RPis when it's about consumption. At least when we're talking about networked setups and not only the 'does really nothing at all' use case

Remaining questions:

Why do they allow RPi Zero to clock with up to 1 GHz by default when they limit B+ to 700 MHz (compare performance and consumption numbers of both tinymembench and sysbench above)?
How does RPi 3 behaves consumption-wise when WiFi/BT are turned off?
How does consumption looks like on the various RPi when average load is not close to 0 but some stuff has to be done (I came accross a lot of really broken python or whatever scripts that try to readout sensors and increase load and consumption a lot). This is an area where RPi 3 (and maybe 2 also) might shine since their SoCs consume only slightly more than the horribly outdated single-core BCM2835 and are able to finish stuff a lot faster (again: 'race to idle' concept: Entering low-power CPU states earlier helps with minimizing consumption if there is some constant load)

Further readings:

tkaiser · August 8, 2016

Another surprising update regarding settings, this time again with a Raspberry: RPi 3. I wanted to test whether using the DietPi OS image helps also with consumption. DietPi relies on Raspbian Lite (as far as I understood) but ships with optimized settings. So RPi 3 is tested with 3 different OS images using same SD card and connected peripherals (none) and settings (HDMI disabled, otherwise defaults):

Raspbian Debian Jessie Lite (2016-05-27-raspbian-jessie-lite.img):

Idle: 1165 mW. Performance: sysbench takes 120 seconds (constantly at 1200 MHz, 80Â°C), consumption reported is 3550 mW (+2385 mW compared to 'baseline')

Ubuntu Mate 16.04 (ARMv7 userland, latest 4.4 kernel):

Idle: 1150 mW. Performance: Sysbench takes 105 seconds (no throttling occured) and consumption increases to 3600 mW (+2450 mW compared to baseline). This can be considered identical to above except the small performance boost due to being able to use ARMv7 code.

DietPi_v127_RPi-armv6-(Jessie).img:

I used default settings and only set HDMI to disabled from within dietpi-config which does not only switch off HDMI but should also help with memory throughput (see here for the description. And again: sysbench won't be affected by that)

Bildschirmfoto%202016-08-08%20um%2018.25

Idle: 1230 mW (that's surprisingly 75 mW more than above)

Performance: On 1st run sysbench takes 121 seconds and consumption most probably increased up to 3615 mW (assuming +2385 mW compared to baseline from Raspbian test above) but then some sort of throttling happened (strange since checking /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq from time to time still reported being at 1.2GHz always) and sysbench execution time increased to 169 secs while consumption only reached 3050 mW (+1820 mW compared to baseline).

Bildschirmfoto%202016-08-08%20um%2018.37

This is how execution times looked like over time:

    execution time (avg/stddev):   121.2137/0.02
    execution time (avg/stddev):   143.4227/0.01
    execution time (avg/stddev):   154.7651/0.02
    execution time (avg/stddev):   159.3969/0.03
    execution time (avg/stddev):   162.2143/0.02
    execution time (avg/stddev):   164.0288/0.01
    execution time (avg/stddev):   165.7774/0.03
    execution time (avg/stddev):   166.6406/0.02
    execution time (avg/stddev):   167.3383/0.02
    execution time (avg/stddev):   168.1799/0.03
    execution time (avg/stddev):   168.6029/0.02
    execution time (avg/stddev):  ~169.x

I then checked settings in dietpi-config and increased throttling treshold from 75Â°C to 85Â°C. Now 1st sysbench run took 120 secs (as it should be without throttling) and then slowed down to 137 seconds over time:

    execution time (avg/stddev):   119.7105/0.02
    execution time (avg/stddev):   126.3084/0.01
    execution time (avg/stddev):   133.1608/0.02
    execution time (avg/stddev):   135.2913/0.02
    execution time (avg/stddev):   134.7278/0.02
    execution time (avg/stddev):   135.8305/0.01
    execution time (avg/stddev):   136.2153/0.02
    execution time (avg/stddev):   136.4005/0.01
    execution time (avg/stddev):   136.6043/0.01
    execution time (avg/stddev):   136.8146/0.01
    execution time (avg/stddev):   136.7731/0.02
    execution time (avg/stddev):   137.1757/0.03
    execution time (avg/stddev):  ~137.x

Average consumption was then at 3595 mW (+2365 mW) and the output of DietPi's cpu tool looked like this then:

root@DietPi:~# cpu

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€

DietPi CPU Info

Use dietpi-config to change CPU / performance options

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€

Architecture | armv7l

Temp | Warning: 82'c | Reducing the life of your device.

Governor | ondemand

Throttle up | 50% CPU usage

Current Freq Min Freq Max Freq

CPU0 | 1200 Mhz 600 Mhz 1200 Mhz

CPU1 | 1200 Mhz 600 Mhz 1200 Mhz

CPU2 | 1200 Mhz 600 Mhz 1200 Mhz

CPU3 | 1200 Mhz 600 Mhz 1200 Mhz

root@DietPi:~# uname -a

Linux DietPi 4.4.16-v7+ #899 SMP Thu Jul 28 12:40:33 BST 2016 armv7l GNU/Linux

Both increased idle consumption is strange (maybe in DietPi some background daemons are permanently active?) as well as 'performance per watt' ratio:

Raspbian, no throttling: 120 secs, +2385 mW
DietPi with 75Â°C settings: 169 secs, +1820 mW
DietPi with 85Â°C settings: 137 secs, +2365 mW

Why does Raspbian executes the benchmark with no throttling at constant 1200 MHz @ 80Â°C in 120 seconds and DietPi with 85Â°C treshold increases execution time by 14% (120 sec vs. 137) while reporting 82Â°C and only saving 20 mW in this mode? One possible explanation would be that I did this test with Ethernet connected (which adds to overall consumption with 200 mW but shouldn't affect SoC temperatures since the SoC has no Ethernet and this is just the Ethernet PHY that gets activated in LAN9514 IC). Would be interesting to investigate further (temperature behaviour of the SoC when Ethernet is plugged in or not) but it's not worth my time.

And then it seems like a miracle to me that a loss in performance of 14 percent (120 sec vs. 137) only saves 20 mW. Since I also wondered why CPU clockspeed was always reported as 1200 MHz I had a look at

/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors

To my surprise only 600 and 1200 MHz are available, so we seem to deal with a pretty simple dvfs/cpufreq table only defining two states. As a comparison: With Armbian on the larger H3 boards we use 7 operating points that start with 480 MHz @ 980 mV and end up at 1296 MHz @ 1320 mV and we added way more cpufreq steps since this helps a lot with throttling (confirmed with both A64 and H3 devices).

So based on our research (using as much cpufreq and dvfs operating points as possible to get throttling more efficient) the RPi 3 settings look like 'throttling from hell'. The numbers with 85Â°C treshold would indicate that RPi 3 remained 28% of the time at 600 MHz and 72% at 1200 MHz (since execution time differs by 14%) but a consumption difference of just 20 mW means that voltage regulation isn't quick enough to cope with this fast switching between upper and lower clockspeed (and if I understood correctly also dynamic clocking of DRAM).

With 75"C settings it's a different picture: Execution time increases by 41% (120 sec vs. 169) which would mean that CPU cores remained 82% at the lower clockspeed which results in huge savings: 565 mW less.

I would believe adding a few more cpufreq operating points would already help (600, 800, 900, 1000, 1100, 1200) and in case it's possible with RPi 3 to also use different supply voltages with these clockspeeds then throttling would get even more efficient. If RPi 3 would be an Orange Pi PC and we would have to deal with this situation (only switching between 600 and 1200 MHz, loosing 14% performance with a minimum benefit of less than 1% consumption savings) then simply adding one more dvfs operating point with slightly decreased voltage at 1150 MHz would already suffice. But I don't know whether that's possible on RPi 3 and how the exact mechanism to control voltages work there.

EDIT: Now everything is clear, since the ARM cores are no first class citizens on any RPi but just VideoCore's guests the kernel doesn't know what's really going on (LOL!). The VideoCore VPU has to be queried since in reality everything is controlled from there: https://github.com/raspberrypi/linux/issues/1320#issuecomment-191754676 (without using the vcgencmd command talking to the master processor you get no idea what's really happening)

And another strange symptom (for me): With all RPi 3 OS images after a 'shutdown -h now' the board still wasted between 380 and 450 mW (which is most likely a bug but who knows). Only physically cutting power really helps lowering 'off state' consumption to zero. But maybe on Raspberries one should use 'poweroff' instead?

tkaiser · August 13, 2016

Another update regarding consumption / performance since I've been busy the last days playing around with NanoPi NEO. The NEO is currently the smallest H3 board around also featuring a more simpler design than other H3 devices. For information regarding this board check linux-sunxi wiki and/or the appropriate thread in H3 forums.

We had to realize that reducing DRAM clockspeeds helps with both lowering consumption and temperatures when querying the SoC's internal temperature sensor (that must sit somewhere nearby the memory controller). We also used a patch to decrease DRAM clockspeed below legacy kernel's fixed limit of 408 MHz minimum. Since I was curious where the most consumption gains happen I let a script walk through all available DRAM clockspeeds (adjustable in 24 MHz steps above 372 MHz and 12 MHz below) and the surprising result is that the biggest gains are between 408 MHz and 456 MHz:

This is NanoPi NEO starting with 870 mW and ending up at 1340 mW (only adjusting DRAM clockspeed on this H3 board is responsible for 470 mW difference in idle consumption!) and the biggest step is between 408 MHz and 432 MHz (I wonder why Friendly ARM chose 432 and not 408 on their OS image):

Bildschirmfoto%202016-08-12%20um%2007.51

And this is Orange Pi Lite with identical settings starting with 510 mW and ending up at ~815 mW at 672 MHz mW (here the difference is just 305 mW, please note that I did not test above 624 MHz since this is the maximum DRAM speed we allow on all H3 boards anyway). Again the biggest step between 408 MHz and 432 MHz: 130 mW (630mW --> 760mW)

Bildschirmfoto%202016-08-13%20um%2000.12

So what's different? Both boards share many details: the primitve voltage regulator only switching between 1.1V and 1.3V, same amount of DRAM (I tested the 512MiB NEO version) but obviously some onboard components different (LDO regulator on the NEO vs. buck converter on the Lite) and DRAM access is different. Unlike all other H3 devices FriendlyARM chose for the NEO a single bank design.

Surprisingly the more primitive and lightweight looking NEO shows worse consumption numbers compared to OPi Lite (unfortunately I have no One here any more to compare)

So what about performance: I let also run tinymembench on both boards from 132 - 672 MHz clockspeed (fixed CPU settings, performance governor, 4 CPU cores active and running at 1200 MHz). Not so surprisingly the dual bank configuration is faster especially at higher DRAM clockspeeds (but there is no performance bump around 432 MHz as one could imagine based on the consumption behaviour, it's quite the opposite since 384 MHz - 504 MHz perform nearly identical with single bank configuration):

DRAM clock     NanoPi NEO      OPi Lite   OPi Plus 2E
132 MHz:       135.8 MB/s    153.8 MB/s    154.4 MB/s
144 MHz:       147.8 MB/s    158.8 MB/s    171.4 MB/s
156 MHz:       156.0 MB/s    171.3 MB/s    176.9 MB/s
168 MHz:       179.2 MB/s    183.7 MB/s    188.6 MB/s
180 MHz:       188.5 MB/s    206.0 MB/s    201.5 MB/s
192 MHz:       194.5 MB/s    285.4 MB/s    232.4 MB/s
204 MHz:       202.7 MB/s    295.5 MB/s    300.4 MB/s
216 MHz:       199.7 MB/s    282.1 MB/s    314.9 MB/s
228 MHz:       196.9 MB/s    290.9 MB/s    292.9 MB/s
240 MHz:       198.9 MB/s    303.7 MB/s    328.3 MB/s
252 MHz:       199.3 MB/s    313.2 MB/s    366.7 MB/s
264 MHz:       216.5 MB/s    361.3 MB/s    350.3 MB/s
276 MHz:       217.9 MB/s    344.8 MB/s    388.1 MB/s
288 MHz:       231.7 MB/s    339.8 MB/s    410.3 MB/s
300 MHz:       235.7 MB/s    350.9 MB/s    398.6 MB/s
312 MHz:       250.0 MB/s    339.2 MB/s    377.9 MB/s
324 MHz:       262.4 MB/s    360.2 MB/s    389.1 MB/s
336 MHz:       271.5 MB/s    375.6 MB/s    409.6 MB/s
348 MHz:       271.3 MB/s    395.6 MB/s    421.7 MB/s
360 MHz:       299.3 MB/s    414.8 MB/s    402.9 MB/s
372 MHz:       339.4 MB/s    452.2 MB/s    398.4 MB/s
384 MHz:       428.5 MB/s    507.0 MB/s    432.7 MB/s
408 MHz:       433.5 MB/s    594.8 MB/s    580.5 MB/s
432 MHz:       436.8 MB/s    632.4 MB/s    606.9 MB/s
456 MHz:       421.1 MB/s    665.7 MB/s    667.5 MB/s
480 MHz:       434.4 MB/s    678.4 MB/s    685.4 MB/s
504 MHz:       431.8 MB/s    714.5 MB/s    719.7 MB/s
528 MHz:       448.9 MB/s    766.9 MB/s    756.2 MB/s
552 MHz:       454.3 MB/s    802.5 MB/s    798.2 MB/s
576 MHz:       458.6 MB/s    835.3 MB/s    838.4 MB/s
600 MHz:       465.8 MB/s    857.0 MB/s    882.1 MB/s
624 MHz:       484.8 MB/s    892.9 MB/s    905.5 MB/s
648 MHz:       506.1 MB/s    928.3 MB/s    938.0 MB/s
672 MHz:       539.3 MB/s    963.0 MB/s    965.7 MB/s

An archive with all tinymembench results for both boards and now also Plus 2E can be found here. The numbers above are the 'standard memcpy' results but tinymembench tests a lot more. Just as an example the 624 MHz results from OPi Lite:

Board: orangepilite, DRAM clockspeed: 624000

tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    296.2 MB/s (2.0%)
 C copy backwards (32 byte blocks)                    :    998.6 MB/s
 C copy backwards (64 byte blocks)                    :   1030.7 MB/s
 C copy                                               :    963.7 MB/s
 C copy prefetched (32 bytes step)                    :    910.3 MB/s
 C copy prefetched (64 bytes step)                    :   1032.8 MB/s
 C 2-pass copy                                        :    800.9 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    783.0 MB/s
 C 2-pass copy prefetched (64 bytes step)             :    844.9 MB/s
 C fill                                               :   3984.5 MB/s (0.1%)
 C fill (shuffle within 16 byte blocks)               :   3969.5 MB/s
 C fill (shuffle within 32 byte blocks)               :    462.0 MB/s (4.7%)
 C fill (shuffle within 64 byte blocks)               :    489.1 MB/s (7.5%)
 ---
 standard memcpy                                      :    892.9 MB/s
 standard memset                                      :   3034.7 MB/s
 ---
 NEON read                                            :   1317.9 MB/s
 NEON read prefetched (32 bytes step)                 :   1496.9 MB/s
 NEON read prefetched (64 bytes step)                 :   1513.5 MB/s
 NEON read 2 data streams                             :    374.4 MB/s
 NEON read 2 data streams prefetched (32 bytes step)  :    720.3 MB/s
 NEON read 2 data streams prefetched (64 bytes step)  :    755.9 MB/s
 NEON copy                                            :   1038.5 MB/s
 NEON copy prefetched (32 bytes step)                 :   1132.5 MB/s
 NEON copy prefetched (64 bytes step)                 :   1193.4 MB/s
 NEON unrolled copy                                   :   1011.7 MB/s
 NEON unrolled copy prefetched (32 bytes step)        :   1059.0 MB/s
 NEON unrolled copy prefetched (64 bytes step)        :   1130.1 MB/s
 NEON copy backwards                                  :   1004.4 MB/s
 NEON copy backwards prefetched (32 bytes step)       :   1067.6 MB/s
 NEON copy backwards prefetched (64 bytes step)       :   1154.6 MB/s
 NEON 2-pass copy                                     :    897.3 MB/s
 NEON 2-pass copy prefetched (32 bytes step)          :    966.2 MB/s
 NEON 2-pass copy prefetched (64 bytes step)          :    997.6 MB/s
 NEON unrolled 2-pass copy                            :    778.5 MB/s
 NEON unrolled 2-pass copy prefetched (32 bytes step) :    745.1 MB/s
 NEON unrolled 2-pass copy prefetched (64 bytes step) :    806.3 MB/s
 NEON fill                                            :   3986.3 MB/s (0.1%)
 NEON fill backwards                                  :   3969.0 MB/s
 VFP copy                                             :   1022.1 MB/s
 VFP 2-pass copy                                      :    789.0 MB/s
 ARM fill (STRD)                                      :   3037.0 MB/s
 ARM fill (STM with 8 registers)                      :   3970.1 MB/s
 ARM fill (STM with 4 registers)                      :   3598.5 MB/s
 ARM copy prefetched (incr pld)                       :   1145.7 MB/s
 ARM copy prefetched (wrap pld)                       :   1053.3 MB/s
 ARM 2-pass copy prefetched (incr pld)                :    873.9 MB/s
 ARM 2-pass copy prefetched (wrap pld)                :    833.1 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    5.2 ns          /     9.0 ns 
    131072 :    8.1 ns          /    12.7 ns 
    262144 :    9.5 ns          /    14.1 ns 
    524288 :   11.5 ns          /    16.3 ns 
   1048576 :   85.6 ns          /   132.0 ns 
   2097152 :  128.7 ns          /   173.6 ns 
   4194304 :  150.8 ns          /   188.5 ns 
   8388608 :  163.7 ns          /   196.5 ns 
  16777216 :  173.4 ns          /   204.0 ns 
  33554432 :  183.1 ns          /   217.6 ns 
  67108864 :  195.7 ns          /   240.7 ns

Edit: Added tinymembench results for 2GB equipped OPi Plus 2E also (clocking with 1296 MHz). No differences compared to OPi Lite

tkaiser · August 14, 2016

Another round of tests. This time it's about lowering peak consumption. With our default settings we allow pretty low idle consumption but at boot time we always have rather high consumption peaks compared to idle behaviour later. In case someone wants to use a really weak PSU or powers a couple of boards with one step-down converter (via PoE -- Power over Ethernet -- for example) then it's important to be able to control consumption peaks also.

With most if not all board/kernel combinations we have three places to control this behaviour:

u-boot: brings up the CPU cores, defines initial CPU and DRAM clockspeed
kernel defaults: as soon as the kernel takes over these settings are active (might rely on u-boot's settings, might use own settings, minimum/maximum depends on device tree or fex stuff on Allwinner legacy kernels
userspace: In Armbian we ship with cpufrequtils that control minimum/maximum cpufreq settings and governor used -- have a look at /etc/default/cpufrequtils)

So how to get for example a NanoPi NEO to boot with as less peak consumption possible with legacy kernel? With our most recent NEO settings we bring up all 4 cores and define CPU clockspeed in u-boot as low as 480 MHz. As soon as the kernel takes over we use interactive governor and allow cpufreq scaling from 240 MHz up to 1200 MHz and since booting is pretty CPU intensive the kernel will stay at 1008 MHz or above most of the time while booting being responsible for consumption peaks that exceed idle consumption by 4-5 times. As soon as the cpufrequtils takes over behaviour can be controlled again (eg. setting MAX_SPEED=1296000 to just 240 MHz)

So the problem is the time between kernel starting and invokation of cpufrequtils daemon since our default 'interactive' cpufreq governor lets run H3 on all 4 cores with 1200 MHz on NEO even if we defined maximum cpufreq in normal operation mode to be 912 MHz (everything defined in /etc/default/cpufrequtils will only be active when cpufrequtils has been started by systemd/upstart)

Since we can choose between a few different cpufreq governors with H3's legacy kernel I thought: Let's try the differences out (leaving out the performance governor since this one does the opposite of what we're looking for). I modified cpufrequtils startscript to do some monitoring (time of invocation and cpufreq steps the kernel used before) and added a script to log start times in a file to create average values later, then reboot the board automatically and to exchange the kernel after every 100 reboots with 4 different settings: interactive, ondemand, powersave and userspace default cpufreq governor.

To get an idea how changing the default cpufreq governor in kernel config might influence other H3 boards I chose to strongest one to compare: OPi Plus 2E. NanoPi NEO will be configured to use 480 MHz cpufreq set by u-boot and to allow cpufreq scaling between 240 MHz and 1200 MHz. OPi Plus 2E uses 1008 MHz as cpufreq in u-boot and jumps between 480 MHz and 1296 MHz with our default settings.

So how do the 4 different cpufreq governors behave with both boards?

interactive: does the best job from a performance perspective since this governor switches pretty fast from lower clockspeeds to higher ones (also highest consumption peaks seen)
ondemand: In our tests cpufreq only switched between lowest allowed and highest clockspeed while remaining most of the times at the lowest (240/1200 on NEO and 480/1296 on Plus 2E). Please be aware that ondemand is considered broken.
powersave: With this setting cpufreq remains at the lowest allowed clockspeed (240 MHz on NEO and 480 MHz on Plus 2E)
userspace: No adjustments at all, simply re-using the clockspeed set by u-boot (480 MHz on NEO and 1008 MHz on Plus 2E)

Let's have a look how boot times changed. I simply monitored the time in seconds between start of the kernel and invocation of cpufrequtils (since this is the time span when changing the default cpufreq governor in kernel config matters):

                NEO      Plus 2E
Interactive:   10.06       9.93
Ondemand:      12.43      10.90
Powersave:     14.16      11.51
Userspace:     11.36      10.15

Shorter times correlate with higher peak consumption. So it's obvious that changing default cpufreq governor for H3 boards from interactive to powersave would help a lot reducing boot consumption. On the NEO this will delay booting by ~4.1 seconds and on Plus 2E by just ~1.65 seconds -- reason is simple: NEO boots with 240 MHz instead of remaining above 1008 MHz most of the time and OPi Plus 2E boots with 480 MHz instead of +1200).

But userspace is also interesting. This governor doesn't alter cpufreq set by u-boot so therefore NEO boots with 480 MHz and OPi Plus 2E with 1008 MHz (also true for all other H3 devices except of the overheating ones -- Beelink X2, Banana Pi M2+ and NanoPi M1 use 816 MHz instead) while delaying boot times just by 1.3 seconds (NEO) or 0.23 (Plus 2E).

The 'less consumption' champion is clearly powersave but since we want to only maintain one single kernel config for all H3 boards it might be the better idea to choose userspace instead as default cpufreq governor in sun8i legacy kernel config since with this setting NEO still reduces boot consumption a lot but other H3 devices aren't affected that much.

All consumption numbers are just 'looking at powermeter while board boots'. My measurement setup using average values totally fails when it's about peak consumption. I already thought about using a RPi 3, its camera module, the motion daemon and an OCR solution to monitor my powermeter. But based on the information we already have (consumption numbers based on cpufreq/dvfs settings) it seems switching from interactive to userspace is a good idea to save peak current while booting. Though if anyone is after lowest consumption possible then choosing powersave is the better choice.

In case someone wants to test on his own here's the procedure and test logs:

I added these three lines to the start of /etc/init.d/cpufrequtils:

echo "cpufrequtils taking over" > /dev/kmsg
grep -v " 0$" /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state >>/var/log/boot-state.log
echo "$(date) $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor) $(cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq) $(cat /sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/cur_freq)" >>/var/log/boot-state.log

And then I added '/usr/local/bin/check-boot-time.sh &' to /etc/rc.local and this script looks like this:

root@nanopineo:/home/tk# cat /usr/local/bin/check-boot-time.sh 
#!/bin/bash
# check boot times
sleep 15
dmesg | grep cpufreq >>/var/log/boot-state.log
BootTime=$(tail -n2 /var/log/boot-state.log | awk -F" " '/cpufrequtils taking over/ {print $2}' | sed 's/\]//')
echo ${BootTime} >>/var/log/boot-times.log
CountOfEntries=$(wc -l </var/log/boot-times.log)
if [ ${CountOfEntries} -eq 100 ]; then
sleep 60
dpkg -i /home/tk/boot-tests/linux-image-sun8i_5.17_armhf_ondemand.deb
echo "--- ondemand ---" >>/var/log/boot-times.log
elif [ ${CountOfEntries} -eq 200 ]; then
sleep 60
dpkg -i /home/tk/boot-tests/linux-image-sun8i_5.17_armhf_powersave.deb
echo "--- powersave ---" >>/var/log/boot-times.log
elif [ ${CountOfEntries} -eq 300 ]; then
sleep 60
dpkg -i /home/tk/boot-tests/linux-image-sun8i_5.17_armhf_userspace.deb
echo "--- userspace ---" >>/var/log/boot-times.log
elif [ ${CountOfEntries} -gt 399 ]; then
exit 0
fi
reboot

With this scripted setup the boards test unattended through 4 different settings rebooting 400 times and providing logs that can be interpreted later:

Orange Pi Plus 2E: http://sprunge.us/JCCU
NanoPi NEO: http://sprunge.us/JIgV

tkaiser · August 15, 2016

Since I realized that getting peak consumption values while booting is not a task that needs automation but can be done manually and in short time (count of samples can be rather low) I decided to give it a try. I let NanoPi NEO boot 10 times each with interactive, powersave and userspace governor (ondemand isn't interesting here) and simply watched my powermeter's display for peak numbers shown. At the end of the test I used my usual consumption monitoring setup to get an idea how the numbers the powermeter provided (including PSU's own consumption!) match with the usual numbers when using a Banana Pro as 'monitoring PSU'.

Results as follows:

                boot time    peak consumption shown
interactive:    10.6824      9 x 2.8W, 1 x 2.9W
powersave:      14.7607      3 x 2.2W, 6 x 2.3W, 1 x 2.4W
userspace:      11.9503      10 x 2.4W

So based on this quick test the powersave governor doesn't help avoiding high consumption values since peak consumption values are pretty close to the results with userspace. On the other hand switching from interactive to powersave would increase boot time by ~4.1 seconds while userspace only delays boot times by ~1.3 seconds on the NEO. On all other H3 devices switching from interactive to userspace shouldn't matter at all since boot times are only slightly delayed -- see above (0.22 seconds more on OPi Plus 2E)

How would boot behaviour of H3 devices currently supported by Armbian change when switching default cpufreq governor in kernel config from interactive to userspace? Let's have a look at cpufreq scaling behaviour before the cpufrequtils daemon will be started (the short 2-3 seconds lasting consumption peaks happen prior to cpufrequtils start!). The numbers given for interactive are meant as 'spent most of the times at', when using userspace the board simply remains at the clockspeed set in u-boot config until cpufrequtils daemon will be started:

                 interactive        userspace
NanoPi NEO        1008-1200            480

NanoPi M1,
Banana Pi M2+     1008-1200            816

Beelink X2,
OPi One/Lite      1008-1200           1008

All other OPi     1008-1296           1008

That means that on NanoPi M1 and BPi M2+ booting might be delayed by ~0.5 seconds, with X2 or OPi Lite/One we're talking about 0.3 seconds and with the larger Oranges it's even less when switching to userspace. Consumption savings on all these boards are negligible but with NanoPi NEO we get a reduction of peak consumption while booting of approx. 500mW (not worth a look when powering a single NEO with a good PSU but if a fleet of NEOs should be powered through PoE then these 500 mW multiplied with the count of NEOs can make a huge difference regarding the PSU's amperage dimensions)

The baseline of my tests was a NEO/512 powered through FriendlyARM's PSU-ONECOM with only Ethernet connected. Xunlong's 5V/3A PSU powered PSU-ONECOM and sits in a 'Brennenstuhl PM 231E' powermeter reporting ~1.6W idle consumption with userspace governor, all CPU cores active, default DRAM clockspeed --> means the board was idling at 480 MHz CPU clockspeed and 408 MHz DRAM clockspeed.

When I ran 'sysbench --test=cpu --cpu-max-prime=2000000 run --num-threads=4' the powermeter showed 2.4W consumption at 912 MHz and 3.0W when at 1008 MHz (the huge increase is due to VDD_CPUX switching from 1.1V to 1.3V).

So how do these values translate to consumption measurements 'behind PSU'? I used my usual Banana Pro monitoring PSU setup and got

1190 mW reported when idling (1.6W according to powermeter)
1980 mW when running sysbench at 912 MHz (2.4W according to powermeter)
2720 mW when running sysbench at 1008 MHz (3.0W according to powermeter)

                    with PSU (powermeter)   w/o PSU (Banana Pro)
idle:                    1.6W                    1190 mW
sysbench @ 912MHz:       2.4W (+800mW)           1980 mW (+790mW)
sysbench @ 1008MHz:      3.0W (+1400mW)          2720 mW (+1530mW)

TL;DR: Switching from interactive to userspace as default cpufreq governor in sun8i kernel config helps reducing NanoPi NEO's peak consumption at booting by ~500mW while it does not delay booting times a lot (~1.3 seconds longer on NEO). With this change situation for all other H3 devices does not change much both regarding peak consumption and boot times. Switching to userspace seems reasonable to me since we can benefit a lot with NEO's low power mode while not negatively affecting other boards.

tkaiser · August 17, 2016

Another interesting update on the relationship of consumption and performance: I used sysbench all the time to do some basic comparisons. The great thing with sysbench is that it absolutely not depends on memory bandwidth which is bad on the other hand when you compare with real world performance critical stuff since every task that does not run on internal CPU caches only (and that's the vast majority!) depends somehow on memory bandwidth.

Let's take a look at cpuminer which is a bitcoin mining application that uses NEON instructions on ARM (pretty heavy compared to non-NEON stuff but not that much compared with cpuburn-a7) and features also a benchmark mode reporting khash/s (kilo hashes per second) which is great to explore how memory bandwidth might influence performance of memory dependant workloads.

Getting cpuminer up and running on Armbian (and most probably every other more recent Debian based armhf distro) is simple and takes only a minute: Get https://sourceforge.net/projects/cpuminer/files/pooler-cpuminer-2.4.5.tar.gz/download then untar it, change into cpuminer-2.4.5 dir and do

sudo apt-get install libcurl4-gnutls-dev
./configure CFLAGS="-O3 -mfpu=neon"
make
./minerd --benchmark

I decided to test with one Orange Pi limited to 1200 MHz clockspeed max, one able to reach 1296 MHz and NanoPi NEO since this is the first H3 device which really differs with regard to DRAM:

Single bank configuration -- on all other H3 SBC so far always 2 DRAM chips are used. This affects memory bandwidth negatively and is maybe also responsible for overheating and more consumption (just an assumption ~~but Olimex when starting with H3 board prototypes reported the same heat issues and they also use a single bank DRAM config~~)
DRAM clocked with just 432 MHz by the vendor and since we found out that lowering this clockspeed down to 408 MHz performance isn't that much affected but consumption decreases by a whopping 130mW with these 24 MHz less we decided to remain at 408 MHz in Armbian for the NEO

Since we also decided to limit maximum CPU clockspeed to 912 MHz I tested NEO with both 912 and 1200 MHz CPU clockspeed (this is the upper clockspeed where the SoC stays on the lower 1.1V core voltage, for every higher clockspeed a switch to 1.3V is needed which increases consumption massively!)

For the test also a small fan was needed in addition to heatsinks to prevent throttling. On the NEO I used FriendlyARM's own large and effective heatsink, on the Oranges my usual 50 Cent el cheapo heatsinks. Also important: I used the same Armbian image for all tests and our NEO settings also for OPi Lite (HDMI/Mali disabled being the most important tweak -- see below). When testing with Orange Pi PC I allowed 1296 MHz maximum clockspeed and also disabled HDMI/Mali for one test. So we have 5 columns:

NEO/912: HDMI/Mali disabled, 912 MHz cpufreq, single bank DRAM
NEO: HDMI/Mali disabled, 1200 MHz cpufreq, single bank DRAM
Lite: HDMI/Mali disabled, 1200 MHz cpufreq, dual bank DRAM
PC: HDMI/Mali disabled, 1296 MHz cpufreq, dual bank DRAM
PC with original settings: 1296 MHz cpufreq, dual bank DRAM

All results in hash/s by cpuminer-2.4.5 with NEON enabled.

DRAM clock  NEO/912  NEO   Lite   PC     PC with original settings
132 MHz:      922    997   1230   1259   1142
144 MHz:      979   1060   1296   1331   1210
156 MHz:     1024   1126   1358   1400   1280
168 MHz:     1070   1189   1410   1460   1349
180 MHz:     1109   1242   1460   1519   1409
192 MHz:     1149   1292   1510   1570   1489
204 MHz:     1180   1340   1550   1620   1558
216 MHz:     1210   1385   1591   1660   1610
228 MHz:     1239   1430   1628   1700   1660
240 MHz:     1260   1469   1670   1740   1702
252 MHz:     1290   1500   1700   1772   1742
264 MHz:     1320   1534   1730   1810   1780
276 MHz:     1343   1563   1760   1842   1810
288 MHz:     1368   1591   1780   1870   1840
300 MHz:     1380   1620   1800   1900   1870
312 MHz:     1400   1650   1820   1920   1890
324 MHz:     1410   1680   1840   1948   1920
336 MHz:     1421   1710   1867   1963   1940
348 MHz:     1440   1730   1890   1990   1963
360 MHz:     1450   1760   1912   2010   1989
372 MHz:     1460   1780   1930   2039   2012
384 MHz:     1470   1800   1940   2060   2039
408 MHz:     1500   1830   1960   2090   2070
432 MHz:     1530   1858   1982   2110   2090
456 MHz:     1540   1880   2000   2130   2110
480 MHz:     1551   1910   2011   2150   2130
504 MHz:     1560   1920   2020   2169   2149
528 MHz:     1570   1950   2029   2180   2160
552 MHz:     1580   1980   2052   2180   2170
576 MHz:     1594   2000   2079   2190   2180
600 MHz:     1600   2012   2100   2225   2200
624 MHz:     1611   2030   2109   2249   2230
648 MHz:     1620   2040   2110   2260   2239
672 MHz:     1624   2049   2120   2270   2247

Let's look at the results:

When comparing the two last columns (OPi PC with default settings and HDMI/Mali disabled) it's obvious that disabling both improves cpuminer/memory performance. When we have in mind that on all cheap ARM SoCs CPU and GPU share access to DRAM then it's obvious that disabling GPU cores frees CPU ressources and memory bandwidth available increases (the lower the DRAM clockspeed the more this makes a difference:at 132 MHz it's 117 hash/s difference, at 672 MHz only 23 hash/s)
When looking at the first two columns the same can be observed. The difference between both runs is just H3 running at either 912 MHz or 1200 MHz on the NEO. At the lowest DRAM clockspeed possible the difference between both cpufreqs is just 75 hash/s while on the upper 425 hash/s. More interesting: At NEO's default 408 MHz DRAM clockspeed the cpufreq differences result in 1500 vs. 1830 hash/s which means increasing cpufreq from 912 MHz by 32 percent to 1200 MHz does only result in 22 percent performance gain since for this workload DRAM access is important too
When comparing columns 3 and 4 (Lite and PC using same DRAM but different clockspeeds: 1200 vs. 1296 MHz), the memory bandwidth effect is also present. The lower the DRAM is clocked the less the higher cpufreq makes a difference
Same when looking at columns 2 and 3 (comparing NEO and Lite running at the same 1200 MHz CPU clockspeed but with single vs. dual bank DRAM config). At 132 MHz it's a difference of 233 hash/s between both and at 672 MHz it's only 71 hash/s

Testing through 132 - 672 MHz is only useful to get some understanding what's going on and how low DRAM clockspeeds might affect performance of specific workloads. Now let's have a look at realistic DRAM clockspeeds and that's 408 MHz for NEO and 624 MHz for Orange Pis. The 408 MHz are the result of trusting into the vendor's defaults and improving them slightly (decreasing clockspeed by 24 MHz results in 130 mW consumption less while getting only insignificant performance losses) and the 624 MHz are the result of community based DRAM reliability testing for the board (not trusting into Allwinner's 672 MHz).

So how do the three boards compete when driven with optimized settings (Armbian defaults but HDMI/Mali disabled on all boards as with NEO defaults):

Orange Pi PC @ 1296/624 MHz: 2249 hash/s
Orange Pi Lite @ 1200/624 MHz: 2109 hash/s
NanoPi NEO @ 1200/408 MHz: 1830 hash/s
NanoPi NEO @ 912/408 MHz: 1500 hash/s

Please keep in mind that all these numbers above are the result of using active cooling. With just a heatsink everything looks differently since then throttling occurs and strange things might happen -- the best example is the NEO: When trying to run the test with only FriendlyARM's heatsink, no fan and allowing 1200 MHz clockspeed the board simply deadlocks after running cpuminer for 25 minutes at 64Â°C SoC temperature reported.

We chose the 912 MHz max for NEO in Armbian for a reason: The NEO is simply not made for heavy stuff. I experienced also deadlocks within 2 minutes when trying to run our usual lima-memtester DRAM reliability tests on the NEO which heats up the SoC heavily and increases consumption a lot since we stress Mali400MP GPU to the max. Without an annoying fan it's impossible to run these workloads on the NEO.

TL;DR: Disabling HDMI/GPU helps with reducing consumption and temperatures. It also increases memory bandwidth since CPU and GPU cores have to share DRAM access. More memory bandwidth helps increasing performance of most tasks (even IO bound tasks benefit from on slow SoCs like H3). On SoCs that tend to overheating disabling HDMI/GPU helps twice with performance since lower consumption/temperatures also help with throttling. In case the SoC stays cooler throttling will jump in later when running heavy workloads

On a related note: We've already measured how switching through different DRAM clockspeeds affects temperatures and consumption on NanoPi NEO when being idle. We get a difference of 470 mW and 10Â°C (w/o heatsink) just by adjusting DRAM clockspeed between 132 and 672 MHz. How does it look like when running CPU intensive tasks? I used NEO, limited maximum cpufreq to 912 MHz (since the board deadlocked at 1200 MHz when running cpuminer for 25 minutes) and disabled the fan so that only FA's heatsink helps with heat dissipation:

Wed Aug 17 15:06:41 CEST 2016	132/912 MHz	0.913667    50
Wed Aug 17 15:09:42 CEST 2016	144/912 MHz	0.969667    50
Wed Aug 17 15:12:42 CEST 2016	156/912 MHz	1.01967     51
Wed Aug 17 15:15:43 CEST 2016	168/912 MHz	1.05967     51
Wed Aug 17 15:18:43 CEST 2016	180/912 MHz	1.1         52
Wed Aug 17 15:21:43 CEST 2016	192/912 MHz	1.14        53
Wed Aug 17 15:24:44 CEST 2016	204/912 MHz	1.17        53
Wed Aug 17 15:27:44 CEST 2016	216/912 MHz	1.20233     54
Wed Aug 17 15:30:45 CEST 2016	228/912 MHz	1.23933     54
Wed Aug 17 15:33:45 CEST 2016	240/912 MHz	1.265       55
Wed Aug 17 15:36:45 CEST 2016	252/912 MHz	1.29033     56
Wed Aug 17 15:39:45 CEST 2016	264/912 MHz	1.31967     56
Wed Aug 17 15:42:46 CEST 2016	276/912 MHz	1.34        57
Wed Aug 17 15:45:46 CEST 2016	288/912 MHz	1.36        57
Wed Aug 17 15:48:46 CEST 2016	300/912 MHz	1.38        58
Wed Aug 17 15:51:47 CEST 2016	312/912 MHz	1.398       58
Wed Aug 17 15:54:47 CEST 2016	324/912 MHz	1.40967     58
Wed Aug 17 15:57:47 CEST 2016	336/912 MHz	1.42        59
Wed Aug 17 16:00:47 CEST 2016	348/912 MHz	1.43187     59
Wed Aug 17 16:03:48 CEST 2016	360/912 MHz	1.44233     59
Wed Aug 17 16:06:48 CEST 2016	372/912 MHz	1.45067     59
Wed Aug 17 16:09:48 CEST 2016	384/912 MHz	1.467       60
Wed Aug 17 16:12:48 CEST 2016	408/912 MHz	1.49033     61
Wed Aug 17 16:15:49 CEST 2016	432/912 MHz	1.563       65
Wed Aug 17 16:18:49 CEST 2016	456/912 MHz	1.539       67
Wed Aug 17 16:21:49 CEST 2016	480/912 MHz	1.54967     68
Wed Aug 17 16:24:50 CEST 2016	504/912 MHz	1.55933     69
Wed Aug 17 16:27:50 CEST 2016	528/912 MHz	1.57        69
Wed Aug 17 16:30:50 CEST 2016	552/870 MHz	1.55531     69
Wed Aug 17 16:33:50 CEST 2016	576/842 MHz	1.51        70
Wed Aug 17 16:36:51 CEST 2016	600/831 MHz	1.492       70
Wed Aug 17 16:39:51 CEST 2016	624/823 MHz	1.48034     70
Wed Aug 17 16:42:51 CEST 2016	648/822 MHz	1.47862     70
Wed Aug 17 16:45:51 CEST 2016	672/822 MHz	1.47594     70

These are the raw logs I used containing time stamp, DRAM and average CPU clockspeed of the last ~2.5 minutes, cpuminer khash/s value and SoC temperature in Â°C.

CPU clockspeed was set to maximum and while increasing DRAM clockspeed from 132 MHz up to 528 MHz SoC temperature increased by 20Â°C (only related to DRAM clockspeed!). When exceeding 528 MHz throttling occured so that SoC temperature remained at ~70Â°C while cpuminer's performance started to degrade. The increase in consumption and temperatures with higher DRAM clockspeed slowed cpufreq down and led to lower khash/s values above 528 MHz DRAM clock.

As a comparison the same test with OPi Lite (no fan, cheap heatsink, same settings, cpufreq limited to 912 MHz):

Wed Aug 17 17:16:30 CEST 2016	132/912 MHz	1.07867	50
Wed Aug 17 17:19:30 CEST 2016	144/912 MHz	1.129	49
Wed Aug 17 17:22:31 CEST 2016	156/912 MHz	1.16933	50
Wed Aug 17 17:25:31 CEST 2016	168/912 MHz	1.21	51
Wed Aug 17 17:28:31 CEST 2016	180/912 MHz	1.25033	51
Wed Aug 17 17:31:32 CEST 2016	192/912 MHz	1.28167	54
Wed Aug 17 17:34:32 CEST 2016	204/912 MHz	1.31167	53
Wed Aug 17 17:37:32 CEST 2016	216/912 MHz	1.33933	53
Wed Aug 17 17:40:32 CEST 2016	228/912 MHz	1.35967	53
Wed Aug 17 17:43:33 CEST 2016	240/912 MHz	1.38	54
Wed Aug 17 17:46:33 CEST 2016	252/912 MHz	1.4	54
Wed Aug 17 17:49:33 CEST 2016	264/912 MHz	1.43	54
Wed Aug 17 17:52:33 CEST 2016	276/912 MHz	1.45	56
Wed Aug 17 17:55:34 CEST 2016	288/912 MHz	1.46067	57
Wed Aug 17 17:58:34 CEST 2016	300/912 MHz	1.47233	54
Wed Aug 17 18:01:34 CEST 2016	312/912 MHz	1.49	56
Wed Aug 17 18:04:34 CEST 2016	324/912 MHz	1.5	56
Wed Aug 17 18:07:35 CEST 2016	336/912 MHz	1.51	56
Wed Aug 17 18:10:35 CEST 2016	348/912 MHz	1.52	57
Wed Aug 17 18:13:35 CEST 2016	360/912 MHz	1.53	56
Wed Aug 17 18:16:35 CEST 2016	372/912 MHz	1.532	56
Wed Aug 17 18:19:35 CEST 2016	384/912 MHz	1.54	57
Wed Aug 17 18:22:36 CEST 2016	408/912 MHz	1.547	56
Wed Aug 17 18:25:36 CEST 2016	432/912 MHz	1.57	62
Wed Aug 17 18:28:36 CEST 2016	456/912 MHz	1.59	61
Wed Aug 17 18:31:37 CEST 2016	480/912 MHz	1.6	62
Wed Aug 17 18:34:37 CEST 2016	504/912 MHz	1.65967	63
Wed Aug 17 18:37:37 CEST 2016	528/912 MHz	1.65967	62
Wed Aug 17 18:40:37 CEST 2016	552/912 MHz	1.65067	63
Wed Aug 17 18:43:38 CEST 2016	576/912 MHz	1.66	64
Wed Aug 17 18:46:38 CEST 2016	600/912 MHz	1.65933	62
Wed Aug 17 18:49:38 CEST 2016	624/912 MHz	1.66	63
Wed Aug 17 18:52:38 CEST 2016	648/912 MHz	1.65967	63
Wed Aug 17 18:55:39 CEST 2016	672/912 MHz	1.65933	63

No throttling occured, temperatures were lower, cpuminer performance higher. OPi Lite also uses DDR3 @ 1.5V (not DDR3L @ 1.35V as the larger Orange Pi variants) so the most obvious change is single vs. dual bank DRAM configuration. Maybe that's the reason Olimex reported such overheating problems when they started with their H3 boards a while ago ~~(also using DDR3 in single bank configuration)~~?

tkaiser · August 23, 2016

LOL, today I did some testing with NanoPi NEO, kernel 4.7.2 and the new schedutil cpufreq scheduler. I let the following run to check thermal readouts after allowing 1200 MHz max cpufreq:

sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$(grep -c '^processor' /proc/cpuinfo)

To my surprise the result was just 117.5 seconds -- that's 'better' than RPi 3 with same settings and with Orange Pi PC while being clocked higher (1.3 GHz vs. 1.2 GHz) I got the following a few days ago: 'sysbench takes 142 seconds, H3 constantly running at 1296 MHz, SoC temperature reached 74Â°C but no throttling happening'

Wow!!! An increase in performance of ~30 percent just by using a new kernel! With a benchmark that should not be affected by the kernel version at all?! That's magic.

So I immediately tried out our 3.4.112 Xenial image. Same thermal readouts, same result: 117.5 seconds! What did happen?

I tried out Xenial 16.04 LTS with both 4.7.2 and 3.4.112 kernel. And before I always used Debian Jessie. Ok, downloaded our Jessie image for NanoPi NEO, executed the same sysbench call and got 153.5 seconds (which is the correct value given that no throttling occured, max cpufreq was at 1200 MHz and OPi PC clocked at 1296 MHz finishes in 142 seconds!)

What can we learn from this? Sysbench is used nearly everywhere to 'get an idea about CPU performance' while it is horrible crap to compare different systems! You always have to ensure that you're using the very same sysbench binary. At least it has to be built with the exact same compiler settings and version! We get a whopping 30 percent performance increase just since the Ubuntu folks use other compiler switches/version compared to the Debian folks:

This is 2 times 'sysbench 0.4.12'

Ubuntu Xenial Xerus:

root@nanopineo:~# file /usr/bin/sysbench
/usr/bin/sysbench: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, BuildID[sha1]=2df715a7bcb84cb03205fa3a5bc8474c6be1eac2, stripped
root@nanopineo:~# lsb_release -c
Codename: xenial
root@nanopineo:~# sysbench --version
sysbench 0.4.12

Debian Jessie:

root@nanopineo:~# file /usr/bin/sysbench
/usr/bin/sysbench: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 2.6.32, BuildID[sha1]=664005ab6bf45166f9882338db01b59750af0447, stripped
root@nanopineo:~# lsb_release -c
Codename: jessie
root@nanopineo:~# sysbench --version
sysbench 0.4.12

Just the same effect when comparing sysbench numbers on RPi 2 or 3 when running Raspbian or Ubuntu Mate -- see post #12 above (but there the difference is only 15 percent so it seems either the Raspbian people aren't using that conservative compiler switches compared to Jessie or Ubuntu Mate for Raspberries does not optimize that much as our 16.04 packages from Ubuntu repositories)

TL;DR: Never trust any sysbench numbers you find on the net if you don't know which compiler settings and version have been used. Sysbench is crap to compare different systems. You can use sysbench's cpu test only for a very limited set of tests: Creating identical CPU utilization situations (to compare throttling settings as I did before in this thread), running tests to estimate multi-threaded results when adding/removing CPU cores or test CPU performance without tampering results by memory bandwidth (sysbench is that primitive that all code runs inside the CPU caches!)

Everything else always requires to use the exact same sysbench binary on different systems to be able to compare. So no-cross platform comparisons possible, no comparisons between systems running different OS images, no comparisons between different CPU architectures possible. Using sysbench as a general purpose CPU benchmark is always just fooling yourself!

jobenvil · September 30, 2016

Really appreciated that you always share your observations.

Kevin Kreger · October 8, 2016

@tkaiser ... This is great. In fact, this is just the information we need to optimize our Orange Pi One which is running Android headless.

Magnets · December 26, 2016

Have you done any testing on the Orange Pi zero? Will the peak consumption be similar to the Orange Pi PC?

billybangleballs · February 5, 2017

@tkaiser Very much appreciate your documentation on power usage. Keep up the good work.

hojnikb · March 6, 2017

Here are some of my results for PC2:

Equipement used:

-Orange Pi PC2

-Samsung EVO+ 32GB sd card

-5V/2A usb PSU

-KEWEISI usb power monitor

-Armbian 5.25@March 6

Results:

Idle at armbian desktop, no devices connected: 0.98W

Idle at armbian desktop, keyboard+mouse via ps/2 usb converter chip: 1.03W

Idle at armbian desktop, keyboard+mouse via ps/2 usb converter chip, usb wifi RTL8188ETV: 1.64W

Scrolling thru armbian forums with firefox, keyboard+mouse via ps/2 usb converter chip, usb wifi RTL8188ETV: 2.05-3.08W

Burnin test with cpuburnA53 @ 1.3Ghz, keyboard+mouse via ps/2 usb converter chip, usb wifi RTL8188ETV: 7.32W

Burnin test with cpuburnA53 @ 1.06Ghz, keyboard+mouse via ps/2 usb converter chip, usb wifi RTL8188ETV: 5.30W

I should add, that with the 1.3Ghz test, it throttled within seconds to 1.06Ghz.

If anyone whats something else tested, please suggest.

hojnikb · March 6, 2017

Is there a good arm cpu benchmark, that i can use for efficiency testing ? I'm really interested at which clock A53 cores are most efficient.

hojnikb · March 14, 2017

So i did a quick and dirty power test for each freq/voltage point using stabilityTester/xhpl64. Testing was done on the same equipment as last time, but with addition of a fan to eliminate throttling. Highest power recorded. HDMI/keyboard+mouse connected.

480MHz Idle 0.9747W

480Mhz 1.792W
528Mhz 1.8432W
648Mhz 2.0951W
672Mhz 2.1462W
720Mhz 2.1973W
728Mhz 2.295W
792Mhz 2.346W
816Mhz 2.448W
864Mhz 2.652W
912Mhz 2.754W
936Mhz 2.8504W
960Mhz 3.054W
1008Mhz 3.2512W
1056Mhz 3.4544W
1040Mhz 3.8885W
1152Mhz 4.1915W
1200Mhz 4.6965W
1224Mhz 4.9995W
1248Mhz 5.2312W
1296Mhz 5.7456W
1368Mhz 7.0716W

Edited March 14, 2017 by tkaiser
Added link to https://github.com/ehoutsma/StabilityTester

hojnikb · March 15, 2017

If there are any real world apps, that load up all 4 cores and would bring somewhat consistent results, i'm happy to take suggestions. I'm well aware, that testing with xhpl64 isn't exactly realistic, but more of a worst case scenario.

wtarreau · March 16, 2017

19 hours ago, hojnikb said:

If there are any real world apps, that load up all 4 cores and would bring somewhat consistent results, i'm happy to take suggestions. I'm well aware, that testing with xhpl64 isn't exactly realistic, but more of a worst case scenario.

I usually run "openssl speed rsa2048 -multi <#cores>" for this, the RSA code is carefully optimized to achieve a very high IPC on most CPUs and I always managed to achieve the highest power consumption with this. The only difficulty is that it doesn't last long (10s) so you have to measure quickly. Another benefit is that it often comes pre-installed on most systems.

hojnikb · March 16, 2017

4 hours ago, wtarreau said:

I usually run "openssl speed rsa2048 -multi <#cores>" for this, the RSA code is carefully optimized to achieve a very high IPC on most CPUs and I always managed to achieve the highest power consumption with this. The only difficulty is that it doesn't last long (10s) so you have to measure quickly. Another benefit is that it often comes pre-installed on most systems.

All right, i might give this a try the next time fiddling with my boards

superjamie · April 10, 2017

On 05/08/2016 at 4:38 AM, tkaiser said:

Why do they allow RPi Zero to clock with up to 1 GHz by default when they limit B+ to 700 MHz (compare performance and consumption numbers of both tinymembench and sysbench above)?

It was the limit of manufacturing capability at the times of release.

The first Pi was released April 2012, the B+ was released July 2014, and the Zero was released November 2015.

In 2012, Broadcom could only make the SoC well enough that the ARM1176JZF-S could reliably reach 700MHz. Some units could be overclocked with good results but many could not.

By the end of 2015 - almost four years later - they had improved the precision of the manufacturing process so that 1000MHz was possible and reliable on all chips.

hojnikb · April 10, 2017

1 hour ago, superjamie said:

It was the limit of manufacturing capability at the times of release.

The first Pi was released April 2012, the B+ was released July 2014, and the Zero was released November 2015.

In 2012, Broadcom could only make the SoC well enough that the ARM1176JZF-S could reliably reach 700MHz. Some units could be overclocked with good results but many could not.

By the end of 2015 - almost four years later - they had improved the precision of the manufacturing process so that 1000MHz was possible and reliable on all chips.

1Ghz was achievable on pretty much all boards; evidence of this is the option in their raspi-config, which allowed for 1Ghz setting. My example went easily to 1150Mhz.

superjamie · April 30, 2017

On 10/04/2017 at 11:44 PM, hojnikb said:

1Ghz was achievable on pretty much all boards; evidence of this is the option in their raspi-config, which allowed for 1Ghz setting. My example went easily to 1150Mhz.

The configuration tool allowed users to try for 1GHz but it definitely wasn't achievable on all boards. I had a first-batch 256MiB RAM Pi 1 (which I bought the week they were released in 2012) and a later 512 MiB Pi 1, both of which could not reliably go past 900MHz. I've spoken to other Pi 1 owners who could achieve 950MHz or 1000MHz, and one owner whose board couldn't even get past 850MHz reliably.

If you had a Pi 1 reaching 1150MHz, you were very lucky and your experience was definitely not typical of most users.

Sign In

Research SBC consumption/performance comparisons

Recommended Posts

tkaiser

wildcat_paris

arox

Igor

tkaiser

wildcat_paris

tkaiser

wildcat_paris

tkaiser

tkaiser

tkaiser

tkaiser

tkaiser

tkaiser

tkaiser

tkaiser

tkaiser

jobenvil

Kevin Kreger

Magnets

billybangleballs

hojnikb

hojnikb

hojnikb

hojnikb

wtarreau

hojnikb

superjamie

hojnikb

superjamie

Similar Content

Forums

My Activity Streams

Download

Store

Important Information