tkaiser Posted August 4, 2016 Posted August 4, 2016 The following is the start of a series of tests regarding minimized consumption mode for SBCs. Idea behind is to provide Armbian settings that allow some of the SBC we support to be used as cheap and low-power IoT nodes (or call it 'home automation' or anything else, at least it's about low-power and headless useage) I start with some consumption comparisons with various RPi models below just to get a baseline of numbers and to verify that my consumption monitoring setup is ok. Also you find a lot of numbers on the net that are questionable (measured with inappropriate equipment, comparing OS images with different settings, taking cable resistance not into account, wonky methodology only looking at current and forgetting about voltage fluctuations and so on) so I thought lets take a few RPi lying around and do some own measurement with an absolutely identical setup so the numbers I get can be compared reliably. I used the most recent Raspbian Debian Jessie Lite image currently available (2016-05-27-raspbian-jessie-lite.img) with latest kernel (4.4.13), all upgrades applied, HDMI disabled in /etc/rc.local by '/usr/bin/tvservice -o' and RPi powered through USB port of a Banana Pro (my monitoring PSU -- all values below are 30 min average values). All tests done using the same Raspbian installation on the same SD card. Memory throughput tests done using https://github.com/ssvb/tinymembench CPU 'benchmarks' done using sysbench (that is known to be not able to compare different CPU architectures but since RPi 3 has to run with an ARMv7 kernel and ARMv6 userland it's ok to use it, also it's lightweight enough to not overload my 'monitoring PSU' and throttling could be prevented just by applying a cheap heatsink to the SoC). If we were able to run ARMv8 code on RPI 3's SoC then sysbench would be completely useless since then the test would not take 120 seconds but less than 10 (that's what you get from not using ARMv8 instruction set) I always used 2 sysbench runs, the first with '--cpu-max-prime=20000' to get some numbers to compare, the second running for at least an hour with '--cpu-max-prime=2000000' to get reliable consumption reporting. With the RPi 3 applying a cheap heatsink was necessary to prevent throttling (cpufreq remained at 1200 MHz and SoC temperature at 80°C). Tests used all available CPU cores so the results only apply to multi-threaded workloads (keep that in mind, your 'real world' application requirements normally look different!): sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$(grep -c '^processor' /proc/cpuinfo) A few words regarding the RPi platform: All RPi use basically the same SoC. It's a BroadCom VideoCore IV SoC that boots a proprietary OS combined with 1 to 4 ARM cores that are brought up later. RPi Zero/A/A+/B/B+ use the BCM2835 SoC which adds 1 ARMv6 core to the VideoCore VPU/GPU, BCM2836 replaced this with a quad-core ARMv7 cluster and on the latest BCM2837 design they replaced the Cortex-A7 cores with Cortex-A53 that currently have to run in 32-bit mode only. The other limitations this platform suffers from are also due to this design (VideoCore VPU/GPU being the main part of the SoC and no further SoC development done except exchanging ARM cores and minor memory interface improvements): only one single USB 2.0 OTG port available between SoC and the outside only DDR2 DRAM possible and the maximum is 1GB (all RPi use LPDDR2 at 1.2V) FAT partition needed where the proprietary VideoCore bootloader BLOBs are located So how do some RPi provide Ethernet and 2 or 4 USB ports? They use an onboard component called LAN9512 (Fast Ethernet + 2 USB ports on RPi B(not B+!) or LAN9514 providing Fast Ethernet + 4 USB ports on RPi B+, 2 and 3. The RPi models that save this component (RPi A+ and Zero) show not so surprisingly the lowest consumption numbers. Same could've been true for RPi A but unfortunately RPi foundation chose inefficient LDO (low dropout) regulators to generate 3.3V and 1.8V needed by various ICs on the boards which transform power into heat on the two first models (so no numbers for RPi A and B here since they're not suitable for low-power operation due to this design flaw) We can see below that disabling the LAN9514 hub/Ethernet combo makes a huge difference regarding consumption which we should take into account if we start to compare with boards supported by Armbian (eg. H3 boards that feature real Ethernet and 4 real USB ports). Same applies to RPi A+ or Zero when an USB-to-Ethernet dongle is connected but here it heavily depends on the dongle in question. When using one of my Gbit-Ethernet dongles (Realtek RTL8153 based) consumption increases by +1100mW regardless whether buspower is 0 or 1, with a random Fast Ethernet adapter it makes a difference -- see below. RPi Zero with nothing connected, doing nothing, just power led: echo 0 >/sys/devices/platform/soc/20980000.usb/buspower --> 365 mW With a connected Apple USB-Fast-Ethernet dongle consumption is like this: echo 0 >/sys/devices/platform/soc/20980000.usb/buspower --> 410 mW (no network) echo 1 >/sys/devices/platform/soc/20980000.usb/buspower --> 1420 (network active, cable inserted but idling) That means this USB-Ethernet dongle consumes 45mW when just connected (regardless whether the RPi is completely powered off or buspower = 0) and as soon as an USB connection between dongle and RPi is negotiated and an Ethernet connection on the other side another whopping 1010 mW adds to overall consumption. Therefore choose your Ethernet dongle wisely when you deal with devices that lack native Ethernet capabilities Fortunately the RPi Zero exposes the SoC's one single OTG port as Micro USB with ID pin so the Zero unlike all other RPi models can switch to an USB gadget role so we can use the USB OTG connection as network connection using the g_ether module (quite simple in the meantime with most recent Raspbian images, just have a look at https://gist.github.com/gbaman/975e2db164b3ca2b51ae11e45e8fd40a).I'll cover performance and consumption numbers in this mode in a later post (covering idle and full load and also some camera scenarios since this is my only use case for any RPi: HW accelerated video encoding). Performance numbers: sysbench takes 915 seconds on the single core @ 1000 MHz, 800 mW reported (+435 mW compared to 'baseline'). And tinymembench looks like this: tinymembench v0.4.9 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 169.8 MB/s C copy backwards (32 byte blocks) : 170.6 MB/s C copy backwards (64 byte blocks) : 168.2 MB/s C copy : 185.9 MB/s (0.2%) C copy prefetched (32 bytes step) : 449.0 MB/s C copy prefetched (64 bytes step) : 273.3 MB/s C 2-pass copy : 180.2 MB/s (4.0%) C 2-pass copy prefetched (32 bytes step) : 313.4 MB/s C 2-pass copy prefetched (64 bytes step) : 272.3 MB/s (2.7%) C fill : 856.7 MB/s (4.1%) C fill (shuffle within 16 byte blocks) : 856.8 MB/s (3.8%) C fill (shuffle within 32 byte blocks) : 856.6 MB/s C fill (shuffle within 64 byte blocks) : 856.9 MB/s --- standard memcpy : 439.9 MB/s standard memset : 1693.5 MB/s (3.7%) --- VFP copy : 222.7 MB/s VFP 2-pass copy : 198.1 MB/s (2.4%) ARM fill (STRD) : 856.6 MB/s ARM fill (STM with 8 registers) : 1675.7 MB/s ARM fill (STM with 4 registers) : 1693.4 MB/s (3.7%) ARM copy prefetched (incr pld) : 440.0 MB/s (4.2%) ARM copy prefetched (wrap pld) : 270.4 MB/s ARM 2-pass copy prefetched (incr pld) : 379.7 MB/s (3.7%) ARM 2-pass copy prefetched (wrap pld) : 308.5 MB/s ========================================================================== == Framebuffer read tests. == == == == Many ARM devices use a part of the system memory as the framebuffer, == == typically mapped as uncached but with write-combining enabled. == == Writes to such framebuffers are quite fast, but reads are much == == slower and very sensitive to the alignment and the selection of == == CPU instructions which are used for accessing memory. == == == == Many x86 systems allocate the framebuffer in the GPU memory, == == accessible for the CPU via a relatively slow PCI-E bus. Moreover, == == PCI-E is asymmetric and handles reads a lot worse than writes. == == == == If uncached framebuffer reads are reasonably fast (at least 100 MB/s == == or preferably >300 MB/s), then using the shadow framebuffer layer == == is not necessary in Xorg DDX drivers, resulting in a nice overall == == performance improvement. For example, the xf86-video-fbturbo DDX == == uses this trick. == ========================================================================== VFP copy (from framebuffer) : 223.8 MB/s VFP 2-pass copy (from framebuffer) : 207.0 MB/s ARM copy (from framebuffer) : 188.3 MB/s ARM 2-pass copy (from framebuffer) : 215.1 MB/s (3.5%) ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.8 ns / 1.2 ns 32768 : 19.8 ns / 32.4 ns 65536 : 32.6 ns / 47.0 ns 131072 : 42.4 ns / 57.3 ns 262144 : 98.1 ns / 157.5 ns 524288 : 166.2 ns / 293.0 ns 1048576 : 200.2 ns / 364.7 ns 2097152 : 217.3 ns / 401.7 ns 4194304 : 226.0 ns / 420.5 ns 8388608 : 231.0 ns / 430.4 ns 16777216 : 236.4 ns / 442.2 ns 33554432 : 251.4 ns / 473.6 ns 67108864 : 288.9 ns / 548.7 ns RPi B+: Nothing connected, doing nothing, just power led: echo 0 >/sys/devices/platform/soc/20980000.usb/buspower --> 600 mW echo 1 >/sys/devices/platform/soc/20980000.usb/buspower --> 985 mW buspower = 1 and Ethernet cable connected --> 1200 mW Performance: sysbench took 1311 seconds @ 700 MHz while 1160 mW consumption has been reported (+175 mW compared to 'baseline') This is tinymembench: tinymembench v0.4.9 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 127.2 MB/s (0.1%) C copy backwards (32 byte blocks) : 130.5 MB/s C copy backwards (64 byte blocks) : 129.5 MB/s C copy : 144.9 MB/s C copy prefetched (32 bytes step) : 368.9 MB/s C copy prefetched (64 bytes step) : 212.6 MB/s C 2-pass copy : 137.8 MB/s C 2-pass copy prefetched (32 bytes step) : 248.5 MB/s (0.1%) C 2-pass copy prefetched (64 bytes step) : 207.7 MB/s C fill : 760.4 MB/s C fill (shuffle within 16 byte blocks) : 760.5 MB/s C fill (shuffle within 32 byte blocks) : 760.3 MB/s C fill (shuffle within 64 byte blocks) : 760.5 MB/s --- standard memcpy : 380.2 MB/s standard memset : 1483.9 MB/s --- VFP copy : 165.0 MB/s VFP 2-pass copy : 145.1 MB/s ARM fill (STRD) : 760.6 MB/s (1.3%) ARM fill (STM with 8 registers) : 1101.7 MB/s (0.1%) ARM fill (STM with 4 registers) : 1484.2 MB/s ARM copy prefetched (incr pld) : 380.3 MB/s ARM copy prefetched (wrap pld) : 205.3 MB/s ARM 2-pass copy prefetched (incr pld) : 291.5 MB/s ARM 2-pass copy prefetched (wrap pld) : 237.1 MB/s ========================================================================== == Framebuffer read tests. == == == == Many ARM devices use a part of the system memory as the framebuffer, == == typically mapped as uncached but with write-combining enabled. == == Writes to such framebuffers are quite fast, but reads are much == == slower and very sensitive to the alignment and the selection of == == CPU instructions which are used for accessing memory. == == == == Many x86 systems allocate the framebuffer in the GPU memory, == == accessible for the CPU via a relatively slow PCI-E bus. Moreover, == == PCI-E is asymmetric and handles reads a lot worse than writes. == == == == If uncached framebuffer reads are reasonably fast (at least 100 MB/s == == or preferably >300 MB/s), then using the shadow framebuffer layer == == is not necessary in Xorg DDX drivers, resulting in a nice overall == == performance improvement. For example, the xf86-video-fbturbo DDX == == uses this trick. == ========================================================================== VFP copy (from framebuffer) : 169.6 MB/s VFP 2-pass copy (from framebuffer) : 153.8 MB/s ARM copy (from framebuffer) : 150.0 MB/s ARM 2-pass copy (from framebuffer) : 165.9 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.1 ns / 0.1 ns 16384 : 1.7 ns / 2.8 ns 32768 : 31.9 ns / 51.7 ns 65536 : 51.6 ns / 74.0 ns 131072 : 67.3 ns / 90.6 ns 262144 : 132.4 ns / 207.2 ns 524288 : 227.0 ns / 396.9 ns 1048576 : 274.5 ns / 495.7 ns 2097152 : 298.4 ns / 546.5 ns 4194304 : 310.3 ns / 572.2 ns 8388608 : 317.4 ns / 587.5 ns 16777216 : 326.9 ns / 606.8 ns 33554432 : 353.7 ns / 660.5 ns 67108864 : 379.8 ns / 712.8 ns RPi 2: Nothing connected, doing nothing, just power led: echo 0 >/sys/devices/platform/soc/3f980000.usb/buspower --> 645 mW echo 1 >/sys/devices/platform/soc/3f980000.usb/buspower --> 1005 mW Performance: sysbench takes 192 seconds @ 900 MHz, 2140 mW reported (+1135 mW compared to 'baseline'). And tinymembench looks like: tinymembench v0.4.9 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 244.6 MB/s C copy backwards (32 byte blocks) : 776.8 MB/s (1.1%) C copy backwards (64 byte blocks) : 980.5 MB/s C copy : 706.7 MB/s (0.6%) C copy prefetched (32 bytes step) : 911.1 MB/s C copy prefetched (64 bytes step) : 951.9 MB/s (1.2%) C 2-pass copy : 596.5 MB/s C 2-pass copy prefetched (32 bytes step) : 619.8 MB/s C 2-pass copy prefetched (64 bytes step) : 629.3 MB/s (0.6%) C fill : 1188.0 MB/s C fill (shuffle within 16 byte blocks) : 1191.7 MB/s (0.4%) C fill (shuffle within 32 byte blocks) : 400.2 MB/s (0.5%) C fill (shuffle within 64 byte blocks) : 420.4 MB/s --- standard memcpy : 1065.1 MB/s standard memset : 1191.8 MB/s (0.1%) --- NEON read : 1343.9 MB/s (0.5%) NEON read prefetched (32 bytes step) : 1370.5 MB/s NEON read prefetched (64 bytes step) : 1366.9 MB/s (0.4%) NEON read 2 data streams : 390.1 MB/s NEON read 2 data streams prefetched (32 bytes step) : 727.2 MB/s (0.2%) NEON read 2 data streams prefetched (64 bytes step) : 767.0 MB/s NEON copy : 996.7 MB/s NEON copy prefetched (32 bytes step) : 961.7 MB/s (0.8%) NEON copy prefetched (64 bytes step) : 1033.2 MB/s NEON unrolled copy : 954.4 MB/s (0.4%) NEON unrolled copy prefetched (32 bytes step) : 925.9 MB/s NEON unrolled copy prefetched (64 bytes step) : 985.7 MB/s NEON copy backwards : 840.9 MB/s NEON copy backwards prefetched (32 bytes step) : 845.5 MB/s (1.0%) NEON copy backwards prefetched (64 bytes step) : 873.8 MB/s NEON 2-pass copy : 625.4 MB/s NEON 2-pass copy prefetched (32 bytes step) : 642.5 MB/s (0.3%) NEON 2-pass copy prefetched (64 bytes step) : 648.8 MB/s (0.3%) NEON unrolled 2-pass copy : 588.9 MB/s NEON unrolled 2-pass copy prefetched (32 bytes step) : 578.9 MB/s (0.2%) NEON unrolled 2-pass copy prefetched (64 bytes step) : 611.2 MB/s (0.3%) NEON fill : 1191.9 MB/s NEON fill backwards : 1192.3 MB/s (0.1%) VFP copy : 964.0 MB/s VFP 2-pass copy : 587.0 MB/s (0.3%) ARM fill (STRD) : 1190.8 MB/s (0.1%) ARM fill (STM with 8 registers) : 1192.1 MB/s ARM fill (STM with 4 registers) : 1192.2 MB/s (0.1%) ARM copy prefetched (incr pld) : 960.1 MB/s (0.7%) ARM copy prefetched (wrap pld) : 841.5 MB/s ARM 2-pass copy prefetched (incr pld) : 633.0 MB/s ARM 2-pass copy prefetched (wrap pld) : 606.7 MB/s (0.4%) ========================================================================== == Framebuffer read tests. == == == == Many ARM devices use a part of the system memory as the framebuffer, == == typically mapped as uncached but with write-combining enabled. == == Writes to such framebuffers are quite fast, but reads are much == == slower and very sensitive to the alignment and the selection of == == CPU instructions which are used for accessing memory. == == == == Many x86 systems allocate the framebuffer in the GPU memory, == == accessible for the CPU via a relatively slow PCI-E bus. Moreover, == == PCI-E is asymmetric and handles reads a lot worse than writes. == == == == If uncached framebuffer reads are reasonably fast (at least 100 MB/s == == or preferably >300 MB/s), then using the shadow framebuffer layer == == is not necessary in Xorg DDX drivers, resulting in a nice overall == == performance improvement. For example, the xf86-video-fbturbo DDX == == uses this trick. == ========================================================================== NEON read (from framebuffer) : 61.7 MB/s (0.2%) NEON copy (from framebuffer) : 61.5 MB/s NEON 2-pass copy (from framebuffer) : 58.7 MB/s NEON unrolled copy (from framebuffer) : 59.3 MB/s NEON 2-pass unrolled copy (from framebuffer) : 58.2 MB/s (0.2%) VFP copy (from framebuffer) : 308.7 MB/s (0.7%) VFP 2-pass copy (from framebuffer) : 272.9 MB/s ARM copy (from framebuffer) : 208.3 MB/s ARM 2-pass copy (from framebuffer) : 180.5 MB/s (0.2%) ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 6.4 ns / 11.6 ns 131072 : 9.9 ns / 16.7 ns 262144 : 11.7 ns / 19.0 ns 524288 : 14.7 ns / 23.1 ns 1048576 : 88.7 ns / 141.5 ns 2097152 : 134.3 ns / 189.7 ns 4194304 : 158.0 ns / 208.3 ns 8388608 : 171.5 ns / 217.9 ns 16777216 : 181.8 ns / 228.1 ns 33554432 : 191.8 ns / 241.6 ns 67108864 : 207.1 ns / 268.8 ns Raspberry Pi 3: nothing connected, doing nothing, just power led: echo 0 >/sys/devices/platform/soc/3f980000.usb/buspower --> 770 mW echo 1 >/sys/devices/platform/soc/3f980000.usb/buspower --> 1165 mW buspower = 1 and Ethernet cable connected --> 1360 mW Important: RPi 3 idles at just ~130mW above RPi 2 level. Whether further savings are possible by disabling WiFi/BT is something that would need further investigations. Performance: sysbench takes 120 seconds (constantly at 1200 MHz, 80°C), consumption reported is 3550 mW (+2385 mW compared to 'baseline') and tinymembench looks like: tinymembench v0.4.9 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 1345.8 MB/s (0.5%) C copy backwards (32 byte blocks) : 1334.3 MB/s (0.7%) C copy backwards (64 byte blocks) : 1333.5 MB/s (0.5%) C copy : 1350.1 MB/s (0.4%) C copy prefetched (32 bytes step) : 1376.9 MB/s (0.3%) C copy prefetched (64 bytes step) : 1376.7 MB/s (0.5%) C 2-pass copy : 1055.3 MB/s C 2-pass copy prefetched (32 bytes step) : 1092.0 MB/s (0.2%) C 2-pass copy prefetched (64 bytes step) : 1097.1 MB/s (0.3%) C fill : 1732.6 MB/s C fill (shuffle within 16 byte blocks) : 1735.9 MB/s C fill (shuffle within 32 byte blocks) : 1733.1 MB/s C fill (shuffle within 64 byte blocks) : 1731.9 MB/s --- standard memcpy : 1372.2 MB/s (0.3%) standard memset : 1737.6 MB/s (0.1%) --- NEON read : 2254.5 MB/s NEON read prefetched (32 bytes step) : 2442.2 MB/s (0.6%) NEON read prefetched (64 bytes step) : 2420.1 MB/s NEON read 2 data streams : 2115.4 MB/s NEON read 2 data streams prefetched (32 bytes step) : 2433.9 MB/s (0.3%) NEON read 2 data streams prefetched (64 bytes step) : 2432.6 MB/s (0.3%) NEON copy : 1327.8 MB/s (0.9%) NEON copy prefetched (32 bytes step) : 1376.1 MB/s NEON copy prefetched (64 bytes step) : 1379.9 MB/s (0.5%) NEON unrolled copy : 1344.6 MB/s (0.3%) NEON unrolled copy prefetched (32 bytes step) : 1369.6 MB/s NEON unrolled copy prefetched (64 bytes step) : 1371.3 MB/s NEON copy backwards : 1341.1 MB/s (0.5%) NEON copy backwards prefetched (32 bytes step) : 1375.5 MB/s NEON copy backwards prefetched (64 bytes step) : 1376.3 MB/s (0.4%) NEON 2-pass copy : 1100.5 MB/s (0.3%) NEON 2-pass copy prefetched (32 bytes step) : 1138.0 MB/s NEON 2-pass copy prefetched (64 bytes step) : 1138.2 MB/s (0.2%) NEON unrolled 2-pass copy : 1075.5 MB/s NEON unrolled 2-pass copy prefetched (32 bytes step) : 1099.6 MB/s NEON unrolled 2-pass copy prefetched (64 bytes step) : 1100.1 MB/s NEON fill : 1788.8 MB/s NEON fill backwards : 1788.7 MB/s (0.2%) VFP copy : 1342.4 MB/s (0.4%) VFP 2-pass copy : 1070.1 MB/s (0.2%) ARM fill (STRD) : 1786.8 MB/s (0.2%) ARM fill (STM with 8 registers) : 1789.1 MB/s (0.3%) ARM fill (STM with 4 registers) : 1787.8 MB/s (0.2%) ARM copy prefetched (incr pld) : 1373.3 MB/s ARM copy prefetched (wrap pld) : 1378.1 MB/s (0.4%) ARM 2-pass copy prefetched (incr pld) : 1113.1 MB/s ARM 2-pass copy prefetched (wrap pld) : 1108.8 MB/s ========================================================================== == Framebuffer read tests. == == == == Many ARM devices use a part of the system memory as the framebuffer, == == typically mapped as uncached but with write-combining enabled. == == Writes to such framebuffers are quite fast, but reads are much == == slower and very sensitive to the alignment and the selection of == == CPU instructions which are used for accessing memory. == == == == Many x86 systems allocate the framebuffer in the GPU memory, == == accessible for the CPU via a relatively slow PCI-E bus. Moreover, == == PCI-E is asymmetric and handles reads a lot worse than writes. == == == == If uncached framebuffer reads are reasonably fast (at least 100 MB/s == == or preferably >300 MB/s), then using the shadow framebuffer layer == == is not necessary in Xorg DDX drivers, resulting in a nice overall == == performance improvement. For example, the xf86-video-fbturbo DDX == == uses this trick. == ========================================================================== NEON read (from framebuffer) : 73.4 MB/s (0.1%) NEON copy (from framebuffer) : 73.1 MB/s (0.2%) NEON 2-pass copy (from framebuffer) : 72.0 MB/s (0.2%) NEON unrolled copy (from framebuffer) : 72.7 MB/s NEON 2-pass unrolled copy (from framebuffer) : 71.2 MB/s (0.2%) VFP copy (from framebuffer) : 473.7 MB/s (0.4%) VFP 2-pass copy (from framebuffer) : 428.5 MB/s (1.1%) ARM copy (from framebuffer) : 260.1 MB/s (0.4%) ARM 2-pass copy (from framebuffer) : 242.5 MB/s (0.7%) ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 5.4 ns / 9.2 ns 131072 : 8.2 ns / 13.1 ns 262144 : 9.7 ns / 14.8 ns 524288 : 11.0 ns / 16.6 ns 1048576 : 75.2 ns / 118.3 ns 2097152 : 110.9 ns / 154.9 ns 4194304 : 134.4 ns / 173.9 ns 8388608 : 146.8 ns / 182.3 ns 16777216 : 154.7 ns / 187.4 ns 33554432 : 159.7 ns / 191.4 ns 67108864 : 162.6 ns / 193.7 ns To sum it up: There's not much magic involved regarding consumption of the various RPi models: When it's about the 'do really nothing' use case then RPi A+ most probably wins due to half the amount of LPDDR2 DRAM compared to RPi Zero who is next. Both SBC are dimensioned for light loads (only one USB port available that has to provide max 500mA by specs) and save the LAN9514 IC (combined internal USB hub and Fast Ethernet adapter) The two first models RPi A and B are not worth a look when it's about low consumption since they use inefficient LDO regulators to provide different voltages that waste a lot energy. Newer RPi models rely on better circuitry. By accessing /sys/devices/platform/soc/*.usb/buspower consumption can be influenced on all models but it depends on what's connected to the USB port (see the USB-Ethernet adapter example on RPi Zero above) On RPi B+, 2 and 3 cutting power to LAN9514 saves ~400mW. When LAN9514 negotiates an Ethernet connection then consumption increases by ~200mW (which is just 600mW more and really not that bad!) The energy savings of disabled HDMI and especially onboard leds are not that great but you can control behaviour from userspace and get these savings 'for free' so why not disabling stuff you don't need? Consumption numbers for the 'everything disabled and doing nothing' (power cut to LAN9514!) use case do not differ that much. RPi Zero: 365 mW, RPi B+: 600 mW, RPi 2: 645 mW, RPi 3: 770 mW (still no idea whether disabling WiFi/BT on RPi 3 brings consumption down to B+/2 level) When exactly no network connectivity is needed or only from time to time (eg. every hour for a minute or something like this) RPi Zero and A+ can shine. If you need LAN or WiFi permanently you should keep in mind that this adds approx. +1000mW to your consumption and then all LAN9514 equipped 'larger' RPi models might be more energy efficient (!). Even if RPi 3 is not able to perform optimal (ARMv8 cores running an ARMv7 kernel and an ARMv6 userland) it might be an intersting replacement for a RPi B+ if you need the USB ports and Ethernet. You could limit maximum consumption by disabling CPU cores 2-4 and could still get less overall consumption when running light workloads since even with 1 CPU core active RPi 3 is almost twice as fast as the single core RPis (compare with the 'race to idle' concept, the faster work can be done the earlier CPU cores can enter low-power states). EDIT: Disabling CPU cores on RPi 3 does not help with consumption -- see post #5 below. And now to answer the question many might ask since I was talking all the time about various RPi models: Q: Do you now port Armbian to Raspberry Pi?! A: Nope To be honest, there's no need for that. Raspbian when running on Raspberries is really great (unlike the various crappy Raspbian images made eg. for Banana Pis), RPi users are familiar with it, tens of thousands tutorials are available and so on. For me personally it was just important to verify some consumption numbers available on the net, to verify whether my readouts using the PMIC of an Allwinner SBC are correct (seems so) and to get the idea which energy savings level we should target with our new Armbian settings. Based on some experiments done with an Orange Pi Lite I'm pretty confident that we will soon have a couple of ultra cheap H3 boards (that are available unlike RPi Zero which costs way more due to added shipping costs and the inability to order more than one at a time!) that outperform RPis when it's about consumption. At least when we're talking about networked setups and not only the 'does really nothing at all' use case Remaining questions: Why do they allow RPi Zero to clock with up to 1 GHz by default when they limit B+ to 700 MHz (compare performance and consumption numbers of both tinymembench and sysbench above)? How does RPi 3 behaves consumption-wise when WiFi/BT are turned off? How does consumption looks like on the various RPi when average load is not close to 0 but some stuff has to be done (I came accross a lot of really broken python or whatever scripts that try to readout sensors and increase load and consumption a lot). This is an area where RPi 3 (and maybe 2 also) might shine since their SoCs consume only slightly more than the horribly outdated single-core BCM2835 and are able to finish stuff a lot faster (again: 'race to idle' concept: Entering low-power CPU states earlier helps with minimizing consumption if there is some constant load) Further readings: https://www.raspberrypi.org/documentation/configuration/config-txt.md http://www.jeffgeerling.com/blogs/jeff-geerling/raspberry-pi-zero-conserve-energy http://raspberrypi.stackexchange.com/questions/8498/disable-lan9512 http://raspberrypi.stackexchange.com/questions/43285/raspberry-pi-3-vs-pi-2-power-consumption-and-heat-dissipation http://raspberrypi.stackexchange.com/questions/5033/how-much-energy-does-the-raspberry-pi-consume-in-a-day https://learn.adafruit.com/introducing-the-raspberry-pi-model-b-plus-plus-differences-vs-model-b/power-supply 5
wildcat_paris Posted August 4, 2016 Posted August 4, 2016 so on: - educational perspective : RPi is great - power consumption for IoT : RPi is not the best choice. BCM283x architecture is meant to age not very well. I hope for the RPi community (I am part of) the ties with Broadcom will come to an end sooner or later.
arox Posted August 4, 2016 Posted August 4, 2016 I am afraid (or not anymore) that RPI zero will be obsolete before being really available. 1
Igor Posted August 5, 2016 Posted August 5, 2016 One quick test on FriendlyARM NanoPi M3 (8core), stock Debian, kernel 3.4.39, current draw during test: 1.35A (0.5A in idle) sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$(grep -c '^processor' /proc/cpuinfo) sysbench 0.4.12: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 8 Doing CPU performance benchmark Threads started! Done. Maximum prime number checked in CPU test: 20000 Test execution summary: total time: 57.0311s total number of events: 10000 total time taken by event execution: 456.0994 per-request statistics: min: 45.41ms avg: 45.61ms max: 140.37ms approx. 95 percentile: 45.70ms Threads fairness: events (avg/stddev): 1250.0000/1.22 execution time (avg/stddev): 57.0124/0.01 root@NanoPi3:~# 1
tkaiser Posted August 5, 2016 Author Posted August 5, 2016 I am afraid (or not anymore) that RPI zero will be obsolete before being really available. Well, if you compare the BOM of an RPi Zero and B+ it's obvious that the latter might add not even $5 to the production costs (this is mass production and all these components are dirt-cheap). So while they might not sell the Zero at a loss they simply don't make enough money with it so why should they try to cannibalize sales of B+ by making the Zero generally available? Anyway back to consumption testing -- still with RPi. Today I wanted to explore the possibility to dynamically limit maximum power consumption of the multi-core RPi (2 and 3). To my surprise this is not possible, you can only limit the maximum CPU core count by editing /boot/cmdline.txt and add 'maxcpus=N' and every change needs a reboot: https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=99372 Even more a surprise: the consumption numbers: RPi 3: nothing connected, doing nothing, just power led, maxcpus=1 (single core): echo 0 >/sys/devices/platform/soc/3f980000.usb/buspower --> 1120 mW echo 1 >/sys/devices/platform/soc/3f980000.usb/buspower --> 1520 mW Performance: sysbench took 486 seconds running on a single core (at 2640 mW, that's +1120mW compared to the baseline) and tinymembench looks like this in single-core mode: tinymembench v0.4.9 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 407.7 MB/s (20.6%) C copy backwards (32 byte blocks) : 406.1 MB/s (2.8%) C copy backwards (64 byte blocks) : 408.7 MB/s (0.8%) C copy : 408.8 MB/s (0.8%) C copy prefetched (32 bytes step) : 414.0 MB/s (0.8%) C copy prefetched (64 bytes step) : 413.2 MB/s (1.0%) C 2-pass copy : 390.9 MB/s (17.3%) C 2-pass copy prefetched (32 bytes step) : 410.9 MB/s (0.5%) C 2-pass copy prefetched (64 bytes step) : 412.0 MB/s C fill : 719.9 MB/s C fill (shuffle within 16 byte blocks) : 719.9 MB/s C fill (shuffle within 32 byte blocks) : 719.0 MB/s C fill (shuffle within 64 byte blocks) : 716.8 MB/s --- standard memcpy : 418.8 MB/s standard memset : 719.1 MB/s --- NEON read : 588.9 MB/s NEON read prefetched (32 bytes step) : 831.9 MB/s NEON read prefetched (64 bytes step) : 832.0 MB/s NEON read 2 data streams : 579.4 MB/s (0.3%) NEON read 2 data streams prefetched (32 bytes step) : 821.4 MB/s NEON read 2 data streams prefetched (64 bytes step) : 821.5 MB/s NEON copy : 409.3 MB/s NEON copy prefetched (32 bytes step) : 416.3 MB/s NEON copy prefetched (64 bytes step) : 414.6 MB/s NEON unrolled copy : 409.6 MB/s NEON unrolled copy prefetched (32 bytes step) : 422.0 MB/s NEON unrolled copy prefetched (64 bytes step) : 421.9 MB/s NEON copy backwards : 407.5 MB/s NEON copy backwards prefetched (32 bytes step) : 413.8 MB/s NEON copy backwards prefetched (64 bytes step) : 412.3 MB/s NEON 2-pass copy : 392.4 MB/s NEON 2-pass copy prefetched (32 bytes step) : 415.3 MB/s NEON 2-pass copy prefetched (64 bytes step) : 415.3 MB/s NEON unrolled 2-pass copy : 390.2 MB/s NEON unrolled 2-pass copy prefetched (32 bytes step) : 435.1 MB/s NEON unrolled 2-pass copy prefetched (64 bytes step) : 436.3 MB/s NEON fill : 719.4 MB/s NEON fill backwards : 719.3 MB/s VFP copy : 413.1 MB/s VFP 2-pass copy : 391.2 MB/s ARM fill (STRD) : 715.6 MB/s ARM fill (STM with 8 registers) : 719.2 MB/s ARM fill (STM with 4 registers) : 718.1 MB/s ARM copy prefetched (incr pld) : 413.4 MB/s ARM copy prefetched (wrap pld) : 412.4 MB/s ARM 2-pass copy prefetched (incr pld) : 411.8 MB/s (0.1%) ARM 2-pass copy prefetched (wrap pld) : 405.7 MB/s ========================================================================== == Framebuffer read tests. == == == == Many ARM devices use a part of the system memory as the framebuffer, == == typically mapped as uncached but with write-combining enabled. == == Writes to such framebuffers are quite fast, but reads are much == == slower and very sensitive to the alignment and the selection of == == CPU instructions which are used for accessing memory. == == == == Many x86 systems allocate the framebuffer in the GPU memory, == == accessible for the CPU via a relatively slow PCI-E bus. Moreover, == == PCI-E is asymmetric and handles reads a lot worse than writes. == == == == If uncached framebuffer reads are reasonably fast (at least 100 MB/s == == or preferably >300 MB/s), then using the shadow framebuffer layer == == is not necessary in Xorg DDX drivers, resulting in a nice overall == == performance improvement. For example, the xf86-video-fbturbo DDX == == uses this trick. == ========================================================================== NEON read (from framebuffer) : 20.4 MB/s NEON copy (from framebuffer) : 20.3 MB/s NEON 2-pass copy (from framebuffer) : 20.4 MB/s NEON unrolled copy (from framebuffer) : 20.4 MB/s NEON 2-pass unrolled copy (from framebuffer) : 20.3 MB/s VFP copy (from framebuffer) : 158.0 MB/s VFP 2-pass copy (from framebuffer) : 154.4 MB/s (0.2%) ARM copy (from framebuffer) : 78.8 MB/s ARM 2-pass copy (from framebuffer) : 79.0 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.2 ns / 0.2 ns 65536 : 5.6 ns / 9.4 ns 131072 : 8.5 ns / 13.3 ns 262144 : 10.0 ns / 15.0 ns 524288 : 14.2 ns / 21.5 ns 1048576 : 203.3 ns / 317.2 ns 2097152 : 309.3 ns / 415.9 ns 4194304 : 364.5 ns / 450.4 ns 8388608 : 390.3 ns / 465.3 ns 16777216 : 403.0 ns / 474.3 ns 33554432 : 409.4 ns / 480.3 ns 67108864 : 412.6 ns / 483.4 ns Weird to say the least, I disabled 3 CPU cores -- see below -- but obviously that led to some sort of background activity since idle consumption compared to 'quad core' mode increased by a whopping ~360mW while difference between idle and sysbench is 1120mW? With 4 cores idle consumption was at 1165 mW and a 4-core sysbench run added 2385 mW. Based on this a single-core sysbench should add ~600mW so there clearly is something wrong. But it really looks like this: pi@raspberrypi:~ $ cat /proc/cpuinfo processor : 0 model name : ARMv7 Processor rev 4 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4 Hardware : BCM2709 Revision : a02082 Serial : 00000000ee6200a5 pi@raspberrypi:~ $ cat /boot/cmdline.txt dwc_otg.lpm_enable=0 console=serial0,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait maxcpus=1 To verify my measurements I simply removed 'maxcpus=1' from cmdline.txt and immediately rebooted: consumption back to normal values (1155mW with disabled LAN9514, 765mW with buspower=0): So obviously the preferrable way to limit maximum consumption at least with RPi 3 is to rely on cpufreq scaling, limit max cpufreq for example to 600 MHz and leave count of active CPU cores as is. 1
wildcat_paris Posted August 5, 2016 Posted August 5, 2016 @tkaiser So obviously the preferrable way to limit maximum consumption at least with RPi 3 is to rely on cpufreq scaling, limit max cpufreq for example to 600 MHz and leave count of active CPU cores as is. So with my goldfish vocabulary, 4 cores @ 600MHz so about the power of 1 core of @2400MHz (so someway """""better""""" than 1 core @1200MHz) it is going to be nice with NEON aarch64 instructions when the kernel 4.8 is released for RPi3 for computation.
tkaiser Posted August 5, 2016 Author Posted August 5, 2016 So with my goldfish vocabulary, 4 cores @ 600MHz so about the power of 1 core of @2400MHz (so someway """""better""""" than 1 core @1200MHz) Nope. These sorts of comparisons are only valid if you have a workload that scales linearly with count of CPU cores or you are running moronically silly benchmarks as it's done all the time eg on Phoronix. Real-world workloads look different, most of the stuff is single threaded and therefore will run slower on a 4 x 600 MHz system than on one with 1 x 1200. The reason why I'm thinking about limiting maximum consumption is since we're talking about low-power modes/settings and 'IoT' use cases. Imagine you use one 5V/2A PSU to power 3 boards and imagine a worst case scenario where something went really wrong and on all SBCs countless processes are running with 100% CPU utilization. In such a situation 5V/2A might not be enough and at least one board may freeze/crash, most probably all of them since underpowering is a pretty reliable method to freeze/crash any SBC. That's the reason why I check idle and 'full load' consumption individually and also check the role of consumers (as the LAN9514 USB/Ethernet IC on Raspberries can be considered one) and what happens when an Ethernet cable is inserted or now. Since then you can calculate easily how much more consumption such a worst case scenario would mean and take counter-measures (such as limiting CPU cores or maximum cpufreq). Just as an example how easy such worst case scenarios could be triggered: A few months ago we did a server migration at a customer and as some sort of a burn-in test I let a few thousand images be converted on the new virtual machine while we still tuned settings. Little mistake when setting up the cronjob and then the conversion did not start every hour but every minute instead. So the load was up to +300 pretty fast but since this was Solaris x86 it was even possible to login and to recover from the problem in a shell (while true ; do pkill $converter; sleep 1; done -- do the real work in another shell) It should also be noted that most if not all recent SBC support dynamic voltage frequency scaling (lowering the voltage the CPU cores are fed with when at lower clockspeeds and increasing them with clockspeed -- not linearly bit somehow exponentially) so comparing full load at 600 MHz and 1200 MHz might double performance but consumption will be 2.x times more (maybe even 3.x times -- depends on the dvfs settings used). This has also to be taken into account when settings are defined and for a proper low-power operation the workload has to be analyzed or simply some long time consumption monitoring has to happen since too many factors are involved (dvfs, the specific workload, 'race to idle' concept and CPU cores sitting on low-power states and so on). 2
wildcat_paris Posted August 5, 2016 Posted August 5, 2016 Real-world workloads look different, most of the stuff is single threaded and therefore will run slower on a 4 x 600 MHz system than on one with 1 x 1200. @tk I know that. that is why I added multiple quotes like in """""better""""" because it depends on the number of threads used concurrently. At the same time, 4 cores can handle interruptions """"better"""" (faster) than only one core with a lot of tasks sharing the same core (especially when trying to """"mimick a little bit""""" FPGA parallel systems computation with realtime RT kernel patches to reduce kernel latency to handle better I/O) edit: I know no SoC / microcontroller can replace a FPGA for massive parallel computing massive parallel I/O handling (but well you know me, I am a gold fish)
tkaiser Posted August 6, 2016 Author Posted August 6, 2016 Next round of tests this time using the same device but different OS images. I want to compare a stock Armbian desktop build (without any tweaks), the same Armbian installation with optimized settings and also the most advanced Linux image available for Orange Pis prior to Armbian starting to support H3 devices: That's IMO loboris' Ubuntu Mate image for Orange Pi PC (OrangePI-PC_Ubuntu_Vivid_Mate.img.xz). Boris Lovošević (loboris) did a tremendous job developing the first useable OS images for the various H3 based Orange Pis but unfortunately stopped at the end of last year with this (to focus on real IoT stuff instead of H3 boards -- see http://www.emw3165.com/viewtopic.php?f=15&t=4330 for his more recent activities). While we never used his kernel sources directly a few of his patches still enhance our legacy H3/sun8i kernel. Unfortunately his OS images were made with overclocking in mind (reaching 1536 MHz clockspeed) and so they contained relatively bad dvfs settings (overvolting the SoC badly even above datasheet limits). That's the main reason H3 has been blamed as an overheating beast but fortunately now we know better (members of linux-sunxi community and Armbian team did a lot of research how to develop more suitable dvfs/thermal settings). While the differences between our settings and the overclocking/overvolting attempts of his OS images were always visible when looking at temperatures/throttling behaviour now I've also the equipment to measure precisely how consumption differs. The following measurements were made with an Orange Pi PC from last year wearing my standard H3 heatsink with active networking since H3 contains a native Ethernet implementation and most people will use these devices with connected network. Orange Pi PC, only Ethernet connected, loboris' settings/image: Consumption when being absolutely idle: 1545 mW Performance: sysbench took 158 seconds after throttling jumped in (max cpufreq most of the time at 1200 MHz and sometimes below while SoC temp was reported being above 80°C all the time). Averaged consumption: 3115 mw (+1570 mW compared to baseline) Orange Pi PC, only Ethernet connected, Armbian Jessie desktop: Consumption when being absolutely idle: 1260 mW Performance: sysbench took 149 seconds after throttling started to get stable (that's faster than above) while consumption was at 3005 mW (+1745 mW compared to baseline but absolutely lower than with loboris' settings). How's that possible to get better performance at lower consumption and temperature level? Since we use better dvfs settings and feed the CPU cores with lower VDD_CPUX core voltage which helps a lot with throttling. It's just settings that differ, everything else is nearly the same though we use another legacy kernel variant that shows a different behaviour regarding HDMI -- as can be seen below loboris' kernel deactivates the whole HDMI engine when no display is connected 10 minutes after booting, so with a connected display the performance of his image would've been even worse since both temperatures and consumption would've been higher. So with our defaults an Orange Pi PC idles already at RPi B+ and 2 level when the Raspberries are also operated with Ethernet cable inserted (then consumption increases by ~200 mW and is at 1200 mW and slightly more when looking at RPi 2). But Orange Pi PC runs a full GUI desktop unlike Raspbian Lite and we speak about a SBC that features native Ethernet and 4 real USB ports (vs. just one single USB OTG port combined with hub and USB Ethernet adapter on the Raspberries). So what about optimizing Armbian settings for headless use? Orange Pi PC, only Ethernet connected, Armbian with consumption-control enabled: Consumption when being absolutely idle: 800 mW Performance: sysbench takes 142 seconds, H3 constantly running at 1296 MHz, SoC temperature reached 74°C but no throttling happening and full load consumption was at 3100 mW (useless to compare with baseline consumption -- see below please) So what did happen? Based on some tests the last days we already know that we can save ~200 mW with our sun8i legacy kernel when we disable both HDMI and Mali400 engines (requires a reboot so only feasible for headless devices), we also know that if we lower DRAM clockspeed from our default 624 MHz to 264 MHz (still faster than RPi) we save an additional ~240 mW and by disabling CPU cores we get ~10 mW less per core in idle situations. So that's the simple reason we can lower idle consumption from 1260 mW to below 800 mW with H3 based Orange Pi PC while we still have a device that's faster than RPi B+/2, has active/idle Ethernet network and 4 real USB ports available while consuming only 2/3 what RPi B+ or 2 need in the same situation (but there no native Ethernet being available and only one real USB port) Some consumption graphs: 1) Loboris Ubuntu Mate image. The consumption drop after 10 minutes due to no connected HDMI display can be seen twice: On the left and after the reboot at 11:45: 2) Armbian defaults: 3) Armbian with optimized settings: A few words regarding these optimized settings. I disabled the start of the nodm daemon (therefore no X windows running), adjusted a few values in the fex file (disabling HDMI/Mali) and used a patched kernel to decrease DRAM clock below Allwinner's default (they use 408 MHz but with this value energy savings aren't that much). And then I started the test from within /etc/rc.local by calling a script that lowered DRAM clock, disabled CPU cores, waited 45 minutes to bring back CPU cores and DRAM clockspeed and start the sysbench test: root@orangepipc:/home/tk# cat /usr/local/bin/check-consumption.sh #/bin/bash echo 0 >/sys/devices/system/cpu/cpu3/online echo 0 >/sys/devices/system/cpu/cpu2/online echo 0 >/sys/devices/system/cpu/cpu1/online echo 264000 >/sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/userspace/set_freq sleep 2400 echo 1 >/sys/devices/system/cpu/cpu3/online echo 1 >/sys/devices/system/cpu/cpu2/online echo 1 >/sys/devices/system/cpu/cpu1/online echo 624000 >/sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/userspace/set_freq while true ; do sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$(grep -c '^processor' /proc/cpuinfo) >>/var/log/sysbench.log done What is done there manually (adjusting count of active CPU cores and DRAM clockspeed) can be done automatically with future Armbian versions. We're thinking about allowing to specify some sort of 'power control' (by a daemon or something like that) and can then operate at least H3 boards with legacy kernel at really low consumption levels. I tested with an Orange Pi Lite already and was able to get below 500 mW idle consumption (with a not so useful 'nothing works except serial console' use case ). But that's the area where such a 'power control' mechanism would make some sense: cheap H3 boards (OPi One/Lite, NanoPi NEO/Air) used as low power IoT nodes. Personally I'm already amazed to be able to limit idle consumption of an Orange Pi PC (while having 4 real USB ports ready and Fast Ethernet active and used!) below 1W and that we start to get enough knowledge to control consumption behaviour in a way we could ensure that it does not exceed 1.5W for example. The only problem remaining then are inefficient PSUs (drawing 3W from the wall while the H3 board in question idles at 0.8W ) Further readings: http://forum.armbian.com/index.php/topic/1614-running-h3-boards-with-minimal-consumption/ http://forum.armbian.com/index.php/topic/1665-rfc-using-a20-board-with-armbian-as-powermeter/ http://forum.armbian.com/index.php/topic/1728-rfc-default-settings-for-nanopi-neoair/ 2
tkaiser Posted August 7, 2016 Author Posted August 7, 2016 A few more words regarding H3 SoC and settings: I just tried out the upper clockspeed possible with legacy H3 kernel on the larger Orange Pi boards: 1536 MHz. To operate reliably at this clockspeed VDD_CPUX core voltage has to be increased too (not possible on all the H3 boards that have a more primitive or no adjustable voltage regulator and therefore limit VDD_CPUX to 1.3V max). With loboris' initial settings only switching between 1.3V and 1.5V happened (just 2 dvfs operating points defined for whatever reasons) so I relied on our normal settings (using 7 values and maxing out at 1296 MHz @ 1.32V) and simply defined a few new operating points with increased voltages maxing out at loboris' 1536 MHz @ 1.5V settings: LV1_freq = 1536000000 LV1_volt = 1500 LV2_freq = 1440000000 LV2_volt = 1420 LV3_freq = 1344000000 LV3_volt = 1360 LV4_freq = 1296000000 LV4_volt = 1320 LV5_freq = 1008000000 LV5_volt = 1140 LV6_freq = 816000000 LV6_volt = 1020 LV7_freq = 480000000 LV7_volt = 980 Unlike our normal dvfs operating points covering 480-1296 MHz that are tested and can be considered sane the ones for 1344 MHz and above are just assumptions -- maybe voltages could be lowered but this has to be confirmed by doing time consuming reliability testing. To let any heavy workload run on H3 with these settings and all CPU cores being active a fan is necessary so I added one to keep temperatures below 70°C (powered by another SBC nearby from GPIO header since I for this test I don't wanted the fan's own consumption add to my numbers). Results as follows: Baseline consumption (idling and DRAM clockspeed at 624 MHz): 1015 mW sysbench with 1 core @ 1536 MHz: 1940 mW (+920 mW above baseline) sysbench with 2 cores @ 1536 MHz: 2845 mW (+1830 mW above baseline) sysbench with 4 cores @ 1536 MHz: 4300 mW (+3280 mW above baseline) When running on all 4 cores at 1536 MHz sysbench execution time is as as low as with RPi 3 (120 seconds) but this mode makes no sense at all cause it requires an annoying fan since otherwise throttling would reduce clockspeeds to ~1164 MHz (compare with the results above) and also consumption is way too high due to increased core voltage necessary to get reliable operation with these overclocker settings (and if it's really about performance per watt the fan's consumption would have to be added too!). So what is this test for? When running on 1 or 2 cores with 1536 MHz @ 1.5V there's no fan needed. I got 47°C with a single core and 67°C with 2 cores when running sysbench so even with really heavy workloads (eg. cpuburn-a7) enabling 1536 MHz when limiting count of active CPU cores could make some sense in situations where single-threaded tasks need to run from time to time at highest speed possible (again: compare with the 'race to idle' concept since finishing tasks in less time allows CPU cores to enter low-power modes earlier which might reduce overall consumption in the end) One purpose of this test was to get limits of my power monitoring setup (I still use a Banana Pro to feed connected devices with power since I let Banana's AXP209 PMU monitor consumption -- a normal USB2 port should provice 500mA by specs but as this test shows Banana Pro is able to provide even 860mA via one USB port), another one was to get consumption numbers for loboris' settings: ~18% more performance correlate with ~65% more consumption when comparing Armbian's regular upper 1296 MHz cpufreq limit with the 1536 @ 1.5V dvfs operating point. Simple conclusion: If you look for best performance per watt then highest clockspeeds are not economical since increased VDD_CPUX core voltage is responsible for consumption way too high. But enabling these overclocking/overvolting settings might make some sense in situations where single-threaded workloads have to be finished in less time and in case count of active CPU cores will be limited to 1 or 2 short operations with these overvolted settings do not even require an annoying fan. 2
tkaiser Posted August 8, 2016 Author Posted August 8, 2016 Another few words why I used sysbench to do consumption measurements. While sysbench's cpu test is not able to produce meaningful numbers when comparing different architectures (see below) it can be used to roughly estimate 'worst case' situations when for whatever reasons the board gets under full load (scripts running amok and so on, simply a 'full load' condition). Another unique feature of this sysbench test is that it is pretty much not related to memory performance at all and scales linearly with both CPU clockspeed and count of CPU cores (normal real world workloads differ in all 3 areas a lot!) so it might help estimate integer performance roughly by doing simple calculations (double the count of CPU cores, increase or decrease clockspeeds) to get an idea how integer performance correlates with consumption (as we've seen above higher clockspeeds need higher voltages so consumption scales not linearly with performance but goes up way more quick) The aforementioned 'feature' (not dependent on memory performance since the cpu test simply calculates prime numbers) is also a major caveat since this specific workload is not that typical for most normal tasks where memory throughput matters more (especially on some Raspberries where playing around with clock settings can drastically improve performance since L2 cache and overall memory performances matters a lot for most workloads). Also sysbench results can not be compared accross different CPU architectures since special CPU specific instructions might speed up prime number calculations by factor 15 while overall system performance is identical. Sysbench when used with RPi 2 (ARMv7) and RPi 3 (ARMv8) is already the best example. When running Raspbian the sysbench binary has been made with ARMv6 compiler settings. You get execution times of 192 secs (RPi 2 @ 900 MHz) or 120 secs (RPi 3 @ 1200 MHz). As soon as you switch to an ARMv7 userland (using Ubuntu Mate for example -- I chose ubuntu-mate-16.04-desktop-armhf-raspberry-pi.img) test execution will be faster: 168 secs on RPi 2 and 105 secs on RPi 3. This is just the result of different compiler switches using optimized CPU instructions. And if RPi 3 would use a sysbench binary that has been compiled with all options set for ARMv8 then the test would finish within 10 seconds -- see this example here how moronic it is to combine ARMv8 CPU cores with an ARMv6 userland) So what can we read from the numbers collected above for example all RPi results with sysbench? The following lists the test setup, sysbench execution time, then relative consumption increase compared to idle consumption (baseline) and then absolute consumption when running the specific test setup: RPi Zero @ 1000 MHz: 915 sec, 435 mW, 800 mW RPi B+ @ 700 MHz: 1311 sec, 175 mW, 1160 mW RPi 2 @ 900 MHz: 192 sec, 1135 mW, 2140 mW RPi 3 @ 1200 MHz: 120 sec, 2385 mW, 3550 mW Please remember: this is just the result of using Raspbian defaults (that might be adjustable if you know what you do). The main purpose of these tests was to be able to calculate consumption increase when taking worst case conditions into account (eg. scripts running amok). That's what the relative consumption increase is about. So if I plan to use a RPi B+ with USB peripherals connected that need additional 2W I already know that I don't have to take consumption caused by CPU load into account (it's just 175mW more!). But in case we would allow B+ to clock as high as the Zero (with increased voltages -- that's what causes the way higher consumption compared to the little increase in performance) then it gets interesting since CPU load is responsible for ~0.5W (435 mW) more at 1.0GHz and it would read instead: RPi B+ @ 1000 MHz: 915 sec, 435 mW, 1420 mW When using a RPi 3 worst case conditions could result in ~2.5W more consumption needed (2385 mW) so in case it is known that peripherals are connected that need an additional 1W PSU requirements for an Ethernet connected RPi 3 would look like (idle / full load): 2.5W - 4.5W since idle consumption with Ethernet connected is already at 1360 mW, peripherals add another 1000 mW and CPU load might add another 2385 mW. But... in case this is not wanted a simple measure to reduce maximum consumption is to adjust cpufreq scaling settings to 600 MHz max and then we're talking about 2.5W - 3W. This is pretty irrelevant if you power your device from a good wall wart but in case you use PoE (and have to think about how to dimension step-down converters) or have to ensure minimal consumption settings then this matters. Since I started to collect some numbers for 'performance per watt' comparisons (mostly to check energy efficiency on H3 boards with primitive voltage regulator switching between just 1.1V and 1.3V) also here as reference: OPi PC / loboris: 158 sec, 1570 mW, 3115 mW OPi PC / Armbian: 149 sec, 1745 mW, 3005 mW OPi PC / optimized: 142 sec, - mW, 3100 mW Please keep in mind that when you compare these numbers with those for Raspberries above you have to add at least 200mW to RPi overall consumption since this is what connected Ethernet on the larger RPi models will add (with RPi Zero you might need to add a whole W unless you find a better USB-Ethernet adapter than mine two) And also as a reference single core numbers for RPi 3 (higher idle consumption than when running on all 4 CPU cores for whatever reasons!) and also overclocked numbers for OPi PC: RPi 3 @ 1200 MHz / ARMv6, 1 core:: 486 sec, 1120 mW, 2640 mW OPi PC @ 1536 MHz, 1 core: 482 sec, 920 mW, 1940 mW OPi PC @ 1536 MHz, 2 cores: 241 sec, 1830 mW, 2845 mW OPi PC @ 1536 MHz, 4 cores: 120 sec, 3280 mW, 4300 mW This whole 'performance per watt' stuff gets really interesting when looking at Orange Pi One/Lite and NanoPi NEO/Air since they can only switch between 1.1V (912 MHz max) and 1.3V (1200 MHz) and I would suppose 'performance per watt' is way better when remaining at the lower voltage. Time will tell -- still no NanoPi dev samples arrived Further readings: http://linuxonflash.blogspot.de/2015/02/a-look-at-raspberry-pi-2-performance.html https://retroresolution.com/2016/03/24/overclocking-the-raspberry-pi-3-pragmatism-and-optimising-for-single-vs-multicore-performance/ 1
tkaiser Posted August 8, 2016 Author Posted August 8, 2016 Another surprising update regarding settings, this time again with a Raspberry: RPi 3. I wanted to test whether using the DietPi OS image helps also with consumption. DietPi relies on Raspbian Lite (as far as I understood) but ships with optimized settings. So RPi 3 is tested with 3 different OS images using same SD card and connected peripherals (none) and settings (HDMI disabled, otherwise defaults): Raspbian Debian Jessie Lite (2016-05-27-raspbian-jessie-lite.img): Idle: 1165 mW. Performance: sysbench takes 120 seconds (constantly at 1200 MHz, 80°C), consumption reported is 3550 mW (+2385 mW compared to 'baseline') Ubuntu Mate 16.04 (ARMv7 userland, latest 4.4 kernel): Idle: 1150 mW. Performance: Sysbench takes 105 seconds (no throttling occured) and consumption increases to 3600 mW (+2450 mW compared to baseline). This can be considered identical to above except the small performance boost due to being able to use ARMv7 code. DietPi_v127_RPi-armv6-(Jessie).img: I used default settings and only set HDMI to disabled from within dietpi-config which does not only switch off HDMI but should also help with memory throughput (see here for the description. And again: sysbench won't be affected by that) Idle: 1230 mW (that's surprisingly 75 mW more than above) Performance: On 1st run sysbench takes 121 seconds and consumption most probably increased up to 3615 mW (assuming +2385 mW compared to baseline from Raspbian test above) but then some sort of throttling happened (strange since checking /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq from time to time still reported being at 1.2GHz always) and sysbench execution time increased to 169 secs while consumption only reached 3050 mW (+1820 mW compared to baseline). This is how execution times looked like over time: execution time (avg/stddev): 121.2137/0.02 execution time (avg/stddev): 143.4227/0.01 execution time (avg/stddev): 154.7651/0.02 execution time (avg/stddev): 159.3969/0.03 execution time (avg/stddev): 162.2143/0.02 execution time (avg/stddev): 164.0288/0.01 execution time (avg/stddev): 165.7774/0.03 execution time (avg/stddev): 166.6406/0.02 execution time (avg/stddev): 167.3383/0.02 execution time (avg/stddev): 168.1799/0.03 execution time (avg/stddev): 168.6029/0.02 execution time (avg/stddev): ~169.x I then checked settings in dietpi-config and increased throttling treshold from 75°C to 85°C. Now 1st sysbench run took 120 secs (as it should be without throttling) and then slowed down to 137 seconds over time: execution time (avg/stddev): 119.7105/0.02 execution time (avg/stddev): 126.3084/0.01 execution time (avg/stddev): 133.1608/0.02 execution time (avg/stddev): 135.2913/0.02 execution time (avg/stddev): 134.7278/0.02 execution time (avg/stddev): 135.8305/0.01 execution time (avg/stddev): 136.2153/0.02 execution time (avg/stddev): 136.4005/0.01 execution time (avg/stddev): 136.6043/0.01 execution time (avg/stddev): 136.8146/0.01 execution time (avg/stddev): 136.7731/0.02 execution time (avg/stddev): 137.1757/0.03 execution time (avg/stddev): ~137.x Average consumption was then at 3595 mW (+2365 mW) and the output of DietPi's cpu tool looked like this then: root@DietPi:~# cpu ───────────────────────────────────────────────────── DietPi CPU Info Use dietpi-config to change CPU / performance options ───────────────────────────────────────────────────── Architecture | armv7l Temp | Warning: 82'c | Reducing the life of your device. Governor | ondemand Throttle up | 50% CPU usage Current Freq Min Freq Max Freq CPU0 | 1200 Mhz 600 Mhz 1200 Mhz CPU1 | 1200 Mhz 600 Mhz 1200 Mhz CPU2 | 1200 Mhz 600 Mhz 1200 Mhz CPU3 | 1200 Mhz 600 Mhz 1200 Mhz root@DietPi:~# uname -a Linux DietPi 4.4.16-v7+ #899 SMP Thu Jul 28 12:40:33 BST 2016 armv7l GNU/Linux Both increased idle consumption is strange (maybe in DietPi some background daemons are permanently active?) as well as 'performance per watt' ratio: Raspbian, no throttling: 120 secs, +2385 mW DietPi with 75°C settings: 169 secs, +1820 mW DietPi with 85°C settings: 137 secs, +2365 mW Why does Raspbian executes the benchmark with no throttling at constant 1200 MHz @ 80°C in 120 seconds and DietPi with 85°C treshold increases execution time by 14% (120 sec vs. 137) while reporting 82°C and only saving 20 mW in this mode? One possible explanation would be that I did this test with Ethernet connected (which adds to overall consumption with 200 mW but shouldn't affect SoC temperatures since the SoC has no Ethernet and this is just the Ethernet PHY that gets activated in LAN9514 IC). Would be interesting to investigate further (temperature behaviour of the SoC when Ethernet is plugged in or not) but it's not worth my time. And then it seems like a miracle to me that a loss in performance of 14 percent (120 sec vs. 137) only saves 20 mW. Since I also wondered why CPU clockspeed was always reported as 1200 MHz I had a look at /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors To my surprise only 600 and 1200 MHz are available, so we seem to deal with a pretty simple dvfs/cpufreq table only defining two states. As a comparison: With Armbian on the larger H3 boards we use 7 operating points that start with 480 MHz @ 980 mV and end up at 1296 MHz @ 1320 mV and we added way more cpufreq steps since this helps a lot with throttling (confirmed with both A64 and H3 devices). So based on our research (using as much cpufreq and dvfs operating points as possible to get throttling more efficient) the RPi 3 settings look like 'throttling from hell'. The numbers with 85°C treshold would indicate that RPi 3 remained 28% of the time at 600 MHz and 72% at 1200 MHz (since execution time differs by 14%) but a consumption difference of just 20 mW means that voltage regulation isn't quick enough to cope with this fast switching between upper and lower clockspeed (and if I understood correctly also dynamic clocking of DRAM). With 75"C settings it's a different picture: Execution time increases by 41% (120 sec vs. 169) which would mean that CPU cores remained 82% at the lower clockspeed which results in huge savings: 565 mW less. I would believe adding a few more cpufreq operating points would already help (600, 800, 900, 1000, 1100, 1200) and in case it's possible with RPi 3 to also use different supply voltages with these clockspeeds then throttling would get even more efficient. If RPi 3 would be an Orange Pi PC and we would have to deal with this situation (only switching between 600 and 1200 MHz, loosing 14% performance with a minimum benefit of less than 1% consumption savings) then simply adding one more dvfs operating point with slightly decreased voltage at 1150 MHz would already suffice. But I don't know whether that's possible on RPi 3 and how the exact mechanism to control voltages work there. EDIT: Now everything is clear, since the ARM cores are no first class citizens on any RPi but just VideoCore's guests the kernel doesn't know what's really going on (LOL!). The VideoCore VPU has to be queried since in reality everything is controlled from there: https://github.com/raspberrypi/linux/issues/1320#issuecomment-191754676 (without using the vcgencmd command talking to the master processor you get no idea what's really happening) And another strange symptom (for me): With all RPi 3 OS images after a 'shutdown -h now' the board still wasted between 380 and 450 mW (which is most likely a bug but who knows). Only physically cutting power really helps lowering 'off state' consumption to zero. But maybe on Raspberries one should use 'poweroff' instead? 1
tkaiser Posted August 13, 2016 Author Posted August 13, 2016 Another update regarding consumption / performance since I've been busy the last days playing around with NanoPi NEO. The NEO is currently the smallest H3 board around also featuring a more simpler design than other H3 devices. For information regarding this board check linux-sunxi wiki and/or the appropriate thread in H3 forums. We had to realize that reducing DRAM clockspeeds helps with both lowering consumption and temperatures when querying the SoC's internal temperature sensor (that must sit somewhere nearby the memory controller). We also used a patch to decrease DRAM clockspeed below legacy kernel's fixed limit of 408 MHz minimum. Since I was curious where the most consumption gains happen I let a script walk through all available DRAM clockspeeds (adjustable in 24 MHz steps above 372 MHz and 12 MHz below) and the surprising result is that the biggest gains are between 408 MHz and 456 MHz: This is NanoPi NEO starting with 870 mW and ending up at 1340 mW (only adjusting DRAM clockspeed on this H3 board is responsible for 470 mW difference in idle consumption!) and the biggest step is between 408 MHz and 432 MHz (I wonder why Friendly ARM chose 432 and not 408 on their OS image): And this is Orange Pi Lite with identical settings starting with 510 mW and ending up at ~815 mW at 672 MHz mW (here the difference is just 305 mW, please note that I did not test above 624 MHz since this is the maximum DRAM speed we allow on all H3 boards anyway). Again the biggest step between 408 MHz and 432 MHz: 130 mW (630mW --> 760mW) So what's different? Both boards share many details: the primitve voltage regulator only switching between 1.1V and 1.3V, same amount of DRAM (I tested the 512MiB NEO version) but obviously some onboard components different (LDO regulator on the NEO vs. buck converter on the Lite) and DRAM access is different. Unlike all other H3 devices FriendlyARM chose for the NEO a single bank design. Surprisingly the more primitive and lightweight looking NEO shows worse consumption numbers compared to OPi Lite (unfortunately I have no One here any more to compare) So what about performance: I let also run tinymembench on both boards from 132 - 672 MHz clockspeed (fixed CPU settings, performance governor, 4 CPU cores active and running at 1200 MHz). Not so surprisingly the dual bank configuration is faster especially at higher DRAM clockspeeds (but there is no performance bump around 432 MHz as one could imagine based on the consumption behaviour, it's quite the opposite since 384 MHz - 504 MHz perform nearly identical with single bank configuration): DRAM clock NanoPi NEO OPi Lite OPi Plus 2E 132 MHz: 135.8 MB/s 153.8 MB/s 154.4 MB/s 144 MHz: 147.8 MB/s 158.8 MB/s 171.4 MB/s 156 MHz: 156.0 MB/s 171.3 MB/s 176.9 MB/s 168 MHz: 179.2 MB/s 183.7 MB/s 188.6 MB/s 180 MHz: 188.5 MB/s 206.0 MB/s 201.5 MB/s 192 MHz: 194.5 MB/s 285.4 MB/s 232.4 MB/s 204 MHz: 202.7 MB/s 295.5 MB/s 300.4 MB/s 216 MHz: 199.7 MB/s 282.1 MB/s 314.9 MB/s 228 MHz: 196.9 MB/s 290.9 MB/s 292.9 MB/s 240 MHz: 198.9 MB/s 303.7 MB/s 328.3 MB/s 252 MHz: 199.3 MB/s 313.2 MB/s 366.7 MB/s 264 MHz: 216.5 MB/s 361.3 MB/s 350.3 MB/s 276 MHz: 217.9 MB/s 344.8 MB/s 388.1 MB/s 288 MHz: 231.7 MB/s 339.8 MB/s 410.3 MB/s 300 MHz: 235.7 MB/s 350.9 MB/s 398.6 MB/s 312 MHz: 250.0 MB/s 339.2 MB/s 377.9 MB/s 324 MHz: 262.4 MB/s 360.2 MB/s 389.1 MB/s 336 MHz: 271.5 MB/s 375.6 MB/s 409.6 MB/s 348 MHz: 271.3 MB/s 395.6 MB/s 421.7 MB/s 360 MHz: 299.3 MB/s 414.8 MB/s 402.9 MB/s 372 MHz: 339.4 MB/s 452.2 MB/s 398.4 MB/s 384 MHz: 428.5 MB/s 507.0 MB/s 432.7 MB/s 408 MHz: 433.5 MB/s 594.8 MB/s 580.5 MB/s 432 MHz: 436.8 MB/s 632.4 MB/s 606.9 MB/s 456 MHz: 421.1 MB/s 665.7 MB/s 667.5 MB/s 480 MHz: 434.4 MB/s 678.4 MB/s 685.4 MB/s 504 MHz: 431.8 MB/s 714.5 MB/s 719.7 MB/s 528 MHz: 448.9 MB/s 766.9 MB/s 756.2 MB/s 552 MHz: 454.3 MB/s 802.5 MB/s 798.2 MB/s 576 MHz: 458.6 MB/s 835.3 MB/s 838.4 MB/s 600 MHz: 465.8 MB/s 857.0 MB/s 882.1 MB/s 624 MHz: 484.8 MB/s 892.9 MB/s 905.5 MB/s 648 MHz: 506.1 MB/s 928.3 MB/s 938.0 MB/s 672 MHz: 539.3 MB/s 963.0 MB/s 965.7 MB/s An archive with all tinymembench results for both boards and now also Plus 2E can be found here. The numbers above are the 'standard memcpy' results but tinymembench tests a lot more. Just as an example the 624 MHz results from OPi Lite: Board: orangepilite, DRAM clockspeed: 624000 tinymembench v0.4.9 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 296.2 MB/s (2.0%) C copy backwards (32 byte blocks) : 998.6 MB/s C copy backwards (64 byte blocks) : 1030.7 MB/s C copy : 963.7 MB/s C copy prefetched (32 bytes step) : 910.3 MB/s C copy prefetched (64 bytes step) : 1032.8 MB/s C 2-pass copy : 800.9 MB/s C 2-pass copy prefetched (32 bytes step) : 783.0 MB/s C 2-pass copy prefetched (64 bytes step) : 844.9 MB/s C fill : 3984.5 MB/s (0.1%) C fill (shuffle within 16 byte blocks) : 3969.5 MB/s C fill (shuffle within 32 byte blocks) : 462.0 MB/s (4.7%) C fill (shuffle within 64 byte blocks) : 489.1 MB/s (7.5%) --- standard memcpy : 892.9 MB/s standard memset : 3034.7 MB/s --- NEON read : 1317.9 MB/s NEON read prefetched (32 bytes step) : 1496.9 MB/s NEON read prefetched (64 bytes step) : 1513.5 MB/s NEON read 2 data streams : 374.4 MB/s NEON read 2 data streams prefetched (32 bytes step) : 720.3 MB/s NEON read 2 data streams prefetched (64 bytes step) : 755.9 MB/s NEON copy : 1038.5 MB/s NEON copy prefetched (32 bytes step) : 1132.5 MB/s NEON copy prefetched (64 bytes step) : 1193.4 MB/s NEON unrolled copy : 1011.7 MB/s NEON unrolled copy prefetched (32 bytes step) : 1059.0 MB/s NEON unrolled copy prefetched (64 bytes step) : 1130.1 MB/s NEON copy backwards : 1004.4 MB/s NEON copy backwards prefetched (32 bytes step) : 1067.6 MB/s NEON copy backwards prefetched (64 bytes step) : 1154.6 MB/s NEON 2-pass copy : 897.3 MB/s NEON 2-pass copy prefetched (32 bytes step) : 966.2 MB/s NEON 2-pass copy prefetched (64 bytes step) : 997.6 MB/s NEON unrolled 2-pass copy : 778.5 MB/s NEON unrolled 2-pass copy prefetched (32 bytes step) : 745.1 MB/s NEON unrolled 2-pass copy prefetched (64 bytes step) : 806.3 MB/s NEON fill : 3986.3 MB/s (0.1%) NEON fill backwards : 3969.0 MB/s VFP copy : 1022.1 MB/s VFP 2-pass copy : 789.0 MB/s ARM fill (STRD) : 3037.0 MB/s ARM fill (STM with 8 registers) : 3970.1 MB/s ARM fill (STM with 4 registers) : 3598.5 MB/s ARM copy prefetched (incr pld) : 1145.7 MB/s ARM copy prefetched (wrap pld) : 1053.3 MB/s ARM 2-pass copy prefetched (incr pld) : 873.9 MB/s ARM 2-pass copy prefetched (wrap pld) : 833.1 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 5.2 ns / 9.0 ns 131072 : 8.1 ns / 12.7 ns 262144 : 9.5 ns / 14.1 ns 524288 : 11.5 ns / 16.3 ns 1048576 : 85.6 ns / 132.0 ns 2097152 : 128.7 ns / 173.6 ns 4194304 : 150.8 ns / 188.5 ns 8388608 : 163.7 ns / 196.5 ns 16777216 : 173.4 ns / 204.0 ns 33554432 : 183.1 ns / 217.6 ns 67108864 : 195.7 ns / 240.7 ns Edit: Added tinymembench results for 2GB equipped OPi Plus 2E also (clocking with 1296 MHz). No differences compared to OPi Lite 1
tkaiser Posted August 14, 2016 Author Posted August 14, 2016 Another round of tests. This time it's about lowering peak consumption. With our default settings we allow pretty low idle consumption but at boot time we always have rather high consumption peaks compared to idle behaviour later. In case someone wants to use a really weak PSU or powers a couple of boards with one step-down converter (via PoE -- Power over Ethernet -- for example) then it's important to be able to control consumption peaks also. With most if not all board/kernel combinations we have three places to control this behaviour: u-boot: brings up the CPU cores, defines initial CPU and DRAM clockspeed kernel defaults: as soon as the kernel takes over these settings are active (might rely on u-boot's settings, might use own settings, minimum/maximum depends on device tree or fex stuff on Allwinner legacy kernels userspace: In Armbian we ship with cpufrequtils that control minimum/maximum cpufreq settings and governor used -- have a look at /etc/default/cpufrequtils) So how to get for example a NanoPi NEO to boot with as less peak consumption possible with legacy kernel? With our most recent NEO settings we bring up all 4 cores and define CPU clockspeed in u-boot as low as 480 MHz. As soon as the kernel takes over we use interactive governor and allow cpufreq scaling from 240 MHz up to 1200 MHz and since booting is pretty CPU intensive the kernel will stay at 1008 MHz or above most of the time while booting being responsible for consumption peaks that exceed idle consumption by 4-5 times. As soon as the cpufrequtils takes over behaviour can be controlled again (eg. setting MAX_SPEED=1296000 to just 240 MHz) So the problem is the time between kernel starting and invokation of cpufrequtils daemon since our default 'interactive' cpufreq governor lets run H3 on all 4 cores with 1200 MHz on NEO even if we defined maximum cpufreq in normal operation mode to be 912 MHz (everything defined in /etc/default/cpufrequtils will only be active when cpufrequtils has been started by systemd/upstart) Since we can choose between a few different cpufreq governors with H3's legacy kernel I thought: Let's try the differences out (leaving out the performance governor since this one does the opposite of what we're looking for). I modified cpufrequtils startscript to do some monitoring (time of invocation and cpufreq steps the kernel used before) and added a script to log start times in a file to create average values later, then reboot the board automatically and to exchange the kernel after every 100 reboots with 4 different settings: interactive, ondemand, powersave and userspace default cpufreq governor. To get an idea how changing the default cpufreq governor in kernel config might influence other H3 boards I chose to strongest one to compare: OPi Plus 2E. NanoPi NEO will be configured to use 480 MHz cpufreq set by u-boot and to allow cpufreq scaling between 240 MHz and 1200 MHz. OPi Plus 2E uses 1008 MHz as cpufreq in u-boot and jumps between 480 MHz and 1296 MHz with our default settings. So how do the 4 different cpufreq governors behave with both boards? interactive: does the best job from a performance perspective since this governor switches pretty fast from lower clockspeeds to higher ones (also highest consumption peaks seen) ondemand: In our tests cpufreq only switched between lowest allowed and highest clockspeed while remaining most of the times at the lowest (240/1200 on NEO and 480/1296 on Plus 2E). Please be aware that ondemand is considered broken. powersave: With this setting cpufreq remains at the lowest allowed clockspeed (240 MHz on NEO and 480 MHz on Plus 2E) userspace: No adjustments at all, simply re-using the clockspeed set by u-boot (480 MHz on NEO and 1008 MHz on Plus 2E) Let's have a look how boot times changed. I simply monitored the time in seconds between start of the kernel and invocation of cpufrequtils (since this is the time span when changing the default cpufreq governor in kernel config matters): NEO Plus 2E Interactive: 10.06 9.93 Ondemand: 12.43 10.90 Powersave: 14.16 11.51 Userspace: 11.36 10.15 Shorter times correlate with higher peak consumption. So it's obvious that changing default cpufreq governor for H3 boards from interactive to powersave would help a lot reducing boot consumption. On the NEO this will delay booting by ~4.1 seconds and on Plus 2E by just ~1.65 seconds -- reason is simple: NEO boots with 240 MHz instead of remaining above 1008 MHz most of the time and OPi Plus 2E boots with 480 MHz instead of +1200). But userspace is also interesting. This governor doesn't alter cpufreq set by u-boot so therefore NEO boots with 480 MHz and OPi Plus 2E with 1008 MHz (also true for all other H3 devices except of the overheating ones -- Beelink X2, Banana Pi M2+ and NanoPi M1 use 816 MHz instead) while delaying boot times just by 1.3 seconds (NEO) or 0.23 (Plus 2E). The 'less consumption' champion is clearly powersave but since we want to only maintain one single kernel config for all H3 boards it might be the better idea to choose userspace instead as default cpufreq governor in sun8i legacy kernel config since with this setting NEO still reduces boot consumption a lot but other H3 devices aren't affected that much. All consumption numbers are just 'looking at powermeter while board boots'. My measurement setup using average values totally fails when it's about peak consumption. I already thought about using a RPi 3, its camera module, the motion daemon and an OCR solution to monitor my powermeter. But based on the information we already have (consumption numbers based on cpufreq/dvfs settings) it seems switching from interactive to userspace is a good idea to save peak current while booting. Though if anyone is after lowest consumption possible then choosing powersave is the better choice. In case someone wants to test on his own here's the procedure and test logs: I added these three lines to the start of /etc/init.d/cpufrequtils: echo "cpufrequtils taking over" > /dev/kmsg grep -v " 0$" /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state >>/var/log/boot-state.log echo "$(date) $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor) $(cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq) $(cat /sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/cur_freq)" >>/var/log/boot-state.log And then I added '/usr/local/bin/check-boot-time.sh &' to /etc/rc.local and this script looks like this: root@nanopineo:/home/tk# cat /usr/local/bin/check-boot-time.sh #!/bin/bash # check boot times sleep 15 dmesg | grep cpufreq >>/var/log/boot-state.log BootTime=$(tail -n2 /var/log/boot-state.log | awk -F" " '/cpufrequtils taking over/ {print $2}' | sed 's/\]//') echo ${BootTime} >>/var/log/boot-times.log CountOfEntries=$(wc -l </var/log/boot-times.log) if [ ${CountOfEntries} -eq 100 ]; then sleep 60 dpkg -i /home/tk/boot-tests/linux-image-sun8i_5.17_armhf_ondemand.deb echo "--- ondemand ---" >>/var/log/boot-times.log elif [ ${CountOfEntries} -eq 200 ]; then sleep 60 dpkg -i /home/tk/boot-tests/linux-image-sun8i_5.17_armhf_powersave.deb echo "--- powersave ---" >>/var/log/boot-times.log elif [ ${CountOfEntries} -eq 300 ]; then sleep 60 dpkg -i /home/tk/boot-tests/linux-image-sun8i_5.17_armhf_userspace.deb echo "--- userspace ---" >>/var/log/boot-times.log elif [ ${CountOfEntries} -gt 399 ]; then exit 0 fi reboot With this scripted setup the boards test unattended through 4 different settings rebooting 400 times and providing logs that can be interpreted later: Orange Pi Plus 2E: http://sprunge.us/JCCU NanoPi NEO: http://sprunge.us/JIgV 1
tkaiser Posted August 15, 2016 Author Posted August 15, 2016 Since I realized that getting peak consumption values while booting is not a task that needs automation but can be done manually and in short time (count of samples can be rather low) I decided to give it a try. I let NanoPi NEO boot 10 times each with interactive, powersave and userspace governor (ondemand isn't interesting here) and simply watched my powermeter's display for peak numbers shown. At the end of the test I used my usual consumption monitoring setup to get an idea how the numbers the powermeter provided (including PSU's own consumption!) match with the usual numbers when using a Banana Pro as 'monitoring PSU'. Results as follows: boot time peak consumption shown interactive: 10.6824 9 x 2.8W, 1 x 2.9W powersave: 14.7607 3 x 2.2W, 6 x 2.3W, 1 x 2.4W userspace: 11.9503 10 x 2.4W So based on this quick test the powersave governor doesn't help avoiding high consumption values since peak consumption values are pretty close to the results with userspace. On the other hand switching from interactive to powersave would increase boot time by ~4.1 seconds while userspace only delays boot times by ~1.3 seconds on the NEO. On all other H3 devices switching from interactive to userspace shouldn't matter at all since boot times are only slightly delayed -- see above (0.22 seconds more on OPi Plus 2E) How would boot behaviour of H3 devices currently supported by Armbian change when switching default cpufreq governor in kernel config from interactive to userspace? Let's have a look at cpufreq scaling behaviour before the cpufrequtils daemon will be started (the short 2-3 seconds lasting consumption peaks happen prior to cpufrequtils start!). The numbers given for interactive are meant as 'spent most of the times at', when using userspace the board simply remains at the clockspeed set in u-boot config until cpufrequtils daemon will be started: interactive userspace NanoPi NEO 1008-1200 480 NanoPi M1, Banana Pi M2+ 1008-1200 816 Beelink X2, OPi One/Lite 1008-1200 1008 All other OPi 1008-1296 1008 That means that on NanoPi M1 and BPi M2+ booting might be delayed by ~0.5 seconds, with X2 or OPi Lite/One we're talking about 0.3 seconds and with the larger Oranges it's even less when switching to userspace. Consumption savings on all these boards are negligible but with NanoPi NEO we get a reduction of peak consumption while booting of approx. 500mW (not worth a look when powering a single NEO with a good PSU but if a fleet of NEOs should be powered through PoE then these 500 mW multiplied with the count of NEOs can make a huge difference regarding the PSU's amperage dimensions) The baseline of my tests was a NEO/512 powered through FriendlyARM's PSU-ONECOM with only Ethernet connected. Xunlong's 5V/3A PSU powered PSU-ONECOM and sits in a 'Brennenstuhl PM 231E' powermeter reporting ~1.6W idle consumption with userspace governor, all CPU cores active, default DRAM clockspeed --> means the board was idling at 480 MHz CPU clockspeed and 408 MHz DRAM clockspeed. When I ran 'sysbench --test=cpu --cpu-max-prime=2000000 run --num-threads=4' the powermeter showed 2.4W consumption at 912 MHz and 3.0W when at 1008 MHz (the huge increase is due to VDD_CPUX switching from 1.1V to 1.3V). So how do these values translate to consumption measurements 'behind PSU'? I used my usual Banana Pro monitoring PSU setup and got 1190 mW reported when idling (1.6W according to powermeter) 1980 mW when running sysbench at 912 MHz (2.4W according to powermeter) 2720 mW when running sysbench at 1008 MHz (3.0W according to powermeter) with PSU (powermeter) w/o PSU (Banana Pro) idle: 1.6W 1190 mW sysbench @ 912MHz: 2.4W (+800mW) 1980 mW (+790mW) sysbench @ 1008MHz: 3.0W (+1400mW) 2720 mW (+1530mW) TL;DR: Switching from interactive to userspace as default cpufreq governor in sun8i kernel config helps reducing NanoPi NEO's peak consumption at booting by ~500mW while it does not delay booting times a lot (~1.3 seconds longer on NEO). With this change situation for all other H3 devices does not change much both regarding peak consumption and boot times. Switching to userspace seems reasonable to me since we can benefit a lot with NEO's low power mode while not negatively affecting other boards. 1
tkaiser Posted August 17, 2016 Author Posted August 17, 2016 Another interesting update on the relationship of consumption and performance: I used sysbench all the time to do some basic comparisons. The great thing with sysbench is that it absolutely not depends on memory bandwidth which is bad on the other hand when you compare with real world performance critical stuff since every task that does not run on internal CPU caches only (and that's the vast majority!) depends somehow on memory bandwidth. Let's take a look at cpuminer which is a bitcoin mining application that uses NEON instructions on ARM (pretty heavy compared to non-NEON stuff but not that much compared with cpuburn-a7) and features also a benchmark mode reporting khash/s (kilo hashes per second) which is great to explore how memory bandwidth might influence performance of memory dependant workloads. Getting cpuminer up and running on Armbian (and most probably every other more recent Debian based armhf distro) is simple and takes only a minute: Get https://sourceforge.net/projects/cpuminer/files/pooler-cpuminer-2.4.5.tar.gz/download then untar it, change into cpuminer-2.4.5 dir and do sudo apt-get install libcurl4-gnutls-dev ./configure CFLAGS="-O3 -mfpu=neon" make ./minerd --benchmark I decided to test with one Orange Pi limited to 1200 MHz clockspeed max, one able to reach 1296 MHz and NanoPi NEO since this is the first H3 device which really differs with regard to DRAM: Single bank configuration -- on all other H3 SBC so far always 2 DRAM chips are used. This affects memory bandwidth negatively and is maybe also responsible for overheating and more consumption (just an assumption but Olimex when starting with H3 board prototypes reported the same heat issues and they also use a single bank DRAM config) DRAM clocked with just 432 MHz by the vendor and since we found out that lowering this clockspeed down to 408 MHz performance isn't that much affected but consumption decreases by a whopping 130mW with these 24 MHz less we decided to remain at 408 MHz in Armbian for the NEO Since we also decided to limit maximum CPU clockspeed to 912 MHz I tested NEO with both 912 and 1200 MHz CPU clockspeed (this is the upper clockspeed where the SoC stays on the lower 1.1V core voltage, for every higher clockspeed a switch to 1.3V is needed which increases consumption massively!) For the test also a small fan was needed in addition to heatsinks to prevent throttling. On the NEO I used FriendlyARM's own large and effective heatsink, on the Oranges my usual 50 Cent el cheapo heatsinks. Also important: I used the same Armbian image for all tests and our NEO settings also for OPi Lite (HDMI/Mali disabled being the most important tweak -- see below). When testing with Orange Pi PC I allowed 1296 MHz maximum clockspeed and also disabled HDMI/Mali for one test. So we have 5 columns: NEO/912: HDMI/Mali disabled, 912 MHz cpufreq, single bank DRAM NEO: HDMI/Mali disabled, 1200 MHz cpufreq, single bank DRAM Lite: HDMI/Mali disabled, 1200 MHz cpufreq, dual bank DRAM PC: HDMI/Mali disabled, 1296 MHz cpufreq, dual bank DRAM PC with original settings: 1296 MHz cpufreq, dual bank DRAM All results in hash/s by cpuminer-2.4.5 with NEON enabled. DRAM clock NEO/912 NEO Lite PC PC with original settings 132 MHz: 922 997 1230 1259 1142 144 MHz: 979 1060 1296 1331 1210 156 MHz: 1024 1126 1358 1400 1280 168 MHz: 1070 1189 1410 1460 1349 180 MHz: 1109 1242 1460 1519 1409 192 MHz: 1149 1292 1510 1570 1489 204 MHz: 1180 1340 1550 1620 1558 216 MHz: 1210 1385 1591 1660 1610 228 MHz: 1239 1430 1628 1700 1660 240 MHz: 1260 1469 1670 1740 1702 252 MHz: 1290 1500 1700 1772 1742 264 MHz: 1320 1534 1730 1810 1780 276 MHz: 1343 1563 1760 1842 1810 288 MHz: 1368 1591 1780 1870 1840 300 MHz: 1380 1620 1800 1900 1870 312 MHz: 1400 1650 1820 1920 1890 324 MHz: 1410 1680 1840 1948 1920 336 MHz: 1421 1710 1867 1963 1940 348 MHz: 1440 1730 1890 1990 1963 360 MHz: 1450 1760 1912 2010 1989 372 MHz: 1460 1780 1930 2039 2012 384 MHz: 1470 1800 1940 2060 2039 408 MHz: 1500 1830 1960 2090 2070 432 MHz: 1530 1858 1982 2110 2090 456 MHz: 1540 1880 2000 2130 2110 480 MHz: 1551 1910 2011 2150 2130 504 MHz: 1560 1920 2020 2169 2149 528 MHz: 1570 1950 2029 2180 2160 552 MHz: 1580 1980 2052 2180 2170 576 MHz: 1594 2000 2079 2190 2180 600 MHz: 1600 2012 2100 2225 2200 624 MHz: 1611 2030 2109 2249 2230 648 MHz: 1620 2040 2110 2260 2239 672 MHz: 1624 2049 2120 2270 2247 Let's look at the results: When comparing the two last columns (OPi PC with default settings and HDMI/Mali disabled) it's obvious that disabling both improves cpuminer/memory performance. When we have in mind that on all cheap ARM SoCs CPU and GPU share access to DRAM then it's obvious that disabling GPU cores frees CPU ressources and memory bandwidth available increases (the lower the DRAM clockspeed the more this makes a difference:at 132 MHz it's 117 hash/s difference, at 672 MHz only 23 hash/s) When looking at the first two columns the same can be observed. The difference between both runs is just H3 running at either 912 MHz or 1200 MHz on the NEO. At the lowest DRAM clockspeed possible the difference between both cpufreqs is just 75 hash/s while on the upper 425 hash/s. More interesting: At NEO's default 408 MHz DRAM clockspeed the cpufreq differences result in 1500 vs. 1830 hash/s which means increasing cpufreq from 912 MHz by 32 percent to 1200 MHz does only result in 22 percent performance gain since for this workload DRAM access is important too When comparing columns 3 and 4 (Lite and PC using same DRAM but different clockspeeds: 1200 vs. 1296 MHz), the memory bandwidth effect is also present. The lower the DRAM is clocked the less the higher cpufreq makes a difference Same when looking at columns 2 and 3 (comparing NEO and Lite running at the same 1200 MHz CPU clockspeed but with single vs. dual bank DRAM config). At 132 MHz it's a difference of 233 hash/s between both and at 672 MHz it's only 71 hash/s Testing through 132 - 672 MHz is only useful to get some understanding what's going on and how low DRAM clockspeeds might affect performance of specific workloads. Now let's have a look at realistic DRAM clockspeeds and that's 408 MHz for NEO and 624 MHz for Orange Pis. The 408 MHz are the result of trusting into the vendor's defaults and improving them slightly (decreasing clockspeed by 24 MHz results in 130 mW consumption less while getting only insignificant performance losses) and the 624 MHz are the result of community based DRAM reliability testing for the board (not trusting into Allwinner's 672 MHz). So how do the three boards compete when driven with optimized settings (Armbian defaults but HDMI/Mali disabled on all boards as with NEO defaults): Orange Pi PC @ 1296/624 MHz: 2249 hash/s Orange Pi Lite @ 1200/624 MHz: 2109 hash/s NanoPi NEO @ 1200/408 MHz: 1830 hash/s NanoPi NEO @ 912/408 MHz: 1500 hash/s Please keep in mind that all these numbers above are the result of using active cooling. With just a heatsink everything looks differently since then throttling occurs and strange things might happen -- the best example is the NEO: When trying to run the test with only FriendlyARM's heatsink, no fan and allowing 1200 MHz clockspeed the board simply deadlocks after running cpuminer for 25 minutes at 64°C SoC temperature reported. We chose the 912 MHz max for NEO in Armbian for a reason: The NEO is simply not made for heavy stuff. I experienced also deadlocks within 2 minutes when trying to run our usual lima-memtester DRAM reliability tests on the NEO which heats up the SoC heavily and increases consumption a lot since we stress Mali400MP GPU to the max. Without an annoying fan it's impossible to run these workloads on the NEO. TL;DR: Disabling HDMI/GPU helps with reducing consumption and temperatures. It also increases memory bandwidth since CPU and GPU cores have to share DRAM access. More memory bandwidth helps increasing performance of most tasks (even IO bound tasks benefit from on slow SoCs like H3). On SoCs that tend to overheating disabling HDMI/GPU helps twice with performance since lower consumption/temperatures also help with throttling. In case the SoC stays cooler throttling will jump in later when running heavy workloads On a related note: We've already measured how switching through different DRAM clockspeeds affects temperatures and consumption on NanoPi NEO when being idle. We get a difference of 470 mW and 10°C (w/o heatsink) just by adjusting DRAM clockspeed between 132 and 672 MHz. How does it look like when running CPU intensive tasks? I used NEO, limited maximum cpufreq to 912 MHz (since the board deadlocked at 1200 MHz when running cpuminer for 25 minutes) and disabled the fan so that only FA's heatsink helps with heat dissipation: Wed Aug 17 15:06:41 CEST 2016 132/912 MHz 0.913667 50 Wed Aug 17 15:09:42 CEST 2016 144/912 MHz 0.969667 50 Wed Aug 17 15:12:42 CEST 2016 156/912 MHz 1.01967 51 Wed Aug 17 15:15:43 CEST 2016 168/912 MHz 1.05967 51 Wed Aug 17 15:18:43 CEST 2016 180/912 MHz 1.1 52 Wed Aug 17 15:21:43 CEST 2016 192/912 MHz 1.14 53 Wed Aug 17 15:24:44 CEST 2016 204/912 MHz 1.17 53 Wed Aug 17 15:27:44 CEST 2016 216/912 MHz 1.20233 54 Wed Aug 17 15:30:45 CEST 2016 228/912 MHz 1.23933 54 Wed Aug 17 15:33:45 CEST 2016 240/912 MHz 1.265 55 Wed Aug 17 15:36:45 CEST 2016 252/912 MHz 1.29033 56 Wed Aug 17 15:39:45 CEST 2016 264/912 MHz 1.31967 56 Wed Aug 17 15:42:46 CEST 2016 276/912 MHz 1.34 57 Wed Aug 17 15:45:46 CEST 2016 288/912 MHz 1.36 57 Wed Aug 17 15:48:46 CEST 2016 300/912 MHz 1.38 58 Wed Aug 17 15:51:47 CEST 2016 312/912 MHz 1.398 58 Wed Aug 17 15:54:47 CEST 2016 324/912 MHz 1.40967 58 Wed Aug 17 15:57:47 CEST 2016 336/912 MHz 1.42 59 Wed Aug 17 16:00:47 CEST 2016 348/912 MHz 1.43187 59 Wed Aug 17 16:03:48 CEST 2016 360/912 MHz 1.44233 59 Wed Aug 17 16:06:48 CEST 2016 372/912 MHz 1.45067 59 Wed Aug 17 16:09:48 CEST 2016 384/912 MHz 1.467 60 Wed Aug 17 16:12:48 CEST 2016 408/912 MHz 1.49033 61 Wed Aug 17 16:15:49 CEST 2016 432/912 MHz 1.563 65 Wed Aug 17 16:18:49 CEST 2016 456/912 MHz 1.539 67 Wed Aug 17 16:21:49 CEST 2016 480/912 MHz 1.54967 68 Wed Aug 17 16:24:50 CEST 2016 504/912 MHz 1.55933 69 Wed Aug 17 16:27:50 CEST 2016 528/912 MHz 1.57 69 Wed Aug 17 16:30:50 CEST 2016 552/870 MHz 1.55531 69 Wed Aug 17 16:33:50 CEST 2016 576/842 MHz 1.51 70 Wed Aug 17 16:36:51 CEST 2016 600/831 MHz 1.492 70 Wed Aug 17 16:39:51 CEST 2016 624/823 MHz 1.48034 70 Wed Aug 17 16:42:51 CEST 2016 648/822 MHz 1.47862 70 Wed Aug 17 16:45:51 CEST 2016 672/822 MHz 1.47594 70 These are the raw logs I used containing time stamp, DRAM and average CPU clockspeed of the last ~2.5 minutes, cpuminer khash/s value and SoC temperature in °C. CPU clockspeed was set to maximum and while increasing DRAM clockspeed from 132 MHz up to 528 MHz SoC temperature increased by 20°C (only related to DRAM clockspeed!). When exceeding 528 MHz throttling occured so that SoC temperature remained at ~70°C while cpuminer's performance started to degrade. The increase in consumption and temperatures with higher DRAM clockspeed slowed cpufreq down and led to lower khash/s values above 528 MHz DRAM clock. As a comparison the same test with OPi Lite (no fan, cheap heatsink, same settings, cpufreq limited to 912 MHz): Wed Aug 17 17:16:30 CEST 2016 132/912 MHz 1.07867 50 Wed Aug 17 17:19:30 CEST 2016 144/912 MHz 1.129 49 Wed Aug 17 17:22:31 CEST 2016 156/912 MHz 1.16933 50 Wed Aug 17 17:25:31 CEST 2016 168/912 MHz 1.21 51 Wed Aug 17 17:28:31 CEST 2016 180/912 MHz 1.25033 51 Wed Aug 17 17:31:32 CEST 2016 192/912 MHz 1.28167 54 Wed Aug 17 17:34:32 CEST 2016 204/912 MHz 1.31167 53 Wed Aug 17 17:37:32 CEST 2016 216/912 MHz 1.33933 53 Wed Aug 17 17:40:32 CEST 2016 228/912 MHz 1.35967 53 Wed Aug 17 17:43:33 CEST 2016 240/912 MHz 1.38 54 Wed Aug 17 17:46:33 CEST 2016 252/912 MHz 1.4 54 Wed Aug 17 17:49:33 CEST 2016 264/912 MHz 1.43 54 Wed Aug 17 17:52:33 CEST 2016 276/912 MHz 1.45 56 Wed Aug 17 17:55:34 CEST 2016 288/912 MHz 1.46067 57 Wed Aug 17 17:58:34 CEST 2016 300/912 MHz 1.47233 54 Wed Aug 17 18:01:34 CEST 2016 312/912 MHz 1.49 56 Wed Aug 17 18:04:34 CEST 2016 324/912 MHz 1.5 56 Wed Aug 17 18:07:35 CEST 2016 336/912 MHz 1.51 56 Wed Aug 17 18:10:35 CEST 2016 348/912 MHz 1.52 57 Wed Aug 17 18:13:35 CEST 2016 360/912 MHz 1.53 56 Wed Aug 17 18:16:35 CEST 2016 372/912 MHz 1.532 56 Wed Aug 17 18:19:35 CEST 2016 384/912 MHz 1.54 57 Wed Aug 17 18:22:36 CEST 2016 408/912 MHz 1.547 56 Wed Aug 17 18:25:36 CEST 2016 432/912 MHz 1.57 62 Wed Aug 17 18:28:36 CEST 2016 456/912 MHz 1.59 61 Wed Aug 17 18:31:37 CEST 2016 480/912 MHz 1.6 62 Wed Aug 17 18:34:37 CEST 2016 504/912 MHz 1.65967 63 Wed Aug 17 18:37:37 CEST 2016 528/912 MHz 1.65967 62 Wed Aug 17 18:40:37 CEST 2016 552/912 MHz 1.65067 63 Wed Aug 17 18:43:38 CEST 2016 576/912 MHz 1.66 64 Wed Aug 17 18:46:38 CEST 2016 600/912 MHz 1.65933 62 Wed Aug 17 18:49:38 CEST 2016 624/912 MHz 1.66 63 Wed Aug 17 18:52:38 CEST 2016 648/912 MHz 1.65967 63 Wed Aug 17 18:55:39 CEST 2016 672/912 MHz 1.65933 63 No throttling occured, temperatures were lower, cpuminer performance higher. OPi Lite also uses DDR3 @ 1.5V (not DDR3L @ 1.35V as the larger Orange Pi variants) so the most obvious change is single vs. dual bank DRAM configuration. Maybe that's the reason Olimex reported such overheating problems when they started with their H3 boards a while ago (also using DDR3 in single bank configuration)?
tkaiser Posted August 23, 2016 Author Posted August 23, 2016 LOL, today I did some testing with NanoPi NEO, kernel 4.7.2 and the new schedutil cpufreq scheduler. I let the following run to check thermal readouts after allowing 1200 MHz max cpufreq: sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$(grep -c '^processor' /proc/cpuinfo) To my surprise the result was just 117.5 seconds -- that's 'better' than RPi 3 with same settings and with Orange Pi PC while being clocked higher (1.3 GHz vs. 1.2 GHz) I got the following a few days ago: 'sysbench takes 142 seconds, H3 constantly running at 1296 MHz, SoC temperature reached 74°C but no throttling happening' Wow!!! An increase in performance of ~30 percent just by using a new kernel! With a benchmark that should not be affected by the kernel version at all?! That's magic. So I immediately tried out our 3.4.112 Xenial image. Same thermal readouts, same result: 117.5 seconds! What did happen? I tried out Xenial 16.04 LTS with both 4.7.2 and 3.4.112 kernel. And before I always used Debian Jessie. Ok, downloaded our Jessie image for NanoPi NEO, executed the same sysbench call and got 153.5 seconds (which is the correct value given that no throttling occured, max cpufreq was at 1200 MHz and OPi PC clocked at 1296 MHz finishes in 142 seconds!) What can we learn from this? Sysbench is used nearly everywhere to 'get an idea about CPU performance' while it is horrible crap to compare different systems! You always have to ensure that you're using the very same sysbench binary. At least it has to be built with the exact same compiler settings and version! We get a whopping 30 percent performance increase just since the Ubuntu folks use other compiler switches/version compared to the Debian folks: This is 2 times 'sysbench 0.4.12' Ubuntu Xenial Xerus: root@nanopineo:~# file /usr/bin/sysbench /usr/bin/sysbench: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, BuildID[sha1]=2df715a7bcb84cb03205fa3a5bc8474c6be1eac2, stripped root@nanopineo:~# lsb_release -c Codename: xenial root@nanopineo:~# sysbench --version sysbench 0.4.12 Debian Jessie: root@nanopineo:~# file /usr/bin/sysbench /usr/bin/sysbench: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 2.6.32, BuildID[sha1]=664005ab6bf45166f9882338db01b59750af0447, stripped root@nanopineo:~# lsb_release -c Codename: jessie root@nanopineo:~# sysbench --version sysbench 0.4.12 Just the same effect when comparing sysbench numbers on RPi 2 or 3 when running Raspbian or Ubuntu Mate -- see post #12 above (but there the difference is only 15 percent so it seems either the Raspbian people aren't using that conservative compiler switches compared to Jessie or Ubuntu Mate for Raspberries does not optimize that much as our 16.04 packages from Ubuntu repositories) TL;DR: Never trust any sysbench numbers you find on the net if you don't know which compiler settings and version have been used. Sysbench is crap to compare different systems. You can use sysbench's cpu test only for a very limited set of tests: Creating identical CPU utilization situations (to compare throttling settings as I did before in this thread), running tests to estimate multi-threaded results when adding/removing CPU cores or test CPU performance without tampering results by memory bandwidth (sysbench is that primitive that all code runs inside the CPU caches!) Everything else always requires to use the exact same sysbench binary on different systems to be able to compare. So no-cross platform comparisons possible, no comparisons between systems running different OS images, no comparisons between different CPU architectures possible. Using sysbench as a general purpose CPU benchmark is always just fooling yourself! 3
jobenvil Posted September 30, 2016 Posted September 30, 2016 Really appreciated that you always share your observations.
Kevin Kreger Posted October 8, 2016 Posted October 8, 2016 @tkaiser ... This is great. In fact, this is just the information we need to optimize our Orange Pi One which is running Android headless.
Magnets Posted December 26, 2016 Posted December 26, 2016 Have you done any testing on the Orange Pi zero? Will the peak consumption be similar to the Orange Pi PC?
billybangleballs Posted February 5, 2017 Posted February 5, 2017 @tkaiser Very much appreciate your documentation on power usage. Keep up the good work.
hojnikb Posted March 6, 2017 Posted March 6, 2017 Here are some of my results for PC2: Equipement used: -Orange Pi PC2 -Samsung EVO+ 32GB sd card -5V/2A usb PSU -KEWEISI usb power monitor -Armbian 5.25@March 6 Results: Idle at armbian desktop, no devices connected: 0.98W Idle at armbian desktop, keyboard+mouse via ps/2 usb converter chip: 1.03W Idle at armbian desktop, keyboard+mouse via ps/2 usb converter chip, usb wifi RTL8188ETV: 1.64W Scrolling thru armbian forums with firefox, keyboard+mouse via ps/2 usb converter chip, usb wifi RTL8188ETV: 2.05-3.08W Burnin test with cpuburnA53 @ 1.3Ghz, keyboard+mouse via ps/2 usb converter chip, usb wifi RTL8188ETV: 7.32W Burnin test with cpuburnA53 @ 1.06Ghz, keyboard+mouse via ps/2 usb converter chip, usb wifi RTL8188ETV: 5.30W I should add, that with the 1.3Ghz test, it throttled within seconds to 1.06Ghz. If anyone whats something else tested, please suggest.
hojnikb Posted March 6, 2017 Posted March 6, 2017 Is there a good arm cpu benchmark, that i can use for efficiency testing ? I'm really interested at which clock A53 cores are most efficient.
hojnikb Posted March 14, 2017 Posted March 14, 2017 (edited) So i did a quick and dirty power test for each freq/voltage point using stabilityTester/xhpl64. Testing was done on the same equipment as last time, but with addition of a fan to eliminate throttling. Highest power recorded. HDMI/keyboard+mouse connected. 480MHz Idle 0.9747W 480Mhz 1.792W 528Mhz 1.8432W 648Mhz 2.0951W 672Mhz 2.1462W 720Mhz 2.1973W 728Mhz 2.295W 792Mhz 2.346W 816Mhz 2.448W 864Mhz 2.652W 912Mhz 2.754W 936Mhz 2.8504W 960Mhz 3.054W 1008Mhz 3.2512W 1056Mhz 3.4544W 1040Mhz 3.8885W 1152Mhz 4.1915W 1200Mhz 4.6965W 1224Mhz 4.9995W 1248Mhz 5.2312W 1296Mhz 5.7456W 1368Mhz 7.0716W Edited March 14, 2017 by tkaiser Added link to https://github.com/ehoutsma/StabilityTester
hojnikb Posted March 15, 2017 Posted March 15, 2017 If there are any real world apps, that load up all 4 cores and would bring somewhat consistent results, i'm happy to take suggestions. I'm well aware, that testing with xhpl64 isn't exactly realistic, but more of a worst case scenario.
wtarreau Posted March 16, 2017 Posted March 16, 2017 19 hours ago, hojnikb said: If there are any real world apps, that load up all 4 cores and would bring somewhat consistent results, i'm happy to take suggestions. I'm well aware, that testing with xhpl64 isn't exactly realistic, but more of a worst case scenario. I usually run "openssl speed rsa2048 -multi <#cores>" for this, the RSA code is carefully optimized to achieve a very high IPC on most CPUs and I always managed to achieve the highest power consumption with this. The only difficulty is that it doesn't last long (10s) so you have to measure quickly. Another benefit is that it often comes pre-installed on most systems.
hojnikb Posted March 16, 2017 Posted March 16, 2017 4 hours ago, wtarreau said: I usually run "openssl speed rsa2048 -multi <#cores>" for this, the RSA code is carefully optimized to achieve a very high IPC on most CPUs and I always managed to achieve the highest power consumption with this. The only difficulty is that it doesn't last long (10s) so you have to measure quickly. Another benefit is that it often comes pre-installed on most systems. All right, i might give this a try the next time fiddling with my boards
superjamie Posted April 10, 2017 Posted April 10, 2017 On 05/08/2016 at 4:38 AM, tkaiser said: Why do they allow RPi Zero to clock with up to 1 GHz by default when they limit B+ to 700 MHz (compare performance and consumption numbers of both tinymembench and sysbench above)? It was the limit of manufacturing capability at the times of release. The first Pi was released April 2012, the B+ was released July 2014, and the Zero was released November 2015. In 2012, Broadcom could only make the SoC well enough that the ARM1176JZF-S could reliably reach 700MHz. Some units could be overclocked with good results but many could not. By the end of 2015 - almost four years later - they had improved the precision of the manufacturing process so that 1000MHz was possible and reliable on all chips. 1
hojnikb Posted April 10, 2017 Posted April 10, 2017 1 hour ago, superjamie said: It was the limit of manufacturing capability at the times of release. The first Pi was released April 2012, the B+ was released July 2014, and the Zero was released November 2015. In 2012, Broadcom could only make the SoC well enough that the ARM1176JZF-S could reliably reach 700MHz. Some units could be overclocked with good results but many could not. By the end of 2015 - almost four years later - they had improved the precision of the manufacturing process so that 1000MHz was possible and reliable on all chips. 1Ghz was achievable on pretty much all boards; evidence of this is the option in their raspi-config, which allowed for 1Ghz setting. My example went easily to 1150Mhz.
superjamie Posted April 30, 2017 Posted April 30, 2017 On 10/04/2017 at 11:44 PM, hojnikb said: 1Ghz was achievable on pretty much all boards; evidence of this is the option in their raspi-config, which allowed for 1Ghz setting. My example went easily to 1150Mhz. The configuration tool allowed users to try for 1GHz but it definitely wasn't achievable on all boards. I had a first-batch 256MiB RAM Pi 1 (which I bought the week they were released in 2012) and a later 512 MiB Pi 1, both of which could not reliably go past 900MHz. I've spoken to other Pi 1 owners who could achieve 950MHz or 1000MHz, and one owner whose board couldn't even get past 850MHz reliably. If you had a Pi 1 reaching 1150MHz, you were very lucky and your experience was definitely not typical of most users.
Recommended Posts