Running H3 boards with minimal consumption
7 7

38 posts in this topic

Since I wondered why FriendlyARM chose just 432MHz DRAM clock for their new NanoPi NEO (said to be an IoT node for lightweight stuff) and I also wondered how low consumption could be configured with a H3 device I decided to simply try it out.

 
Since I have no NanoPi NEO lying around (and FriendlyARM seems not to ship developer samples) I used Orange Pi Lite instead. Same amount of DRAM (but dual bank configuration therefore somewhat faster), same voltage regulator but Wi-Fi instead of Ethernet. I adjusted the fex file to stay always on the lower VDD_CPUX voltage (1.1V), disabled all unnecessary stuff (Wi-Fi, HDMI/Mali400 and so on, please see modified fex settings), also adjusted /etc/defaults/cpufreq-utils to jump between 240-912MHz cpufreq and added the following to /etc/rc.local to make H3 as slow as an RPi Zero:
echo 0 >/sys/devices/system/cpu/cpu3/online
echo 0 >/sys/devices/system/cpu/cpu2/online
echo 0 >/sys/devices/system/cpu/cpu1/online
echo 408000 >/sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/userspace/set_freq
(disabling 3 CPU cores and limiting DRAM clockspeed to 408 MHz -- lowering DRAM clockspeed from 672 MHz down to 408 MHz is responsible for a whopping 200mW difference regarding consumption).
 
With this single core setup OPi Lite remains at 800mW when idling at 912MHz, when running a 'sysbench --test=cpu --num-threads=1 --cpu-max-prime=20000 run' consumption increases by 300mW (and H3 is still a bit faster at 912MHz compared to a RPi Zero at 900 MHz: 808 seconds vs. 930 seconds). Further reducing CPU clockspeed or disabling leds doesn't help that much or at least my powermeter isn't that precise.
 
I find it already pretty nice to be able to limit consumption down to 160mA (800mW) by disabling 3 CPU cores (easy to bring back when needed!), downclocking DRAM and limiting VDD_CPUX voltage to 1.1V. That means that on H3 devices featuring the more flexible SY8106A voltage regulator even lower consumption values could be achieved since VDD_CPUX voltage could be lowered even more. And consumption might be reduced further by disabling more stuff. But that's something someone else with a multimeter has to test since my equipment isn't precise enough.
 
To sum it up: By simply tweaking software settings (most of them not even needing a reboot but accessible from user space) average idle consumption of an H3 device can be reduced from 1.5W (300mA) to almost the half. In this mode (one single CPU core active at 912MHz and DRAM downclocked to 408MHz) a H3 device is still faster than an RPi Zero while providing way more IO and network bandwidth. And if settings are chosen wisely performance can be increased a lot from userspace (transforming a single core H3 @ 912MHz to a quad-core H3 @ 1200/1296 MHz with faster DRAM which translates to roughly 6 times the performance)
David, Igor, rodolfo and 1 other like this

Share this post


Link to post
Share on other sites

I've tested OPI ONE/LITE with moderate loads running stably from very modest power banks. The superior performance compared to RasPis ( ZERO/A+/B+/2 ) clearly make the OPIs the boards of choice. Good to know settings can be further adjusted from user space.

Share this post


Link to post
Share on other sites

The superior performance compared to RasPis ( ZERO/A+/B+/2 ) clearly make the OPIs the boards of choice.

 

Hmm... I'm somewhat sceptical since at least with RPi Zero you can decrease idle consumption a lot: https://www.raspberrypi.org/forums/viewtopic.php?f=63&t=127210

 

But as already said: That's stuff for someone else with more precise equipment. I just wanted to give some hints where to adjust settings and am already pretty happy to be able to get H3 devices idling below 1W while still being able to instantly increase performance by 500%-600% from user space at any time. 

 

Regarding use cases I already know that I'm somewhat limited :)

 

For RPi Zero exactly zero use cases come to my mind -- maybe that's the tinkerer stuff closer to Arduinos than to a Linux host I'm used to :)

David likes this

Share this post


Link to post
Share on other sites

Hmm... I'm somewhat sceptical since at least with RPi Zero you can decrease idle consumption a lot: https://www.raspberrypi.org/forums/viewtopic.php?f=63&t=127210

 

Idle consumption is just perfect for the null use case :) RasPI Zero/ A+ are great as low power IoT gateways for sub-Arduino ATTINYs operating on watchdog. Higher up in the food chain OPIs deliver all the Linux fun at reasonable consumption. "Power on demand" for OPIs, idling at very low power and bursting to full blast when actually needed, is a very interesting strategy when running the boards from batteries. Thank you for the hints :) .

David likes this

Share this post


Link to post
Share on other sites

Very interesting. Just to check. When running on full capacity on 4 cores whot temperatures are we speaking of, 60 degrees?

 

What kind of heat sinks would you suggest for the Lite/One?

Share this post


Link to post
Share on other sites

What about temperature,if only one core is benchmarked?

Does the Temperature read out differ if core 1, 2, 3 or 4 is benchmarked.

Share this post


Link to post
Share on other sites

tkaiser said : "For RPi Zero exactly zero use cases come to my mind"

 

Well, I have an old Raspi A that serve a dozen of relays, IR transmitter, buzzer, Bluetooth remote, and other goodies. A 80286 would do the job. Idem with an other Pi for a waker (lamp + buzzer + radio), and other for temp sensors ...

 

So my personal use cases are low power, low consumption devices in a BT pan network. Ideally, I should use aduinos and mesh zigbee or z-wave network, but it would cost much more time and money !

 

So I am very interested in nanopi neo as RPI zero is so hard to find. Thanks for your work. Even if you dont have use cases for some cards, other will have the needs that match the specs. (I dont use a computer anymore but a network of arm nodes with specific roles : server, fw, desktop ...)

tkaiser likes this

Share this post


Link to post
Share on other sites

When running on full capacity on 4 cores whot temperatures are we speaking of, 60 degrees? What kind of heat sinks would you suggest for the Lite/One?

 

Why do people always talk about heat when it's about consumption? :)

 

These two issues are somehow related but also independent (since 'overheating' takes some time to let throttling jump in while a load peak immediately requires way more consumption -- what I'm trying to say is: maximum consumption depends on load peaks while temperature depends on longer load periods)

 

For a low power device as outlined above you don't need a heatsink at all since temperatures aren't an issue. If you enable 4 CPU cores running at 912 MHz / 1.1V then most probably a heatsink is also not necessary. When increasing VDD_CPUX voltage and CPU clockspeeds and thinking about high constant load then a heatsink gets interesting. All information applies to Orange Pis -- we already know that other H3 devices suffer from overheating way more since Xunlong seems to use large copper layers inside the PCB to spread the heat and other vendors seem to save that.

 

The information provided above was meant to encourage people to do their own testing since this really helps with understanding.

  • grab the modified fex file, grab the original, do a diff --> start to understand where to fiddle around and do your own testings with your own devices
  • check below /sys/devices/system/cpu/, /sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/ and /sys/devices/system/cpu/cpu0/cpufreq/
  • use monitoring (we spent so much time to ease this: 'sudo armbianmonitor -r' to install RPi-Monitor, 'sudo armbianmonitor -m' to do CLI based quick monitoring)
  • Read through the process trying to optimize performance depending on thermal behaviour: https://github.com/igorpecovnik/lib/issues/298

Regarding heatsinks: I usually use those that can be seen here on H3 devices: http://linux-sunxi.org/Orange_Pi_One#Orange_Pi_Lite(sold on Aliexpress as GPU or DRAM coolers -- no idea why). But a heatsink is only a requirement if you're thinking about exactly the opposite we're talking here: Running constantly high loads with full CPU clockspeed. H3 is specified to run at up to 125°C, our default throttling settings jump in way early but it's up to you to adjust these settings (ths_para and cooler_table section in the fex files for legacy kernel). So by increasing the thermal trip points and allowing H3 to run at higher temperatures without throttling you can also save the 50 Cent for a heatsink ;)

David likes this

Share this post


Link to post
Share on other sites

What about temperature,if only one core is benchmarked?

Does the Temperature read out differ if core 1, 2, 3 or 4 is benchmarked.

 

You have a H3 device so why don't you try it out yourself?

 

Anyway: CPU 0 can't be disabled for obvious reasons and our in-kernel core-keeper has to be disabled for such experiments (check last lines in fex files above!). I used the following test script on an Orange Pi PC Plus (1296MHz max) always testing CPU 0 together with one of the other cores:

 

 

root@orangepipcplus:~# cat /usr/local/bin/test-h3.sh
#!/bin/bash

# activate all cores and heat up the SoC
echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo 1 >/sys/devices/system/cpu/cpu3/online
echo 1 >/sys/devices/system/cpu/cpu2/online
echo 1 >/sys/devices/system/cpu/cpu1/online
sysbench --test=cpu --cpu-max-prime=5000 run --num-threads=4
echo 0 >/sys/devices/system/cpu/cpu3/online
echo 0 >/sys/devices/system/cpu/cpu2/online
sleep 0.2
# CPU0/1
sysbench --test=cpu --cpu-max-prime=40000 run --num-threads=2
echo 1 >/sys/devices/system/cpu/cpu2/online
sleep 0.2
echo 0 >/sys/devices/system/cpu/cpu1/online
sleep 0.2
# CPU0/2
sysbench --test=cpu --cpu-max-prime=40000 run --num-threads=2
echo 1 >/sys/devices/system/cpu/cpu3/online
sleep 0.2
echo 0 >/sys/devices/system/cpu/cpu2/online
sleep 0.2
# CPU0/3
sysbench --test=cpu --cpu-max-prime=40000 run --num-threads=2 

 

 

 

Thermal readouts are pretty much identical (between 50° and 53°C) so the thermal sensor is placed correctly on the SoC (unlike eg the BroadCom SoC used on the various Raspberry Pi -- there the sensor is obviously placed at the VideoCore's location and shows constantly wrong thermal readouts if it's about CPU utilization)

 

H3_Thermal_CPU_Sensor.png

Tido and David like this

Share this post


Link to post
Share on other sites

Another board, another test. Today I checked Orange Pi PC Plus running most recent Armbian from eMMC. I used our default dvfs settings (allowing interactive switching between 480 and 1296 MHz CPU clockspeed) and tested what helps in further reducing consumption:

  • Reducing DRAM clockspeed from our default 624MHz down to 408MHz --> 200mW less
  • Disabling HDMI/display/Mali400 in fex file --> 200mW less
  • Disabling CPU cores --> not able to measure the difference when being idle
  • Further undervolting at lower clockspeeds and allowing downclocking to 240 MHz --> not able to measure the difference when being idle

I left all 4 USB ports and Ethernet active (and LAN connected) and in headless mode (display disabled) with reduced DRAM clock I end up with 1100mW idle consumption just using our defaults (4 CPU cores idling at 480MHz). That means without any further manual tweaking the cpufreq scaling code in the kernel will increase clockspeeds when needed. Simply tested starting cpuburn-a7: Clockspeeds jump to 1.3GHz, VDD_CPUX will be increased to 1320mV according to our dvfs table and board consumption jumps from 1100mW to 5000mW.

 

As a side effect in this situation also DRAM clockspeed was automatically increased to our default 624MHz since Allwinner's budget cooling code in our legacy kernel adjusts DRAM clockspeed to the value defined in the fex file as soon as the first thermal trip point is reached. So in case performance is needed as a throttling side effect DRAM clock got also adjusted to performance values (but remains at this value afterwards so it would need either a cronjob resetting /sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/userspace/set_freq periodically or some investigation whether DRAM clocking behaviour could be made more intelligent -- see below).

 

For me being able to run an Orange Pi PC or PC Plus idle at 220mA / 1100mW while connected to the LAN with all USB ports ready is already ok. As can be seen above DRAM clocking behaviour might be worth a look. In our sun8i legacy kernel the following stuff is currently not set:

# CONFIG_SUNXI_THERMAL_DYNAMIC is not set
# CONFIG_CPU_THERMAL is not set
# CONFIG_SUNXI_BUDGET_COOLING_VFTBL is not set

(didn't had a look what that's responsible for since I'm looking for volunteers). Also different governors for handling DRAM clockspeed might be useable (currently we only have 'userspace' active and therefore can set DRAM clockspeed only manually within the hardcoded boundaries -- minimum defined in kernel code, maximum in fex file). And then it looks like a dram_dvfs_table entry in fex file is supported. But I would suppose this is stuff for Allwinner SoCs with PMIC support.

 

Anyway: People interested in further decreasing consumption now know where the playground is located ;)

 

One final note: I won't test now but on GbE equipped H3 boards the PHY (RTL8211E on all 4 boards) is responsible for up to 180mA additional consumption

manuti and David like this

Share this post


Link to post
Share on other sites

TL;DR: By making a small/cheap H3 device as slow and feature-less as an RPi Zero we're able to reduce idle consumption to 140mA or even lower. Tested with Orange Pi Lite, results should apply to other H3 devices equipped with the primitive voltage regulator only switching between 1.1V and 1.3V.

 

Based on the experiences with OPi PC Plus (killing CPU cores at lowest cpufreq isn't that effective) I created a tiny patch to be able to even further downclock DRAM (lib/patch/kernel/sun8i-default/0028-h3-limit-dram-clock-to-264-mhz.patch) and get now close to 700mW / 140mA idle consumption on Orange Pi Lite (without any network connectivity -- please keep that in mind!)

 

Fex settings are more conservative again (480-1200 MHz instead of 240-912 MHz): http://sprunge.us/ZHag

 

Then /etc/rc.local now looks like this

echo 0 >/sys/devices/system/cpu/cpu3/online
echo 0 >/sys/devices/system/cpu/cpu2/online
echo 0 >/sys/devices/system/cpu/cpu1/online
echo 264000 >/sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/userspace/set_freq

If one decides to not disable the 3 additional CPU cores then idle consumption increases slightly (maybe 50mW) but on the other hand no tweaks are necessary to activate the CPU cores when needed since the cpufreq scaling code in the kernel is handling this. When I start cpuburn-a7 with all 4 cores active consumption reaches 4800 mW immediately when 1200 MHz are allowed. If one limits the upper clockspeed to just 912 MHz in /etc/default/cpufrequtils (MAX_SPEED=912000) which is the highest allowed clockspeed remaining at the lower VDD_CPUX voltage (1.1V on the small H3 devices) then it's just 2600mW instead.

 

So by configuring the board to stay on the lower voltage we loose only 25 percent maximum full load performance but are able to limit maximum consumption by 2.2W! This looks like a reasonable config if one wants to use the small/cheap H3 boards like NanoPi M1, NEO or OPI One/Lite as lightweight IoT nodes since maximum performance is less important than low consumption. 

 

BTW: H3 SoC can deal with DDR2 and DDR3 memory. Downclocking from the advertised 672 MHz (we at Armbian don't use since reliability testing on a couple of boards showed a few failures so we limit this to 624 MHz on all H3 devices instead!) to just 264 MHz sounds like a massive decrease in performance. In fact downclocking the dual bank DDR3 on OPi Lite results in a memory performance still comparable with any RPi (only DDR2 in single bank configuration). In 'RPi Zero simulation mode' with just a single CPU core running at 912 MHz and DRAM clocked with just 264 MHz tinymembench reports this:

 

 

tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    132.8 MB/s (3.7%)
 C copy backwards (32 byte blocks)                    :    456.9 MB/s (5.1%)
 C copy backwards (64 byte blocks)                    :    477.8 MB/s (4.6%)
 C copy                                               :    450.2 MB/s (2.4%)
 C copy prefetched (32 bytes step)                    :    481.0 MB/s (3.8%)
 C copy prefetched (64 bytes step)                    :    472.0 MB/s (6.9%)
 C 2-pass copy                                        :    438.3 MB/s (6.6%)
 C 2-pass copy prefetched (32 bytes step)             :    437.8 MB/s (2.0%)
 C 2-pass copy prefetched (64 bytes step)             :    444.8 MB/s
 C fill                                               :   1749.3 MB/s (6.5%)
 C fill (shuffle within 16 byte blocks)               :   1748.9 MB/s
 C fill (shuffle within 32 byte blocks)               :    187.6 MB/s (5.3%)
 C fill (shuffle within 64 byte blocks)               :    199.5 MB/s (8.1%)
 ---
 standard memcpy                                      :    302.7 MB/s (5.3%)
 standard memset                                      :   1747.9 MB/s (2.0%)
 ---
 NEON read                                            :    647.7 MB/s (5.6%)
 NEON read prefetched (32 bytes step)                 :    786.4 MB/s
 NEON read prefetched (64 bytes step)                 :    778.9 MB/s (3.5%)
 NEON read 2 data streams                             :    187.1 MB/s (6.7%)
 NEON read 2 data streams prefetched (32 bytes step)  :    360.6 MB/s (2.2%)
 NEON read 2 data streams prefetched (64 bytes step)  :    367.3 MB/s (5.2%)
 NEON copy                                            :    478.1 MB/s (4.5%)
 NEON copy prefetched (32 bytes step)                 :    510.3 MB/s (4.1%)
 NEON copy prefetched (64 bytes step)                 :    557.0 MB/s (8.0%)
 NEON unrolled copy                                   :    472.5 MB/s (6.9%)
 NEON unrolled copy prefetched (32 bytes step)        :    506.7 MB/s (7.6%)
 NEON unrolled copy prefetched (64 bytes step)        :    538.6 MB/s (3.4%)
 NEON copy backwards                                  :    458.2 MB/s (7.1%)
 NEON copy backwards prefetched (32 bytes step)       :    455.4 MB/s (2.1%)
 NEON copy backwards prefetched (64 bytes step)       :    547.8 MB/s
 NEON 2-pass copy                                     :    454.1 MB/s (6.9%)
 NEON 2-pass copy prefetched (32 bytes step)          :    502.4 MB/s (5.3%)
 NEON 2-pass copy prefetched (64 bytes step)          :    513.8 MB/s (4.0%)
 NEON unrolled 2-pass copy                            :    439.4 MB/s (6.6%)
 NEON unrolled 2-pass copy prefetched (32 bytes step) :    433.4 MB/s (2.1%)
 NEON unrolled 2-pass copy prefetched (64 bytes step) :    454.3 MB/s
 NEON fill                                            :   1748.6 MB/s (6.7%)
 NEON fill backwards                                  :   1747.2 MB/s
 VFP copy                                             :    474.7 MB/s (3.7%)
 VFP 2-pass copy                                      :    439.5 MB/s
 ARM fill (STRD)                                      :   1748.1 MB/s (4.3%)
 ARM fill (STM with 8 registers)                      :   1747.5 MB/s
 ARM fill (STM with 4 registers)                      :   1748.1 MB/s
 ARM copy prefetched (incr pld)                       :    535.4 MB/s (3.9%)
 ARM copy prefetched (wrap pld)                       :    455.7 MB/s (8.4%)
 ARM 2-pass copy prefetched (incr pld)                :    494.6 MB/s (4.8%)
 ARM 2-pass copy prefetched (wrap pld)                :    459.4 MB/s (3.4%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    6.9 ns          /    11.9 ns 
    131072 :   10.7 ns          /    16.8 ns 
    262144 :   12.7 ns          /    18.6 ns 
    524288 :   18.8 ns          /    26.3 ns 
   1048576 :  171.1 ns          /   265.6 ns 
   2097152 :  257.1 ns          /   346.9 ns 
   4194304 :  302.1 ns          /   378.1 ns 
   8388608 :  326.9 ns          /   394.7 ns 
  16777216 :  344.7 ns          /   411.9 ns 
  33554432 :  363.1 ns          /   433.5 ns 
  67108864 :  388.5 ns          /   483.3 ns 

 

 

 

That's not much difference compared to a RPi Zero, A, A+, B or B+: https://github.com/ssvb/tinymembench/wiki/Raspberry-Pi-(BCM2835)

Share this post


Link to post
Share on other sites

Final update: Since in most of our fex files we define all 4 USB ports H3 features (3 x host, 1 x OTG) as being active I also tried out to disable them completely (edit in fex files: 'usb_used = 0'). With a 4.5V PSU I'm now at 500mW / 100mA in 'RPi Zero mode'.

 

Disabling all USB ports by default is somewhat moronic but at least we now know that by disabling USB ports also some mW can be saved on H3 boards with legacy kernel (maybe the same can be achieved by poking sysfs nodes just like on RPi).

 

Time to stop. For me being able to let a fully featured Orange Pi PC idle at 1100mW is enough (all USB ports and Ethernet ready). Others that might 'misuse' H3 boards as Arduino replacements might be happy that consumption can be reduced to the half while still being faster than an RPi Zero.

Share this post


Link to post
Share on other sites

700 mW for a single device with its own PSU is good : small 5V PSU typically are rated 0.3 W consumption without any load and max 70% efficiency (only when power demand is near nominal power). Those who want to build a cluster (or a group of devices) should also invest in a good PSU to power the whole cluster + disk + hubs + goodies at around 80% of nominal power.

Share this post


Link to post
Share on other sites

Anyone wants to play a bit? Here are .debs of a patched kernel with DRAM clockspeed as low as 132 MHz:

# cat lib/patch/kernel/sun8i-default/0028-h3-limit-dram-clock-to-132-mhz.patch 
diff --git a/drivers/devfreq/dramfreq/sunxi-ddrfreq.c b/drivers/devfreq/dramfreq/sunxi-ddrfreq.c
index c7c20b7..7581087 100755
--- a/drivers/devfreq/dramfreq/sunxi-ddrfreq.c
+++ b/drivers/devfreq/dramfreq/sunxi-ddrfreq.c
@@ -1666,7 +1666,7 @@ static __devinit int sunxi_ddrfreq_probe(struct platform_device *pdev)
 	if (sunxi_ddrfreq_min < SUNXI_DDRFREQ_MINFREQ_MIN)
 		sunxi_ddrfreq_min = sunxi_ddrfreq_max / 3;
 #elif defined(CONFIG_ARCH_SUN8IW7P1)
-	sunxi_ddrfreq_min = 408000;
+	sunxi_ddrfreq_min = 132000;
 #else
 	type = script_get_item("dram_para", "dram_tpr12", &val);
 	if (SCIRPT_ITEM_VALUE_TYPE_INT != type) {

Installation as usual -- unpack the tar archive and then do as root in the same directory:

dpkg -i linux-headers-sun8i_5.17_armhf.deb linux-image-sun8i_5.17_armhf.deb

Reducing DRAM clockspeed to just 132 MHz further reduces consumption. But performance is also affected of course. You would need to put the following to eg. /etc/rc.local or a cron job:

echo 132000 >/sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/userspace/set_freq 2>/dev/null

Share this post


Link to post
Share on other sites

Thanks for this nice work.

It's a shame that most boards don't allow to make the best use of it with a battery connection.

Share this post


Link to post
Share on other sites

It's a shame that most boards don't allow to make the best use of it with a battery connection.

 

Huh? Different people, different needs! ;)

 

Maybe you remember over half a year ago back in Orange Pi forums? There were many people around proudly frying their boards with most moronic settings ever. Using performance cpufreq governor and loboris' insanely overvolted settings they were happy to have an instable H3 board using an annoying fan and idling with 1536 MHz clockspeed.

 

Then we started to investigate how THS/thermal settings can be improved, increased both performance and lowered temperatures/consumption and now we're entering next round starting to understand how other settings behave (eg. DRAM clockspeed -- I would've never imagined that we can lower DRAM clock on H3 boards that much and gain energy savings while still being faster than RPi's DRAM interface).

 

BTW: maybe some more useful stuff in another thread how to switch off consumers currently not needed (not that much applies to H3 board unfortunately): http://forum.armbian.com/index.php/topic/1631-tutorial-marriage-between-a20-and-h3-ups-mode-sunxi-pio-utility/page-0

 

I already tried out to switch off the RTL8211E PHY on GBit Ethernet equipped H3 boards but to no avail:

ifconfig eth0 down
sunxi-pio -m PD06'<default><default<default><0>'
sleep 300
ifconfig eth0 up
sunxi-pio -m PD06'<default><default<default><1>'
sleep 300
ifconfig eth0 down
sunxi-pio -m PD06'<default><default<default><0>'
sleep 300
ifconfig eth0 up
sunxi-pio -m PD06'<default><default<default><1>'
sleep 300
ifconfig eth0 down
sunxi-pio -m PD06'<default><default<default><0>'

Makes no difference regarding consumption, the PHY starts to consume power when transmitting data.

Share this post


Link to post
Share on other sites

"Different people, different needs!"

 

Well, my first Unix workstation was a microVax (Digital Equipment). One tiny mips and it ran X11 !

 

My new H3 BPI boasts :

SMP: Total of 4 processors activated (19200.00 BogoMIPS)

LOL

 

I should say my new BPI M2+ is not the best card I ever saw, but eMMC is very valuable. It replaces now a RPI-B as firewall/gateway with 10 times cpu power and IO bandwith but with half the consumption (2 Ethernet - GbE + USB, wifi, Bluetooth, no display = 2,5 W - PSU has 75% efficiency). The (2nd level) http proxy cache can now run on the gateway thanks to eMMC).

 

I will test DRAM lower frequency after reinstalling the mail gateways.

Share this post


Link to post
Share on other sites

Hi,

 

I found I could spare 300 mW and 3° at idle by disabling 2 cpu, but I dont have any change by setting dram_clk = 480 in dram_para in script.bin.

 

Did I miss something ?

 

(BPI M2+ 3.4.112-sun8i #8 SMP PREEMPT Mon Jun 20 12:54:33 CEST 2016 armv7l ARMv7 Processor rev 5 (v7l) sun8i GNU/Linux stock kernel)

Share this post


Link to post
Share on other sites

I dont have any change by setting dram_clk = 480 in dram_para in script.bin

 

You should check

/sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/cur_freq

Armbian uses mainline u-boot and there the DRAM clock is specified (624 MHz by default). Our sun8i legacy kernel contains code to adjust DRAM clockspeed only in throttling situations therefore what you specify in script.bin will only be used when one of the thermal trip points will be reached. That's different to the OS images that rely on Allwinner's outdated 2011.09 u-boot version where they patched u-boot to read these settings from script.bin.

 

In other words: To change DRAM clockspeed you have to edit script.bin (to be prepared for throttling situations) and adjust DRAM clock from user space through 

echo 480000 >/sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/userspace/set_freq

As outlined above: I would prefer using a pretty low value to set through 'userspace/set_freq' (by cron), then define a low first thermal trip point and a high DRAM clockspeed (624 MHz) in script.bin. This combination will lead to energy savings when the board idles around but will automagically increase DRAM clockspeed in case a longer activity period is happening (temperature rises --> kernel adjust DRAM clockspeed)

 

BTW: BPi M2+ is of course the worst H3 board available if you're interested in energy saving (the 'engineers' forgot to add switchable resistors to the SY8113B voltage regulator therefore VDD_CPUX is always at 1.3V -- of course the 'engineers' wrote this wrong in schematics and as usual they never correct mistakes they made)

Share this post


Link to post
Share on other sites

@tkaiser

 

Thanks a lot. I spared another 200 mW and 3° ! I can now run my "gatekeeper" with less than 2W (no video, 2 cpu) - initially more than 5W with an (untuned) RPI-B with outdated software.

 

Benchmarking performance in real world for my use case is not so easy and I have to think of a method. The usage is interactive (text filtering by privoxy) and the final result depends from network, and complex pipelining. I already found in first test that cpufreq governor parameters were optimal in armbian.

 

Energy saving (and heat dissipation) is a complex matter anyway but not so important during winter ! As a matter of fact, I spare 20W by shutting down my ISP TV box and audio booster with relays, and I can spare much more by driving intelligently my central heating circulator. I would like to have humidity sensors everywhere in the house because a leakage can cost you up to 1000 times the cost of a card if you dont detect it. So my next project will be to find the best card for IOT projects, even if I must downgrade performance of a 4 core, 1,5 GHz, 1Go RAM card - as manufacturers are not so much interested in providing cheap and light solutions.

Share this post


Link to post
Share on other sites

A quick test show that I can spare 150 mW by downgrading ethernet to 100 Mbits - witch is not surprising. (ethtool -s eth0 speed 100 duplex full).

 

Modern switchs boast having adaptative power (with port usage and wire length). I use currently a 10 meters cable and will check the difference when I moved the swith and use shorter cable. But I dont expect a difference for BPI m2+ as the power consumption does not lower at idle when I unplug the cable.

Share this post


Link to post
Share on other sites
Good news: Adjusting DRAM clockspeed (which is not that much related to overall performance in most use cases!) seems to be way more efficient than lowering CPU clockspeed when it's about low idle consumption.
 
With a measurement setup that's known to be somewhat inaccurate (both absolute readings and linearity definitely not precise enough to provide numbers for general comparisons!) it's at least confirmed that lowering DRAM clockspeed shows way more savings compared to lowering CPU clockspeed... as long as we're talking about an idle system (tests with moderate and full load will show a different picture! Results will follow)
 
I managed to get consumption reported as low as 400mW when idling with 132 MHz DRAM clockspeed and a single CPU core at just 240 MHz (left value), then I made all 4 CPU cores active again and adjusted idle clockspeed to 912 MHz (this is the upper clockspeed still remaining at the lowest 1.1V VDD_CPUX core voltage on the more primitive H3 boards: OPi One/Lite and NanoPi M1/NEO), then I disabled CPU cores 1, 2 and 3 remaining at 912 MHz, then H3 idled at 240 MHz with 4 cores active, then 3 and then 2 (from left to right. last setting missing on graphs, this is a 30 min average value since otherwise numbers weren't precise enough, so that's the reason that looks like stairways somehow):

 

 

Bildschirmfoto%202016-07-30%20um%2016.41

 

In 'absolute' numbers (you should not trust in -- the consumption monitoring setup is known to fail in this regard, only relative comparisons are possible but that's enough to get the idea what to disable if in doubt:

 1 Core @ 240 MHz: 400 mW
 1 Core @ 912 MHz: 410 mW
4 Cores @ 240 MHz: 530 mW
4 Cores @ 912 MHz: 540 mW

 1 Core @ 240 MHz: 400 mW
2 Cores @ 240 MHz: 450 mW
3 Cores @ 240 MHz: 500 mW
4 Cores @ 240 MHz: 530 mW

So the difference between idling at 240 MHz and 912 MHz is pretty much inexistent (10 mW difference measured is close to the measuring inaccuracy of my approach) but DRAM clockspeed matters and count of active 'engines' (CPU cores, display engine, USB ports, and so on -- simply compare with the WiP approach to somehow reliably monitor consumption of connected devices in low power mode)

Share this post


Link to post
Share on other sites

 

Anyone wants to play a bit? Here are .debs of a patched kernel with DRAM clockspeed as low as 132 MHz:

# cat lib/patch/kernel/sun8i-default/0028-h3-limit-dram-clock-to-132-mhz.patch 
diff --git a/drivers/devfreq/dramfreq/sunxi-ddrfreq.c b/drivers/devfreq/dramfreq/sunxi-ddrfreq.c
index c7c20b7..7581087 100755
--- a/drivers/devfreq/dramfreq/sunxi-ddrfreq.c
+++ b/drivers/devfreq/dramfreq/sunxi-ddrfreq.c
@@ -1666,7 +1666,7 @@ static __devinit int sunxi_ddrfreq_probe(struct platform_device *pdev)
 	if (sunxi_ddrfreq_min < SUNXI_DDRFREQ_MINFREQ_MIN)
 		sunxi_ddrfreq_min = sunxi_ddrfreq_max / 3;
 #elif defined(CONFIG_ARCH_SUN8IW7P1)
-	sunxi_ddrfreq_min = 408000;
+	sunxi_ddrfreq_min = 132000;
 #else
 	type = script_get_item("dram_para", "dram_tpr12", &val);
 	if (SCIRPT_ITEM_VALUE_TYPE_INT != type) {

Installation as usual -- unpack the tar archive and then do as root in the same directory:

dpkg -i linux-headers-sun8i_5.17_armhf.deb linux-image-sun8i_5.17_armhf.deb

Reducing DRAM clockspeed to just 132 MHz further reduces consumption. But performance is also affected of course. You would need to put the following to eg. /etc/rc.local or a cron job:

echo 132000 >/sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/userspace/set_freq 2>/dev/null

 

Is this patch included in the armbian build tools or do I need to enable it somehow?

 

I've just built armbian kernel images for my nanopineo on Devuan Jessie (yes I know it's unsupported but I got it working after some trial and error).... It appears from my logs that this patch is not applied to the kernel I just built. I'm wanting to build a specific SPI LCD driver into the kernel as well as use the lower memory clocks.

 

Edit: never mind I didn't notice the file path you passed to cat above... I stuck it there and I expect that will work ;-)

Share this post


Link to post
Share on other sites

I added a preliminary version of a new tool to our repo: h3consumption (please read there how to use/test this tool and provide feedback there also).

 

Two examples for usage and savings:

 

Orange Pi Plus 2E

 
With our defaults (and no WiFi connection established!) the board idles at 1650 mW and 'h3consumption -p' displays:

Active settings:

cpu       1296 mhz allowed, 1296 mhz possible, 4 cores active

dram      624 mhz

hdmi/gpu  active

usb ports active

eth0      1000Mb/s/Full, Link: yes

wlan0     unassociated  Nickname:"<WIFI@REALTEK>"
          Mode:Managed  Frequency=2.412 GHz  Access Point: Not-Associated   
          Sensitivity:0/0  
          Retry:off   RTS thr:off   Fragment thr:off
          Encryption key:off
          Power Management:off
          Link Quality:0  Signal level:0  Noise level:0
          Rx invalid nwid:0  Rx invalid crypt:0  Rx invalid frag:0
          Tx excessive retries:0  Invalid misc:0   Missed beacon:0

After execution of 'h3consumption -c 1 -m 1296 -d 408 -g off -e fast', commenting the 8189fs line in /etc/modules and a reboot the board idles at just 870 mW now. Disabling HDMI/GPU, using only Fast instead of Gbit Ethernet and lower DRAM clock are mostly responsible for saving 780 mW consumption! Disabling CPU cores is more a measure to limit maximum/peak consumption and should be avoided when it's only about lowering idle consumption. Now 'h3consumption -p' displays:

Active settings:

cpu       1296 mhz allowed, 1296 mhz possible, 1 cores active

dram      408 mhz

hdmi/gpu  off

usb ports active

eth0      100Mb/s/Full, Link: yes

Please keep in mind that we're talking about a quad core SBC with a performance above RPi 2 level but featuring 2 GB DRAM and 16 GB eMMC, having 4 real USB ports instead of 1, having a real Ethernet port instead of none and featuring one onboard WiFi instead of none. That costs as much as an RPi 2 while idling in the same mode way below (RPi 2 with USB/Ethernet ready/connected needs at least 1200 mW while OPi Plus 2E in the same mode with all 4 CPU cores active will idle at ~900 mW). And if we need performance instead of lowest consumption we can do so from userspace within seconds. The following brings back all CPU cores, switches to GBit Ethernet and enables WiFi:
for i in 3 2 1; do echo 1 >/sys/devices/system/cpu/cpu${i}/online; done
ethtool -s eth0 speed 1000 duplex full
modprobe 8189fs && sleep 0.5 && ifconfig wlan0 up

And if the work is done we can easily return to low-power mode again, still having an SBC with Ethernet ready/connected and 4 USB ports ready:

ifconfig wlan0 down && sleep 0.5 && rmmod -f 8189fs
ethtool -s eth0 speed 100 duplex full
for i in 3 2 1; do echo 0 >/sys/devices/system/cpu/cpu${i}/online; done

Another example:

 

Orange Pi Lite

 

With our defaults (and no WiFi connection established!) the board idles at 1060 mW and 'h3consumption -p' displays:

 

 

Active settings:

cpu       1200 mhz allowed, 1200 mhz possible, 4 cores active

dram      624 mhz

hdmi/gpu  active

usb ports active

wlan0     unassociated  Nickname:"<WIFI@REALTEK>"
          Mode:Managed  Frequency=2.412 GHz  Access Point: Not-Associated   
          Sensitivity:0/0  
          Retry:off   RTS thr:off   Fragment thr:off
          Encryption key:off
          Power Management:off
          Link Quality:0  Signal level:0  Noise level:0
          Rx invalid nwid:0  Rx invalid crypt:0  Rx invalid frag:0
          Tx excessive retries:0  Invalid misc:0   Missed beacon:0

 

 

With 'h3consumption -D 132 -c1 -g off -u off' we're switching in 'OPi Zero' mode by lowering DRAM clockspeed to just 132 MHz (experimental! Recommended minimum is 408 MHz), disabling all CPU cores but one, disabling HDMI/GPU and also all USB ports (Ethernet is here already disabled). In this mode connectivity is clearly restricted (not a single USB port active which saves ~125 mW) but if the Lite is just a data logger using some sensors and transmitting data every 24 hours it can happily operate in this 400 mW mode, switch on WiFi when needed, transmit the data and by disabling WiFi get back into low-power state again. Perfectly controllable from userspace, it's just:

modprobe 8189fs && sleep 0.5 && ifconfig wlan0 up    # wifi enabled
ifconfig wlan0 down && sleep 0.5 && rmmod -f 8189fs  # wifi disabled

With these settings (nearly everything disabled in H3) the Lite is able to run with 400 mW. I made quick tests with NanoPi M1 (even ~20 mW less) and Orange Pi PC Plus (~20 mW more) and I think it's save to assume that all those smaller H3 boards behave the same here more or less (the NEO being the one exception due to PCB design and maybe different DRAM configurations, the GbE equipped boards showing higher ground consumption due to Gbit Ethernet PHY, more DRAM and so on)

 

Summary

 

Some h3consumption rules of thumb:

  • Disabling GPU/HDMI on headless H3 devices (-g off) is always a good idea since it lowers consumption by +200 mW and also increases memory bandwidth which slightly improves performance
  • Lowering DRAM clockspeed is also responsible for some savings but slightly decreases performance of some workloads. Going below 408 MHz is experimental but will show even higher savings
  • Lowering peak and full load consumption can be done by limiting active CPU cores (-c 2 for example) or by limiting maximum cpufreq (-m 912 being the best possible value for H3 boards with the primitive voltage regulator since then NanoPi M1/NEO and OPi One/Lite will always remain at the lower 1.1V VDD_CPUX voltage which will result in huge savings under full load)
  • On the Gigabit Ethernet equipped boards only negotiating Fast Ethernet (-e fast) saves almost 400 mW consumption (and most probably even more on the other end of the cable since modern GbE switches also lower consumption with Fast instead of Gbit Ethernet)

Pro tip: Many of these settings are accessible from userspace and without a reboot. Just have a look at what h3consumption will add to /etc/rc.local to get the idea. In case an Orange Pi Plus 2E for example idles around most of the time since it is only used nightly to store backups from other machines one can easily combine lowest idle consumption with full performance in active periods. Simply use 'h3consumption -c 1 -m 1296 -d 624 -g off -e fast'. This will allow our usual 624 MHz DRAM clockspeed but to save energy in idle periods the following can be executed (script / cron job / whatever):

echo 132000 >/sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/userspace/set_freq
for i in 3 2 1; do echo 0 >/sys/devices/system/cpu/cpu${i}/online; done
ethtool -s eth0 speed 100 duplex full

And then prior to execution of the backup jobs that require high performance we switch back to performance settings:

echo 624000 >/sys/devices/platform/sunxi-ddrfreq/devfreq/sunxi-ddrfreq/userspace/set_freq
for i in 3 2 1; do echo 1 >/sys/devices/system/cpu/cpu${i}/online; done
ethtool -s eth0 speed 1000 duplex full
gnasch and rodolfo like this

Share this post


Link to post
Share on other sites

Addendum: On H3 boards with Fast Ethernet (internal PHY in H3 vs. external RTL8211E PHY on the GbE capable boards) enabling/disabling the Ethernet with network cable connected makes a difference of 200 mW. When doing the same on GbE boards (BPi M2+, OPi Plus, Plus 2 or Plus 2E -- tested with the last one) this saves 570 mW consumption. This could be either done using our new h3consumption tool but then changes will require each time a reboot.

 

But if the driver has been built as module then enabling/disabling Ethernet can be done from userspace like with WiFi modules:

ifconfig eth0 down && rmmod -f sunxi_gmac             # disable Ethernet
modprobe sunxi_gmac && sleep 0.5 && ifconfig eth0 up  # enable Ethernet again

This change requires our build system, changing CONFIG_SUNXI_GETH=y to CONFIG_SUNXI_GETH=m in config/kernel/sunxi linux-sun8i-default.config and of course adding 'sunxi_gmac' to /etc/modules if the driver should be loaded at startup.

 

To summarize the idle consumption savings possible with h3consumption:

  • Disabling GPU/HDMI for true headless operation -210 mW: -g off
  • On Fast Ethernet boards -200 mW by disabling Ethernet: -e off (-570 mW on GbE boards)
  • On GbE boards -370 mW by switching from Gbit to Fast Ethernet: -e fast (without effect on Fast Ethernet devices)
  • Lower DRAM clockspeed to 408 MHz -150 mW: -d 408
  • Disable all USB ports -125mW: -u off (unfortunately 'all or nothing')
  • Limit count of CPU cores to 1 -30 mW: -c 1 (affects only peak/full consumption)
  • Limit maximum cpufreq to eg. 912 MHz 0 mW: -m 912 (affects only peak/full consumption)

Share this post


Link to post
Share on other sites

Just disabling GPU/HDMI for headless use works great for OPI ONE/LITE. A customized x2go LXDE-desktop on OPI ONE accessed via WiFi or LAN uses just 85M RAM and "idles" around 900mW (LAN) - 1200mW (WLAN). That's just fine to run the boards from dirt cheap 18650 power bank hacks with passthrough charging.

Share this post


Link to post
Share on other sites

I'm going to investigate thist stuff tomorrow, but I wanted to ask about my main usage case.  I have an OjPi PC running a chess engine 24/7 on two cores atm, but could use 4 soon. Anyway, it does a LOT of idling, so I'd like to reduce power there if possible, but obviously when a game starts, it can require 100% DRAM and CPU for long periods.  It doesn't think during the opening, and occasionally (due to the nature of chess engines) it is unable to come up with a move to ponder while waiting for the opponent. So it does sometimes go back to idle during a game, just not for long.

 

I don't think a cron solution for DRAM will work in this scenario, but clearly reducing DRAM speed is desirable if it can kick back up when needed. I don't need any USB ports active, though it's in a location where it's rarely useful to use the box for fdisk or a TTL connection via screen.  HDMI is off, but I don't think it's done in the fex file.  It gets not much network activity, but relatively constant trickle.

 

What do you think?

Share this post


Link to post
Share on other sites

What do you think?

 

Please read through the savings numbers above, post #9 here and and try to get the 'big picture'. All the consumption numbers are made between PSU and board so they don't contain consumption wasted by the PSU itself. So take a crappy PSU that takes 3W from the wall while your Orange Pi PC idles.

 

With Armbian default settings it will idle at ~1200 mW, by disabling GPU/HDMI you get below 1000 mW and if you start to adjust DRAM clockspeed further savings are possible (by getting as low as 408 MHz just another 150 mW less). If DRAM clockspeed influences your application or not I don't know -- maybe you don't know too since you never tested it? In case it has a benchmark mode, adjust DRAM clockspeed and re-test (look at the bottom of the linked thread, and look how the moronic sysbench is affected by DRAM clockspeed -- not at all -- and how cpuminer -- pretty much).

 

Since your device has only Fast Ethernet but requires network access you could save ~200 mW by disabling GPU/HDMI which will also increase performance if applications depend on memory bandwidth. And with adjusting DRAM clockspeed you might get another ~150 mW less which will slightly affect memory bandwidth in the other direction.

 

So there's not much to do regarding idle consumption in your setup and even with 300 mW less (25 percent savings compared to defaults) by using a crappy PSU the whole setup will just consume 10 percent less since the PSU wastes more energy than the board.

 

For server use cases the rule of thumb is simple: Disable GPU/HDMI if not needed, this saves consumption and increases performance. GbE boards might switch to Fast Ethernet in idle periods which is responsible for huge overall savings (board + switch port affected if more recent GBit Ethernet switches are used). All the other stuff is more or less extreme/experimental and for IoT use cases (eg. limiting DRAM clockspeed to 132 MHz or count of active CPU cores or max cpufreq to limit maximum consumption too)

manuti likes this

Share this post


Link to post
Share on other sites

One funny thing with BPI m2+ is the drop in CPU temp when you lower ... Ethernet speed from 1000 to 100 Mb/s. Something like 1.5° if the card is in open air and 3° in a close case. (conditions : no heatsink, aluminium case, temp 41° at idle, 26° ambiant).

 

So I would say the PCB is responsible for half of it.

 

N.B. The effect is probably less important proportionally when temp raise at 60°+ in active computing.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
7 7

  • Support the project

    We need your help to stay focused on the project.

    Choose the amount and currency you would like to donate in below.