Add armbian-test-reliability package to our repo?


Recommended Posts

On Orange Pi One, Lite and NanoPi M1 we currently have the ability to let VDD_CPUX voltage switch between 1.1V and 1.3V. With our current settings we switch at 912MHz CPU clockspeed from 1.1V to 1.3V. The higher the clockspeed the higher VDD_CPUX (this is the voltage the CPU cores are fed with) has to be to let the SoC work reliable. But on the other hand the higher VDD_CPUX is set the hotter the SoC gets and the earlier throttling will jump in.

 
So knowing at which clockspeed H3 still works reliable at the lowest possible VDD_CPUX voltage (in the case of the aforementioned boards only 1.1V or 1.3V are available) can help with overall performance since throttling gets more efficient.
 
Since some users reported that they can run their OPi Ones with modified settings at 1200 MHz while staying at 1.1V and I just tested successfully 1344 MHz at 1.3V and it's known that these tresholds depend on the SoC/board in question, I quickly evaluated how to provide a tool that would enable users of OPi One, Lite and NanoPi M1 to check their boards limit by running a set of tests in an unattended way to determine

 

  • highest clockspeed at which the SoC works reliable at the lower VDD_CPUX voltage
  • highest clockspeed at which the SoC works reliable at the upper VDD_CPUX voltage
The problem here is reliability. Most overclockers don't care, they're happy to fry their boards idling around at insane high clockspeeds (plain moronic since when the board is idle clockspeeds and voltages can be adjusted to the minimum) and simply don't give a shit that when they put some load on their devices that they get instable, data corruption occurs or even tasks or the whole system crash/hang. 
 
What we try to provide are sane settings that don't affect reliability but provide better performance. Unfortunately that means that this requires some amount of testing (and also adjusted settings, a heatsink and maybe even a fan). The old approach is outlined here: https://linux-sunxi.org/Hardware_Reliability_Tests#CPU (most of the people forget the most important part: Run cpuburn-a7 in parallel on all CPU cores since otherwise test results are 100% worthless)
 
In the meantime we found another tool that is pretty cool to detect undervoltage situatios: A highly optimized Linpack build: https://github.com/longsleep/build-pine64-image/pull/3#issuecomment-195928362 (with this approach we optimised dvfs/cpufreq settings for Pine64 3 months ago already)
 
So all that's needed would be an adjusted Armbian image (or tools)
  • that contains this specific Linpack build to run the tests
  • comes with adjusted THS settings to prevent throttling
  • comes with adjusted dvfs settings to stay as long as possible at the lower VDD_CPUX voltage
  • walks in an automated fashion through these clockspeeds: 912000 960000 1008000 1056000 1104000 1152000 1200000 1248000 1296000 1344000
  • checks /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state prior to a test and afterwards -- if anything changed then throttling occured and the results have to be thrown away (warning message to the user: improve heat dissipation otherwise test is useless)
  • check Linpack output for data corruption
If Linpack reports corrupted data, then that specific clockspeed at this moment is known to work not reliably with just 1.1V. So at this moment the DVFS settings have to be adjusted to the CPU clockspeed of the last working test since we need some safety headroom in script.bin (the LV2_freq parameter as can be seen below), then the board has to be rebooted and the current and the missing clockspeeds have to be tested now with 1.3V VDD_CPUX. When Linpack again detects an error then the maximum clockspeed this H3 can be driven with is also known.
 
So in case the first fail happens at 1056 MHz then in script.bin 'LV2_freq = 1008000000' has to be set. After exchanging that, rebooting and testing again when the next fail happens at 1344 MHz then the last passing CPU clockspeed is the maximum clockspeed that H3 should be allowed to run at. So the aforementioned example results would allow the user to use these three modifications in the productive script.bin he normally uses:
max_freq = 1296000000
LV1_freq = 1296000000
LV2_freq = 1008000000
And in /etc/default/cpufrequtils this can be set:
MAX_SPEED=1296000
(all other settings remain the same, it's just the adoption of this device dependent test that showed that this specific H3 becomes instable with 1.1V at 1056 MHz and 1.3V at 1344 MHz. Also normal throttling settings aren't affected since the reliability testing )
 
Different boards will fail at different clockspeeds (already confirmed! When we started to support OPi One in Feb I accidentally deactivated switching to the higher VDD_CPUX voltage and several users reported that they even got in trouble simply booting the board since kernel panics happened early in the boot stage!). So this is a tool that has to run on each board individually.
 
Prerequisits:
 
Replace these 3 sections with the following in script.bin
[ths_para]
ths_used = 1
ths_trip1_count = 6
ths_trip1_0 = 95
ths_trip1_1 = 96
ths_trip1_2 = 97
ths_trip1_3 = 98
ths_trip1_4 = 99
ths_trip1_5 = 105
ths_trip1_6 = 0
ths_trip1_7 = 0
ths_trip1_0_min = 0
ths_trip1_0_max = 1
ths_trip1_1_min = 1
ths_trip1_1_max = 2
ths_trip1_2_min = 2
ths_trip1_2_max = 3
ths_trip1_3_min = 3
ths_trip1_3_max = 4
ths_trip1_4_min = 4
ths_trip1_4_max = 5
ths_trip1_5_min = 5
ths_trip1_5_max = 7
ths_trip1_6_min = 0
ths_trip1_6_max = 0
ths_trip2_count = 1
ths_trip2_0 = 105

[cooler_table]
cooler_count = 8
cooler0 = "1344000 4 4294967295 0"
cooler1 = "912000 4 4294967295 0"
cooler2 = "768000 4 4294967295 0"
cooler3 = "648000 4 4294967295 0"
cooler4 = "480000 4 4294967295 0"
cooler5 = "480000 3 4294967295 0"
cooler6 = "480000 2 4294967295 0"
cooler7 = "480000 1 4294967295 0"

[dvfs_table]
pmuic_type = 1
pmu_gpio0 = port:PL06<1><1><2><1>
pmu_level0 = 11300
pmu_level1 = 1100
max_freq = 1344000000
min_freq = 480000000
LV_count = 5
LV1_freq = 1344000000
LV1_volt = 1300
LV2_freq = 1296000000
LV2_volt = 1100
LV3_freq = 912000000
LV3_volt = 1100
LV4_freq = 648000000
LV4_volt = 1100
LV5_freq = 480000000
LV5_volt = 1100
Can be done by
  • bin2fex /boot/bin/orangepilite.bin /boot/bin/orangepilite_test.fex
  • [edit /boot/bin/orangepilite_test.fex]
  • fex2bin /boot/bin/orangepilite_test.fex /boot/bin/orangepilite_test.bin
  • ln -sf /boot/bin/orangepilite_test.bin /boot/script.bin
Then replace /etc/default/cpufrequtils with
ENABLE=true
MIN_SPEED=480000
MAX_SPEED=912000
GOVERNOR=performance
Package Linpack to be executed automatically.
 
Put the whole testing stuff either into check_first_login.sh so users just have to download this image, replace SD card and let their board run the tests unattended. Or provide the whole stuff as a tool that can be run (downloading the Linpack package, adjusting script.bin, rebooting if necessary and adjusting the originally used script.bin with the new settings)
 
Important: When testing through the individual clockspeeds (912000 960000 1008000 1056000 1104000 1152000 1200000 1248000 1296000 1344000) the speed can be set with
echo $cpufreq >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
After the new clockspeed has been set, immediately save the contents of /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state, then run Linpack and afterwards compare again. If contents differ, throttling happened and the results are worthless! Since we want to test reliability of clockspeed X but throttling jumped in and we tested partially a lower clockspeed in reality. 
 
Why do I write this here in developer forum?
  • Adjusting dvfs operating points on Orange Pi One, Lite or Nano Pi M1 is easy since the voltage regulator here only switches between two voltages. Easy to start with
  • IMO we should provide tools to improve performance behaviour of our supported boards. Performance on nearly every more modern SoC we support is related to thermal issues. Throttling prevents CPU/GPU cores running at full speed, high VDD_CPUX voltage negatively affects this since the higher the voltage, the more heat, the more throttling, the less performance
  • VDD_CPUX voltage when set too low affects reliability. Since the cheap SoCs we deal with aren't 'factory tested' that much it's an individual job of the SBC's owner to test through possible dvfs operating points to determine best voltage settings (as low as possible -- so that reliability isn't affected)
  • The whole testing process when started from scratch is really time consuming (eg. compiling the necessary Linpack version) and might be frustrating (please read through https://github.com/longsleep/build-pine64-image/pull/3 to get the idea how long it took to convince other devs to test correctly and to refrain from the idea to unlock clockspeeds too high)
So it would be great if we as Armbian team could provide a test image or bundled tools that ease this process. Nearly all more recent boards we support suffer from throttling issues and by enabling our users to fine tune dvfs settings using Armbian means also automagically 'more performance' with full load workloads.
 
IMO it would be nice to add an additional armbian-test-reliability package to our repo containing
  • scripts/tools to help testing
  • Linpack as used for undervoltage checking (this and the next tool can also be used on ARMv8/64-bit systems since the 32-bit variants are more demanding!)
  • cpuminer since this tools contains a benchmark mode that make testing through different settings more easy since you get the result of different dvfs/throttling settings directly as a performance ratio.
Also if someone claims this task starting to play with such a limited voltage regulator as present on the little Oranges eases getting results quickly. It gets a bit more complicated if we want to provide the same for the larger Oranges that use the SY8106A voltage regulator that can adjust VDD_CPUX settings in 20mV steps (or any other board that would also benefit from 'per board optimized dvfs operating points').
Link to post
Share on other sites
Donate and support the project!

Guest
This topic is now closed to further replies.