NanoPi M4V2 randomly crashes


Recommended Posts

I have no crash since 5 days.

 

Welcome to Armbian 20.08.3 Buster with Linux 5.8.10-rockchip64

 

No end-user support: built from trunk

 

System load: 20%           Up time:       5 days 1:18

Memory usage: 34% of 3.71G  Zram usage:    6% of 1.86G  IP:            172.21.0.1 192.168.42.31

CPU temp:      48°C           Usage of /:    17% of 458G   storage/:      6% of 916G   storage temp: Always°C           

 

$ cat /etc/default/cpufrequtils

ENABLE=true

MIN_SPEED=600000

MAX_SPEED=1800000

GOVERNOR=performance

Link to post
Share on other sites
Donate and support the project!

3 hours ago, aprayoga said:

We are still testing on Helios64 (with value 40000), so far with reboot and power cycle does not trigger any kernel crash.

 

 

Hi @aprayoga thank you for your comments, I will check on this though at this moment I don't know how I can apply that 40000 value (I assume this is not an easy change to cpufrequtils!)

Link to post
Share on other sites
9 hours ago, aprayoga said:

We are still testing on Helios64 (with value 40000), so far with reboot and power cycle does not trigger any kernel crash.

@aprayoga Fingers crossed! :)

I remember playing with "regulator-ramp-delay" with M4V2 before (after noticing slow big cpu cluster transitions) but I probably did not got that high and definitely did not see that post you mentioned (and was not successful).

I started some tests with 40000 right now.

 

@Pedro Lamas if you want to also try testing it...

Save below overlay into a file in your M4V2 (let's name it rump-delay-test.dts), run "sudo armbian-add-overlay rump-delay-test.dts" and reboot your M4V2.

/dts-v1/;
/plugin/;

/ {
        compatible = "rockchip,rk3399";

        fragment@0 {
                target = <&vdd_cpu_b>;

                __overlay__ {
                        regulator-ramp-delay = <40000>;
                };
        };
};

 

Link to post
Share on other sites

Thanks for sharing that @piter75, following your instruction I've now added the overlay and set the governor back to "ondemand" with min 600000 and max 1800000.

 

I'll keep an eye on it and report back any crash - hopefully not though!!

 

Entering day 3 with "ondemand" governor, min 600000 and max 1800000, and the custom overlay @piter75 provided... no issues at all!! I think we're on to something here!! :)

 

I just woke up to find my M4V2 had crashed during the night... :(

 

As I had to manually reboot it, I upgraded the firmware via armbian-config and rebooted it again to make sure I'm using the latest available.

 

Another crash just now while I was pulling some images from Docker Hub...

 

Message from syslogd@localhost at Oct  7 18:13:28 ...
 kernel:[85530.199864] Internal error: Oops: 96000044 [#1] PREEMPT SMP

Message from syslogd@localhost at Oct  7 18:13:28 ...
 kernel:[85530.221095] Code: f94006e1 f9403fe2 f90004e1 d37cf400 (f9000027)

 

Link to post
Share on other sites

Good evening

My nanoPi M4V2 also suffers from frequent errors, both with 1800 and 2000 max frequency. Minimum is 408 and governor is set onDemand.

This is the message I got on terminal (ssh connection) some minutes ago:

Message from syslogd@nanoNas at Oct  9 19:35:18 ...
 kernel:[ 4724.003648] Internal error: Oops: 96000004 [#1] PREEMPT SMP

Message from syslogd@nanoNas at Oct  9 19:35:18 ...
 kernel:[ 4724.024320] Code: 9b355f82 91002060 8b02031b d503201f (f9401321)

Luckily enough, it did not brake the ssh connection and I could recover some trace (end of dmesg)
 

[ 4724.003620] Unable to handle kernel paging request at virtual address 0000000080000020
[ 4724.003625] Mem abort info:
[ 4724.003627]   ESR = 0x96000004
[ 4724.003630]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 4724.003632]   SET = 0, FnV = 0
[ 4724.003633]   EA = 0, S1PTW = 0
[ 4724.003634] Data abort info:
[ 4724.003636]   ISV = 0, ISS = 0x00000004
[ 4724.003637]   CM = 0, WnR = 0
[ 4724.003641] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000c1c57000
[ 4724.003643] [0000000080000020] pgd=0000000000000000, p4d=0000000000000000
[ 4724.003648] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 4724.004147] Modules linked in: governor_performance zstd zram snd_soc_hdmi_codec snd_soc_rt5651 rc_cec dw_hdmi_i2s_audio dw_hdmi_cec snd_soc_rl6231 snd_soc_simple_card rockchip_vdec(C) rockchip_rga snd_soc_rockchip_spdif panfrost hantro_vpu(C) snd_soc_simple_card_utils v4l2_h264 videobuf2_dma_contig snd_soc_rockchip_i2s videobuf2_dma_sg v4l2_mem2mem videobuf2_vmalloc videobuf2_memops snd_soc_core videobuf2_v4l2 btsdio videobuf2_common snd_pcm_dmaengine videodev mc hci_uart gpu_sched rockchipdrm fusb302 dw_mipi_dsi tcpm dw_hdmi typec analogix_dp brcmfmac drm_kms_helper snd_pcm brcmutil btqca btrtl snd_timer btbcm cec btintel rc_core bluetooth snd soundcore cfg80211 drm rfkill sg drm_panel_orientation_quirks cpufreq_dt nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek dwmac_rk stmmac_platform stmmac mdio_xpcs
[ 4724.011464] CPU: 5 PID: 12945 Comm: md0_raid5 Tainted: G         C        5.8.13-rockchip64 #20.08.8
[ 4724.012261] Hardware name: FriendlyElec NanoPi M4 Ver2.0 (DT)
[ 4724.012765] pstate: 00000005 (nzcv daif -PAN -UAO BTYPE=--)
[ 4724.013269] pc : raid_run_ops+0x530/0x14c8 [raid456]
[ 4724.013707] lr : raid_run_ops+0x518/0x14c8 [raid456]
[ 4724.014142] sp : ffff800017e9ba10
[ 4724.014433] x29: ffff800017e9ba10 x28: 0000000000000003
[ 4724.014899] x27: ffff0000cb9750f8 x26: 0000000000000000
[ 4724.015364] x25: 0000000080000000 x24: ffff0000cb974b00
[ 4724.015830] x23: 00000000000001f0 x22: ffff0000cb974b68
[ 4724.016296] x21: 0000000000000158 x20: ffff0000cb974f08
[ 4724.016761] x19: fffffdffbff2ee38 x18: ffff0000e5bbd688
[ 4724.017226] x17: 0000000000000001 x16: 0000000000000af8
[ 4724.017691] x15: 2ab14ab4b2e69eb3 x14: 0000579552bdc338
[ 4724.018156] x13: 0000000000000183 x12: 0000000000000190
[ 4724.018621] x11: 000fffffffffffff x10: 0000000000000004
[ 4724.019086] x9 : ffff0000f77b0590 x8 : ffff0000f77afbc0
[ 4724.019551] x7 : 0000000000000001 x6 : ffff0000e0a79a10
[ 4724.020016] x5 : 0000000000000228 x4 : 0000000000000158
[ 4724.020481] x3 : 0000000000000000 x2 : 00000000000005f8
[ 4724.020946] x1 : ffff0000e32b8000 x0 : 0000000000000008
[ 4724.021412] Call trace:
[ 4724.021636]  raid_run_ops+0x530/0x14c8 [raid456]
[ 4724.022044]  handle_stripe+0x7c0/0x1f08 [raid456]
[ 4724.022461]  handle_active_stripes.isra.0+0x3a4/0x4d8 [raid456]
[ 4724.022982]  raid5d+0x300/0x5b0 [raid456]
[ 4724.023346]  md_thread+0x9c/0x188 [md_mod]
[ 4724.023715]  kthread+0x118/0x150
[ 4724.024002]  ret_from_fork+0x10/0x34
[ 4724.024320] Code: 9b355f82 91002060 8b02031b d503201f (f9401321)
[ 4724.024855] ---[ end trace 29dfe51ce6a12d3a ]---
[ 4724.025320] ------------[ cut here ]------------
[ 4724.025738] WARNING: CPU: 5 PID: 12945 at kernel/exit.c:720 do_exit+0x3c/0xa18
[ 4724.026368] Modules linked in: governor_performance zstd zram snd_soc_hdmi_codec snd_soc_rt5651 rc_cec dw_hdmi_i2s_audio dw_hdmi_cec snd_soc_rl6231 snd_soc_simple_card rockchip_vdec(C) rockchip_rga snd_soc_rockchip_spdif panfrost hantro_vpu(C) snd_soc_simple_card_utils v4l2_h264 videobuf2_dma_contig snd_soc_rockchip_i2s videobuf2_dma_sg v4l2_mem2mem videobuf2_vmalloc videobuf2_memops snd_soc_core videobuf2_v4l2 btsdio videobuf2_common snd_pcm_dmaengine videodev mc hci_uart gpu_sched rockchipdrm fusb302 dw_mipi_dsi tcpm dw_hdmi typec analogix_dp brcmfmac drm_kms_helper snd_pcm brcmutil btqca btrtl snd_timer btbcm cec btintel rc_core bluetooth snd soundcore cfg80211 drm rfkill sg drm_panel_orientation_quirks cpufreq_dt nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek dwmac_rk stmmac_platform stmmac mdio_xpcs
[ 4724.033673] CPU: 5 PID: 12945 Comm: md0_raid5 Tainted: G      D  C        5.8.13-rockchip64 #20.08.8
[ 4724.034470] Hardware name: FriendlyElec NanoPi M4 Ver2.0 (DT)
[ 4724.034974] pstate: 80000005 (Nzcv daif -PAN -UAO BTYPE=--)
[ 4724.035465] pc : do_exit+0x3c/0xa18
[ 4724.035773] lr : die+0x204/0x248
[ 4724.036056] sp : ffff800017e9b690
[ 4724.036348] x29: ffff800017e9b690 x28: ffff0000e32b8000
[ 4724.036813] x27: ffff0000cb9750f8 x26: 0000000000000000
[ 4724.037278] x25: 0000000080000000 x24: 0000000000000000
[ 4724.037743] x23: ffff0000e32b8000 x22: 0000000000000001
[ 4724.038208] x21: ffff800017e9b7a7 x20: 000000000000000b
[ 4724.038673] x19: ffff0000e32b8000 x18: 0000000000000010
[ 4724.039138] x17: 0000000000000001 x16: 0000000000000af8
[ 4724.039603] x15: ffff0000e32b84a8 x14: 0720072007200720
[ 4724.040068] x13: 0720072007200720 x12: 0720072007200720
[ 4724.040533] x11: 0720072007200720 x10: 0720072007200720
[ 4724.040997] x9 : 0720072007200720 x8 : 0720072007200720
[ 4724.041462] x7 : 0000000000000303 x6 : ffff0000f2e00f00
[ 4724.041927] x5 : 0000000000000001 x4 : ffff0000f77bc1d0
[ 4724.042392] x3 : 0000000000000000 x2 : 0000000000000000
[ 4724.042857] x1 : ffff0000f12bae48 x0 : ffff800017e9bdb0
[ 4724.043322] Call trace:
[ 4724.043540]  do_exit+0x3c/0xa18
[ 4724.043816]  die+0x204/0x248
[ 4724.044073]  die_kernel_fault+0x64/0x78
[ 4724.044411]  __do_kernel_fault+0x88/0x138
[ 4724.044764]  do_page_fault+0x198/0x468
[ 4724.045094]  do_translation_fault+0x64/0x88
[ 4724.045461]  do_mem_abort+0x40/0xa0
[ 4724.045769]  el1_sync_handler+0x104/0x110
[ 4724.046121]  el1_sync+0x7c/0x100
[ 4724.046417]  raid_run_ops+0x530/0x14c8 [raid456]
[ 4724.046825]  handle_stripe+0x7c0/0x1f08 [raid456]
[ 4724.047240]  handle_active_stripes.isra.0+0x3a4/0x4d8 [raid456]
[ 4724.047760]  raid5d+0x300/0x5b0 [raid456]
[ 4724.048125]  md_thread+0x9c/0x188 [md_mod]
[ 4724.048488]  kthread+0x118/0x150
[ 4724.048773]  ret_from_fork+0x10/0x34
[ 4724.049087] ---[ end trace 29dfe51ce6a12d3b ]---
[ 4724.049518] note: md0_raid5[12945] exited with preempt_count 1

 

And this is the version of Armbian I use:

Welcome to Armbian 20.08.9 Buster with Linux 5.8.13-rockchip64

 

I just copy the error part of dmesg but if necessary, I can provide the full output. Hope it helps

Link to post
Share on other sites

I'm hopeful we'll eventually resolve this issue. At least we can be fairly certain it's a software thing given legacy kernels are rock solid (pardon the pun).

 

One noteworthy thing is a Rock Pi 4 I've running Gentoo with a mainline kernel (5.8) is very stable with uptimes over a month.

 

The hardware is virtually identical (including the aforementioned regulator) and the default ramp delay is the same as the M4V2:

 

regulator-ramp-delay = <1000>;

 

That it's a memory setup issue is still my best guess.

Link to post
Share on other sites

I've got the very same message during the night, although I set the maximum frequence at 1600:

# cat /etc/default/cpufrequtils
ENABLE=true
MIN_SPEED=408000
MAX_SPEED=1608000
GOVERNOR=ondemand

The output of dmesg is also very similar to the one of yesterday evening

 

I am ready to make some testing if it can help to solve the issue. Just let me know what.

Link to post
Share on other sites

@piter75 having made the overlay change you suggested (be aware I'm now using "userspace" governor fixed to 1200000), shouldn't cpu4 and cpu5 return 40000 instead of the current 52000?

 

pedro@nanopim4v2:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency
40000
pedro@nanopim4v2:~$ cat /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_transition_latency
40000
pedro@nanopim4v2:~$ cat /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_transition_latency
40000
pedro@nanopim4v2:~$ cat /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_transition_latency
40000
pedro@nanopim4v2:~$ cat /sys/devices/system/cpu/cpu4/cpufreq/cpuinfo_transition_latency
52000
pedro@nanopim4v2:~$ cat /sys/devices/system/cpu/cpu5/cpufreq/cpuinfo_transition_latency
52000

 

Here's the full output for cpu4:

 

pedro@nanopim4v2:~$ pushd /sys/devices/system/cpu/cpu4/cpufreq && paste <(ls *) <(cat *) && popd
/sys/devices/system/cpu/cpu4/cpufreq ~
affected_cpus   4 5
cpuinfo_cur_freq        1200000
cpuinfo_max_freq        2016000
cpuinfo_min_freq        408000
cpuinfo_transition_latency     52000
related_cpus    4 5
scaling_available_frequencies  408000 600000 816000 1008000 1200000 1416000 1608000 1800000 2016000
scaling_available_governors    conservative userspace powersave ondemand performance schedutil
scaling_cur_freq        1200000
scaling_driver  cpufreq-dt
scaling_governor        userspace
scaling_max_freq        1200000
scaling_min_freq        1200000
scaling_setspeed        1200000
cat: stats: Is a directory

stats:
reset
time_in_state
total_trans
trans_table
~

 

Link to post
Share on other sites

So having the governor set to "userspace" with min and max speed set to "1200000" seems to make it completely stable.

 

Any idea of what we can try next?

 

As the Helios64 shares the same CPU, I do ask if it is running more stable than the M4V2... if it is, any chance of comparing the two and trying to make the M4V2 settings match the ones on the Helios64? 

Link to post
Share on other sites

I might as well chime in here... I've been battling this problem for nearly a year, all the while periodically checking various threads. I've gone through many hardware and software iterations to troubleshoot. Just going to do a dump of everything up to this point, apologies for the long post.

 

I'm willing to try any builds, configs, and pretty much anything else anyone can think of, short of soldering, to figure out what is going on.

 

Setup:

  • 1x nanopi m4v2 w/ 4x SATA hat (friendlyelec) powered by a brick/barrel connector "ALITOVE 12V 8A 96W"
  • 3x nanopi m4v2 w/ 5v 4a power supplies from friendlyelec (tried multiple cables, currently using "UGREEN USB C Cable 5A Supercharge Type C to USB A" on all 3)
  • 1x nanopi m4 (v1) w/ 5v 4a power supply from friendlyelec (same cable: ugreen 5a type c to usb a)
  • All have dedicated 32gb eMMC
  • All have the friendlyelec heat sink, some with the factory pad, and some with a 20x20x1.2mm copper spacer + noctua thermal compound; copper spacer was added during a later iteration (FYI: there is no difference between stock blue pad from friendlyelec and the copper spacer + noctua, save your money!)
  • All have noctua NF-A6x25 5v 60mm fans (active cooling, powered via GPIO VDD_5V pin 2 + GND); this active cooling was added during a later iteration (FYI: definitely need active cooling, idles 10C cooler, and under load was something like 20-30C cooler, I'll have to post actual numbers another day)
  • All on the same circuit, shared w/ my ubiquiti networking gear (no problems with this other hardware)

Software/OS combinations I've tried:

  1. Stock armbian bionic server w/ 5.x kernel (prior to focal release)
  2. Stock armbian focal server w/ 5.x kernel
  3. Stock armbian focal server w/ custom compiled 5.x kernel
  4. Stock armbian focal server w/ custom compiled 5.x kernel + hacks to disable zram (i thought maybe kubernetes/docker was freaking out)
  5. Custom ubuntu focal arm cloud image, using armbian /boot and a custom kernel 5.x (frankenstein, i know, getting desperate)
  6. Custom ubuntu bionic arm cloud image, using armbian /boot and a custom kernel 5.x
  7. Custom ubuntu bionic arm cloud image, using armbian /boot and a custom kernel 4.4.213

 

Currently on the 4.4 + bionic cloud image, uptime is 1 day 20 minutes, no crashes. I'll give it a few more days to see what happens. (.20 below was giving me unrelated trouble, got a late start)

2020-10-17T23:50:20+0000
192.168.68.20: 23:50:22 up  1:25,  1 user,  load average: 0.11, 0.07, 0.01
192.168.68.21: 23:50:25 up 1 day, 24 min,  0 users,  load average: 0.77, 0.72, 0.72
192.168.68.22: 23:50:27 up 1 day, 24 min,  0 users,  load average: 0.73, 0.80, 0.81
192.168.68.23: 23:50:29 up 1 day, 24 min,  0 users,  load average: 0.68, 0.74, 0.75
192.168.68.24: 23:50:32 up 1 day, 24 min,  0 users,  load average: 0.83, 0.79, 0.82

 

On the 5.x kernel variations, all 5 nodes will end up in a crashed/locked state within 7 days, but typically less, regardless of load. At this point, I just let them idle, and run a simple ssh command every second to execute uptime. Load average is 0.7 to 0.9. This current iteration (#7) if the first one with kernel 4.4.

 

Although I didn't check the governor profile until now, 4.4 is showing "interactive" on all 6 cpus, and here's some other cpu info too:

$ cat /sys/devices/system/cpu/cpu{0..5}/cpufreq/scaling_governor
interactive
interactive
interactive
interactive
interactive
interactive

$ cat /sys/devices/system/cpu/cpu{0..5}/cpufreq/cpuinfo_transition_latency 
40000
40000
40000
40000
465000
465000

$ sudo cat /sys/devices/system/cpu/cpu{0..5}/cpufreq/cpuinfo_{cur,min,max}_freq
408000
408000
1416000
408000
408000
1416000
408000
408000
1416000
408000
408000
1416000
816000
408000
1800000
816000
408000
1800000

 

Extra info:

  • Kernel configs are version controlled, can supply upon request
  • Kernel compilation and imaging is all scripted, and i'm set up for armbian dev (one day i'll put my custom scripts out on GH)
  • My u-boot experience is garbage
  • 2x UART CP2102 and 1x FTDI FT232RL IC usb modules are at my disposal
  • Got some microSD cards sitting around here somewhere...
  • I use cloud-init nocloud for config/first-boot seeding
  • The m4v2 w/ the SATA hat sometimes runs w/ a USB ASIX AX88179 dongle in bonded mode, but that periodically causes an unrelated kernel panic on 4.4 and 5.x (but that's another topic altogether)
Edited by zamnuts
more info, clarity
Link to post
Share on other sites

I found the same problem, NanoPi M4V2 randomly panic. after CPU frequency reduction, it seems to be fixed.

 

heiher@hev-mpc:~$ cat /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq 
1416000
heiher@hev-mpc:~$ cat /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq 
1800000

 

Link to post
Share on other sites

@Pedro Lamas I don't use this to config. my config equivalent to:

echo 1416000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq
echo 1800000 > /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq
echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
echo performance > /sys/devices/system/cpu/cpufreq/policy4/scaling_governor

 

Link to post
Share on other sites
On 9/30/2020 at 2:34 PM, Pedro Lamas said:

I just noticed something while reading at the RK3399 specsheet: the recommended maximum frequency of the A72 is actually 1.8Ghz, not 2.0Ghz as on the FriendlyARM website and wiki!

 

The RK3399K however does indicate a recommended maximum of 2.0Ghz, but that is not the version in use on the NanoPi M4V2.

 

The Rock Pi 4 uses the same RK3399 SoC and they specifically say the frequency of the A72 is 1.8Ghz.

 

I even found a commit in armbian codebase for the Helios64 (another one with the same RK3399 SoC) where the maximum is set to 1.8Ghz: https://github.com/armbian/build/pull/2191

 

I will leave my board for a couple of days more with "userspace" governor and min and max set to 1008000, and if there's no crashes, I will try "ondemand" governor with min set to 1008000 and max to 1800000

 
 
 
 

 

Following up on the above, I just opened this issue on GitHub, suggesting that the CPUMAX for the RK3399 is set to 1800000 following the manufacturer specifications.

Link to post
Share on other sites

I don't think it is linked with the max frequency as I already had the probleme although the max frequency was set at 1.6 GHz.

As Pedro Lamas noticed, setting min and max frequency at 1.2 GHz my nanopi run 3 days without bug.

From what I have observed, it might be a probleme of memory when the frequence change, may be some kind of desynchronisation ?

It happens with any of the core, big or LITTLE.

I upgrade to the last version (5.8.16-rockchip64 #20.08.14 SMP PREEMPT Tue Oct 20 22:37:51), but the bug is still there :angry:

In order to see if it depends on frequency or if it is linked with the change of frequency I have set min and max at 2 GHz and will let it run like that for some time ...

Link to post
Share on other sites
4 hours ago, JackR said:

I don't think it is linked with the max frequency as I already had the probleme although the max frequency was set at 1.6 GHz.

As Pedro Lamas noticed, setting min and max frequency at 1.2 GHz my nanopi run 3 days without bug.

From what I have observed, it might be a probleme of memory when the frequence change, may be some kind of desynchronisation ?

It happens with any of the core, big or LITTLE.

I upgrade to the last version (5.8.16-rockchip64 #20.08.14 SMP PREEMPT Tue Oct 20 22:37:51), but the bug is still there :angry:

In order to see if it depends on frequency or if it is linked with the change of frequency I have set min and max at 2 GHz and will let it run like that for some time ...

 

Is it possible that A53 (little) can not run at the highest design frequency (1.51GHz)? I configured the A53 to 1.42GHz and the A72 to 1.8GHz, which looks good. I also have a NanoPi M4V1, it's running the official mainline kernel 5.9.1, and the kernel default configuration is 1.42GHz+1.8GHz, and I have never found stability problems.

Link to post
Share on other sites
[heiher@hev-sbc ~]$ uname -a
Linux hev-sbc 5.9.1 #1 SMP Sun Oct 18 17:37:19 CST 2020 aarch64 GNU/Linux

[heiher@hev-sbc ~]$ cat /sys/devices/system/cpu/cpufreq/*/scaling_available_frequencies
408000 600000 816000 1008000 1200000 1416000 
408000 600000 816000 1008000 1200000 1416000 1608000 1800000

 

Link to post
Share on other sites

same here:

I confirm since last update I had random crashes and wifi lost (i post a topic about it)

I just change the cpufrequtils and the random crashes disapeard, but wifi is still lost.

But now with this configuration I suspect I have some waiting-time when the workload is too important or temperature too high, I don't know.

For example a simple apt update on a 60KB file can pause the system for .. let say 1 or 2 minutes, and then things are fast

strange behaviour

 

Link to post
Share on other sites
Welcome to Armbian 20.08.14 Buster with Linux 5.8.16-rockchip64

System load:   2%               Up time:       4 days 11:57        
Memory usage:  8% of 3.71G      IP:            192.168.0.59
CPU temp:      49°C               Usage of /:    52% of 3.5G       


# cat /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq

1512000

# cat /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq

2016000

# cat /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq

1512000

# cat /sys/devices/system/cpu/cpufreq/policy4/scaling_min_freq

2016000

 

With this config, it is now 4,5 days that my nanopi M4V2 is running full speed (LITTLE at 1512, big at 2016 MHz) without a single crash. So it seems to confirm that the instability is not linked to the frequency but happens when the frequency changes.

 

As hev mentionned, it doesn't happens on the nanopi M4V1: and the only difference between V1 and V2 is the memory type, DDR3 for V1 and DDR4 for V2. As the observed crashes are memory errors on any of the cores and whatever the max frequency is set, it  really looks like desynchronisation  between the cores and memory when the frequency is changed by the governor.

 

I am not expert enough to go deeper in this analysis but I hope someone more skilled can start from here and write a patch.

Link to post
Share on other sites
7 hours ago, JackR said:

So it seems to confirm that the instability is not linked to the frequency but happens when the frequency changes.

 

I noticed that the ondemand sampling_rate was low in 5.x compared to 4.4's default. compare below:

 

$ uname -r
4.4.213-rk3399

$ sudo cat /sys/devices/system/cpu/cpufreq/policy{0,4}/ondemand/sampling_rate
40000
465000

 

vs

 

$ uname -r
5.8.15-rockchip64

$ sudo cat /sys/devices/system/cpu/cpufreq/policy{0,4}/ondemand/sampling_rate
10000
10000

 

It appears that 10000 seems to be a default (or the minimum value possible). Per https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt 10000 is too low given the cpu_transition_latency:

$ uname -r
5.8.15-rockchip64

$ cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_transition_latency 
40000
40000
40000
40000
515000
515000

 

Given the documented formula cpuinfo_transition_latency * 750 / 1000, it should be (at a minimum) 30000 for cpu0-3, and 386250 for cpu4-5.

 

Using sysfs to configure cpufreq on boot with the revised sampling_rate seems to make it stable (although uptime is only about 2 days in). I ran periodic load testing to get the ondemand governor to change freq, and this config looks promising. Despite the previously mentioned calculations, i opted to align the sampling_rate with what was present in the 4.4 kernel:

$ tail -n+1 /etc/sysfs.d/*
==> /etc/sysfs.d/cpufreq-policy0.conf <==
devices/system/cpu/cpufreq/policy0/scaling_governor = ondemand
devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1416000
devices/system/cpu/cpufreq/policy0/scaling_min_freq = 600000
devices/system/cpu/cpufreq/policy0/ondemand/sampling_rate = 40000

==> /etc/sysfs.d/cpufreq-policy4.conf <==
devices/system/cpu/cpufreq/policy4/scaling_governor = ondemand
devices/system/cpu/cpufreq/policy4/scaling_max_freq = 1800000
devices/system/cpu/cpufreq/policy4/scaling_min_freq = 600000
devices/system/cpu/cpufreq/policy4/ondemand/sampling_rate = 465000

$ for i in 0 1 2 3 4 5; do sudo cat /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_{governor,min_freq,max_freq}; done;
ondemand
600000
1416000
ondemand
600000
1416000
ondemand
600000
1416000
ondemand
600000
1416000
ondemand
600000
1800000
ondemand
600000
1800000

 

one of two boards are already crashing, not sure if its the board or what. going to switch to "performance" governor and see. figured i'd share the sample_rate delta info though.

Link to post
Share on other sites

While trying to determine faulty RAM on one board, I noticed the ondemand governor w/ the revised frequencies and sampling rate it isn't as stable as it appears.

 

So I decided to run a test using memtester using most of the left over RAM while the system was up, e.g. `memtester 3550M 7`:

  • All boards have max cpu freq set at 1416000 and 1800000
  • 1x m4v2 using 4.4 and governor "ondemand"
  • 3x m4v2 using 5.8 and governor "ondemand"
  • 1x m4 (v1) using 4.4 and governor "interactive"

Results:

  • All 3 boards w/ 5.8 showed at least one memtester failure per loop (I ran 7 loops). 2 of the boards running 5.8 actually seized/froze, one on the 3rd loop and the other on the end of the 7th loop.
  • Both boards (2) w/ 4.4 showed zero memtester failures, all memtester tests passed with "ok" and the system is still up and responsive.

In summary, we're still no closer than when this thread started: resetting CPU max/min frequencies, and changing the governor/dynamic cpu scaling doesn't actually help.

 

For completeness, here's one memtester result for a board running 5.8 (an earlier test), the last line on loop 2 is when the whole thing became unresponsive:

root@host:/home/ubuntu# memtester 3500M 7
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 3500MB (3670016000 bytes)
got  3500MB (3670016000 bytes), trying mlock ...locked.
Loop 1/7:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : testing  12FAILURE: 0x80000000 != 0x00000000 at offset 0x4a270988.
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : testing  24FAILURE: 0x05000000 != 0x85000000 at offset 0x297959c0.
  Bit Flip            : testing   1FAILURE: 0x80000001 != 0x00000001 at offset 0x07489960.
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : ok
  16-bit Writes       : |
Message from syslogd@pim402 at Oct 27 03:20:22 ...
 kernel:[ 5842.577847] Internal error: Oops: 96000047 [#1] PREEMPT SMP
|
Message from syslogd@pim402 at Oct 27 03:20:22 ...
 kernel:[ 5842.599503] Code: 51000401 8b0202c2 91002080 f861db01 (f8216844) 
/
Message from syslogd@pim402 at Oct 27 03:21:33 ...
 kernel:[ 5914.064316] Internal error: Oops: 96000004 [#2] PREEMPT SMP
-
Message from syslogd@pim402 at Oct 27 03:21:33 ...
 kernel:[ 5914.084728] Code: f9401800 d503233f d50323bf f85b8000 (f9400800) 
ok

Loop 2/7:
  Stuck Address       : ok         
  Random Value        : \

 

And here's a good run from a board running 4.4:

root@pim400:/home/ubuntu# memtester 2910M 7
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 2910MB (3051356160 bytes)
got  2910MB (3051356160 bytes), trying mlock ...locked.
Loop 1/7:
  Stuck Address       : ok         
  Random Value        : ok

... cut ...

Loop 7/7:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : ok         
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : ok
  16-bit Writes       : ok

Done.

 

What else can we try?

Edited by zamnuts
Link to post
Share on other sites

Thank you for your efforts @zamnuts this is some great info you are sharing!

 

I for one can only say that I've have no issues at all with "userspace" governor and min and max frequency set to 1416000!

 

pedro@nanopim4v2:/sys/devices/system/cpu/cpufreq$ sudo cat /sys/devices/system/cpu/cpufreq/policy{0,4}/scaling_governor
userspace
userspace
pedro@nanopim4v2:/sys/devices/system/cpu/cpufreq$ sudo cat /sys/devices/system/cpu/cpufreq/policy{0,4}/cpuinfo_{cur,min,max}_freq
1416000
408000
1512000
1416000
408000
2016000

 

My M4V2 has been working for almost 6 days now with 19 docker containers and no problems at all!

 

I understand that ideally one wants to use the "ondemand" governor, but I do wonder what would happen if you run your tests with a "userspace" governor and fixed frequencies?

Link to post
Share on other sites

I ran a some tests the past few days, all nodes are stable using the "performance" governor and the 1.5ghz/2ghz cpu frequencies.

 

Nodes:

  • 1x m4v2 on kernel 4.4.213 (w/ SATA hat)
  • 3x m4v2 on kernel 5.8.6
  • 1x m4 (v1) on kernel 4.4.213

sysfs configuration:

$ tail -n+1 /etc/sysfs.d/*
==> /etc/sysfs.d/cpufreq-policy0.conf <==
devices/system/cpu/cpufreq/policy0/scaling_governor = performance
devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1512000
devices/system/cpu/cpufreq/policy0/scaling_min_freq = 408000

# enable sampling_rate if scaling_governor = ondemand
#devices/system/cpu/cpufreq/policy0/ondemand/sampling_rate = 40000

==> /etc/sysfs.d/cpufreq-policy4.conf <==
devices/system/cpu/cpufreq/policy4/scaling_governor = performance
devices/system/cpu/cpufreq/policy4/scaling_max_freq = 2016000
devices/system/cpu/cpufreq/policy4/scaling_min_freq = 408000

# enable sampling_rate if scaling_governor = ondemand
#devices/system/cpu/cpufreq/policy4/ondemand/sampling_rate = 465000

 

FYI, the 4.4 kernels can't be set to scaling_max_freq at 1512000 and 2016000 since the scaling_available_frequencies top out at 1416000 and 1800000 respectively. Setting the scaling_max_freq above these just selects the maximum freq available, so there's no adverse effect in the sysfs configuration. Also, when using any governor besides "ondemand" (e.g. "performance"), setting the ondemand/sampling_rate causes sysfsutils to fail because the "ondemand" directory does not exist (and the other directives don't get set).

 

Also, i have swap disabled (as you can see in the free -m output below); these nodes will form a k8s cluster. (and soon, now that they're confidently stable!)

 

Lastly, had to disable the "ondemand" service so the scaling_governor sysfs settings would persist across reboots (likely due to service startup order, i.e. sysfsutils before ondemand):

systemctl disable ondemand
systemctl mask ondemand 

 

Tests and results:

  • memtester with as much of free memory as possible (~3400M for the v2s and 1700M on the v1)
    • Sequence: reboot, 7 loops, then reboot again, 7 loops
    • Memory allocation was done on a per-node basis, based on "available" minus about 15M, e.g.:
      $ free -m
                    total        used        free      shared  buff/cache   available
      Mem:           3800         175        3380           1         244        3467
      Swap:             0           0           0
    •  
    • Results: out of 14 total loops per node (70 total), only the very first (1) loop on a 5.8.6 kernel had 2 failures as follows, all other loops on all nodes passed all tests
      Bit Spread          : testing  68FAILURE: 0x2800000000000000 != 0x2800000080000000 at offset 0x56f8c790.
      Bit Flip            : testing 215FAILURE: 0x84000000 != 0x04000000 at offset 0x55194dd0.
  • compiled linux kernel 5.9.1 (without the NFS/network component), a recommended test from the u-boot memory tester readme:
    • "The best known test case to stress a system like that is to boot Linux with root file system mounted over NFS, and then build some larger software package natively (say, compile a Linux kernel on the system) - this will cause enough context switches, network traffic (and thus DMA transfers from the network controller), varying RAM use, etc. to trigger any weak spots in this area."

    • Used "make -j $(nproc)" to utilize all available cores

    • "monitored" each node with periodic resource usage dumps, and also to ensure the governor/frequencies didn't change. here's a snippet that shows free memory and cpu load averages:
      2020-10-29T04:33:17+0000  192.168.68.20: [  4.4.213-rk3399]  04:33:20 up  8:46,  1 user,  load average: 6.07, 5.87, 5.92, free:  105M, govs: performance performance, freqs: 1416000 1800000
      2020-10-29T04:33:20+0000  192.168.68.21: [5.8.6-rockchip64]  04:33:21 up  8:46,  1 user,  load average: 6.01, 6.02, 6.00, free:  480M, govs: performance performance, freqs: 1512000 2016000
      2020-10-29T04:33:21+0000  192.168.68.22: [5.8.6-rockchip64]  04:33:22 up  8:46,  1 user,  load average: 6.00, 6.06, 6.02, free:  401M, govs: performance performance, freqs: 1512000 2016000
      2020-10-29T04:33:22+0000  192.168.68.23: [5.8.6-rockchip64]  04:33:23 up  8:46,  1 user,  load average: 6.01, 6.03, 6.00, free:  439M, govs: performance performance, freqs: 1512000 2016000
      2020-10-29T04:33:23+0000  192.168.68.24: [  4.4.213-rk3399]  04:33:26 up  8:46,  1 user,  load average: 6.93, 6.73, 6.71, free:  101M, govs: performance performance, freqs: 1416000 1800000

    • Compilation completed successfully, and no errors/warnings in dmesg (kern.log), syslog, nor stdout/stderr (on the tty)

 

It is worth noting that memtester failures always showed a discrepancy of 0x80 (128) based on 8-bit/1-byte sections, every-single-one, and there was also a pattern (duplicated failure hex values among all nodes/tests). The location varied. Examples:

  • 0x2800000000000000 != 0x2800000080000000
  • 0x84000000 != 0x04000000
  • 0x00000028 != 0x80000028
  • 0x80000080 != 0x00000080
  • 0x80000001 != 0x00000001
  • 0x80000000000000 != 0x80000080000000
  • 0x5555555555555555 != 0x55555555d5555555
  • etc...

 

I might as well share my cloud-init configuration at this point; only the critical new instance bootstrapping stuff is here, all other configs/software/patching/etc is handled via ansible. here's "user-data":

#cloud-config
disable_root: true
mounts:
  - [ swap, null ]
ntp:
  enabled: true
  ntp_client: 'auto'
packages:
  - 'sysfsutils'
package_update: false
package_upgrade: false
preserve_hostname: false
runcmd:
  - [ systemctl, disable, ondemand ]
  - [ systemctl, mask, ondemand ]
ssh_authorized_keys:
  - 'ssh-rsa <PUBLIC KEY REDACTED>'
timezone: 'Etc/UTC'
write_files:
  - content: |
      devices/system/cpu/cpufreq/policy0/scaling_governor = performance
      devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1512000
      devices/system/cpu/cpufreq/policy0/scaling_min_freq = 408000

      # kernel 4.4 max is 1416000
      #devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1416000

      # enable sampling_rate if scaling_governor = ondemand
      #devices/system/cpu/cpufreq/policy0/ondemand/sampling_rate = 40000
    path: /etc/sysfs.d/cpufreq-policy0.conf
  - content: |
      devices/system/cpu/cpufreq/policy4/scaling_governor = performance
      devices/system/cpu/cpufreq/policy4/scaling_max_freq = 2016000
      devices/system/cpu/cpufreq/policy4/scaling_min_freq = 408000

      # kernel 4.4 max is 1800000
      #devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1800000

      # enable sampling_rate if scaling_governor = ondemand
      #devices/system/cpu/cpufreq/policy4/ondemand/sampling_rate = 465000
    path: /etc/sysfs.d/cpufreq-policy4.conf

 

I'll start re-imaging these nodes and actually using them now. Will report back if this issue pops back up again.

Still no solution for the "ondemand" governor :( - the CPUs simply just don't like to change frequencies in the 5.x kernel.

Edited by zamnuts
note about 0x80
Link to post
Share on other sites

>I'm willing to try any builds, configs, and pretty much anything else anyone can think of, short of soldering,

Do some dtb hacking? All the memory timing is in this, maybe the mem is pushed too far, too hard?

But, it is plenty complicated, looking at 5x, much of the detail is gone, so applies to 4x mostly

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...