Nobby42 Posted October 1, 2020 Posted October 1, 2020 I have no crash since 5 days. Welcome to Armbian 20.08.3 Buster with Linux 5.8.10-rockchip64 No end-user support: built from trunk System load: 20% Up time: 5 days 1:18 Memory usage: 34% of 3.71G Zram usage: 6% of 1.86G IP: 172.21.0.1 192.168.42.31 CPU temp: 48°C Usage of /: 17% of 458G storage/: 6% of 916G storage temp: Always°C $ cat /etc/default/cpufrequtils ENABLE=true MIN_SPEED=600000 MAX_SPEED=1800000 GOVERNOR=performance
Pedro Lamas Posted October 1, 2020 Author Posted October 1, 2020 3 hours ago, aprayoga said: We are still testing on Helios64 (with value 40000), so far with reboot and power cycle does not trigger any kernel crash. Hi @aprayoga thank you for your comments, I will check on this though at this moment I don't know how I can apply that 40000 value (I assume this is not an easy change to cpufrequtils!)
piter75 Posted October 1, 2020 Posted October 1, 2020 9 hours ago, aprayoga said: We are still testing on Helios64 (with value 40000), so far with reboot and power cycle does not trigger any kernel crash. @aprayoga Fingers crossed! I remember playing with "regulator-ramp-delay" with M4V2 before (after noticing slow big cpu cluster transitions) but I probably did not got that high and definitely did not see that post you mentioned (and was not successful). I started some tests with 40000 right now. @Pedro Lamas if you want to also try testing it... Save below overlay into a file in your M4V2 (let's name it rump-delay-test.dts), run "sudo armbian-add-overlay rump-delay-test.dts" and reboot your M4V2. /dts-v1/; /plugin/; / { compatible = "rockchip,rk3399"; fragment@0 { target = <&vdd_cpu_b>; __overlay__ { regulator-ramp-delay = <40000>; }; }; }; 1
Pedro Lamas Posted October 1, 2020 Author Posted October 1, 2020 Thanks for sharing that @piter75, following your instruction I've now added the overlay and set the governor back to "ondemand" with min 600000 and max 1800000. I'll keep an eye on it and report back any crash - hopefully not though!! Entering day 3 with "ondemand" governor, min 600000 and max 1800000, and the custom overlay @piter75 provided... no issues at all!! I think we're on to something here!! I just woke up to find my M4V2 had crashed during the night... As I had to manually reboot it, I upgraded the firmware via armbian-config and rebooted it again to make sure I'm using the latest available. Another crash just now while I was pulling some images from Docker Hub... Message from syslogd@localhost at Oct 7 18:13:28 ... kernel:[85530.199864] Internal error: Oops: 96000044 [#1] PREEMPT SMP Message from syslogd@localhost at Oct 7 18:13:28 ... kernel:[85530.221095] Code: f94006e1 f9403fe2 f90004e1 d37cf400 (f9000027) 1
JackR Posted October 9, 2020 Posted October 9, 2020 (edited) Good evening My nanoPi M4V2 also suffers from frequent errors, both with 1800 and 2000 max frequency. Minimum is 408 and governor is set onDemand. This is the message I got on terminal (ssh connection) some minutes ago: Message from syslogd@nanoNas at Oct 9 19:35:18 ... kernel:[ 4724.003648] Internal error: Oops: 96000004 [#1] PREEMPT SMP Message from syslogd@nanoNas at Oct 9 19:35:18 ... kernel:[ 4724.024320] Code: 9b355f82 91002060 8b02031b d503201f (f9401321) Luckily enough, it did not brake the ssh connection and I could recover some trace (end of dmesg) Spoiler [ 4724.003620] Unable to handle kernel paging request at virtual address 0000000080000020 [ 4724.003625] Mem abort info: [ 4724.003627] ESR = 0x96000004 [ 4724.003630] EC = 0x25: DABT (current EL), IL = 32 bits [ 4724.003632] SET = 0, FnV = 0 [ 4724.003633] EA = 0, S1PTW = 0 [ 4724.003634] Data abort info: [ 4724.003636] ISV = 0, ISS = 0x00000004 [ 4724.003637] CM = 0, WnR = 0 [ 4724.003641] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000c1c57000 [ 4724.003643] [0000000080000020] pgd=0000000000000000, p4d=0000000000000000 [ 4724.003648] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 4724.004147] Modules linked in: governor_performance zstd zram snd_soc_hdmi_codec snd_soc_rt5651 rc_cec dw_hdmi_i2s_audio dw_hdmi_cec snd_soc_rl6231 snd_soc_simple_card rockchip_vdec(C) rockchip_rga snd_soc_rockchip_spdif panfrost hantro_vpu(C) snd_soc_simple_card_utils v4l2_h264 videobuf2_dma_contig snd_soc_rockchip_i2s videobuf2_dma_sg v4l2_mem2mem videobuf2_vmalloc videobuf2_memops snd_soc_core videobuf2_v4l2 btsdio videobuf2_common snd_pcm_dmaengine videodev mc hci_uart gpu_sched rockchipdrm fusb302 dw_mipi_dsi tcpm dw_hdmi typec analogix_dp brcmfmac drm_kms_helper snd_pcm brcmutil btqca btrtl snd_timer btbcm cec btintel rc_core bluetooth snd soundcore cfg80211 drm rfkill sg drm_panel_orientation_quirks cpufreq_dt nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek dwmac_rk stmmac_platform stmmac mdio_xpcs [ 4724.011464] CPU: 5 PID: 12945 Comm: md0_raid5 Tainted: G C 5.8.13-rockchip64 #20.08.8 [ 4724.012261] Hardware name: FriendlyElec NanoPi M4 Ver2.0 (DT) [ 4724.012765] pstate: 00000005 (nzcv daif -PAN -UAO BTYPE=--) [ 4724.013269] pc : raid_run_ops+0x530/0x14c8 [raid456] [ 4724.013707] lr : raid_run_ops+0x518/0x14c8 [raid456] [ 4724.014142] sp : ffff800017e9ba10 [ 4724.014433] x29: ffff800017e9ba10 x28: 0000000000000003 [ 4724.014899] x27: ffff0000cb9750f8 x26: 0000000000000000 [ 4724.015364] x25: 0000000080000000 x24: ffff0000cb974b00 [ 4724.015830] x23: 00000000000001f0 x22: ffff0000cb974b68 [ 4724.016296] x21: 0000000000000158 x20: ffff0000cb974f08 [ 4724.016761] x19: fffffdffbff2ee38 x18: ffff0000e5bbd688 [ 4724.017226] x17: 0000000000000001 x16: 0000000000000af8 [ 4724.017691] x15: 2ab14ab4b2e69eb3 x14: 0000579552bdc338 [ 4724.018156] x13: 0000000000000183 x12: 0000000000000190 [ 4724.018621] x11: 000fffffffffffff x10: 0000000000000004 [ 4724.019086] x9 : ffff0000f77b0590 x8 : ffff0000f77afbc0 [ 4724.019551] x7 : 0000000000000001 x6 : ffff0000e0a79a10 [ 4724.020016] x5 : 0000000000000228 x4 : 0000000000000158 [ 4724.020481] x3 : 0000000000000000 x2 : 00000000000005f8 [ 4724.020946] x1 : ffff0000e32b8000 x0 : 0000000000000008 [ 4724.021412] Call trace: [ 4724.021636] raid_run_ops+0x530/0x14c8 [raid456] [ 4724.022044] handle_stripe+0x7c0/0x1f08 [raid456] [ 4724.022461] handle_active_stripes.isra.0+0x3a4/0x4d8 [raid456] [ 4724.022982] raid5d+0x300/0x5b0 [raid456] [ 4724.023346] md_thread+0x9c/0x188 [md_mod] [ 4724.023715] kthread+0x118/0x150 [ 4724.024002] ret_from_fork+0x10/0x34 [ 4724.024320] Code: 9b355f82 91002060 8b02031b d503201f (f9401321) [ 4724.024855] ---[ end trace 29dfe51ce6a12d3a ]--- [ 4724.025320] ------------[ cut here ]------------ [ 4724.025738] WARNING: CPU: 5 PID: 12945 at kernel/exit.c:720 do_exit+0x3c/0xa18 [ 4724.026368] Modules linked in: governor_performance zstd zram snd_soc_hdmi_codec snd_soc_rt5651 rc_cec dw_hdmi_i2s_audio dw_hdmi_cec snd_soc_rl6231 snd_soc_simple_card rockchip_vdec(C) rockchip_rga snd_soc_rockchip_spdif panfrost hantro_vpu(C) snd_soc_simple_card_utils v4l2_h264 videobuf2_dma_contig snd_soc_rockchip_i2s videobuf2_dma_sg v4l2_mem2mem videobuf2_vmalloc videobuf2_memops snd_soc_core videobuf2_v4l2 btsdio videobuf2_common snd_pcm_dmaengine videodev mc hci_uart gpu_sched rockchipdrm fusb302 dw_mipi_dsi tcpm dw_hdmi typec analogix_dp brcmfmac drm_kms_helper snd_pcm brcmutil btqca btrtl snd_timer btbcm cec btintel rc_core bluetooth snd soundcore cfg80211 drm rfkill sg drm_panel_orientation_quirks cpufreq_dt nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek dwmac_rk stmmac_platform stmmac mdio_xpcs [ 4724.033673] CPU: 5 PID: 12945 Comm: md0_raid5 Tainted: G D C 5.8.13-rockchip64 #20.08.8 [ 4724.034470] Hardware name: FriendlyElec NanoPi M4 Ver2.0 (DT) [ 4724.034974] pstate: 80000005 (Nzcv daif -PAN -UAO BTYPE=--) [ 4724.035465] pc : do_exit+0x3c/0xa18 [ 4724.035773] lr : die+0x204/0x248 [ 4724.036056] sp : ffff800017e9b690 [ 4724.036348] x29: ffff800017e9b690 x28: ffff0000e32b8000 [ 4724.036813] x27: ffff0000cb9750f8 x26: 0000000000000000 [ 4724.037278] x25: 0000000080000000 x24: 0000000000000000 [ 4724.037743] x23: ffff0000e32b8000 x22: 0000000000000001 [ 4724.038208] x21: ffff800017e9b7a7 x20: 000000000000000b [ 4724.038673] x19: ffff0000e32b8000 x18: 0000000000000010 [ 4724.039138] x17: 0000000000000001 x16: 0000000000000af8 [ 4724.039603] x15: ffff0000e32b84a8 x14: 0720072007200720 [ 4724.040068] x13: 0720072007200720 x12: 0720072007200720 [ 4724.040533] x11: 0720072007200720 x10: 0720072007200720 [ 4724.040997] x9 : 0720072007200720 x8 : 0720072007200720 [ 4724.041462] x7 : 0000000000000303 x6 : ffff0000f2e00f00 [ 4724.041927] x5 : 0000000000000001 x4 : ffff0000f77bc1d0 [ 4724.042392] x3 : 0000000000000000 x2 : 0000000000000000 [ 4724.042857] x1 : ffff0000f12bae48 x0 : ffff800017e9bdb0 [ 4724.043322] Call trace: [ 4724.043540] do_exit+0x3c/0xa18 [ 4724.043816] die+0x204/0x248 [ 4724.044073] die_kernel_fault+0x64/0x78 [ 4724.044411] __do_kernel_fault+0x88/0x138 [ 4724.044764] do_page_fault+0x198/0x468 [ 4724.045094] do_translation_fault+0x64/0x88 [ 4724.045461] do_mem_abort+0x40/0xa0 [ 4724.045769] el1_sync_handler+0x104/0x110 [ 4724.046121] el1_sync+0x7c/0x100 [ 4724.046417] raid_run_ops+0x530/0x14c8 [raid456] [ 4724.046825] handle_stripe+0x7c0/0x1f08 [raid456] [ 4724.047240] handle_active_stripes.isra.0+0x3a4/0x4d8 [raid456] [ 4724.047760] raid5d+0x300/0x5b0 [raid456] [ 4724.048125] md_thread+0x9c/0x188 [md_mod] [ 4724.048488] kthread+0x118/0x150 [ 4724.048773] ret_from_fork+0x10/0x34 [ 4724.049087] ---[ end trace 29dfe51ce6a12d3b ]--- [ 4724.049518] note: md0_raid5[12945] exited with preempt_count 1 And this is the version of Armbian I use: Welcome to Armbian 20.08.9 Buster with Linux 5.8.13-rockchip64 I just copy the error part of dmesg but if necessary, I can provide the full output. Hope it helps Edited December 7, 2020 by TRS-80 move long output into spoiler
Kyra Posted October 9, 2020 Posted October 9, 2020 I'm hopeful we'll eventually resolve this issue. At least we can be fairly certain it's a software thing given legacy kernels are rock solid (pardon the pun). One noteworthy thing is a Rock Pi 4 I've running Gentoo with a mainline kernel (5.8) is very stable with uptimes over a month. The hardware is virtually identical (including the aforementioned regulator) and the default ramp delay is the same as the M4V2: regulator-ramp-delay = <1000>; That it's a memory setup issue is still my best guess.
JackR Posted October 10, 2020 Posted October 10, 2020 I've got the very same message during the night, although I set the maximum frequence at 1600: # cat /etc/default/cpufrequtils ENABLE=true MIN_SPEED=408000 MAX_SPEED=1608000 GOVERNOR=ondemand The output of dmesg is also very similar to the one of yesterday evening I am ready to make some testing if it can help to solve the issue. Just let me know what.
Pedro Lamas Posted October 10, 2020 Author Posted October 10, 2020 I've had enough crashes with the 40000 overlay change, so for now I've moved back to what I hope are safer governor settings: pedro@nanopim4v2:~$ cat /etc/default/cpufrequtils ENABLE=true MIN_SPEED=1200000 MAX_SPEED=1200000 GOVERNOR=userspace
Pedro Lamas Posted October 10, 2020 Author Posted October 10, 2020 @piter75 having made the overlay change you suggested (be aware I'm now using "userspace" governor fixed to 1200000), shouldn't cpu4 and cpu5 return 40000 instead of the current 52000? pedro@nanopim4v2:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency 40000 pedro@nanopim4v2:~$ cat /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_transition_latency 40000 pedro@nanopim4v2:~$ cat /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_transition_latency 40000 pedro@nanopim4v2:~$ cat /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_transition_latency 40000 pedro@nanopim4v2:~$ cat /sys/devices/system/cpu/cpu4/cpufreq/cpuinfo_transition_latency 52000 pedro@nanopim4v2:~$ cat /sys/devices/system/cpu/cpu5/cpufreq/cpuinfo_transition_latency 52000 Here's the full output for cpu4: pedro@nanopim4v2:~$ pushd /sys/devices/system/cpu/cpu4/cpufreq && paste <(ls *) <(cat *) && popd /sys/devices/system/cpu/cpu4/cpufreq ~ affected_cpus 4 5 cpuinfo_cur_freq 1200000 cpuinfo_max_freq 2016000 cpuinfo_min_freq 408000 cpuinfo_transition_latency 52000 related_cpus 4 5 scaling_available_frequencies 408000 600000 816000 1008000 1200000 1416000 1608000 1800000 2016000 scaling_available_governors conservative userspace powersave ondemand performance schedutil scaling_cur_freq 1200000 scaling_driver cpufreq-dt scaling_governor userspace scaling_max_freq 1200000 scaling_min_freq 1200000 scaling_setspeed 1200000 cat: stats: Is a directory stats: reset time_in_state total_trans trans_table ~
Pedro Lamas Posted October 15, 2020 Author Posted October 15, 2020 So having the governor set to "userspace" with min and max speed set to "1200000" seems to make it completely stable. Any idea of what we can try next? As the Helios64 shares the same CPU, I do ask if it is running more stable than the M4V2... if it is, any chance of comparing the two and trying to make the M4V2 settings match the ones on the Helios64?
zamnuts Posted October 18, 2020 Posted October 18, 2020 (edited) I might as well chime in here... I've been battling this problem for nearly a year, all the while periodically checking various threads. I've gone through many hardware and software iterations to troubleshoot. Just going to do a dump of everything up to this point, apologies for the long post. I'm willing to try any builds, configs, and pretty much anything else anyone can think of, short of soldering, to figure out what is going on. Setup: 1x nanopi m4v2 w/ 4x SATA hat (friendlyelec) powered by a brick/barrel connector "ALITOVE 12V 8A 96W" 3x nanopi m4v2 w/ 5v 4a power supplies from friendlyelec (tried multiple cables, currently using "UGREEN USB C Cable 5A Supercharge Type C to USB A" on all 3) 1x nanopi m4 (v1) w/ 5v 4a power supply from friendlyelec (same cable: ugreen 5a type c to usb a) All have dedicated 32gb eMMC All have the friendlyelec heat sink, some with the factory pad, and some with a 20x20x1.2mm copper spacer + noctua thermal compound; copper spacer was added during a later iteration (FYI: there is no difference between stock blue pad from friendlyelec and the copper spacer + noctua, save your money!) All have noctua NF-A6x25 5v 60mm fans (active cooling, powered via GPIO VDD_5V pin 2 + GND); this active cooling was added during a later iteration (FYI: definitely need active cooling, idles 10C cooler, and under load was something like 20-30C cooler, I'll have to post actual numbers another day) All on the same circuit, shared w/ my ubiquiti networking gear (no problems with this other hardware) Software/OS combinations I've tried: Stock armbian bionic server w/ 5.x kernel (prior to focal release) Stock armbian focal server w/ 5.x kernel Stock armbian focal server w/ custom compiled 5.x kernel Stock armbian focal server w/ custom compiled 5.x kernel + hacks to disable zram (i thought maybe kubernetes/docker was freaking out) Custom ubuntu focal arm cloud image, using armbian /boot and a custom kernel 5.x (frankenstein, i know, getting desperate) Custom ubuntu bionic arm cloud image, using armbian /boot and a custom kernel 5.x Custom ubuntu bionic arm cloud image, using armbian /boot and a custom kernel 4.4.213 Currently on the 4.4 + bionic cloud image, uptime is 1 day 20 minutes, no crashes. I'll give it a few more days to see what happens. (.20 below was giving me unrelated trouble, got a late start) 2020-10-17T23:50:20+0000 192.168.68.20: 23:50:22 up 1:25, 1 user, load average: 0.11, 0.07, 0.01 192.168.68.21: 23:50:25 up 1 day, 24 min, 0 users, load average: 0.77, 0.72, 0.72 192.168.68.22: 23:50:27 up 1 day, 24 min, 0 users, load average: 0.73, 0.80, 0.81 192.168.68.23: 23:50:29 up 1 day, 24 min, 0 users, load average: 0.68, 0.74, 0.75 192.168.68.24: 23:50:32 up 1 day, 24 min, 0 users, load average: 0.83, 0.79, 0.82 On the 5.x kernel variations, all 5 nodes will end up in a crashed/locked state within 7 days, but typically less, regardless of load. At this point, I just let them idle, and run a simple ssh command every second to execute uptime. Load average is 0.7 to 0.9. This current iteration (#7) if the first one with kernel 4.4. Although I didn't check the governor profile until now, 4.4 is showing "interactive" on all 6 cpus, and here's some other cpu info too: $ cat /sys/devices/system/cpu/cpu{0..5}/cpufreq/scaling_governor interactive interactive interactive interactive interactive interactive $ cat /sys/devices/system/cpu/cpu{0..5}/cpufreq/cpuinfo_transition_latency 40000 40000 40000 40000 465000 465000 $ sudo cat /sys/devices/system/cpu/cpu{0..5}/cpufreq/cpuinfo_{cur,min,max}_freq 408000 408000 1416000 408000 408000 1416000 408000 408000 1416000 408000 408000 1416000 816000 408000 1800000 816000 408000 1800000 Extra info: Kernel configs are version controlled, can supply upon request Kernel compilation and imaging is all scripted, and i'm set up for armbian dev (one day i'll put my custom scripts out on GH) My u-boot experience is garbage 2x UART CP2102 and 1x FTDI FT232RL IC usb modules are at my disposal Got some microSD cards sitting around here somewhere... I use cloud-init nocloud for config/first-boot seeding The m4v2 w/ the SATA hat sometimes runs w/ a USB ASIX AX88179 dongle in bonded mode, but that periodically causes an unrelated kernel panic on 4.4 and 5.x (but that's another topic altogether) Edited October 18, 2020 by zamnuts more info, clarity
hev Posted October 21, 2020 Posted October 21, 2020 I found the same problem, NanoPi M4V2 randomly panic. after CPU frequency reduction, it seems to be fixed. heiher@hev-mpc:~$ cat /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq 1416000 heiher@hev-mpc:~$ cat /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq 1800000
Pedro Lamas Posted October 21, 2020 Author Posted October 21, 2020 @hev can you indicate your current governor settings for comparison (run "cat /etc/default/cpufrequtils" ) ?
hev Posted October 21, 2020 Posted October 21, 2020 @Pedro Lamas I don't use this to config. my config equivalent to: echo 1416000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq echo 1800000 > /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor echo performance > /sys/devices/system/cpu/cpufreq/policy4/scaling_governor 1
Pedro Lamas Posted October 21, 2020 Author Posted October 21, 2020 On 9/30/2020 at 2:34 PM, Pedro Lamas said: I just noticed something while reading at the RK3399 specsheet: the recommended maximum frequency of the A72 is actually 1.8Ghz, not 2.0Ghz as on the FriendlyARM website and wiki! The RK3399K however does indicate a recommended maximum of 2.0Ghz, but that is not the version in use on the NanoPi M4V2. The Rock Pi 4 uses the same RK3399 SoC and they specifically say the frequency of the A72 is 1.8Ghz. I even found a commit in armbian codebase for the Helios64 (another one with the same RK3399 SoC) where the maximum is set to 1.8Ghz: https://github.com/armbian/build/pull/2191 I will leave my board for a couple of days more with "userspace" governor and min and max set to 1008000, and if there's no crashes, I will try "ondemand" governor with min set to 1008000 and max to 1800000 Following up on the above, I just opened this issue on GitHub, suggesting that the CPUMAX for the RK3399 is set to 1800000 following the manufacturer specifications.
JackR Posted October 21, 2020 Posted October 21, 2020 I don't think it is linked with the max frequency as I already had the probleme although the max frequency was set at 1.6 GHz. As Pedro Lamas noticed, setting min and max frequency at 1.2 GHz my nanopi run 3 days without bug. From what I have observed, it might be a probleme of memory when the frequence change, may be some kind of desynchronisation ? It happens with any of the core, big or LITTLE. I upgrade to the last version (5.8.16-rockchip64 #20.08.14 SMP PREEMPT Tue Oct 20 22:37:51), but the bug is still there In order to see if it depends on frequency or if it is linked with the change of frequency I have set min and max at 2 GHz and will let it run like that for some time ...
hev Posted October 22, 2020 Posted October 22, 2020 4 hours ago, JackR said: I don't think it is linked with the max frequency as I already had the probleme although the max frequency was set at 1.6 GHz. As Pedro Lamas noticed, setting min and max frequency at 1.2 GHz my nanopi run 3 days without bug. From what I have observed, it might be a probleme of memory when the frequence change, may be some kind of desynchronisation ? It happens with any of the core, big or LITTLE. I upgrade to the last version (5.8.16-rockchip64 #20.08.14 SMP PREEMPT Tue Oct 20 22:37:51), but the bug is still there In order to see if it depends on frequency or if it is linked with the change of frequency I have set min and max at 2 GHz and will let it run like that for some time ... Is it possible that A53 (little) can not run at the highest design frequency (1.51GHz)? I configured the A53 to 1.42GHz and the A72 to 1.8GHz, which looks good. I also have a NanoPi M4V1, it's running the official mainline kernel 5.9.1, and the kernel default configuration is 1.42GHz+1.8GHz, and I have never found stability problems.
hev Posted October 22, 2020 Posted October 22, 2020 [heiher@hev-sbc ~]$ uname -a Linux hev-sbc 5.9.1 #1 SMP Sun Oct 18 17:37:19 CST 2020 aarch64 GNU/Linux [heiher@hev-sbc ~]$ cat /sys/devices/system/cpu/cpufreq/*/scaling_available_frequencies 408000 600000 816000 1008000 1200000 1416000 408000 600000 816000 1008000 1200000 1416000 1608000 1800000
camelator Posted October 26, 2020 Posted October 26, 2020 same here: I confirm since last update I had random crashes and wifi lost (i post a topic about it) I just change the cpufrequtils and the random crashes disapeard, but wifi is still lost. But now with this configuration I suspect I have some waiting-time when the workload is too important or temperature too high, I don't know. For example a simple apt update on a 60KB file can pause the system for .. let say 1 or 2 minutes, and then things are fast strange behaviour
JackR Posted October 26, 2020 Posted October 26, 2020 Welcome to Armbian 20.08.14 Buster with Linux 5.8.16-rockchip64 System load: 2% Up time: 4 days 11:57 Memory usage: 8% of 3.71G IP: 192.168.0.59 CPU temp: 49°C Usage of /: 52% of 3.5G # cat /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq 1512000 # cat /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq 2016000 # cat /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq 1512000 # cat /sys/devices/system/cpu/cpufreq/policy4/scaling_min_freq 2016000 With this config, it is now 4,5 days that my nanopi M4V2 is running full speed (LITTLE at 1512, big at 2016 MHz) without a single crash. So it seems to confirm that the instability is not linked to the frequency but happens when the frequency changes. As hev mentionned, it doesn't happens on the nanopi M4V1: and the only difference between V1 and V2 is the memory type, DDR3 for V1 and DDR4 for V2. As the observed crashes are memory errors on any of the cores and whatever the max frequency is set, it really looks like desynchronisation between the cores and memory when the frequency is changed by the governor. I am not expert enough to go deeper in this analysis but I hope someone more skilled can start from here and write a patch.
zamnuts Posted October 27, 2020 Posted October 27, 2020 7 hours ago, JackR said: So it seems to confirm that the instability is not linked to the frequency but happens when the frequency changes. I noticed that the ondemand sampling_rate was low in 5.x compared to 4.4's default. compare below: $ uname -r 4.4.213-rk3399 $ sudo cat /sys/devices/system/cpu/cpufreq/policy{0,4}/ondemand/sampling_rate 40000 465000 vs $ uname -r 5.8.15-rockchip64 $ sudo cat /sys/devices/system/cpu/cpufreq/policy{0,4}/ondemand/sampling_rate 10000 10000 It appears that 10000 seems to be a default (or the minimum value possible). Per https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt 10000 is too low given the cpu_transition_latency: $ uname -r 5.8.15-rockchip64 $ cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_transition_latency 40000 40000 40000 40000 515000 515000 Given the documented formula cpuinfo_transition_latency * 750 / 1000, it should be (at a minimum) 30000 for cpu0-3, and 386250 for cpu4-5. Using sysfs to configure cpufreq on boot with the revised sampling_rate seems to make it stable (although uptime is only about 2 days in). I ran periodic load testing to get the ondemand governor to change freq, and this config looks promising. Despite the previously mentioned calculations, i opted to align the sampling_rate with what was present in the 4.4 kernel: $ tail -n+1 /etc/sysfs.d/* ==> /etc/sysfs.d/cpufreq-policy0.conf <== devices/system/cpu/cpufreq/policy0/scaling_governor = ondemand devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1416000 devices/system/cpu/cpufreq/policy0/scaling_min_freq = 600000 devices/system/cpu/cpufreq/policy0/ondemand/sampling_rate = 40000 ==> /etc/sysfs.d/cpufreq-policy4.conf <== devices/system/cpu/cpufreq/policy4/scaling_governor = ondemand devices/system/cpu/cpufreq/policy4/scaling_max_freq = 1800000 devices/system/cpu/cpufreq/policy4/scaling_min_freq = 600000 devices/system/cpu/cpufreq/policy4/ondemand/sampling_rate = 465000 $ for i in 0 1 2 3 4 5; do sudo cat /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_{governor,min_freq,max_freq}; done; ondemand 600000 1416000 ondemand 600000 1416000 ondemand 600000 1416000 ondemand 600000 1416000 ondemand 600000 1800000 ondemand 600000 1800000 one of two boards are already crashing, not sure if its the board or what. going to switch to "performance" governor and see. figured i'd share the sample_rate delta info though. 1
zamnuts Posted October 27, 2020 Posted October 27, 2020 (edited) While trying to determine faulty RAM on one board, I noticed the ondemand governor w/ the revised frequencies and sampling rate it isn't as stable as it appears. So I decided to run a test using memtester using most of the left over RAM while the system was up, e.g. `memtester 3550M 7`: All boards have max cpu freq set at 1416000 and 1800000 1x m4v2 using 4.4 and governor "ondemand" 3x m4v2 using 5.8 and governor "ondemand" 1x m4 (v1) using 4.4 and governor "interactive" Results: All 3 boards w/ 5.8 showed at least one memtester failure per loop (I ran 7 loops). 2 of the boards running 5.8 actually seized/froze, one on the 3rd loop and the other on the end of the 7th loop. Both boards (2) w/ 4.4 showed zero memtester failures, all memtester tests passed with "ok" and the system is still up and responsive. In summary, we're still no closer than when this thread started: resetting CPU max/min frequencies, and changing the governor/dynamic cpu scaling doesn't actually help. For completeness, here's one memtester result for a board running 5.8 (an earlier test), the last line on loop 2 is when the whole thing became unresponsive: root@host:/home/ubuntu# memtester 3500M 7 memtester version 4.3.0 (64-bit) Copyright (C) 2001-2012 Charles Cazabon. Licensed under the GNU General Public License version 2 (only). pagesize is 4096 pagesizemask is 0xfffffffffffff000 want 3500MB (3670016000 bytes) got 3500MB (3670016000 bytes), trying mlock ...locked. Loop 1/7: Stuck Address : ok Random Value : ok Compare XOR : ok Compare SUB : ok Compare MUL : ok Compare DIV : ok Compare OR : ok Compare AND : ok Sequential Increment: ok Solid Bits : testing 12FAILURE: 0x80000000 != 0x00000000 at offset 0x4a270988. Block Sequential : ok Checkerboard : ok Bit Spread : testing 24FAILURE: 0x05000000 != 0x85000000 at offset 0x297959c0. Bit Flip : testing 1FAILURE: 0x80000001 != 0x00000001 at offset 0x07489960. Walking Ones : ok Walking Zeroes : ok 8-bit Writes : ok 16-bit Writes : | Message from syslogd@pim402 at Oct 27 03:20:22 ... kernel:[ 5842.577847] Internal error: Oops: 96000047 [#1] PREEMPT SMP | Message from syslogd@pim402 at Oct 27 03:20:22 ... kernel:[ 5842.599503] Code: 51000401 8b0202c2 91002080 f861db01 (f8216844) / Message from syslogd@pim402 at Oct 27 03:21:33 ... kernel:[ 5914.064316] Internal error: Oops: 96000004 [#2] PREEMPT SMP - Message from syslogd@pim402 at Oct 27 03:21:33 ... kernel:[ 5914.084728] Code: f9401800 d503233f d50323bf f85b8000 (f9400800) ok Loop 2/7: Stuck Address : ok Random Value : \ And here's a good run from a board running 4.4: root@pim400:/home/ubuntu# memtester 2910M 7 memtester version 4.3.0 (64-bit) Copyright (C) 2001-2012 Charles Cazabon. Licensed under the GNU General Public License version 2 (only). pagesize is 4096 pagesizemask is 0xfffffffffffff000 want 2910MB (3051356160 bytes) got 2910MB (3051356160 bytes), trying mlock ...locked. Loop 1/7: Stuck Address : ok Random Value : ok ... cut ... Loop 7/7: Stuck Address : ok Random Value : ok Compare XOR : ok Compare SUB : ok Compare MUL : ok Compare DIV : ok Compare OR : ok Compare AND : ok Sequential Increment: ok Solid Bits : ok Block Sequential : ok Checkerboard : ok Bit Spread : ok Bit Flip : ok Walking Ones : ok Walking Zeroes : ok 8-bit Writes : ok 16-bit Writes : ok Done. What else can we try? Edited October 27, 2020 by zamnuts
Pedro Lamas Posted October 27, 2020 Author Posted October 27, 2020 Thank you for your efforts @zamnuts this is some great info you are sharing! I for one can only say that I've have no issues at all with "userspace" governor and min and max frequency set to 1416000! pedro@nanopim4v2:/sys/devices/system/cpu/cpufreq$ sudo cat /sys/devices/system/cpu/cpufreq/policy{0,4}/scaling_governor userspace userspace pedro@nanopim4v2:/sys/devices/system/cpu/cpufreq$ sudo cat /sys/devices/system/cpu/cpufreq/policy{0,4}/cpuinfo_{cur,min,max}_freq 1416000 408000 1512000 1416000 408000 2016000 My M4V2 has been working for almost 6 days now with 19 docker containers and no problems at all! I understand that ideally one wants to use the "ondemand" governor, but I do wonder what would happen if you run your tests with a "userspace" governor and fixed frequencies?
zamnuts Posted October 29, 2020 Posted October 29, 2020 (edited) I ran a some tests the past few days, all nodes are stable using the "performance" governor and the 1.5ghz/2ghz cpu frequencies. Nodes: 1x m4v2 on kernel 4.4.213 (w/ SATA hat) 3x m4v2 on kernel 5.8.6 1x m4 (v1) on kernel 4.4.213 sysfs configuration: $ tail -n+1 /etc/sysfs.d/* ==> /etc/sysfs.d/cpufreq-policy0.conf <== devices/system/cpu/cpufreq/policy0/scaling_governor = performance devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1512000 devices/system/cpu/cpufreq/policy0/scaling_min_freq = 408000 # enable sampling_rate if scaling_governor = ondemand #devices/system/cpu/cpufreq/policy0/ondemand/sampling_rate = 40000 ==> /etc/sysfs.d/cpufreq-policy4.conf <== devices/system/cpu/cpufreq/policy4/scaling_governor = performance devices/system/cpu/cpufreq/policy4/scaling_max_freq = 2016000 devices/system/cpu/cpufreq/policy4/scaling_min_freq = 408000 # enable sampling_rate if scaling_governor = ondemand #devices/system/cpu/cpufreq/policy4/ondemand/sampling_rate = 465000 FYI, the 4.4 kernels can't be set to scaling_max_freq at 1512000 and 2016000 since the scaling_available_frequencies top out at 1416000 and 1800000 respectively. Setting the scaling_max_freq above these just selects the maximum freq available, so there's no adverse effect in the sysfs configuration. Also, when using any governor besides "ondemand" (e.g. "performance"), setting the ondemand/sampling_rate causes sysfsutils to fail because the "ondemand" directory does not exist (and the other directives don't get set). Also, i have swap disabled (as you can see in the free -m output below); these nodes will form a k8s cluster. (and soon, now that they're confidently stable!) Lastly, had to disable the "ondemand" service so the scaling_governor sysfs settings would persist across reboots (likely due to service startup order, i.e. sysfsutils before ondemand): systemctl disable ondemand systemctl mask ondemand Tests and results: memtester with as much of free memory as possible (~3400M for the v2s and 1700M on the v1) Sequence: reboot, 7 loops, then reboot again, 7 loops Memory allocation was done on a per-node basis, based on "available" minus about 15M, e.g.: $ free -m total used free shared buff/cache available Mem: 3800 175 3380 1 244 3467 Swap: 0 0 0 Results: out of 14 total loops per node (70 total), only the very first (1) loop on a 5.8.6 kernel had 2 failures as follows, all other loops on all nodes passed all tests Bit Spread : testing 68FAILURE: 0x2800000000000000 != 0x2800000080000000 at offset 0x56f8c790. Bit Flip : testing 215FAILURE: 0x84000000 != 0x04000000 at offset 0x55194dd0. compiled linux kernel 5.9.1 (without the NFS/network component), a recommended test from the u-boot memory tester readme: "The best known test case to stress a system like that is to boot Linux with root file system mounted over NFS, and then build some larger software package natively (say, compile a Linux kernel on the system) - this will cause enough context switches, network traffic (and thus DMA transfers from the network controller), varying RAM use, etc. to trigger any weak spots in this area." Used "make -j $(nproc)" to utilize all available cores "monitored" each node with periodic resource usage dumps, and also to ensure the governor/frequencies didn't change. here's a snippet that shows free memory and cpu load averages: 2020-10-29T04:33:17+0000 192.168.68.20: [ 4.4.213-rk3399] 04:33:20 up 8:46, 1 user, load average: 6.07, 5.87, 5.92, free: 105M, govs: performance performance, freqs: 1416000 1800000 2020-10-29T04:33:20+0000 192.168.68.21: [5.8.6-rockchip64] 04:33:21 up 8:46, 1 user, load average: 6.01, 6.02, 6.00, free: 480M, govs: performance performance, freqs: 1512000 2016000 2020-10-29T04:33:21+0000 192.168.68.22: [5.8.6-rockchip64] 04:33:22 up 8:46, 1 user, load average: 6.00, 6.06, 6.02, free: 401M, govs: performance performance, freqs: 1512000 2016000 2020-10-29T04:33:22+0000 192.168.68.23: [5.8.6-rockchip64] 04:33:23 up 8:46, 1 user, load average: 6.01, 6.03, 6.00, free: 439M, govs: performance performance, freqs: 1512000 2016000 2020-10-29T04:33:23+0000 192.168.68.24: [ 4.4.213-rk3399] 04:33:26 up 8:46, 1 user, load average: 6.93, 6.73, 6.71, free: 101M, govs: performance performance, freqs: 1416000 1800000 Compilation completed successfully, and no errors/warnings in dmesg (kern.log), syslog, nor stdout/stderr (on the tty) It is worth noting that memtester failures always showed a discrepancy of 0x80 (128) based on 8-bit/1-byte sections, every-single-one, and there was also a pattern (duplicated failure hex values among all nodes/tests). The location varied. Examples: 0x2800000000000000 != 0x2800000080000000 0x84000000 != 0x04000000 0x00000028 != 0x80000028 0x80000080 != 0x00000080 0x80000001 != 0x00000001 0x80000000000000 != 0x80000080000000 0x5555555555555555 != 0x55555555d5555555 etc... I might as well share my cloud-init configuration at this point; only the critical new instance bootstrapping stuff is here, all other configs/software/patching/etc is handled via ansible. here's "user-data": #cloud-config disable_root: true mounts: - [ swap, null ] ntp: enabled: true ntp_client: 'auto' packages: - 'sysfsutils' package_update: false package_upgrade: false preserve_hostname: false runcmd: - [ systemctl, disable, ondemand ] - [ systemctl, mask, ondemand ] ssh_authorized_keys: - 'ssh-rsa <PUBLIC KEY REDACTED>' timezone: 'Etc/UTC' write_files: - content: | devices/system/cpu/cpufreq/policy0/scaling_governor = performance devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1512000 devices/system/cpu/cpufreq/policy0/scaling_min_freq = 408000 # kernel 4.4 max is 1416000 #devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1416000 # enable sampling_rate if scaling_governor = ondemand #devices/system/cpu/cpufreq/policy0/ondemand/sampling_rate = 40000 path: /etc/sysfs.d/cpufreq-policy0.conf - content: | devices/system/cpu/cpufreq/policy4/scaling_governor = performance devices/system/cpu/cpufreq/policy4/scaling_max_freq = 2016000 devices/system/cpu/cpufreq/policy4/scaling_min_freq = 408000 # kernel 4.4 max is 1800000 #devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1800000 # enable sampling_rate if scaling_governor = ondemand #devices/system/cpu/cpufreq/policy4/ondemand/sampling_rate = 465000 path: /etc/sysfs.d/cpufreq-policy4.conf I'll start re-imaging these nodes and actually using them now. Will report back if this issue pops back up again. Still no solution for the "ondemand" governor - the CPUs simply just don't like to change frequencies in the 5.x kernel. Edited October 29, 2020 by zamnuts note about 0x80
wdtz Posted October 30, 2020 Posted October 30, 2020 >I'm willing to try any builds, configs, and pretty much anything else anyone can think of, short of soldering, Do some dtb hacking? All the memory timing is in this, maybe the mem is pushed too far, too hard? But, it is plenty complicated, looking at 5x, much of the detail is gone, so applies to 4x mostly
JackR Posted November 8, 2020 Posted November 8, 2020 I have tested another configuration, min frequency = 408 MHz on all kernels, max = 1512 on LITTLE and max = 2016 on big, but with governor set as conservative. It's running since 13 days, no crash # cat /sys/devices/system/cpu/cpufreq/policy{0,4}/scaling_governor conservative conservative # cat /sys/devices/system/cpu/cpufreq/policy{0,4}/cpuinfo_{min,max}_freq 408000 1512000 408000 2016000 # cat /sys/devices/system/cpu/cpufreq/policy{0,4}/conservative/sampling_rate 10000 10000 # uname -a Linux nanoNas 5.8.16-rockchip64 #20.08.14 SMP PREEMPT Tue Oct 20 22:37:51 CEST 2020 aarch64 GNU/Linux
JackR Posted November 16, 2020 Posted November 16, 2020 I was to quick to report, although after 13 days The nanopi has crash twice this week end with the setup shown in my previous post. So definitively a problem when frequency changes. I'll switch back to a fix frequency setup until some progress is made or a workaround found
Pedro Lamas Posted November 16, 2020 Author Posted November 16, 2020 I've had a rock solid NanoPi M4V2 since I set the governor to userspace with min and max frequency set to 1416000 - absolutely no crashes at all! I can live with that!!
NicoD Posted November 16, 2020 Posted November 16, 2020 I'm pretty stable with performance at 1.5/2Ghz. With ondemand it crashed at least within an hour. I had to set it with armbian-config. With cpufreq-set -g performance it weirdly wasn't stable. So I'm happy. Not sure if it is 100% stable.
svenoone Posted December 7, 2020 Posted December 7, 2020 Hey guys i am following up your discussion about the topic. Just tried some of your tests. Just having one working as DIY NAS - and no idea how to get it stable running because of this error. Is there some update or a walkaround you know?
Recommended Posts