Jump to content

zamnuts

Members
  • Posts

    4
  • Joined

  • Last visited

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. I ran a some tests the past few days, all nodes are stable using the "performance" governor and the 1.5ghz/2ghz cpu frequencies. Nodes: 1x m4v2 on kernel 4.4.213 (w/ SATA hat) 3x m4v2 on kernel 5.8.6 1x m4 (v1) on kernel 4.4.213 sysfs configuration: $ tail -n+1 /etc/sysfs.d/* ==> /etc/sysfs.d/cpufreq-policy0.conf <== devices/system/cpu/cpufreq/policy0/scaling_governor = performance devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1512000 devices/system/cpu/cpufreq/policy0/scaling_min_freq = 408000 # enable sampling_rate if scaling_governor = ondemand #devices/system/cpu/cpufreq/policy0/ondemand/sampling_rate = 40000 ==> /etc/sysfs.d/cpufreq-policy4.conf <== devices/system/cpu/cpufreq/policy4/scaling_governor = performance devices/system/cpu/cpufreq/policy4/scaling_max_freq = 2016000 devices/system/cpu/cpufreq/policy4/scaling_min_freq = 408000 # enable sampling_rate if scaling_governor = ondemand #devices/system/cpu/cpufreq/policy4/ondemand/sampling_rate = 465000 FYI, the 4.4 kernels can't be set to scaling_max_freq at 1512000 and 2016000 since the scaling_available_frequencies top out at 1416000 and 1800000 respectively. Setting the scaling_max_freq above these just selects the maximum freq available, so there's no adverse effect in the sysfs configuration. Also, when using any governor besides "ondemand" (e.g. "performance"), setting the ondemand/sampling_rate causes sysfsutils to fail because the "ondemand" directory does not exist (and the other directives don't get set). Also, i have swap disabled (as you can see in the free -m output below); these nodes will form a k8s cluster. (and soon, now that they're confidently stable!) Lastly, had to disable the "ondemand" service so the scaling_governor sysfs settings would persist across reboots (likely due to service startup order, i.e. sysfsutils before ondemand): systemctl disable ondemand systemctl mask ondemand Tests and results: memtester with as much of free memory as possible (~3400M for the v2s and 1700M on the v1) Sequence: reboot, 7 loops, then reboot again, 7 loops Memory allocation was done on a per-node basis, based on "available" minus about 15M, e.g.: $ free -m total used free shared buff/cache available Mem: 3800 175 3380 1 244 3467 Swap: 0 0 0 Results: out of 14 total loops per node (70 total), only the very first (1) loop on a 5.8.6 kernel had 2 failures as follows, all other loops on all nodes passed all tests Bit Spread : testing 68FAILURE: 0x2800000000000000 != 0x2800000080000000 at offset 0x56f8c790. Bit Flip : testing 215FAILURE: 0x84000000 != 0x04000000 at offset 0x55194dd0. compiled linux kernel 5.9.1 (without the NFS/network component), a recommended test from the u-boot memory tester readme: "The best known test case to stress a system like that is to boot Linux with root file system mounted over NFS, and then build some larger software package natively (say, compile a Linux kernel on the system) - this will cause enough context switches, network traffic (and thus DMA transfers from the network controller), varying RAM use, etc. to trigger any weak spots in this area." Used "make -j $(nproc)" to utilize all available cores "monitored" each node with periodic resource usage dumps, and also to ensure the governor/frequencies didn't change. here's a snippet that shows free memory and cpu load averages: 2020-10-29T04:33:17+0000 192.168.68.20: [ 4.4.213-rk3399] 04:33:20 up 8:46, 1 user, load average: 6.07, 5.87, 5.92, free: 105M, govs: performance performance, freqs: 1416000 1800000 2020-10-29T04:33:20+0000 192.168.68.21: [5.8.6-rockchip64] 04:33:21 up 8:46, 1 user, load average: 6.01, 6.02, 6.00, free: 480M, govs: performance performance, freqs: 1512000 2016000 2020-10-29T04:33:21+0000 192.168.68.22: [5.8.6-rockchip64] 04:33:22 up 8:46, 1 user, load average: 6.00, 6.06, 6.02, free: 401M, govs: performance performance, freqs: 1512000 2016000 2020-10-29T04:33:22+0000 192.168.68.23: [5.8.6-rockchip64] 04:33:23 up 8:46, 1 user, load average: 6.01, 6.03, 6.00, free: 439M, govs: performance performance, freqs: 1512000 2016000 2020-10-29T04:33:23+0000 192.168.68.24: [ 4.4.213-rk3399] 04:33:26 up 8:46, 1 user, load average: 6.93, 6.73, 6.71, free: 101M, govs: performance performance, freqs: 1416000 1800000 Compilation completed successfully, and no errors/warnings in dmesg (kern.log), syslog, nor stdout/stderr (on the tty) It is worth noting that memtester failures always showed a discrepancy of 0x80 (128) based on 8-bit/1-byte sections, every-single-one, and there was also a pattern (duplicated failure hex values among all nodes/tests). The location varied. Examples: 0x2800000000000000 != 0x2800000080000000 0x84000000 != 0x04000000 0x00000028 != 0x80000028 0x80000080 != 0x00000080 0x80000001 != 0x00000001 0x80000000000000 != 0x80000080000000 0x5555555555555555 != 0x55555555d5555555 etc... I might as well share my cloud-init configuration at this point; only the critical new instance bootstrapping stuff is here, all other configs/software/patching/etc is handled via ansible. here's "user-data": #cloud-config disable_root: true mounts: - [ swap, null ] ntp: enabled: true ntp_client: 'auto' packages: - 'sysfsutils' package_update: false package_upgrade: false preserve_hostname: false runcmd: - [ systemctl, disable, ondemand ] - [ systemctl, mask, ondemand ] ssh_authorized_keys: - 'ssh-rsa <PUBLIC KEY REDACTED>' timezone: 'Etc/UTC' write_files: - content: | devices/system/cpu/cpufreq/policy0/scaling_governor = performance devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1512000 devices/system/cpu/cpufreq/policy0/scaling_min_freq = 408000 # kernel 4.4 max is 1416000 #devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1416000 # enable sampling_rate if scaling_governor = ondemand #devices/system/cpu/cpufreq/policy0/ondemand/sampling_rate = 40000 path: /etc/sysfs.d/cpufreq-policy0.conf - content: | devices/system/cpu/cpufreq/policy4/scaling_governor = performance devices/system/cpu/cpufreq/policy4/scaling_max_freq = 2016000 devices/system/cpu/cpufreq/policy4/scaling_min_freq = 408000 # kernel 4.4 max is 1800000 #devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1800000 # enable sampling_rate if scaling_governor = ondemand #devices/system/cpu/cpufreq/policy4/ondemand/sampling_rate = 465000 path: /etc/sysfs.d/cpufreq-policy4.conf I'll start re-imaging these nodes and actually using them now. Will report back if this issue pops back up again. Still no solution for the "ondemand" governor - the CPUs simply just don't like to change frequencies in the 5.x kernel.
  2. While trying to determine faulty RAM on one board, I noticed the ondemand governor w/ the revised frequencies and sampling rate it isn't as stable as it appears. So I decided to run a test using memtester using most of the left over RAM while the system was up, e.g. `memtester 3550M 7`: All boards have max cpu freq set at 1416000 and 1800000 1x m4v2 using 4.4 and governor "ondemand" 3x m4v2 using 5.8 and governor "ondemand" 1x m4 (v1) using 4.4 and governor "interactive" Results: All 3 boards w/ 5.8 showed at least one memtester failure per loop (I ran 7 loops). 2 of the boards running 5.8 actually seized/froze, one on the 3rd loop and the other on the end of the 7th loop. Both boards (2) w/ 4.4 showed zero memtester failures, all memtester tests passed with "ok" and the system is still up and responsive. In summary, we're still no closer than when this thread started: resetting CPU max/min frequencies, and changing the governor/dynamic cpu scaling doesn't actually help. For completeness, here's one memtester result for a board running 5.8 (an earlier test), the last line on loop 2 is when the whole thing became unresponsive: root@host:/home/ubuntu# memtester 3500M 7 memtester version 4.3.0 (64-bit) Copyright (C) 2001-2012 Charles Cazabon. Licensed under the GNU General Public License version 2 (only). pagesize is 4096 pagesizemask is 0xfffffffffffff000 want 3500MB (3670016000 bytes) got 3500MB (3670016000 bytes), trying mlock ...locked. Loop 1/7: Stuck Address : ok Random Value : ok Compare XOR : ok Compare SUB : ok Compare MUL : ok Compare DIV : ok Compare OR : ok Compare AND : ok Sequential Increment: ok Solid Bits : testing 12FAILURE: 0x80000000 != 0x00000000 at offset 0x4a270988. Block Sequential : ok Checkerboard : ok Bit Spread : testing 24FAILURE: 0x05000000 != 0x85000000 at offset 0x297959c0. Bit Flip : testing 1FAILURE: 0x80000001 != 0x00000001 at offset 0x07489960. Walking Ones : ok Walking Zeroes : ok 8-bit Writes : ok 16-bit Writes : | Message from syslogd@pim402 at Oct 27 03:20:22 ... kernel:[ 5842.577847] Internal error: Oops: 96000047 [#1] PREEMPT SMP | Message from syslogd@pim402 at Oct 27 03:20:22 ... kernel:[ 5842.599503] Code: 51000401 8b0202c2 91002080 f861db01 (f8216844) / Message from syslogd@pim402 at Oct 27 03:21:33 ... kernel:[ 5914.064316] Internal error: Oops: 96000004 [#2] PREEMPT SMP - Message from syslogd@pim402 at Oct 27 03:21:33 ... kernel:[ 5914.084728] Code: f9401800 d503233f d50323bf f85b8000 (f9400800) ok Loop 2/7: Stuck Address : ok Random Value : \ And here's a good run from a board running 4.4: root@pim400:/home/ubuntu# memtester 2910M 7 memtester version 4.3.0 (64-bit) Copyright (C) 2001-2012 Charles Cazabon. Licensed under the GNU General Public License version 2 (only). pagesize is 4096 pagesizemask is 0xfffffffffffff000 want 2910MB (3051356160 bytes) got 2910MB (3051356160 bytes), trying mlock ...locked. Loop 1/7: Stuck Address : ok Random Value : ok ... cut ... Loop 7/7: Stuck Address : ok Random Value : ok Compare XOR : ok Compare SUB : ok Compare MUL : ok Compare DIV : ok Compare OR : ok Compare AND : ok Sequential Increment: ok Solid Bits : ok Block Sequential : ok Checkerboard : ok Bit Spread : ok Bit Flip : ok Walking Ones : ok Walking Zeroes : ok 8-bit Writes : ok 16-bit Writes : ok Done. What else can we try?
  3. I noticed that the ondemand sampling_rate was low in 5.x compared to 4.4's default. compare below: $ uname -r 4.4.213-rk3399 $ sudo cat /sys/devices/system/cpu/cpufreq/policy{0,4}/ondemand/sampling_rate 40000 465000 vs $ uname -r 5.8.15-rockchip64 $ sudo cat /sys/devices/system/cpu/cpufreq/policy{0,4}/ondemand/sampling_rate 10000 10000 It appears that 10000 seems to be a default (or the minimum value possible). Per https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt 10000 is too low given the cpu_transition_latency: $ uname -r 5.8.15-rockchip64 $ cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_transition_latency 40000 40000 40000 40000 515000 515000 Given the documented formula cpuinfo_transition_latency * 750 / 1000, it should be (at a minimum) 30000 for cpu0-3, and 386250 for cpu4-5. Using sysfs to configure cpufreq on boot with the revised sampling_rate seems to make it stable (although uptime is only about 2 days in). I ran periodic load testing to get the ondemand governor to change freq, and this config looks promising. Despite the previously mentioned calculations, i opted to align the sampling_rate with what was present in the 4.4 kernel: $ tail -n+1 /etc/sysfs.d/* ==> /etc/sysfs.d/cpufreq-policy0.conf <== devices/system/cpu/cpufreq/policy0/scaling_governor = ondemand devices/system/cpu/cpufreq/policy0/scaling_max_freq = 1416000 devices/system/cpu/cpufreq/policy0/scaling_min_freq = 600000 devices/system/cpu/cpufreq/policy0/ondemand/sampling_rate = 40000 ==> /etc/sysfs.d/cpufreq-policy4.conf <== devices/system/cpu/cpufreq/policy4/scaling_governor = ondemand devices/system/cpu/cpufreq/policy4/scaling_max_freq = 1800000 devices/system/cpu/cpufreq/policy4/scaling_min_freq = 600000 devices/system/cpu/cpufreq/policy4/ondemand/sampling_rate = 465000 $ for i in 0 1 2 3 4 5; do sudo cat /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_{governor,min_freq,max_freq}; done; ondemand 600000 1416000 ondemand 600000 1416000 ondemand 600000 1416000 ondemand 600000 1416000 ondemand 600000 1800000 ondemand 600000 1800000 one of two boards are already crashing, not sure if its the board or what. going to switch to "performance" governor and see. figured i'd share the sample_rate delta info though.
  4. I might as well chime in here... I've been battling this problem for nearly a year, all the while periodically checking various threads. I've gone through many hardware and software iterations to troubleshoot. Just going to do a dump of everything up to this point, apologies for the long post. I'm willing to try any builds, configs, and pretty much anything else anyone can think of, short of soldering, to figure out what is going on. Setup: 1x nanopi m4v2 w/ 4x SATA hat (friendlyelec) powered by a brick/barrel connector "ALITOVE 12V 8A 96W" 3x nanopi m4v2 w/ 5v 4a power supplies from friendlyelec (tried multiple cables, currently using "UGREEN USB C Cable 5A Supercharge Type C to USB A" on all 3) 1x nanopi m4 (v1) w/ 5v 4a power supply from friendlyelec (same cable: ugreen 5a type c to usb a) All have dedicated 32gb eMMC All have the friendlyelec heat sink, some with the factory pad, and some with a 20x20x1.2mm copper spacer + noctua thermal compound; copper spacer was added during a later iteration (FYI: there is no difference between stock blue pad from friendlyelec and the copper spacer + noctua, save your money!) All have noctua NF-A6x25 5v 60mm fans (active cooling, powered via GPIO VDD_5V pin 2 + GND); this active cooling was added during a later iteration (FYI: definitely need active cooling, idles 10C cooler, and under load was something like 20-30C cooler, I'll have to post actual numbers another day) All on the same circuit, shared w/ my ubiquiti networking gear (no problems with this other hardware) Software/OS combinations I've tried: Stock armbian bionic server w/ 5.x kernel (prior to focal release) Stock armbian focal server w/ 5.x kernel Stock armbian focal server w/ custom compiled 5.x kernel Stock armbian focal server w/ custom compiled 5.x kernel + hacks to disable zram (i thought maybe kubernetes/docker was freaking out) Custom ubuntu focal arm cloud image, using armbian /boot and a custom kernel 5.x (frankenstein, i know, getting desperate) Custom ubuntu bionic arm cloud image, using armbian /boot and a custom kernel 5.x Custom ubuntu bionic arm cloud image, using armbian /boot and a custom kernel 4.4.213 Currently on the 4.4 + bionic cloud image, uptime is 1 day 20 minutes, no crashes. I'll give it a few more days to see what happens. (.20 below was giving me unrelated trouble, got a late start) 2020-10-17T23:50:20+0000 192.168.68.20: 23:50:22 up 1:25, 1 user, load average: 0.11, 0.07, 0.01 192.168.68.21: 23:50:25 up 1 day, 24 min, 0 users, load average: 0.77, 0.72, 0.72 192.168.68.22: 23:50:27 up 1 day, 24 min, 0 users, load average: 0.73, 0.80, 0.81 192.168.68.23: 23:50:29 up 1 day, 24 min, 0 users, load average: 0.68, 0.74, 0.75 192.168.68.24: 23:50:32 up 1 day, 24 min, 0 users, load average: 0.83, 0.79, 0.82 On the 5.x kernel variations, all 5 nodes will end up in a crashed/locked state within 7 days, but typically less, regardless of load. At this point, I just let them idle, and run a simple ssh command every second to execute uptime. Load average is 0.7 to 0.9. This current iteration (#7) if the first one with kernel 4.4. Although I didn't check the governor profile until now, 4.4 is showing "interactive" on all 6 cpus, and here's some other cpu info too: $ cat /sys/devices/system/cpu/cpu{0..5}/cpufreq/scaling_governor interactive interactive interactive interactive interactive interactive $ cat /sys/devices/system/cpu/cpu{0..5}/cpufreq/cpuinfo_transition_latency 40000 40000 40000 40000 465000 465000 $ sudo cat /sys/devices/system/cpu/cpu{0..5}/cpufreq/cpuinfo_{cur,min,max}_freq 408000 408000 1416000 408000 408000 1416000 408000 408000 1416000 408000 408000 1416000 816000 408000 1800000 816000 408000 1800000 Extra info: Kernel configs are version controlled, can supply upon request Kernel compilation and imaging is all scripted, and i'm set up for armbian dev (one day i'll put my custom scripts out on GH) My u-boot experience is garbage 2x UART CP2102 and 1x FTDI FT232RL IC usb modules are at my disposal Got some microSD cards sitting around here somewhere... I use cloud-init nocloud for config/first-boot seeding The m4v2 w/ the SATA hat sometimes runs w/ a USB ASIX AX88179 dongle in bonded mode, but that periodically causes an unrelated kernel panic on 4.4 and 5.x (but that's another topic altogether)
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines