Schroedingers Cat Posted December 11, 2020 Posted December 11, 2020 (edited) I'm using Armbian 20.11.1 Focal (5.9.11-rockchip64) on my Helios64. When there's heavier load via ethernet (SFTP to a SATA Ultrastar 12TB disk) for something like more than an hour, the entire system will freeze. Not sure if it actually does freeze but I lose all SSH connections and cannot connect anymore via SSH unless I restart the Helios64. Happened today 4 times and it's totally reproducible. Haven't tested another drive yet. What makes this difficult to debug is that under /var/log/syslog or kern.log or faillog nothing relevant is being written as to in what state the device is. Is this a known issue? How can I find out what the problem is? Edited December 11, 2020 by Schroedingers Cat 1
Schroedingers Cat Posted December 13, 2020 Author Posted December 13, 2020 (edited) When I'm connected via COM/PuTTY, this is what happens when the system freezes while i monitored via iotop: Quote 11385 ?sys sftp-user 0.00 B/s 98.59 M/s 0.00 % 0.00 % sftp-server 4 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_par_gp] 6 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker~-kblockd] 8 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [mm_percpu_wq] 9 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0] 10 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_preempt] 11 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0] 12 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [cpuhp/0] 13 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [cpuhp/1] 14 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1] 15 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/1] 17 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker~-kblockd] 18 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [cpuhp/2] 19 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/2] 20 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirq-/2] 18 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [cpuhp/2] 19 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/2] 20 ?sys root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/2] keys: any: refresh q: quit i: ionice o: active p: procs a: accum sort: r: asc left: SWAPIN right: COMMAND home: TID end: [45859.904820] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [45859.905384] rcu: 2-...!: (16 ticks this GP) idle=892/1/0x4000000000000000 softirq=933974/933976 fqs=1 [45859.906213] rcu: 3-...!: (8 GPs behind) idle=842/0/0x1 softirq=985077/985077 fqs=1 [45859.906899] rcu: 5-...!: (36 ticks this GP) idle=e8a/1/0x4000000000000000 softirq=2724805/2724808 fqs=1 After that, I cannot send any commands via COM, so the system is definitely freezing. It's also happening when copying to any drive and it's also happening when there's less drives connected. Any idea what this instability is caused by? Edited December 13, 2020 by Schroedingers Cat
Werner Posted December 13, 2020 Posted December 13, 2020 On 12/12/2020 at 12:00 AM, Schroedingers Cat said: What makes this difficult to debug is that under /var/log/syslog or kern.log or faillog nothing relevant is being written as to in what state the device is. Logs are usually stored in ram and only written every few minutes to drastically increase the lifespan of sd cards. The downside is that it is sometimes hard to track down issues. You could either disable log2ram and try to reproduce or connect debug console and follow the output of dmesg and wait until another freeze happens. Though there does not necessary need to be any output. Sometimes systems freeze without giving a clue whatsoever :/ 1
Schroedingers Cat Posted December 15, 2020 Author Posted December 15, 2020 Thanks for your response. I'm only allowed to answer after 24 hours, for some reason. I disabled the log2ram but the log files are still not showing anything interesting. Crash happened on 15th of December around 16:00 and restart was around 19:50. Here's the relevant section from /var/log/syslog: Quote Dec 15 14:15:01 localhost CRON[24499]: (root) CMD (/usr/lib/armbian/armbian-truncate-logs) Dec 15 14:15:01 localhost CRON[24500]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Dec 15 14:17:01 localhost CRON[24516]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 15 14:25:01 localhost CRON[24571]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Dec 15 14:30:01 localhost CRON[24604]: (root) CMD (/usr/lib/armbian/armbian-truncate-logs) Dec 15 14:35:01 localhost CRON[24640]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Dec 15 14:45:01 localhost CRON[24706]: (root) CMD (/usr/lib/armbian/armbian-truncate-logs) Dec 15 14:45:01 localhost CRON[24707]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Dec 15 14:55:01 localhost CRON[24774]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Dec 15 15:00:01 localhost CRON[24808]: (root) CMD (/usr/lib/armbian/armbian-truncate-logs) Dec 15 15:05:01 localhost CRON[24842]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Dec 15 15:15:01 localhost CRON[24910]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Dec 15 15:15:01 localhost CRON[24911]: (root) CMD (/usr/lib/armbian/armbian-truncate-logs) Dec 15 15:17:01 localhost CRON[24927]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 15 15:25:01 localhost CRON[24983]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Dec 15 15:30:01 localhost CRON[25017]: (root) CMD (/usr/lib/armbian/armbian-truncate-logs) Dec 15 19:50:59 localhost systemd-modules-load[417]: Inserted module 'lm75' Dec 15 19:50:59 localhost systemd-sysctl[433]: Not setting net/ipv4/conf/all/promote_secondaries (explicit setting exists). /var/log/kern.log: Quote Dec 14 00:17:14 localhost kernel: [27357.681768] NOHZ: local_softirq_pending 08 Dec 14 01:27:26 localhost kernel: [31569.414047] NOHZ: local_softirq_pending 08 Dec 14 02:40:35 localhost kernel: [35958.205471] NOHZ: local_softirq_pending 08 Dec 14 07:35:34 localhost kernel: [53657.100804] NOHZ: local_softirq_pending 08 Dec 14 13:46:37 localhost kernel: [75918.982748] NOHZ: local_softirq_pending 08 Dec 15 19:50:59 localhost kernel: [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034] Nothing interesting found in /var/log/dmesg. Quote or connect debug console and follow the output of dmesg and wait until another freeze happens. Do you mean connecting via COM/USB? I lose connection if it happens, but I can try. Anything else I can do to investigate this? Do you think my device is broken? 2
gprovost Posted December 16, 2020 Posted December 16, 2020 @Schroedingers Cat Yes please run your system with serial console open and command dmesg -w and then copy here the result once the system crashes.
Werner Posted December 16, 2020 Posted December 16, 2020 9 hours ago, Schroedingers Cat said: I'm only allowed to answer after 24 hours, for some reason. Unfortunately a needed measure to fight spam bots Anyway you received a like which should lift the restriction within 24h.
Schroedingers Cat Posted December 18, 2020 Author Posted December 18, 2020 Support told me to do the following: Run `armbian-config`, go to -> System -> CPU And set: Minimum CPU speed = 1200000 Maximum CPU speed = 1200000 CPU governor = performance I'm now writing to my HDD via SFTP for more than 1.5 days without an issue, so that seemed to solve it. 1
Schroedingers Cat Posted December 20, 2020 Author Posted December 20, 2020 @Werner @gprovost I just installed the Armbian update to v20.11.4. How do I find out if it fixes the issue with the Rockchip?
gprovost Posted December 21, 2020 Posted December 21, 2020 For now we haven't forced the CPU governor to performance, we still hope to find a fix for DVFS to work properly. v20.11.4 doesn't change anything related to DVFS, it is just a rebuild after the realtek r8152 driver was removed by mistake on previous version. BTW you can check you current CPU governor settings with following command : cpufreq-info
Recommended Posts