Random system reboots


Mangix

Recommended Posts

Armbian is a community driven open source project. Do you like to contribute your code?

19 hours ago, Mangix said:

I use qbittorrent ina docker container.

Okay is there an easy way to install? Then I might try it.

 

19 hours ago, Mangix said:

Again, it's a kernel issue. .66 is the last one that does not reboot. 8 hours uptime so far. With all future kernels, I can barely get 2 hours.

I believe you on the kernel issue.

 

Just thinking it might be a user space triggering a kernel issue issue.

Link to post
Share on other sites
Am 23.11.2020 um 12:25 schrieb FredK:

I ordered a replacement PSU which was delivered and put in operation this morning. I'll give feedback in any case (spontaneous reboot or continous operation).

Spontaneous reboot with new PSU after 26 hours of operation.

 

EDIT: System now upgraded to 5.8.18-mvebu Buster 20.11 (coming from 5.8.16-mvebu Buster 20.08.22)

Edited by FredK
Link to post
Share on other sites
5 hours ago, Mangix said:

@Heisathhow do I build dev kernels? compile.sh only shows current and legacy.

You can build dev with

./compile.sh EXPERT=yes

 

 

 

26 minutes ago, gprovost said:

@Mangix I might have missed out something, but which patched are you referring to exactly ?

I assume he is referring to the series of patches we have for DFS support:

https://github.com/armbian/build/blob/master/patch/kernel/mvebu-current/800-Add_Armada_38x_support_for_clk-cpu.patch

https://github.com/armbian/build/blob/master/patch/kernel/mvebu-current/801-Use_shorter_register_definition_in_pmsu_c.patch

https://github.com/armbian/build/blob/master/patch/kernel/mvebu-current/802-Made_the_dynamic_frequency_scaling_support_more_generic.patch

https://github.com/armbian/build/blob/master/patch/kernel/mvebu-current/803-Armada_38x_Add_dynamic_frequency_scaling_support_in_pmsu.patch

https://github.com/armbian/build/blob/master/patch/kernel/mvebu-current/804-Update_Armada_38x_DT_for_dynamic_frequency_scaling.patch

 

These have been in there for a long time and might indeed cause the crashes. Probably they need to be adjusted (although they still apply fine).

Link to post
Share on other sites

Yeah these seem to be the exactly same. Only difference is their way of disabling the global timer. We remove the DT node, they disable the compilation.

 

Our way: https://github.com/armbian/build/blob/master/patch/kernel/mvebu-current/fix_time_drift_remove_global_timer.patch

Their way: https://github.com/hnyman/openwrt/commit/90113cd70f33449a68827e63501dcc688c14d007#diff-b7a0f3497875655ca3abc14fb540ca45c913b347a8b2c1efa2ae91b4fa5d9b39

 

EDIT: From the commit msg: "Note: upstream messages mention possible instability under heavy I/O."

Link to post
Share on other sites

I have updated our DFS patches with the OpenWRT ones. There were some small differences (probably not functional ones). The build compiles, dfs works and there is no time drift.

 

As @Mangix has a reliable way of causing a hang it would be great if you could build a image based on the PR and test if the openwrt patches still cause the crashes. 

 

Afterwards I will either make a PR to remove the patches for legacy&current or update the patches there to OWRT also :)

Link to post
Share on other sites

I thought Hannu moved to developing for ipq806x. Interesting...

 

I love how he notes instability under heavy I/O. That's exactly what I experience.

 

From what I see, patch 806 accomplishes the same as fix_time_drift_remove_global_timer.patch in a cleaner way.

 

Anyway, I will be waiting to confirm 24 hour uptime before I try anything else.

 

I also vote for removing these patches. We don't have these in OpenWrt. Stability is more important.

 

edit: on that last note, a PR like that for OpenWrt will be rejected. We have problems with having too many patches. We don't need any that have no chance of making it upstream eventually.

 

edit2: the Turris people have also sort of abandoned this patchset. They have it for their OpenWrt fork, but they use mainline openwrt in newer versions.

 

edit3: I will note, this device has fans. I don't think temperature is ever a problem.

Link to post
Share on other sites

I also assume these new patches have the same problem. But as some lines have changed they might be better adjusted to newer kernels.

 

Because the DFS patches in general were already in 4.14, 4.19 etc. and as you had no problems prior to a specific 4.19 version, it is not "just" a problem with the patches but more a problem with the patches after a specific kernel version. So I assume the DFS patches don't fit as good anymore. 

 

In any case I'd like to be sure that DFS is not stable for mainline (and not just because of outdated patches) before we remove it.

Link to post
Share on other sites

Related: https://forum.openwrt.org/t/cpu-frequency-scaling-driver-for-mvebu-wrt3200acm-etc/2808/91

 

Not looking good.

 

edit: I got 18 hours uptime before I gave up. testing kernel 5.9 with that PR on GitHub. Hopefully this works.

 

dmesg shows this also:

 

debugfs: Directory 'cpu1' with parent 'opp' already present!

 

edit2: seems this dev 5.9 kernel has broken PWM. Fans are going at full speed. Otherwise, I went hard at it for ~3 hours. I can't get it to reboot. We'll see if it survives 24 hours. Looks like the turris people fixed something... or the last patch is what actually fixes things.

 

edit3: I got impatient. Flashed a freshly built kernel with a new dtb. Fan works correctly now.

 

edit4: bad news. Even these new patches cause freezing. Turns out this is easier to reproduce with btrfs scrub. It reboots within an hour.

Link to post
Share on other sites

I just merged the PR, so we now have DFS with the old patches on legacy&current and the new apparently better patches on dev.

 

You mentioned these new patches also freeze but only with btrfs scrub. Can you do one more test? Compare 'btrfs scrub' without DFS vs. 'btrfs scrub' with the new patches?   This should then give definitive answer.

Link to post
Share on other sites

I'm running btrfs scrub currently without the DFS patches. 2 hours uptime and counting. Old or new DFS patches do not make a difference. They both cause freezing.

 

edit: I should mention the reason I'm running btrfs scrub is because of all of these kernel freezes. I'm expecting to see errors. So far there are none. That's pretty impressive as there have been 100+ freezes.

 

Anyway I'm done with these DFS patches. Whether or not they get removed, I'm building my kernels without them.

Link to post
Share on other sites

Yeah that is the reason why it was not removed until now. No one complained. Armbian is past LK4.19 for a longer while (hell there even was a complete 5.4 release) and no one seems to have any issues. Igor, gprovost and myself all use Helios4 / clearfogpro on a daily basis (as NAS or whatever) and do not have / are unable to reproduce these problems... I think there are just many specific factors under which the DFS stuff causes problems.

Link to post
Share on other sites