Random system reboots


Mangix

Recommended Posts

I seem to have issues on my Helios 4 where it constantly reboots. I have no idea what happens just before. I assume some kind of kernel panic.

 

Kernel 4.19.63 - works

4.19.84 - I believe works

4.19.104 and above - broken

 

Newer ones are broken as well. This is most likely an issue with some local patch.

 

Any way to view the history?

Link to post
Share on other sites
Donate and support the project!

18 minutes ago, gprovost said:

Can you update the /boot/armbianEnv.txt with the following in order to increase log output :
 


verbosity=7

extraargs=no_console_suspend ignore_loglevel

 

Then you will need to keep the serial open and hope to catch something when it happens.

 

Does it crash easily ?

Yeah. Once I got rid of kernel 4.19.63 to test newer ones, I can't get uptime longer than several hours. Maybe 2-3.

 

As for serial, I don't have a spare laptop to hook up. Maybe I can think of something. I wonder if there's an Android app for this...

Link to post
Share on other sites

I found the issue. It's some local armbian patch that messing things up.

 

I recently cloned https://github.com/armbian/build and removed 5 pointless patches. That PR was merged. So I built that and same issue.

 

Then I deleted a bunch of patches from the mvebvu-current directory.

 

So far with this kernel, I am not getting any issues. My git status currently is this:

 

```

    deleted:    patch/kernel/mvebu-current/0044-gpio-report-all-gpios-in-debugfs.patch
    deleted:    patch/kernel/mvebu-current/40-pci-add-irq-change-handler-sspl.patch
    deleted:    patch/kernel/mvebu-current/402-sfp-display-SFP-module-information.patch
    deleted:    patch/kernel/mvebu-current/412-ARM-dts-armada388-clearfog-emmc-on-clearfog-base.patch
    deleted:    patch/kernel/mvebu-current/92-mvebu-gpio-add_wake_on_gpio_support.patch
    deleted:    patch/kernel/mvebu-current/92-mvebu-gpio-remove-hardcoded-timer-assignment-2.patch
    deleted:    patch/kernel/mvebu-current/92-mvebu-gpio-remove-hardcoded-timer-assignment.patch
    deleted:    patch/kernel/mvebu-current/dts-disable-spi-flash-on-a388-microsom.patch
    deleted:    patch/kernel/mvebu-current/fix_time_drift_remove_global_timer.patch
    deleted:    patch/kernel/mvebu-current/general-increasing_DMA_block_memory_allocation_to_2048.patch
    deleted:    patch/kernel/mvebu-current/unlock_atheros_regulatory_restrictions.patch

```

 

My theory is that the pci patch or one of the GPIO ones is causing the issue.

Link to post
Share on other sites

@Mangix Just to make sure, your using github.com/armbian/build | master branch and build current version (5.8.y) and your helios4 freezes after 2-3 hours?  

Did you only build kernel files and change kernel or is this a clean/complete image?

 

Can you tell us about your load on the helios4 during those times? Just idling?  2-3 hours should be easy enough to reproduce.  I will take my build from yesterday (after merging your PR) and try with it.  

 

If possible can you test some more and leave a few of the patches you removed in there? Just to narrow it down. 

The Helios4 is not using PCI or SFP so it would be good to leave those in there for testing. The GPIO / timer patches are needed for PWM fan support. 

 

@FredK Do you also have such a high frequence of reboots? 

Link to post
Share on other sites

Sigh false alarm. It still happens. I've learned that I can reproduce by downloading with qbittorrent in addition to watching a video connected through a Samba share. This actually reminds me of the time on my Turris Omnia that I managed to reboot the device just by watching a video through a Samba share. I wonder if the same thing is happening here...

 

When I mentioned that I could do this with mvebu and Samba but not ksmbd, the DD-WRT developer told me it's a serious kernel issue if a userspace program can crash the kernel.

 

I only installed linux-image-current-mvebu_20.11.0-trunk_armhf.deb with dpkg -i.

 

I'm thinking of collecting a serial log again. Should I be running journalctl -f while doing so?

 

@gprovost watchdog is running, yes.

Link to post
Share on other sites

How long both you ( @Mangix and @FredK ) have been running your Helios64 setup ? We had a lot of case of Helios4 faulty PSU (AC/DC power adapter) after one year of usage. The capacitors used in the PSU are not fulfilling their hour rating :-/ We have completely changed PSU supplier for our new product Helios64.

 

You can find a good replacement unit on Amazon : https://www.amazon.com/TAIFU-4-Pin-12V-8-33A-Replacement/dp/B07NCG1P8X

 

 

Link to post
Share on other sites

You can check here for deb files: https://beta.armbian.com/pool/main/l/   and   https://apt.armbian.com/pool/main/l/

There is not necessarily a release for every kernel version. But you can try your luck. Remember when changing kernel with dpkg to not only update image but also dtb. There are sometimes changes there. 

 

Apart from that between the different 4.19 version there is not much difference from armbian side. 

Link to post
Share on other sites

@Mangix The issue is that since your system still able to startup then it means without load the PSU is able to provide 12V, but under load it starts to drop most probably way below 11V. So it's not trivial to test to be honest. One way would be to run the system with less HDD hookup to see if system don't reset anymore, but I guess if you have a RAID array then not possible to do that easily.

Link to post
Share on other sites
6 hours ago, gprovost said:

We had a lot of case of Helios4 faulty PSU (AC/DC power adapter) after one year of usage. The capacitors used in the PSU are not fulfilling their hour rating :-/ We have completely changed PSU supplier for our new product Helios64.

 

Imagine getting this sort of honesty and transparency from some mass market mfr. of NAS (or almost anything nowadays).  I don't think so!

Link to post
Share on other sites

The more I think about this the more I think it's the kernel and not the power supply. I have 4 laptop hard drives connected to my Helios4. Those only use 5V. I also only started having issues when I swapped out the kernel.

 

Time to figure out how to build an old kernel looks like.

Link to post
Share on other sites

@FredK Yes the 2 PSU you listed are ok. Any 4-Pin PSU replacement for Synology with 12V and at least 8A output will work since we use the same pinout.

 

2 hours ago, Mangix said:

The more I think about this the more I think it's the kernel and not the power supply. I have 4 laptop hard drives connected to my Helios4. Those only use 5V. I also only started having issues when I swapped out the kernel.

Hmmm you right but still it doesn't completely dismiss the possibility of PSU issue.  What model of 2.5' HDD you are using ?

 

If the watchdog is disabled the system won't be able to reboot/reset on its own when it hangs. So does your system still reboot on its own with watchdog disable ?

Link to post
Share on other sites

With the watchdog disabled, it does not reboot. I have a serial log with journalctl -f running. I can't see anything interesting.

 

Anyway, I now conclude this is an upstream kernel issue. Given that I know kernel 4.19.84 works and .104 is broken, I will try to narrow the issue down.

 

edit: nope. .84 reboots as well. Given that I know .63 works (I had multiple months of uptime), I'll try versions between .63 and .84. Starting with .70.

 

FFS this kernel rebooted while I was installing a new one. Now I have a brick. I forgot how to reinstall it. chroot something.

 

@gprovostwestern digital blues.

Link to post
Share on other sites

Progress update: kernel 4.19.70 fails. .65 works. Testing .67 now.

 

edit: .66 has not crashed yet. Will wait to see if it can stay alive for 12 hours.

 

I'm trying to compile kernels based on commit. It doesn't seem to work though. I'm trying

 

```

--- a/config/sources/families/mvebu.conf
+++ b/config/sources/families/mvebu.conf
@@ -10,7 +10,7 @@ fi
 case $BRANCH in
        legacy)
 
-               KERNELBRANCH='tag:v4.19.66'
+               KERNELBRANCH='commit:46b306f3cd7b47901382ca014eb1082b4b25db4a'
 
        ;;
```

 

Which gives

 

```

[ error ] ERROR in function compile_kernel [ compilation.sh:379 ]
[ error ] Error kernel menuconfig failed 
```

 

I'm trying to see which commit is responsible for the failure based on https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/?h=v4.19.158&ofs=9800

 

Current theory is this commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.19.158&id=46b306f3cd7b47901382ca014eb1082b4b25db4a

 

It says it's for 32-bit.

Link to post
Share on other sites

Hey,

 

I am not sure if it is possible to select  a kernelbranch by commit. Maybe someone else knows? 

 

May I ask what else you have installed, or if you have tried with a clean image? If there is no private data/configuration on you Helios4, you might also upload a image of your sd card so we can try with your config.

 

I have been running a helios4 for nearly 2 days now, with no crashes. Transfering some data (~400GiB) with samba. 

Welcome to Armbian 20.11.0-trunk Buster with Linux 5.8.18-mvebu

No end-user support: built from trunk

System load:   12%              Up time:       1 day 21:13
Memory usage:  4% of 1.97G      IP:            192.168.42.127
CPU temp:      42°C             Ambient temp:  28°C
Usage of /:    6% of 29G        storage/:      50% of 117G

Last login: Sun Nov 22 08:53:30 2020 from 192.168.42.11
root@helios4:~# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.42.127  netmask 255.255.255.0  broadcast 192.168.42.255
        inet6 fe80::bd05:f16:95a0:fee  prefixlen 64  scopeid 0x20<link>
        inet6 2a03:4000:2b:63a:1098:42:0:1c1  prefixlen 128  scopeid 0x0<global>
        ether 0e:47:85:69:06:21  txqueuelen 1024  (Ethernet)
        RX packets 283087536  bytes 427927106434 (398.5 GiB)
        RX errors 61  dropped 0  overruns 0  frame 0
        TX packets 24727708  bytes 1401262789 (1.3 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 37

 

Link to post
Share on other sites