Mangix Posted November 20, 2020 Posted November 20, 2020 I seem to have issues on my Helios 4 where it constantly reboots. I have no idea what happens just before. I assume some kind of kernel panic. Kernel 4.19.63 - works 4.19.84 - I believe works 4.19.104 and above - broken Newer ones are broken as well. This is most likely an issue with some local patch. Any way to view the history?
gprovost Posted November 20, 2020 Posted November 20, 2020 Can you update the /boot/armbianEnv.txt with the following in order to increase log output : verbosity=7 extraargs=no_console_suspend ignore_loglevel Then you will need to keep the serial open and hope to catch something when it happens. Does it crash easily ?
Mangix Posted November 20, 2020 Author Posted November 20, 2020 18 minutes ago, gprovost said: Can you update the /boot/armbianEnv.txt with the following in order to increase log output : verbosity=7 extraargs=no_console_suspend ignore_loglevel Then you will need to keep the serial open and hope to catch something when it happens. Does it crash easily ? Yeah. Once I got rid of kernel 4.19.63 to test newer ones, I can't get uptime longer than several hours. Maybe 2-3. As for serial, I don't have a spare laptop to hook up. Maybe I can think of something. I wonder if there's an Android app for this...
Mangix Posted November 20, 2020 Author Posted November 20, 2020 Just happened again. Serial output shows nothing. First line is me connected and then just U-Boot. Looks like I'll have to compile my own kernel with most of the patches stripped out...
gprovost Posted November 20, 2020 Posted November 20, 2020 When it hangs, it automatically reboots ? Do you have the watchdog service running ( systemctl status watchdog.service ) ?
Mangix Posted November 20, 2020 Author Posted November 20, 2020 I found the issue. It's some local armbian patch that messing things up. I recently cloned https://github.com/armbian/build and removed 5 pointless patches. That PR was merged. So I built that and same issue. Then I deleted a bunch of patches from the mvebvu-current directory. So far with this kernel, I am not getting any issues. My git status currently is this: ``` deleted: patch/kernel/mvebu-current/0044-gpio-report-all-gpios-in-debugfs.patch deleted: patch/kernel/mvebu-current/40-pci-add-irq-change-handler-sspl.patch deleted: patch/kernel/mvebu-current/402-sfp-display-SFP-module-information.patch deleted: patch/kernel/mvebu-current/412-ARM-dts-armada388-clearfog-emmc-on-clearfog-base.patch deleted: patch/kernel/mvebu-current/92-mvebu-gpio-add_wake_on_gpio_support.patch deleted: patch/kernel/mvebu-current/92-mvebu-gpio-remove-hardcoded-timer-assignment-2.patch deleted: patch/kernel/mvebu-current/92-mvebu-gpio-remove-hardcoded-timer-assignment.patch deleted: patch/kernel/mvebu-current/dts-disable-spi-flash-on-a388-microsom.patch deleted: patch/kernel/mvebu-current/fix_time_drift_remove_global_timer.patch deleted: patch/kernel/mvebu-current/general-increasing_DMA_block_memory_allocation_to_2048.patch deleted: patch/kernel/mvebu-current/unlock_atheros_regulatory_restrictions.patch ``` My theory is that the pci patch or one of the GPIO ones is causing the issue. 1
gprovost Posted November 20, 2020 Posted November 20, 2020 Ok we will have to look at it. @Heisath @aprayoga
FredK Posted November 20, 2020 Posted November 20, 2020 Since two or three weeks I'm experiencing spontaneous reboots, also. My system: Linux helios4 5.8.16-mvebu #20.08.13 Capturing logs at the serial console didn't reveal any useful information. So, I ran out of ideas. "armbianmonitor -u" has been stored at http://ix.io/2EN3
Heisath Posted November 20, 2020 Posted November 20, 2020 @Mangix Just to make sure, your using github.com/armbian/build | master branch and build current version (5.8.y) and your helios4 freezes after 2-3 hours? Did you only build kernel files and change kernel or is this a clean/complete image? Can you tell us about your load on the helios4 during those times? Just idling? 2-3 hours should be easy enough to reproduce. I will take my build from yesterday (after merging your PR) and try with it. If possible can you test some more and leave a few of the patches you removed in there? Just to narrow it down. The Helios4 is not using PCI or SFP so it would be good to leave those in there for testing. The GPIO / timer patches are needed for PWM fan support. @FredK Do you also have such a high frequence of reboots?
gprovost Posted November 20, 2020 Posted November 20, 2020 @Mangix @FredK Can you please if yes or no, the watchdog service is running ( systemctl status watchdog.service ) ? If it's not enable but your system reboot on its own then it's a bit strange. This could help to narrow down the issue.
Mangix Posted November 20, 2020 Author Posted November 20, 2020 Sigh false alarm. It still happens. I've learned that I can reproduce by downloading with qbittorrent in addition to watching a video connected through a Samba share. This actually reminds me of the time on my Turris Omnia that I managed to reboot the device just by watching a video through a Samba share. I wonder if the same thing is happening here... When I mentioned that I could do this with mvebu and Samba but not ksmbd, the DD-WRT developer told me it's a serious kernel issue if a userspace program can crash the kernel. I only installed linux-image-current-mvebu_20.11.0-trunk_armhf.deb with dpkg -i. I'm thinking of collecting a serial log again. Should I be running journalctl -f while doing so? @gprovost watchdog is running, yes.
Mangix Posted November 20, 2020 Author Posted November 20, 2020 watchdog is running. should I disable?
gprovost Posted November 20, 2020 Posted November 20, 2020 Yes please disable it and restart your system. Because if the system reboot (reset) on its own without watchdog service running, then a possible reason is that the PSU is starting to fail and during operation the output voltage drop resulting into a hard reset.
FredK Posted November 20, 2020 Posted November 20, 2020 @Magix: Do you also have such a high frequence of reboots? Between 6 hours and 2 days. @gprovost: Can you please if yes or no, the watchdog service is running. During log capturing at the serial console the watchdog service was disabled.
Mangix Posted November 20, 2020 Author Posted November 20, 2020 @gprovost That sounds bad. OTOH, it makes no sense that I started having these issues once I got off kernel 4.19.63. It would be interesting to bisect to see where between 4.19.84 and 4.19.104 it started failing. edit: actually, is there a .deb file for that version anywhere?
gprovost Posted November 20, 2020 Posted November 20, 2020 How long both you ( @Mangix and @FredK ) have been running your Helios64 setup ? We had a lot of case of Helios4 faulty PSU (AC/DC power adapter) after one year of usage. The capacitors used in the PSU are not fulfilling their hour rating :-/ We have completely changed PSU supplier for our new product Helios64. You can find a good replacement unit on Amazon : https://www.amazon.com/TAIFU-4-Pin-12V-8-33A-Replacement/dp/B07NCG1P8X
Heisath Posted November 20, 2020 Posted November 20, 2020 You can check here for deb files: https://beta.armbian.com/pool/main/l/ and https://apt.armbian.com/pool/main/l/ There is not necessarily a release for every kernel version. But you can try your luck. Remember when changing kernel with dpkg to not only update image but also dtb. There are sometimes changes there. Apart from that between the different 4.19 version there is not much difference from armbian side.
Mangix Posted November 20, 2020 Author Posted November 20, 2020 @gprovostI bought mine in the last batch. It's been running every day. Any way to test the PSU? @HeisathLooks like no luck on the former. The latter throws an XML error.
gprovost Posted November 20, 2020 Posted November 20, 2020 @Mangix The issue is that since your system still able to startup then it means without load the PSU is able to provide 12V, but under load it starts to drop most probably way below 11V. So it's not trivial to test to be honest. One way would be to run the system with less HDD hookup to see if system don't reset anymore, but I guess if you have a RAID array then not possible to do that easily.
FredK Posted November 20, 2020 Posted November 20, 2020 vor 1 Stunde schrieb gprovost: You can find a good replacement unit on Amazon : https://www.amazon.com/TAIFU-4-Pin-12V-8-33A-Replacement/dp/B07NCG1P8X @gprovostWhat's about these two replacement PSUs? https://www.amazon.de/dp/B07Q72NBQK https://www.amazon.de/dp/B07RHSX3WR
TRS-80 Posted November 20, 2020 Posted November 20, 2020 6 hours ago, gprovost said: We had a lot of case of Helios4 faulty PSU (AC/DC power adapter) after one year of usage. The capacitors used in the PSU are not fulfilling their hour rating :-/ We have completely changed PSU supplier for our new product Helios64. Imagine getting this sort of honesty and transparency from some mass market mfr. of NAS (or almost anything nowadays). I don't think so!
Mangix Posted November 21, 2020 Author Posted November 21, 2020 The more I think about this the more I think it's the kernel and not the power supply. I have 4 laptop hard drives connected to my Helios4. Those only use 5V. I also only started having issues when I swapped out the kernel. Time to figure out how to build an old kernel looks like.
gprovost Posted November 21, 2020 Posted November 21, 2020 @FredK Yes the 2 PSU you listed are ok. Any 4-Pin PSU replacement for Synology with 12V and at least 8A output will work since we use the same pinout. 2 hours ago, Mangix said: The more I think about this the more I think it's the kernel and not the power supply. I have 4 laptop hard drives connected to my Helios4. Those only use 5V. I also only started having issues when I swapped out the kernel. Hmmm you right but still it doesn't completely dismiss the possibility of PSU issue. What model of 2.5' HDD you are using ? If the watchdog is disabled the system won't be able to reboot/reset on its own when it hangs. So does your system still reboot on its own with watchdog disable ?
Mangix Posted November 21, 2020 Author Posted November 21, 2020 With the watchdog disabled, it does not reboot. I have a serial log with journalctl -f running. I can't see anything interesting. Anyway, I now conclude this is an upstream kernel issue. Given that I know kernel 4.19.84 works and .104 is broken, I will try to narrow the issue down. edit: nope. .84 reboots as well. Given that I know .63 works (I had multiple months of uptime), I'll try versions between .63 and .84. Starting with .70. FFS this kernel rebooted while I was installing a new one. Now I have a brick. I forgot how to reinstall it. chroot something. @gprovostwestern digital blues.
Mangix Posted November 21, 2020 Author Posted November 21, 2020 Progress update: kernel 4.19.70 fails. .65 works. Testing .67 now. edit: .66 has not crashed yet. Will wait to see if it can stay alive for 12 hours. I'm trying to compile kernels based on commit. It doesn't seem to work though. I'm trying ``` --- a/config/sources/families/mvebu.conf +++ b/config/sources/families/mvebu.conf @@ -10,7 +10,7 @@ fi case $BRANCH in legacy) - KERNELBRANCH='tag:v4.19.66' + KERNELBRANCH='commit:46b306f3cd7b47901382ca014eb1082b4b25db4a' ;; ``` Which gives ``` [ error ] ERROR in function compile_kernel [ compilation.sh:379 ] [ error ] Error kernel menuconfig failed ``` I'm trying to see which commit is responsible for the failure based on https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/?h=v4.19.158&ofs=9800 Current theory is this commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.19.158&id=46b306f3cd7b47901382ca014eb1082b4b25db4a It says it's for 32-bit. 1
Heisath Posted November 22, 2020 Posted November 22, 2020 Hey, I am not sure if it is possible to select a kernelbranch by commit. Maybe someone else knows? May I ask what else you have installed, or if you have tried with a clean image? If there is no private data/configuration on you Helios4, you might also upload a image of your sd card so we can try with your config. I have been running a helios4 for nearly 2 days now, with no crashes. Transfering some data (~400GiB) with samba. Welcome to Armbian 20.11.0-trunk Buster with Linux 5.8.18-mvebu No end-user support: built from trunk System load: 12% Up time: 1 day 21:13 Memory usage: 4% of 1.97G IP: 192.168.42.127 CPU temp: 42°C Ambient temp: 28°C Usage of /: 6% of 29G storage/: 50% of 117G Last login: Sun Nov 22 08:53:30 2020 from 192.168.42.11 root@helios4:~# ifconfig eth0 eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.42.127 netmask 255.255.255.0 broadcast 192.168.42.255 inet6 fe80::bd05:f16:95a0:fee prefixlen 64 scopeid 0x20<link> inet6 2a03:4000:2b:63a:1098:42:0:1c1 prefixlen 128 scopeid 0x0<global> ether 0e:47:85:69:06:21 txqueuelen 1024 (Ethernet) RX packets 283087536 bytes 427927106434 (398.5 GiB) RX errors 61 dropped 0 overruns 0 frame 0 TX packets 24727708 bytes 1401262789 (1.3 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 37
Mangix Posted November 22, 2020 Author Posted November 22, 2020 I use qbittorrent ina docker container. Easily reboots the Helios4. Again, it's a kernel issue. .66 is the last one that does not reboot. 8 hours uptime so far. With all future kernels, I can barely get 2 hours. edit: just rebooted. I'm out of ideas at this point. I have a feeling it's a kernel configuration issue. I have no idea what config that 4.19.63 version has.
Recommended Posts