kratz00 Posted November 12, 2020 Share Posted November 12, 2020 Hello For quite some time I experienced system freezes. I already measured the voltage on the board 12V and 5V are okay on both connectors. Attached you will find the armbianmonitor -U output. I tried to capture kernel logs using information from some other thread. sudo dmesg -n 7 sudo dmesg -w But I could not capture anything useful. Today the system froze while checking the raid (filesystem was not mounted). [ 168.224361] md: data-check of RAID array md0 Is there anything else I can do to shed some light? Thanks and regards -kratz00 armbianmonitor.log 0 Quote Link to comment Share on other sites More sharing options...
gprovost Posted November 13, 2020 Share Posted November 13, 2020 Hi, By system freeze, you mean the system hangs and you need to manually reset / power cycle it ? Is the watchdog service running ? systemctl status watchdog.service What the temperature of the SoC during load ? cat /dev/thermal-cpu/temp1_input Just trying to dismiss first any thermal issue. Regards, 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 13, 2020 Author Share Posted November 13, 2020 Hi gprovost 3 hours ago, gprovost said: By system freeze, you mean the system hangs and you need to manually reset / power cycle it ? Exactly. Not reachable over the network. Does not respond via serial console. 3 hours ago, gprovost said: Is the watchdog service running ? systemctl status watchdog.service Seems it is missing: kratz00@helios4:~$ systemctl status watchdog.service Unit watchdog.service could not be found. 3 hours ago, gprovost said: What the temperature of the SoC during load ? cat /dev/thermal-cpu/temp1_input Just trying to dismiss first any thermal issue. Raid check is nearly running for an hour, load is high and the temperature is stable around 55°C: root@helios4:~# uptime 06:53:44 up 21 min, 1 user, load average: 2.00, 1.92, 1.27 root@helios4:~# cat /dev/thermal-cpu/temp1_input 55122 Regards -kratz00 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 17, 2020 Author Share Posted November 17, 2020 Hi Forgot to give an update. The system froze during raid check at 54.3% . The temperature was always between 54-57°C. @gprovost Any ideas what might be wrong? Regards -kratz00 0 Quote Link to comment Share on other sites More sharing options...
Mangix Posted November 19, 2020 Share Posted November 19, 2020 Is this a freeze or a random reboot? I experienced the latter during the 5.4 kernel cycle. Actually it seems to be some regression made during the 4.19 series. I'm still on 4.19.63-mvebu for that reason. 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 19, 2020 Author Share Posted November 19, 2020 13 hours ago, Mangix said: Is this a freeze or a random reboot? I would see freeze, as the system is not usable anymore (not reachable via network nor serial). The logs also do not indicate a reboot. If it would be a reboot I would expect the system to be in an usable state afterwards. 0 Quote Link to comment Share on other sites More sharing options...
Mangix Posted November 20, 2020 Share Posted November 20, 2020 ah not the same issue as mine then. I just tried the latest 4.19 and 5.4 kernel versions. both reboot randomly. Unfortunately now I need to find the old kernel that I was using... 0 Quote Link to comment Share on other sites More sharing options...
fri.K Posted November 22, 2020 Share Posted November 22, 2020 @Mangix, try 5.4.66-mvebu #20.08.3 , I also had random reboots after some updates on heavy NFS loads again, previously I had this problem, but after reinstalling system from scratch to spare sd card system was rock solid. I inserted spare card again and it's stable again, but I have no time to test where the problem is now thought. 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 22, 2020 Author Share Posted November 22, 2020 I guess in your cases the system also freezes and the watchdog service is rebooting the system then. I am going to setup the system from scratch using https://dl.armbian.com/helios4/archive/Armbian_20.08.13_Helios4_buster_current_5.8.16.img.xz tomorrow. 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 23, 2020 Author Share Posted November 23, 2020 Bought a new micro SD card and put https://dl.armbian.com/helios4/archive/Armbian_20.08.13_Helios4_buster_current_5.8.16.img.xz on it. Booted, set up the root password and directly started a raid check (9.4% done so far). I am very excited to see what happens. 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 24, 2020 Author Share Posted November 24, 2020 It froze again during raid check I will try kernel 5.4.66-mvebu, like suggested by @fri.K After that I am really out of ideas 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 24, 2020 Author Share Posted November 24, 2020 @gprovost Not sure if this is expected or not, but since you asked before, this is running Armbian_20.08.13_Helios4_buster_current_5.8.16.img.xz without any changes. root@helios4:~# systemctl status watchdog.service Unit watchdog.service could not be found. 0 Quote Link to comment Share on other sites More sharing options...
gprovost Posted November 24, 2020 Share Posted November 24, 2020 Yes watchdog service is not installed by default. apt-get install watchdog 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 24, 2020 Author Share Posted November 24, 2020 @gprovostThanks. Having a watchdog will restart the system in case it freezes, like it does for @Mangixand @fri.K I would be more interested in helping to fix the underlying problem. How can I help? 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 24, 2020 Author Share Posted November 24, 2020 Short update running 5.4.66-mvebu now, resulted in a system freeze in just a couple of minutes running raid check. I am officially out of ideas. The Helios4 was running fine for many years when I got it after the successful Kickstarter campaign. I can not really pin point when it started freezing (I think it started after October of 2019). I am also not sure if it is a hardware or a software problem. 0 Quote Link to comment Share on other sites More sharing options...
gprovost Posted November 25, 2020 Share Posted November 25, 2020 I know you dismissed PSU problem from the beginning, but how long you have been running the system with same PSU for ? 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 25, 2020 Author Share Posted November 25, 2020 8 minutes ago, gprovost said: I know you dismissed PSU problem from the beginning, but how long you have been running the system with same PSU for ? Since the beginning, it is still the original PSU I got with the Helios4 in February 2018. 0 Quote Link to comment Share on other sites More sharing options...
Mangix Posted November 25, 2020 Share Posted November 25, 2020 @kratz00my theory is a kernel problem. I had months of uptime with kernel 4.19.63. Unfortunately, I don't have the original .deb file. 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 25, 2020 Author Share Posted November 25, 2020 @gprovost As I also have a Synology DS214+, I just switched the PSUs, both are up and running. The Helios4 is running raid check again, let us see what will happen @Mangix If changing the PSU does not have the desired effect, I will try kernel 4.19.63 You can still find the image here https://archive.armbian.com/helios4/archive/Armbian_5.91_Helios4_Debian_buster_next_4.19.63.7z 0 Quote Link to comment Share on other sites More sharing options...
FredK Posted November 25, 2020 Share Posted November 25, 2020 @kratz00 @Mangix @gprovost @Heisath See my post in the parallel thread https://forum.armbian.com/topic/16038-random-system-reboots/?do=findComment&comment=113510 I used a different (newer) Kernel and a new PSU but I got again a spontaneous reboot. 0 Quote Link to comment Share on other sites More sharing options...
TheLinuxBug Posted November 25, 2020 Share Posted November 25, 2020 @kratz00 @FredK Actually, let me revise this after re-reading again. To note that when you have 5 or more drives running on the Armada 3700 series SoC like in ESPRESSOBin there are bugs in the DMA transaction process that can be hit which will trigger some weird events such as kswapd using 100% CPU and terminating all IO access, Kernel Panic, Drive failures reports by sata controller or a freeze. I have seen this predominantly occur during a raid reshape but it can happen at other times as well. On the ESPRESSOBin one of the things that can help get around this is if you are using mdadm and raid5 you can use the xor engine in the chip to handle DMA calculations and this will help to offload it under a lot of conditions. The module is 'marvell-cesa' on ESPRESSOBin, I would guess its is names similarly here. Especially with 5.x.y series kernels I have had to use this. I am currently running a raid of 7x3TB drives on a 8 port PCIe adapter placed into a mini-pcie to PCIEx1 adapter slot attached to the mini-pcie on the board. It is stable under 5.4.y or newer with that module inserted, otherwise under high load from NFS or local system for extended periods I will see errors first where it will fail to allocate dma requests in time and then shortly after system instability, usually resulting in any of the above mentioned outcomes. Maybe if you are not using that module you could compile it and see if it helps with this issue in the newer kernels. Past that it did sound possible if this isn't specifically something caused by change in kernel version like it could be related to bad SATA cable or under power. my 2 cents. Cheers! 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 25, 2020 Author Share Posted November 25, 2020 System froze again during raid check. I was using kernel 5.8.18-mvebu and the PSU from my Synology NAS. I just prepared a SD card with the Armbian_5.91_Helios4_Debian_buster_next_4.19.63.img image As suggested by @Mangix this kernel might also be stable in my case. 0 Quote Link to comment Share on other sites More sharing options...
karamike Posted November 25, 2020 Share Posted November 25, 2020 I don't want this to be a me-too-post - but why else would I be here... I have an old Helios4 with the original PSU, Armbian 20.08.17 Bionic, Linux 5.8.16-mvebu, running in JBOD mode. I've also recently start to experience random freezes while continuously accessing a single drive (picture indexing). At some point the Ethernet connection is lost, no files, no ssh, no pings. Under this conditions I've connected a laptop to the serial output of the Helios4. It doesn't react to any input but displays the following messages: [30482.836176] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [30482.842115] rcu: 1-...!: (0 ticks this GP) idle=87e/1/0x40000002 softirq=7995606/7995606 fqs=1 [30492.851735] rcu: rcu_sched kthread starved for 131290 jiffies! g13530313 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 [30492.862281] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [30492.871261] rcu: RCU grace-period kthread stack dump: [30545.855768] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [30545.861707] rcu: 1-...!: (0 ticks this GP) idle=87e/1/0x40000002 softirq=7995606/7995606 fqs=1 Does this hint to the kernel problem mentioned above? 0 Quote Link to comment Share on other sites More sharing options...
TRS-80 Posted November 25, 2020 Share Posted November 25, 2020 1 hour ago, karamike said: I don't want this to be a me-too-post Every post is another potential data point, which may help to figure out what is going on. So thanks for your feedback. 0 Quote Link to comment Share on other sites More sharing options...
Mangix Posted November 26, 2020 Share Posted November 26, 2020 I compiled my own 4.19.63 . So far, it's not rebooting. Fingers crossed that it can survive a day. If it does, kernel 4.19.64 and above are the problematic ones. The next step is to selectively revert potentially problematic commits and figure out which one is causing reboots. edit: I just realized that I never deleted the old cpufreq patches... 0 Quote Link to comment Share on other sites More sharing options...
gprovost Posted November 26, 2020 Share Posted November 26, 2020 19 hours ago, kratz00 said: As I also have a Synology DS214+, I just switched the PSUs, both are up and running. This is a PSU for only a 2-bay setup (I think only rated 5A ??). Could be not sufficient for a 4 x 3.5" HDD. @TheLinuxBug The driver you are referring to is mv_xor (mv_cesa is for the hw crypto engine), mv_xor is already built-in Armbian for mvebu family A way to confirm Helios4 offload XOR operation on the hw engine is to check /proc/interrupts and look at the following 2 lines 47: 43007 0 GIC-0 54 Level f1060800.xor 48: 40208 0 GIC-0 97 Level f1060900.xor You right about NFS that has been often a culprit of bringing system down to his feet when not properly tuned. @karamike Your log looks more like overload system. What protocol are you using to access data over the network ? @Mangix Hope you manage to narrow down the root cause 1 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted November 27, 2020 Author Share Posted November 27, 2020 @gprovostThanks for the hint. You are right, the Helios4 one does 12V/8A and the Synology one only 12V/6A. I will switch them back, @MangixYou were also right, raid check finished successfully running kernel 4.19.63-mvebu #5.91 (with over 36 hours of uptime at the moment) I saw in the other thread the system freeze is related to the DFS patches, therefor this thread can be closed now. Thanks to everybody for your input and help with this problem. 0 Quote Link to comment Share on other sites More sharing options...
karamike Posted November 27, 2020 Share Posted November 27, 2020 @gprovost The files are being shared via netatalk to a Mac. That worked fine until now. The number of files within a single folder is quite large (up to 13,000 files). The hard drive is formatted with ext4 using the DIR_INDEX option to speed up the file access. The problem first appeared after switching from one of the four drives from 2 TB to 4 TB. I've copied the files from A to B, so the number of files did not change. But I had to reindex all files within a picture processing software on the Mac (creating thumbs, determine picture properties, etc.). I've installed the software watchdog, as mentioned above, but that did not help either. If the system freezes a software watchdog does not have any chance to intervene. After reading about PSU issues here, I disconnected one drive (data + supply voltage) and started the box with only 3 drives. After 10 hours or so of indexing the system did freeze again. I wrote a simple shell script to display temperature and processor load every second. When the system froze the temperature was around 50 °C and the process load slightly above 1. I also noticed that the "blinking lights" (HD access, SMD LEDs) are all off - even when the box seems to work normally. Is this an indication of something? Thanks 0 Quote Link to comment Share on other sites More sharing options...
Mangix Posted November 29, 2020 Share Posted November 29, 2020 DFS patches were removed for current and legacy. dev still has them. I assume builds will be out soon if they're not out already. 0 Quote Link to comment Share on other sites More sharing options...
kratz00 Posted December 28, 2020 Author Share Posted December 28, 2020 I am running Linux 5.9.14-mvebu for over 14 days now (up 14 days, 20:12) without any problem. Thank you very much to all involved tracking down and fixing the issue. 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.