Random system freezes

kratz00 · November 12, 2020

Hello

For quite some time I experienced system freezes. I already measured the voltage on the board 12V and 5V are okay on both connectors.

Attached you will find the armbianmonitor -U output.

I tried to capture kernel logs using information from some other thread.

sudo dmesg -n 7
sudo dmesg -w

But I could not capture anything useful.

Today the system froze while checking the raid (filesystem was not mounted).

[  168.224361] md: data-check of RAID array md0

Is there anything else I can do to shed some light?

Thanks and regards

-kratz00

armbianmonitor.log

gprovost · November 13, 2020

Hi,

By system freeze, you mean the system hangs and you need to manually reset / power cycle it ?

Is the watchdog service running ? systemctl status watchdog.service

What the temperature of the SoC during load ? cat /dev/thermal-cpu/temp1_input

Just trying to dismiss first any thermal issue.

Regards,

kratz00 · November 13, 2020

Hi gprovost

3 hours ago, gprovost said:

By system freeze, you mean the system hangs and you need to manually reset / power cycle it ?

Exactly. Not reachable over the network. Does not respond via serial console.

3 hours ago, gprovost said:

Is the watchdog service running ? systemctl status watchdog.service

Seems it is missing:

kratz00@helios4:~$ systemctl status watchdog.service
Unit watchdog.service could not be found.

3 hours ago, gprovost said:

What the temperature of the SoC during load ? cat /dev/thermal-cpu/temp1_input

Just trying to dismiss first any thermal issue.

Raid check is nearly running for an hour, load is high and the temperature is stable around 55°C:

root@helios4:~# uptime
 06:53:44 up 21 min,  1 user,  load average: 2.00, 1.92, 1.27
root@helios4:~# cat /dev/thermal-cpu/temp1_input 
55122

Regards

-kratz00

kratz00 · November 17, 2020

Hi

Forgot to give an update.

The system froze during raid check at 54.3% .

The temperature was always between 54-57°C.

@gprovost

Any ideas what might be wrong?

Regards

-kratz00

Mangix · November 19, 2020

Is this a freeze or a random reboot?

I experienced the latter during the 5.4 kernel cycle. Actually it seems to be some regression made during the 4.19 series. I'm still on 4.19.63-mvebu for that reason.

kratz00 · November 19, 2020

13 hours ago, Mangix said:

Is this a freeze or a random reboot?

I would see freeze, as the system is not usable anymore (not reachable via network nor serial).

The logs also do not indicate a reboot.

If it would be a reboot I would expect the system to be in an usable state afterwards.

Mangix · November 20, 2020

ah not the same issue as mine then. I just tried the latest 4.19 and 5.4 kernel versions. both reboot randomly. Unfortunately now I need to find the old kernel that I was using...

fri.K · November 22, 2020

@Mangix, try

 5.4.66-mvebu #20.08.3

, I also had random reboots after some updates on heavy NFS loads again, previously I had this problem, but after reinstalling system from scratch to spare sd card system was rock solid. I inserted spare card again and it's stable again, but I have no time to test where the problem is now thought.

kratz00 · November 22, 2020

I guess in your cases the system also freezes and the watchdog service is rebooting the system then.

I am going to setup the system from scratch using https://dl.armbian.com/helios4/archive/Armbian_20.08.13_Helios4_buster_current_5.8.16.img.xz tomorrow.

kratz00 · November 23, 2020

Bought a new micro SD card and put https://dl.armbian.com/helios4/archive/Armbian_20.08.13_Helios4_buster_current_5.8.16.img.xz on it.

Booted, set up the root password and directly started a raid check (9.4% done so far).

I am very excited to see what happens.

kratz00 · November 24, 2020

It froze again during raid check

I will try kernel 5.4.66-mvebu, like suggested by @fri.K

After that I am really out of ideas

kratz00 · November 24, 2020

@gprovost

Not sure if this is expected or not, but since you asked before, this is running Armbian_20.08.13_Helios4_buster_current_5.8.16.img.xz without any changes.

root@helios4:~# systemctl status watchdog.service
Unit watchdog.service could not be found.

gprovost · November 24, 2020

Yes watchdog service is not installed by default.

apt-get install watchdog

kratz00 · November 24, 2020

@gprovostThanks. Having a watchdog will restart the system in case it freezes, like it does for @Mangixand @fri.K

I would be more interested in helping to fix the underlying problem. How can I help?

kratz00 · November 24, 2020

Short update running 5.4.66-mvebu now, resulted in a system freeze in just a couple of minutes running raid check.

I am officially out of ideas.

The Helios4 was running fine for many years when I got it after the successful Kickstarter campaign.

I can not really pin point when it started freezing (I think it started after October of 2019).

I am also not sure if it is a hardware or a software problem.

gprovost · November 25, 2020

I know you dismissed PSU problem from the beginning, but how long you have been running the system with same PSU for ?

kratz00 · November 25, 2020

8 minutes ago, gprovost said:

I know you dismissed PSU problem from the beginning, but how long you have been running the system with same PSU for ?

Since the beginning, it is still the original PSU I got with the Helios4 in February 2018.

Mangix · November 25, 2020

@kratz00my theory is a kernel problem. I had months of uptime with kernel 4.19.63. Unfortunately, I don't have the original .deb file.

kratz00 · November 25, 2020

@gprovost

As I also have a Synology DS214+, I just switched the PSUs, both are up and running.

The Helios4 is running raid check again, let us see what will happen

@Mangix

If changing the PSU does not have the desired effect, I will try kernel 4.19.63

You can still find the image here https://archive.armbian.com/helios4/archive/Armbian_5.91_Helios4_Debian_buster_next_4.19.63.7z

FredK · November 25, 2020

@kratz00 @Mangix @gprovost @Heisath

See my post in the parallel thread https://forum.armbian.com/topic/16038-random-system-reboots/?do=findComment&comment=113510

I used a different (newer) Kernel and a new PSU but I got again a spontaneous reboot.

TheLinuxBug · November 25, 2020

@kratz00 @FredK

Actually, let me revise this after re-reading again. To note that when you have 5 or more drives running on the Armada 3700 series SoC like in ESPRESSOBin there are bugs in the DMA transaction process that can be hit which will trigger some weird events such as kswapd using 100% CPU and terminating all IO access, Kernel Panic, Drive failures reports by sata controller or a freeze. I have seen this predominantly occur during a raid reshape but it can happen at other times as well. On the ESPRESSOBin one of the things that can help get around this is if you are using mdadm and raid5 you can use the xor engine in the chip to handle DMA calculations and this will help to offload it under a lot of conditions. The module is 'marvell-cesa' on ESPRESSOBin, I would guess its is names similarly here. Especially with 5.x.y series kernels I have had to use this. I am currently running a raid of 7x3TB drives on a 8 port PCIe adapter placed into a mini-pcie to PCIEx1 adapter slot attached to the mini-pcie on the board. It is stable under 5.4.y or newer with that module inserted, otherwise under high load from NFS or local system for extended periods I will see errors first where it will fail to allocate dma requests in time and then shortly after system instability, usually resulting in any of the above mentioned outcomes.

Maybe if you are not using that module you could compile it and see if it helps with this issue in the newer kernels.

Past that it did sound possible if this isn't specifically something caused by change in kernel version like it could be related to bad SATA cable or under power.

my 2 cents.

Cheers!

kratz00 · November 25, 2020

System froze again during raid check.

I was using kernel 5.8.18-mvebu and the PSU from my Synology NAS.

I just prepared a SD card with the Armbian_5.91_Helios4_Debian_buster_next_4.19.63.img image

As suggested by @Mangix this kernel might also be stable in my case.

karamike · November 25, 2020

I don't want this to be a me-too-post - but why else would I be here...

I have an old Helios4 with the original PSU, Armbian 20.08.17 Bionic, Linux 5.8.16-mvebu, running in JBOD mode.

I've also recently start to experience random freezes while continuously accessing a single drive (picture indexing). At some point the Ethernet connection is lost, no files, no ssh, no pings.

Under this conditions I've connected a laptop to the serial output of the Helios4.

It doesn't react to any input but displays the following messages:

[30482.836176] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[30482.842115] rcu: 	1-...!: (0 ticks this GP) idle=87e/1/0x40000002 softirq=7995606/7995606 fqs=1 
[30492.851735] rcu: rcu_sched kthread starved for 131290 jiffies! g13530313 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[30492.862281] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[30492.871261] rcu: RCU grace-period kthread stack dump:
[30545.855768] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[30545.861707] rcu: 	1-...!: (0 ticks this GP) idle=87e/1/0x40000002 softirq=7995606/7995606 fqs=1

Does this hint to the kernel problem mentioned above?

TRS-80 · November 25, 2020

1 hour ago, karamike said:

I don't want this to be a me-too-post

Every post is another potential data point, which may help to figure out what is going on. So thanks for your feedback.

Mangix · November 26, 2020

I compiled my own 4.19.63 . So far, it's not rebooting. Fingers crossed that it can survive a day.

If it does, kernel 4.19.64 and above are the problematic ones.

The next step is to selectively revert potentially problematic commits and figure out which one is causing reboots.

edit: I just realized that I never deleted the old cpufreq patches...

gprovost · November 26, 2020

19 hours ago, kratz00 said:

As I also have a Synology DS214+, I just switched the PSUs, both are up and running.

This is a PSU for only a 2-bay setup (I think only rated 5A ??). Could be not sufficient for a 4 x 3.5" HDD.

@TheLinuxBug The driver you are referring to is mv_xor (mv_cesa is for the hw crypto engine), mv_xor is already built-in Armbian for mvebu family

A way to confirm Helios4 offload XOR operation on the hw engine is to check /proc/interrupts and look at the following 2 lines

 47:      43007          0     GIC-0  54 Level     f1060800.xor
 48:      40208          0     GIC-0  97 Level     f1060900.xor

You right about NFS that has been often a culprit of bringing system down to his feet when not properly tuned.

@karamike Your log looks more like overload system. What protocol are you using to access data over the network ?

@Mangix Hope you manage to narrow down the root cause

kratz00 · November 27, 2020

@gprovostThanks for the hint. You are right, the Helios4 one does 12V/8A and the Synology one only 12V/6A. I will switch them back,

@MangixYou were also right, raid check finished successfully running kernel 4.19.63-mvebu #5.91 (with over 36 hours of uptime at the moment)

I saw in the other thread the system freeze is related to the DFS patches, therefor this thread can be closed now.

Thanks to everybody for your input and help with this problem.

karamike · November 27, 2020

@gprovost The files are being shared via netatalk to a Mac. That worked fine until now.

The number of files within a single folder is quite large (up to 13,000 files). The hard drive is formatted with ext4 using the DIR_INDEX option to speed up the file access.

The problem first appeared after switching from one of the four drives from 2 TB to 4 TB. I've copied the files from A to B, so the number of files did not change. But I had to reindex all files within a picture processing software on the Mac (creating thumbs, determine picture properties, etc.).

I've installed the software watchdog, as mentioned above, but that did not help either. If the system freezes a software watchdog does not have any chance to intervene.

After reading about PSU issues here, I disconnected one drive (data + supply voltage) and started the box with only 3 drives. After 10 hours or so of indexing the system did freeze again. I wrote a simple shell script to display temperature and processor load every second. When the system froze the temperature was around 50 °C and the process load slightly above 1.

I also noticed that the "blinking lights" (HD access, SMD LEDs) are all off - even when the box seems to work normally. Is this an indication of something?

Thanks

Mangix · November 29, 2020

DFS patches were removed for current and legacy. dev still has them. I assume builds will be out soon if they're not out already.

kratz00 · December 28, 2020

I am running Linux 5.9.14-mvebu for over 14 days now (up 14 days, 20:12) without any problem.

Thank you very much to all involved tracking down and fixing the issue.

Sign In

Random system freezes

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information