Random system freezes


kratz00

Recommended Posts

Hello

 

For quite some time I experienced system freezes. I already measured the voltage on the board 12V and 5V are okay on both connectors.

Attached you will find the armbianmonitor -U output.

 

I tried to capture kernel logs using information from some other thread.

 

sudo dmesg -n 7
sudo dmesg -w

 

But I could not capture anything useful.

Today the system froze while checking the raid (filesystem was not mounted).

 

[  168.224361] md: data-check of RAID array md0

 

Is there anything else I can do to shed some light?

 

 

Thanks and regards

-kratz00

armbianmonitor.log

Link to post
Share on other sites
Donate and support the project!

Hi,

 

By system freeze, you mean the system hangs and you need to manually reset / power cycle it ?

 

Is the watchdog service running ? systemctl status watchdog.service

 

What the temperature of the SoC during load ? cat /dev/thermal-cpu/temp1_input

Just trying to dismiss first any thermal issue.

 

Regards,

Link to post
Share on other sites

Hi gprovost

3 hours ago, gprovost said:

By system freeze, you mean the system hangs and you need to manually reset / power cycle it ?

Exactly. Not reachable over the network. Does not respond via serial console.

 

3 hours ago, gprovost said:

Is the watchdog service running ? systemctl status watchdog.service

Seems it is missing:

kratz00@helios4:~$ systemctl status watchdog.service
Unit watchdog.service could not be found.

 

3 hours ago, gprovost said:

What the temperature of the SoC during load ? cat /dev/thermal-cpu/temp1_input

Just trying to dismiss first any thermal issue.

Raid check is nearly running for an hour, load is high and the temperature is stable around 55°C:

root@helios4:~# uptime
 06:53:44 up 21 min,  1 user,  load average: 2.00, 1.92, 1.27
root@helios4:~# cat /dev/thermal-cpu/temp1_input 
55122

 

Regards

-kratz00

Link to post
Share on other sites
13 hours ago, Mangix said:

Is this a freeze or a random reboot?

I would see freeze, as the system is not usable anymore (not reachable via network nor serial).

The logs also do not indicate a reboot.

If it would be a reboot I would expect the system to be in an usable state afterwards.

Link to post
Share on other sites

@Mangix, try

 5.4.66-mvebu #20.08.3

, I also had random reboots after some updates on heavy NFS loads again, previously I had this problem, but after reinstalling system from scratch to spare sd card system was rock solid. I inserted spare card again and it's stable again, but I have no time to test where the problem is now thought.

Link to post
Share on other sites

Short update running 5.4.66-mvebu now, resulted in a system freeze in just a couple of minutes running raid check.

I am officially out of ideas.

The Helios4 was running fine for many years when I got it after the successful Kickstarter campaign.

I can not really pin point when it started freezing (I think it started after October of 2019).

I am also not sure if it is a hardware or a software problem.

 

Link to post
Share on other sites

@kratz00 @FredK

 

Actually, let me revise this after re-reading again.  To note that when you have 5 or more drives running on the Armada 3700 series SoC like in ESPRESSOBin there are bugs in the DMA transaction process that can be hit which will trigger some weird events such as kswapd using 100% CPU and terminating all IO access,  Kernel Panic, Drive failures reports by sata controller or a freeze.  I have seen this predominantly occur during a raid reshape but it can happen at other times as well.  On the ESPRESSOBin one of the things that can help get around this is if you are using mdadm and raid5 you can use the xor engine in the chip to handle DMA calculations and this will help to offload it under a lot of conditions.  The module  is 'marvell-cesa' on ESPRESSOBin, I would guess its is names similarly here.  Especially with 5.x.y series kernels I have had to use this.  I am currently running a raid of 7x3TB drives on a 8 port PCIe adapter placed into a mini-pcie to PCIEx1 adapter slot attached to the mini-pcie on the board.  It is stable under 5.4.y or newer with that module inserted, otherwise under high load from NFS or local system for extended periods I will see errors first where it will fail to allocate dma requests in time and then shortly after system instability, usually resulting in any of the above mentioned outcomes.

 

Maybe if you are not using that module you could compile it and see if it helps with this issue in the newer kernels.

 

Past that it did sound possible if this isn't specifically something caused by change in kernel version like it could be related to bad SATA cable or under power.

 

my 2 cents.

 

Cheers!

 

Link to post
Share on other sites

I don't want this to be a me-too-post - but why else would I be here...

 

I have an old Helios4 with the original PSU, Armbian 20.08.17 Bionic, Linux 5.8.16-mvebu, running in JBOD mode.

 

I've also recently start to experience random freezes while continuously accessing a single drive (picture indexing). At some point the Ethernet connection is lost, no files, no ssh, no pings.

 

Under this conditions I've connected a laptop to the serial output of the Helios4.

It doesn't react to any input but displays the following messages:

 

[30482.836176] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[30482.842115] rcu: 	1-...!: (0 ticks this GP) idle=87e/1/0x40000002 softirq=7995606/7995606 fqs=1 
[30492.851735] rcu: rcu_sched kthread starved for 131290 jiffies! g13530313 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[30492.862281] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[30492.871261] rcu: RCU grace-period kthread stack dump:
[30545.855768] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[30545.861707] rcu: 	1-...!: (0 ticks this GP) idle=87e/1/0x40000002 softirq=7995606/7995606 fqs=1

 

Does this hint to the kernel problem mentioned above?

 

Link to post
Share on other sites

I compiled my own 4.19.63 . So far, it's not rebooting. Fingers crossed that it can survive a day.

 

If it does, kernel 4.19.64 and above are the problematic ones.

 

The next step is to selectively revert potentially problematic commits and figure out which one is causing reboots.

 

edit: I just realized that I never deleted the old cpufreq patches...

Link to post
Share on other sites
19 hours ago, kratz00 said:

As I also have a Synology DS214+, I just switched the PSUs, both are up and running.

 

 This is a PSU for only a 2-bay setup (I think only rated 5A ??). Could be not sufficient for a 4 x 3.5" HDD.

 

 

@TheLinuxBug The driver you are referring to is mv_xor (mv_cesa is for the hw crypto engine), mv_xor is already built-in Armbian for mvebu family

 

A way to confirm Helios4 offload XOR operation on the hw engine is to check /proc/interrupts and look at the following 2 lines

 

 47:      43007          0     GIC-0  54 Level     f1060800.xor
 48:      40208          0     GIC-0  97 Level     f1060900.xor

 

You right about NFS that has been often a culprit of bringing system down to his feet when not properly tuned.

 

 

@karamike Your log looks more like overload system. What protocol are you using to access data over the network ?

 

@Mangix Hope you manage to narrow down the root cause

Link to post
Share on other sites

@gprovostThanks for the hint. You are right, the Helios4 one does 12V/8A and the Synology one only 12V/6A. I will switch them back,

@MangixYou were also right, raid check finished successfully running kernel 4.19.63-mvebu #5.91 (with over 36 hours of uptime at the moment)

 

I saw in the other thread the system freeze is related to the DFS patches, therefor this thread can be closed now.

 

Thanks to everybody for your input and help with this problem.

Link to post
Share on other sites

@gprovost The files are being shared via netatalk to a Mac. That worked fine until now.

 

The number of files within a single folder is quite large (up to 13,000 files). The hard drive is formatted with ext4 using the DIR_INDEX option to speed up the file access.

 

The problem first appeared after switching from one of the four drives from 2 TB to 4 TB. I've copied the files from A to B, so the number of files did not change. But I had to reindex all files within a picture processing software on the Mac (creating thumbs, determine picture properties, etc.).

 

I've installed the software watchdog, as mentioned above, but that did not help either. If the system freezes a software watchdog does not have any chance to intervene.

 

After reading about PSU issues here, I disconnected one drive (data + supply voltage) and started the box with only 3 drives. After 10 hours or so of indexing the system did freeze again.  I wrote a simple shell script to display temperature and processor load every second. When the system froze the temperature was around 50 °C and the process load slightly above 1.

 

I also noticed that the "blinking lights" (HD access, SMD LEDs) are all off - even when the box seems to work normally. Is this an indication of something?

 

Thanks

Link to post
Share on other sites