Helios64 randomly dying - help!

gmerrall · July 9, 2021

My Helios64 NAS is suddenly randomly just, stopping. It appears to be totally random. It doesn't appear to be some form of actual shutdown as the front panel lights all stay on. I lose all access including the serial console. The only way to reboot is a long press on the power button to actually physically reboot the box.

I've cranked the verbosity up to 7 in /boot/armbianEnv.txt in case something is output on the serial console but I've not seen any useful output yet but for obvious reasons it's hard to catch it happening. Because /var/log is cleared every boot I can't see if anything was logged just prior to the halt event.

Not sure where to start trying to diagnose. One option is to remove the folder2ram mount for /var/log to persist logs?

output of "uname -a" in case it helps

Linux helios64 5.10.43-rockchip64 #21.05.4 SMP PREEMPT Wed Jun 16 08:02:12 UTC 2021 aarch64 GNU/Linux

Kobol folks: I'm in Singapore if that helps any.

wurmfood · July 9, 2021

You can set up logging to go to a flash drive, but capturing the console is going to be the best bet.

Do you have another computer you can leave on with the serial console connected? For example, I have a small NUC where I keep picocom running in a tmux session, that way I don't lose anything. You can use the -g option to have picocom log to a file, as well.

gmerrall · July 12, 2021

I had serial logging underway but the screen idea was a good one. Of course after happening pretty much everyday, it's now not happening at all. Typical!

digwer · July 16, 2021

I have the same problem as described: the device is powered on, all lights are on (however, not blinking) but not accessible trough serial nor ssh.
I will try leaving raspberrypi connected via serial, to capture the logs, because nothing appears on systemd log.

Edited July 16, 2021 by digwer

jotapesse · July 16, 2021

Hi! Not sure if it's related to the original post but my Kobol Helios64 seems to have just stopped responding following a reboot. Already tried to power down and power up and the issue persists.

Hardware starts, disks are initialised and responds to ping. But no longer can be accessed via SSH or the OpenMediaVault web admin page. My Helios64 is updated with latest Armbian Buster 5.10.x and latest OpenMediaVault 5.x. Armbian + OMV installed to internal eMMC. Docker apps installed to M.2 SATA Port1. Other 4 HDDs with a ZFS setup on SATA ports 2 to 5 for data storage.

A few minutes prior to doing the reboot I noticed some kernel errors on a SSH connected shell to my Helios64 the syslog errors (posted on the shell). They kept appearing randomly and separated for a few minutes apart as the following:

Message from syslogd@helios64 at Jul 16 18:47:05 ...
 kernel:[111630.816643] Internal error: Oops: 96000004 [#8] PREEMPT SMP

Message from syslogd@helios64 at Jul 16 18:47:05 ...
 kernel:[111630.843686] Code: 14000011 f9400273 b40001f3 d1002274 (b9402280)

Then rebooted from the OMV web admin page and it stopped responding from there. Pings ok but dead SSH connection is refused and OMV dead web admin page. Any idea of what's this about and how to sort this?

meymarce · July 16, 2021

You guys have tried the usual low CPU clock/ raise voltage things?

digwer · July 17, 2021

16 hours ago, jotapesse said:

Message from syslogd@helios64 at Jul 16 18:47:05 ...
 kernel:[111630.816643] Internal error: Oops: 96000004 [#8] PREEMPT SMP

Message from syslogd@helios64 at Jul 16 18:47:05 ...
 kernel:[111630.843686] Code: 14000011 f9400273 b40001f3 d1002274 (b9402280)

jotapesse, I think that kernel Oops is different problem from gmerrall and mine. In my/our case we don't have any kernel crashes, just random hardware lockup. I would suggest to connect via serial, get the whole kernel crash and create another thread.

meymarce, I haven't tried that. How should I do that?

meymarce · July 17, 2021

You can also try setting MAX_SPEED to 408000. However I have a stable system since I raised the voltage in /boot/boot.cmd

jotapesse · July 17, 2021

11 hours ago, digwer said:

jotapesse, I think that kernel Oops is different problem from gmerrall and mine. In my/our case we don't have any kernel crashes, just random hardware lockup. I would suggest to connect via serial, get the whole kernel crash and create another thread.

Thanks! Yes, I believe that's the case. I guess something got corrupted on my install. Booted SD card Armbian fine. Tried everything, copy all files at the /boot directory, filesystem check, chrooted and updated and fully upgraded. In the end I gave up and reinstalled everything from scratch. Let's see how if it's stable on the long run... I'm getting a bit worried with all reports I read here regarding instability, hangs, crashes, boot corruptions, custom cpu voltage and frequency modifications. I need it to work reliably as a NAS and apps server. Hopefully all of you will get yours storted as well.

EPZ · April 30, 2023

Hi,

gmerall and digwer, did you find a solution to this problem ? I am facing the same problem and have no idea how to solve it.

(Also on Helios 64, Ambian 22.02.1 with Debian and Kernel 5.15.93)

Best regards,

EPZ

digwer · May 11, 2023

Hi @EPZ
Sorry for the late response.
I think this problem was solved by locking cpu frequency witch cpufrequtils:
```

root@helios64:/etc# cat default/cpufrequtils
ENABLE=true
MIN_SPEED=1008000
MAX_SPEED=1008000
GOVERNOR=performance
```

After locking cpu frequency, check here few times:
```

root@helios64:/sys# grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:1008000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:1008000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:1008000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:1008000
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:1008000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:1008000
```

EPZ · May 28, 2023

Hi @digwer

Thanks for your answer. It has helped so far Let's see on the long term now

A few questions for my better understanding:

1) What is the initial issue, or why does this help ?

2) What is the impact of this change on the performance ? (Main activity is file up-/downloading through Nextcloud and git)

3) And the Impact on power consumption ? (my Helios is largely idle, the main activity is every 30s a request from nextcloud to check if there are any changes)

4) What are the limits/options for these settings while still fixing the issue ?

Best regards and thanks again for your help

EPZ

Sign In

Helios64 randomly dying - help!

Recommended Posts

gmerrall

wurmfood

gmerrall

digwer

jotapesse

meymarce

digwer

meymarce

jotapesse

EPZ

digwer

EPZ

Join the conversation

Similar Content

Forums

My Activity Streams

Download

Store

Important Information