Jump to content

Helios64 randomly dying - help!


gmerrall

Recommended Posts

My Helios64 NAS is suddenly randomly just, stopping. It appears to be totally random.   It doesn't appear to be some form of actual shutdown as the front panel lights all stay on. I lose all access including the serial console.  The only way to reboot is a long press on the power button to actually physically reboot the box.

 

I've cranked the verbosity up to 7 in /boot/armbianEnv.txt in case something is output on the serial console but I've not seen any useful output yet but for obvious reasons it's hard to catch it happening. Because /var/log is cleared every boot I can't see  if anything was logged just prior to the halt event. 

 

Not sure where to start trying to diagnose. One option is to remove the folder2ram mount for /var/log to persist logs?

 

output of "uname -a" in case it helps

Linux helios64 5.10.43-rockchip64 #21.05.4 SMP PREEMPT Wed Jun 16 08:02:12 UTC 2021 aarch64 GNU/Linux

 

Kobol folks: I'm in Singapore if that helps any.

Link to comment
Share on other sites

You can set up logging to go to a flash drive, but capturing the console is going to be the best bet.

Do you have another computer you can leave on with the serial console connected? For example, I have a small NUC where I keep picocom running in a tmux session, that way I don't lose anything. You can use the -g option to have picocom log to a file, as well.

Link to comment
Share on other sites

I have the same problem as described: the device is powered on, all lights are on (however, not blinking) but not accessible trough serial nor ssh.
I will try leaving raspberrypi connected via serial, to capture the logs, because nothing appears on systemd log.

Edited by digwer
Link to comment
Share on other sites

Hi! Not sure if it's related to the original post but my Kobol Helios64 seems to have just stopped responding following a reboot. Already tried to power down and power up and the issue persists.

 

Hardware starts, disks are initialised and responds to ping. But no longer can be accessed via SSH or the OpenMediaVault web admin page. My Helios64 is updated with latest Armbian Buster 5.10.x and latest OpenMediaVault 5.x. Armbian + OMV installed to internal eMMC. Docker apps installed to M.2 SATA Port1. Other 4 HDDs with a ZFS setup on SATA ports 2 to 5 for data storage.

 

A few minutes prior to doing the reboot I noticed some kernel errors on a SSH connected shell to my Helios64 the syslog errors (posted on the shell). They kept appearing randomly and separated for a few minutes apart as the following:

 

Message from syslogd@helios64 at Jul 16 18:47:05 ...
 kernel:[111630.816643] Internal error: Oops: 96000004 [#8] PREEMPT SMP

Message from syslogd@helios64 at Jul 16 18:47:05 ...
 kernel:[111630.843686] Code: 14000011 f9400273 b40001f3 d1002274 (b9402280)

 

Then rebooted from the OMV web admin page and it stopped responding from there. Pings ok but dead SSH connection is refused and OMV dead web admin page. Any idea of what's this about and how to sort this?

Link to comment
Share on other sites

16 hours ago, jotapesse said:
Message from syslogd@helios64 at Jul 16 18:47:05 ...
 kernel:[111630.816643] Internal error: Oops: 96000004 [#8] PREEMPT SMP

Message from syslogd@helios64 at Jul 16 18:47:05 ...
 kernel:[111630.843686] Code: 14000011 f9400273 b40001f3 d1002274 (b9402280)

 

jotapesse, I think that kernel Oops is different problem from gmerrall and mine. In my/our case we don't have any kernel crashes, just random hardware lockup. I would suggest to connect via serial, get the whole kernel crash and create another thread.

meymarce, I haven't tried that. How should I do that?

Link to comment
Share on other sites

11 hours ago, digwer said:

jotapesse, I think that kernel Oops is different problem from gmerrall and mine. In my/our case we don't have any kernel crashes, just random hardware lockup. I would suggest to connect via serial, get the whole kernel crash and create another thread.

 

Thanks! Yes, I believe that's the case. I guess something got corrupted on my install. Booted SD card Armbian fine. Tried everything, copy all files at the /boot directory, filesystem check, chrooted and updated and fully upgraded. In the end I gave up and reinstalled everything from scratch. Let's see how if it's stable on the long run... I'm getting a bit worried with all reports I read here regarding instability, hangs, crashes, boot corruptions, custom cpu voltage and frequency modifications. I need it to work reliably as a NAS and apps server. Hopefully all of you will get yours storted as well.

Link to comment
Share on other sites

Hi,

 

gmerall and digwer, did you find a solution to this problem ? I am facing the same problem and have no idea how to solve it.

(Also on Helios 64, Ambian 22.02.1 with Debian and Kernel 5.15.93)

 

Best regards,

EPZ

Link to comment
Share on other sites

Hi @EPZ
Sorry for the late response.
I think this problem was solved by locking cpu frequency witch cpufrequtils:
```

root@helios64:/etc# cat default/cpufrequtils
ENABLE=true
MIN_SPEED=1008000
MAX_SPEED=1008000
GOVERNOR=performance
```

After locking cpu frequency, check here few times:
```

root@helios64:/sys# grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:1008000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:1008000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:1008000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:1008000
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:1008000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:1008000
```

Link to comment
Share on other sites

Hi @digwer

Thanks for your answer. It has helped so far :) Let's see on the long term now

 

A few questions for my better understanding:

1) What is the initial issue, or why does this help ?

2) What is the impact of this change on the performance ? (Main activity is file up-/downloading through Nextcloud and git)

3) And the Impact on power consumption ? (my Helios is largely idle, the main activity is every 30s a request from nextcloud to check if there are any changes)

4) What are the limits/options for these settings while still fixing the issue ?

 

Best regards and thanks again for your help :)

EPZ 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines