Setup: Helios64 with Armbian 20.08.8 Buster with Linux 4.4.213-rk3399. Will change to Linux 5.8 soon though. 5 x 8 TB drives. Not using battery.
I’ve had this issue ever since beginning to use my Helios. It first happened when I tried building a RAID with OpenMediaVault: It would crash and reboot while building. Sometimes after only 10 minutes, sometimes after 8+ hours of working. The same is true for video transcoding nowadays: it will sometimes crash after a little while, sometimes after a few consecutive hours of working, but it will never be able to work much longer than 12 hours on such a task. Copying large files from the internet has a similar risk of crashing it.
Also, it will not save the crash in the logs. For some reason, the timeframe when it crashed was always gone from the logs. Nowadays it’s even worse: My logs only go until December 11th and even if I clear them on the OMV interface, after the next reboot, they are there again and it refuses to log anything new.
While the server is performing a demanding task, most figures in the ssh screen will be red: System Load at around 170%, CPU temp at 70°C etc. CPU usage on the OMV interface will be around 97%.
At first I thought it was normal but a friend of mine told me servers should absolutely never reboot on their own; that this is an indication something is not right.
My impression from the above-described behavior is that somehow my machine isn’t able to limit the amount of CPU used. I expected the server to become slower under more CPU stress, but instead it seems to not regulate well at all and overwork itself. Or maybe some part of the software/hardware is faulty and will randomly cause a crash.
As you may have noticed, I’m rather new to a lot of things here. I have no idea how to even begin troubleshooting this so I’ll need some pointers from you. What could be the cause for this and what tests can I do to narrow it down?