pekkal Posted March 29, 2020 Posted March 29, 2020 How to debug? After running since August with no issues, my Helios4 has now crashed twice during past week: the OLED remains visible without being updated, the fan remains quiet (unlike after shutdown), all services are down. After unplugging & plugging the power the system starts & all services are OK, but obviously the logs are gone. After reboot log.hdd has very little, dmesg shows no errors. How can I move logs to permanent storage or what can I do otherwise post mortem ... or is it the time to rebuild the system on a new SD card? Regards, pekkal
taziden Posted March 29, 2020 Posted March 29, 2020 Hi, Same thing has been happening to me for the past few weeks. Crashs are becoming more and more frequent though. 2x times last night, once again earlier today. I've switched to a brand new SD card, crashs still occur. Neither atop, nor top, nor the logs (I even deactivated the ram logging in order to have the logs stored on SD) show me anything useful. The NAS becomes unresponsive, both on the network, and both on the serial port. The power LED is ON and the disk activity LEDS either are all lit up or all off but most of the time they are OFF. The network (RJ45 port) LED are still blinking tho. The OLED remains visible without being updated and the fan remains quiet. Can it be a power adapter issue? It's only a few months old, previous one died and I had to order a new one from Kobol. Edit : it might have been a dust issue. I've just opened the case and cleaned a bunch of dust. So far so good, I hope it was just this. Fingers crossed. Edit2 : nope, issue still occurs I'll try with another PSU
jimandroidpc Posted March 29, 2020 Posted March 29, 2020 Yea this ia the 2nd time ive ran an update on OMV and had crashes and now i cant login to the weblogin even after trying to reset pw etc.Sent from my LM-G710 using Tapatalk
gprovost Posted March 30, 2020 Author Posted March 30, 2020 @pekkal @taziden Yes could be a failing PSU. Unfortunately some PSU are using a brand of capacitor that doesn't fulfill their advertise MTBF :-/ So better you check with a voltmeter you still get 12V between the (+) and (-) pins. @jimandroidpc What kind of crash you are referring ? after OMV update can you SSH in or use serial console ? It seems your issue is a failing OMV upgrade, something not related to hardware. Is it an just update or an upgrade (OMV4 to OMV5) you are doing ?
FredK Posted March 30, 2020 Posted March 30, 2020 Release of openmediavault 5 (Usul) https://www.openmediavault.org/?p=2685 2
jimandroidpc Posted March 31, 2020 Posted March 31, 2020 22 hours ago, gprovost said: @pekkal @taziden Yes could be a failing PSU. Unfortunately some PSU are using a brand of capacitor that doesn't fulfill their advertise MTBF :-/ So better you check with a voltmeter you still get 12V between the (+) and (-) pins. @jimandroidpc What kind of crash you are referring ? after OMV update can you SSH in or use serial console ? It seems your issue is a failing OMV upgrade, something not related to hardware. Is it an just update or an upgrade (OMV4 to OMV5) you are doing ? Fixed with this -- https://forum.openmediavault.org/index.php?thread/31630-solved-omv-accepts-login-but-brings-me-back-to-the-login-screen/ as much as Id like it to be OMV isnt really production ready software with errors like these.
gprovost Posted March 31, 2020 Author Posted March 31, 2020 On 3/29/2020 at 6:26 PM, pekkal said: How can I move logs to permanent storage or what can I do otherwise post mortem ... or is it the time to rebuild the system on a new SD card? You can disable armbian-ramlog in /etc/default/armbian-ramlog this way way hopefully next time it crash you see something in /var/log. You will need to reboot after disabling armbian-ramlog. @taziden Maybe you can do the same in order to see what's happening.
taziden Posted March 31, 2020 Posted March 31, 2020 5 hours ago, gprovost said: You can disable armbian-ramlog in /etc/default/armbian-ramlog this way way hopefully next time it crash you see something in /var/log. You will need to reboot after disabling armbian-ramlog. @taziden Maybe you can do the same in order to see what's happening. I disabled it already. It was my first action :-) I disabled both armbian-zram-config and arm-ramlog services at the systemd level (systemctl disable). Today, one hour ago same issue with a brand new PSU. I have a spare helios4 (from batch3), if you haven't go any other suggestion, I'm going to put the disks in it and see if this behaviors still occurs. With this I'd changed everything except the OS.
devman Posted March 31, 2020 Posted March 31, 2020 When I write large files (intermittently) the system will spontaneously reboot. I don't know if it's related, but I've also noticed that often, when copying from windows over to the NAS, the throughput will drop fairly quickly to 0, stay there for a minute or so, and then shoot back up I checked the power supply and it's giving me ~12.5v
gprovost Posted April 1, 2020 Author Posted April 1, 2020 @taziden @devman Can you both run armbianmonitor -u and post the link here. 15 hours ago, taziden said: I have a spare helios4 (from batch3), if you haven't go any other suggestion, I'm going to put the disks in it and see if this behaviors still occurs. With this I'd changed everything except the OS. Yeah you could test that then, it would be helpful to narrow down the issue. 11 hours ago, devman said: When I write large files (intermittently) the system will spontaneously reboot. Hmmm it reboots but not hangs ? That is strange. How long in total have you been running your Helios4 for ? Just in case, do you have watchdog service running ? ( systemctl status watchdog.service ) Just trying to eliminate software issue.
devman Posted April 1, 2020 Posted April 1, 2020 24 minutes ago, gprovost said: Hmmm it reboots but not hangs ? That is strange. How long in total have you been running your Helios4 for ? Just in case, do you have watchdog service running ? ( systemctl status watchdog.service ) Just trying to eliminate software issue. Oh, definitely reboot. The way I first noticed it was the sound of the fans all spinning up to 100% It's one of the first batch units, so ~2 years now. Current uptime is a few hours since the last failure. http://ix.io/2g1j ● watchdog.service - watchdog daemon Loaded: loaded (/lib/systemd/system/watchdog.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2020-04-01 02:17:21 HKT; 10h ago Process: 2232 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modp Process: 2234 ExecStart=/bin/sh -c [ $run_watchdog != 1 ] || exec /usr/sbin/watchdog $watchdog_options (code=exited, Main PID: 2236 (watchdog) Tasks: 1 (limit: 4776) Memory: 708.0K CGroup: /system.slice/watchdog.service └─2236 /usr/sbin/watchdog Apr 01 02:17:21 helios4 watchdog[2236]: interface: no interface to check Apr 01 02:17:21 helios4 watchdog[2236]: temperature: no sensors to check Apr 01 02:17:21 helios4 watchdog[2236]: no test binary files Apr 01 02:17:21 helios4 watchdog[2236]: no repair binary files Apr 01 02:17:21 helios4 watchdog[2236]: error retry time-out = 60 seconds Apr 01 02:17:21 helios4 watchdog[2236]: repair attempts = 1 Apr 01 02:17:21 helios4 watchdog[2236]: alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no Apr 01 02:17:21 helios4 watchdog[2236]: watchdog now set to 60 seconds Apr 01 02:17:21 helios4 watchdog[2236]: hardware watchdog identity: Orion Watchdog Apr 01 02:17:21 helios4 systemd[1]: Started watchdog daemon.
gprovost Posted April 1, 2020 Author Posted April 1, 2020 @devman Would be great to reproduce the issue but with a computer connected to Helios4 serial console to see what happens when system crash. Few thing to setup first : 1. Disable watchdog service: systemctl disable watchdog.service 2. Edit /boot/armbianEnv.txt and add : extraargs=ignore_loglevel Reboot system Connect to serial console and log-in, then do dmesg -n 7 dmesg -w Then let it run until system crash, and hopefully you manage to catch some stuff.
AgentPete Posted April 1, 2020 Posted April 1, 2020 I haven’t used my Helios4 for about a week. On logging in to the web interface today, it seems to accept my credentials, but then throws me back to the login page again. Not the same as a failed login attempt. Just no CP. I can access via SSH, have changed web admin password, still the same problem. No idea what to do – any suggestions appreciated. P.
gprovost Posted April 2, 2020 Author Posted April 2, 2020 @AgentPete Are you talking about OMV web interface ? This has nothing to do with Helios4, but looks like jimandroidpc above had the same issue and found the solution here : https://forum.openmediavault.org/index.php?thread/31630-solved-omv-accepts-login-but-brings-me-back-to-the-login-screen/
AgentPete Posted April 2, 2020 Posted April 2, 2020 Thanks, I suspected it was nothing to do with Helios4, but had a nagging doubt re SD card. Link very much appreciated. P.
DavidGF Posted April 3, 2020 Posted April 3, 2020 Hello there! Also having issues using the new kernel. The system just freezes so there's no point on looking at logs (nothing there even if you disable ram logs) or using the serial port (didnt try but the program that drives the screen stops working so I can only assume thr whole kernel went bananas!). This is indeed hapening since the kernel upgrade. Happened like 3 times y-day and a couple more in these last two weeks since I upgraded. I think I might just downgrade the old kernel for now. Also given that it doesnt crash consistently and that my workload has not changed I suspect this might be a race condition or a similar thing. I will try to get a diff for this kernel vs vanilla and stare at the code. I might get lucky who knows! David
Igor Posted April 3, 2020 Posted April 3, 2020 1 hour ago, DavidGF said: Also having issues using the new kernel. And how to do we know what you are talking about without providing any logs or version numbers? We developed a special diagnostic tool which provides anonymised data for analysis: armbianmonitor -u My Helios runs fine since upgrade one moth ago. _ _ _ _ _ _ | | | | ___| (_) ___ ___| || | | |_| |/ _ \ | |/ _ \/ __| || |_ | _ | __/ | | (_) \__ \__ _| |_| |_|\___|_|_|\___/|___/ |_| Welcome to Armbian buster with Linux 4.19.104-mvebu System load: 0.09 0.04 0.01 Up time: 29 days
DavidGF Posted April 3, 2020 Posted April 3, 2020 Sorry for that! There you go: https://pastebin.com/SQhse3qy Also my /boot/config-4.19.104-mvebu shows that CONFIG_CRASH_DUMP and CONFIG_KEXEC are not set, which make debugging this more complicated. Ideally I'd like to get a physical memory dump when this happens, since there's no logs that I can use.
FrancisTheodoreCatte Posted April 6, 2020 Posted April 6, 2020 (edited) On 4/1/2020 at 1:47 AM, gprovost said: @devman Would be great to reproduce the issue but with a computer connected to Helios4 serial console to see what happens when system crash. Few thing to setup first : 1. Disable watchdog service: systemctl disable watchdog.service 2. Edit /boot/armbianEnv.txt and add : extraargs=ignore_loglevel Reboot system Connect to serial console and log-in, then do dmesg -n 7 dmesg -w Then let it run until system crash, and hopefully you manage to catch some stuff. After upgrading to the 4.19.104-mvebu kernel I also started experiencing the same complete cpu lockups, but here's the kicker: the lockups are continuing, even after downgrading to kernel 4.14.171-mvebu. Easy way for me to trigger it is by running echo check > /sys/block/md0/md/sync_action I managed to catch the CPU stall via the serial console:https://pastebin.com/C8yCQMAn Here's the output from armbianmonitor -u: http://ix.io/2h03 edit: same lockup still happens when running an md data-check even after downgrading further, to 4.14.153-mvebu. Edited April 6, 2020 by FrancisTheodoreCatte more info 1
Igor Posted April 6, 2020 Posted April 6, 2020 1 hour ago, FrancisTheodoreCatte said: After upgrading to the 4.19.104-mvebu kernel I also started experiencing the same complete cpu lockups Welcome to Armbian buster with Linux 4.19.104-mvebu System load: 0.00 0.00 0.00 Up time: 33 days Try to set fixed speed to 1.6 Ghz or 800 Mhz.
DavidGF Posted April 7, 2020 Posted April 7, 2020 What it doesnt make sense is that Francis is saying about downgrading still causing this right? Does the previous kernel have the patches for DVFS or that's only in this kernel? I recall the old kernel would be always running at either 1.6 or 800MHz right? I think I've never had a crash when setting max freq to 800MHz, which I do to reduce heat and noise while I'm not using the NAS actively (just serving). But from the trace it does look like a bug in the dvfs code. From the looks of it seems like the kernel is trying to signal the other CPUs (in this case just one other CPU) but it gets in some weird loop triggered via interrupts? Looking at the patches it seems non-trivial, since the other CPU could be offline or many other weird things as well as race conditions could happen. I hope that it can be resolved soon with the help of the backtrace! Glad it is possible to get a dump via serial also! Thanks both for you help!
FrancisTheodoreCatte Posted April 7, 2020 Posted April 7, 2020 I've pinned the clockspeed to 1.6GHz (800MHz makes Plex unhappy). I'll leave it for a while with the serial console open and see what happens. My experience has been at least a crash a day.
aprayoga Posted April 7, 2020 Posted April 7, 2020 7 hours ago, FrancisTheodoreCatte said: After upgrading to the 4.19.104-mvebu kernel I also started experiencing the same complete cpu lockups, but here's the kicker: the lockups are continuing, even after downgrading to kernel 4.14.171-mvebu. Easy way for me to trigger it is by running echo check > /sys/block/md0/md/sync_action I managed to catch the CPU stall via the serial console:https://pastebin.com/C8yCQMAn Here's the output from armbianmonitor -u: http://ix.io/2h03 edit: same lockup still happens when running an md data-check even after downgrading further, to 4.14.153-mvebu. Thanks for providing the crash log. Last weekend i ran some test with following image Buster NEXT upgraded to buster CURRENT fresh Buster CURRENT and load the system with stress-ng -c 2 -P 70 It supposed to make the system busy and run with full speed (1.6 GHz). i ran it for about 20 hours each test but i did not encounter any issue. Looking on your log, it seems marvell_xor that trigger the crash. I will try your suggestion to trigger the crash
gprovost Posted April 7, 2020 Author Posted April 7, 2020 7 hours ago, DavidGF said: Glad it is possible to get a dump via serial also! Would be great if you could catch and log also your kernel crash as @FrancisTheodoreCatte did.
taziden Posted April 7, 2020 Posted April 7, 2020 On 4/1/2020 at 4:08 AM, gprovost said: @taziden Can you run armbianmonitor -u and post the link here. Yeah you could test that then, it would be helpful to narrow down the issue. Here is the output of armbianmonitor: http://ix.io/2h22 Same issue occurred with my spare Helios4 from batch3, using an unused PSU and unused fans.
DavidGF Posted April 7, 2020 Posted April 7, 2020 Is marvel_xor used for anything other than raid? i only use raid 0/1 which doesn't have any kind of parity operation.
gprovost Posted April 7, 2020 Author Posted April 7, 2020 35 minutes ago, DavidGF said: Is marvel_xor used for anything other than raid? i only use raid 0/1 which doesn't have any kind of parity operation. Yup most probably not related to marvell_xor, just a coincidence. Because yes we have 2 users (it includes you) who see the crash and don't have RAID 5 or 6 setup. 8 hours ago, DavidGF said: Does the previous kernel have the patches for DVFS or that's only in this kernel? I recall the old kernel would be always running at either 1.6 or 800MHz right? There was some changes back in Jan that were related to DVFS. We need to re look at it.
gprovost Posted April 7, 2020 Author Posted April 7, 2020 1 hour ago, taziden said: Same issue occurred with my spare Helios4 from batch3, using an unused PSU and unused fans. Any idea what kind of activity happening on the system when it crashes ?
taziden Posted April 7, 2020 Posted April 7, 2020 2 hours ago, gprovost said: Any idea what kind of activity happening on the system when it crashes ? I've tried disabling everything. Even just having the system running, without unencrypting and mounting the RAID eventually ends up in the helios being unresponsive. I'll do some more tests, like booting on Arch and will keep you posted.
Recommended Posts