Jump to content

Recommended Posts

Posted

How to debug?

 

After running since August with no issues, my Helios4 has now crashed twice during past week: the OLED remains visible without being updated, the fan remains quiet (unlike after shutdown), all services are down. After unplugging & plugging the power the system starts & all services are OK, but obviously the logs are gone. After reboot log.hdd has very little, dmesg shows no errors.

 

How can I move logs to permanent storage or what can I do otherwise post mortem ... or is it the time to rebuild the system on a new SD card?

 

Regards,

pekkal

Posted

Hi,

 

Same thing has been happening to me for the past few weeks. Crashs are becoming more and more frequent though. 2x times last night, once again earlier today.

I've switched to a brand new SD card, crashs still occur.

Neither atop, nor top, nor the logs (I even deactivated the ram logging in order to have the logs stored on SD) show me anything useful.

The NAS becomes unresponsive, both on the network, and both on the serial port. The power LED is ON and the disk activity LEDS either are all lit up or all off but most of the time they are OFF.

The network (RJ45 port) LED are still blinking tho.

The OLED remains visible without being updated and the fan remains quiet.

Can it be a power adapter issue? It's only a few months old, previous one died and I had to order a new one from Kobol.

 

Edit : it might have been a dust issue. I've just opened the case and cleaned a bunch of dust. So far so good, I hope it was just this. Fingers crossed.

Edit2 : nope, issue still occurs :( I'll try with another PSU

Posted

Yea this ia the 2nd time ive ran an update on OMV and had crashes and now i cant login to the weblogin even after trying to reset pw etc.

Sent from my LM-G710 using Tapatalk

Posted

@pekkal @taziden Yes could be a failing PSU. Unfortunately some PSU are using a brand of capacitor that doesn't fulfill their advertise MTBF :-/

 

So better you check with a voltmeter you still get 12V between the (+) and (-) pins.

 

image.png.1bc36330a2eaca9476b56039002cc17d.png

 

@jimandroidpc What kind of crash you are referring ? after OMV update can you SSH in or use serial console ? It seems your issue is a failing OMV upgrade, something not related to hardware. Is it an just update or an upgrade (OMV4 to OMV5) you are doing ?

Posted
22 hours ago, gprovost said:

@pekkal @taziden Yes could be a failing PSU. Unfortunately some PSU are using a brand of capacitor that doesn't fulfill their advertise MTBF :-/

 

So better you check with a voltmeter you still get 12V between the (+) and (-) pins.

 

image.png.1bc36330a2eaca9476b56039002cc17d.png

 

@jimandroidpc What kind of crash you are referring ? after OMV update can you SSH in or use serial console ? It seems your issue is a failing OMV upgrade, something not related to hardware. Is it an just update or an upgrade (OMV4 to OMV5) you are doing ?

 

Fixed with this -- https://forum.openmediavault.org/index.php?thread/31630-solved-omv-accepts-login-but-brings-me-back-to-the-login-screen/

 

as much as Id like it to be OMV isnt really production ready software with errors like these. 

Posted
On 3/29/2020 at 6:26 PM, pekkal said:

How can I move logs to permanent storage or what can I do otherwise post mortem ... or is it the time to rebuild the system on a new SD card?

 

You can disable armbian-ramlog in /etc/default/armbian-ramlog this way way hopefully next time it crash you see something in /var/log. You will need to reboot after disabling armbian-ramlog.

 

@taziden Maybe you can do the same in order to see what's happening.

Posted
5 hours ago, gprovost said:

 

You can disable armbian-ramlog in /etc/default/armbian-ramlog this way way hopefully next time it crash you see something in /var/log. You will need to reboot after disabling armbian-ramlog.

 

@taziden Maybe you can do the same in order to see what's happening.

I disabled it already. It was my first action :-) I disabled both armbian-zram-config and arm-ramlog services at the systemd level (systemctl disable).

Today, one hour ago same issue with a brand new PSU. I have a spare helios4 (from batch3), if you haven't go any other suggestion, I'm going to put the disks in it and see if this behaviors still occurs. With this I'd changed everything except the OS.

Posted

When I write large files (intermittently) the system will spontaneously reboot.

 

I don't know if it's related, but I've also noticed that often, when copying from windows over to the NAS, the throughput will drop fairly quickly to 0, stay there for a minute or so, and then shoot back up

 

I checked the power supply and it's giving me ~12.5v

Posted

@taziden @devman Can you both run armbianmonitor -u and post the link here.

 

15 hours ago, taziden said:

I have a spare helios4 (from batch3), if you haven't go any other suggestion, I'm going to put the disks in it and see if this behaviors still occurs. With this I'd changed everything except the OS.

 

Yeah you could test that then, it would be helpful to narrow down the issue.

 

11 hours ago, devman said:

When I write large files (intermittently) the system will spontaneously reboot.

 

Hmmm it reboots but not hangs ? That is strange. How long in total have you been running your Helios4 for ?

Just in case, do you have watchdog service running ? ( systemctl status watchdog.service ) Just trying to eliminate software issue.

 

Posted
24 minutes ago, gprovost said:

Hmmm it reboots but not hangs ? That is strange. How long in total have you been running your Helios4 for ?

Just in case, do you have watchdog service running ? ( systemctl status watchdog.service ) Just trying to eliminate software issue.

 

 

Oh, definitely reboot.  The way I first noticed it was the sound of the fans all spinning up to 100%

It's one of the first batch units, so ~2 years now.  Current uptime is a few hours since the last failure.

 

http://ix.io/2g1j

 

● watchdog.service - watchdog daemon
   Loaded: loaded (/lib/systemd/system/watchdog.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2020-04-01 02:17:21 HKT; 10h ago
  Process: 2232 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modp
  Process: 2234 ExecStart=/bin/sh -c [ $run_watchdog != 1 ] || exec /usr/sbin/watchdog $watchdog_options (code=exited,
 Main PID: 2236 (watchdog)
    Tasks: 1 (limit: 4776)
   Memory: 708.0K
   CGroup: /system.slice/watchdog.service
           └─2236 /usr/sbin/watchdog

Apr 01 02:17:21 helios4 watchdog[2236]: interface: no interface to check
Apr 01 02:17:21 helios4 watchdog[2236]: temperature: no sensors to check
Apr 01 02:17:21 helios4 watchdog[2236]: no test binary files
Apr 01 02:17:21 helios4 watchdog[2236]: no repair binary files
Apr 01 02:17:21 helios4 watchdog[2236]: error retry time-out = 60 seconds
Apr 01 02:17:21 helios4 watchdog[2236]: repair attempts = 1
Apr 01 02:17:21 helios4 watchdog[2236]: alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no
Apr 01 02:17:21 helios4 watchdog[2236]: watchdog now set to 60 seconds
Apr 01 02:17:21 helios4 watchdog[2236]: hardware watchdog identity: Orion Watchdog
Apr 01 02:17:21 helios4 systemd[1]: Started watchdog daemon.

 

Posted

@devman Would be great to reproduce the issue but with a computer connected to Helios4 serial console to see what happens when system crash.

Few thing to setup first :

 

1. Disable watchdog service: systemctl disable watchdog.service

2. Edit /boot/armbianEnv.txt and add extraargs=ignore_loglevel

Reboot system

 

Connect to serial console and log-in, then do

dmesg -n 7

dmesg -w

 

Then let it run until system crash, and hopefully you manage to catch some stuff.

Posted

I haven’t used my Helios4 for about a week. On logging in to the web interface today, it seems to accept my credentials, but then throws me back to the login page again. Not the same as a failed login attempt. Just no CP.

 

I can access via SSH, have changed web admin password, still the same problem. No idea what to do – any suggestions appreciated.

 

P.

Posted

Hello there!

Also having issues using the new kernel. The system just freezes so there's no point on looking at logs (nothing there even if you disable ram logs) or using the serial port (didnt try but the program that drives the screen stops working so I can only assume thr whole kernel went bananas!).

 

This is indeed hapening since the kernel upgrade. Happened like 3 times y-day and a couple more in these last two weeks since I upgraded. I think I might just downgrade the old kernel for now.

 

Also given that it doesnt crash consistently and that my workload has not changed I suspect this might be a race condition or a similar thing. I will try to get a diff for this kernel vs vanilla and stare at the code. I might get lucky who knows!

 

David

Posted
1 hour ago, DavidGF said:

Also having issues using the new kernel.


And how to do we know what you are talking about without providing any logs or version numbers? :) 

We developed a special diagnostic tool which provides anonymised data for analysis:

armbianmonitor -u

My Helios runs fine since upgrade one moth ago.

 _   _      _ _           _  _   
| | | | ___| (_) ___  ___| || |  
| |_| |/ _ \ | |/ _ \/ __| || |_ 
|  _  |  __/ | | (_) \__ \__   _|
|_| |_|\___|_|_|\___/|___/  |_|  
                                 
Welcome to Armbian buster with Linux 4.19.104-mvebu

System load:   0.09 0.04 0.01  	Up time:       29 days

 

Posted

Sorry for that! There you go: https://pastebin.com/SQhse3qy

Also my /boot/config-4.19.104-mvebu shows that CONFIG_CRASH_DUMP and CONFIG_KEXEC are not set, which make debugging this more complicated.

Ideally I'd like to get a physical memory dump when this happens, since there's no logs that I can use.

Posted (edited)
On 4/1/2020 at 1:47 AM, gprovost said:

@devman Would be great to reproduce the issue but with a computer connected to Helios4 serial console to see what happens when system crash.

Few thing to setup first :

 

1. Disable watchdog service: systemctl disable watchdog.service

2. Edit /boot/armbianEnv.txt and add extraargs=ignore_loglevel

Reboot system

 

Connect to serial console and log-in, then do

dmesg -n 7

dmesg -w

 

Then let it run until system crash, and hopefully you manage to catch some stuff.

 

After upgrading to the 4.19.104-mvebu kernel I also started experiencing the same complete cpu lockups, but here's the kicker: the lockups are continuing, even after downgrading to kernel 4.14.171-mvebu.

 

Easy way for me to trigger it is by running echo check > /sys/block/md0/md/sync_action

 

I managed to catch the CPU stall via the serial console:
https://pastebin.com/C8yCQMAn

 

Here's the output from armbianmonitor -u:

http://ix.io/2h03

 

edit:

same lockup still happens when running an md data-check even after downgrading further, to 4.14.153-mvebu.

Edited by FrancisTheodoreCatte
more info
Posted
1 hour ago, FrancisTheodoreCatte said:

After upgrading to the 4.19.104-mvebu kernel I also started experiencing the same complete cpu lockups

Welcome to Armbian buster with Linux 4.19.104-mvebu

System load:   0.00 0.00 0.00      Up time:       33 days        

Try to set fixed speed to 1.6 Ghz or 800 Mhz.

Posted

What it doesnt make sense is that Francis is saying about downgrading still causing this right?

Does the previous kernel have the patches for DVFS or that's only in this kernel? I recall the old kernel would be always running at either 1.6 or 800MHz right?

I think I've never had a crash when setting max freq to 800MHz, which I do to reduce heat and noise while I'm not using the NAS actively (just serving). But from the trace it does look like a bug in the dvfs code.

From the looks of it seems like the kernel is trying to signal the other CPUs (in this case just one other CPU) but it gets in some weird loop triggered via interrupts? Looking at the patches it seems non-trivial, since the other CPU could be offline or many other weird things as well as race conditions could happen.

I hope that it can be resolved soon with the help of the backtrace!

Glad it is possible to get a dump via serial also!

Thanks both for you help!

Posted
7 hours ago, FrancisTheodoreCatte said:

 

After upgrading to the 4.19.104-mvebu kernel I also started experiencing the same complete cpu lockups, but here's the kicker: the lockups are continuing, even after downgrading to kernel 4.14.171-mvebu.

 

Easy way for me to trigger it is by running echo check > /sys/block/md0/md/sync_action

 

I managed to catch the CPU stall via the serial console:
https://pastebin.com/C8yCQMAn

 

Here's the output from armbianmonitor -u:

http://ix.io/2h03

 

edit:

same lockup still happens when running an md data-check even after downgrading further, to 4.14.153-mvebu.

 

Thanks for providing the crash log.
Last weekend i ran some test with following image
 

  • Buster NEXT upgraded to buster CURRENT
  • fresh Buster CURRENT

and load the system with

stress-ng -c 2 -P 70

It supposed to make the system busy and run with full speed (1.6 GHz). i ran it for about 20 hours each test but i did not encounter any issue.

Looking on your log, it seems marvell_xor that trigger the crash. I will try your suggestion to trigger the crash

Posted
On 4/1/2020 at 4:08 AM, gprovost said:

@taziden Can you run armbianmonitor -u and post the link here.

 

Yeah you could test that then, it would be helpful to narrow down the issue.

 

Here is the output of armbianmonitor: http://ix.io/2h22

Same issue occurred with my spare Helios4 from batch3, using an unused PSU and unused fans.

Posted

Is marvel_xor used for anything other than raid? i only use raid 0/1 which doesn't have any kind of parity operation.

Posted
35 minutes ago, DavidGF said:

Is marvel_xor used for anything other than raid? i only use raid 0/1 which doesn't have any kind of parity operation.

 

Yup most probably not related to marvell_xor, just a coincidence. Because yes we have 2 users (it includes you) who see the crash and don't have RAID 5 or 6 setup.

 

8 hours ago, DavidGF said:

Does the previous kernel have the patches for DVFS or that's only in this kernel? I recall the old kernel would be always running at either 1.6 or 800MHz right?

 

There was some changes back in Jan that were related to DVFS. We need to re look at it.

 

 

Posted
1 hour ago, taziden said:

Same issue occurred with my spare Helios4 from batch3, using an unused PSU and unused fans.

 

Any idea what kind of activity happening on the system when it crashes ?

Posted
2 hours ago, gprovost said:

 

Any idea what kind of activity happening on the system when it crashes ?

I've tried disabling everything. Even just having the system running, without unencrypting and mounting the RAID eventually ends up in the helios being unresponsive.

I'll do some more tests, like booting on Arch and will keep you posted.

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines