Jump to content

Recommended Posts

Posted
7 hours ago, gprovost said:

 

Yup most probably not related to marvell_xor, just a coincidence. Because yes we have 2 users (it includes you) who see the crash and don't have RAID 5 or 6 setup.

 

 

There was some changes back in Jan that were related to DVFS. We need to re look at it.

 

 

 

Sorry, haven't had a chance to get back to this.  My system is running a 3-disk btrfs / raid 5

Posted

Ok with @aprayoga and @count-doku we run cpu and I/O load tests on 3 different in order to try reproduce any of the issue mentioned in the past days, but so far no crash after more than 24hours of stress tests and uptime.

 

For now we are unable to figure out the common denominator between the setup of the following people :

@taziden

@devman

@FrancisTheodoreCatte

@DavidGF

@pekkal

 

PSU issue was put aside after few of you measured the output voltage and it was correct.

 

So in order to figure out if it's software or hardware related, could you guys run your system on a clean and latest image Armbian Buster and follow the instructions ?

(Ideally you use a new SD card to do this test.)

1. Don't play with cpu frequency, let the DVFS do its job.

2. Don't connect any USB device.

3. Don't install all the bunch of software you usually use (No OMV or Plex), just enough to read / write file over the network (via ssh / http / rsync or samba).

4. Edit /boot/armbianEnv.txt and add :  extraargs=ignore_loglevel (need to reboot system)

5. Connect to serial console and log-in, then do

dmesg -n 7

dmesg -w

6. Try to generate load by copying file to / from your helios4. You can also use dd to generate dummy file on your HDD.

 

Hopefully (or not) you manage to reproduce the issue.

 

 

@devman your boot env file is corrupted,  maybe there is other thing no correct in your system. So time to do a fresh install.

 

image.png.17e4023bca7752413005f423bfa5192d.png

 

it should be :

root@helios4:~$ cat /boot/armbianEnv.txt 
verbosity=1
eth1addr=0A:54:C9:1E:91:F0
spi_workaround=off
overlay_prefix=armada-388-helios4
rootdev=UUID=50c31f1b-67ba-4709-b677-f821bd52086f
rootfstype=ext4
usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u

 

Posted

@gprovost (how do I refer to a person/userid in this forum?)

re: "how to debug"

 

I've had no further crashes after the two incidents ≈10 days ago. With little to back this up I suspect the issue is related Mac OS NSF client implementation with Armbian server:

- the issues in my 10 months stable system started after I started working at home (due to the corona virus restrictions): I had my laptop NFS mounted to Helios for weeks

- the two crashes took place when I was not working: the Mac had gone to sleep. They did not appeared when Helios was under heavy load.

- Now I've (usually) unmounted the NFS disks after use: no issues past 10 days

 

With this: if the NFS hypothesis is true I would not find anything with the suggested steps. However, doing it with NFS shares to my Mac OS client going to sleep might (in)validate my suspicion? I believe MACs don't sleep all that well: they keep on waking up and checking things.

 

pekkal

 

Posted

I can provide more info: a similar thing like this freeze had only happened me once previously (it ran for a year without issues at all!) That one was during the hot summer were I live, so I suspect load temp could have played a factor here. It's in a hot spot in the house.

After the upgrade, which I believe it was March14, it happened at April 2nd, twice in the day. Also march 21st.

 

So far I've experimented running it at 800MHz fixed, perhaps manually bumping it to 1.6G whenever I need to copy stuff in bulk, but only for a few minutes.

I limit it by running:

# echo 800000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq

 

I think it has been running OK so far at 800MHz, I will wait a month or two to see whether it keeps stable. If that's the case then I'll switch to max-freq 1.6G and see whether the DFS code has an issue, by letting it pick the right freq on his own.

Does that sound good? WIll be a long experiment but I can't do more testing with it, since I need it to serve a variety of services (it actually server production traffic believe it or not)

Posted

Hi,

I was unable to reproduce the issue with the vanilla Armbian and I needed to have my usual system running, so I booted on my previous SD. After ~29h, I encountered the issue again. And this time, I had a trace on the serial console, similar to the one previously reported by @FrancisTheodoreCatte

One thing I'm noticing in the console is that the traces continue to appear, every minute or every 30s. Here is an extract: https://paste.systemli.org/?00127e5455380d10#9eAznGLeoqN5CKHhK9nQ5aqx3PZy3oarrG39PnoCajQ7

Posted

@tuxd3v yes, 5.4 LTS kernel for all mvebu boards (including helios4) will be released with Armbian 20.05, coming in May.

 

Keep an eye on this thread for more info: 

 

Once there are Release Candidates you can help us with testing them.

Posted

@taziden Thanks a lot for the trace, very useful information. We are trying to back trace what could be the root cause. Not sure yet how to understand it yet but yes it's the same crash than @FrancisTheodoreCatte. At first glance cpu detects via NMI that one core stall. We can see again the mv_xor tasklet in the trace which according to driver code will request a spin lock and maybe for some reason doesn't release the lock making one of the core stall. But this is still an assumption. Both of you have RAID 5 and 6 which will offload XOR operation on mv_xor engine, so at least one common denominator between your two setups.

Posted

@gprovost I've just experienced the same issue (but no traces this time) using the Armbian 5.4 dev kernel for mvebu fyi. I'll try downgrading to an older kernel and see what happen.

Posted

@taziden This is using your so called "previous SD" on which you upgraded to LK 5.4 ? Could you please provide a armbian-monitor -u file.

 

I think both of you ( @FrancisTheodoreCatte ) should run your system with your usual OS / apps setup but you should backlist mv_xor driver see if it still crashes without it. Since this is a built-in driver we need to use a different trick to disable it than the usual module blacklisting approach.

 

Edit /boot/armbianEnv.txt and the following line:

extraargs=initcall_blacklist=mv_xor_driver_init

Reboot your system and check that effectively mv_xor driver is not loaded by looking at the interrupts list: cat /proc/interrupts

You should not see anymore f1060800.xor and f1060900.xor

 

You can also check ls -la /sys/devices/platform/soc/soc:internal-regs/f1060900.xor

You shouldn't not see anymore the symlink driver -> ../../../../../bus/platform/drivers/mv_xor

 

Clearly this is going to increase a bit the load on the CPU without impacting too much performance. It would be super helpful if you can run this for couple of days and check if you guys still encounter crash. At least this way we can maybe narrow down the culprit.

 

Posted

@taziden Please next time don't update your message to give us new information, we won't get notified if you do so.

 

Can you try to catch again a trace with the serial console ? Thanks.

Posted
23 minutes ago, gprovost said:

@taziden Please next time don't update your message to give us new information, we won't get notified if you do so.

 

Can you try to catch again a trace with the serial console ? Thanks.

Yes, been trying to each time. With extraargs=ignore_loglevel, dmesg -n 7; dmesg -w but I don't always get a trace unfortunately :(

Posted

Sorry @gprovost, didn't see your message until today. Anecdotally, as soon as I reenabled the automatic CPU governor, the Helios4 kernel panicked within 8 hours. Unfortunately I didn't have the serial console open to catch a trace.

 

Anyway, I rebooted my Helios4 with the Marvell XOR driver blacklisted. Nothing shows up under /proc/interrupts related to XOR now. Running a serial console again to hopefully get a trace if it crashes.

 

Unrelated to the crashes, but I think I ran into a bug with the Debian Buster armhf build of btrfs-progs 4.20.1. Trying to delete subvolumes on my 17T btrfs volume is impossible-- btrfs subvolume delete gives me "ERROR: Could not statfs: Value too large for defined data type". I found a post from someone using Raspbian buster over on the Raspberry Pi forums with the same issue with large btrfs volumes: https://www.raspberrypi.org/forums/viewtopic.php?t=249873

 

I haven't tried btrfs-progs 4.7.3 from Stretch as they did to see if the problem persists, yet. If anyone else with a large array in a Helios4 with Debian Buster and a large btrfs volume could try creating and deleting a subvolume, I'd appreciate it. I'm assuming this would have to be filed as an Armbian bug report, or possibly upstream??

Posted

I'm trying to run the OMV installer script on top of the Helios4 armbian buster, but it borks at lsb_release : command not found, same if i try to run that command manually, though lsb-release is installed.  Any help ?

Posted
3 minutes ago, Koen said:

I had been building on it for a while, it's gotten apt-get update / upgrade / dist-upgrade.


Aha, some older builds ... not everything is fixed with apt update. Try fixing this problem with:

apt remove linux-buster-root-current-helios4 lsb-release
apt install linux-buster-root-current-helios4

 

Posted

Hello everyone,

 

I am still confused and a bit annoyed by the kit fan's. I got the type "a" fans wich are not able to completely stop.
My Helios is most of the time inactive, like the most i guess ;)

I want to replace them by better fans, but this seems not as easy as i expected. 

 

First, its hard to find 70mm pwm fans.

Second, its almost impossible to get infos about the few fans on the market.

How do i know if a pwm fan can actually stop?

 

Can anyone recommend a fan?

I found this one on ebay (klick here), would these help?

I do not mind the noise when the Helios is active, but i believe it should be silent if its inactive and the hard drives are spun down. 

 

Thanks in advance

Jeckyll

Posted
On 4/13/2020 at 7:48 AM, Heisath said:

@tuxd3v yes, 5.4 LTS kernel for all mvebu boards (including helios4) will be released with Armbian 20.05, coming in May.

 

Once there are Release Candidates you can help us with testing them.

Hello Heisath,

Thanks a lot!

 

I will  be a beta tester :)

 

Regards,

Posted
On 5/3/2020 at 1:23 AM, Jeckyll said:

First, its hard to find 70mm pwm fans.

Second, its almost impossible to get infos about the few fans on the market.

How do i know if a pwm fan can actually stop?

 

If they don't state the type of fan, then if the PWM speed curve is available you can figure out if the fan can stop.

 

image.thumb.png.e8ce2c72cd6610876026d252522ea207.png

 

On our blog we mentioned the reason why we put pack Type-A fan for Batch 3. https://blog.kobol.io/2019/03/18/wol-wiki/

 

Quote

When system is put in suspend mode, the PWM feature controlling the fan speed is stopped. The fans will either spin at their lowest speed (Batch 1 fan) or stop spinning (Batch 2 fan). In the latest case, while this is not an issue for the SoC itself which is designed to run with passive cooling, it might have a negative impact on the HDD peripherals because the ambient temperature inside the case will build up. Therefore it is advised to ensure that when system is suspended the case ambient temperature will not exceed the operating temperature your HDDs are rated for.

 

So I would still recommend to have a bit of active cooling when system is idle.

 

 

Posted
On 4/24/2020 at 5:28 AM, FrancisTheodoreCatte said:

Unrelated to the crashes, but I think I ran into a bug with the Debian Buster armhf build of btrfs-progs 4.20.1. Trying to delete subvolumes on my 17T btrfs volume is impossible-- btrfs subvolume delete gives me "ERROR: Could not statfs: Value too large for defined data type".

 

Isn't it related to the fact that Helios4 is a 32bit system therefore max volume size that can be supported is 16TB.

Posted

Has anyone gotten nextcloud with docker working?

 

I downloaded the official docker container from docker hub but I can't seem to get it to connect to my local MariaDB instance. Keep getting Connection Refused.

Posted
On 4/9/2020 at 4:14 PM, gprovost said:

@devman your boot env file is corrupted,  maybe there is other thing no correct in your system. So time to do a fresh install.

 

Thanks, I made a fresh SD card and no problems for 2 weeks now.

 

Is Stretch / OMV4 still the recommended software, should I be using Buster / OMV5?

Posted
2 hours ago, devman said:

Is Stretch / OMV4 still the recommended software, should I be using Buster / OMV5?

 

OMV5 is officially stable, therefore you should go for Armbian Buster.

 

FYI Armbian will release soon 20.05 version ;-)

Posted
5 hours ago, gprovost said:

 

OMV5 is officially stable, therefore you should go for Armbian Buster.

 

FYI Armbian will release soon 20.05 version ;-)

Thanks, I'll just wait then

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines