devman Posted April 7, 2020 Posted April 7, 2020 7 hours ago, gprovost said: Yup most probably not related to marvell_xor, just a coincidence. Because yes we have 2 users (it includes you) who see the crash and don't have RAID 5 or 6 setup. There was some changes back in Jan that were related to DVFS. We need to re look at it. Sorry, haven't had a chance to get back to this. My system is running a 3-disk btrfs / raid 5
FrancisTheodoreCatte Posted April 9, 2020 Posted April 9, 2020 Update; two straight days with no lockups after setting the CPU clock to 1.6GHz.
gprovost Posted April 9, 2020 Author Posted April 9, 2020 Ok with @aprayoga and @count-doku we run cpu and I/O load tests on 3 different in order to try reproduce any of the issue mentioned in the past days, but so far no crash after more than 24hours of stress tests and uptime. For now we are unable to figure out the common denominator between the setup of the following people : @taziden @devman @FrancisTheodoreCatte @DavidGF @pekkal PSU issue was put aside after few of you measured the output voltage and it was correct. So in order to figure out if it's software or hardware related, could you guys run your system on a clean and latest image Armbian Buster and follow the instructions ? (Ideally you use a new SD card to do this test.) 1. Don't play with cpu frequency, let the DVFS do its job. 2. Don't connect any USB device. 3. Don't install all the bunch of software you usually use (No OMV or Plex), just enough to read / write file over the network (via ssh / http / rsync or samba). 4. Edit /boot/armbianEnv.txt and add : extraargs=ignore_loglevel (need to reboot system) 5. Connect to serial console and log-in, then do dmesg -n 7 dmesg -w 6. Try to generate load by copying file to / from your helios4. You can also use dd to generate dummy file on your HDD. Hopefully (or not) you manage to reproduce the issue. @devman your boot env file is corrupted, maybe there is other thing no correct in your system. So time to do a fresh install. it should be : root@helios4:~$ cat /boot/armbianEnv.txt verbosity=1 eth1addr=0A:54:C9:1E:91:F0 spi_workaround=off overlay_prefix=armada-388-helios4 rootdev=UUID=50c31f1b-67ba-4709-b677-f821bd52086f rootfstype=ext4 usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u
pekkal Posted April 9, 2020 Posted April 9, 2020 @gprovost (how do I refer to a person/userid in this forum?) re: "how to debug" I've had no further crashes after the two incidents ≈10 days ago. With little to back this up I suspect the issue is related Mac OS NSF client implementation with Armbian server: - the issues in my 10 months stable system started after I started working at home (due to the corona virus restrictions): I had my laptop NFS mounted to Helios for weeks - the two crashes took place when I was not working: the Mac had gone to sleep. They did not appeared when Helios was under heavy load. - Now I've (usually) unmounted the NFS disks after use: no issues past 10 days With this: if the NFS hypothesis is true I would not find anything with the suggested steps. However, doing it with NFS shares to my Mac OS client going to sleep might (in)validate my suspicion? I believe MACs don't sleep all that well: they keep on waking up and checking things. pekkal
hatschi1000 Posted April 9, 2020 Posted April 9, 2020 (edited) Never mind Edited April 12, 2020 by hatschi1000 deleted
DavidGF Posted April 12, 2020 Posted April 12, 2020 I can provide more info: a similar thing like this freeze had only happened me once previously (it ran for a year without issues at all!) That one was during the hot summer were I live, so I suspect load temp could have played a factor here. It's in a hot spot in the house. After the upgrade, which I believe it was March14, it happened at April 2nd, twice in the day. Also march 21st. So far I've experimented running it at 800MHz fixed, perhaps manually bumping it to 1.6G whenever I need to copy stuff in bulk, but only for a few minutes. I limit it by running: # echo 800000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq I think it has been running OK so far at 800MHz, I will wait a month or two to see whether it keeps stable. If that's the case then I'll switch to max-freq 1.6G and see whether the DFS code has an issue, by letting it pick the right freq on his own. Does that sound good? WIll be a long experiment but I can't do more testing with it, since I need it to serve a variety of services (it actually server production traffic believe it or not)
taziden Posted April 12, 2020 Posted April 12, 2020 Hi, I was unable to reproduce the issue with the vanilla Armbian and I needed to have my usual system running, so I booted on my previous SD. After ~29h, I encountered the issue again. And this time, I had a trace on the serial console, similar to the one previously reported by @FrancisTheodoreCatte One thing I'm noticing in the console is that the traces continue to appear, every minute or every 30s. Here is an extract: https://paste.systemli.org/?00127e5455380d10#9eAznGLeoqN5CKHhK9nQ5aqx3PZy3oarrG39PnoCajQ7
tuxd3v Posted April 12, 2020 Posted April 12, 2020 Will be there a 5.4 LTS kernel for Helios4? Thanks in Advance, Best Regards, tux
Heisath Posted April 13, 2020 Posted April 13, 2020 @tuxd3v yes, 5.4 LTS kernel for all mvebu boards (including helios4) will be released with Armbian 20.05, coming in May. Keep an eye on this thread for more info: Once there are Release Candidates you can help us with testing them.
gprovost Posted April 14, 2020 Author Posted April 14, 2020 @taziden Thanks a lot for the trace, very useful information. We are trying to back trace what could be the root cause. Not sure yet how to understand it yet but yes it's the same crash than @FrancisTheodoreCatte. At first glance cpu detects via NMI that one core stall. We can see again the mv_xor tasklet in the trace which according to driver code will request a spin lock and maybe for some reason doesn't release the lock making one of the core stall. But this is still an assumption. Both of you have RAID 5 and 6 which will offload XOR operation on mv_xor engine, so at least one common denominator between your two setups.
taziden Posted April 14, 2020 Posted April 14, 2020 @gprovost I've just experienced the same issue (but no traces this time) using the Armbian 5.4 dev kernel for mvebu fyi. I'll try downgrading to an older kernel and see what happen.
gprovost Posted April 15, 2020 Author Posted April 15, 2020 @taziden This is using your so called "previous SD" on which you upgraded to LK 5.4 ? Could you please provide a armbian-monitor -u file. I think both of you ( @FrancisTheodoreCatte ) should run your system with your usual OS / apps setup but you should backlist mv_xor driver see if it still crashes without it. Since this is a built-in driver we need to use a different trick to disable it than the usual module blacklisting approach. Edit /boot/armbianEnv.txt and the following line: extraargs=initcall_blacklist=mv_xor_driver_init Reboot your system and check that effectively mv_xor driver is not loaded by looking at the interrupts list: cat /proc/interrupts You should not see anymore f1060800.xor and f1060900.xor You can also check ls -la /sys/devices/platform/soc/soc:internal-regs/f1060900.xor You shouldn't not see anymore the symlink driver -> ../../../../../bus/platform/drivers/mv_xor Clearly this is going to increase a bit the load on the CPU without impacting too much performance. It would be super helpful if you can run this for couple of days and check if you guys still encounter crash. At least this way we can maybe narrow down the culprit.
taziden Posted April 15, 2020 Posted April 15, 2020 13 hours ago, gprovost said: @taziden This is using your so called "previous SD" on which you upgraded to LK 5.4 ? Could you please provide a armbian-monitor -u file. Yes, here it is http://ix.io/2i8W Done for the module blacklisting, wait&see :-) Edit: bad news, issue reoccured armbianmonitor : http://ix.io/2ic4
gprovost Posted April 16, 2020 Author Posted April 16, 2020 @taziden Please next time don't update your message to give us new information, we won't get notified if you do so. Can you try to catch again a trace with the serial console ? Thanks.
taziden Posted April 16, 2020 Posted April 16, 2020 23 minutes ago, gprovost said: @taziden Please next time don't update your message to give us new information, we won't get notified if you do so. Can you try to catch again a trace with the serial console ? Thanks. Yes, been trying to each time. With extraargs=ignore_loglevel, dmesg -n 7; dmesg -w but I don't always get a trace unfortunately
FrancisTheodoreCatte Posted April 23, 2020 Posted April 23, 2020 Sorry @gprovost, didn't see your message until today. Anecdotally, as soon as I reenabled the automatic CPU governor, the Helios4 kernel panicked within 8 hours. Unfortunately I didn't have the serial console open to catch a trace. Anyway, I rebooted my Helios4 with the Marvell XOR driver blacklisted. Nothing shows up under /proc/interrupts related to XOR now. Running a serial console again to hopefully get a trace if it crashes. Unrelated to the crashes, but I think I ran into a bug with the Debian Buster armhf build of btrfs-progs 4.20.1. Trying to delete subvolumes on my 17T btrfs volume is impossible-- btrfs subvolume delete gives me "ERROR: Could not statfs: Value too large for defined data type". I found a post from someone using Raspbian buster over on the Raspberry Pi forums with the same issue with large btrfs volumes: https://www.raspberrypi.org/forums/viewtopic.php?t=249873 I haven't tried btrfs-progs 4.7.3 from Stretch as they did to see if the problem persists, yet. If anyone else with a large array in a Helios4 with Debian Buster and a large btrfs volume could try creating and deleting a subvolume, I'd appreciate it. I'm assuming this would have to be filed as an Armbian bug report, or possibly upstream??
Koen Posted April 25, 2020 Posted April 25, 2020 I'm trying to run the OMV installer script on top of the Helios4 armbian buster, but it borks at lsb_release : command not found, same if i try to run that command manually, though lsb-release is installed. Any help ?
Koen Posted April 25, 2020 Posted April 25, 2020 13 minutes ago, Igor said: Where did you download the image? URL to the image. It's the Armbian_19.11.3_Helios4_buster_current_4.19.84.7z directly from armbian. So must be https://dl.armbian.com/helios4/archive/Armbian_19.11.3_Helios4_buster_current_4.19.84.7z I had been building on it for a while, it's gotten apt-get update / upgrade / dist-upgrade.
Igor Posted April 25, 2020 Posted April 25, 2020 3 minutes ago, Koen said: I had been building on it for a while, it's gotten apt-get update / upgrade / dist-upgrade. Aha, some older builds ... not everything is fixed with apt update. Try fixing this problem with: apt remove linux-buster-root-current-helios4 lsb-release apt install linux-buster-root-current-helios4
Jeckyll Posted May 2, 2020 Posted May 2, 2020 Hello everyone, I am still confused and a bit annoyed by the kit fan's. I got the type "a" fans wich are not able to completely stop. My Helios is most of the time inactive, like the most i guess I want to replace them by better fans, but this seems not as easy as i expected. First, its hard to find 70mm pwm fans. Second, its almost impossible to get infos about the few fans on the market. How do i know if a pwm fan can actually stop? Can anyone recommend a fan? I found this one on ebay (klick here), would these help? I do not mind the noise when the Helios is active, but i believe it should be silent if its inactive and the hard drives are spun down. Thanks in advance Jeckyll
tuxd3v Posted May 3, 2020 Posted May 3, 2020 On 4/13/2020 at 7:48 AM, Heisath said: @tuxd3v yes, 5.4 LTS kernel for all mvebu boards (including helios4) will be released with Armbian 20.05, coming in May. Once there are Release Candidates you can help us with testing them. Hello Heisath, Thanks a lot! I will be a beta tester Regards,
gprovost Posted May 8, 2020 Author Posted May 8, 2020 On 5/3/2020 at 1:23 AM, Jeckyll said: First, its hard to find 70mm pwm fans. Second, its almost impossible to get infos about the few fans on the market. How do i know if a pwm fan can actually stop? If they don't state the type of fan, then if the PWM speed curve is available you can figure out if the fan can stop. On our blog we mentioned the reason why we put pack Type-A fan for Batch 3. https://blog.kobol.io/2019/03/18/wol-wiki/ Quote When system is put in suspend mode, the PWM feature controlling the fan speed is stopped. The fans will either spin at their lowest speed (Batch 1 fan) or stop spinning (Batch 2 fan). In the latest case, while this is not an issue for the SoC itself which is designed to run with passive cooling, it might have a negative impact on the HDD peripherals because the ambient temperature inside the case will build up. Therefore it is advised to ensure that when system is suspended the case ambient temperature will not exceed the operating temperature your HDDs are rated for. So I would still recommend to have a bit of active cooling when system is idle.
gprovost Posted May 8, 2020 Author Posted May 8, 2020 On 4/24/2020 at 5:28 AM, FrancisTheodoreCatte said: Unrelated to the crashes, but I think I ran into a bug with the Debian Buster armhf build of btrfs-progs 4.20.1. Trying to delete subvolumes on my 17T btrfs volume is impossible-- btrfs subvolume delete gives me "ERROR: Could not statfs: Value too large for defined data type". Isn't it related to the fact that Helios4 is a 32bit system therefore max volume size that can be supported is 16TB.
Mangix Posted May 8, 2020 Posted May 8, 2020 Has anyone gotten nextcloud with docker working? I downloaded the official docker container from docker hub but I can't seem to get it to connect to my local MariaDB instance. Keep getting Connection Refused.
smith69085 Posted May 13, 2020 Posted May 13, 2020 In case anyone is interested, a Helios 4 is for sale in the UK on eBay - http://rover.ebay.com/rover/1/710-53481-19255-0/1?icep_ff3=2&pub=5575378759&campid=5338273189&customid=&icep_item=264729497930&ipn=psmain&icep_vectorid=229508&kwid=902099&mtid=824&kw=lg&toolid=11111 1
Jeckyll Posted May 13, 2020 Posted May 13, 2020 Thanks @gprovost for youre reply. Ok, i will stick to all time spinning fans. Maybe i find some less nosy fans 1
devman Posted May 14, 2020 Posted May 14, 2020 On 4/9/2020 at 4:14 PM, gprovost said: @devman your boot env file is corrupted, maybe there is other thing no correct in your system. So time to do a fresh install. Thanks, I made a fresh SD card and no problems for 2 weeks now. Is Stretch / OMV4 still the recommended software, should I be using Buster / OMV5?
gprovost Posted May 14, 2020 Author Posted May 14, 2020 2 hours ago, devman said: Is Stretch / OMV4 still the recommended software, should I be using Buster / OMV5? OMV5 is officially stable, therefore you should go for Armbian Buster. FYI Armbian will release soon 20.05 version ;-)
devman Posted May 14, 2020 Posted May 14, 2020 5 hours ago, gprovost said: OMV5 is officially stable, therefore you should go for Armbian Buster. FYI Armbian will release soon 20.05 version ;-) Thanks, I'll just wait then
Recommended Posts