antsu

  • Posts

    27
  • Joined

  • Last visited

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

antsu's Achievements

  1. This time it survived a little longer, but sure enough, today I woke up to my H64 unreachable on the 2.5G port again: [255036.090052] xhci-hcd xhci-hcd.0.auto: xHCI host not responding to stop endpoint command. [255036.090061] xhci-hcd xhci-hcd.0.auto: USBSTS: [255036.103643] xhci-hcd xhci-hcd.0.auto: xHCI host controller not responding, assume dead [255036.103681] xhci-hcd xhci-hcd.0.auto: HC died; cleaning up [255036.103734] r8152 4-1.4:1.0 eth1: Stop submitting intr, status -108 [255036.103774] r8152 4-1.4:1.0 eth1: get_registers -110 [255036.103822] r8152 4-1.4:1.0 eth1: Tx status -108 [255036.103831] r8152 4-1.4:1.0 eth1: Tx status -108 [255036.103839] r8152 4-1.4:1.0 eth1: Tx status -108 [255036.103850] r8152 4-1.4:1.0 eth1: Tx status -108 [255036.103868] usb 3-1: USB disconnect, device number 2 [255036.109516] usb 4-1: USB disconnect, device number 2 [255036.109526] usb 4-1.1: USB disconnect, device number 3 [255036.131518] usb 4-1.4: USB disconnect, device number 4 [255429.047792] zio pool=backups vdev=/dev/disk/by-id/usb-WD_Elements_XXXX_XXXXXXXXXXXXXXX-0:0-part1 error=5 type=1 offset=3186896842752 size=4096 flags=180880 [255429.047897] zio pool=backups vdev=/dev/disk/by-id/usb-WD_Elements_XXXX_XXXXXXXXXXXXXXX-0:0-part1 error=5 type=1 offset=270336 size=8192 flags=b08c1 [255429.047937] zio pool=backups vdev=/dev/disk/by-id/usb-WD_Elements_XXXX_XXXXXXXXXXXXXXX-0:0-part1 error=5 type=1 offset=14000475086848 size=8192 flags=b08c1 [255429.047966] zio pool=backups vdev=/dev/disk/by-id/usb-WD_Elements_XXXX_XXXXXXXXXXXXXXX-0:0-part1 error=5 type=1 offset=14000475348992 size=8192 flags=b08c1 [255429.048323] WARNING: Pool 'backups' has encountered an uncorrectable I/O failure and has been suspended. [255433.136397] WARNING: Pool 'backups' has encountered an uncorrectable I/O failure and has been suspended. I'm going back to using just the 1G port for now, since my H64 has been much more stable after the IO scheduler changes suggested by @ShadowDance in another thread, but happy to do more tests if anyone wants to try to figure out what's happening.
  2. I was trying to avoid opening a new instability topic, but this one seems different enough from the others to deserve it's own discussion. A few days ago I applied the 1G fix (pic attached) to the 2.5G port - which was not in use until then, since all my hardware is 1G max. I did this with the intention of using both ports: the 1G port goes to an isolated VLAN that serves VM images to my Proxmox hosts, and the 2.5G interface (running at 1G speed) serves files to the rest of my network. While the fix appears to work and the port is able to communicate at 1G speeds seemingly fine, after changing to this setup I'm experiencing frequent drops in the USB bus, which causes both the 2.5G port and my USB HDD (which has its own PSU) to disconnect, requiring a reboot to regain connectivity. When this happens, I get these messages in dmesg: [48657.415910] xhci-hcd xhci-hcd.0.auto: xHCI host not responding to stop endpoint command. [48657.415930] xhci-hcd xhci-hcd.0.auto: USBSTS: [48657.429522] xhci-hcd xhci-hcd.0.auto: xHCI host controller not responding, assume dead [48657.429604] xhci-hcd xhci-hcd.0.auto: HC died; cleaning up [48657.429952] r8152 4-1.4:1.0 eth1: Stop submitting intr, status -108 [48657.430068] r8152 4-1.4:1.0 eth1: get_registers -110 [48657.430193] r8152 4-1.4:1.0 eth1: Tx status -108 [48657.430223] r8152 4-1.4:1.0 eth1: Tx status -108 [48657.430241] r8152 4-1.4:1.0 eth1: Tx status -108 [48657.430262] r8152 4-1.4:1.0 eth1: Tx status -108 [48657.430327] usb 3-1: USB disconnect, device number 2 [48657.431551] usb 4-1: USB disconnect, device number 2 [48657.431572] usb 4-1.1: USB disconnect, device number 3 [48657.467516] usb 4-1.4: USB disconnect, device number 4 The problem seems to manifest more quickly if I push more data through that interface. I'm running a clean (new) install of Armbian 21.02.3, with OMV and ZFS. Full dmesg and boot log below.
  3. @ShadowDance It's a regular swap partition on sda1, not on a zvol. But thanks for the quick reply.
  4. New crash, but different behaviour and conditions this time. It had a kernel oops at around 5am this morning, resulting in a graceful-ish reboot. From my limited understanding, the error seems to be related with swap (the line "Comm: swapper"), so I have disabled the swap partition for now and will continue to monitor. The major difference this time is that there was nothing happening at that time, it was just sitting idle.
  5. @ShadowDance I think you're definitely on to something here! I just ran the Rsync jobs after setting the scheduler to none for all disks and it completed successfully without crashing. I'll keep an eye on it and report any other crashes, but for now thank you very much! Update: It's now a little over 3 hours into a ZFS scrub, and I restarted all my VMs simultaneously *while* doing the scrub and it has not rebooted nor complained about anything on dmesg. This is very promising! Update 2: Scrub finished without problems!
  6. @aprayogaAs I mentioned, the crash happens when some Rsync jobs are run. I'll try to describe my setup as best as possible. My Helios64 is running LK 5.10.21, OMV 5.6.2-1, and has a RAID-Z2 array + a simple EXT4 partition. The drives are: - Bay 1: 500GB Crucial MX500. - partition 1: 16GB swap partition, created while troubleshooting the crashes (to make sure it was not due to lack of memory). The crashes happen both with and without it. - partition 2: 400GB EXT4 partition, used to serve VM/container images via NFS. - partition 3: 50GB write cache for the ZFS pool. - Bays 2 through 5: 4x 4TB Seagate IronWolf (ST4000VN008-2DR1) in RAID-Z2, ashift=12, ZSTD compression (although most of the data is compressed with LZ4). My remote NAS is a Helios4, also running OMV, with the Rsync plugin in server mode. On the Helios64 I have 4 Rsync jobs for 4 different shared folders, 3 sending and 1 receiving. They all are scheduled to run on Sundays at 3 AM. The reason for starting all 4 jobs at the same time is to maximise the bandwidth usage, since the backup happens over the internet (Wireguard running on the router, not on the H64), and the latency from where I am to where the Helios4 is (Brazil) is quite high. The transfers are incremental, so not a lot of data gets sent with each run, but I guess Rsync still needs to read all the files on the source to see what has changed, causing the high IO load. The data on these synced folders is very diverse in size and number of files. In total it's about 1.7TB. Another way I can reproduce the crash is by running a ZFS scrub. And also as I mentioned in the previous post, in one occasion, the NAS crashed when generating simultaneous IO from multiple VMs, which sit on an EXT4 partition on the SSD, so it doesn't seem limited to the ZFS pool.
  7. Sorry if I'm hijacking this thread, but the title fits the problem, and I didn't want to create yet another thread about stability issues. I'm also having similar problems with my Helios64. It will consistently crash and reboot when I/O is stressed. The two most common scenarios where this happens is when running an Rsync job that I use to sync my H64 with my remote backup NAS (a Helios4 that still works like a champ), and when performing a ZFS scrub. In one isolated occasion, I was also able to trigger a crash by executing an Ansible playbook that made several VMs generate I/O activity simultaneously (the VMs disks are served by the H64). Since the Rsync crash is consistent and predictable, I managed to capture the crash output from the serial console in two occasions and under different settings, listed below: Kernel 5.10.16, performance governor, frequency locked to 1.2GHz Kernel 5.10.21, ondemand governor, dynamic stock frequencies (settings recommended in this and other stability threads): Happy to help with more information or tests.
  8. So, I went hunting for the PR where this got merged, found the reference to kernel module "ledtrig-netdev", noticed the module wasn't loaded on my Helios 64 (no idea why), loaded it via modprobe, restarted the service unit I mentioned in the post above, and now it works. I've added ledtrig-netdev to /etc/modules-load.d/modules.conf, so now it should hopefully work automagically on every boot.
  9. Hi all. Some time ago, in the release notes of a past Armbian version, I remember reading that the network LED should now be enabled by default and associated with eth0. Despite always keeping my Helios64 as up-to-date as possible, mine never lit up, but I didn't think much of it. Today I decided to investigate why, and I'm not sure where to go from here. First things first, the LED works, it's not a hardware issue. I can echo 1 and 0 to the LED path (/sys/class/leds/helios64:blue:net/brightness) and it lights on and off as expected. From what I can tell, this LED, along with the heartbeat one, is enabled by the Systemd unit helios64-heartbeat-led.service. When I check for the status of the unit, I get the following output: ● helios64-heartbeat-led.service - Enable heartbeat & network activity led on Helios64 Loaded: loaded (/etc/systemd/system/helios64-heartbeat-led.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Sat 2021-03-13 00:30:52 GMT; 8min ago Process: 3740 ExecStart=/usr/bin/bash -c echo heartbeat | tee /sys/class/leds/helios64\:\:status/trigger (code=exited, status=0/SUCCESS) Process: 3744 ExecStart=/usr/bin/bash -c echo netdev | tee /sys/class/leds/helios64\:blue\:net/trigger (code=exited, status=1/FAILURE) Main PID: 3744 (code=exited, status=1/FAILURE) Mar 13 00:30:52 nas.lan systemd[1]: Starting Enable heartbeat & network activity led on Helios64... Mar 13 00:30:52 nas.lan bash[3740]: heartbeat Mar 13 00:30:52 nas.lan bash[3744]: netdev Mar 13 00:30:52 nas.lan bash[3744]: tee: '/sys/class/leds/helios64:blue:net/trigger': Invalid argument Mar 13 00:30:52 nas.lan systemd[1]: helios64-heartbeat-led.service: Main process exited, code=exited, status=1/FAILURE Mar 13 00:30:52 nas.lan systemd[1]: helios64-heartbeat-led.service: Failed with result 'exit-code'. Mar 13 00:30:52 nas.lan systemd[1]: Failed to start Enable heartbeat & network activity led on Helios64. Looking into the unit's source, I can see the line where this fails: ExecStart=bash -c 'echo netdev | tee /sys/class/leds/helios64\\:blue\\:net/trigger' And sure enough, there is no "netdev" inside /sys/class/leds/helios64:blue:net/trigger: $ cat /sys/class/leds/helios64:blue:net/trigger [none] usb-gadget usb-host kbd-scrolllock kbd-numlock kbd-capslock kbd-kanalock kbd-shiftlock kbd-altgrlock kbd-ctrllock kbd-altlock kbd-shiftllock kbd-shiftrlock kbd-ctrlllock kbd-ctrlrlock usbport mmc1 mmc2 disk-activity disk-read disk-write ide-disk mtd nand-disk heartbeat cpu cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 activity default-on panic stmmac-0:00:link stmmac-0:00:1Gbps stmmac-0:00:100Mbps stmmac-0:00:10Mbps rc-feedback gpio-charger-online tcpm-source-psy-4-0022-online I'm not sure why it's not there or how does that get populated. I'm running Armbian Buster 21.02.3 with LK 5.10.21-rockchip64. Any help will be much appreciated.
  10. Use kernel 5.8 and follow the instructions on the thread above.
  11. @jberglerThank you very much for that. The steps just needed a small tweak, but it works perfectly. Running on kernel 5.8.16 right now! The small tweak is just to also install flex and bison before installing the kernel headers package (or run apt-get install -f if already installed).
  12. @jberglerThanks again for the ZFS module you shared on the Helios64 support thread. It's been running for days without any problems. If it's not too much trouble, would you mind sharing the process you used to build it?
  13. A little advice for anyone planning to use nand-sata-install to install on the eMMC and has already installed and configured OMV: nand-sata-install will break OMV, but it's easy to fix if you know what's happening. It will skip /srv when copying the root to avoid copying stuff from other mounted filesystems, but OMV 5 stores part of its config in there (salt and pillar folders) and will throw a fit if they're not there when you boot from the eMMC. Simply copy these folders back from the microSD using your preferred method. If you have NFS shares set in OMV, make sure to add the entry /export/* to the file /usr/lib/nand-sata-install/exclude.txt BEFORE running nand-sata-install, or it will try to copy the content of your NFS shares to the eMMC. Lastly, if you're using ZFS, which by default mounts to /<pool_name>, make sure to add its mountpoint to /usr/lib/nand-sata-install/exclude.txt before nand-sata-install as well.
  14. @RaSca Are your disks powered from the USB port itself, or do they have their own power supplies? Is your USB hub powered externally? To me it sounds like the disks are trying to draw more power than the board can provide, and thus failing. For the record, I have a 14TB WD Elements (powered by a 12V external PSU) connected to my H64 and running flawlessly, reaching the max speeds the disk can provide (~200MB/s reads).
  15. Idk if these questions are within the scope of this thread, but I'm curious about a few things regarding thermals on the Helios64, if someone from Kobol could clarify: - What is the expected temperature range for this SoC? - How does the heatsink interface with the SoC? Pads, paste? - Does the heatsink also make contact with any other chips around the CPU? - Would it benefit in any way from a "re-paste" with high performance thermal paste? I could, of course, just disassemble mine and find out for myself, but I figured I would ask first.