Jump to content

ShadowDance

Members
  • Posts

    64
  • Joined

Everything posted by ShadowDance

  1. I was still seeing the issue on Armbian 20.11 Buster (LK 5.9) and it actually seemed worse (even with the 3Gbps limit). So, considering my suspicion of power issue (the push from @TheLinuxBug helped too ), I took the plunge and got a bit creative with my testing. Here's the TL;DR: Stock power/harness has the issue even with only one drive connected Stock power/harness has the issue with UPS disconnected Stock harness has the issue when only powering via UPS battery (no PSU), this triggered the issue without even reading from the disk (idle disk, it happened immediately after boot in initrd) Separate PSU for disk and different SATA cable (no harness) == no issue The 3 Gbps limit was removed for these tests and I picked the ata4 port because I knew it could reliably reproduce the issue soon after boot. I've attached dmesg's from cases 2-4 and I've annotated some of them. I'm leaning towards power-issue here considering case 3 and 4. For reference these drives are rated 0.60A @ 5V and 0.45A @ 12V. I would've liked to test the harness power / harness SATA cable separately but that's unfortunately quite hard, and I didn't have any SATA extension cables (female/male) that would have allowed me to test the harness SATA cable separately from power. In case someone finds it amusing, here's my testing rig (case 4): Notable here is that I performed the 4th test 3 times. The first time I had the drive powered on before powering on the Helios64, but the link was set to 3 Gbps for some reason. I powered both down and powered first on Helios64, waited 2 seconds, then I powered on the drive, this way the link was set up as 6 Gbps and no issues when dd'ing. For good measure I repeated this a second time and still no issue. @gprovost Is there anything else I could test? Or would the next step be to try a different harness, PSU, etc? @aprayoga I have indeed juggled them around, but my issue is with all SATA ports, unlikely that all of them are loose. And thanks for attempting to reproduce, I'll still need to check if it happens with other drives too, but that's a task for another day. 2-dmesg-internal-harness-no-ups.txt 3-dmesg-no-psu-only-ups.txt 4-dmesg-external-psu-sata-cable.txt
  2. Thanks for the reply @TheLinuxBug. It's not apparent from my top post, but providing a bit more context; this happens even with no system load and it happens with both four and five drives plugged in. And also reading from only one drive at a time triggers the issue (while the rest are idle). Considering the four drive scenario, I find it hard to believe that a single drive could be pulling too much power from the power rail (one rail powering 3 drives and the other 1 drive). Testing with a separate PSU is a good suggestion, in hindsight I should've done that before taking the NAS into "production" use. I also have some spare SATA cables I could try with but right now I won't be able to experiment. I have to see if I can find other drives on which I can reproduce the issue because as it stands, the SATA resets are not kind on my drives, they cause read errors on the drives and even with double parity I don't want to risk it. (A ZFS scrub fixes the read errors by writing to the then unreadable sector.)
  3. @grek it's a bit of a dilemma. You must use compatible libs to build the zfs tools, in your case you must build them on Debian Buster, but the module build will fail due to old gcc. You can work around this by using the kernel modules (kmod) from a build using newer gcc (by building on e.g. Ubuntu Focal, Debian Bullseye/Sid, etc) and building the tools separately on Buster. To achieve this we must force the build on Buster to either skip building the kmod or to trick it into building it "successfully". After a bunch of head-scratching, I found the latter to be easier. Here's a set of steps that should work for building the tools (sans functioning kmod) (based largely on the steps posted earlier by jbergler): mkdir /tmp/zfs-build; cd $_ apt-get download linux-headers-current-rockchip64 docker pull debian:buster-slim docker run -it --rm -v $(pwd):/build debian:buster-slim # Inside docker. cd /build apt-get update apt-get install --yes alien autoconf automake bison build-essential dkms fakeroot flex gawk libaio-dev libattr1-dev libblkid-dev libelf-dev libffi-dev libssl-dev libtool libudev-dev python3 python3-cffi python3-dev python3-setuptools uuid-dev zlib1g-dev dpkg -i linux-headers-*.deb # Disable all STACKPROTECT options incompatible with GCC, this means the # built kernel module (kmod) will be non-functional. That's OK since # we're only interested in building the tools. sed -i -e 's/\(.*STACKPROTECT.*=\)y/\1n/' /usr/src/linux-headers-$(uname -r)/.config # Ignore the error(s) from the following: (cd /usr/src/linux-headers-$(uname -r); make prepare) apt-get install --no-install-recommends --yes git git clone https://github.com/openzfs/zfs.git cd zfs git clean -xdf git checkout zfs-2.0.0-rc6 sh autogen.sh ./configure --with-linux=/usr/src/linux-headers-$(uname -r) make -s -j$(nproc) make deb mkdir ../zfs-2.0.0-rc6-buster mv *.deb ../zfs-2.0.0-rc6-buster Then go ahead and install: # Cleanup packages (if installed). modprobe -r zfs zunicode zzstd zlua zcommon znvpair zavl icp spl apt remove --yes zfsutils-linux zfs-zed zfs-initramfs apt autoremove --yes # Install new packages. dpkg -i /path/to/working/kmod-zfs-$(uname -r)*.deb dpkg -i /tmp/zfs-build/zfs-2.0.0-rc6-buster/{libnvpair1,libuutil1,libzfs2,libzpool2,python3-pyzfs,zfs}_*.deb And that should do it, until the next kernel update. PS. I only tested zfs-2.0.0-rc5 when setting up my Helios64 the first time, currently using zfs-0.8.5 so can't speak for the reliability.
  4. I had a harddrive reset again after a reboot and ~1h15min of uptime. This time without plugging in or out any drives and the drives limited to 3 Gbps. I had a ZFS scrub running since boot and I can't spot anything peculiar in my graphs, e.g. CPU utilization was constantly ~70%, RAM ~2.5GiB used, etc. Here's the dmesg of it happening, but pretty much the same as earlier times: [ 4477.151219] ata1.00: exception Emask 0x2 SAct 0x3000000 SErr 0x400 action 0x6 [ 4477.151226] ata1.00: irq_stat 0x08000000 [ 4477.151229] ata1: SError: { Proto } [ 4477.151235] ata1.00: failed command: READ FPDMA QUEUED [ 4477.151243] ata1.00: cmd 60/58:c0:50:81:14/00:00:1d:00:00/40 tag 24 ncq dma 45056 in res 40/00:c0:50:81:14/00:00:1d:00:00/40 Emask 0x2 (HSM violation) [ 4477.151245] ata1.00: status: { DRDY } [ 4477.151248] ata1.00: failed command: READ FPDMA QUEUED [ 4477.151254] ata1.00: cmd 60/08:c8:a8:81:14/00:00:1d:00:00/40 tag 25 ncq dma 4096 in res 40/00:c0:50:81:14/00:00:1d:00:00/40 Emask 0x2 (HSM violation) [ 4477.151256] ata1.00: status: { DRDY } [ 4477.151263] ata1: hard resetting link [ 4477.635201] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320) [ 4477.636449] ata1.00: configured for UDMA/133 [ 4477.636488] sd 0:0:0:0: [sda] tag#24 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [ 4477.636493] sd 0:0:0:0: [sda] tag#24 Sense Key : 0x5 [current] [ 4477.636497] sd 0:0:0:0: [sda] tag#24 ASC=0x21 ASCQ=0x4 [ 4477.636503] sd 0:0:0:0: [sda] tag#24 CDB: opcode=0x88 88 00 00 00 00 00 1d 14 81 50 00 00 00 58 00 00 [ 4477.636508] blk_update_request: I/O error, dev sda, sector 487883088 op 0x0:(READ) flags 0x700 phys_seg 11 prio class 0 [ 4477.636527] zio pool=rpool vdev=/dev/mapper/luks-ata-WDC_WD60EFRX-68L0BN1_WD-XXXXXXXXXXXX-part4 error=5 type=1 offset=248167702528 size=45056 flags=1808b0 [ 4477.636579] sd 0:0:0:0: [sda] tag#25 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [ 4477.636584] sd 0:0:0:0: [sda] tag#25 Sense Key : 0x5 [current] [ 4477.636587] sd 0:0:0:0: [sda] tag#25 ASC=0x21 ASCQ=0x4 [ 4477.636591] sd 0:0:0:0: [sda] tag#25 CDB: opcode=0x88 88 00 00 00 00 00 1d 14 81 a8 00 00 00 08 00 00 [ 4477.636595] blk_update_request: I/O error, dev sda, sector 487883176 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0 [ 4477.636605] zio pool=rpool vdev=/dev/mapper/luks-ata-WDC_WD60EFRX-68L0BN1_WD-XXXXXXXXXXXX-part4 error=5 type=1 offset=248167747584 size=4096 flags=1808b0 [ 4477.636638] ata1: EH complete
  5. Ah sorry, I left out that instead of installing zfs-dkms, you should install the kmod-zfs-5.*-rockchip64_0.8.5-1_arm64.deb package. The only dkms package should be the dummy. (I've updated my earlier post to reflect this.)
  6. @SymbiosisSystems I take it you have a set of 0.8.5 modules built? They work fine with the 0.8.4 zfsutils-linux package, but it requires the zfs-dkms package which will fail to build. We can work around this by installing a dummy package that provides zfs-dkms so that we then can go ahead and install zfsutils-linux / zfs-zed / etc. from backports. Here's how you can create a dummy package: apt-get install --yes equivs mkdir zfs-dkms-dummy; cd $_ cat <<EOF >zfs-dkms Section: misc Priority: optional Standards-Version: 3.9.2 Package: zfs-dkms-dummy Version: 0.8.4 Maintainer: Me <me@localhost> Provides: zfs-dkms Architecture: all Description: Dummy zfs-dkms package for when using built kmod EOF equivs-build zfs-dkms dpkg -i zfs-dkms-dummy_0.8.4_all.deb After this, you can go ahead and install (if not already installed) the 0.8.5 modules (kmod-zfs-5.*-rockchip64_0.8.5-1_arm64.deb) and zfsutils-linux.
  7. @privilegejunkie what specific model? My guess is that the answer is no, though. That's a relatively old article and the WD utility mentioned there only applies to old Red drives that are 4TB and below. For WD drives that support IDLE3, you can try to use idle3ctl (apt install idle3-tools) to manage their idle setting. I've turned idle3 off on all my drives because they're usually idle for, at most 5 minutes leading to constant sleep/wake.
  8. If you created a separate partition for swap on the raw disk (not on top of RAID), I see only one reason that it would lead to a kernel panic, and that's a bad disk. A bad disk would lead to memory corruption. Have you run a S.M.A.R.T. self-test (both short and long) on the disk that had the swap? That said, I would expect swap to work on top of RAID as well, but it's an extra layer that might get congested during high memory pressure. ZFS, for instance, doesn't deal well with swap on top of it, there's at least one long-standing open issue about it.
  9. @gprovost thanks, that fancontrol config looks good. I thought I had read somewhere on this forum that the SoC was rated for operation at 85C which is why I previously commented that the fans need to run at full speed during heavy load. Glad to know this isn't the case. @retrack if you want to increase airflow to SoC, an alternative approach would be to create the funnel at the front. The gap is the size of a HDD and could be more focused so that the air flows better through the SoC heatsink. As an added bonus, this might also improve HDD cooling because less (but more focused) air is flowing to the SoC. Or a second option, a 3D printed mount for a secondary (small) fan that sits on the SoC heatsink. You could use one of the Noctua splitters to power it.
  10. I replaced the stock fans with Noctua NF-A8-PWM fans. These have slightly lower CFM (airflow) at 32.5 vs 35 (stock) however they have a high static pressure rating at 2.37 mm H₂O. For me they produce better cooling than the stock fans. Their noise level is also rated at 17.1 dB(A) so you will barely hear them even at full speed. I also had a pair of NF-R8-PWM (discontinued model) from the old days which I tried. They are very close in CFM at 31.4 but have worse static pressure at 1.41 mm H₂O. These fans produced worse cooling than the stock fans. One additional change I did was place the metal fan grills (finger protectors) on the outside because then the fan is allowed to produce a seal against the case. I thought it was a small design miss to leave a gap between the case and the fan because it might allow air to move back inside. Aesthetically it's nicer to have the grill on the inside (IMO), so as an alternative fix one could design a 3D printed piece to fill the gaps. It's also possible to adjust speed by modifying the `/etc/fancontrol` configuration and restarting the service (`systemctl restart fancontrol`), but I would not recommend this unless you're using better than stock fans. If the CPU is working full throttle you will want the fans running at full speed to sufficiently extract the heat from the CPU area.
  11. This is a continuation of ata1.00: failed command: READ FPDMA QUEUED noted by @djurny. I've experienced the same issue, and have some additional data points to provide. My observations so far: I'm using WDC WD60EFRX (68MYMN1 and 68L0BN1) drives The drives were working without issue previously behind a ASMedia ASM1062 SATA Controller, I've also used some of them behind ASM1542 (external eSATA enclosure) I can reproduce the issue on a clean install of Armbian 20.08.21 Buster and Focal I can reproduce via simple `dd` to `/dev/null` from the drive so filesystem does not seem to be the underlying cause Every drive is affected (i.e. each SATA slot) At what point dd produces an error varies from SATA slot to SATA slot (not drive dependent), SATA slot 4 can reproducibly produce the error almost immediately after starting a read The problem goes away when setting `extraargs=libata.force=3.0` in `/boot/armbianEnv.txt` [1] [1] However, even with SATA limited to 3 Gbps, the problem did reappear when hot-plugging a drive. This reset happened on drive slot 3 when I hot-plugged a drive onto slot 5. This seems weird to me considering they are supposed to be on different power rails. This may suggest there is in general a problem with either the PSU or power delivery to the drives. Here's an excerpt from the reset: [152957.354311] ata3.00: exception Emask 0x10 SAct 0x80000000 SErr 0x9b0000 action 0xe frozen [152957.354318] ata3.00: irq_stat 0x00400000, PHY RDY changed [152957.354322] ata3: SError: { PHYRdyChg PHYInt 10B8B Dispar LinkSeq } [152957.354328] ata3.00: failed command: READ FPDMA QUEUED [152957.354335] ata3.00: cmd 60/58:f8:00:f8:e7/01:00:71:02:00/40 tag 31 ncq dma 176128 in res 40/00:f8:00:f8:e7/00:00:71:02:00/40 Emask 0x10 (ATA bus error) [152957.354338] ata3.00: status: { DRDY } [152957.354345] ata3: hard resetting link And the full dmesg from when the error happened is below:
  12. That's great, thank you. I'll be more than happy to help test!
  13. @jbergler I had the same issue with panics on 20.08.21 Buster (also using ZFS). Problem went away after I switched from ondemand to performance governor via armbian-config.
  14. Hi, I've gone ahead and installed Armbian Buster (20.08.21) with root on ZFS (SATA) and I would like to boot from ZFS as well. However, I noticed u-boot isn't built with ZFS support enabled. I've worked around it for now by keeping /boot on eMMC but would it be possible to enable ZFS support in future builds? It can be enabled by building u-boot with CONFIG_CMD_ZFS defined. The benefits are that /boot can be snapshotted, rolled back, raid/mirror, etc along with root. I would also be quite happy building u-boot myself but I have no experience with it and I'd like to ensure that I can stay up-to-date with Helios64 patches, etc. In case anyone is interested, I've also done a few simplistic benchmarks for encryption using ZFS native encryption (aes-256-gcm) and ZFS on top of LUKS (aes-xts-plain64). ZFS Native encryption: read ~60 MB/s / write ~65 MB/s ZFS on LUKS: read ~210 MB/s / write >100MB/s (sorry for the inaccurate number, write speeds were capped by /dev/urandom) The pool was a single raidz2 vdev with 4 disks (one missing). Speeds were similar on both ZFS 0.8.5 and 2.0.0-rc5. Suffice to say, LUKS performs a lot better, ZFS native encryption is also very heavy on the CPU (think full cpu utilization @ 80C for as long as you're reading / writing). I'm hoping ZFS native encryption will be able to do better optimizations on ARM hardware in the future, but for now I'm going with LUKS. PS. I've also had to limit SATA speed to 3 Gbps for my disks via extraargs=libata.force=3.0, similar to another user in this thread (I had the same issue, also WD disks and they were being reset left and right without it).
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines