ShadowDance

  • Posts

    61
  • Joined

1 Follower

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

ShadowDance's Achievements

  1. @RockBian hmm, that output look more like there's a problem with the disk itself. At least none of the SATA errors I've experienced have been logged by the disks themselves, but could be difference in manufacturers. I'd recommend running a short and a long scan (via smartctl) on the disk.
  2. It does look like the same issue to me and filesystem does not matter, these SATA errors can present themselves simply by reading the disk. And as a result, filesystem corruption is not unexpected, ZFS does protect us from it though. But it can’t be discounted that this could be a disk error as well, perhaps it’s nearing its end-of-life. Which brand / model of disk do you have?
  3. Changing CPU voltage didn't help either. However I might have finally found a pattern to the pool suspension. During two observations it seems to have happened during snapshot replication (from one machine to helios) via zrepl. Not sure if it's during a recv, destroy or hold/release, so I'm now doing more verbose logging of the zfs events in hopes that that will provide some clues. Edit: Could be just normal snapshot/delete too, will be able to verify next time it happens.
  4. Thanks for the hint, I was under the impression that VDD changes were no longer necessary so I haven't even considered them. But I think I'll try just that, it's a stab in the dark but does align with my suspicion about the CPU, and at this point I'll try just about anything.
  5. Thanks for replying and digging into the sources @meymarce, I'll try to comb over that code path and see if I can find any clues as to what's going on. This happens with both SATA firmwares (updated and original), currently running original. And I'm running Buster as well (and my drives are 5 x WD RED 6TB). Did you raise voltage to prevent SATA errors/resets? How did you go about doing that if I may ask? The issue doesn't require a lot of load as I've observed, it happens seemingly randomly (as can be seen in the following graphs for the last 7 days, "gaps" indicate a panic + time before I've rebooted).
  6. Hey, posting this in hopes that someone might have an ideas as to why this is happening. I've been dealing with an issue with my ZFS pool for a while now where the pool gets suspended but there are no other error indicators. WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended. I'm using ZFS on top of LUKS and used to have problems with my drives resetting due to SATA errors, but I haven't really seen that issue since I started using my own SATA cables and limiting SATA speed to 3 Gbps. My working theory for last week has been that it's a problem with the CPU, perhaps some handover between big.LITTLE. So I've tried changing ZFS module options to pin workers to CPU cores, and I've also tried dm-crypt options that do this, but nothing has helped. So either the theory was wrong, or the tweaks did not change the faulty behavior. I also tried disabling the little cores, but the machine refused to boot as a result. With anywhere from two pool stalls per day to one per week, I'm pretty much at my wits end and ready to throw in the towel with my Helios 64. In addition, I still have random kernel stalls/panics originating from rcu or null pointer dereferences (on boot, usually). I'm not really interested in learning to debug the Linux kernel so I might just throw money at the problem and retire the Helios unless someone has a solution for this. I do love the idea of open source hardware and wish the best success for Kobol and the Helios, but I wasn't quite ready to commit to these many problems. I've also tried to set the pool failure mode to panic (zpool set failmode=panic rpool) but it provides no useful output as far as I can tell: [22978.488772] Kernel panic - not syncing: Pool 'rpool' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [22978.490035] CPU: 1 PID: 1429 Comm: z_null_int Tainted: P OE 5.9.14-rockchip64 #20.11.4 [22978.490833] Hardware name: Helios64 (DT) [22978.491182] Call trace: [22978.491416] dump_backtrace+0x0/0x200 [22978.491743] show_stack+0x18/0x28 [22978.492041] dump_stack+0xc0/0x11c [22978.492346] panic+0x164/0x364 [22978.492962] zio_suspend+0x148/0x150 [zfs] [22978.493678] zio_done+0xbd0/0xec0 [zfs] [22978.494387] zio_execute+0xac/0x150 [zfs] [22978.494783] taskq_thread+0x278/0x460 [spl] [22978.495161] kthread+0x140/0x150 [22978.495453] ret_from_fork+0x10/0x34 [22978.495778] SMP: stopping secondary CPUs [22978.496134] Kernel Offset: disabled [22978.496443] CPU features: 0x0240022,2000200c [22978.496817] Memory Limit: none [22978.497098] ---[ end Kernel panic - not syncing: Pool 'rpool' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. ]--- It's not necessarily the Helios's fault either, this could very well be a bug in ZFS on ARM for all I know.
  7. @usefulnoise I'd start by trying all different suggestions in this thread, e.g. limiting speed, disabling ncq, if you're not using raw disks (i.e. partitions or dm-crypt), make sure you've disabled io schedulers on the disks, etc. Example: libata.force=3.0G,noncq,noncqtrim Disabling ncqtrim is probably unnecessary, but doesn't give any benefit with spinning disks anyway. If none of this helps, and you're sure the disks aren't actually faulty, I'd recommend trying the SATA controller firmware update (it didn't help me) or possibly experimenting with removing noise. Hook the PSU to a grounded wall socket, use 3rd party SATA cables, or try rerouting them. Possibly, if you're desperate, try removing the metal clips from the SATA cables (the clip that hooks into the motherboard socket), it shouldn't be a problem, but could perhaps function as an antenna for noise.
  8. Thanks for the updated firmware. Unfortunately, I'm seeing the same as Wofferl: At first the programming didn't seem to work (I used balenaEtcher without unpacking the file, it seemed to understand that it was compressed though so went ahead with it), there was no reboot either. Here's the output: Then I tried flashing it again, unpacked it first this time around and the flashing worked as described:
  9. Finally got around to testing my Helios64 on the grounded wall outlet (alone), unfortunately there was no change. The disks behaved the same as before and I also had those kernel panics during boot that I seem to get ~3 out of 5 bootups. Yes sounds very reasonable, and this is my expectation too. I didn't put too much weight behind what I read, but over there one user had issues with his disks until he grounded the cases of the HDDs. Sounded strange but at this point I'm open to pretty much any strange solutions, haha. Could've just been poorly designed/manufactured HDDs too.
  10. Thanks for sharing, and it seems you are right. I've read some confusing information back and forth and to be honest was never really sure whether mine were SMR or CMR. Doesn't help that WD introduced the Plus-series and then claims all plain Red (1-6TB) are SMR. Good to know they're CMR and makes sense since I haven't noticed performance issues with them.
  11. That's great news, looking forward to it! Also nice! If you do figure out a trick we can implement ourselves (i.e. via soldering or, dare I say, foil wrapping) to reduce the noise there, let us know . Speaking of noise, I have two new thoughts: My Helios64 PSU is not connected to a grounded wall-outlet. Unfortunately here in my apartment there are only two such outlets in very inconvenient places, but I'll see if it's feasible to test it out on that one. Perhaps this introduces noise into the system? Perhaps other devices are leaking their noise via grounded extension cord (I recently learned that using a grounded extension cord on a non-grounded outlet is not ideal, hadn't even thought to consider it before..) I read somewhere that the metal frame of the HDD cases should be connected to device ground, but since we're using plastic mount brackets I doubt this is the case? To be honest I don't know if this is a real issue or not, just something I read on a forum.
  12. Are you sure about that? Generally CMR is considered good, SMR is what you'd want to stay away from. Is it something related to these specific drives?
  13. @Wofferl those are the exact same model as three of my disks (but mine aren't "Plus"). I've used these disks in another machine with ZFS and zero issues (ASM1062 SATA controller). So if we assume the problem is between SATA controller and disk, and while I agree with you that it's probably in part a disk issue, I'm convinced it's something that would be fixable on the SATA controller firmware. Perhaps these disks do something funny that the SATA controller doesn't expect? And based on all my testing so far, the SATA cable also plays a role, meaning perhaps there's a noise-factor in play (as well). Side-note; Western Digital really screwed us over with this whole SMR fiasco, didn't they. I'd be pretty much ready to throw these disks in the trash if it wasn't for the fact that they worked perfectly on another SATA controller. @grek glad it helped! By the way, I would still recommend changing the io scheduler to none because bfq is CPU intensive, and ZFS does it's own scheduling. Probably wont fix issues but might reduce some CPU overhead.
  14. I have some new things to report. I finally got around to replacing all SATA cables with 3rd party cables and using the power-only harness that the team at Kobol sent me. So I went ahead and re-enabled 6 Gbps but to my disappointment I ran into failed command: READ FPDMA QUEUED errors again. Because of this I tried once again to disable NCQ (extraargs=libata.force=noncq) and lo-and-behold, I didn't run into disk issues for the 1+ week I was testing it (kernel 5.9.14). This made me hopeful that maybe, just maybe, my previous test with NCQ disabled was bad so I re-installed the new standard harness I received from Kobol but unfortunately, I started immediately having issues again. Note that this is with io scheduler set to none. ata4.00: failed command: READ DMA EXT @gprovost are you any closer to figuring out what could be wrong here or have a potential fix in the pipeline? @grek I would recommend adding extraargs=libata.force=noncq to /boot/armbianEnv.txt and see if it helps. Might not completely fix the issue but could make more stable.
  15. @scottf007 I think it would be hard for anyone here to really answer if it's worth it or not [for you]. In your situation, I'd try to evaluate whether or not you need the features that ZFS give you. For instance, ZFS snapshots is something you never really need, until you do. When you find that you've deleted some data a month ago and can still recover it from a snapshot, it's a great comfort. If that's something you value, btrfs could be an alternative and is already built into the kernel. If all you need is data integrity, you could consider dm-integrity+mdraid and file system of choice on top (EXT4, XFS, etc.). Skipping "raid" all-together would also be possible, LVM allows for great flexibility with disks. If you're worried about the amount of work you need to put in with ZFS, you can freeze the updates when you are satisfied with the stability of the system. Just hit `sudo apt-mark hold linux-image-current-rockchip64 linux-dtb-current-rockchip64` which prevents kernel/boot instruction updates and you should not have ZFS break on you any time soon. Conversely, `unhold` once you're ready to deal with the future. For me personally, ZFS is totally worth it. I have it on two server/NAS at home. I use ZFS native encryption on one, and LUKS+ZFS on the Helios64 (due to CPU capabilities). I also use a tool named zrepl for automatically creating, pruning and replicating snapshots. So for instance, my most important datasets are backed up from my one machine to the Helios64 in raw mode, this means the data is safe, but not readable by the Helios64 without loading the encryption keys. I also run Armbian on the Helios64 straight off of ZFS (root on ZFS), this gives me the ability to easily roll-back the system if, say, an update broke it. @hartraft depends on your requirements/feature wishlist. RAID (mdraid), for instance, cannot guarantee data consistency (unless stacked with dm-integrity). What this means is that once data is written to the disk, it can still become corrupted and RAID can't catch it. ZFS guards against this via checksums on all data, i.e. once it's on disk, it's guarantee-ably either not corrupted or that corruption will be detected and likely repairable from one of the redundant disks. ZFS also has support for snapshots, meaning you can easily recover deleted files from snapshots, etc. RAID does not support anything like this. Looking at mergerfs, it seems to lack these features as well, and it runs in user-space (via FUSE), so not as integrated. SnapRaid is a backup program so not really comparable and MooseFS I know nothing about, but looks enterprise-y. The closest match-up for ZFS in terms of features is probably btrfs (in kernel) or bcachefs (have never used this).