ShadowDance

Members
  • Posts

    64
  • Joined

1 Follower

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

ShadowDance's Achievements

  1. @RockBian it wasn’t my intention to color the issue black and white. My statement was based on that most user accounts have been from users of WD REDs. I personally believe this issue is far more prevalent than made out in this thread. It’s simply that triggering the issue may require time and the right kind of workload. So either people don’t trigger the issue or they may not even notice it unless they’re checking system logs. I hope your new disks work out for you. Either way, I never tried any other disks in my Helios and probably never will considering there’s no further development and Armbian used to break seemingly every other day (a while back).
  2. @meymarce it’s most likely related, and not due to bad disks. All my disks have received a bunch of UDMA_CRC_Error_Count due to the Helios. I’ve since completely stopped using the Helios and put the same disks inside an ASUSTOR. Zero issues. Basically don’t put WD REDs in a Helios, or be prepared to have a bad time, heh.
  3. Same here, the instability of my Helios64 combined with Armbian not having a test-suite for it (and thus breaking it at any point) lead me to splurge on hardware that cost 4x as much. A NAS should be out of sight and out of mind, not a constant source of worry.
  4. @RockBian hmm, that output look more like there's a problem with the disk itself. At least none of the SATA errors I've experienced have been logged by the disks themselves, but could be difference in manufacturers. I'd recommend running a short and a long scan (via smartctl) on the disk.
  5. It does look like the same issue to me and filesystem does not matter, these SATA errors can present themselves simply by reading the disk. And as a result, filesystem corruption is not unexpected, ZFS does protect us from it though. But it can’t be discounted that this could be a disk error as well, perhaps it’s nearing its end-of-life. Which brand / model of disk do you have?
  6. Changing CPU voltage didn't help either. However I might have finally found a pattern to the pool suspension. During two observations it seems to have happened during snapshot replication (from one machine to helios) via zrepl. Not sure if it's during a recv, destroy or hold/release, so I'm now doing more verbose logging of the zfs events in hopes that that will provide some clues. Edit: Could be just normal snapshot/delete too, will be able to verify next time it happens.
  7. Thanks for the hint, I was under the impression that VDD changes were no longer necessary so I haven't even considered them. But I think I'll try just that, it's a stab in the dark but does align with my suspicion about the CPU, and at this point I'll try just about anything.
  8. Thanks for replying and digging into the sources @meymarce, I'll try to comb over that code path and see if I can find any clues as to what's going on. This happens with both SATA firmwares (updated and original), currently running original. And I'm running Buster as well (and my drives are 5 x WD RED 6TB). Did you raise voltage to prevent SATA errors/resets? How did you go about doing that if I may ask? The issue doesn't require a lot of load as I've observed, it happens seemingly randomly (as can be seen in the following graphs for the last 7 days, "gaps" indicate a panic + time before I've rebooted).
  9. Hey, posting this in hopes that someone might have an ideas as to why this is happening. I've been dealing with an issue with my ZFS pool for a while now where the pool gets suspended but there are no other error indicators. WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended. I'm using ZFS on top of LUKS and used to have problems with my drives resetting due to SATA errors, but I haven't really seen that issue since I started using my own SATA cables and limiting SATA speed to 3 Gbps. My working theory for last week has been that it's a problem with the CPU, perhaps some handover between big.LITTLE. So I've tried changing ZFS module options to pin workers to CPU cores, and I've also tried dm-crypt options that do this, but nothing has helped. So either the theory was wrong, or the tweaks did not change the faulty behavior. I also tried disabling the little cores, but the machine refused to boot as a result. With anywhere from two pool stalls per day to one per week, I'm pretty much at my wits end and ready to throw in the towel with my Helios 64. In addition, I still have random kernel stalls/panics originating from rcu or null pointer dereferences (on boot, usually). I'm not really interested in learning to debug the Linux kernel so I might just throw money at the problem and retire the Helios unless someone has a solution for this. I do love the idea of open source hardware and wish the best success for Kobol and the Helios, but I wasn't quite ready to commit to these many problems. I've also tried to set the pool failure mode to panic (zpool set failmode=panic rpool) but it provides no useful output as far as I can tell: [22978.488772] Kernel panic - not syncing: Pool 'rpool' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [22978.490035] CPU: 1 PID: 1429 Comm: z_null_int Tainted: P OE 5.9.14-rockchip64 #20.11.4 [22978.490833] Hardware name: Helios64 (DT) [22978.491182] Call trace: [22978.491416] dump_backtrace+0x0/0x200 [22978.491743] show_stack+0x18/0x28 [22978.492041] dump_stack+0xc0/0x11c [22978.492346] panic+0x164/0x364 [22978.492962] zio_suspend+0x148/0x150 [zfs] [22978.493678] zio_done+0xbd0/0xec0 [zfs] [22978.494387] zio_execute+0xac/0x150 [zfs] [22978.494783] taskq_thread+0x278/0x460 [spl] [22978.495161] kthread+0x140/0x150 [22978.495453] ret_from_fork+0x10/0x34 [22978.495778] SMP: stopping secondary CPUs [22978.496134] Kernel Offset: disabled [22978.496443] CPU features: 0x0240022,2000200c [22978.496817] Memory Limit: none [22978.497098] ---[ end Kernel panic - not syncing: Pool 'rpool' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. ]--- It's not necessarily the Helios's fault either, this could very well be a bug in ZFS on ARM for all I know.
  10. @usefulnoise I'd start by trying all different suggestions in this thread, e.g. limiting speed, disabling ncq, if you're not using raw disks (i.e. partitions or dm-crypt), make sure you've disabled io schedulers on the disks, etc. Example: libata.force=3.0G,noncq,noncqtrim Disabling ncqtrim is probably unnecessary, but doesn't give any benefit with spinning disks anyway. If none of this helps, and you're sure the disks aren't actually faulty, I'd recommend trying the SATA controller firmware update (it didn't help me) or possibly experimenting with removing noise. Hook the PSU to a grounded wall socket, use 3rd party SATA cables, or try rerouting them. Possibly, if you're desperate, try removing the metal clips from the SATA cables (the clip that hooks into the motherboard socket), it shouldn't be a problem, but could perhaps function as an antenna for noise.
  11. Thanks for the updated firmware. Unfortunately, I'm seeing the same as Wofferl: At first the programming didn't seem to work (I used balenaEtcher without unpacking the file, it seemed to understand that it was compressed though so went ahead with it), there was no reboot either. Here's the output: Then I tried flashing it again, unpacked it first this time around and the flashing worked as described:
  12. Finally got around to testing my Helios64 on the grounded wall outlet (alone), unfortunately there was no change. The disks behaved the same as before and I also had those kernel panics during boot that I seem to get ~3 out of 5 bootups. Yes sounds very reasonable, and this is my expectation too. I didn't put too much weight behind what I read, but over there one user had issues with his disks until he grounded the cases of the HDDs. Sounded strange but at this point I'm open to pretty much any strange solutions, haha. Could've just been poorly designed/manufactured HDDs too.
  13. Thanks for sharing, and it seems you are right. I've read some confusing information back and forth and to be honest was never really sure whether mine were SMR or CMR. Doesn't help that WD introduced the Plus-series and then claims all plain Red (1-6TB) are SMR. Good to know they're CMR and makes sense since I haven't noticed performance issues with them.
  14. That's great news, looking forward to it! Also nice! If you do figure out a trick we can implement ourselves (i.e. via soldering or, dare I say, foil wrapping) to reduce the noise there, let us know . Speaking of noise, I have two new thoughts: My Helios64 PSU is not connected to a grounded wall-outlet. Unfortunately here in my apartment there are only two such outlets in very inconvenient places, but I'll see if it's feasible to test it out on that one. Perhaps this introduces noise into the system? Perhaps other devices are leaking their noise via grounded extension cord (I recently learned that using a grounded extension cord on a non-grounded outlet is not ideal, hadn't even thought to consider it before..) I read somewhere that the metal frame of the HDD cases should be connected to device ground, but since we're using plastic mount brackets I doubt this is the case? To be honest I don't know if this is a real issue or not, just something I read on a forum.
  15. Are you sure about that? Generally CMR is considered good, SMR is what you'd want to stay away from. Is it something related to these specific drives?