ZFS: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended.

ShadowDance · June 21, 2021

Hey, posting this in hopes that someone might have an ideas as to why this is happening.

I've been dealing with an issue with my ZFS pool for a while now where the pool gets suspended but there are no other error indicators.

WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended.

I'm using ZFS on top of LUKS and used to have problems with my drives resetting due to SATA errors, but I haven't really seen that issue since I started using my own SATA cables and limiting SATA speed to 3 Gbps.

My working theory for last week has been that it's a problem with the CPU, perhaps some handover between big.LITTLE. So I've tried changing ZFS module options to pin workers to CPU cores, and I've also tried dm-crypt options that do this, but nothing has helped. So either the theory was wrong, or the tweaks did not change the faulty behavior. I also tried disabling the little cores, but the machine refused to boot as a result.

With anywhere from two pool stalls per day to one per week, I'm pretty much at my wits end and ready to throw in the towel with my Helios 64. In addition, I still have random kernel stalls/panics originating from rcu or null pointer dereferences (on boot, usually). I'm not really interested in learning to debug the Linux kernel so I might just throw money at the problem and retire the Helios unless someone has a solution for this. I do love the idea of open source hardware and wish the best success for Kobol and the Helios, but I wasn't quite ready to commit to these many problems.

I've also tried to set the pool failure mode to panic (zpool set failmode=panic rpool) but it provides no useful output as far as I can tell:

[22978.488772] Kernel panic - not syncing: Pool 'rpool' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
[22978.490035] CPU: 1 PID: 1429 Comm: z_null_int Tainted: P           OE     5.9.14-rockchip64 #20.11.4
[22978.490833] Hardware name: Helios64 (DT)
[22978.491182] Call trace:
[22978.491416]  dump_backtrace+0x0/0x200
[22978.491743]  show_stack+0x18/0x28
[22978.492041]  dump_stack+0xc0/0x11c
[22978.492346]  panic+0x164/0x364
[22978.492962]  zio_suspend+0x148/0x150 [zfs]
[22978.493678]  zio_done+0xbd0/0xec0 [zfs]
[22978.494387]  zio_execute+0xac/0x150 [zfs]
[22978.494783]  taskq_thread+0x278/0x460 [spl]
[22978.495161]  kthread+0x140/0x150
[22978.495453]  ret_from_fork+0x10/0x34
[22978.495778] SMP: stopping secondary CPUs
[22978.496134] Kernel Offset: disabled
[22978.496443] CPU features: 0x0240022,2000200c
[22978.496817] Memory Limit: none
[22978.497098] ---[ end Kernel panic - not syncing: Pool 'rpool' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. ]---

It's not necessarily the Helios's fault either, this could very well be a bug in ZFS on ARM for all I know.

meymarce · June 21, 2021

Just took a quick look at the ZFS sources (on master though) and the only place I can see where this would get called looking at the callstack is https://github.com/openzfs/zfs/blob/ba91311561834774bc8fedfafb19ca1012c9dadd/module/zfs/zio.c#L4772

So it must have hit an error at this point already.

Are you running the modified SATA firmware? Also my system is still plagued by the panics on boot. Second attempt works like a charm then. The only fix I applied was raising the voltage and my custom SATA harness. However I right now have little to zero load.

What distri are you running now? The other thread mentioned Buster and Focal (though iirc they run the same kernel). I am running Buster with 4 Seagate Ironwolf 10TB.

ShadowDance · June 22, 2021

Thanks for replying and digging into the sources @meymarce, I'll try to comb over that code path and see if I can find any clues as to what's going on.

This happens with both SATA firmwares (updated and original), currently running original. And I'm running Buster as well (and my drives are 5 x WD RED 6TB).

Did you raise voltage to prevent SATA errors/resets? How did you go about doing that if I may ask?

The issue doesn't require a lot of load as I've observed, it happens seemingly randomly (as can be seen in the following graphs for the last 7 days, "gaps" indicate a panic + time before I've rebooted).

meymarce · June 22, 2021

The voltage thingy was not specific to the SATA issues but the general system stabilty. And may have not been specific enough, sorry. I meant the VDD fixes here:

Before that my system would go into random panics regardless of frequencies and governor. However I never really checked the callstacks.

This was also done upstream for the RockPro64 (same CPU) to improve general system stability. There was a post by @gprovostthough to try it with a new kernel and the VDD fix removed but I had no luck with that. VDD raised and I am all good.

ShadowDance · June 22, 2021

Thanks for the hint, I was under the impression that VDD changes were no longer necessary so I haven't even considered them. But I think I'll try just that, it's a stab in the dark but does align with my suspicion about the CPU, and at this point I'll try just about anything.

SIGSEGV · June 22, 2021

I'm curious about this issue.

Let us know if changing the CPU Voltage has an effect on system stability.

ShadowDance · June 27, 2021

Changing CPU voltage didn't help either. However I might have finally found a pattern to the pool suspension. During two observations it seems to have happened during snapshot replication (from one machine to helios) via zrepl. Not sure if it's during a recv, destroy or hold/release, so I'm now doing more verbose logging of the zfs events in hopes that that will provide some clues. Edit: Could be just normal snapshot/delete too, will be able to verify next time it happens.