Stability issues on 20.08.21


Recommended Posts

Continuing the discussion from here

 

On a clean install of 20.08.21 im able to crash the box within a few hours of it being under load.

It appears as if the optimisations are being applied

root@helios64:~# cat /proc/sys/net/core/rps_sock_flow_entries
32768

 

The suggestion @ShadowDance made to switch to the performance governor hasn't helped.

 

Anecdotally, I think I remember the crashes always mentioning page faults, and early on there was some discussion about memory timing. Is it possible this continues to be that issue?

 

Edited by jbergler
spelling and some extra details
Link to post
Share on other sites
Armbian is a community driven open source project. Do you like to contribute your code?

 

On 11/11/2020 at 4:08 PM, jbergler said:

I also tried the suggestion to set a performance governor, and for shits and giggles I reduced the max cpu frequency, but that hasn’t made a difference.

System still locks up within a few hours.

What was the max cpu freq you set?

Could you try with performance governor at 1.2GHz and at 816 MHz?

How did you load the system?

 

 

Did you encounter kernel crash on 20.08.10 ?
 

Link to post
Share on other sites
On 11/13/2020 at 5:31 PM, aprayoga said:

Did you encounter kernel crash on 20.08.10 ?

 

It's hard to say for sure, I never quite had a stable system, but I also wasn't generating the kind of load I am now back then.

 

On 11/13/2020 at 5:31 PM, aprayoga said:

What was the max cpu freq you set?

 

 

I had only reduced it one step, I'm trying again now with the settings you suggest.

 

root@helios64:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq
performance
root@helios64:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq | uniq
816000
root@helios64:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq | uniq
1200000

 

The load I'm generating is running a zfs scrub on a 37TB pool across all five disks.

 

Link to post
Share on other sites

After about an hour of the ZFS scrub the "bad PC value" error happened again, however this time the system didn't hard lock.

A decent number of processes related to ZFS are stuck in uninterruptible IO, I can't export the pool, etc.

 

I did see the system crash like this occasionally without the cpufreq tweaks, so I'm not sure it tells us anything new.

I will try again.

 

note, the relatively high uptime is from the system sitting idle for ~5 days before I put it under load again.

 

Spoiler

[433046.690213] Unable to handle kernel paging request at virtual address f9ff8000091f3190

[433046.690218] Internal error: SP/PC alignment exception: 8a000000 [#1] PREEMPT SMP

[433046.690224] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge rfkill governor_performance zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) r8152 snd_soc_hdmi_codec snd_soc_rockchip_i2s snd_soc_core snd_pcm_dmaengine snd_pcm snd_timer panfrost snd gpu_sched soundcore leds_pwm gpio_charger pwm_fan rockchip_rga videobuf2_dma_sg hantro_vpu(C) rockchip_vdec(C) v4l2_h264 videobuf2_dma_contig videobuf2_vmalloc v4l2_mem2mem videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc fusb30x(C) zstd sg gpio_beeper cpufreq_dt zram sch_fq_codel lm75 ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek rockchipdrm analogix_dp dw_hdmi dwmac_rk dw_mipi_dsi stmmac_platform drm_kms_helper cec stmmac rc_core

[433046.690323] mdio_xpcs

[433046.690976] Mem abort info:

[433046.691593] drm drm_panel_orientation_quirks adc_keys

[433046.699701]   ESR = 0x86000004

[433046.700155] CPU: 5 PID: 248302 Comm: z_rd_int Tainted: P         C OE     5.8.17-rockchip64 #20.08.21

[433046.700433]   EC = 0x21: IABT (current EL), IL = 32 bits

[433046.701245] Hardware name: Helios64 (DT)

[433046.701718]   SET = 0, FnV = 0

[433046.702073] pstate: 40000005 (nZcv daif -PAN -UAO BTYPE=--)

[433046.702373]   EA = 0, S1PTW = 0

[433046.702850] pc : 0xb

[433046.703132] [f9ff8000091f3190] address between user and kernel address ranges

[433046.703334] lr : 0xb

[433046.704168] sp : ffff800019d53a40

[433046.704469] x29: ffff0000b604c000 x28: ffff0000f6c03a00

[433046.704946] x27: ffff000045281600 x26: 000000000000000b

[433046.705421] x25: ffff800011a10000 x24: 0000000000000000

[433046.705897] x23: 0000000000000000 x22: 0080000000000000

[433046.706374] x21: 0000000000042c00 x20: ffff000092ff8d88

[433046.706849] x19: ffff000045281600 x18: 00001e1e0a99c21b

[433046.707326] x17: 00000030510320ae x16: 000000fe01cf8d4b

[433046.707801] x15: 0000000000000000 x14: 0000000000000000

[433046.708277] x13: 0000000000000008 x12: ffff0000d8f2ea28

[433046.708753] x11: 0000000000000020 x10: 0000000000000001

[433046.709229] x9 : 0000000000000000 x8 : ffff00006fb62b00

[433046.709705] x7 : 0000000000000000 x6 : 000000000000003f

[433046.710181] x5 : 0000000000000040 x4 : 0000000000000000

[433046.710657] x3 : 0000000000000004 x2 : 0000000000000000

[433046.711133] x1 : ffff000000000000 x0 : ffff00006fb62a00

[433046.711610] Call trace:

[433046.711837] 0xb

[433046.712016] Code: bad PC value

[433046.712298] ---[ end trace ac904cdd631dd942 ]---

[433046.712714] Internal error: Oops: 86000004 [#2] PREEMPT SMP

[433046.713212] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge rfkill governor_performance zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) r8152 snd_soc_hdmi_codec snd_soc_rockchip_i2s snd_soc_core snd_pcm_dmaengine snd_pcm snd_timer panfrost snd gpu_sched soundcore leds_pwm gpio_charger pwm_fan rockchip_rga videobuf2_dma_sg hantro_vpu(C) rockchip_vdec(C) v4l2_h264 videobuf2_dma_contig videobuf2_vmalloc v4l2_mem2mem videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc fusb30x(C) zstd sg gpio_beeper cpufreq_dt zram sch_fq_codel lm75 ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek rockchipdrm analogix_dp dw_hdmi dwmac_rk dw_mipi_dsi stmmac_platform drm_kms_helper cec stmmac rc_core

[433046.713298] mdio_xpcs drm drm_panel_orientation_quirks adc_keys

[433046.721466] CPU: 4 PID: 248273 Comm: z_rd_int Tainted: P      D  C OE     5.8.17-rockchip64 #20.08.21

[433046.722281] Hardware name: Helios64 (DT)

[433046.722637] pstate: 80000005 (Nzcv daif -PAN -UAO BTYPE=--)

[433046.723135] pc : 0xf9ff8000091f3190

[433046.723464] lr : avl_find+0x68/0xc8 [zavl]

[433046.723833] sp : ffff800019c73a40

[433046.724134] x29: ffff800019c73a40 x28: ffff000080c2afa8

[433046.724611] x27: ffff0000b604c9a8 x26: ffff0000b604c9c8

[433046.725088] x25: ffff0000b6743090 x24: 0000000000000000

[433046.725565] x23: ffff800019c73af0 x22: ffff8000091f40d8

[433046.726041] x21: ffff00005c0be900 x20: ffff000056059e00

[433046.726517] x19: ffff000056059e00 x18: 000021ba4d598e5d

[433046.726994] x17: 0000003fa86bd6e8 x16: 0000014c01a9b0f1

[433046.727470] x15: 0000000000000000 x14: 0000000000000000

[433046.727946] x13: 0000000000000008 x12: ffff0000e5b3b028

[433046.728422] x11: 0000000000000100 x10: 0000000000000001

[433046.728898] x9 : 0000000000000000 x8 : 000000000023e0e8

[433046.729373] x7 : 000000000023e120 x6 : 0000000000000001

[433046.729849] x5 : 0000000000000001 x4 : 0000000000000000

[433046.730325] x3 : 0000000000000000 x2 : 0000000000000100

[433046.730801] x1 : 0000000000000000 x0 : 00000000ffffffff

[433046.731277] Call trace:

[433046.731503] 0xf9ff8000091f3190

[433046.731948] dsl_scan_prefetch+0x1a8/0x228 [zfs]

[433046.732490] dsl_scan_prefetch_dnode+0x8c/0x110 [zfs]

[433046.733068] dsl_scan_prefetch_cb+0x21c/0x268 [zfs]

[433046.733630] arc_read_done+0x20c/0x3f8 [zfs]

[433046.734140] zio_done+0x254/0xd40 [zfs]

[433046.734634] zio_execute+0xac/0x110 [zfs]

[433046.735016] taskq_thread+0x298/0x440 [spl]

[433046.735402] kthread+0x118/0x150

[433046.735700] ret_from_fork+0x10/0x34

[433046.736031] Code: bad PC value

[433046.736315] ---[ end trace ac904cdd631dd943 ]---

 

Link to post
Share on other sites

I'm been testing my Helios64 as well.  I'm running armbian 20.08.21 Focal, but I also downloaded the kernel builder script thingy from github and built linux-image-current-rockchip64-20.11.0-trunk which is a 5.9.9 kernel.  Installed that, then built openzfs 2.0.0-rc6.   I then proceeded to syncoid 2.15TB of snapshots to it also while doing a scrub and was able to get the load average up to 10+.  The machine ran through the night, so I think it might be stable.  A few more days testing will validate this.

 

schu

Edited by akschu
speling
Link to post
Share on other sites

I see a lot of stability issue posts around this board. Do we know if this issue is related purely to the kernel such as what was stated here: https://blog.kobol.io/2020/10/27/helios64-software-issue/?

Or is this maybe a combination of things, such as ZFS and latest kernel? My Helios64 was delivered today, and I plan on a RAID setup but not with ZFS. So I guess I will see for myself soon. :)

Link to post
Share on other sites

I'll defer to the Kobol folks, in the previous mega thread the statement was made that the issues should have been fixed in a new version that ensured it was correctly applying the hardware tweaks, for me things have never been properly stable, even on just a vanilla install. The only semi-stable solution has been to reduce the clock speed, which is fine for now.

Link to post
Share on other sites

5.9.9 with armbian patches is working well for me so far.  I've scrubbed the pool 5-6  times as well as syncoid from my hypervisor every hour for the last two days.  I'm mostly just looking for a stable backup system that supports ZFS and it looks like this will work.

Link to post
Share on other sites

@jbergler

5.8.x & 5.9.x are working here as well, but I'm not using ZFS, just plain vanilla mdadm RAID and LVM2 formatted as XFS.

If you have an extra set of HDDs could you try building a new data pool with mdamd or LVM2 to test your setup?

Since you're getting memory related errors, is there a way for you to run a memory test on your board?

Have you checked if the heatsink is seated properly over the components of the board?

 

Link to post
Share on other sites

Did more testing over the weekend on 5.9.9.  I was able to benchmark with FIO on top of a ZFS dataset for hours with the load average hitting 10+ while scrubing the datastore.  No issues.  Right nowt he uptime is 3 days. 

 

I'm actually a little surprised at the performance.  It's very decent for what it is. 

 

I wonder if the fact that I'm running ZFS and 5.9.9 while others are using mdadm and 5.8 is the difference.  I'm not really planning on going backwards on either.  If 5.9.9 works then no need to build another kernel, and you would have to pry ZFS out of my cold dead hands.  I've spend enough of my life layering encryption/compression on top of partitions on top of volume management on top of partitions on top of disks.  ZFS is just better, and having performance penalty free snapshops that I can replicate to other hosts over SSH is the icing on the cake. 

 

Link to post
Share on other sites
3 hours ago, akschu said:

I'm not really planning on going backwards on either.  If 5.9.9 works then no need to build another kernel, and you would have to pry ZFS out of my cold dead hands.  I've spend enough of my life layering encryption/compression on top of partitions on top of volume management on top of partitions on top of disks.  ZFS is just better, and having performance penalty free snapshops that I can replicate to other hosts over SSH is the icing on the cake. 

 

Amen!

 

I have been following this forum with great interest and suspect it's only a matter of time until I buy one of these devices (or maybe wait for ECC one).

 

Thanks to everyone testing and contributing feedback toward getting these devices stable, I for one certainly appreciate it (I am sure others do/will as well).

Link to post
Share on other sites