jbergler Posted November 12, 2020 Posted November 12, 2020 (edited) Continuing the discussion from here On a clean install of 20.08.21 im able to crash the box within a few hours of it being under load. It appears as if the optimisations are being applied root@helios64:~# cat /proc/sys/net/core/rps_sock_flow_entries 32768 The suggestion @ShadowDance made to switch to the performance governor hasn't helped. Anecdotally, I think I remember the crashes always mentioning page faults, and early on there was some discussion about memory timing. Is it possible this continues to be that issue? Edited November 12, 2020 by jbergler spelling and some extra details
aprayoga Posted November 13, 2020 Posted November 13, 2020 On 11/11/2020 at 4:08 PM, jbergler said: I also tried the suggestion to set a performance governor, and for shits and giggles I reduced the max cpu frequency, but that hasn’t made a difference. System still locks up within a few hours. What was the max cpu freq you set? Could you try with performance governor at 1.2GHz and at 816 MHz? How did you load the system? Did you encounter kernel crash on 20.08.10 ?
jbergler Posted November 16, 2020 Author Posted November 16, 2020 On 11/13/2020 at 5:31 PM, aprayoga said: Did you encounter kernel crash on 20.08.10 ? It's hard to say for sure, I never quite had a stable system, but I also wasn't generating the kind of load I am now back then. On 11/13/2020 at 5:31 PM, aprayoga said: What was the max cpu freq you set? I had only reduced it one step, I'm trying again now with the settings you suggest. root@helios64:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq performance root@helios64:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq | uniq 816000 root@helios64:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq | uniq 1200000 The load I'm generating is running a zfs scrub on a 37TB pool across all five disks.
jbergler Posted November 16, 2020 Author Posted November 16, 2020 After about an hour of the ZFS scrub the "bad PC value" error happened again, however this time the system didn't hard lock. A decent number of processes related to ZFS are stuck in uninterruptible IO, I can't export the pool, etc. I did see the system crash like this occasionally without the cpufreq tweaks, so I'm not sure it tells us anything new. I will try again. note, the relatively high uptime is from the system sitting idle for ~5 days before I put it under load again. Spoiler [433046.690213] Unable to handle kernel paging request at virtual address f9ff8000091f3190 [433046.690218] Internal error: SP/PC alignment exception: 8a000000 [#1] PREEMPT SMP [433046.690224] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge rfkill governor_performance zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) r8152 snd_soc_hdmi_codec snd_soc_rockchip_i2s snd_soc_core snd_pcm_dmaengine snd_pcm snd_timer panfrost snd gpu_sched soundcore leds_pwm gpio_charger pwm_fan rockchip_rga videobuf2_dma_sg hantro_vpu(C) rockchip_vdec(C) v4l2_h264 videobuf2_dma_contig videobuf2_vmalloc v4l2_mem2mem videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc fusb30x(C) zstd sg gpio_beeper cpufreq_dt zram sch_fq_codel lm75 ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek rockchipdrm analogix_dp dw_hdmi dwmac_rk dw_mipi_dsi stmmac_platform drm_kms_helper cec stmmac rc_core [433046.690323] mdio_xpcs [433046.690976] Mem abort info: [433046.691593] drm drm_panel_orientation_quirks adc_keys [433046.699701] ESR = 0x86000004 [433046.700155] CPU: 5 PID: 248302 Comm: z_rd_int Tainted: P C OE 5.8.17-rockchip64 #20.08.21 [433046.700433] EC = 0x21: IABT (current EL), IL = 32 bits [433046.701245] Hardware name: Helios64 (DT) [433046.701718] SET = 0, FnV = 0 [433046.702073] pstate: 40000005 (nZcv daif -PAN -UAO BTYPE=--) [433046.702373] EA = 0, S1PTW = 0 [433046.702850] pc : 0xb [433046.703132] [f9ff8000091f3190] address between user and kernel address ranges [433046.703334] lr : 0xb [433046.704168] sp : ffff800019d53a40 [433046.704469] x29: ffff0000b604c000 x28: ffff0000f6c03a00 [433046.704946] x27: ffff000045281600 x26: 000000000000000b [433046.705421] x25: ffff800011a10000 x24: 0000000000000000 [433046.705897] x23: 0000000000000000 x22: 0080000000000000 [433046.706374] x21: 0000000000042c00 x20: ffff000092ff8d88 [433046.706849] x19: ffff000045281600 x18: 00001e1e0a99c21b [433046.707326] x17: 00000030510320ae x16: 000000fe01cf8d4b [433046.707801] x15: 0000000000000000 x14: 0000000000000000 [433046.708277] x13: 0000000000000008 x12: ffff0000d8f2ea28 [433046.708753] x11: 0000000000000020 x10: 0000000000000001 [433046.709229] x9 : 0000000000000000 x8 : ffff00006fb62b00 [433046.709705] x7 : 0000000000000000 x6 : 000000000000003f [433046.710181] x5 : 0000000000000040 x4 : 0000000000000000 [433046.710657] x3 : 0000000000000004 x2 : 0000000000000000 [433046.711133] x1 : ffff000000000000 x0 : ffff00006fb62a00 [433046.711610] Call trace: [433046.711837] 0xb [433046.712016] Code: bad PC value [433046.712298] ---[ end trace ac904cdd631dd942 ]--- [433046.712714] Internal error: Oops: 86000004 [#2] PREEMPT SMP [433046.713212] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge rfkill governor_performance zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) r8152 snd_soc_hdmi_codec snd_soc_rockchip_i2s snd_soc_core snd_pcm_dmaengine snd_pcm snd_timer panfrost snd gpu_sched soundcore leds_pwm gpio_charger pwm_fan rockchip_rga videobuf2_dma_sg hantro_vpu(C) rockchip_vdec(C) v4l2_h264 videobuf2_dma_contig videobuf2_vmalloc v4l2_mem2mem videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc fusb30x(C) zstd sg gpio_beeper cpufreq_dt zram sch_fq_codel lm75 ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek rockchipdrm analogix_dp dw_hdmi dwmac_rk dw_mipi_dsi stmmac_platform drm_kms_helper cec stmmac rc_core [433046.713298] mdio_xpcs drm drm_panel_orientation_quirks adc_keys [433046.721466] CPU: 4 PID: 248273 Comm: z_rd_int Tainted: P D C OE 5.8.17-rockchip64 #20.08.21 [433046.722281] Hardware name: Helios64 (DT) [433046.722637] pstate: 80000005 (Nzcv daif -PAN -UAO BTYPE=--) [433046.723135] pc : 0xf9ff8000091f3190 [433046.723464] lr : avl_find+0x68/0xc8 [zavl] [433046.723833] sp : ffff800019c73a40 [433046.724134] x29: ffff800019c73a40 x28: ffff000080c2afa8 [433046.724611] x27: ffff0000b604c9a8 x26: ffff0000b604c9c8 [433046.725088] x25: ffff0000b6743090 x24: 0000000000000000 [433046.725565] x23: ffff800019c73af0 x22: ffff8000091f40d8 [433046.726041] x21: ffff00005c0be900 x20: ffff000056059e00 [433046.726517] x19: ffff000056059e00 x18: 000021ba4d598e5d [433046.726994] x17: 0000003fa86bd6e8 x16: 0000014c01a9b0f1 [433046.727470] x15: 0000000000000000 x14: 0000000000000000 [433046.727946] x13: 0000000000000008 x12: ffff0000e5b3b028 [433046.728422] x11: 0000000000000100 x10: 0000000000000001 [433046.728898] x9 : 0000000000000000 x8 : 000000000023e0e8 [433046.729373] x7 : 000000000023e120 x6 : 0000000000000001 [433046.729849] x5 : 0000000000000001 x4 : 0000000000000000 [433046.730325] x3 : 0000000000000000 x2 : 0000000000000100 [433046.730801] x1 : 0000000000000000 x0 : 00000000ffffffff [433046.731277] Call trace: [433046.731503] 0xf9ff8000091f3190 [433046.731948] dsl_scan_prefetch+0x1a8/0x228 [zfs] [433046.732490] dsl_scan_prefetch_dnode+0x8c/0x110 [zfs] [433046.733068] dsl_scan_prefetch_cb+0x21c/0x268 [zfs] [433046.733630] arc_read_done+0x20c/0x3f8 [zfs] [433046.734140] zio_done+0x254/0xd40 [zfs] [433046.734634] zio_execute+0xac/0x110 [zfs] [433046.735016] taskq_thread+0x298/0x440 [spl] [433046.735402] kthread+0x118/0x150 [433046.735700] ret_from_fork+0x10/0x34 [433046.736031] Code: bad PC value [433046.736315] ---[ end trace ac904cdd631dd943 ]---
jbergler Posted November 20, 2020 Author Posted November 20, 2020 I had 1 more crash and another soft lockup, but otherwise the box is much more usable. @aprayoga Definitely still something not running right, even at the lower clock speeds. My limited knowledge suggests something memory related, but that's all I've got. If you'd like me to test anything else, let me know.
akschu Posted November 20, 2020 Posted November 20, 2020 (edited) I'm been testing my Helios64 as well. I'm running armbian 20.08.21 Focal, but I also downloaded the kernel builder script thingy from github and built linux-image-current-rockchip64-20.11.0-trunk which is a 5.9.9 kernel. Installed that, then built openzfs 2.0.0-rc6. I then proceeded to syncoid 2.15TB of snapshots to it also while doing a scrub and was able to get the load average up to 10+. The machine ran through the night, so I think it might be stable. A few more days testing will validate this. schu Edited November 20, 2020 by akschu speling 1
jmanes Posted November 20, 2020 Posted November 20, 2020 I see a lot of stability issue posts around this board. Do we know if this issue is related purely to the kernel such as what was stated here: https://blog.kobol.io/2020/10/27/helios64-software-issue/? Or is this maybe a combination of things, such as ZFS and latest kernel? My Helios64 was delivered today, and I plan on a RAID setup but not with ZFS. So I guess I will see for myself soon.
jbergler Posted November 20, 2020 Author Posted November 20, 2020 I'll defer to the Kobol folks, in the previous mega thread the statement was made that the issues should have been fixed in a new version that ensured it was correctly applying the hardware tweaks, for me things have never been properly stable, even on just a vanilla install. The only semi-stable solution has been to reduce the clock speed, which is fine for now.
akschu Posted November 22, 2020 Posted November 22, 2020 5.9.9 with armbian patches is working well for me so far. I've scrubbed the pool 5-6 times as well as syncoid from my hypervisor every hour for the last two days. I'm mostly just looking for a stable backup system that supports ZFS and it looks like this will work.
SIGSEGV Posted November 22, 2020 Posted November 22, 2020 @jbergler 5.8.x & 5.9.x are working here as well, but I'm not using ZFS, just plain vanilla mdadm RAID and LVM2 formatted as XFS. If you have an extra set of HDDs could you try building a new data pool with mdamd or LVM2 to test your setup? Since you're getting memory related errors, is there a way for you to run a memory test on your board? Have you checked if the heatsink is seated properly over the components of the board?
barnumbirr Posted November 23, 2020 Posted November 23, 2020 5.8.x had been running fine on my device for about 9 and a half days then randomly crashed. No logs seem to have survived the crash so this is going to be nearly impossible to debug.
akschu Posted November 23, 2020 Posted November 23, 2020 Did more testing over the weekend on 5.9.9. I was able to benchmark with FIO on top of a ZFS dataset for hours with the load average hitting 10+ while scrubing the datastore. No issues. Right nowt he uptime is 3 days. I'm actually a little surprised at the performance. It's very decent for what it is. I wonder if the fact that I'm running ZFS and 5.9.9 while others are using mdadm and 5.8 is the difference. I'm not really planning on going backwards on either. If 5.9.9 works then no need to build another kernel, and you would have to pry ZFS out of my cold dead hands. I've spend enough of my life layering encryption/compression on top of partitions on top of volume management on top of partitions on top of disks. ZFS is just better, and having performance penalty free snapshops that I can replicate to other hosts over SSH is the icing on the cake. 2
TRS-80 Posted November 23, 2020 Posted November 23, 2020 3 hours ago, akschu said: I'm not really planning on going backwards on either. If 5.9.9 works then no need to build another kernel, and you would have to pry ZFS out of my cold dead hands. I've spend enough of my life layering encryption/compression on top of partitions on top of volume management on top of partitions on top of disks. ZFS is just better, and having performance penalty free snapshops that I can replicate to other hosts over SSH is the icing on the cake. Amen! I have been following this forum with great interest and suspect it's only a matter of time until I buy one of these devices (or maybe wait for ECC one). Thanks to everyone testing and contributing feedback toward getting these devices stable, I for one certainly appreciate it (I am sure others do/will as well).
aprayoga Posted November 25, 2020 Posted November 25, 2020 @jbergler Could you try the attached u-boot ? This u-boot contains updated Rockchip blob (DDR driver & ATF) install with dpkg -i linux-u-boot-current-helios64_20.11.0-trunk_arm64.deb After that, run armbian-config > System > Install > 5 Install/Update the bootloader on SD/eMMC If you are using SD card, make sure to clean bootloader on the eMMC. you can run dd if=/dev/zero of=/dev/mmcblk1 seek=64 count=30000 Power cycle the system. The system should boot with new bootloader. Spoiler DDR Version 1.24 20191016 RevNocRL In channel 0 CS = 0 MR0=0x18 MR4=0x1 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 1 CS = 0 MR0=0x18 MR4=0x1 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 0 training pass! channel 1 training pass! change freq to 416MHz 0,1 Channel 0: LPDDR4,416MHz Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB Channel 1: LPDDR4,416MHz Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB 256B stride channel 0 CS = 0 MR0=0x18 MR4=0x1 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 1 CS = 0 MR0=0x18 MR4=0x1 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 0 training pass! channel 1 training pass! channel 0, cs 0, advanced training done channel 1, cs 0, advanced training done change freq to 856MHz 1,0 ch 0 ddrconfig = 0x101, ddrsize = 0x40 ch 1 ddrconfig = 0x101, ddrsize = 0x40 pmugrf_os_reg[2] = 0x32C1F2C1, stride = 0xD ddr_set_rate to 328MHZ ddr_set_rate to 666MHZ ddr_set_rate to 416MHZ, ctl_index 0 ddr_set_rate to 856MHZ, ctl_index 1 support 416 856 328 666 MHz, current 856MHz OUT Boot1 Release Time: May 29 2020 17:36:36, version: 1.26 CPUId = 0x0 ChipType = 0x10, 352 SdmmcInit=2 0 BootCapSize=100000 UserCapSize=14910MB FwPartOffset=2000 , 100000 mmc0:cmd5,20 SdmmcInit=0 0 BootCapSize=0 UserCapSize=15103MB FwPartOffset=2000 , 0 StorageInit ok = 67151 SecureMode = 0 SecureInit read PBA: 0x4 SecureInit read PBA: 0x404 SecureInit read PBA: 0x804 SecureInit read PBA: 0xc04 SecureInit read PBA: 0x1004 SecureInit read PBA: 0x1404 SecureInit read PBA: 0x1804 SecureInit read PBA: 0x1c04 SecureInit ret = 0, SecureMode = 0 atags_set_bootdev: ret:(0) GPT 0x3335db8 signature is wrong recovery gpt... GPT 0x3335db8 signature is wrong recovery gpt fail! Trust Addr:0x4000, 0x58334c42 No find bl30.bin No find bl32.bin Load uboot, ReadLba = 2000 Load OK, addr=0x200000, size=0xdd6b0 RunBL31 0x40000 @ 191346 us NOTICE: BL31: v1.3(debug):2803a2c8a NOTICE: BL31: Built : 14:31:03, May 19 2020 NOTICE: BL31: Rockchip release version: v1.1 INFO: GICv3 with legacy support detected. ARM GICV3 driver initialized in EL3 INFO: Using opteed sec cpu_context! INFO: boot cpu mask: 0 INFO: plat_rockchip_pmu_init(1191): pd status 3e INFO: BL31: Initializing runtime services WARNING: No OPTEE provided by BL2 boot loader, Booting device without OPTEE initialization. SMC`s destined for OPTEE will return SMC_UNK ERROR: Error initializing runtime service opteed_fast INFO: BL31: Preparing for EL3 exit to normal world INFO: Entry point address = 0x200000 INFO: SPSR = 0x3c9 U-Boot 2020.07-armbian (Nov 25 2020 - 07:14:05 +0700) SoC: Rockchip rk3399 Reset cause: POR DRAM: 3.9 GiB PMIC: RK808 SF: Detected w25q128 with page size 256 Bytes, erase size 4 KiB, total 16 MiB MMC: mmc@fe320000: 1, sdhci@fe330000: 0 Loading Environment from MMC... *** Warning - bad CRC, using default environment In: serial Out: serial Err: serial Model: Helios64 Revision: 1.2 - 4GB non ECC Net: eth0: ethernet@fe300000 scanning bus for devices... Hit any key to stop autoboot: 0 switch to partitions #0, OK mmc1 is current device Scanning mmc 1:1... Found U-Boot script /boot/boot.scr 3185 bytes read in 6 ms (517.6 KiB/s) ## Executing script at 00500000 Boot script loaded from mmc 1 166 bytes read in 5 ms (32.2 KiB/s) 14091886 bytes read in 600 ms (22.4 MiB/s) 27331072 bytes read in 1157 ms (22.5 MiB/s) 79946 bytes read in 13 ms (5.9 MiB/s) 2698 bytes read in 10 ms (262.7 KiB/s) Applying kernel provided DT fixup script (rockchip-fixup.scr) ## Executing script at 09000000 ## Loading init Ramdisk from Legacy Image at 06000000 ... Image Name: uInitrd Image Type: AArch64 Linux RAMDisk Image (gzip compressed) Data Size: 14091822 Bytes = 13.4 MiB Load Address: 00000000 Entry Point: 00000000 Verifying Checksum ... OK ## Flattened Device Tree blob at 01f00000 Booting using the fdt blob at 0x1f00000 Loading Ramdisk to f5176000, end f5ee662e ... OK Loading Device Tree to 00000000f50fa000, end 00000000f5175fff ... OK Starting kernel ... Please take note at binaries version DDR Version 1.24 20191016 RevNocRL NOTICE: BL31: Built : 14:31:03, May 19 2020 U-Boot 2020.07-armbian (Nov 25 2020 - 07:14:05 +0700) Try to trigger the kernel crash. --- If you want to restore the original u-boot you can run apt install linux-u-boot-helios64-current=20.08.21 and update the u-boot using armbian-config --- There is built in memory tester on Linux kernel, just add this line to /boot/armbianEnv.txt extraargs=memtest=10 you can change number of loop (10). It took quite some time to run the test. you can see the result using dmesg linux-u-boot-current-helios64_20.11.0-trunk_arm64.deb
jbergler Posted November 25, 2020 Author Posted November 25, 2020 (edited) Initial attempt with the new uboot and with removing the cpufreq tweaks results in a new panic Spoiler [588872.135762] reboot: Restarting system DDR Version 1.24 20191016 RevNocRL In soft reset SRX channel 0 CS = 0 MR0=0x18 MR4=0x1 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 1 CS = 0 MR0=0x18 MR4=0x1 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 0 training pass! channel 1 training pass! change freq to 416MHz 0,1 Channel 0: LPDDR4,416MHz Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB Channel 1: LPDDR4,416MHz Bus Width=32 Col=10 Bank=8 Row=16 CS=1 Die Bus-Width=16 Size=2048MB 256B stride channel 0 CS = 0 MR0=0x18 MR4=0x1 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 1 CS = 0 MR0=0x18 MR4=0x1 MR5=0x1 MR8=0x10 MR12=0x72 MR14=0x72 MR18=0x0 MR19=0x0 MR24=0x8 MR25=0x0 channel 0 training pass! channel 1 training pass! channel 0, cs 0, advanced training done channel 1, cs 0, advanced training done change freq to 856MHz 1,0 ch 0 ddrconfig = 0x101, ddrsize = 0x40 ch 1 ddrconfig = 0x101, ddrsize = 0x40 pmugrf_os_reg[2] = 0x32C1F2C1, stride = 0xD ddr_set_rate to 328MHZ ddr_set_rate to 666MHZ ddr_set_rate to 416MHZ, ctl_index 0 ddr_set_rate to 856MHZ, ctl_index 1 support 416 856 328 666 MHz, current 856MHz OUT Boot1 Release Time: May 29 2020 17:36:36, version: 1.26 CPUId = 0x0 ChipType = 0x10, 447 SdmmcInit=2 0 BootCapSize=100000 UserCapSize=14910MB FwPartOffset=2000 , 100000 mmc0:cmd8,20 mmc0:cmd5,20 mmc0:cmd55,20 mmc0:cmd1,20 mmc0:cmd8,20 mmc0:cmd5,20 mmc0:cmd55,20 mmc0:cmd1,20 mmc0:cmd8,20 mmc0:cmd5,20 mmc0:cmd55,20 mmc0:cmd1,20 SdmmcInit=0 1 StorageInit ok = 69105 SecureMode = 0 SecureInit read PBA: 0x4 SecureInit read PBA: 0x404 SecureInit read PBA: 0x804 SecureInit read PBA: 0xc04 SecureInit read PBA: 0x1004 SecureInit read PBA: 0x1404 SecureInit read PBA: 0x1804 SecureInit read PBA: 0x1c04 SecureInit ret = 0, SecureMode = 0 atags_set_bootdev: ret:(0) GPT 0x3335db8 signature is wrong recovery gpt... GPT 0x3335db8 signature is wrong recovery gpt fail! Trust Addr:0x4000, 0x58334c42 No find bl30.bin No find bl32.bin Load uboot, ReadLba = 2000 Load OK, addr=0x200000, size=0xdd6b0 RunBL31 0x40000 @ 96897 us NOTICE: BL31: v1.3(debug):2803a2c8a NOTICE: BL31: Built : 14:31:03, May 19 2020 NOTICE: BL31: Rockchip release version: v1.1 INFO: GICv3 with legacy support detected. ARM GICV3 driver initialized in EL3 INFO: Using opteed sec cpu_context! INFO: boot cpu mask: 0 INFO: plat_rockchip_pmu_init(1191): pd status 3e INFO: BL31: Initializing runtime services WARNING: No OPTEE provided by BL2 boot loader, Booting device without OPTEE initialization. SMC`s destined for OPTEE will return SMC_UNK ERROR: Error initializing runtime service opteed_fast INFO: BL31: Preparing for EL3 exit to normal world INFO: Entry point address = 0x200000 INFO: SPSR = 0x3c9 U-Boot 2020.07-armbian (Nov 25 2020 - 07:14:05 +0700) SoC: Rockchip rk3399 Reset cause: RST DRAM: 3.9 GiB PMIC: RK808 SF: Detected w25q128 with page size 256 Bytes, erase size 4 KiB, total 16 MiB MMC: mmc@fe320000: 1, sdhci@fe330000: 0 Loading Environment from MMC... *** Warning - bad CRC, using default environment In: serial Out: serial Err: serial Model: Helios64 Revision: 1.2 - 4GB non ECC Net: eth0: ethernet@fe300000 scanning bus for devices... Hit any key to stop autoboot: 0 Card did not respond to voltage select! switch to partitions #0, OK mmc0(part 0) is current device Scanning mmc 0:1... Found U-Boot script /boot/boot.scr 3185 bytes read in 18 ms (171.9 KiB/s) ## Executing script at 00500000 Boot script loaded from mmc 0 193 bytes read in 15 ms (11.7 KiB/s) 16364137 bytes read in 1576 ms (9.9 MiB/s) 27331072 bytes read in 2614 ms (10 MiB/s) 79946 bytes read in 40 ms (1.9 MiB/s) 2698 bytes read in 32 ms (82 KiB/s) Applying kernel provided DT fixup script (rockchip-fixup.scr) ## Executing script at 09000000 ## Loading init Ramdisk from Legacy Image at 06000000 ... Image Name: uInitrd Image Type: AArch64 Linux RAMDisk Image (gzip compressed) Data Size: 16364073 Bytes = 15.6 MiB Load Address: 00000000 Entry Point: 00000000 Verifying Checksum ... OK ## Flattened Device Tree blob at 01f00000 Booting using the fdt blob at 0x1f00000 Loading Ramdisk to f4f4b000, end f5ee6229 ... OK Loading Device Tree to 00000000f4ecf000, end 00000000f4f4afff ... OK Starting kernel ... [ 16.090622] OF: graph: no port node found in /syscon@ff770000/usb2-phy@e450/otg-port [ 16.637382] r8152 2-1.4:1.0 (unnamed net_device) (uninitialized): netif_napi_add() called with weight 256 [ 24.585805] Unable to handle kernel NULL pointer dereference at virtual address 00000000000005cc [ 24.586591] Mem abort info: [ 24.586844] ESR = 0x96000004 [ 24.587120] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.587591] SET = 0, FnV = 0 [ 24.587865] EA = 0, S1PTW = 0 [ 24.588145] Data abort info: [ 24.588404] ISV = 0, ISS = 0x00000004 [ 24.588746] CM = 0, WnR = 0 [ 24.589014] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.589786] Mem abort info: [ 24.590038] ESR = 0x96000004 [ 24.590312] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.590781] SET = 0, FnV = 0 [ 24.591055] EA = 0, S1PTW = 0 [ 24.591336] Data abort info: [ 24.591594] ISV = 0, ISS = 0x00000004 [ 24.591934] CM = 0, WnR = 0 [ 24.592201] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.592973] Mem abort info: [ 24.593225] ESR = 0x96000004 [ 24.593499] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.593968] SET = 0, FnV = 0 [ 24.594241] EA = 0, S1PTW = 0 [ 24.594522] Data abort info: [ 24.594780] ISV = 0, ISS = 0x00000004 [ 24.595121] CM = 0, WnR = 0 [ 24.595388] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.596161] Mem abort info: [ 24.596412] ESR = 0x96000004 [ 24.596686] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.597155] SET = 0, FnV = 0 [ 24.597428] EA = 0, S1PTW = 0 [ 24.597709] Data abort info: [ 24.597967] ISV = 0, ISS = 0x00000004 [ 24.598308] CM = 0, WnR = 0 [ 24.598576] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.599349] Mem abort info: [ 24.599601] ESR = 0x96000004 [ 24.599874] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.600343] SET = 0, FnV = 0 [ 24.600616] EA = 0, S1PTW = 0 [ 24.600896] Data abort info: [ 24.601154] ISV = 0, ISS = 0x00000004 [ 24.601495] CM = 0, WnR = 0 [ 24.601762] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.602534] Mem abort info: [ 24.602784] ESR = 0x96000004 [ 24.603058] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.603527] SET = 0, FnV = 0 [ 24.603800] EA = 0, S1PTW = 0 [ 24.604081] Data abort info: [ 24.604339] ISV = 0, ISS = 0x00000004 [ 24.604680] CM = 0, WnR = 0 [ 24.604946] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.605718] Mem abort info: [ 24.605968] ESR = 0x96000004 [ 24.606242] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.606711] SET = 0, FnV = 0 [ 24.606984] EA = 0, S1PTW = 0 [ 24.607263] Data abort info: [ 24.607521] ISV = 0, ISS = 0x00000004 [ 24.607863] CM = 0, WnR = 0 [ 24.608129] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.608901] Mem abort info: [ 24.609153] ESR = 0x96000004 [ 24.609426] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.609895] SET = 0, FnV = 0 [ 24.610168] EA = 0, S1PTW = 0 [ 24.610448] Data abort info: [ 24.610706] ISV = 0, ISS = 0x00000004 [ 24.611048] CM = 0, WnR = 0 [ 24.611314] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.612086] Mem abort info: [ 24.612336] ESR = 0x96000004 [ 24.612610] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.613079] SET = 0, FnV = 0 [ 24.613352] EA = 0, S1PTW = 0 [ 24.613633] Data abort info: [ 24.613891] ISV = 0, ISS = 0x00000004 [ 24.614232] CM = 0, WnR = 0 [ 24.614500] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.615272] Mem abort info: [ 24.615522] ESR = 0x96000004 [ 24.615796] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.616265] SET = 0, FnV = 0 [ 24.616538] EA = 0, S1PTW = 0 [ 24.616819] Data abort info: [ 24.617077] ISV = 0, ISS = 0x00000004 [ 24.617418] CM = 0, WnR = 0 [ 24.617684] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.618457] Mem abort info: [ 24.618707] ESR = 0x96000004 [ 24.618980] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.619450] SET = 0, FnV = 0 [ 24.619723] EA = 0, S1PTW = 0 [ 24.620004] Data abort info: [ 24.620262] ISV = 0, ISS = 0x00000004 [ 24.620603] CM = 0, WnR = 0 [ 24.620871] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.621644] Mem abort info: [ 24.621893] ESR = 0x96000004 [ 24.622167] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.622637] SET = 0, FnV = 0 [ 24.622910] EA = 0, S1PTW = 0 [ 24.623191] Data abort info: [ 24.623448] ISV = 0, ISS = 0x00000004 [ 24.623790] CM = 0, WnR = 0 [ 24.624056] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.624828] Mem abort info: [ 24.625080] ESR = 0x96000004 [ 24.625354] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.625823] SET = 0, FnV = 0 [ 24.626096] EA = 0, S1PTW = 0 [ 24.626377] Data abort info: [ 24.626635] ISV = 0, ISS = 0x00000004 [ 24.626976] CM = 0, WnR = 0 [ 24.627244] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.628016] Mem abort info: [ 24.628266] ESR = 0x96000004 [ 24.628539] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.629009] SET = 0, FnV = 0 [ 24.629282] EA = 0, S1PTW = 0 [ 24.629563] Data abort info: [ 24.629821] ISV = 0, ISS = 0x00000004 [ 24.630162] CM = 0, WnR = 0 [ 24.630428] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.631201] Mem abort info: [ 24.631451] ESR = 0x96000004 [ 24.631724] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.632193] SET = 0, FnV = 0 [ 24.632466] EA = 0, S1PTW = 0 [ 24.632747] Data abort info: [ 24.633005] ISV = 0, ISS = 0x00000004 [ 24.633346] CM = 0, WnR = 0 [ 24.633612] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.634384] Mem abort info: [ 24.634634] ESR = 0x96000004 [ 24.634908] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.635377] SET = 0, FnV = 0 [ 24.635650] EA = 0, S1PTW = 0 [ 24.635931] Data abort info: [ 24.636189] ISV = 0, ISS = 0x00000004 [ 24.636530] CM = 0, WnR = 0 [ 24.636798] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.637570] Mem abort info: [ 24.637820] ESR = 0x96000004 [ 24.638094] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.638563] SET = 0, FnV = 0 [ 24.638836] EA = 0, S1PTW = 0 [ 24.639116] Data abort info: [ 24.639375] ISV = 0, ISS = 0x00000004 [ 24.639715] CM = 0, WnR = 0 [ 24.639982] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.640754] Mem abort info: [ 24.641004] ESR = 0x96000004 [ 24.641277] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.641746] SET = 0, FnV = 0 [ 24.642019] EA = 0, S1PTW = 0 [ 24.642300] Data abort info: [ 24.642558] ISV = 0, ISS = 0x00000004 [ 24.642899] CM = 0, WnR = 0 [ 24.643165] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.643938] Mem abort info: [ 24.644188] ESR = 0x96000004 [ 24.644461] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.644930] SET = 0, FnV = 0 [ 24.645203] EA = 0, S1PTW = 0 [ 24.645484] Data abort info: [ 24.645742] ISV = 0, ISS = 0x00000004 [ 24.646083] CM = 0, WnR = 0 [ 24.646351] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.647123] Mem abort info: [ 24.647373] ESR = 0x96000004 [ 24.647646] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.648115] SET = 0, FnV = 0 [ 24.648388] EA = 0, S1PTW = 0 [ 24.648668] Data abort info: [ 24.648926] ISV = 0, ISS = 0x00000004 [ 24.649267] CM = 0, WnR = 0 [ 24.649533] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.650305] Mem abort info: [ 24.650555] ESR = 0x96000004 [ 24.650829] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.651298] SET = 0, FnV = 0 [ 24.651571] EA = 0, S1PTW = 0 [ 24.651852] Data abort info: [ 24.652109] ISV = 0, ISS = 0x00000004 [ 24.652451] CM = 0, WnR = 0 [ 24.652717] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 [ 24.653489] Mem abort info: [ 24.653739] ESR = 0x96000004 [ 24.654013] EC = 0x25: DABT (current EL), IL = 32 bits [ 24.654482] SET = 0, FnV = 0 [ 24.654755] EA = 0, S1PTW = 0 [ 24.655036] Data abort info: [ 24.655294] ISV = 0, ISS = 0x00000004 [ 24.655635] CM = 0, WnR = 0 [ 24.656017] Insufficient stack space to handle exception! [ 24.656021] ESR: 0x96000047 -- DABT (current EL) [ 24.656022] FAR: 0xffff800011b9fff0 [ 24.656024] Task stack: [0xffff800011ba0000..0xffff800011ba4000] [ 24.656026] IRQ stack: [0xffff800011ad8000..0xffff800011adc000] [ 24.656028] Overflow stack: [0xffff0000f77932b0..0xffff0000f77942b0] [ 24.656031] CPU: 4 PID: 0 Comm: swapper/4 Tainted: P C OE 5.8.17-rockchip64 #20.08.21 [ 24.656032] Hardware name: Helios64 (DT) [ 24.656034] pstate: 80000085 (Nzcv daIf -PAN -UAO BTYPE=--) [ 24.656036] pc : format_decode+0x4/0x4a8 [ 24.656037] lr : vsnprintf+0x8c/0x728 [ 24.656039] sp : ffff800011ba0020 [ 24.656040] x29: ffff800011ba0020 x28: ffff8000111db6b8 [ 24.656045] x27: ffff800011a1d238 x26: 0000000000000020 [ 24.656049] x25: 0000000000000000 x24: 00000000000003e0 [ 24.656053] x23: 00000000ffffffc8 x22: ffff800010ecb890 [ 24.656056] x21: ffff800011ba0350 x20: ffff800011a1d238 [ 24.656060] x19: ffff800011a1d618 x18: 0000000000000010 [ 24.656064] x17: 0000000000000001 x16: 0000000000000019 [ 24.656068] x15: ffff0000f6ea5ba8 x14: 0720072007200720 [ 24.656072] x13: 0720072007200720 x12: 0720072007200720 [ 24.656075] x11: ffff800011ba0350 x10: ffff800011ba0350 [ 24.656079] x9 : ffff800011ba0350 x8 : ffff800011ba0350 [ 24.656083] x7 : ffff800011ba0350 x6 : ffff800011ba0350 [ 24.656086] x5 : 0000000000000000 x4 : ffff0000f6ea5700 [ 24.656090] x3 : ffff800011ba00d0 x2 : ffff8000111db6b8 [ 24.656094] x1 : ffff800011ba00a0 x0 : ffff8000111db6b8 [ 24.656098] Kernel panic - not syncing: kernel stack overflow [ 24.656100] SMP: stopping secondary CPUs [ 24.656102] Kernel Offset: disabled [ 24.656103] CPU features: 0x240022,2000600c [ 24.656105] Memory Limit: none And trying again Spoiler [ 19.133928] Unable to handle kernel paging request at virtual address ffff80000ee0257c [ 19.134640] Mem abort info: [ 19.134892] ESR = 0x86000006 [ 19.135169] EC = 0x21: IABT (current EL), IL = 32 bits [ 19.135639] SET = 0, FnV = 0 [ 19.135913] EA = 0, S1PTW = 0 [ 19.136197] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000035ec000 [ 19.136789] [ffff80000ee0257c] pgd=00000000f7fff003, p4d=00000000f7fff003, pud=00000000f7ffe003, pmd=0000000000000000 [ 19.137730] Internal error: Oops: 86000006 [#1] PREEMPT SMP [ 19.138224] Modules linked in: zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) r8152 snd_soc_hdmi_codec panfrost snd_soc_rockchip_i2s gpu_sched snd_soc_core leds_pwm snd_pcm_dmaengine pwm_fan gpio_charger snd_pcm hantro_vpu(C) snd_timer rockchip_rga rockchip_vdec(C) snd videobuf2_dma_sg soundcore v4l2_h264 videobuf2_dma_contig videobuf2_vmalloc fusb30x(C) v4l2_mem2mem videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc sg gpio_beeper cpufreq_dt sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace lm75 sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek rockchipdrm analogix_dp dw_hdmi dw_mipi_dsi drm_kms_helper cec rc_core dwmac_rk stmmac_platform drm stmmac mdio_xpcs drm_panel_orientation_quirks adc_keys [ 19.144940] CPU: 4 PID: 0 Comm: swapper/4 Tainted: P C OE 5.8.17-rockchip64 #20.08.21 [ 19.145721] Hardware name: Helios64 (DT) [ 19.146073] pstate: 80000085 (Nzcv daIf -PAN -UAO BTYPE=--) [ 19.146572] pc : 0xffff80000ee0257c [ 19.146894] lr : _raw_spin_lock_irqsave+0x28/0xa0 [ 19.147313] sp : ffff800011adbeb0 [ 19.147609] x29: ffff800011adbeb0 x28: 0000000000000001 [ 19.148083] x27: ffff0000f6ea5700 x26: ffff800011adc000 [ 19.148555] x25: ffff800011501d20 x24: 0000000000000000 [ 19.149027] x23: 0000000000000000 x22: ffff0000f6ea5700 [ 19.149498] x21: ffff0000f77a8b40 x20: 0000000000000080 [ 19.149970] x19: ffff0000f77a8b40 x18: 0000000000000000 [ 19.150441] x17: 0000000000000001 x16: 0000000000000019 [ 19.150912] x15: 0000000000000006 x14: 000010670d7edc0e [ 19.151384] x13: 00000000000003fd x12: 0000000000000006 [ 19.151855] x11: 0000000000000001 x10: 0000000000000a20 [ 19.152327] x9 : ffff800011ba3e70 x8 : ffff0000f6ea6180 [ 19.152798] x7 : 00000000ffffffff x6 : 00000000351d78da [ 19.153270] x5 : 00ffffffffffffff x4 : 002b646607bcf500 [ 19.153741] x3 : 0000000000000000 x2 : 0000000000000001 [ 19.154212] x1 : 0000000000000000 x0 : 0000000000000000 [ 19.154684] Call trace: [ 19.154906] 0xffff80000ee0257c [ 19.155196] sched_ttwu_pending+0x58/0x168 [ 19.155566] flush_smp_call_function_queue+0xec/0x258 [ 19.156018] generic_smp_call_function_single_interrupt+0x14/0x20 [ 19.156561] handle_IPI+0x258/0x3e8 [ 19.156876] gic_handle_irq+0x154/0x158 [ 19.157220] el1_irq+0xb8/0x180 [ 19.157505] arch_cpu_idle+0x28/0x218 [ 19.157836] default_idle_call+0x1c/0x44 [ 19.158188] do_idle+0x210/0x288 [ 19.158478] cpu_startup_entry+0x28/0x68 [ 19.158830] secondary_start_kernel+0x140/0x178 [ 19.159239] Code: bad PC value [ 19.159524] ---[ end trace 99042d0e071b2912 ]--- [ 19.159936] Kernel panic - not syncing: Fatal exception in interrupt [ 19.160500] SMP: stopping secondary CPUs [ 20.327519] SMP: failed to stop secondary CPUs 3-5 [ 20.327945] Kernel Offset: disabled [ 20.328257] CPU features: 0x240022,2000600c [ 20.328629] Memory Limit: none [ 20.328915] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Edited November 25, 2020 by jbergler more details
jbergler Posted November 25, 2020 Author Posted November 25, 2020 I cold booted the box, and now it seems to behave just fine. Will run some load testing overnight and report back.
jbergler Posted November 25, 2020 Author Posted November 25, 2020 Box locked up overnight, nothing on the console.
TheLinuxBug Posted November 25, 2020 Posted November 25, 2020 @jbergler Do you have an ATX power supply you can hook the drives to and test powering them that way? As I mentioned in a few other threads, including: I believe this may be a power delivery issue under load. Someone will need to test using alternative power supply to confirm this though as I do not have one (Helios64). I have 2x RockPi 4c w/ m.2 to PCIe x4 adapter and an 8 port SATA card, one running 6x3TB mdadm raid other 9x2TB drive mdadm raid with one drive via USB 3.0 running 24/7 but I am using an actual ATX power supply to power all my drives, not a built on power supply like the Helios has. Of all the people reporting this, someone will need to test and confirm. -- root@rockpi-4c:~# uname -a Linux rockpi-4c 5.8.6-rockchip64 #20.08.1 SMP PREEMPT Thu Sep 3 18:03:42 CEST 2020 aarch64 aarch64 aarch64 GNU/Linux root@rockpi-4c:~# uptime 13:13:59 up 10 days, 21:11, 7 users, load average: 0.10, 0.09, 0.09 root@rockpi-4c:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md127 : active raid5 sdi[0] sdg[1] sdf[3] sdh[8] sda[4] sdc[6] sde[2] sdb[5] sdd[7] 15627059200 blocks super 1.2 level 5, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU] bitmap: 0/15 pages [0KB], 65536KB chunk -- root@rockpi-4c:~# uname -a Linux rockpi-4c 5.8.6-rockchip64 #20.08.2 SMP PREEMPT Fri Sep 4 20:23:22 CEST 2020 aarch64 aarch64 aarch64 GNU/Linux root@rockpi-4c:~# uptime 13:15:55 up 7 days, 8:05, 6 users, load average: 0.64, 0.21, 0.07 root@rockpi-4c:~# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] md127 : active raid5 sdc[2] sdb[1] sde[4] sda[0] sdd[7] sdf[6] 14650670080 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6] [UUUUUU] -- My 2 cents. Cheers!
jbergler Posted November 25, 2020 Author Posted November 25, 2020 33 minutes ago, TheLinuxBug said: @jbergler Do you have an ATX power supply you can hook the drives to and test powering them that way? I believe this may be a power delivery issue under load. I do not unfortunately, but I haven't seen any errors in the lead up to the crashes I've experienced that look like problems with the drives (at least not from what I can tell)
TheLinuxBug Posted November 25, 2020 Posted November 25, 2020 41 minutes ago, jbergler said: I do not unfortunately, but I haven't seen any errors in the lead up to the crashes I've experienced that look like problems with the drives (at least not from what I can tell) Correct, though my idea would be that something happens with power delivery that either starves the board or the drives -- though that is hard to prove one way or the other without using a different power supply for the hard drives. Could be wrong, though, would help to eliminate that as a possibility in these cases. my 2 cents. Cheers!
Recommended Posts