Meier Posted July 8, 2019 Posted July 8, 2019 On several RockPro64 boards I experience infrequent freezes, mostly directly on boot, but also after some longer (hours, days) uptime. I use the official power adapter and no additional hardware except a PCIe adapter for an SSD, which works flawlessly when operational. When looking at the kern.log, there are quite a few errors and warnings, but comparing to a successful boot these are all also present. So I don't think they cause the freeze directly. Currently, I recorded the freeze on the latest Armbian Bionic as release just recently. FWIW, I think the same issue also occurs on the previous Debian Stretch image (had various freezes, but have not recorded any details yet). On a unsuccessful boot, the error occurs after about 8 seconds. A reboot (or two) usually fixes the issue, until the next time... Jul 8 15:18:23 carol kernel: [ 8.528275] Unable to handle kernel NULL pointer dereference at virtual address 00000000 Jul 8 15:18:23 carol kernel: [ 8.530625] pgd = ffffffc0ead3f000 Jul 8 15:18:23 carol kernel: [ 8.532515] [00000000] *pgd=0000000000000000, *pud=0000000000000000 Jul 8 15:18:23 carol kernel: [ 8.534710] Internal error: Oops: 96000005 [#1] SMP Jul 8 15:18:24 carol kernel: [ 8.536753] Modules linked in: af_packet iptable_nat nf_nat_ipv4 nf_nat nf_log_ipv4 nf_log_common xt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_conntrack nf_conntrack iptable_filter snd_soc_rockchip_hdmi_dp rk_vcodec ip_tables x_tables autofs4 phy_rockchip_pcie Jul 8 15:18:24 carol kernel: [ 8.542598] CPU: 5 PID: 1044 Comm: find Not tainted 4.4.182-rockchip64 #1 Jul 8 15:18:24 carol kernel: [ 8.544959] Hardware name: Pine64 RockPro64 (DT) Jul 8 15:18:24 carol kernel: [ 8.547155] task: ffffffc0eb247000 task.stack: ffffffc0e1c78000 Jul 8 15:18:24 carol kernel: [ 8.549491] PC is at do_dentry_open+0x234/0x2e4 Jul 8 15:18:24 carol kernel: [ 8.551687] LR is at do_dentry_open+0x288/0x2e4 Jul 8 15:18:24 carol kernel: [ 8.553852] pc : [<ffffff80081f2738>] lr : [<ffffff80081f278c>] pstate: a0000145 Jul 8 15:18:24 carol kernel: [ 8.556284] sp : ffffffc0e1c7bbc0 Jul 8 15:18:24 carol kernel: [ 8.558403] x29: ffffffc0e1c7bbc0 x28: ffffffc0eb247000 Jul 8 15:18:24 carol kernel: [ 8.560742] x27: 0000000000000000 x26: ffffffc0f26eb000 Jul 8 15:18:24 carol kernel: [ 8.563064] x25: 000000000000011d x24: ffffffc0e1cff690 Jul 8 15:18:24 carol kernel: [ 8.565376] x23: ffffff8008219cb8 x22: 0000000000000000 Jul 8 15:18:24 carol kernel: [ 8.567670] x21: 0000000000000000 x20: ffffffc0f27882b0 Jul 8 15:18:24 carol kernel: [ 8.569954] x19: ffffffc0e1cff680 x18: 0000007fb4979a70 Jul 8 15:18:24 carol kernel: [ 8.572191] x17: 0000007fb48e8848 x16: ffffff80081f3ea4 Jul 8 15:18:24 carol kernel: [ 8.574416] x15: 0000000000000000 x14: ffffffffffffffff Jul 8 15:18:24 carol kernel: [ 8.576663] x13: 0000000000000000 x12: 0101010101010101 Jul 8 15:18:24 carol kernel: [ 8.578896] x11: 7f7f7f7f7f7f7f7f x10: 0000007fb4a8a140 Jul 8 15:18:24 carol kernel: [ 8.581115] x9 : 0000000000000000 x8 : ffffffc0e1cff7b8 Jul 8 15:18:24 carol kernel: [ 8.583390] x7 : 0000000000000000 x6 : ffffffc0f061d1e9 Jul 8 15:18:24 carol kernel: [ 8.585641] x5 : 0000000000000000 x4 : 00000000000055b1 Jul 8 15:18:24 carol kernel: [ 8.587874] x3 : 00000040eee4a000 x2 : ffffff8008219a20 Jul 8 15:18:24 carol kernel: [ 8.590114] x1 : ffffff8008c02140 x0 : 0000000000000000 Jul 8 15:18:24 carol kernel: [ 8.592341] Jul 8 15:18:24 carol kernel: [ 8.592341] PC: 0xffffff80081f26b8: Jul 8 15:18:24 carol kernel: [ 8.596219] 26b8 54fffd60 f940c680 f9001660 b4fffd20 aa1603e1 aa1303e0 940c0a2c 2a0003f6 Jul 8 15:18:24 carol kernel: [ 8.598764] 26d8 35000700 b9405261 d5033bbf f940ca80 b5000320 b50004b7 f9401660 f9402c17 Jul 8 15:18:24 carol kernel: [ 8.601305] 26f8 b5000457 b9405660 370004c0 b9405660 36080100 f9401661 f9400c22 b5000062 Jul 8 15:18:24 carol kernel: [ 8.603883] 2718 f9401421 b4000061 320e0000 b9005660 b9405260 12166c00 b9005260 f9409a60 ... Full kern.log boot log: https://pastebin.com/zcpxB1HQ Please find attached the full armbianmonitor output here: https://pastebin.com/NkVAejC6 Any help is greatly appreciated! Board: Not on the list
Meier Posted July 8, 2019 Author Posted July 8, 2019 Additional info: after some quick uptime of ~1h the board started to fault repeatedly, but without crashing completely. SSH connections were closed, but later a login was possible again. Three specific errors in short interval, all logged in full here: https://pastebin.com/SAcUAGb2 Jul 8 16:59:31 carol kernel: [ 3752.234046] Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009d5401c Jul 8 16:59:31 carol kernel: [ 3752.240736] Internal error: : 96000210 [#1] SMP ... Jul 8 17:00:12 carol kernel: [ 3759.996389] BUG: spinlock lockup suspected on CPU#3, nvme/296 Jul 8 17:00:12 carol kernel: [ 3760.001966] lock: 0xffffff8009141870, .magic: dead4ead, .owner: nvme/296, .owner_cpu: 3 ... Jul 8 17:00:12 carol kernel: [ 3792.419942] Watchdog detected hard LOCKUP on cpu 3 Jul 8 17:00:12 carol kernel: [ 3792.420464] ------------[ cut here ]------------ Jul 8 17:00:12 carol kernel: [ 3792.430494] WARNING: at kernel/watchdog.c:352
Igor Posted July 8, 2019 Posted July 8, 2019 I am just about to close the office until end of the month, I will try to debug tomorrow if possible ... until then try attached quick fix - its a kernel + dtb compilation with most recent upstream fixed. I doubt this problem is solved, but worth trying. linux-image-rockchip64_5.91_arm64.deb linux-dtb-rockchip64_5.91_arm64.deb
Meier Posted July 9, 2019 Author Posted July 9, 2019 Thanks Igor! Will try that today and let you know how it works out. Update: works fine so far, after 3+ hours uptime, also with the self-compiled image, but intervals between freezes can be quite long.
Igor Posted July 10, 2019 Posted July 10, 2019 Good. I hope this will be it! I took one RK3399 board (NanoPC T4) with me and it is serving as real world test -> KODI media center / web browser / VPN gateway / AP / file server. I am looking/hoping to get three weeks of up-time I also move this topic under RK3399 sub-forum since it suits here better.
Meier Posted July 12, 2019 Author Posted July 12, 2019 Unfortunately, I still keep getting the freezes from time to time. Two thinks I noticed: When running `stress -i 4 -d 4` I can crash the board in ~3 minutes very reliably. But not any board, just this one, but even without any additional peripherals like the SSD plugged in. As it's running on eMMC, it might be this particular eMMC that causes the crash. This let me to build a latest Armbian Ubuntu 18.04 image with `overlayroot` to eliminate all I/O to the eMMC. This board that has been crashing has now been running for ~2 days. I'll try to gather more data in case the boards crash with the build from the current master branch. Just FYI in case you're curious: this is the project I'm working on https://github.com/digitalbitbox/bitbox-base. 1
Myy Posted July 16, 2019 Posted July 16, 2019 The first error *might* be related to the firmware file not present, as printed just below the oops message. For this one, you could try this : Quote cd /tmp wget https://raw.githubusercontent.com/wkennington/linux-firmware/master/rockchip/dptx.bin cp /tmp/dptx.bin /lib/firmware/rockchip/dptx.bin The methodology is taken from here : https://forum.pine64.org/showthread.php?tid=6510Now the spinlock seems to be NVMe related... When you boot correctly, does something like find / generates a freeze ? EDIT : Didn't read the whole thread correctly... With overlayroot enabled, are you also testing with stress -i 4 -d 4 ?
Meier Posted July 16, 2019 Author Posted July 16, 2019 Thanks Myy for the pointers. I'll try if the dptx.bin driver helps preventing the boot oops message. Is there a way to tell how that binary file has been compiled, or to make sure it is legit? Regarding the stress testing in overlayroot, this command immediately aborts as it fills up the available tmpfs within seconds. Good thought about find /, I'll try that.
Myy Posted July 16, 2019 Posted July 16, 2019 3 hours ago, Meier said: Is there a way to tell how that binary file has been compiled, or to make sure it is legit? That firmware seems to be part of Closed McBlobby family : https://patchwork.kernel.org/patch/9225567/ However, a more legit source for this firmware would be : https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/rockchip Give the find / command a try and, if possible, try it on a NVMe drive.
Meier Posted July 18, 2019 Author Posted July 18, 2019 The boot failure is no longer an issue currently, after updating to Armbian Ubuntu 18.04 and using overlayroot, but the nvme spinlock keeps happening frequently. I noticed that it is mostly during intense writing to the SSD, which heats up, the system freezes and never comes back up. Strangely enough, the SSD is then still under constant load and does not cool down. I ran a dedicted stresstest with the tool stressdisk with the following script (it includes our own fancontrol tool): #!/bin/bash apt update -y apt install -y tmux unzip smartmontools mkdir src && cd $_ wget https://github.com/ncw/stressdisk/releases/download/v1.0.12/stressdisk_1.0.12_linux_arm64.zip unzip stressdisk_1.0.12_linux_arm64.zip chmod +x stressdisk mv stressdisk /usr/sbin wget https://github.com/digitalbitbox/bitbox-base/releases/download/wip/bbbfancontrol.tar.gz tar xvf bbbfancontrol.tar.gz chmod +x bbbfancontrol mv bbbfancontrol /usr/sbin/ echo "/dev/nvme0n1p1 /mnt/ssd ext4 rw,nosuid,dev,noexec,noatime,nodiratime,auto,nouser,async,nofail 0 2" >> /etc/fstab mount -a mkdir -p /mnt/ssd/stressdisk tmux new-session -d 'watch smartctl -a /dev/nvme0n1' tmux split-window -h 'stressdisk cycle /mnt/ssd/stressdisk' tmux split-window -v 'htop' tmux split-window -v 'bbbfancontrol -v' tmux -2 attach-session -d Using the RockPro64 with Armbian and writing heavily on a Samsung SSD (connected with an PCIe M.2 adapter), I am able to consistently freeze the system within minutes. See video here: Jul 18 08:35:24 rockpro64 kernel: Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009cc801c Jul 18 08:35:24 rockpro64 kernel: Internal error: : 96000210 [#1] SMP Jul 18 08:35:24 rockpro64 kernel: Modules linked in: af_packet lz4hc lz4hc_compress zlib snd_soc_rockchip_hdmi_dp lzo rk_vcodec zram ip_tables x_tables autofs4 phy_rockchip_pcie Jul 18 08:35:24 rockpro64 kernel: CPU: 3 PID: 261 Comm: nvme Not tainted 4.4.182-rockchip64 #1 Jul 18 08:35:24 rockpro64 kernel: Hardware name: Pine64 RockPro64 (DT) Jul 18 08:35:24 rockpro64 kernel: task: ffffffc0e4e13800 task.stack: ffffffc0dfb9c000 Jul 18 08:35:24 rockpro64 kernel: PC is at nvme_kthread+0xac/0x1d8 Jul 18 08:35:24 rockpro64 kernel: LR is at nvme_kthread+0x78/0x1d8 Jul 18 08:35:24 rockpro64 kernel: pc : [<ffffff80087564cc>] lr : [<ffffff8008756498>] pstate: 20000145 Jul 18 08:35:24 rockpro64 kernel: sp : ffffffc0dfb9fd60 Jul 18 08:35:24 rockpro64 kernel: x29: ffffffc0dfb9fd60 x28: ffffffc0df92ed00 Jul 18 08:35:24 rockpro64 kernel: x27: 0000000002080020 x26: ffffffc0ebaa0228 Jul 18 08:35:24 rockpro64 kernel: x25: ffffff80091ff0f0 x24: ffffff80091ff0f0 Jul 18 08:35:24 rockpro64 kernel: x23: ffffff8008755310 x22: ffffff80091ff0d8 Jul 18 08:35:24 rockpro64 kernel: x21: 0000000000000007 x20: ffffff8009cc801c Jul 18 08:35:24 rockpro64 kernel: x19: ffffffc0ebd1d400 x18: 0000000000000000 ... ... Jul 18 08:35:24 rockpro64 kernel: Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009cc8000 Jul 18 08:35:24 rockpro64 kernel: Bad mode in Error handler detected, code 0xbf000002 -- SError Jul 18 08:35:24 rockpro64 kernel: BUG: spinlock lockup suspected on CPU#3, nvme/261 Jul 18 08:35:24 rockpro64 kernel: lock: 0xffffff8009141870, .magic: dead4ead, .owner: nvme/261, .owner_cpu: 3 Jul 18 08:35:24 rockpro64 kernel: CPU: 3 PID: 261 Comm: nvme Not tainted 4.4.182-rockchip64 #1 Jul 18 08:35:24 rockpro64 kernel: Hardware name: Pine64 RockPro64 (DT) Jul 18 08:35:24 rockpro64 kernel: Call trace: Jul 18 08:35:24 rockpro64 kernel: [<ffffff80080882b0>] dump_backtrace+0x0/0x1bc Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008088490>] show_stack+0x24/0x30 Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008587fec>] dump_stack+0x98/0xc0 Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008106164>] spin_dump+0x84/0xa4 Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008106300>] do_raw_spin_lock+0xdc/0x164 ... ... https://pastebin.com/1UeCiHDW Using the smartmontools and physical temperature measurements, we can observe that the first chip on the Samsung 970 EVO (1TB) SSD gets up to 95 degrees celsius hot before crashing. With the exact same image, board and adapter I tried other M.2 SSDs as well. Interestingly, the chip on a Samsung 970 EVO (500 GB) got up to 107 degrees celsius, but did not crash. Other SSD got not as hot and had no issues with stress tests: * Intel 660p (512 GB and 1 TB): ~80 degrees celsius, physical measurement only * Crucial P1 (512 GB and 1 TB): ~75 degrees celsius, physical measurement only * Western Digital Black (500 GB): ~70 degrees celsius, physical measurement only So this issue might be related to that specific SSD. I'm still surprised that the Samsung SSD do not throttle at all, IMHO they never should get that hot in the first place. 1
Myy Posted July 19, 2019 Posted July 19, 2019 What does grep nvme_kthread /boot/System.map* returns ? Same thing for grep 0xffffff8009c /boot/System.map* ? EDIT : Apparently the key words here are : Unhandled fault: synchronous external abort It seems to be related to something that "went wrong" from an operation not executed directly by the CPU. https://stackoverflow.com/questions/27507013/synchronous-external-abort-on-arm https://community.nxp.com/thread/496662 I don't know if that's the disk firing an abort request due to the very high temperature, or simply the disk boiling so much that PCIe operations are not done correctly anymore.
Myy Posted July 19, 2019 Posted July 19, 2019 That said, if it doesn't crash with other SSD, maybe try to cool it down a little and see if it solves the issue ? Also, does it generate the *same* problem inside a standard PC/Laptop ? EDIT : Also, did you try to check for firmware upgrades for this specific drive ? Maybe it could enable "throttling" automatically and avoid the boiling mess. Okay, there's no firmwares for this one. Anyway, if anybody else could try this (with a spare disk that isn't useful to you... At this temperature, the disk *might* suffer heavy damage), we could maybe put a warning on every board that supports NVMe about such issues. I was thinking that you could share these informations with the Samsung Community, but their forums seem kind-of dead.
Meier Posted July 24, 2019 Author Posted July 24, 2019 Thanks for the suggestions, will follow up shortly. First there's some vacations... :-)
Meier Posted September 21, 2019 Author Posted September 21, 2019 Apologies, I did not really follow up on that issue as the following two measures prevented any further occurrences: Avoid using the mentioned Samsung SSD (currently using Cruzial P1) Run the eMMC as Ubuntu 18.04 with overlayroot enabled Not sure if it's both or only one measure, but no more freezes. Let's hope it stays that way.
Meier Posted September 23, 2019 Author Posted September 23, 2019 Damn it, came in this morning and had the kerne freeze again on one machine. Talking of the devil... I took a picture of the screen (see attachment), but it does not really hint at what went wrong as far as I can tell. Even scrolling back didn't show much additional info. I'm a bit at a loss how to capture what actually went wrong... Any hints?
Myy Posted March 31, 2020 Posted March 31, 2020 If you still have this bug, is it possible to scroll up and get the beginning of the error. The main issue with kernel panics is that the kernel tends to output a stacktrace for each CPU... The least used ones being displayed at the bottom. So I still don't know what caused a NULL pointer dereference in the first kernel panic mentionned.
Myy Posted April 1, 2020 Posted April 1, 2020 Does anyone have a spare Hard Drive/SSD and a RockPro and can fire a stress test on the disk for a few minutes ? If that doesn't do anything, try a disk stress test + a CPU stress test.
Recommended Posts