1 1
Meier

Infrequent RockPro64 freeze (kernel NULL pointer)

Recommended Posts

On several RockPro64 boards I experience infrequent freezes, mostly directly on boot, but also after some longer (hours, days) uptime. I use the official power adapter and no additional hardware except a PCIe adapter for an SSD, which works flawlessly when operational.

 

When looking at the kern.log, there are quite a few errors and warnings, but comparing to a successful boot these are all also present. So I don't think they cause the freeze directly.

 

Currently, I recorded the freeze on the latest Armbian Bionic as release just recently. FWIW, I think the same issue also occurs on the previous Debian Stretch image (had various freezes, but have not recorded any details yet).

 

On a unsuccessful boot, the error occurs after about 8 seconds. A reboot (or two) usually fixes the issue, until the next time...

Jul  8 15:18:23 carol kernel: [    8.528275] Unable to handle kernel NULL pointer dereference at virtual address 00000000
Jul  8 15:18:23 carol kernel: [    8.530625] pgd = ffffffc0ead3f000
Jul  8 15:18:23 carol kernel: [    8.532515] [00000000] *pgd=0000000000000000, *pud=0000000000000000
Jul  8 15:18:23 carol kernel: [    8.534710] Internal error: Oops: 96000005 [#1] SMP
Jul  8 15:18:24 carol kernel: [    8.536753] Modules linked in: af_packet iptable_nat nf_nat_ipv4 nf_nat nf_log_ipv4 nf_log_common xt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_conntrack nf_conntrack iptable_filter snd_soc_rockchip_hdmi_dp rk_vcodec ip_tables x_tables autofs4 phy_rockchip_pcie
Jul  8 15:18:24 carol kernel: [    8.542598] CPU: 5 PID: 1044 Comm: find Not tainted 4.4.182-rockchip64 #1
Jul  8 15:18:24 carol kernel: [    8.544959] Hardware name: Pine64 RockPro64 (DT)
Jul  8 15:18:24 carol kernel: [    8.547155] task: ffffffc0eb247000 task.stack: ffffffc0e1c78000
Jul  8 15:18:24 carol kernel: [    8.549491] PC is at do_dentry_open+0x234/0x2e4
Jul  8 15:18:24 carol kernel: [    8.551687] LR is at do_dentry_open+0x288/0x2e4
Jul  8 15:18:24 carol kernel: [    8.553852] pc : [<ffffff80081f2738>] lr : [<ffffff80081f278c>] pstate: a0000145
Jul  8 15:18:24 carol kernel: [    8.556284] sp : ffffffc0e1c7bbc0
Jul  8 15:18:24 carol kernel: [    8.558403] x29: ffffffc0e1c7bbc0 x28: ffffffc0eb247000 
Jul  8 15:18:24 carol kernel: [    8.560742] x27: 0000000000000000 x26: ffffffc0f26eb000 
Jul  8 15:18:24 carol kernel: [    8.563064] x25: 000000000000011d x24: ffffffc0e1cff690 
Jul  8 15:18:24 carol kernel: [    8.565376] x23: ffffff8008219cb8 x22: 0000000000000000 
Jul  8 15:18:24 carol kernel: [    8.567670] x21: 0000000000000000 x20: ffffffc0f27882b0 
Jul  8 15:18:24 carol kernel: [    8.569954] x19: ffffffc0e1cff680 x18: 0000007fb4979a70 
Jul  8 15:18:24 carol kernel: [    8.572191] x17: 0000007fb48e8848 x16: ffffff80081f3ea4 
Jul  8 15:18:24 carol kernel: [    8.574416] x15: 0000000000000000 x14: ffffffffffffffff 
Jul  8 15:18:24 carol kernel: [    8.576663] x13: 0000000000000000 x12: 0101010101010101 
Jul  8 15:18:24 carol kernel: [    8.578896] x11: 7f7f7f7f7f7f7f7f x10: 0000007fb4a8a140 
Jul  8 15:18:24 carol kernel: [    8.581115] x9 : 0000000000000000 x8 : ffffffc0e1cff7b8 
Jul  8 15:18:24 carol kernel: [    8.583390] x7 : 0000000000000000 x6 : ffffffc0f061d1e9 
Jul  8 15:18:24 carol kernel: [    8.585641] x5 : 0000000000000000 x4 : 00000000000055b1 
Jul  8 15:18:24 carol kernel: [    8.587874] x3 : 00000040eee4a000 x2 : ffffff8008219a20 
Jul  8 15:18:24 carol kernel: [    8.590114] x1 : ffffff8008c02140 x0 : 0000000000000000 
Jul  8 15:18:24 carol kernel: [    8.592341] 
Jul  8 15:18:24 carol kernel: [    8.592341] PC: 0xffffff80081f26b8:
Jul  8 15:18:24 carol kernel: [    8.596219] 26b8  54fffd60 f940c680 f9001660 b4fffd20 aa1603e1 aa1303e0 940c0a2c 2a0003f6
Jul  8 15:18:24 carol kernel: [    8.598764] 26d8  35000700 b9405261 d5033bbf f940ca80 b5000320 b50004b7 f9401660 f9402c17
Jul  8 15:18:24 carol kernel: [    8.601305] 26f8  b5000457 b9405660 370004c0 b9405660 36080100 f9401661 f9400c22 b5000062
Jul  8 15:18:24 carol kernel: [    8.603883] 2718  f9401421 b4000061 320e0000 b9005660 b9405260 12166c00 b9005260 f9409a60
...

Full kern.log boot log:

https://pastebin.com/zcpxB1HQ

 

Please find attached the full armbianmonitor output here:

https://pastebin.com/NkVAejC6

 

Any help is greatly appreciated!

Board: Not on the list

Share this post


Link to post
Share on other sites

Additional info: after some quick uptime of ~1h the board started to fault repeatedly, but without crashing completely. SSH connections were closed, but later a login was possible again.

 

Three specific errors in short interval, all logged in full here: https://pastebin.com/SAcUAGb2

Jul  8 16:59:31 carol kernel: [ 3752.234046] Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009d5401c
Jul  8 16:59:31 carol kernel: [ 3752.240736] Internal error: : 96000210 [#1] SMP
...

Jul  8 17:00:12 carol kernel: [ 3759.996389] BUG: spinlock lockup suspected on CPU#3, nvme/296
Jul  8 17:00:12 carol kernel: [ 3760.001966]  lock: 0xffffff8009141870, .magic: dead4ead, .owner: nvme/296, .owner_cpu: 3
...

Jul  8 17:00:12 carol kernel: [ 3792.419942] Watchdog detected hard LOCKUP on cpu 3
Jul  8 17:00:12 carol kernel: [ 3792.420464] ------------[ cut here ]------------
Jul  8 17:00:12 carol kernel: [ 3792.430494] WARNING: at kernel/watchdog.c:352

 

Share this post


Link to post
Share on other sites

Thanks Igor! Will try that today and let you know how it works out.

 

Update: works fine so far, after 3+ hours uptime, also with the self-compiled image, but intervals between freezes can be quite long.

Share this post


Link to post
Share on other sites

Good. I hope this will be it! I took one RK3399 board (NanoPC T4) with me and it is serving as real world test -> KODI media center / web browser / VPN gateway / AP / file server. I am looking/hoping to get three weeks of up-time ;)

 

I also move this topic under RK3399 sub-forum since it suits here better.

Share this post


Link to post
Share on other sites

Unfortunately, I still keep getting the freezes from time to time. Two thinks I noticed:

  • When running `stress -i 4 -d 4` I can crash the board in ~3 minutes very reliably. But not any board, just this one, but even without any additional peripherals like the SSD plugged in. As it's running on eMMC, it might be this particular eMMC that causes the crash.
  • This let me to build a latest Armbian Ubuntu 18.04 image with `overlayroot` to eliminate all I/O to the eMMC. This board that has been crashing has now been running for ~2 days.

I'll try to gather more data in case the boards crash with the build from the current master branch.

 

Just FYI in case you're curious: this is the project I'm working on https://github.com/digitalbitbox/bitbox-base.

Share this post


Link to post
Share on other sites

The first error *might* be related to the firmware file not present, as printed just below the oops message.

For this one, you could try this :
 

Quote

cd /tmp
wget https://raw.githubusercontent.com/wkennington/linux-firmware/master/rockchip/dptx.bin
cp /tmp/dptx.bin /lib/firmware/rockchip/dptx.bin


The methodology is taken from here : https://forum.pine64.org/showthread.php?tid=6510

Now the spinlock seems to be NVMe related... When you boot correctly, does something like find / generates a freeze ?

EDIT : Didn't read the whole thread correctly...

 

With overlayroot enabled, are you also testing with stress -i 4 -d 4 ?

 

Share this post


Link to post
Share on other sites

Thanks Myy for the pointers. I'll try if the dptx.bin driver helps preventing the boot oops message. Is there a way to tell how that binary file has been compiled, or to make sure it is legit?

 

Regarding the stress testing in overlayroot, this command immediately aborts as it fills up the available tmpfs within seconds. Good thought about find /, I'll try that.

Share this post


Link to post
Share on other sites
3 hours ago, Meier said:

Is there a way to tell how that binary file has been compiled, or to make sure it is legit?

 

That firmware seems to be part of Closed McBlobby family : https://patchwork.kernel.org/patch/9225567/

 

However, a more legit source for this firmware would be : https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/rockchip

 

Give the find / command a try and, if possible, try it on a NVMe drive.

Share this post


Link to post
Share on other sites

The boot failure is no longer an issue currently, after updating to Armbian Ubuntu 18.04 and using overlayroot, but the nvme spinlock keeps happening frequently. I noticed that it is mostly during intense writing to the SSD, which heats up, the system freezes and never comes back up. Strangely enough, the SSD is then still under constant load and does not cool down.

 

I ran a dedicted stresstest with the tool stressdisk with the following script (it includes our own fancontrol tool):

#!/bin/bash
apt update -y
apt install -y tmux unzip smartmontools
mkdir src && cd $_
wget https://github.com/ncw/stressdisk/releases/download/v1.0.12/stressdisk_1.0.12_linux_arm64.zip
unzip stressdisk_1.0.12_linux_arm64.zip
chmod +x stressdisk
mv stressdisk /usr/sbin

wget https://github.com/digitalbitbox/bitbox-base/releases/download/wip/bbbfancontrol.tar.gz
tar xvf bbbfancontrol.tar.gz
chmod +x bbbfancontrol
mv bbbfancontrol /usr/sbin/

echo "/dev/nvme0n1p1 /mnt/ssd ext4 rw,nosuid,dev,noexec,noatime,nodiratime,auto,nouser,async,nofail 0 2" >> /etc/fstab
mount -a

mkdir -p /mnt/ssd/stressdisk

tmux new-session -d 'watch smartctl -a /dev/nvme0n1'
tmux split-window -h 'stressdisk cycle /mnt/ssd/stressdisk'
tmux split-window -v 'htop'
tmux split-window -v 'bbbfancontrol -v'
tmux -2 attach-session -d

Using the RockPro64 with Armbian and writing heavily on a Samsung SSD (connected with an PCIe M.2 adapter), I am able to consistently freeze the system within minutes. See video here:

 

Jul 18 08:35:24 rockpro64 kernel: Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009cc801c
Jul 18 08:35:24 rockpro64 kernel: Internal error: : 96000210 [#1] SMP
Jul 18 08:35:24 rockpro64 kernel: Modules linked in: af_packet lz4hc lz4hc_compress zlib snd_soc_rockchip_hdmi_dp lzo rk_vcodec zram ip_tables x_tables autofs4 phy_rockchip_pcie
Jul 18 08:35:24 rockpro64 kernel: CPU: 3 PID: 261 Comm: nvme Not tainted 4.4.182-rockchip64 #1
Jul 18 08:35:24 rockpro64 kernel: Hardware name: Pine64 RockPro64 (DT)
Jul 18 08:35:24 rockpro64 kernel: task: ffffffc0e4e13800 task.stack: ffffffc0dfb9c000
Jul 18 08:35:24 rockpro64 kernel: PC is at nvme_kthread+0xac/0x1d8
Jul 18 08:35:24 rockpro64 kernel: LR is at nvme_kthread+0x78/0x1d8
Jul 18 08:35:24 rockpro64 kernel: pc : [<ffffff80087564cc>] lr : [<ffffff8008756498>] pstate: 20000145
Jul 18 08:35:24 rockpro64 kernel: sp : ffffffc0dfb9fd60
Jul 18 08:35:24 rockpro64 kernel: x29: ffffffc0dfb9fd60 x28: ffffffc0df92ed00 
Jul 18 08:35:24 rockpro64 kernel: x27: 0000000002080020 x26: ffffffc0ebaa0228 
Jul 18 08:35:24 rockpro64 kernel: x25: ffffff80091ff0f0 x24: ffffff80091ff0f0 
Jul 18 08:35:24 rockpro64 kernel: x23: ffffff8008755310 x22: ffffff80091ff0d8 
Jul 18 08:35:24 rockpro64 kernel: x21: 0000000000000007 x20: ffffff8009cc801c 
Jul 18 08:35:24 rockpro64 kernel: x19: ffffffc0ebd1d400 x18: 0000000000000000 
...
...
Jul 18 08:35:24 rockpro64 kernel: Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009cc8000
Jul 18 08:35:24 rockpro64 kernel: Bad mode in Error handler detected, code 0xbf000002 -- SError
Jul 18 08:35:24 rockpro64 kernel: BUG: spinlock lockup suspected on CPU#3, nvme/261
Jul 18 08:35:24 rockpro64 kernel:  lock: 0xffffff8009141870, .magic: dead4ead, .owner: nvme/261, .owner_cpu: 3
Jul 18 08:35:24 rockpro64 kernel: CPU: 3 PID: 261 Comm: nvme Not tainted 4.4.182-rockchip64 #1
Jul 18 08:35:24 rockpro64 kernel: Hardware name: Pine64 RockPro64 (DT)
Jul 18 08:35:24 rockpro64 kernel: Call trace:
Jul 18 08:35:24 rockpro64 kernel: [<ffffff80080882b0>] dump_backtrace+0x0/0x1bc
Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008088490>] show_stack+0x24/0x30
Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008587fec>] dump_stack+0x98/0xc0
Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008106164>] spin_dump+0x84/0xa4
Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008106300>] do_raw_spin_lock+0xdc/0x164
...
...

https://pastebin.com/1UeCiHDW

 

Using the smartmontools and physical temperature measurements, we can observe that the first chip on the Samsung 970 EVO (1TB) SSD gets up to 95 degrees celsius hot before crashing. With the exact same image, board and adapter I tried other M.2 SSDs as well. Interestingly, the chip on a Samsung 970 EVO (500 GB) got up to 107 degrees celsius, but did not crash.

 

Other SSD got not as hot and had no issues with stress tests:
* Intel 660p (512 GB and 1 TB): ~80 degrees celsius, physical measurement only
* Crucial P1 (512 GB and 1 TB): ~75 degrees celsius, physical measurement only
* Western Digital Black (500 GB): ~70 degrees celsius, physical measurement only

 

So this issue might be related to that specific SSD. I'm still surprised that the Samsung SSD do not throttle at all, IMHO they never should get that hot in the first place.

Share this post


Link to post
Share on other sites
1 1