Jump to content

Infrequent RockPro64 freeze (kernel NULL pointer)


Meier

Recommended Posts

On several RockPro64 boards I experience infrequent freezes, mostly directly on boot, but also after some longer (hours, days) uptime. I use the official power adapter and no additional hardware except a PCIe adapter for an SSD, which works flawlessly when operational.

 

When looking at the kern.log, there are quite a few errors and warnings, but comparing to a successful boot these are all also present. So I don't think they cause the freeze directly.

 

Currently, I recorded the freeze on the latest Armbian Bionic as release just recently. FWIW, I think the same issue also occurs on the previous Debian Stretch image (had various freezes, but have not recorded any details yet).

 

On a unsuccessful boot, the error occurs after about 8 seconds. A reboot (or two) usually fixes the issue, until the next time...

Jul  8 15:18:23 carol kernel: [    8.528275] Unable to handle kernel NULL pointer dereference at virtual address 00000000
Jul  8 15:18:23 carol kernel: [    8.530625] pgd = ffffffc0ead3f000
Jul  8 15:18:23 carol kernel: [    8.532515] [00000000] *pgd=0000000000000000, *pud=0000000000000000
Jul  8 15:18:23 carol kernel: [    8.534710] Internal error: Oops: 96000005 [#1] SMP
Jul  8 15:18:24 carol kernel: [    8.536753] Modules linked in: af_packet iptable_nat nf_nat_ipv4 nf_nat nf_log_ipv4 nf_log_common xt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_conntrack nf_conntrack iptable_filter snd_soc_rockchip_hdmi_dp rk_vcodec ip_tables x_tables autofs4 phy_rockchip_pcie
Jul  8 15:18:24 carol kernel: [    8.542598] CPU: 5 PID: 1044 Comm: find Not tainted 4.4.182-rockchip64 #1
Jul  8 15:18:24 carol kernel: [    8.544959] Hardware name: Pine64 RockPro64 (DT)
Jul  8 15:18:24 carol kernel: [    8.547155] task: ffffffc0eb247000 task.stack: ffffffc0e1c78000
Jul  8 15:18:24 carol kernel: [    8.549491] PC is at do_dentry_open+0x234/0x2e4
Jul  8 15:18:24 carol kernel: [    8.551687] LR is at do_dentry_open+0x288/0x2e4
Jul  8 15:18:24 carol kernel: [    8.553852] pc : [<ffffff80081f2738>] lr : [<ffffff80081f278c>] pstate: a0000145
Jul  8 15:18:24 carol kernel: [    8.556284] sp : ffffffc0e1c7bbc0
Jul  8 15:18:24 carol kernel: [    8.558403] x29: ffffffc0e1c7bbc0 x28: ffffffc0eb247000 
Jul  8 15:18:24 carol kernel: [    8.560742] x27: 0000000000000000 x26: ffffffc0f26eb000 
Jul  8 15:18:24 carol kernel: [    8.563064] x25: 000000000000011d x24: ffffffc0e1cff690 
Jul  8 15:18:24 carol kernel: [    8.565376] x23: ffffff8008219cb8 x22: 0000000000000000 
Jul  8 15:18:24 carol kernel: [    8.567670] x21: 0000000000000000 x20: ffffffc0f27882b0 
Jul  8 15:18:24 carol kernel: [    8.569954] x19: ffffffc0e1cff680 x18: 0000007fb4979a70 
Jul  8 15:18:24 carol kernel: [    8.572191] x17: 0000007fb48e8848 x16: ffffff80081f3ea4 
Jul  8 15:18:24 carol kernel: [    8.574416] x15: 0000000000000000 x14: ffffffffffffffff 
Jul  8 15:18:24 carol kernel: [    8.576663] x13: 0000000000000000 x12: 0101010101010101 
Jul  8 15:18:24 carol kernel: [    8.578896] x11: 7f7f7f7f7f7f7f7f x10: 0000007fb4a8a140 
Jul  8 15:18:24 carol kernel: [    8.581115] x9 : 0000000000000000 x8 : ffffffc0e1cff7b8 
Jul  8 15:18:24 carol kernel: [    8.583390] x7 : 0000000000000000 x6 : ffffffc0f061d1e9 
Jul  8 15:18:24 carol kernel: [    8.585641] x5 : 0000000000000000 x4 : 00000000000055b1 
Jul  8 15:18:24 carol kernel: [    8.587874] x3 : 00000040eee4a000 x2 : ffffff8008219a20 
Jul  8 15:18:24 carol kernel: [    8.590114] x1 : ffffff8008c02140 x0 : 0000000000000000 
Jul  8 15:18:24 carol kernel: [    8.592341] 
Jul  8 15:18:24 carol kernel: [    8.592341] PC: 0xffffff80081f26b8:
Jul  8 15:18:24 carol kernel: [    8.596219] 26b8  54fffd60 f940c680 f9001660 b4fffd20 aa1603e1 aa1303e0 940c0a2c 2a0003f6
Jul  8 15:18:24 carol kernel: [    8.598764] 26d8  35000700 b9405261 d5033bbf f940ca80 b5000320 b50004b7 f9401660 f9402c17
Jul  8 15:18:24 carol kernel: [    8.601305] 26f8  b5000457 b9405660 370004c0 b9405660 36080100 f9401661 f9400c22 b5000062
Jul  8 15:18:24 carol kernel: [    8.603883] 2718  f9401421 b4000061 320e0000 b9005660 b9405260 12166c00 b9005260 f9409a60
...

Full kern.log boot log:

https://pastebin.com/zcpxB1HQ

 

Please find attached the full armbianmonitor output here:

https://pastebin.com/NkVAejC6

 

Any help is greatly appreciated!

Board: Not on the list
Link to comment
Share on other sites

Additional info: after some quick uptime of ~1h the board started to fault repeatedly, but without crashing completely. SSH connections were closed, but later a login was possible again.

 

Three specific errors in short interval, all logged in full here: https://pastebin.com/SAcUAGb2

Jul  8 16:59:31 carol kernel: [ 3752.234046] Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009d5401c
Jul  8 16:59:31 carol kernel: [ 3752.240736] Internal error: : 96000210 [#1] SMP
...

Jul  8 17:00:12 carol kernel: [ 3759.996389] BUG: spinlock lockup suspected on CPU#3, nvme/296
Jul  8 17:00:12 carol kernel: [ 3760.001966]  lock: 0xffffff8009141870, .magic: dead4ead, .owner: nvme/296, .owner_cpu: 3
...

Jul  8 17:00:12 carol kernel: [ 3792.419942] Watchdog detected hard LOCKUP on cpu 3
Jul  8 17:00:12 carol kernel: [ 3792.420464] ------------[ cut here ]------------
Jul  8 17:00:12 carol kernel: [ 3792.430494] WARNING: at kernel/watchdog.c:352

 

Link to comment
Share on other sites

Thanks Igor! Will try that today and let you know how it works out.

 

Update: works fine so far, after 3+ hours uptime, also with the self-compiled image, but intervals between freezes can be quite long.

Link to comment
Share on other sites

Good. I hope this will be it! I took one RK3399 board (NanoPC T4) with me and it is serving as real world test -> KODI media center / web browser / VPN gateway / AP / file server. I am looking/hoping to get three weeks of up-time ;)

 

I also move this topic under RK3399 sub-forum since it suits here better.

Link to comment
Share on other sites

Unfortunately, I still keep getting the freezes from time to time. Two thinks I noticed:

  • When running `stress -i 4 -d 4` I can crash the board in ~3 minutes very reliably. But not any board, just this one, but even without any additional peripherals like the SSD plugged in. As it's running on eMMC, it might be this particular eMMC that causes the crash.
  • This let me to build a latest Armbian Ubuntu 18.04 image with `overlayroot` to eliminate all I/O to the eMMC. This board that has been crashing has now been running for ~2 days.

I'll try to gather more data in case the boards crash with the build from the current master branch.

 

Just FYI in case you're curious: this is the project I'm working on https://github.com/digitalbitbox/bitbox-base.

Link to comment
Share on other sites

The first error *might* be related to the firmware file not present, as printed just below the oops message.

For this one, you could try this :
 

Quote

cd /tmp
wget https://raw.githubusercontent.com/wkennington/linux-firmware/master/rockchip/dptx.bin
cp /tmp/dptx.bin /lib/firmware/rockchip/dptx.bin


The methodology is taken from here : https://forum.pine64.org/showthread.php?tid=6510

Now the spinlock seems to be NVMe related... When you boot correctly, does something like find / generates a freeze ?

EDIT : Didn't read the whole thread correctly...

 

With overlayroot enabled, are you also testing with stress -i 4 -d 4 ?

 

Link to comment
Share on other sites

Thanks Myy for the pointers. I'll try if the dptx.bin driver helps preventing the boot oops message. Is there a way to tell how that binary file has been compiled, or to make sure it is legit?

 

Regarding the stress testing in overlayroot, this command immediately aborts as it fills up the available tmpfs within seconds. Good thought about find /, I'll try that.

Link to comment
Share on other sites

3 hours ago, Meier said:

Is there a way to tell how that binary file has been compiled, or to make sure it is legit?

 

That firmware seems to be part of Closed McBlobby family : https://patchwork.kernel.org/patch/9225567/

 

However, a more legit source for this firmware would be : https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/rockchip

 

Give the find / command a try and, if possible, try it on a NVMe drive.

Link to comment
Share on other sites

The boot failure is no longer an issue currently, after updating to Armbian Ubuntu 18.04 and using overlayroot, but the nvme spinlock keeps happening frequently. I noticed that it is mostly during intense writing to the SSD, which heats up, the system freezes and never comes back up. Strangely enough, the SSD is then still under constant load and does not cool down.

 

I ran a dedicted stresstest with the tool stressdisk with the following script (it includes our own fancontrol tool):

#!/bin/bash
apt update -y
apt install -y tmux unzip smartmontools
mkdir src && cd $_
wget https://github.com/ncw/stressdisk/releases/download/v1.0.12/stressdisk_1.0.12_linux_arm64.zip
unzip stressdisk_1.0.12_linux_arm64.zip
chmod +x stressdisk
mv stressdisk /usr/sbin

wget https://github.com/digitalbitbox/bitbox-base/releases/download/wip/bbbfancontrol.tar.gz
tar xvf bbbfancontrol.tar.gz
chmod +x bbbfancontrol
mv bbbfancontrol /usr/sbin/

echo "/dev/nvme0n1p1 /mnt/ssd ext4 rw,nosuid,dev,noexec,noatime,nodiratime,auto,nouser,async,nofail 0 2" >> /etc/fstab
mount -a

mkdir -p /mnt/ssd/stressdisk

tmux new-session -d 'watch smartctl -a /dev/nvme0n1'
tmux split-window -h 'stressdisk cycle /mnt/ssd/stressdisk'
tmux split-window -v 'htop'
tmux split-window -v 'bbbfancontrol -v'
tmux -2 attach-session -d

Using the RockPro64 with Armbian and writing heavily on a Samsung SSD (connected with an PCIe M.2 adapter), I am able to consistently freeze the system within minutes. See video here:

 

Jul 18 08:35:24 rockpro64 kernel: Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009cc801c
Jul 18 08:35:24 rockpro64 kernel: Internal error: : 96000210 [#1] SMP
Jul 18 08:35:24 rockpro64 kernel: Modules linked in: af_packet lz4hc lz4hc_compress zlib snd_soc_rockchip_hdmi_dp lzo rk_vcodec zram ip_tables x_tables autofs4 phy_rockchip_pcie
Jul 18 08:35:24 rockpro64 kernel: CPU: 3 PID: 261 Comm: nvme Not tainted 4.4.182-rockchip64 #1
Jul 18 08:35:24 rockpro64 kernel: Hardware name: Pine64 RockPro64 (DT)
Jul 18 08:35:24 rockpro64 kernel: task: ffffffc0e4e13800 task.stack: ffffffc0dfb9c000
Jul 18 08:35:24 rockpro64 kernel: PC is at nvme_kthread+0xac/0x1d8
Jul 18 08:35:24 rockpro64 kernel: LR is at nvme_kthread+0x78/0x1d8
Jul 18 08:35:24 rockpro64 kernel: pc : [<ffffff80087564cc>] lr : [<ffffff8008756498>] pstate: 20000145
Jul 18 08:35:24 rockpro64 kernel: sp : ffffffc0dfb9fd60
Jul 18 08:35:24 rockpro64 kernel: x29: ffffffc0dfb9fd60 x28: ffffffc0df92ed00 
Jul 18 08:35:24 rockpro64 kernel: x27: 0000000002080020 x26: ffffffc0ebaa0228 
Jul 18 08:35:24 rockpro64 kernel: x25: ffffff80091ff0f0 x24: ffffff80091ff0f0 
Jul 18 08:35:24 rockpro64 kernel: x23: ffffff8008755310 x22: ffffff80091ff0d8 
Jul 18 08:35:24 rockpro64 kernel: x21: 0000000000000007 x20: ffffff8009cc801c 
Jul 18 08:35:24 rockpro64 kernel: x19: ffffffc0ebd1d400 x18: 0000000000000000 
...
...
Jul 18 08:35:24 rockpro64 kernel: Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009cc8000
Jul 18 08:35:24 rockpro64 kernel: Bad mode in Error handler detected, code 0xbf000002 -- SError
Jul 18 08:35:24 rockpro64 kernel: BUG: spinlock lockup suspected on CPU#3, nvme/261
Jul 18 08:35:24 rockpro64 kernel:  lock: 0xffffff8009141870, .magic: dead4ead, .owner: nvme/261, .owner_cpu: 3
Jul 18 08:35:24 rockpro64 kernel: CPU: 3 PID: 261 Comm: nvme Not tainted 4.4.182-rockchip64 #1
Jul 18 08:35:24 rockpro64 kernel: Hardware name: Pine64 RockPro64 (DT)
Jul 18 08:35:24 rockpro64 kernel: Call trace:
Jul 18 08:35:24 rockpro64 kernel: [<ffffff80080882b0>] dump_backtrace+0x0/0x1bc
Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008088490>] show_stack+0x24/0x30
Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008587fec>] dump_stack+0x98/0xc0
Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008106164>] spin_dump+0x84/0xa4
Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008106300>] do_raw_spin_lock+0xdc/0x164
...
...

https://pastebin.com/1UeCiHDW

 

Using the smartmontools and physical temperature measurements, we can observe that the first chip on the Samsung 970 EVO (1TB) SSD gets up to 95 degrees celsius hot before crashing. With the exact same image, board and adapter I tried other M.2 SSDs as well. Interestingly, the chip on a Samsung 970 EVO (500 GB) got up to 107 degrees celsius, but did not crash.

 

Other SSD got not as hot and had no issues with stress tests:
* Intel 660p (512 GB and 1 TB): ~80 degrees celsius, physical measurement only
* Crucial P1 (512 GB and 1 TB): ~75 degrees celsius, physical measurement only
* Western Digital Black (500 GB): ~70 degrees celsius, physical measurement only

 

So this issue might be related to that specific SSD. I'm still surprised that the Samsung SSD do not throttle at all, IMHO they never should get that hot in the first place.

Link to comment
Share on other sites

What does grep nvme_kthread /boot/System.map* returns ?
Same thing for grep 0xffffff8009c /boot/System.map* ?

 

EDIT : Apparently the key words here are : Unhandled fault: synchronous external abort

It seems to be related to something that "went wrong" from an operation not executed directly by the CPU.

 

https://stackoverflow.com/questions/27507013/synchronous-external-abort-on-arm

https://community.nxp.com/thread/496662

 

I don't know if that's the disk firing an abort request due to the very high temperature, or simply the disk boiling so much that PCIe operations are not done correctly anymore.

Link to comment
Share on other sites

That said, if it doesn't crash with other SSD, maybe try to cool it down a little and see if it solves the issue ?

Also, does it generate the *same* problem inside a standard PC/Laptop ?

 

EDIT : Also, did you try to check for firmware upgrades for this specific drive ? Maybe it could enable "throttling" automatically and avoid the boiling mess.

Okay, there's no firmwares for this one.

 

Anyway, if anybody else could try this (with a spare disk that isn't useful to you... At this temperature, the disk *might* suffer heavy damage), we could maybe put a warning on every board that supports NVMe about such issues.

 

I was thinking that you could share these informations with the Samsung Community, but their forums seem kind-of dead.

Link to comment
Share on other sites

Apologies, I did not really follow up on that issue as the following two measures prevented any further occurrences:

  • Avoid using the mentioned Samsung SSD (currently using Cruzial P1)
  • Run the eMMC as Ubuntu 18.04 with overlayroot enabled

Not sure if it's both or only one measure, but no more freezes. Let's hope it stays that way.

Link to comment
Share on other sites

Damn it, came in this morning and had the kerne freeze again on one machine. Talking of the devil...

 

I took a picture of the screen (see attachment), but it does not really hint at what went wrong as far as I can tell. Even scrolling back didn't show much additional info. I'm a bit at a loss how to capture what actually went wrong... Any hints?

 

20190923_092015.jpg

Link to comment
Share on other sites

If you still have this bug, is it possible to scroll up and get the beginning of the error.

 

The main issue with kernel panics is that the kernel tends to output a stacktrace for each CPU... The least used ones being displayed at the bottom.

 

So I still don't know what caused a NULL pointer dereference in the first kernel panic mentionned.

Link to comment
Share on other sites

Does anyone have a spare Hard Drive/SSD and a RockPro and can fire a stress test on the disk for a few minutes ?

If that doesn't do anything, try a disk stress test + a CPU stress test.

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines