To be honest, I can't imagine to be the only one having these issues. Maybe it's just that our usage of the RockPro64 board with a PCIe NVMe M.2 SSD is not very common? We suspected the PCIe / SSD combination to be the cause for a long time, for but cannot really confirm this. The issues are consistent over dozens of RockPro64 boards and many SSD brands. And I've heard from users that the PinebookPro (which uses similar hardware) also freezes under heavy load when run with an SSD.
The boot failure is no longer an issue currently, after updating to Armbian Ubuntu 18.04 and using overlayroot, but the nvme spinlock keeps happening frequently. I noticed that it is mostly during intense writing to the SSD, which heats up, the system freezes and never comes back up. Strangely enough, the SSD is then still under constant load and does not cool down.
I ran a dedicted stresstest with the tool stressdisk with the following script (it includes our own fancontrol tool):
#!/bin/bash
apt update -y
apt install -y tmux unzip smartmontools
mkdir src && cd $_
wget https://github.com/ncw/stressdisk/releases/download/v1.0.12/stressdisk_1.0.12_linux_arm64.zip
unzip stressdisk_1.0.12_linux_arm64.zip
chmod +x stressdisk
mv stressdisk /usr/sbin
wget https://github.com/digitalbitbox/bitbox-base/releases/download/wip/bbbfancontrol.tar.gz
tar xvf bbbfancontrol.tar.gz
chmod +x bbbfancontrol
mv bbbfancontrol /usr/sbin/
echo "/dev/nvme0n1p1 /mnt/ssd ext4 rw,nosuid,dev,noexec,noatime,nodiratime,auto,nouser,async,nofail 0 2" >> /etc/fstab
mount -a
mkdir -p /mnt/ssd/stressdisk
tmux new-session -d 'watch smartctl -a /dev/nvme0n1'
tmux split-window -h 'stressdisk cycle /mnt/ssd/stressdisk'
tmux split-window -v 'htop'
tmux split-window -v 'bbbfancontrol -v'
tmux -2 attach-session -d
Using the RockPro64 with Armbian and writing heavily on a Samsung SSD (connected with an PCIe M.2 adapter), I am able to consistently freeze the system within minutes. See video here:
Using the smartmontools and physical temperature measurements, we can observe that the first chip on the Samsung 970 EVO (1TB) SSD gets up to 95 degrees celsius hot before crashing. With the exact same image, board and adapter I tried other M.2 SSDs as well. Interestingly, the chip on a Samsung 970 EVO (500 GB) got up to 107 degrees celsius, but did not crash.
Other SSD got not as hot and had no issues with stress tests:
* Intel 660p (512 GB and 1 TB): ~80 degrees celsius, physical measurement only
* Crucial P1 (512 GB and 1 TB): ~75 degrees celsius, physical measurement only
* Western Digital Black (500 GB): ~70 degrees celsius, physical measurement only
So this issue might be related to that specific SSD. I'm still surprised that the Samsung SSD do not throttle at all, IMHO they never should get that hot in the first place.
Unfortunately, I still keep getting the freezes from time to time. Two thinks I noticed:
When running `stress -i 4 -d 4` I can crash the board in ~3 minutes very reliably. But not any board, just this one, but even without any additional peripherals like the SSD plugged in. As it's running on eMMC, it might be this particular eMMC that causes the crash.
This let me to build a latest Armbian Ubuntu 18.04 image with `overlayroot` to eliminate all I/O to the eMMC. This board that has been crashing has now been running for ~2 days.
I'll try to gather more data in case the boards crash with the build from the current master branch.
Just FYI in case you're curious: this is the project I'm working on https://github.com/digitalbitbox/bitbox-base.
Just wanted to let you know that we collaborated with Mender and Armbian for RockPro64 is now fully supported by the Mender.io open source client-server manager for over-the-air software updates. It should be relatively straight-forward to exend this approach to other boards as well.
The error is rather this one :
net/wireguard/ratelimiter.c:60:2: error: implicit declaration of function ‘call_rcu’; did you mean ‘call_srcu’? [-Werror=implicit-function-declaration]
call_rcu(&entry->rcu, entry_free);
^~~~~~~~
call_srcu
cc1: some warnings being treated as errors
You can try to add at command line "WIREGUARD=no" , it will skip this part of the build ...