NVME I/O QID Timeout & Disabled Controller with Bullseye on NanoPi M4v2

LiX · February 12, 2021

Armbianmonitor:

NanoPi M4v2 with official NVME hat kit, boot from EMMC with
OS: Armbian w/ Debian Bullseye, Linux 5.10.12-rockchip64
SSD: Kioxia XG6 (Thinkpad part)

That also happened with 5.9 kernel. Not tested with other OS.

When doing cold boot, always got kernel messages like:

Feb 11 01:14:34 localhost kernel: [    3.102606] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Feb 11 01:14:34 localhost kernel: [    3.106554] input: gpio-keys as /devices/platform/gpio-keys/input/input0
Feb 11 01:14:34 localhost kernel: [    3.107121] of_cfs_init
Feb 11 01:14:34 localhost kernel: [    3.107161] of_cfs_init: OK
Feb 11 01:14:34 localhost kernel: [    3.129048] dwmmc_rockchip fe310000.mmc: Successfully tuned phase to 193
Feb 11 01:14:34 localhost kernel: [    3.135392] mmc0: new ultra high speed SDR104 SDIO card at address 0001
Feb 11 01:14:34 localhost kernel: [    3.144242] usb 7-1: new high-speed USB device number 2 using xhci-hcd
Feb 11 01:14:34 localhost kernel: [    3.295224] usb 7-1: New USB device found, idVendor=2109, idProduct=2817, bcdDevice= 0.50
Feb 11 01:14:34 localhost kernel: [    3.295236] usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Feb 11 01:14:34 localhost kernel: [    3.295244] usb 7-1: Product: USB2.0 Hub
Feb 11 01:14:34 localhost kernel: [    3.295253] usb 7-1: Manufacturer: VIA Labs, Inc.
Feb 11 01:14:34 localhost kernel: [    3.355660] hub 7-1:1.0: USB hub found
Feb 11 01:14:34 localhost kernel: [    3.355851] hub 7-1:1.0: 4 ports detected
Feb 11 01:14:34 localhost kernel: [    3.421534] random: crng init done
Feb 11 01:14:34 localhost kernel: [    3.450423] usb 8-1: new SuperSpeed Gen 1 USB device number 2 using xhci-hcd
Feb 11 01:14:34 localhost kernel: [    3.542849] usb 8-1: New USB device found, idVendor=2109, idProduct=0817, bcdDevice= 0.50
Feb 11 01:14:34 localhost kernel: [    3.542860] usb 8-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Feb 11 01:14:34 localhost kernel: [    3.542869] usb 8-1: Product: USB3.0 Hub
Feb 11 01:14:34 localhost kernel: [    3.542877] usb 8-1: Manufacturer: VIA Labs, Inc.
Feb 11 01:14:34 localhost kernel: [    3.563553] hub 8-1:1.0: USB hub found
Feb 11 01:14:34 localhost kernel: [    3.563793] hub 8-1:1.0: 4 ports detected
Feb 11 01:14:34 localhost kernel: [   64.480292] nvme nvme0: I/O 12 QID 0 timeout, disable controller
Feb 11 01:14:34 localhost kernel: [   64.588242] nvme nvme0: Device shutdown incomplete; abort shutdown
Feb 11 01:14:34 localhost kernel: [   64.588624] nvme nvme0: Identify Controller failed (-4)
Feb 11 01:14:34 localhost kernel: [   64.588636] nvme nvme0: Removing after probe failure status: -5
Feb 11 01:14:34 localhost kernel: [   64.607118] Freeing unused kernel memory: 4352K
Feb 11 01:14:34 localhost kernel: [   64.620362] Run /init as init process

As stated, nvme disk will be removed after 60 seconds timeout.

Currently my method to recover is to do a reboot (the attached link contains the dmesg from a reboot while it works), then there has no problem to use the SSD. I've also tried to add bootdelay to armbianEnv.txt in /boot, but seems no help.

Is this a bug or compatibility issue? Is it possible that I can add back the nvme0 drive without reboot? (I've tried the pci rescan sysfs file, seems no use too.)

BTW, the unsafe_shutdowns in nvme's smart-log also increases.

pkfox · February 12, 2021

What SSD are you using? I've been using Samsung 970 EVO for a couple of years without any issues

LiX · February 13, 2021

23 hours ago, pkfox said:

What SSD are you using? I've been using Samsung 970 EVO for a couple of years without any issues

It's a KIOXIA XG6...I guess that maybe a compatibility or driver issue now, since it's been dropped by system after idle for hours yesterday...

Quote

[ 244.836170] sd 0:0:0:0: [sda] Attached SCSI disk
[10169.406639] nvme nvme0: I/O 29 QID 0 timeout, reset controller
[10329.152079] nvme nvme0: I/O 13 QID 0 timeout, disable controller
[10334.244085] nvme nvme0: Device shutdown incomplete; abort shutdown
[10334.260337] nvme nvme0: could not set timestamp (-4)
[10334.260346] nvme nvme0: Removing after probe failure status: -4
[10334.320623] nvme nvme0: failed to set APST feature (-19)

i5Js · February 14, 2021

On 2/12/2021 at 11:23 PM, pkfox said:

What SSD are you using? I've been using Samsung 970 EVO for a couple of years without any issues

Which kernel are you using? Because my board is rebooting unexpectedly almos every day...

LiX · February 15, 2021

On 2/14/2021 at 12:18 AM, i5Js said:

Which kernel are you using? Because my board is rebooting unexpectedly almos every day...

Hi, it's 5.10.12-rockchip64 #21.02.1, and seems my situation has been improved since I increased bootdelay in armbianEnv.txt from 3 to 10, I got 0/2 fail rate in the past 2 days.

pkfox · February 20, 2021

On 2/14/2021 at 10:18 AM, i5Js said:

Which kernel are you using? Because my board is rebooting unexpectedly almos every day...

I have a few Nanopi m4 v1 and v2 boards most of them use 5.10.16-rockchip64 kernel but one runs 4.4.213-rk3399 because I haven't got around to updating it but all of the boards have the nvme hat and a Samsung 970 SSD fitted.

i5Js · February 21, 2021

On 2/16/2021 at 12:23 AM, LiX said:

Hi, it's 5.10.12-rockchip64 #21.02.1, and seems my situation has been improved since I increased bootdelay in armbianEnv.txt from 3 to 10, I got 0/2 fail rate in the past 2 days.

Can you share your armbianEnv.txt? I've just checked mine and it has not that setting.

LiX · February 25, 2021

On 2/21/2021 at 3:26 AM, i5Js said:

Can you share your armbianEnv.txt? I've just checked mine and it has not that setting.

Hi, I added that up myself without checking the documents...moreover, I am headless so I can't confirm it actually works or not...also I've removed it, since I still have about 1/3 fail rate with this.

Lennyz1988 · February 25, 2021

On 2/14/2021 at 11:18 AM, i5Js said:

Which kernel are you using? Because my board is rebooting unexpectedly almos every day...

Try the latest. I also used to have lot's of crashes. But not anymore. You can download this kernel through armbian-config.

Armbian 21.02.2 Buster with Linux 5.10.16-rockchip64

LiX · March 1, 2021

On 2/25/2021 at 3:39 AM, Lennyz1988 said:

Try the latest. I also used to have lot's of crashes. But not anymore. You can download this kernel through armbian-config.

Armbian 21.02.2 Buster with Linux 5.10.16-rockchip64

Actually I think it went worse with the 5.10.16, I have almost 100% fail rate yesterday, it frozen my $home since it's on the NVME with following message:

Feb 27 13:14:48 localhost lightdm[2532]: Error getting user list from org.freedesktop.Accounts: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name org.freedesktop.Accounts was not provide
d by any .service files
Feb 27 13:14:51 localhost kernel: [   21.178369] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
Feb 27 13:18:55 localhost kernel: [  265.927366] Modules linked in: governor_performance algif_hash algif_skcipher af_alg bnep zram zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE)
zavl(POE) icp(POE) spl(OE) snd_soc_hdmi_codec snd_soc_simple_card rockchip_rga hci_uart hantro_vpu(C) snd_soc_simple_card_utils rc_cec rockchip_vdec(C) btqca dw_hdmi_i2s_audio snd_soc_rt5651 dw_hdmi_c
ec videobuf2_dma_sg panfrost btrtl btsdio v4l2_h264 snd_soc_rockchip_i2s snd_soc_rl6231 videobuf2_dma_contig snd_soc_rockchip_spdif gpu_sched v4l2_mem2mem videobuf2_vmalloc btbcm fusb302 btintel snd_s
oc_core bluetooth videobuf2_memops videobuf2_v4l2 snd_pcm_dmaengine videobuf2_common tcpm snd_pcm videodev snd_timer brcmfmac snd mc soundcore brcmutil typec cfg80211 rfkill cpufreq_dt nfsd auth_rpcgs
s nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 rockchipdrm realtek dw_hdmi dw_mipi_dsi analogix_dp drm_kms_helper dwmac_rk stmmac_platform cec stmmac rc_core pcs_xpcs drm drm_panel_orientatio
n_quirks
Feb 27 13:18:55 localhost kernel: [  265.935437] CPU: 3 PID: 312 Comm: kworker/3:1H Tainted: P         C OE     5.10.16-rockchip64 #21.02.2
Feb 27 13:18:55 localhost kernel: [  265.936262] Hardware name: FriendlyElec NanoPi M4 Ver2.0 (DT)
Feb 27 13:18:55 localhost kernel: [  265.936798] Workqueue: kblockd blk_mq_timeout_work
Feb 27 13:18:55 localhost kernel: [  265.937246] pstate: 20000005 (nzCv daif -PAN -UAO -TCO BTYPE=--)
Feb 27 13:18:55 localhost kernel: [  265.937794] pc : nvme_timeout+0x48/0x370
Feb 27 13:18:55 localhost kernel: [  265.938159] lr : blk_mq_check_expired+0x210/0x230
Feb 27 13:18:55 localhost kernel: [  265.938584] sp : ffff800012603bd0
Feb 27 13:18:55 localhost kernel: [  265.938890] x29: ffff800012603bd0 x28: ffff000040816a20
Feb 27 13:18:55 localhost kernel: [  265.939388] x27: ffff000043f8fec8 x26: 00000000000000c0
Feb 27 13:18:55 localhost kernel: [  265.939884] x25: ffff000044b7e810 x24: ffff000044b7e700
Feb 27 13:18:55 localhost kernel: [  265.940379] x23: ffff0000f1bc5a40 x22: ffff0000449bac00
Feb 27 13:18:55 localhost kernel: [  265.940875] x21: ffff8000118b9948 x20: ffff00004454f000
Feb 27 13:18:55 localhost kernel: [  265.941370] x19: ffff800012a4201c x18: 0000000000000000
Feb 27 13:18:55 localhost kernel: [  265.941865] x17: 0000000000000000 x16: 0000000000000000
Feb 27 13:18:55 localhost kernel: [  265.942360] x15: 0000000000000001 x14: 00000000000000e5
Feb 27 13:18:55 localhost kernel: [  265.942855] x13: 000000000006e4a8 x12: 0000000000000000
Feb 27 13:18:55 localhost kernel: [  265.943349] x11: 0000000000000000 x10: 0000000000000a40
Feb 27 13:18:55 localhost kernel: [  265.943844] x9 : ffff800012603d20 x8 : fefefefefefefeff
Feb 27 13:18:55 localhost kernel: [  265.944339] x7 : ffff800012603da0 x6 : ffff000044ae07e0
Feb 27 13:18:55 localhost kernel: [  265.944834] x5 : 0000000000000002 x4 : 0000000000000001
Feb 27 13:18:55 localhost kernel: [  265.945328] x3 : 0000000000000000 x2 : ffff800010987d90
Feb 27 13:18:55 localhost kernel: [  265.945822] x1 : 0000000000000000 x0 : 0000000000000000
Feb 27 13:18:55 localhost kernel: [  265.946315] Call trace:
Feb 27 13:18:55 localhost kernel: [  265.946554]  nvme_timeout+0x48/0x370
Feb 27 13:18:55 localhost kernel: [  265.946889]  blk_mq_check_expired+0x210/0x230
Feb 27 13:18:55 localhost kernel: [  265.947293]  bt_iter+0x60/0x70
Feb 27 13:18:55 localhost kernel: [  265.947585]  blk_mq_queue_tag_busy_iter+0x1e4/0x338
Feb 27 13:18:55 localhost kernel: [  265.948030]  blk_mq_timeout_work+0x17c/0x1a8
Feb 27 13:18:55 localhost kernel: [  265.948428]  process_one_work+0x1ec/0x4d0
Feb 27 13:18:55 localhost kernel: [  265.948801]  worker_thread+0x48/0x478
Feb 27 13:18:55 localhost kernel: [  265.949143]  kthread+0x140/0x150
Feb 27 13:18:55 localhost kernel: [  265.949449]  ret_from_fork+0x10/0x34
Feb 27 13:18:55 localhost kernel: [  265.950347] ---[ end trace b4d79516f26f1c5d ]---
Feb 27 13:17:08 localhost kernel: [    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]

and I can't shutdown the (headless) system properly, but only unplug the power.

I am going to try linux-image-legacy-rockchip64 (4.4.213-rockchip64) next.

Werner · March 1, 2021

Moved to p2p because unsupported userspace.

LiX · March 2, 2021

Sorry guys, I just realized this issue is most probably related to OPAL SED ( which I enabled shortly after I got the SSD). I will try to file a bug at kernel.org. Thanks for your time.

Sign In

NVME I/O QID Timeout & Disabled Controller with Bullseye on NanoPi M4v2

Recommended Posts

LiX

pkfox

LiX

i5Js

LiX

pkfox

i5Js

LiX

Lennyz1988

LiX

Werner

LiX

Forums

My Activity Streams

Download

Store

Important Information