Jump to content

XU4 reboots on luksOpen of 3TB disk


Rene Milk

Recommended Posts

My System:

ODROID XU4

Armbian stable: Linux odroidxu4 4.9.56-odroidxu4

USB-3 HDD Cage with 3 SATA drives, connected to XU4 via USB2-HUB (Since XU4 onboard USB3 does not recognize drive cage, never has)

 

The drives are 1.5TB, 3TB and 4TB in size.  All have one partition of full size and are encrypted with luks/dm-crypt.  I'm switching from a Debian 8 system on the XU4, which had no problem with disk decryption. 

After the initial setup on Armbian I'm now trying to use those disks. I can successfully decrypt and mount the 1.5 TB disk.  But when I do  

cryptsetup luksOpen /dev/disk/by-uuid/9ef4da01-8c26-48f1-a945-d2f419068e2f somename

the system reboots after I've entered my passphrase and a few seconds of disk activity. I get no error, I see no new messages with dmesg.  I can reliably reproduce this behaviour for both the 3TB and 4TB disks.  

Any help on how to identify/fix the problem would be greatly appreciated. 

Link to comment
Share on other sites

48 minutes ago, Rene Milk said:

switching from a Debian 8 system on the XU4

Which kernel version and which settings used there?

 

48 minutes ago, Rene Milk said:

Any help on how to identify/fix the problem would be greatly appreciated. 

Well, if I hear about spontaneous reboots my first try would be to check for underpowering. What does happen if you do the following:

sudo armbianmonitor -p
minerd --benchmark

(this will install cpuminer we use here as tool to test for throttling and also underpowering issues)

Link to comment
Share on other sites

The Debian 8 distribution was called odrobian, running a 3.10.96-30 kernel.  For what settings are you looking exactly? Kernel boot parameters? 

 

I've canceled the benchmark after running for 6 minutes because I'm about to leave home. Does that exclude underpowering as an issue then? 

Link to comment
Share on other sites

1 hour ago, Rene Milk said:

For what settings are you looking exactly?

 

None exactly (I've to admit that my question was somewhat stupid). Background info: we've seen this not just once that better kernel and settings lead to higher performance which in turns leads to higher consumption which on devices with powering issues will then trigger instabilities. That's not Armbian related but a general observation (eg. the community Android builds for devices like Pine64 suffer from the 'same' problem since enabling all performance tweaks possible: users report instabilities on their board running the community Android while the vendor Android runs flawlessly). The reason can be observed by both looking at benchmark scores and a Powermeter between wall and device.

 

1 hour ago, Rene Milk said:

I've canceled the benchmark after running for 6 minutes because I'm about to leave home. Does that exclude underpowering as an issue then?

 

Not entirely but now I think more about software issues. If you try again can you please first exchange contents of /usr/bin/armbianmonitor with https://raw.githubusercontent.com/armbian/build/master/packages/bsp/common/usr/bin/armbianmonitor and then open two more shells and execute in one

armbianmonitor -m 0.5

And in the other

watch -n 0.1 "dmesg | tail -n $((LINES-6))"

Edit: those two shells should be via SSH to have the messages present when the board reboots again.

Link to comment
Share on other sites

very 0.1s: dmesg | tail -n 48                                                                                                                                                             Sat Nov 11 20:29:37 2017

[   19.832650] usbcore: registered new interface driver btusb
[   20.195266] ads7846 spi1.1: touchscreen, irq 149
[   20.195887] input: ADS7846 Touchscreen as /devices/platform/soc:/12d30000.spi:/spi_master/spi1/spi1.1/input/input5
[   20.698275] systemd-journald[535]: Received request to flush runtime journal from PID 1
[   21.488065] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
[   21.488073] Bluetooth: BNEP filters: protocol multicast
[   21.488087] Bluetooth: BNEP socket layer initialized
[   22.438818] IPv6: ADDRCONF(NETDEV_UP): enx001e0630202f: link is not ready
[   22.470115] IPv6: ADDRCONF(NETDEV_UP): enx001e0630202f: link is not ready
[   22.491979] r8152 6-1:1.0 enx001e0630202f: carrier on
[   22.492127] IPv6: ADDRCONF(NETDEV_CHANGE): enx001e0630202f: link becomes ready
[   27.295289] fuse init (API version 7.26)
[   27.591322] Bluetooth: RFCOMM TTY layer initialized
[   27.591347] Bluetooth: RFCOMM socket layer initialized
[   27.591365] Bluetooth: RFCOMM ver 1.11
[  410.575019] NET: Registered protocol family 38
[  410.587529] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[  410.594131] pgd = c0003000
[  410.596817] [00000010] *pgd=80000040004003, *pmd=00000000
[  410.602192] Internal error: Oops: 207 [#1] PREEMPT SMP THUMB2
[  410.607909] Modules linked in: algif_skcipher af_alg rfcomm fuse cpufreq_powersave cpufreq_conservative cpufreq_userspace cpufreq_ondemand bnep ads7846 spidev joydev btusb btbcm btintel bluetooth rfkill spi_s
3c64xx w1_gpio wire pwm_fan s5p_sss exynos_gpiomem uio_pdrv_genirq uio ipv6 uas hid_cherry hid_logitech_hidpp hid_logitech_dj
[  410.637202] CPU: 5 PID: 35 Comm: ksoftirqd/5 Not tainted 4.9.56-odroidxu4 #5
[  410.644218] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[  410.650286] task: ee8d76c0 task.stack: ee970000
[  410.654795] PC is at memcpy+0x68/0x2dc
[  410.658518] LR is at s5p_tasklet_cb+0x116/0x1f4 [s5p_sss]
[  410.663888] pc : [<c0572f20>]    lr : [<bf938843>]    psr: 000700b3
               sp : ee971e94  ip : 00000010  fp : c1002080
[  410.675328] r10: c1002080  r9 : 00000000  r8 : 60070013
[  410.680525] r7 : ec110744  r6 : ed439a68  r5 : 00000020  r4 : ed439a10
[  410.687024] r3 : ec110740  r2 : fffffff0  r1 : 00000010  r0 : f2d64230
[  410.693524] Flags: nzcv  IRQs off  FIQs on  Mode SVC_32  ISA Thumb  Segment user
[  410.700889] Control: 70c5387d  Table: 6e90f5c0  DAC: 55555555
[  410.706609] Process ksoftirqd/5 (pid: 35, stack limit = 0xee970210)
[  410.712848] Stack: (0xee971e94 to 0xee972000)
[  410.717180] 1e80:                                              00000020 ed439a68 ec110744
[  410.725326] 1ea0: 60070013 f2d64230 ed439a10 bf938843 bf93872d ed439a3c ed439a40 00000000
[  410.733472] 1ec0: ee971ed0 ee970000 c0e602c0 c022cf5d c022cf15 40000006 00000000 00000006
[  410.741617] 1ee0: c1002098 ee970000 00000100 c022d0a9 ee971f18 00000000 ee971ef0 c1059000
[  410.749762] 1f00: 0000000a 00005667 c1002d00 04208040 00000000 ee970000 00000000 ee911440
[  410.757908] 1f20: c100a470 ffffe000 c1003114 00000000 00000000 c022d1c7 c022d1a1 c02403e3
[  410.766053] 1f40: 00000000 00040938 00000000 ee911a80 ee970000 ee911440 c024031d 00000000
[  410.774199] 1f60: 00000000 c023d917 00000001 00000005 ee911440 00000000 00030003 ee971f7c
[  410.782344] 1f80: ee971f7c 00000000 00000000 ee971f8c ee971f8c 00040938 ee911a80 c023d85d
[  410.790492] 1fa0: 00000000 00000000 00000000 c02158e1 00000000 00000000 00000000 00000000
[  410.798638] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  410.806782] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 ffffffff ffffffff
[  410.814935] [<c0572f20>] (memcpy) from [<bf938843>] (s5p_tasklet_cb+0x116/0x1f4 [s5p_sss])
root@odroidxu4:~# armbianmonitor -m 0.5 
Stop monitoring using [ctrl]-[c]
Time       big.LITTLE   load %cpu %sys %usr %nice %io %irq   CPU  C.St.
20:29:25:  600/ 600MHz  2.05   2%   0%   0%   0%   1%   0% 59.0°C  0/12
20:29:31: 2000/ 600MHz  2.04   2%   0%   0%   0%   1%   0% 65.0°C  0/12
20:29:36: 2000/ 600MHz  2.04   2%   0%   0%   0%   1%   0% 68.0°C  0/12

Turns out that 'watch+dmesg' is more informative than 'dmesg -w', thanks! So what's next to check? Should I try another armbian kernel?

Link to comment
Share on other sites

Looks like the hardware crypto driver is crashing. It needs more debugging to test what exactly crashes it and why (there is no memcpy call in s5p_tasklet_cb as far as I see so it must be an inlined function, possibly not even related to HW crypto).

Also it would be good to try to reproduce this on a newer kernel (4.9.61).

Link to comment
Share on other sites

It would be ideal to reproduce it on a fresh image with minimal steps, like installing a new image, creating a LUKS container with specific parameters and trying to mount it. If all disks have the same encryption parameters it could be hard to reproduce because of hardware requirements since it would be hard to find a spare 3+TB HDD for tests.

 

11 minutes ago, Rene Milk said:

So first venue would be to recompile the dm-crypt kernel module in debug mode?

No, recompiling the kernel with debug info (CONFIG_DEBUG_INFO), reproducing the crash and using addr2line on the uncompressed debug image to convert the stack trace to source files and lines. It may require using a serial console to capture the whole crash log with full stack trace.

 

14 minutes ago, Rene Milk said:

And second would be to compile the newer kernel myself entirely? Or is this available as a package already? 

Should be available in the beta repository, but switching production systems to it is not a good idea in general.

Link to comment
Share on other sites

4 minutes ago, zador.blood.stained said:

It would be ideal to reproduce it on a fresh image with minimal steps, like installing a new image, creating a LUKS container with specific parameters and trying to mount it. If all disks have the same encryption parameters it could be hard to reproduce because of hardware requirements since it would be hard to find a spare 3+TB HDD for tests.

Indeed I don't have a spare drive to test this. I'm sure though that other than different hardware and size,  the containers are identical wrt cypher and filesystem setup. 

 

6 minutes ago, zador.blood.stained said:

Should be available in the beta repository, but switching production systems to it is not a good idea in general.

Well, it's not really a production system, yet, since I cannot use it for it's intended purpose anyhow. So I'm up to try that if you think it would be useful.  I'd try this first then before recompiling the kernel. (For which I'd need some rough pointers, btw) I could also easily check with the legacy armbian image. 

Link to comment
Share on other sites

So I've checked with the legacy 3.10.106 kernel image and there all disks unlock fine. 

 

After switching to the beta repository on the mainline image, I still get a kernel panic: 

[  119.634984] NET: Registered protocol family 38
[  119.647551] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[  119.654148] pgd = c0003000
[  119.656834] [00000010] *pgd=80000040004003, *pmd=00000000
[  119.662208] Internal error: Oops: 207 [#1] PREEMPT SMP THUMB2
[  119.667927] Modules linked in: algif_skcipher af_alg rfcomm fuse cpufreq_powersave cpufreq_conservative cpufreq_userspace bnep ads7846 spidev joydev btusb btbcm btintel bluetooth rfkill spi_s3c64xx s5p_sss w1
_gpio pwm_fan wire uio_pdrv_genirq uio exynos_gpiomem ipv6 uas hid_cherry hid_logitech_hidpp hid_logitech_dj
[  119.695744] CPU: 6 PID: 40 Comm: ksoftirqd/6 Not tainted 4.9.61-odroidxu4 #20
[  119.702847] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[  119.708913] task: ee9a3e80 task.stack: ee9c6000
[  119.713430] PC is at memcpy+0x68/0x2dc
[  119.717151] LR is at s5p_tasklet_cb+0x116/0x1f4 [s5p_sss]
[  119.722520] pc : [<c0575660>]    lr : [<bf966843>]    psr: 000700b3
               sp : ee9c7e94  ip : 00000010  fp : c1002080
[  119.733957] r10: c1002080  r9 : 00000000  r8 : 60070013
[  119.739158] r7 : ecb2cf44  r6 : edc7ea68  r5 : 00000020  r4 : edc7ea10
[  119.745655] r3 : ecb2cf40  r2 : fffffff0  r1 : 00000010  r0 : f2de3230
[  119.752156] Flags: nzcv  IRQs off  FIQs on  Mode SVC_32  ISA Thumb  Segment user
[  119.759523] Control: 70c5387d  Table: 6cbb0cc0  DAC: 55555555
[  119.765241] Process ksoftirqd/6 (pid: 40, stack limit = 0xee9c6210)

I'm probably going to be happy with the legacy image for now, but I'm willing to help debug the problem with the newer kernel, if you guys think it's worth it. 

 

Edit: I don't have a serial console kit, just a TV connected with HDMI

Edited by Rene Milk
Link to comment
Share on other sites

Turns out I couldn't use the 3.10 based image after all, because Kodi doesn't work. 

So I've tried an arch linux for arm, with a 4.9.14 kernel, which has the same issue with luks. What works for me currently is a Ubuntu 16.04.2 Image directly from Hardkernel, which runs 4.9.27-35

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines