Rene Milk Posted November 11, 2017 Posted November 11, 2017 My System: ODROID XU4 Armbian stable: Linux odroidxu4 4.9.56-odroidxu4 USB-3 HDD Cage with 3 SATA drives, connected to XU4 via USB2-HUB (Since XU4 onboard USB3 does not recognize drive cage, never has) The drives are 1.5TB, 3TB and 4TB in size. All have one partition of full size and are encrypted with luks/dm-crypt. I'm switching from a Debian 8 system on the XU4, which had no problem with disk decryption. After the initial setup on Armbian I'm now trying to use those disks. I can successfully decrypt and mount the 1.5 TB disk. But when I do cryptsetup luksOpen /dev/disk/by-uuid/9ef4da01-8c26-48f1-a945-d2f419068e2f somename the system reboots after I've entered my passphrase and a few seconds of disk activity. I get no error, I see no new messages with dmesg. I can reliably reproduce this behaviour for both the 3TB and 4TB disks. Any help on how to identify/fix the problem would be greatly appreciated.
tkaiser Posted November 11, 2017 Posted November 11, 2017 48 minutes ago, Rene Milk said: switching from a Debian 8 system on the XU4 Which kernel version and which settings used there? 48 minutes ago, Rene Milk said: Any help on how to identify/fix the problem would be greatly appreciated. Well, if I hear about spontaneous reboots my first try would be to check for underpowering. What does happen if you do the following: sudo armbianmonitor -p minerd --benchmark (this will install cpuminer we use here as tool to test for throttling and also underpowering issues)
Rene Milk Posted November 11, 2017 Author Posted November 11, 2017 The Debian 8 distribution was called odrobian, running a 3.10.96-30 kernel. For what settings are you looking exactly? Kernel boot parameters? I've canceled the benchmark after running for 6 minutes because I'm about to leave home. Does that exclude underpowering as an issue then?
tkaiser Posted November 11, 2017 Posted November 11, 2017 1 hour ago, Rene Milk said: For what settings are you looking exactly? None exactly (I've to admit that my question was somewhat stupid). Background info: we've seen this not just once that better kernel and settings lead to higher performance which in turns leads to higher consumption which on devices with powering issues will then trigger instabilities. That's not Armbian related but a general observation (eg. the community Android builds for devices like Pine64 suffer from the 'same' problem since enabling all performance tweaks possible: users report instabilities on their board running the community Android while the vendor Android runs flawlessly). The reason can be observed by both looking at benchmark scores and a Powermeter between wall and device. 1 hour ago, Rene Milk said: I've canceled the benchmark after running for 6 minutes because I'm about to leave home. Does that exclude underpowering as an issue then? Not entirely but now I think more about software issues. If you try again can you please first exchange contents of /usr/bin/armbianmonitor with https://raw.githubusercontent.com/armbian/build/master/packages/bsp/common/usr/bin/armbianmonitor and then open two more shells and execute in one armbianmonitor -m 0.5 And in the other watch -n 0.1 "dmesg | tail -n $((LINES-6))" Edit: those two shells should be via SSH to have the messages present when the board reboots again.
Rene Milk Posted November 11, 2017 Author Posted November 11, 2017 very 0.1s: dmesg | tail -n 48 Sat Nov 11 20:29:37 2017 [ 19.832650] usbcore: registered new interface driver btusb [ 20.195266] ads7846 spi1.1: touchscreen, irq 149 [ 20.195887] input: ADS7846 Touchscreen as /devices/platform/soc:/12d30000.spi:/spi_master/spi1/spi1.1/input/input5 [ 20.698275] systemd-journald[535]: Received request to flush runtime journal from PID 1 [ 21.488065] Bluetooth: BNEP (Ethernet Emulation) ver 1.3 [ 21.488073] Bluetooth: BNEP filters: protocol multicast [ 21.488087] Bluetooth: BNEP socket layer initialized [ 22.438818] IPv6: ADDRCONF(NETDEV_UP): enx001e0630202f: link is not ready [ 22.470115] IPv6: ADDRCONF(NETDEV_UP): enx001e0630202f: link is not ready [ 22.491979] r8152 6-1:1.0 enx001e0630202f: carrier on [ 22.492127] IPv6: ADDRCONF(NETDEV_CHANGE): enx001e0630202f: link becomes ready [ 27.295289] fuse init (API version 7.26) [ 27.591322] Bluetooth: RFCOMM TTY layer initialized [ 27.591347] Bluetooth: RFCOMM socket layer initialized [ 27.591365] Bluetooth: RFCOMM ver 1.11 [ 410.575019] NET: Registered protocol family 38 [ 410.587529] Unable to handle kernel NULL pointer dereference at virtual address 00000010 [ 410.594131] pgd = c0003000 [ 410.596817] [00000010] *pgd=80000040004003, *pmd=00000000 [ 410.602192] Internal error: Oops: 207 [#1] PREEMPT SMP THUMB2 [ 410.607909] Modules linked in: algif_skcipher af_alg rfcomm fuse cpufreq_powersave cpufreq_conservative cpufreq_userspace cpufreq_ondemand bnep ads7846 spidev joydev btusb btbcm btintel bluetooth rfkill spi_s 3c64xx w1_gpio wire pwm_fan s5p_sss exynos_gpiomem uio_pdrv_genirq uio ipv6 uas hid_cherry hid_logitech_hidpp hid_logitech_dj [ 410.637202] CPU: 5 PID: 35 Comm: ksoftirqd/5 Not tainted 4.9.56-odroidxu4 #5 [ 410.644218] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree) [ 410.650286] task: ee8d76c0 task.stack: ee970000 [ 410.654795] PC is at memcpy+0x68/0x2dc [ 410.658518] LR is at s5p_tasklet_cb+0x116/0x1f4 [s5p_sss] [ 410.663888] pc : [<c0572f20>] lr : [<bf938843>] psr: 000700b3 sp : ee971e94 ip : 00000010 fp : c1002080 [ 410.675328] r10: c1002080 r9 : 00000000 r8 : 60070013 [ 410.680525] r7 : ec110744 r6 : ed439a68 r5 : 00000020 r4 : ed439a10 [ 410.687024] r3 : ec110740 r2 : fffffff0 r1 : 00000010 r0 : f2d64230 [ 410.693524] Flags: nzcv IRQs off FIQs on Mode SVC_32 ISA Thumb Segment user [ 410.700889] Control: 70c5387d Table: 6e90f5c0 DAC: 55555555 [ 410.706609] Process ksoftirqd/5 (pid: 35, stack limit = 0xee970210) [ 410.712848] Stack: (0xee971e94 to 0xee972000) [ 410.717180] 1e80: 00000020 ed439a68 ec110744 [ 410.725326] 1ea0: 60070013 f2d64230 ed439a10 bf938843 bf93872d ed439a3c ed439a40 00000000 [ 410.733472] 1ec0: ee971ed0 ee970000 c0e602c0 c022cf5d c022cf15 40000006 00000000 00000006 [ 410.741617] 1ee0: c1002098 ee970000 00000100 c022d0a9 ee971f18 00000000 ee971ef0 c1059000 [ 410.749762] 1f00: 0000000a 00005667 c1002d00 04208040 00000000 ee970000 00000000 ee911440 [ 410.757908] 1f20: c100a470 ffffe000 c1003114 00000000 00000000 c022d1c7 c022d1a1 c02403e3 [ 410.766053] 1f40: 00000000 00040938 00000000 ee911a80 ee970000 ee911440 c024031d 00000000 [ 410.774199] 1f60: 00000000 c023d917 00000001 00000005 ee911440 00000000 00030003 ee971f7c [ 410.782344] 1f80: ee971f7c 00000000 00000000 ee971f8c ee971f8c 00040938 ee911a80 c023d85d [ 410.790492] 1fa0: 00000000 00000000 00000000 c02158e1 00000000 00000000 00000000 00000000 [ 410.798638] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 410.806782] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 ffffffff ffffffff [ 410.814935] [<c0572f20>] (memcpy) from [<bf938843>] (s5p_tasklet_cb+0x116/0x1f4 [s5p_sss]) root@odroidxu4:~# armbianmonitor -m 0.5 Stop monitoring using [ctrl]-[c] Time big.LITTLE load %cpu %sys %usr %nice %io %irq CPU C.St. 20:29:25: 600/ 600MHz 2.05 2% 0% 0% 0% 1% 0% 59.0°C 0/12 20:29:31: 2000/ 600MHz 2.04 2% 0% 0% 0% 1% 0% 65.0°C 0/12 20:29:36: 2000/ 600MHz 2.04 2% 0% 0% 0% 1% 0% 68.0°C 0/12 Turns out that 'watch+dmesg' is more informative than 'dmesg -w', thanks! So what's next to check? Should I try another armbian kernel?
zador.blood.stained Posted November 11, 2017 Posted November 11, 2017 Looks like the hardware crypto driver is crashing. It needs more debugging to test what exactly crashes it and why (there is no memcpy call in s5p_tasklet_cb as far as I see so it must be an inlined function, possibly not even related to HW crypto). Also it would be good to try to reproduce this on a newer kernel (4.9.61).
Rene Milk Posted November 11, 2017 Author Posted November 11, 2017 So first venue would be to recompile the dm-crypt kernel module in debug mode? And second would be to compile the newer kernel myself entirely? Or is this available as a package already?
zador.blood.stained Posted November 11, 2017 Posted November 11, 2017 It would be ideal to reproduce it on a fresh image with minimal steps, like installing a new image, creating a LUKS container with specific parameters and trying to mount it. If all disks have the same encryption parameters it could be hard to reproduce because of hardware requirements since it would be hard to find a spare 3+TB HDD for tests. 11 minutes ago, Rene Milk said: So first venue would be to recompile the dm-crypt kernel module in debug mode? No, recompiling the kernel with debug info (CONFIG_DEBUG_INFO), reproducing the crash and using addr2line on the uncompressed debug image to convert the stack trace to source files and lines. It may require using a serial console to capture the whole crash log with full stack trace. 14 minutes ago, Rene Milk said: And second would be to compile the newer kernel myself entirely? Or is this available as a package already? Should be available in the beta repository, but switching production systems to it is not a good idea in general.
Rene Milk Posted November 11, 2017 Author Posted November 11, 2017 4 minutes ago, zador.blood.stained said: It would be ideal to reproduce it on a fresh image with minimal steps, like installing a new image, creating a LUKS container with specific parameters and trying to mount it. If all disks have the same encryption parameters it could be hard to reproduce because of hardware requirements since it would be hard to find a spare 3+TB HDD for tests. Indeed I don't have a spare drive to test this. I'm sure though that other than different hardware and size, the containers are identical wrt cypher and filesystem setup. 6 minutes ago, zador.blood.stained said: Should be available in the beta repository, but switching production systems to it is not a good idea in general. Well, it's not really a production system, yet, since I cannot use it for it's intended purpose anyhow. So I'm up to try that if you think it would be useful. I'd try this first then before recompiling the kernel. (For which I'd need some rough pointers, btw) I could also easily check with the legacy armbian image.
Rene Milk Posted November 11, 2017 Author Posted November 11, 2017 (edited) So I've checked with the legacy 3.10.106 kernel image and there all disks unlock fine. After switching to the beta repository on the mainline image, I still get a kernel panic: [ 119.634984] NET: Registered protocol family 38 [ 119.647551] Unable to handle kernel NULL pointer dereference at virtual address 00000010 [ 119.654148] pgd = c0003000 [ 119.656834] [00000010] *pgd=80000040004003, *pmd=00000000 [ 119.662208] Internal error: Oops: 207 [#1] PREEMPT SMP THUMB2 [ 119.667927] Modules linked in: algif_skcipher af_alg rfcomm fuse cpufreq_powersave cpufreq_conservative cpufreq_userspace bnep ads7846 spidev joydev btusb btbcm btintel bluetooth rfkill spi_s3c64xx s5p_sss w1 _gpio pwm_fan wire uio_pdrv_genirq uio exynos_gpiomem ipv6 uas hid_cherry hid_logitech_hidpp hid_logitech_dj [ 119.695744] CPU: 6 PID: 40 Comm: ksoftirqd/6 Not tainted 4.9.61-odroidxu4 #20 [ 119.702847] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree) [ 119.708913] task: ee9a3e80 task.stack: ee9c6000 [ 119.713430] PC is at memcpy+0x68/0x2dc [ 119.717151] LR is at s5p_tasklet_cb+0x116/0x1f4 [s5p_sss] [ 119.722520] pc : [<c0575660>] lr : [<bf966843>] psr: 000700b3 sp : ee9c7e94 ip : 00000010 fp : c1002080 [ 119.733957] r10: c1002080 r9 : 00000000 r8 : 60070013 [ 119.739158] r7 : ecb2cf44 r6 : edc7ea68 r5 : 00000020 r4 : edc7ea10 [ 119.745655] r3 : ecb2cf40 r2 : fffffff0 r1 : 00000010 r0 : f2de3230 [ 119.752156] Flags: nzcv IRQs off FIQs on Mode SVC_32 ISA Thumb Segment user [ 119.759523] Control: 70c5387d Table: 6cbb0cc0 DAC: 55555555 [ 119.765241] Process ksoftirqd/6 (pid: 40, stack limit = 0xee9c6210) I'm probably going to be happy with the legacy image for now, but I'm willing to help debug the problem with the newer kernel, if you guys think it's worth it. Edit: I don't have a serial console kit, just a TV connected with HDMI Edited November 11, 2017 by Rene Milk
Rene Milk Posted November 15, 2017 Author Posted November 15, 2017 Turns out I couldn't use the 3.10 based image after all, because Kodi doesn't work. So I've tried an arch linux for arm, with a 4.9.14 kernel, which has the same issue with luks. What works for me currently is a Ubuntu 16.04.2 Image directly from Hardkernel, which runs 4.9.27-35
suchende Posted December 25, 2017 Posted December 25, 2017 I remember this problem with the cubietruck an the armbian 4.9 kernel. With a self compiled 4.10 kernel it works. The same problem occurs with the odroid hc1. The self compiled 4.14 kernel works flawless.
Recommended Posts