Meier

Members
  • Content Count

    17
  • Joined

  • Last visited

About Meier

  • Rank
    Member

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Thanks for all the hints, will try this out. To be honest, I can't imagine to be the only one having these issues. Maybe it's just that our usage of the RockPro64 board with a PCIe NVMe M.2 SSD is not very common? We suspected the PCIe / SSD combination to be the cause for a long time, for but cannot really confirm this. The issues are consistent over dozens of RockPro64 boards and many SSD brands. And I've heard from users that the PinebookPro (which uses similar hardware) also freezes under heavy load when run with an SSD.
  2. This is a last-ditch effort to figure out how we can continue to build our project on top of Armbian. Goal We are building an appliance on the RockPro64 board, with eMMC and an PCIe SSD attached, using a custom-built Armbian Ubuntu 18.04 image with extensive post-configuration (packages, appliacations, overlayroot...) within the boot process. The image is then processed by Mender to create a true dual-root filesystem over-the-air update system with fallback (we invested a lot of resources to extend Mender for Armbian and RockPro64). This build process works perfectly and is exactly what we need. Issue The reliability, however, is bad. In an older topic of mine, I documented our issues of regular device reboots due to kernel panics and infrequent freezes. We never really figured out what causes these crashes. It depends on the individual hardware, but on average reboots happen every few days, freezes that need a manuall power cycle every one or two weeks. This is also the case when using the official images. Workaround When using the exact same custom-built Armbian image, but replace the kernel with an Ayufan kernel, the images are stable. The devices run for weeks without reboots or freezes, independent on the individual hardware. Options This is obviously a very hacky way of creating a Linux image. So we have two options: (preferred) continue to use Armbian, but with a stable kernel switch to Ayufan and eat all the costs: implement new build system, redesign post-configuration, extend Mender solution The way forward I am not a Linux kernel expert. I am also not sure how the two projects Armbian and Ayufan are related exactly, but it seems that Armbian uses some of Ayufans resources in the build process. For me the question then becomes where the differences originate. Is this something that can be and people are willing to figure out? How can I support that by running tests, providing logs or assist in other non source-code tasks? Any help is appreciated.
  3. I am experiencing frequent kernel panics with the RockPro64 board and assume it could be a power consumption issue with the SSD (PCIe NVMe M.2). I tested a variety of different SSDs, with the Cruzial P1 being the most reliable. Is it possible to manually throttle an SSD within Armbian, forcing it to consume less power? In my case, peak performance is not really important. This would allow me to further validate my thesis. Thanks in advance for any suggestions!
  4. Damn it, came in this morning and had the kerne freeze again on one machine. Talking of the devil... I took a picture of the screen (see attachment), but it does not really hint at what went wrong as far as I can tell. Even scrolling back didn't show much additional info. I'm a bit at a loss how to capture what actually went wrong... Any hints?
  5. Apologies, I did not really follow up on that issue as the following two measures prevented any further occurrences: Avoid using the mentioned Samsung SSD (currently using Cruzial P1) Run the eMMC as Ubuntu 18.04 with overlayroot enabled Not sure if it's both or only one measure, but no more freezes. Let's hope it stays that way.
  6. Thanks for the suggestions, will follow up shortly. First there's some vacations... :-)
  7. The boot failure is no longer an issue currently, after updating to Armbian Ubuntu 18.04 and using overlayroot, but the nvme spinlock keeps happening frequently. I noticed that it is mostly during intense writing to the SSD, which heats up, the system freezes and never comes back up. Strangely enough, the SSD is then still under constant load and does not cool down. I ran a dedicted stresstest with the tool stressdisk with the following script (it includes our own fancontrol tool): #!/bin/bash apt update -y apt install -y tmux unzip smartmontools mkdir src && cd $_ wget https://github.com/ncw/stressdisk/releases/download/v1.0.12/stressdisk_1.0.12_linux_arm64.zip unzip stressdisk_1.0.12_linux_arm64.zip chmod +x stressdisk mv stressdisk /usr/sbin wget https://github.com/digitalbitbox/bitbox-base/releases/download/wip/bbbfancontrol.tar.gz tar xvf bbbfancontrol.tar.gz chmod +x bbbfancontrol mv bbbfancontrol /usr/sbin/ echo "/dev/nvme0n1p1 /mnt/ssd ext4 rw,nosuid,dev,noexec,noatime,nodiratime,auto,nouser,async,nofail 0 2" >> /etc/fstab mount -a mkdir -p /mnt/ssd/stressdisk tmux new-session -d 'watch smartctl -a /dev/nvme0n1' tmux split-window -h 'stressdisk cycle /mnt/ssd/stressdisk' tmux split-window -v 'htop' tmux split-window -v 'bbbfancontrol -v' tmux -2 attach-session -d Using the RockPro64 with Armbian and writing heavily on a Samsung SSD (connected with an PCIe M.2 adapter), I am able to consistently freeze the system within minutes. See video here: Jul 18 08:35:24 rockpro64 kernel: Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009cc801c Jul 18 08:35:24 rockpro64 kernel: Internal error: : 96000210 [#1] SMP Jul 18 08:35:24 rockpro64 kernel: Modules linked in: af_packet lz4hc lz4hc_compress zlib snd_soc_rockchip_hdmi_dp lzo rk_vcodec zram ip_tables x_tables autofs4 phy_rockchip_pcie Jul 18 08:35:24 rockpro64 kernel: CPU: 3 PID: 261 Comm: nvme Not tainted 4.4.182-rockchip64 #1 Jul 18 08:35:24 rockpro64 kernel: Hardware name: Pine64 RockPro64 (DT) Jul 18 08:35:24 rockpro64 kernel: task: ffffffc0e4e13800 task.stack: ffffffc0dfb9c000 Jul 18 08:35:24 rockpro64 kernel: PC is at nvme_kthread+0xac/0x1d8 Jul 18 08:35:24 rockpro64 kernel: LR is at nvme_kthread+0x78/0x1d8 Jul 18 08:35:24 rockpro64 kernel: pc : [<ffffff80087564cc>] lr : [<ffffff8008756498>] pstate: 20000145 Jul 18 08:35:24 rockpro64 kernel: sp : ffffffc0dfb9fd60 Jul 18 08:35:24 rockpro64 kernel: x29: ffffffc0dfb9fd60 x28: ffffffc0df92ed00 Jul 18 08:35:24 rockpro64 kernel: x27: 0000000002080020 x26: ffffffc0ebaa0228 Jul 18 08:35:24 rockpro64 kernel: x25: ffffff80091ff0f0 x24: ffffff80091ff0f0 Jul 18 08:35:24 rockpro64 kernel: x23: ffffff8008755310 x22: ffffff80091ff0d8 Jul 18 08:35:24 rockpro64 kernel: x21: 0000000000000007 x20: ffffff8009cc801c Jul 18 08:35:24 rockpro64 kernel: x19: ffffffc0ebd1d400 x18: 0000000000000000 ... ... Jul 18 08:35:24 rockpro64 kernel: Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009cc8000 Jul 18 08:35:24 rockpro64 kernel: Bad mode in Error handler detected, code 0xbf000002 -- SError Jul 18 08:35:24 rockpro64 kernel: BUG: spinlock lockup suspected on CPU#3, nvme/261 Jul 18 08:35:24 rockpro64 kernel: lock: 0xffffff8009141870, .magic: dead4ead, .owner: nvme/261, .owner_cpu: 3 Jul 18 08:35:24 rockpro64 kernel: CPU: 3 PID: 261 Comm: nvme Not tainted 4.4.182-rockchip64 #1 Jul 18 08:35:24 rockpro64 kernel: Hardware name: Pine64 RockPro64 (DT) Jul 18 08:35:24 rockpro64 kernel: Call trace: Jul 18 08:35:24 rockpro64 kernel: [<ffffff80080882b0>] dump_backtrace+0x0/0x1bc Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008088490>] show_stack+0x24/0x30 Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008587fec>] dump_stack+0x98/0xc0 Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008106164>] spin_dump+0x84/0xa4 Jul 18 08:35:24 rockpro64 kernel: [<ffffff8008106300>] do_raw_spin_lock+0xdc/0x164 ... ... https://pastebin.com/1UeCiHDW Using the smartmontools and physical temperature measurements, we can observe that the first chip on the Samsung 970 EVO (1TB) SSD gets up to 95 degrees celsius hot before crashing. With the exact same image, board and adapter I tried other M.2 SSDs as well. Interestingly, the chip on a Samsung 970 EVO (500 GB) got up to 107 degrees celsius, but did not crash. Other SSD got not as hot and had no issues with stress tests: * Intel 660p (512 GB and 1 TB): ~80 degrees celsius, physical measurement only * Crucial P1 (512 GB and 1 TB): ~75 degrees celsius, physical measurement only * Western Digital Black (500 GB): ~70 degrees celsius, physical measurement only So this issue might be related to that specific SSD. I'm still surprised that the Samsung SSD do not throttle at all, IMHO they never should get that hot in the first place.
  8. Thanks Myy for the pointers. I'll try if the dptx.bin driver helps preventing the boot oops message. Is there a way to tell how that binary file has been compiled, or to make sure it is legit? Regarding the stress testing in overlayroot, this command immediately aborts as it fills up the available tmpfs within seconds. Good thought about find /, I'll try that.
  9. Unfortunately, I still keep getting the freezes from time to time. Two thinks I noticed: When running `stress -i 4 -d 4` I can crash the board in ~3 minutes very reliably. But not any board, just this one, but even without any additional peripherals like the SSD plugged in. As it's running on eMMC, it might be this particular eMMC that causes the crash. This let me to build a latest Armbian Ubuntu 18.04 image with `overlayroot` to eliminate all I/O to the eMMC. This board that has been crashing has now been running for ~2 days. I'll try to gather more data in case the boards crash with the build from the current master branch. Just FYI in case you're curious: this is the project I'm working on https://github.com/digitalbitbox/bitbox-base.
  10. Thanks Igor! Will try that today and let you know how it works out. Update: works fine so far, after 3+ hours uptime, also with the self-compiled image, but intervals between freezes can be quite long.
  11. Additional info: after some quick uptime of ~1h the board started to fault repeatedly, but without crashing completely. SSH connections were closed, but later a login was possible again. Three specific errors in short interval, all logged in full here: https://pastebin.com/SAcUAGb2 Jul 8 16:59:31 carol kernel: [ 3752.234046] Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009d5401c Jul 8 16:59:31 carol kernel: [ 3752.240736] Internal error: : 96000210 [#1] SMP ... Jul 8 17:00:12 carol kernel: [ 3759.996389] BUG: spinlock lockup suspected on CPU#3, nvme/296 Jul 8 17:00:12 carol kernel: [ 3760.001966] lock: 0xffffff8009141870, .magic: dead4ead, .owner: nvme/296, .owner_cpu: 3 ... Jul 8 17:00:12 carol kernel: [ 3792.419942] Watchdog detected hard LOCKUP on cpu 3 Jul 8 17:00:12 carol kernel: [ 3792.420464] ------------[ cut here ]------------ Jul 8 17:00:12 carol kernel: [ 3792.430494] WARNING: at kernel/watchdog.c:352
  12. On several RockPro64 boards I experience infrequent freezes, mostly directly on boot, but also after some longer (hours, days) uptime. I use the official power adapter and no additional hardware except a PCIe adapter for an SSD, which works flawlessly when operational. When looking at the kern.log, there are quite a few errors and warnings, but comparing to a successful boot these are all also present. So I don't think they cause the freeze directly. Currently, I recorded the freeze on the latest Armbian Bionic as release just recently. FWIW, I think the same issue also occurs on the previous Debian Stretch image (had various freezes, but have not recorded any details yet). On a unsuccessful boot, the error occurs after about 8 seconds. A reboot (or two) usually fixes the issue, until the next time... Jul 8 15:18:23 carol kernel: [ 8.528275] Unable to handle kernel NULL pointer dereference at virtual address 00000000 Jul 8 15:18:23 carol kernel: [ 8.530625] pgd = ffffffc0ead3f000 Jul 8 15:18:23 carol kernel: [ 8.532515] [00000000] *pgd=0000000000000000, *pud=0000000000000000 Jul 8 15:18:23 carol kernel: [ 8.534710] Internal error: Oops: 96000005 [#1] SMP Jul 8 15:18:24 carol kernel: [ 8.536753] Modules linked in: af_packet iptable_nat nf_nat_ipv4 nf_nat nf_log_ipv4 nf_log_common xt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_conntrack nf_conntrack iptable_filter snd_soc_rockchip_hdmi_dp rk_vcodec ip_tables x_tables autofs4 phy_rockchip_pcie Jul 8 15:18:24 carol kernel: [ 8.542598] CPU: 5 PID: 1044 Comm: find Not tainted 4.4.182-rockchip64 #1 Jul 8 15:18:24 carol kernel: [ 8.544959] Hardware name: Pine64 RockPro64 (DT) Jul 8 15:18:24 carol kernel: [ 8.547155] task: ffffffc0eb247000 task.stack: ffffffc0e1c78000 Jul 8 15:18:24 carol kernel: [ 8.549491] PC is at do_dentry_open+0x234/0x2e4 Jul 8 15:18:24 carol kernel: [ 8.551687] LR is at do_dentry_open+0x288/0x2e4 Jul 8 15:18:24 carol kernel: [ 8.553852] pc : [<ffffff80081f2738>] lr : [<ffffff80081f278c>] pstate: a0000145 Jul 8 15:18:24 carol kernel: [ 8.556284] sp : ffffffc0e1c7bbc0 Jul 8 15:18:24 carol kernel: [ 8.558403] x29: ffffffc0e1c7bbc0 x28: ffffffc0eb247000 Jul 8 15:18:24 carol kernel: [ 8.560742] x27: 0000000000000000 x26: ffffffc0f26eb000 Jul 8 15:18:24 carol kernel: [ 8.563064] x25: 000000000000011d x24: ffffffc0e1cff690 Jul 8 15:18:24 carol kernel: [ 8.565376] x23: ffffff8008219cb8 x22: 0000000000000000 Jul 8 15:18:24 carol kernel: [ 8.567670] x21: 0000000000000000 x20: ffffffc0f27882b0 Jul 8 15:18:24 carol kernel: [ 8.569954] x19: ffffffc0e1cff680 x18: 0000007fb4979a70 Jul 8 15:18:24 carol kernel: [ 8.572191] x17: 0000007fb48e8848 x16: ffffff80081f3ea4 Jul 8 15:18:24 carol kernel: [ 8.574416] x15: 0000000000000000 x14: ffffffffffffffff Jul 8 15:18:24 carol kernel: [ 8.576663] x13: 0000000000000000 x12: 0101010101010101 Jul 8 15:18:24 carol kernel: [ 8.578896] x11: 7f7f7f7f7f7f7f7f x10: 0000007fb4a8a140 Jul 8 15:18:24 carol kernel: [ 8.581115] x9 : 0000000000000000 x8 : ffffffc0e1cff7b8 Jul 8 15:18:24 carol kernel: [ 8.583390] x7 : 0000000000000000 x6 : ffffffc0f061d1e9 Jul 8 15:18:24 carol kernel: [ 8.585641] x5 : 0000000000000000 x4 : 00000000000055b1 Jul 8 15:18:24 carol kernel: [ 8.587874] x3 : 00000040eee4a000 x2 : ffffff8008219a20 Jul 8 15:18:24 carol kernel: [ 8.590114] x1 : ffffff8008c02140 x0 : 0000000000000000 Jul 8 15:18:24 carol kernel: [ 8.592341] Jul 8 15:18:24 carol kernel: [ 8.592341] PC: 0xffffff80081f26b8: Jul 8 15:18:24 carol kernel: [ 8.596219] 26b8 54fffd60 f940c680 f9001660 b4fffd20 aa1603e1 aa1303e0 940c0a2c 2a0003f6 Jul 8 15:18:24 carol kernel: [ 8.598764] 26d8 35000700 b9405261 d5033bbf f940ca80 b5000320 b50004b7 f9401660 f9402c17 Jul 8 15:18:24 carol kernel: [ 8.601305] 26f8 b5000457 b9405660 370004c0 b9405660 36080100 f9401661 f9400c22 b5000062 Jul 8 15:18:24 carol kernel: [ 8.603883] 2718 f9401421 b4000061 320e0000 b9005660 b9405260 12166c00 b9005260 f9409a60 ... Full kern.log boot log: https://pastebin.com/zcpxB1HQ Please find attached the full armbianmonitor output here: https://pastebin.com/NkVAejC6 Any help is greatly appreciated!
  13. Just wanted to let you know that we collaborated with Mender and Armbian for RockPro64 is now fully supported by the Mender.io open source client-server manager for over-the-air software updates. It should be relatively straight-forward to exend this approach to other boards as well. https://mender.io/ https://github.com/mendersoftware/mender-convert/pull/103
  14. I am looking to build a stable applicance that is updateable on demand / over-the-air in an atomic way. Ideally, it uses a dual-partition setup, where the is an active rootfs and an inactive one. On update, a new rootfs is streamed directly to the inactive partition, the device reboots to the new rootfs and - if everything runs according to plan - the new rootfs is committed as the new active one. If the reboot fails, the device reverts to the previous (still active) rootfs. This functionality is very common with embedded devices, many using the Yocto project. I'd like to try to use Armbian, however, building the image from source with my own userpatches and include a software package like [swupdate](https://github.com/sbabic/swupdate) or [rauc](https://github.com/rauc/rauc/) for the update functionality. The challenge for me is mainly to get the correct disk image (dual partition and bootloader configuration). I was not able to find any documentation, guide or project that tries something similar, integrating an updater into the Armbian build process. Is anyone aware of other projects where I could get inspiration, or is this really a first?