Jump to content

mikhailai

Members
  • Posts

    10
  • Joined

  • Last visited

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. I saw the reply on the bug, but decided to just send a patch (that refers to the bug) and copy it to the correct lists. I'll follow up if there would be no replies to the patch.
  2. Ok, I've filed the kernel bug: https://bugzilla.kernel.org/show_bug.cgi?id=219089 And submitted the patch: https://lore.kernel.org/linux-arm-kernel/20240729165036.7368-1-mikhailai@gmail.com/ Let's see whether anyone would pay attention to it. Unfortunately, I'm not sure if I ended-up CC-ing all the right people. The merge window seem to have just closed, so it may take 2 months before getting pulled into the mainline + whatever time it would take to make it into release branches. Also, I'm not really sure where non-vanilla core or compiler would come into play. I could certainly reproduce the problem with vanilla 6.10 (mainline) Linux kernel. I DID use Ubuntu compiler (on Linux Mint), but I'm 99.99% certain it'll behave identically if compiled on Debian with the same GCC version.
  3. Hi y'all. Sorry for disappearing for a while: it really took some time to investigate, but now I'm pretty sure I've found the root cause. TLDR: My analysis above is wrong. The bug is present in v6.1.87, and change to "drivers/char/random.c" has nothing to do with it: just accidentally happens to trigger the bug. The problem only manifests itself when the ftrace "mcount" call instruction for _raw_spin_unlock_irqrestore function in the kernel code straddles the instruction cache lines. This happens when "_raw_spin_unlock_irqrestore" address ends on (hex): 1c, 3c, 5c, ... fc (see System.map for that). Due to above arbitrary changes to the kernel code may trigger this problem to appear or disappear. In other words, the hang may look fixed, but then show up later. It is present through the whole 6.1.y kernel branch, as well as 6.6.y branch. I did not check the mainline or earlier branches. The problem does not appear when the kernel is compiled with GCC 9, which is a default cross-compiler on Ubuntu 20.04 (Focal). See the bottom of this post for the correct fix. I'll try to get into the upstream Linux kernel. LONG VERSION: Problem: Linux kernel hangs in early boot on 32-bit ARM platform, when ftrace 4-byte "mcount" function call location for "_raw_spin_unlock_irqrestore" function straddles icache lines. The problem persist through the whole 6.1.y kernel branch and likely beyond. Could also reproduce it in the 6.6.y branch with a bit more "nop" placement (see below). ROOT CAUSE ANALYSIS: The hang is inside: start_kernel -> ftrace_init -> ftrace_process_locs -> ftrace_update_code. It hangs when it updates the ftrace location (by calling "ftrace_nop_initialize") for the entry for: _raw_spin_unlock_irqrestore The reason is the following: "ftrace_nop_initialize" calls "ftrace_init_nop", which on 32-bit ARM goes to "ftrace_make_nop". "ftrace_make_nop" calls "ftrace_modify_code" that calls "__patch_text", that in-turn calls "__patch_text_real" (defined in "arch/arm/kernel/patch.c") with remap=true. After writing the actual instruction, "__patch_text_real" does the following: if (waddr != addr) { flush_kernel_vmap_range(waddr, twopage ? size / 2 : size); patch_unmap(FIX_TEXT_POKE0, &flags); } flush_icache_range((uintptr_t)(addr), (uintptr_t)(addr) + size); The "patch_unmap" calls the above-mentioned "_raw_spin_unlock_irqrestore". Hereby lies the problem. If it's patching the "_raw_spin_unlock_irqrestore", it invokes the function BEFORE flushing the icache, so there is a possibility of that function having an invalid code created by the combination of the updated and non-updated pieces of the instruction residing in different cache lines. The occurrence of the error strongly depends on other factors: that's why it worked for earlier 6.1.y kernels. Necessary factors: The ftrace location for "_raw_spin_unlock_irqrestore" is NOT 4-byte aligned and 4 bytes at this location straddle the instruction cache line (0x20) boundaries. I.e. the pg->records[i]->ip (hex) value ends on: 0x1e, 0x3e, 0x5e, ... 0xfe. For that function, this value is offset from the function address by 2 bytes. The previous Ftrace entry needs to be updated as well. That is probably needed to get the icache into inconsistent state. For the reproduced hangs, the previous entry is inside the "_raw_write_unlock_irqrestore" (unlike _raw_spin_unlock_irqrestore, it is NOT being invoked when "ftrace_update_code" is executing). The problem is present for (cross-compiler) GCC 10, 11, 12. It does not happen when the kernel is compiled with GCC 9, even when condition (1) is satisfied. Not sure what is the reason: could be different code or condition (2) being different, leading to cache NOT get into an inconsistent state. Note, the default cross-compiler on Ubuntu 22.04 (Jammy) is GCC 11, while the default compiler on Ubuntu 20.04 (Focal) is GCC 9. Note, the condition (1) can be achieved by increasing/decreasing code size of certain functions. The following algorithm can be used. Add 4 "nop" instructions at a time to "drivers/char/random.c", "try_to_generate_entropy" function, until "_raw_spin_unlock_irqrestore" address ends on -x8, or -xC, where "x" is odd. E.g. ...1c, ...3c, ...5c, etc. E.g. asm("nop;nop;nop;nop; "); If it ends on 8, add 2 more "nop" instructions to one of the lock functions inside the "__lock_text_start" section: see the System.map on which one comes first/earlier. PROPOSED FIX: The fix is really simple: just swap the order of "patch_unmap" and "flush_icache_range" in the above code snippet (from "arch/arm/kernel/patch.c", "__patch_text_real" function). I.e. replace the above code snippet with: if (waddr != addr) flush_kernel_vmap_range(waddr, twopage ? size / 2 : size); flush_icache_range((uintptr_t)(addr), (uintptr_t)(addr) + size); /* Can only call 'patch_unmap' after flushing dcache and icache, * because it calls 'raw_spin_unlock_irqrestore', but that may * happen to be the very function we're currently patching * (as it happens during the ftrace init). */ if (waddr != addr) patch_unmap(FIX_TEXT_POKE0, &flags);
  4. True, but the "random.c" on the 6.6 branch contains bunch of other changes not present in 6.1 (15 commits to be precise). I suppose the change "random: handle creditable entropy from atomic process context" woks well with these commits, but is broken without some of these changes. In fact, I kind-of confirmed that, per my comment below. Overall, this looks plausible. The change was originally done and tested on the mainline, with all other changes being present. Then it was cherry-picked into 6.6 and 6.1 branches, where it received more limited testing that did not catch the problem. I'm guessing the problem does not show up on x86 and shows up on armhf. It could be timing dependent, so only shows up under specific circumstances. I'm hoping there would be just a few commits (ideally just one) that could be cherry-picked into 6.1 branch to make it work.
  5. Ok, returning to the original question. I did some dissection, and the problem appears to be a 6.1.x kernel bug as opposed to something being broken on the Armbian side. Disclaimer: I did not use a proper Armbian build; rather just took the kernel code from "linux-6.1.y" branch and used "config-6.1.77-legacy-sunxi". So here are my results: The v6.1.87 is booting fine: the same way as "linux-image-legacy-sunxi" version 24.2.1. The v6.1.88 is broken with the same symptoms as "linux-image-legacy-sunxi" version 24.5.1. The culprit is the following commit: 07b37f227c8daa27e68f57b1c691fab34a06731e (HEAD) random: handle creditable entropy from atomic process context commit 07b37f227c8daa27e68f57b1c691fab34a06731e Author: Jason A. Donenfeld <Jason@zx2c4.com> Date: Wed Apr 17 13:38:29 2024 +0200 random: handle creditable entropy from atomic process context commit e871abcda3b67d0820b4182ebe93435624e9c6a4 upstream. The entropy accounting changes a static key when the RNG has initialized, since it only ever initializes once. Static key changes, however, cannot be made from atomic context, so depending on where the last creditable entropy comes from, the static key change might need to be deferred to a worker. Previously the code used the execute_in_process_context() helper function, which accounts for whether or not the caller is in_interrupt(). However, that doesn't account for the case where the caller is actually in process context but is holding a spinlock. This turned out to be the case with input_handle_event() in drivers/input/input.c contributing entropy: [<ffffffd613025ba0>] die+0xa8/0x2fc [<ffffffd613027428>] bug_handler+0x44/0xec [<ffffffd613016964>] brk_handler+0x90/0x144 [<ffffffd613041e58>] do_debug_exception+0xa0/0x148 [<ffffffd61400c208>] el1_dbg+0x60/0x7c [<ffffffd61400c000>] el1h_64_sync_handler+0x38/0x90 [<ffffffd613011294>] el1h_64_sync+0x64/0x6c [<ffffffd613102d88>] __might_resched+0x1fc/0x2e8 [<ffffffd613102b54>] __might_sleep+0x44/0x7c [<ffffffd6130b6eac>] cpus_read_lock+0x1c/0xec [<ffffffd6132c2820>] static_key_enable+0x14/0x38 [<ffffffd61400ac08>] crng_set_ready+0x14/0x28 [<ffffffd6130df4dc>] execute_in_process_context+0xb8/0xf8 [<ffffffd61400ab30>] _credit_init_bits+0x118/0x1dc [<ffffffd6138580c8>] add_timer_randomness+0x264/0x270 [<ffffffd613857e54>] add_input_randomness+0x38/0x48 [<ffffffd613a80f94>] input_handle_event+0x2b8/0x490 [<ffffffd613a81310>] input_event+0x6c/0x98 According to Guoyong, it's not really possible to refactor the various drivers to never hold a spinlock there. And in_atomic() isn't reliable. So, rather than trying to be too fancy, just punt the change in the static key to a workqueue always. There's basically no drawback of doing this, as the code already needed to account for the static key not changing immediately, and given that it's just an optimization, there's not exactly a hurry to change the static key right away, so deferal is fine. Reported-by: Guoyong Wang <guoyong.wang@mediatek.com> Cc: stable@vger.kernel.org Fixes: f5bda35fba61 ("random: use static branch for crng_ready()") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> diff --git a/drivers/char/random.c b/drivers/char/random.c index 5d1c8e1c99b5..fd57eb372d49 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -683,7 +683,7 @@ static void extract_entropy(void *buf, size_t len) static void __cold _credit_init_bits(size_t bits) { - static struct execute_work set_ready; + static DECLARE_WORK(set_ready, crng_set_ready); unsigned int new, orig, add; unsigned long flags; @@ -699,8 +699,8 @@ static void __cold _credit_init_bits(size_t bits) if (orig < POOL_READY_BITS && new >= POOL_READY_BITS) { crng_reseed(); /* Sets crng_init to CRNG_READY under base_crng.lock. */ - if (static_key_initialized) - execute_in_process_context(crng_set_ready, &set_ready); + if (static_key_initialized && system_unbound_wq) + queue_work(system_unbound_wq, &set_ready); wake_up_interruptible(&crng_init_wait); kill_fasync(&fasync, SIGIO, POLL_IN); pr_notice("crng init done\n"); @@ -870,8 +870,8 @@ void __init random_init(void) /* * If we were initialized by the cpu or bootloader before jump labels - * are initialized, then we should enable the static branch here, where - * it's guaranteed that jump labels have been initialized. + * or workqueues are initialized, then we should enable the static + * branch here, where it's guaranteed that these have been initialized. */ if (!static_branch_likely(&crng_is_ready) && crng_init >= CRNG_READY) crng_set_ready(NULL); The code change is rather simple: it switches from using "execute_in_process_context" to "queue_work", but that switch is causing the lock-up. I don't have enough knowledge to debug why it is happening: suspect some sort of a deadlock. I've tried taking the "random.c" from the 6.6.34 kernel and doing hacky modifications to get to to compile on 6.1.y: that fixed the problem, so I'm guessing the "random.c" on the 6.1.y branch is not in a good state. Does anyone have suggestions on how to proceed from here?
  6. I can try doing one-off fix for the current Armbian release, but I cannot commit to support these patches going forward: I'm very short on time right now. LMK if you're still interested giving me the information. I guess I should start off with reading documentation on building the Armbian (never built any image).
  7. The Linux kernel contained in the latest "linux-image-legacy-sunxi" (version 24.5.1) package appears to broken to the point of locking-up right from the start. It prints "Starting kernel ...", and no more messages appear even with "verbosity=7" set in the "armbianEnv.txt". The "linux-image-legacy-sunxi" version 24.2.1 boots just fine. Here are the steps to reproduce the problem. I've done this on "Orange Pi One" board, but exactly the same issue occurs on (community maintained) Banana Pi M1. 1. Download and write the Armbian image to a MicroSD card. 2. Connect the serial console, boot the board, finish setup, do all the upgrades: everything works fine at this point. 3. Set "verbosity=7" in the "armbianEnv.txt", reboot and observe the kernel messages. At this point, the "linux-image-current-sunxi", version 24.5.1 (kernel 6.6.31) is installed. 4. Install "armbian-config" and use it to switch to "linux-image-legacy-sunxi=24.2.1 (6.1.77)". Observe that the board boots up fine. 5. Now switch to "linux-image-legacy-sunxi=24.5.1 (6.1.92)". The boot process now gets stuck at "Starting kernel ..." message. So as a summary: * "linux-image-current-sunxi" version 24.5.1 with 6.6.31 kernel: boots fine. * "linux-image-legacy-sunxi" version 24.2.1 with "6.1.77" kernel: boots fine. * "linux-image-legacy-sunxi" version 24.5.1 with "6.1.92" kernel: broken: stuck at "Starting kernel ..." message. I wonder if anyone could check what could have happened with "linux-image-legacy-sunxi" in the latest Armbian build.
  8. Tried the verbosity=7, the same effect. With 6.1.77 kernel: Working FDT set to 49630000 Starting kernel ... [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 6.1.77-legacy-sunxi (armbian@next) (arm-linux-gnueabihf-gcc (Ubuntu 11.4.0-1ubuntu1~ 22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #1 SMP Fri Feb 23 08:12:53 UTC 2024 [ 0.000000] CPU: ARMv7 Processor [410fc074] revision 4 (ARMv7), cr=50c5387d [ 0.000000] CPU: div instructions available: patching division code [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache [ 0.000000] OF: fdt: Machine model: LeMaker Banana Pi ...... With 6.1.92 kernel from "linux-image-legacy-sunxi" version 24.5.1 Working FDT set to 49630000 Starting kernel ... I tried the Device Trees from both "linux-dtb-legacy-sunxi" and "linux-dtb-current-sunxi": makes no difference. This again points to the "linux-image-legacy-sunxi" version 24.5.1 being broken. I wonder if someone with more authority can try running the kernel from that package on any Allwinner A20 board, or any Allwinner H3 Board (e.g. Orange Pi One, Orange Pi PC Plus), and confirm or disprove my claim.
  9. I have a similar issue on "Banana PI M1", so I can provide more details. Basically, what I've found out is the linux kernel contained in the latest "linux-image-legacy-sunxi" (version 24.5.1) package appears to broken to the point of locking-up right from the start. It prints "Starting kernel...", and no more messages appear even with "verbosity=6" setting in the "armbianEnv.txt". * "linux-image-current-sunxi" version 24.5.1 with 6.6.31 kernel: boots fine. * "linux-image-legacy-sunxi" version 24.2.1 with "6.1.77" kernel: boots fine. * "linux-image-legacy-sunxi" version 24.5.1 with "6.1.92" kernel: broken as described above. In the "boots fine" cases, the kernel prints multiple other messages, before starting the OS boot. Here are the steps to reproduce. I've done it for "Banana PI M1", but I think any board with Allwinner "Cortex-A7"-based SOC should work. 1. Download and write the Armbian image to a MicroSD card. 2. Connect the serial console, boot the board, finish setup, do all the upgrades: assume everything works fine at this point. Double-check that "linux-image-current-sunxi" is what's being used. 3. Set "verbosity=6" in the "armbianEnv.txt", reboot and observe the kernel messages. 4. Install "armbian-config" and use it to switch to the legacy kernel. After step (4) reboots, observe the boot process gets stuck at "Starting kernel..." message. What I'm guessing could've happened is something went wrong during the build process for that specific target (e.g. data corruption, wrong architecture being used).
  10. Actually, there is a better way to eliminate this CPU load without disabling the ADC driver and losing the ability to access the SOC temperature. The Linux thermal management periodically polls the SOC temperature, and you can disable it by adding the following line to /etc/rc.local: echo disabled > /sys/devices/virtual/thermal/thermal_zone0/mode Details/background: The A20 ADC driver is very inefficient: it seems to be using a busy loop (as mentioned here: https://linux-sunxi.narkive.com/1UetD5rh/bug-report-kworker-issue-for-mailline-kernel-4-12) The Linux thermal management periodically polls the SOC temperature (runs the ADC) at around 0.9Hz. I presume this would be useful for controlling the CPU fan, but probably not applicable to most A20 systems. Disabling this polling (as above) removes the CPU load, while still allowing inefficient one-off reads of the CPU temperature, as well as other ADC usage.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines