Jump to content

linux-image-legacy-sunxi=24.5.1 (kernel 6.1.92) is broken: stuck at "Starting kernel ..."


Recommended Posts

The Linux kernel contained in the latest "linux-image-legacy-sunxi" (version 24.5.1) package appears to broken to the point of locking-up right from the start. It prints "Starting kernel ...", and no more messages appear even with "verbosity=7" set in the "armbianEnv.txt". The "linux-image-legacy-sunxi" version 24.2.1 boots just fine.

 

Here are the steps to reproduce the problem. I've done this on "Orange Pi One" board, but exactly the same issue occurs on (community maintained) Banana Pi M1.

1. Download and write the Armbian image to a MicroSD card.

2. Connect the serial console, boot the board, finish setup, do all the upgrades: everything works fine at this point.

3. Set "verbosity=7" in the "armbianEnv.txt", reboot and observe the kernel messages. At this point, the "linux-image-current-sunxi", version 24.5.1 (kernel 6.6.31) is installed.

4. Install "armbian-config" and use it to switch to "linux-image-legacy-sunxi=24.2.1 (6.1.77)". Observe that the board boots up fine.

5. Now switch to "linux-image-legacy-sunxi=24.5.1 (6.1.92)". The boot process now gets stuck at "Starting kernel ..." message.

 

So as a summary:

*  "linux-image-current-sunxi" version 24.5.1 with 6.6.31 kernel: boots fine.

*  "linux-image-legacy-sunxi" version 24.2.1 with "6.1.77" kernel: boots fine.

*  "linux-image-legacy-sunxi" version 24.5.1 with "6.1.92" kernel: broken: stuck at "Starting kernel ..." message.

 

I wonder if anyone could check what could have happened with "linux-image-legacy-sunxi" in the latest Armbian build.

Link to comment
Share on other sites

31 минуту назад, mikhailai сказал:

I wonder if anyone could check what could have happened with "linux-image-legacy-sunxi" in the latest Armbian build.

The last time these patches were changed:

Date:   Wed Mar 27 20:50:41 2024

Obviously, patches need to be rebased to the new kernel version and conflicts need to be fixed.

 

If you are ready to volunteer to support these patches, I can tell you how to do it.

 

Regards.

Link to comment
Share on other sites

I can try doing one-off fix for the current Armbian release, but I cannot commit to support these patches going forward: I'm very short on time right now. LMK if you're still interested giving me the information. I guess I should start off with reading documentation on building the Armbian (never built any image).

Link to comment
Share on other sites

54 минуты назад, mikhailai сказал:

I can try doing one-off fix for the current Armbian release

That's enough.
It is not necessary to collect an image.

It is enough to assemble the kernel package, install it in the OS and check its performance.

I'll write the instructions.

Link to comment
Share on other sites

On 6/17/2024 at 10:52 AM, mikhailai said:

stuck at "Starting kernel ..." message.

 

I did manage to build  a minimal legacy image (24.8.0-trunk, sunxi-legacy:6.1.94) from the current Armbian build system and it gets stuck at the "Starting kernel" message.

putty.txt

Link to comment
Share on other sites

54 минуты назад, Stephen Graf сказал:

I did manage to build  a minimal legacy image (24.8.0-trunk, sunxi-legacy:6.1.94)

Will you be able to publish part of the kernel build log?

The part that reports on the application of patches.

 

9 часов назад, Stephen Graf сказал:

I just tried to build a legacy image for orangepione and it fails.

We don't need this build logic path.

 

Force the build system to always build the kernel package:

./compile.sh test ARTIFACT_IGNORE_CACHE="yes" kernel

Configuration file:

~/build$ cat userpatches/config-test.conf 
display_alert "Common settings for Armbian OS images" "setting default values" "info"
#declare -g USE_MAINLINE_GOOGLE_MIRROR="yes"
declare -g SYNC_CLOCK="no"
declare -g INSTALL_HEADERS="no"
declare -g WIREGUARD="no"
declare -g VENDOR="Armbian_community"
declare -g VENDORURL="https://github.com/armbian/build"
declare -g VENDORDOCS="https://docs.armbian.com"
declare -g VENDORSUPPORT="https://community.armbian.com/"
declare -g VENDORPRIVACY="https://duckduckgo.com/"
declare -g VENDORBUGS="https://github.com/armbian/community/issues"
declare -g VENDORLOGO="armbian-logo"
declare -g MAINTAINERMAIL=info@armbian.com
declare -g MAINTAINER="The-going"
declare -g COMPRESS_OUTPUTIMAGE="sha,img,xz"
declare -g IMAGE_XZ_COMPRESSION_RATIO=5

declare -g EXPERT="yes"
#declare -g KERNEL_CONFIGURE=yes
#declare -g DONT_BUILD_ARTIFACTS="firmware,full_firmware,fake_ubuntu_advantage_tools,armbian-config,armbian-zsh,armbian-plymouth-theme"

#Upload the log file to the armbian website.
#SHARE_LOG=yes
#ARTIFACT_IGNORE_CACHE="yes"

KERNEL_GIT=shallow

RELEASE=bookworm
BOARD=bananapim64

BRANCH=current

BUILD_DESKTOP=no
BUILD_MINIMAL=yes

 

P.S.

Edit: BOARD=XXXX  BRANCH=YYYYY

Link to comment
Share on other sites

14 hours ago, going said:
./compile.sh test ARTIFACT_IGNORE_CACHE="yes" kernel

@going

I compiled with your test script for legacy orangepione.  There was no image produced. The curl command to upload the log file did not work and uploading the log file to this message also failed.

I did pull the attached patches section from the log file.

 

Can I email the files to you directly?

build_log_patches.txt

Link to comment
Share on other sites

Ok, returning to the original question. I did some dissection, and the problem appears to be a 6.1.x kernel bug as opposed to something being broken on the Armbian side.

Disclaimer: I did not use a proper Armbian build; rather just took the kernel code from "linux-6.1.y" branch and used "config-6.1.77-legacy-sunxi".

 

So here are my results:

  • The v6.1.87 is booting fine: the same way as "linux-image-legacy-sunxi" version 24.2.1.
  • The v6.1.88 is broken with the same symptoms as "linux-image-legacy-sunxi" version 24.5.1.

The culprit is the following commit:

07b37f227c8daa27e68f57b1c691fab34a06731e (HEAD) random: handle creditable entropy from atomic process context

commit 07b37f227c8daa27e68f57b1c691fab34a06731e
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Wed Apr 17 13:38:29 2024 +0200

    random: handle creditable entropy from atomic process context
    
    commit e871abcda3b67d0820b4182ebe93435624e9c6a4 upstream.
    
    The entropy accounting changes a static key when the RNG has
    initialized, since it only ever initializes once. Static key changes,
    however, cannot be made from atomic context, so depending on where the
    last creditable entropy comes from, the static key change might need to
    be deferred to a worker.
    
    Previously the code used the execute_in_process_context() helper
    function, which accounts for whether or not the caller is
    in_interrupt(). However, that doesn't account for the case where the
    caller is actually in process context but is holding a spinlock.
    
    This turned out to be the case with input_handle_event() in
    drivers/input/input.c contributing entropy:
    
      [<ffffffd613025ba0>] die+0xa8/0x2fc
      [<ffffffd613027428>] bug_handler+0x44/0xec
      [<ffffffd613016964>] brk_handler+0x90/0x144
      [<ffffffd613041e58>] do_debug_exception+0xa0/0x148
      [<ffffffd61400c208>] el1_dbg+0x60/0x7c
      [<ffffffd61400c000>] el1h_64_sync_handler+0x38/0x90
      [<ffffffd613011294>] el1h_64_sync+0x64/0x6c
      [<ffffffd613102d88>] __might_resched+0x1fc/0x2e8
      [<ffffffd613102b54>] __might_sleep+0x44/0x7c
      [<ffffffd6130b6eac>] cpus_read_lock+0x1c/0xec
      [<ffffffd6132c2820>] static_key_enable+0x14/0x38
      [<ffffffd61400ac08>] crng_set_ready+0x14/0x28
      [<ffffffd6130df4dc>] execute_in_process_context+0xb8/0xf8
      [<ffffffd61400ab30>] _credit_init_bits+0x118/0x1dc
      [<ffffffd6138580c8>] add_timer_randomness+0x264/0x270
      [<ffffffd613857e54>] add_input_randomness+0x38/0x48
      [<ffffffd613a80f94>] input_handle_event+0x2b8/0x490
      [<ffffffd613a81310>] input_event+0x6c/0x98
    
    According to Guoyong, it's not really possible to refactor the various
    drivers to never hold a spinlock there. And in_atomic() isn't reliable.
    
    So, rather than trying to be too fancy, just punt the change in the
    static key to a workqueue always. There's basically no drawback of doing
    this, as the code already needed to account for the static key not
    changing immediately, and given that it's just an optimization, there's
    not exactly a hurry to change the static key right away, so deferal is
    fine.
    
    Reported-by: Guoyong Wang <guoyong.wang@mediatek.com>
    Cc: stable@vger.kernel.org
    Fixes: f5bda35fba61 ("random: use static branch for crng_ready()")
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 5d1c8e1c99b5..fd57eb372d49 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -683,7 +683,7 @@ static void extract_entropy(void *buf, size_t len)
 
 static void __cold _credit_init_bits(size_t bits)
 {
-	static struct execute_work set_ready;
+	static DECLARE_WORK(set_ready, crng_set_ready);
 	unsigned int new, orig, add;
 	unsigned long flags;
 
@@ -699,8 +699,8 @@ static void __cold _credit_init_bits(size_t bits)
 
 	if (orig < POOL_READY_BITS && new >= POOL_READY_BITS) {
 		crng_reseed(); /* Sets crng_init to CRNG_READY under base_crng.lock. */
-		if (static_key_initialized)
-			execute_in_process_context(crng_set_ready, &set_ready);
+		if (static_key_initialized && system_unbound_wq)
+			queue_work(system_unbound_wq, &set_ready);
 		wake_up_interruptible(&crng_init_wait);
 		kill_fasync(&fasync, SIGIO, POLL_IN);
 		pr_notice("crng init done\n");
@@ -870,8 +870,8 @@ void __init random_init(void)
 
 	/*
 	 * If we were initialized by the cpu or bootloader before jump labels
-	 * are initialized, then we should enable the static branch here, where
-	 * it's guaranteed that jump labels have been initialized.
+	 * or workqueues are initialized, then we should enable the static
+	 * branch here, where it's guaranteed that these have been initialized.
 	 */
 	if (!static_branch_likely(&crng_is_ready) && crng_init >= CRNG_READY)
 		crng_set_ready(NULL);

 

The code change is rather simple: it switches from using "execute_in_process_context" to "queue_work", but that switch is causing the lock-up. I don't have enough knowledge to debug why it is happening: suspect some sort of a deadlock.

 

I've tried taking the "random.c" from the 6.6.34 kernel and doing hacky modifications to get to to compile on 6.1.y: that fixed the problem, so I'm guessing the "random.c" on the 6.1.y branch is not in a good state.

 

Does anyone have suggestions on how to proceed from here?

Link to comment
Share on other sites

2 часа назад, mikhailai сказал:

Does anyone have suggestions on how to proceed from here?

Analysis:

linux-stable> git log --pretty=oneline v6.1.87..07b37f227c8daa27e68f57b1c691fab34a06731e | wc -l
8

 

Maybe we will do the following:

1) Freeze the outdated kernel to version 6.1.87.

diff --git a/config/sources/families/include/sunxi64_common.inc b/config/sources/families/include/sunxi64_common.inc
index 18775666..e37fe516 100644
--- a/config/sources/families/include/sunxi64_common.inc
+++ b/config/sources/families/include/sunxi64_common.inc
@@ -25,6 +25,7 @@ case $BRANCH in
 
        legacy)
                declare -g KERNEL_MAJOR_MINOR="6.1" # Major and minor versions of this kernel.
+               declare -g KERNELBRANCH="tag:v6.1.78"
                ;;
 
        current)
diff --git a/config/sources/families/include/sunxi_common.inc b/config/sources/families/include/sunxi_common.inc
index 93b14ab8..f6261767 100644
--- a/config/sources/families/include/sunxi_common.inc
+++ b/config/sources/families/include/sunxi_common.inc
@@ -26,6 +26,7 @@ case $BRANCH in
 
        legacy)
                declare -g KERNEL_MAJOR_MINOR="6.1" # Major and minor versions of this kernel.
+               declare -g KERNELBRANCH="tag:v6.1.78"
                ;;
 
        current)

 

2)  Переработаем (извлечём заново патчи) для этой версии ядра.

3) Leave this kernel in this state, and eliminate the cause for the current 6.6 kernel. If it is present in it.

 

 

Link to comment
Share on other sites

17.06.2024 в 20:52, mikhailai сказал:

"linux-image-current-sunxi" version 24.5.1 with 6.6.31 kernel: boots fine.

 

3 часа назад, mikhailai сказал:

The culprit is the following commit:

07b37f227c8daa27e68f57b1c691fab34a06731e (HEAD) random: handle creditable entropy from atomic process context

This patch in the 6.6 kernel is present after the v6.6.28 tag 998f52a860555a9f02242bc0a4b3e9b47d47dc11

I think the problem lies elsewhere.

Link to comment
Share on other sites

20.06.2024 в 02:21, Stephen Graf сказал:

Cut the log file by taking out all the kernel build log entries.

 

https://paste.armbian.com/ibamekatak

 

Summary: kernel patching: 498 total patches; 498 applied; 81 with problems; 80 needs_rebase; 4 not_mbox

 

This line indicates that problems exist, but is silent about what kind of problems they are. Row offset? Diffusion?

Here, a separate piece can be applied to another node in the DTS or to another function in the C code.

 

Only a person who reads the source code of the file and reads the patch file can detect the problem.

Link to comment
Share on other sites

6 hours ago, going said:

This patch in the 6.6 kernel is present after the v6.6.28 tag 998f52a860555a9f02242bc0a4b3e9b47d47dc11

I think the problem lies elsewhere.

True, but the "random.c" on the 6.6 branch contains bunch of other changes not present in 6.1 (15 commits to be precise). I suppose the change "random: handle creditable entropy from atomic process context" woks well with these commits, but is broken without some of these changes.

 

In fact, I kind-of confirmed that, per my comment below.

10 hours ago, mikhailai said:

I've tried taking the "random.c" from the 6.6.34 kernel and doing hacky modifications to get to to compile on 6.1.y: that fixed the problem, so I'm guessing the "random.c" on the 6.1.y branch is not in a good state.

 

Overall, this looks plausible. The change was originally done and tested on the mainline, with all other changes being present. Then it was cherry-picked into 6.6 and 6.1 branches, where it received more limited testing that did not catch the problem. I'm guessing the problem does not show up on x86 and shows up on armhf. It could be timing dependent, so only shows up under specific circumstances.

 

I'm hoping there would be just a few commits (ideally just one) that could be cherry-picked into 6.1 branch to make it work.

Link to comment
Share on other sites

50 минут назад, mikhailai сказал:

I'm hoping there would be just a few commits (ideally just one) that could be cherry-picked into 6.1 branch to make it work.

Okay, I get it.
Can we just take these few patches from the 6.6 kernel and add them to the 6.1 kernel?

It is better if they are in the form in which they already exist in 6.6.

I mean, what have you already tested.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines