Jump to content

Helios64 - freeze whatever the kernel is.


Recommended Posts

Posted

No (and i'm using MDADM, no ZFS), until now, hopefully (otherwise i'll get mad about losing time and datas just because of these errors ...) i haven't encountered broken raid or any errors (mdadm --misc detail is fine, no errors in dmesg -T, and so on).

 

Indeed maybe there are different problems ... anyway it's far away from being stable in the current state :(

 

Also i don't think i have overheat issue, my sensors are (even when copying files, ...) 

 

  Quote

lm75-i2c-2-4c
Adapter: rk3x-i2c
temp1:        +37.4°C  (high = +80.0°C, hyst = +75.0°C)

 

cpu-virtual-0
Adapter: Virtual device
temp1:        +46.2°C  (crit = +100.0°C)

tcpm_source_psy_4_0022-i2c-4-22

Adapter: rk3x-i2c
in0:          +0.00 V  (min =  +0.00 V, max =  +0.00 V)
curr1:        +0.00 A  (max =  +0.00 A)

 

gpu-virtual-0
Adapter: Virtual device
temp1:        +46.2°C  (crit = +95.0°C)
 

Expand  

 

 

Posted

Hi @Seneca, @SR-G

 

Could you add following lines to the beginning of /boot/boot.cmd

regulator dev vdd_log
regulator value 930000
regulator dev vdd_center
regulator value 950000

run

mkimage -C none -A arm -T script -d /boot/boot.cmd /boot/boot.scr

and reboot. Verify whether it can improve stability.

 

 

  Reveal hidden contents

 

You could also use the attached boot.scr

boot.scrFetching info...

Posted

Ok i just applyed the suggested modifications this morning (by modifying the boot.cmd file + regenerating the boot.scr file).

Reboot has been OK.

Let's see ...

 

+ what exactly are these values / are they related to the CPU speed and if yes, how are they different than what is applyed when modifying the CPU governance configuration through armbian-config ?

 

Previous armbianEnv.txt (for reference) (untouched)

 

```

verbosity=1
bootlogo=false
overlay_prefix=rockchip
rootdev=UUID=a79a14c0-3cf4-4fb9-a6c6-838571351371
rootfstype=ext4
usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u

```

 

Previous boot.cmd file, for reference (modified as requested with the 4 new lines)

 

```

# DO NOT EDIT THIS FILE
#
# Please edit /boot/armbianEnv.txt to set supported parameters
#

setenv load_addr "0x9000000"
setenv overlay_error "false"
# default values
setenv rootdev "/dev/mmcblk0p1"
setenv verbosity "1"
setenv console "both"
setenv bootlogo "false"
setenv rootfstype "ext4"
setenv docker_optimizations "on"
setenv earlycon "off"

echo "Boot script loaded from ${devtype} ${devnum}"

if test -e ${devtype} ${devnum} ${prefix}armbianEnv.txt; then
    load ${devtype} ${devnum} ${load_addr} ${prefix}armbianEnv.txt
    env import -t ${load_addr} ${filesize}
fi

if test "${logo}" = "disabled"; then setenv logo "logo.nologo"; fi

if test "${console}" = "display" || test "${console}" = "both"; then setenv consoleargs "console=tty1"; fi
if test "${console}" = "serial" || test "${console}" = "both"; then setenv consoleargs "console=ttyS2,1500000 ${consoleargs}"; fi
if test "${earlycon}" = "on"; then setenv consoleargs "earlycon ${consoleargs}"; fi
if test "${bootlogo}" = "true"; then setenv consoleargs "bootsplash.bootfile=bootsplash.armbian ${consoleargs}"; fi

# get PARTUUID of first partition on SD/eMMC the boot script was loaded from
if test "${devtype}" = "mmc"; then part uuid mmc ${devnum}:1 partuuid; fi

setenv bootargs "root=${rootdev} rootwait rootfstype=${rootfstype} ${consoleargs} consoleblank=0 loglevel=${verbosity} ubootpart=${partuuid} usb-storage.quirks=${usbstoragequirks} ${extraargs} ${extraboardargs}"

if test "${docker_optimizations}" = "on"; then setenv bootargs "${bootargs} cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory swapaccount=1"; fi

load ${devtype} ${devnum} ${ramdisk_addr_r} ${prefix}uInitrd
load ${devtype} ${devnum} ${kernel_addr_r} ${prefix}Image

load ${devtype} ${devnum} ${fdt_addr_r} ${prefix}dtb/${fdtfile}
fdt addr ${fdt_addr_r}
fdt resize 65536
for overlay_file in ${overlays}; do
    if load ${devtype} ${devnum} ${load_addr} ${prefix}dtb/rockchip/overlay/${overlay_prefix}-${overlay_file}.dtbo; then
        echo "Applying kernel provided DT overlay ${overlay_prefix}-${overlay_file}.dtbo"
        fdt apply ${load_addr} || setenv overlay_error "true"
    fi
done
for overlay_file in ${user_overlays}; do
    if load ${devtype} ${devnum} ${load_addr} ${prefix}overlay-user/${overlay_file}.dtbo; then
        echo "Applying user provided DT overlay ${overlay_file}.dtbo"
        fdt apply ${load_addr} || setenv overlay_error "true"
    fi
done
if test "${overlay_error}" = "true"; then
    echo "Error applying DT overlays, restoring original DT"
    load ${devtype} ${devnum} ${fdt_addr_r} ${prefix}dtb/${fdtfile}
else
    if load ${devtype} ${devnum} ${load_addr} ${prefix}dtb/rockchip/overlay/${overlay_prefix}-fixup.scr; then
        echo "Applying kernel provided DT fixup script (${overlay_prefix}-fixup.scr)"
        source ${load_addr}
    fi
    if test -e ${devtype} ${devnum} ${prefix}fixup.scr; then
        load ${devtype} ${devnum} ${load_addr} ${prefix}fixup.scr
        echo "Applying user provided fixup script (fixup.scr)"
        source ${load_addr}
    fi
fi
booti ${kernel_addr_r} ${ramdisk_addr_r} ${fdt_addr_r}

# Recompile with:
# mkimage -C none -A arm -T script -d /boot/boot.cmd /boot/boot.scr
```

 

 

```

(...)

lrwxrwxrwx 1 root root   25 2021-01-10 14:30 uInitrd -> uInitrd-5.9.14-rockchip64
-rw-r--r-- 1 root root 3,2K 2021-02-01 09:13 boot.cmd
-rw-rw-r-- 1 root root 3,3K 2021-02-01 09:13 boot.scr
-rw-r--r-- 1 root root  166 2021-02-01 09:15 armbianEnv.txt
```

Posted
  On 2/1/2021 at 8:18 AM, SR-G said:

+ what exactly are these values / are they related to the CPU speed and if yes, how are they different than what is applyed when modifying the CPU governance configuration through armbian-config ?

Expand  

 

It affect various subsystem. CPU governance work in higher level, algorithm, when to apply certain CPU speed from predefined table.

Posted (edited)

So i applyed these parameters on 2021/02/01 and today (2 days later) i just got another freeze (this time i had IO but not a lot - was uploading files from NAS to cloud at 40MB/s - but it's of course not the first time i have some IO during hours, it's just that until now most freezes have happened without IO).

 

No RED LED blinking this time once freezed + all HDD leds are ON but are not blinking.

 

I had an opened SSH connection and nothing has been printed there, it's just frozen (ping from another host not answered, and so on).

  Reveal hidden contents

 

dmesg -T (during reboot after freeze)

the MMC errors are new

I don't know for the error about voltage that can't be read

 

edit : to be noted, i was still in "performance" mode for CPU governance, with same min/max possible values (here : https://forum.armbian.com/topic/16944-crazy-instability/ it is suggested to not be in that mode -> have just reverted to min possible value / max possible + powersave mode)

 

Edited by aprayoga
move kernel log inside spoiler
Posted

So i don't know if it's enough to say that everything is now under control, but for now my uptime is +12 days (before, i had at least one freeze per week, and often more than that).

 

  Quote

uptime
 15:56:12 up 12 days, 22:10,  3 users,  load average: 4,24, 4,12, 3,64
 

Expand  

 

So a little bit too soon to be sure.

 

+ for now i'm avoiding 5.10 kernel installation and corresponding reboot.

Posted

Is this the same issue I'm having over in this thread?

 

The armbianEnv.txt on my microSD card looked like this

 

verbosity=1
bootlogo=false
overlay_prefix=rockchip
rootdev=UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
rootfstype=ext4
usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u,0x0bc2:0x3322:u

 

I removed the last usbstoragequirks as that didn't seem to present in anyone else's example I've seen. Made no difference.

Posted

About why your NAS has been frozen, maybe indeed (it's quite possible), but sadly in addition you are encountering some other issues during the reboot (that I haven't encountered on my side - crossing fingers on that topic ...). But of course these many freezes and reboots can't be good in any way for the operating system on disk or even for the hardware (hdd)...

 

I would suggest to try a fresh reinstall on a second scarf to see if everything boots up nicely, as a first step ... 

Posted
  On 2/16/2021 at 3:27 PM, SR-G said:

So i don't know if it's enough to say that everything is now under control, but for now my uptime is +12 days (before, i had at least one freeze per week, and often more than that).

 

 

So a little bit too soon to be sure.

 

+ for now i'm avoiding 5.10 kernel installation and corresponding reboot.

Expand  

 

So 26 days as uptime now - it seems better.

 

  Quote

 10:11:49 up 26 days, 16:26,  2 users,  load average: 0,13, 0,14, 0,06

Expand  

 

Posted

Could you guys update to latest kernel (linux-image-current-rockchip64_21.02.3_arm64 / LK 5.10.21) and revert to ondemand governor with the full range of frequency in case you were on performance mode.

 

Also remove any vdd voltage tweak you might have put in /boot/boot.cmd. If you had vdd tweak you will need to rebuild u-boot boot script after removing the lines in /boot/boot.cmd

 

mkimage -C none -A arm -T script -d /boot/boot.cmd /boot/boot.scr

 

Let us know how the new kernel improves the stability.

Posted
  On 3/10/2021 at 5:13 AM, gprovost said:

Could you guys update to latest kernel (linux-image-current-rockchip64_21.02.3_arm64 / LK 5.10.21) and revert to ondemand governor with the full range of frequency in case you were on performance mode.

 

Also remove any vdd voltage tweak you might have put in /boot/boot.cmd. If you had vdd tweak you will need to rebuild u-boot boot script after removing the lines in /boot/boot.cmd

 

mkimage -C none -A arm -T script -d /boot/boot.cmd /boot/boot.scr

 

Let us know how the new kernel improves the stability.

Expand  

I did a dist-upgrade yesterday, and I've had two freezes since then due to high IO (probably). But I've reverted the cpufreq tweaks now (ondemand+min/max 400000/1800000), and I never enabled he vdd tweaks in boot.cmd, so there wasn't anything to revert. So we'll see how it goes, I'll try to put a lot of IO load on it the coming days.

5.10.21-rockchip64 #21.02.3 SMP PREEMPT Mon Mar 8 01:05:08 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux

 

Posted

My system was stable for a long time (~3-4 weeks) and then the other day it soft locked with a panic (trace was in ZFS).

Rest of the system was still vaguely usable, great - this has happened before I thought, so I rebooted and could not get it to finish booting.

 

Every time, one of two things would happen as the zfs pool was mounted.

1) system would silently lock up, no red led, no panic on console, nothing

2) system would panic, red led started flashing.

 

The only way I've been able to get the system to boot is by unplugging the disks, waiting for the system to boot and then plugging the disks back in and mounting them.

Even then the system crashes again within a short period of time (maybe because the ZFS is trying to scrub following the crash)

 

I've upgraded to 21.02.3 / 5.10.21

I never had the vdd tweaks applied, but I've tried both with and without them.

I've explicitly run the boot-loader upload steps in armbian config (was Nov, now Mar 8)

 

I'm relatively confident the issue I'm seeing relates to the others here, more often than not the panics are page faults (null pointer, address between kernel and user space, could not execute from non-execute memory) which seems plausible given the focus on voltage tuning.

 

Any ideas?

I can make an effort to collect boot logs if that's helpful, but given the frequency of these reports it seems like this is a relatively widespread issue.

Posted

@jbergler could you modify /boot/armbianEnv.txt and add/modify following lines:

 

verbosity=7
console=serial
extraargs=earlyprintk ignore_loglevel

 

It should make the serial console more verbose and output the systemd services to serial.

 

Could you also post a step by step to set up and reproduce the crash?

maybe also the detail of ZFS pool and HDD model and size.

 

Posted

@aprayoga verbosity was already up, but I've added the other args.

 

I'm not going to provoke the system since it's somewhat stable again and it's in use, but in terms of a repro here's the setup.

2x 8TB + 3x 12TB drives.

tank0 5x8TB raidz1

tank1 3x4TB raidz1 (this tank isn't mounted currently)

 

If I want to crash the box I can start a zfs scrub on tank0.
After some time (<~6 hours) the box crashes. On boot, if a scrub was in progress, box won't finish booting.

 

  Reveal hidden contents

 

 

Posted

@jbergler I recently noticed the armbian-hardware-optimization script for Helios64 changes the IO scheduler to `bfq` for spinning disks, however, for ZFS we should be using `none` because it has it's own scheduler. Normally ZFS would change the scheduler itself, but that would only happen if you're using raw disks (not partitions) and if you import the zpool _after_ the hardware optimization script has run.

 

You can try changing it (e.g. `echo none >/sys/block/sda/queue/scheduler`) for each ZFS disk and see if anything changes. I still haven't figured out if this is a cause for any problems, but it's worth a shot.

Posted
  On 3/12/2021 at 12:38 PM, Seneca said:

Just an update from yesterday, no freezes or crashes yet, even though quite heavy IO and CPU.

Expand  

I've tried to provoke a system freeze with high cpu and IO, but it seems stable for now.

20:05:49 up 4 days, 23:47,  1 user,  load average: 0,14, 0,15, 0,11

I'll update this thread if the issue reoccurs.

Posted

Okay so i got a freeze today, so even in my previous situation (as described in previous posts) it was not 100% stable (but still way better than at first).

Posted

And a second freeze one hour after the first one (blinking red light), while upgrading the kernel. Now of course nothing boots up.

Posted

And (after having lost 2 hours yesterday to reinstall the system), today : yet another freeze (this time with the latest image / kernel and default out-of-the-box configuration).

 

This really starts to be insane and nearly unusable.

Posted

Many additional freezes in the meanwhile.

 

Now (with latest kernel) i'm unable to have a stable situation whatever i do :

- latest kernel

- boot.scr put back

- same min and max freq

- governor on "performance" or "schedutil" or whatever

 

I always have freeze.

 

I'm at the point i'm about to be DISGUSTED by this NAS - i've never lost so much time with an electronic device.

 

What is the expected delay before having something stable for this NAS ?

Is it only worked on by KOBOL ?

How many people have a stable NAS versus an unstable NAS ?

Is my device faulty in any way ?

What is the refund policy on KOBOL ?

Posted

I've had a 100% stable system for 16 days now, no cpufreq tweaks, only vanilla armbian config (+zram though).

17:32:07 up 16 days, 20:13,  1 user,  load average: 0,12, 0,17, 0,14

Using this kernel:

5.10.21-rockchip64

And I've _never_ gotten the red led blink when my system froze. Have you tried different or fewer drives?

Posted
  On 3/28/2021 at 3:29 PM, SR-G said:

What is the expected delay before having something stable for this NAS ?

Is it only worked on by KOBOL ?

How many people have a stable NAS versus an unstable NAS ?

Is my device faulty in any way ?

What is the refund policy on KOBOL ?

Expand  

 

i gave up a few month ago.

however: i have a friend who is quite happy with his. omv/raid5 - nothing else.

 

i got mine somewhat stable by reducing cpu freq (same for min and max). but i do have huge stability requirements for my nas and also want it to run some services and so i have moved on.

sad for me. but i still think that there are good and working ones out there

Posted

I've had a stable system (with previous kernel) for 30 days, then one freeze, then system corrupted, then reinstall everything, then now several freezes per day (at first with vanilla armbian config)

 

Same kernel than you

Linux helios64 5.10.21-rockchip64 #21.02.3 SMP PREEMPT Mon Mar 8 01:05:08 UTC 2021 aarch64 GNU/Linux
 

I can't test different drives, i've 5 WD digital plugged in as a RAID5 array.

Posted

Mine has been pretty stable for the last 2 months. Last restart was to update to Armbian 21.02.2 Buster with 5.10.16 kernel 24 days ago. I applied the Cpu freq mod for the previous kernel, and upgraded with apt, no fresh install. Cpu freq mod is still in place I assume. Device has been completely stable since that mod, and I am not undoing it for the time being. Reliability > everything else.

 

I'm using the 2.5G port exclusively with a 2.5G switch.

There are 4 4TB Seagate Ironwolf drives and an old (very old) Sandisk Extreme SSD in there.

No OMV or ZFS. Just LVM Raid5 with SSD rw cache.

No docker or VMs running.

Cockpit for management.

Samba for file sharing.

Borg for backups.

 

 

 

Posted

Honestly we don't know yet what the root cause of this difference of behavior.

Just to say we acknowledge that there are few boards that seems to not show improvement. We are not ignoring the issue and are still looking into it.

Posted

Mine is still crashing like a clockwork every 24 hours and leaves me generally with a corrupted OS and data.

 

I know you are a small team and you had to takle many obstacles to release the Helios64, believe me im a fan of your work, product and armbian in general, i m aware of the effort every involved person puts into this project.

 

But i would appreciate a bit more news about the current status of developement.

 

As far as i can see there is no status or developement overview on the current issues, not on your blog nor on twitter.

 

The only information we get are in various topics across the forum on armbian.

 

Is there a possibility to let us know more about the ongoing research and inform us about the progress and persued assumed solutions?

 

Also I wonder if those issues are genreal rk3399 gonvernor etc. problems, or if it applies specifically for the Helios64?

 

Thank you in advance for your reply.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines