Jump to content

Recommended Posts

Posted (edited)

Continuing the discussion from here

 

On a clean install of 20.08.21 im able to crash the box within a few hours of it being under load.

It appears as if the optimisations are being applied

root@helios64:~# cat /proc/sys/net/core/rps_sock_flow_entries
32768

 

The suggestion @ShadowDance made to switch to the performance governor hasn't helped.

 

Anecdotally, I think I remember the crashes always mentioning page faults, and early on there was some discussion about memory timing. Is it possible this continues to be that issue?

 

Edited by jbergler
spelling and some extra details
Posted

 

  On 11/11/2020 at 9:08 AM, jbergler said:

I also tried the suggestion to set a performance governor, and for shits and giggles I reduced the max cpu frequency, but that hasn’t made a difference.

System still locks up within a few hours.

Expand  

What was the max cpu freq you set?

Could you try with performance governor at 1.2GHz and at 816 MHz?

How did you load the system?

 

 

Did you encounter kernel crash on 20.08.10 ?
 

Posted
  On 11/13/2020 at 4:31 AM, aprayoga said:

Did you encounter kernel crash on 20.08.10 ?

Expand  

 

It's hard to say for sure, I never quite had a stable system, but I also wasn't generating the kind of load I am now back then.

 

  On 11/13/2020 at 4:31 AM, aprayoga said:

What was the max cpu freq you set?

 

Expand  

 

I had only reduced it one step, I'm trying again now with the settings you suggest.

 

root@helios64:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq
performance
root@helios64:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq | uniq
816000
root@helios64:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq | uniq
1200000

 

The load I'm generating is running a zfs scrub on a 37TB pool across all five disks.

 

Posted

After about an hour of the ZFS scrub the "bad PC value" error happened again, however this time the system didn't hard lock.

A decent number of processes related to ZFS are stuck in uninterruptible IO, I can't export the pool, etc.

 

I did see the system crash like this occasionally without the cpufreq tweaks, so I'm not sure it tells us anything new.

I will try again.

 

note, the relatively high uptime is from the system sitting idle for ~5 days before I put it under load again.

 

  Reveal hidden contents

 

Posted

I had 1 more crash and another soft lockup, but otherwise the box is much more usable.

 

@aprayoga Definitely still something not running right, even at the lower clock speeds.
My limited knowledge suggests something memory related, but that's all I've got. If you'd like me to test anything else, let me know.

Posted (edited)

I'm been testing my Helios64 as well.  I'm running armbian 20.08.21 Focal, but I also downloaded the kernel builder script thingy from github and built linux-image-current-rockchip64-20.11.0-trunk which is a 5.9.9 kernel.  Installed that, then built openzfs 2.0.0-rc6.   I then proceeded to syncoid 2.15TB of snapshots to it also while doing a scrub and was able to get the load average up to 10+.  The machine ran through the night, so I think it might be stable.  A few more days testing will validate this.

 

schu

Edited by akschu
speling
Posted

I'll defer to the Kobol folks, in the previous mega thread the statement was made that the issues should have been fixed in a new version that ensured it was correctly applying the hardware tweaks, for me things have never been properly stable, even on just a vanilla install. The only semi-stable solution has been to reduce the clock speed, which is fine for now.

Posted

5.9.9 with armbian patches is working well for me so far.  I've scrubbed the pool 5-6  times as well as syncoid from my hypervisor every hour for the last two days.  I'm mostly just looking for a stable backup system that supports ZFS and it looks like this will work.

Posted

@jbergler

5.8.x & 5.9.x are working here as well, but I'm not using ZFS, just plain vanilla mdadm RAID and LVM2 formatted as XFS.

If you have an extra set of HDDs could you try building a new data pool with mdamd or LVM2 to test your setup?

Since you're getting memory related errors, is there a way for you to run a memory test on your board?

Have you checked if the heatsink is seated properly over the components of the board?

 

Posted

5.8.x had been running fine on my device for about 9 and a half days then randomly crashed. No logs seem to have survived the crash so this is going to be nearly impossible to debug.

Posted

Did more testing over the weekend on 5.9.9.  I was able to benchmark with FIO on top of a ZFS dataset for hours with the load average hitting 10+ while scrubing the datastore.  No issues.  Right nowt he uptime is 3 days. 

 

I'm actually a little surprised at the performance.  It's very decent for what it is. 

 

I wonder if the fact that I'm running ZFS and 5.9.9 while others are using mdadm and 5.8 is the difference.  I'm not really planning on going backwards on either.  If 5.9.9 works then no need to build another kernel, and you would have to pry ZFS out of my cold dead hands.  I've spend enough of my life layering encryption/compression on top of partitions on top of volume management on top of partitions on top of disks.  ZFS is just better, and having performance penalty free snapshops that I can replicate to other hosts over SSH is the icing on the cake. 

 

Posted
  On 11/23/2020 at 4:31 PM, akschu said:

I'm not really planning on going backwards on either.  If 5.9.9 works then no need to build another kernel, and you would have to pry ZFS out of my cold dead hands.  I've spend enough of my life layering encryption/compression on top of partitions on top of volume management on top of partitions on top of disks.  ZFS is just better, and having performance penalty free snapshops that I can replicate to other hosts over SSH is the icing on the cake. 

Expand  

 

Amen!

 

I have been following this forum with great interest and suspect it's only a matter of time until I buy one of these devices (or maybe wait for ECC one).

 

Thanks to everyone testing and contributing feedback toward getting these devices stable, I for one certainly appreciate it (I am sure others do/will as well).

Posted

@jbergler Could you try the attached u-boot ? This u-boot contains updated Rockchip blob (DDR driver & ATF)

install with

 

dpkg -i linux-u-boot-current-helios64_20.11.0-trunk_arm64.deb

After that, run armbian-config > System > Install > 5 Install/Update the bootloader on SD/eMMC

 

If you are using SD card, make sure to clean bootloader on the eMMC. you can run

dd if=/dev/zero of=/dev/mmcblk1 seek=64 count=30000

 

Power cycle the system. The system should boot with new bootloader.

 

  Reveal hidden contents

Please take note at binaries version

DDR Version 1.24 20191016 RevNocRL
NOTICE:  BL31: Built : 14:31:03, May 19 2020
U-Boot 2020.07-armbian (Nov 25 2020 - 07:14:05 +0700)

 

Try to trigger the kernel crash.

 

---

If you want to restore the original u-boot you can run

apt install linux-u-boot-helios64-current=20.08.21

and update the u-boot using armbian-config

---

 

There is built in memory tester on Linux kernel,

just add this line to /boot/armbianEnv.txt

extraargs=memtest=10

you can change number of loop (10). It took quite some time to run the test.

you can see the result using dmesg

 

 

 

 

linux-u-boot-current-helios64_20.11.0-trunk_arm64.debFetching info...

Posted (edited)

Initial attempt with the new uboot and with removing the cpufreq tweaks results in a new panic

  Reveal hidden contents

 

 

And trying again

  Reveal hidden contents

 

Edited by jbergler
more details
Posted

@jbergler Do you have an ATX power supply you can hook the drives to and test powering them that way?

 

As I mentioned in a few other threads, including:

I believe this may be a power delivery issue under load.  Someone will need to test using alternative power supply to confirm this though as I do not have one (Helios64).  I have 2x RockPi 4c w/ m.2 to PCIe x4 adapter and an 8 port SATA card, one running 6x3TB mdadm raid other 9x2TB drive mdadm raid with one drive via USB 3.0 running 24/7 but I am using an actual ATX power supply to power all my drives, not a built on power supply like the Helios has.  Of all the people reporting this, someone will need to test and confirm.

 

--

root@rockpi-4c:~# uname -a
Linux rockpi-4c 5.8.6-rockchip64 #20.08.1 SMP PREEMPT Thu Sep 3 18:03:42 CEST 2020 aarch64 aarch64 aarch64 GNU/Linux
root@rockpi-4c:~# uptime
 13:13:59 up 10 days, 21:11,  7 users,  load average: 0.10, 0.09, 0.09

root@rockpi-4c:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active raid5 sdi[0] sdg[1] sdf[3] sdh[8] sda[4] sdc[6] sde[2] sdb[5] sdd[7]
      15627059200 blocks super 1.2 level 5, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      bitmap: 0/15 pages [0KB], 65536KB chunk

--

root@rockpi-4c:~# uname -a
Linux rockpi-4c 5.8.6-rockchip64 #20.08.2 SMP PREEMPT Fri Sep 4 20:23:22 CEST 2020 aarch64 aarch64 aarch64 GNU/Linux
root@rockpi-4c:~# uptime
 13:15:55 up 7 days,  8:05,  6 users,  load average: 0.64, 0.21, 0.07
root@rockpi-4c:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md127 : active raid5 sdc[2] sdb[1] sde[4] sda[0] sdd[7] sdf[6]
      14650670080 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6] [UUUUUU]

--

My 2 cents.

 

Cheers!

Posted
  On 11/25/2020 at 6:11 PM, TheLinuxBug said:

@jbergler Do you have an ATX power supply you can hook the drives to and test powering them that way?

I believe this may be a power delivery issue under load.

Expand  

 

I do not unfortunately, but I haven't seen any errors in the lead up to the crashes I've experienced that look like problems with the drives (at least not from what I can tell)

Posted
  On 11/25/2020 at 6:45 PM, jbergler said:

 

I do not unfortunately, but I haven't seen any errors in the lead up to the crashes I've experienced that look like problems with the drives (at least not from what I can tell)

Expand  

Correct, though my idea would be that something happens with power delivery that either starves the board or the drives -- though that is hard to prove one way or the other without using a different power supply for the hard drives.

 

Could be wrong, though, would help to eliminate that as a possibility in these cases.

 

my 2 cents.

 

Cheers!

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines