jbergler

  • Content Count

    23
  • Joined

  • Last visited

About jbergler

  • Rank
    Member

Recent Profile Visitors

205 profile views
  1. @aprayoga verbosity was already up, but I've added the other args. I'm not going to provoke the system since it's somewhat stable again and it's in use, but in terms of a repro here's the setup. 2x 8TB + 3x 12TB drives. tank0 5x8TB raidz1 tank1 3x4TB raidz1 (this tank isn't mounted currently) If I want to crash the box I can start a zfs scrub on tank0. After some time (<~6 hours) the box crashes. On boot, if a scrub was in progress, box won't finish booting.
  2. My system was stable for a long time (~3-4 weeks) and then the other day it soft locked with a panic (trace was in ZFS). Rest of the system was still vaguely usable, great - this has happened before I thought, so I rebooted and could not get it to finish booting. Every time, one of two things would happen as the zfs pool was mounted. 1) system would silently lock up, no red led, no panic on console, nothing 2) system would panic, red led started flashing. The only way I've been able to get the system to boot is by unplugging the disks, waiting for the sys
  3. The problem here is that it‘s not possible to compile the module on Debian because of how the kernel has been built. I reported the issue here and while I could *fix* it, it really strikes me as something the core armbian team needs to weigh in on. One option is to use an older GCC in the build system, the other is to disable per task stack protections in the kernel - neither seem like great choices to me.
  4. I do not unfortunately, but I haven't seen any errors in the lead up to the crashes I've experienced that look like problems with the drives (at least not from what I can tell)
  5. Box locked up overnight, nothing on the console.
  6. I cold booted the box, and now it seems to behave just fine. Will run some load testing overnight and report back.
  7. Initial attempt with the new uboot and with removing the cpufreq tweaks results in a new panic And trying again
  8. I'll defer to the Kobol folks, in the previous mega thread the statement was made that the issues should have been fixed in a new version that ensured it was correctly applying the hardware tweaks, for me things have never been properly stable, even on just a vanilla install. The only semi-stable solution has been to reduce the clock speed, which is fine for now.
  9. I had 1 more crash and another soft lockup, but otherwise the box is much more usable. @aprayoga Definitely still something not running right, even at the lower clock speeds. My limited knowledge suggests something memory related, but that's all I've got. If you'd like me to test anything else, let me know.
  10. After about an hour of the ZFS scrub the "bad PC value" error happened again, however this time the system didn't hard lock. A decent number of processes related to ZFS are stuck in uninterruptible IO, I can't export the pool, etc. I did see the system crash like this occasionally without the cpufreq tweaks, so I'm not sure it tells us anything new. I will try again. note, the relatively high uptime is from the system sitting idle for ~5 days before I put it under load again.
  11. Out of curiosity what is the (web?) interface in your screenshot.
  12. It's hard to say for sure, I never quite had a stable system, but I also wasn't generating the kind of load I am now back then. I had only reduced it one step, I'm trying again now with the settings you suggest. root@helios64:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq performance root@helios64:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq | uniq 816000 root@helios64:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq | uniq 1200000 The load I'm generating is running a zfs scrub on a 37TB pool across all
  13. Continuing the discussion from here On a clean install of 20.08.21 im able to crash the box within a few hours of it being under load. It appears as if the optimisations are being applied root@helios64:~# cat /proc/sys/net/core/rps_sock_flow_entries 32768 The suggestion @ShadowDance made to switch to the performance governor hasn't helped. Anecdotally, I think I remember the crashes always mentioning page faults, and early on there was some discussion about memory timing. Is it possible this continues to be that issue?
  14. root@helios64:~# cat /proc/sys/net/core/rps_sock_flow_entries 32768 I also tried the suggestion to set a performance governor, and for shits and giggles I reduced the max cpu frequency, but that hasn’t made a difference. System still locks up within a few hours. I did finally manage to get the serial console to print something meaningful during boot, and one thing that stands out is this Loading Environment from MMC... *** Warning - bad CRC, using default environment Full boot log is below Yes,
  15. I'm still seeing regular panics, to the point where the box won't stay up for more than a few hours. To ensure it was in a clean state, I re-installed 20.08.21 focal and only added samba + zfs back.