Jump to content

SR-G

Members
  • Posts

    37
  • Joined

  • Last visited

Everything posted by SR-G

  1. Well, my HELIOS64 board died a long time ago (with no possible replacment), but before that, about the freezes : - in the end the root cause of the problem has never been corrected, but had been identified at some point in time as being related to the CPU freq changes / governance - hence the workaround about stabilizing the board with a fixed governance and a lower speed of 1.2Ghx - as far as i remember, this was nearly allowing to not encounter the issue anymore (not sure however what you are calling "DSF" + this had nothing to do with disks - SSD or regular HDD)
  2. cancelled - power unit cable was not correctly plugged in and NAS was running on battery, hence the automated shutdowns, i think (but it's not clearly printed / displayed anywhere, i would say that it could have helped).
  3. NAS was on during the night and this morning : no more RAID5 (with 5 WD HDD). After reboot, i see that two disks are missing : ``` [Sun Jul 18 11:49:11 2021] ata1: illegal qc_active transition (00000000->00000001) [Sun Jul 18 11:49:14 2021] ata2: softreset failed (device not ready) [Sun Jul 18 11:49:23 2021] ata1: illegal qc_active transition (00000000->00000001) [Sun Jul 18 11:49:24 2021] ata2: softreset failed (device not ready) [Sun Jul 18 11:49:38 2021] ata1: illegal qc_active transition (00000000->00000001) [Sun Jul 18 11:49:38 2021] ata1.00: failed to set xfermode (err_mask=0x40) [Sun Jul 18 11:49:56 2021] ata2: illegal qc_active transition (00000000->00000001) [Sun Jul 18 11:49:56 2021] ata2: illegal qc_active transition (00000000->00000001) [Sun Jul 18 11:49:57 2021] ata1: illegal qc_active transition (00000000->00000001) [Sun Jul 18 11:50:06 2021] ata1: illegal qc_active transition (00000000->00000001) ``` ``` 11:49 root@helios64 /mnt/internal# ll /dev/sd* brw-rw---- 1 root disk 8, 32 2021-07-18 11:48 /dev/sdc brw-rw---- 1 root disk 8, 33 2021-07-18 11:48 /dev/sdc1 brw-rw---- 1 root disk 8, 48 2021-07-18 11:48 /dev/sdd brw-rw---- 1 root disk 8, 49 2021-07-18 11:48 /dev/sdd1 brw-rw---- 1 root disk 8, 64 2021-07-18 11:48 /dev/sde brw-rw---- 1 root disk 8, 65 2021-07-18 11:48 /dev/sde1 ``` What's happening ? Is it also a PSU that failed (like for HELIOS4 NAS ...) ? PSU unit light seems OK (blue led on the alimentation itself, not flashing)
  4. So with a new PSU (30€ on amazon) my HELIOS4 NAS is working again.
  5. Hello, HELIOS4 system was running (and doing nothing), when it suddenly crashed. After restart, nothing automatically loaded (system stuck in emergency mode), once USB cable plugged i managed to log only once into the system, and discovered that no disks where mounted anymore (not even available / detect by the board) (it's not so easy as it seems there are some freezes, maybe due to the hardware errors related to he missing hdd links) 13:00 root@helios4 ~# ll /dev/sd* zsh: no matches found: /dev/sd* Is this a failing PSU (like in other threads) ? To be noted : the PSU has a blinking green light ... i can't remember if it was like that before ? In journalctl : May 26 12:57:03 helios4 kernel: ata1: SATA link down (SStatus 0 SControl 300) May 26 12:57:04 helios4 kernel: ata1: SATA link down (SStatus 0 SControl 300) May 26 12:57:06 helios4 kernel: ata1: COMRESET failed (errno=-32) May 26 12:57:06 helios4 kernel: ata1: reset failed (errno=-32), retrying in 8 se May 26 12:57:06 helios4 kernel: ata2: SATA link down (SStatus 0 SControl 300) May 26 12:57:07 helios4 kernel: ata2: SATA link down (SStatus 0 SControl 300) May 26 12:57:08 helios4 kernel: ata2: SATA link down (SStatus 0 SControl 300) May 26 12:57:11 helios4 kernel: ata2: COMRESET failed (errno=-32) May 26 12:57:11 helios4 kernel: ata2: reset failed (errno=-32), retrying in 8 se May 26 12:57:12 helios4 kernel: ata3: SATA link down (SStatus 0 SControl 300) May 26 12:57:12 helios4 kernel: ata4: COMRESET failed (errno=-32) May 26 12:57:12 helios4 kernel: ata4: reset failed (errno=-32), retrying in 8 se May 26 12:57:13 helios4 kernel: ata3: SATA link down (SStatus 0 SControl 300) May 26 12:57:14 helios4 kernel: ata3: SATA link down (SStatus 0 SControl 300) May 26 12:57:15 helios4 kernel: ata3: SATA link down (SStatus 0 SControl 300) May 26 12:57:15 helios4 kernel: ata1: SATA link down (SStatus 0 SControl 300) May 26 12:57:16 helios4 kernel: ata3: SATA link down (SStatus 0 SControl 300) May 26 12:57:16 helios4 kernel: ata3: SATA link down (SStatus 0 SControl 300) May 26 12:57:17 helios4 kernel: ata1: SATA link down (SStatus 0 SControl 300) May 26 12:57:17 helios4 kernel: ata3: SATA link down (SStatus 0 SControl 300) May 26 12:57:17 helios4 kernel: ata1: SATA link down (SStatus 0 SControl 300) May 26 12:57:18 helios4 kernel: ata3: SATA link down (SStatus 0 SControl 300) May 26 12:57:19 helios4 kernel: ata3: SATA link down (SStatus 0 SControl 300) May 26 12:57:19 helios4 kernel: ata3: SATA link down (SStatus 0 SControl 300) Full boot log :
  6. So on my side, after my latest reinstallation (due to corrupted OS) : - with default installation / configuration out of the box, i had one freeze every 24h - by switching to "powersave", or to "performance" mode, and with min CPU frequency = max CPU frequency = either 1.8Ghz either 1.6Ghz : still the same (one freeze per day) - by switching to "performance" mode, with min CPU frequency = max CPU frequency = 1.4Ghz, it seems now more stable (uptime = 5 days for now) So my guts feeling is really that these issues : - are mainly related to the cpufreq mechanizm - and probably related to what has been nicely spotted before (by Vin), the fact that 2 core have different max frequency range (as expected per the specs, but maybe with a corner case in the cpufreq governance)
  7. I'm a bit confused to not have the same values on my side for all policies ... Whereas : (> i set min = max = 1.6Ghz through armbian-config) What are these two different policies in /cpufreq/ ? (policy0 + policy4 in /cpufreq/ on my side) Is it like "policy0" is used by "performance" governor mode and policy4 by "powersave" ? (in which case it would make sense for me to have different values)
  8. On my side : - OS (debian helios64 image) installed on SD card, SD card is a samsung one (128G) - 5x Western Digital HDD (all the same) WDBBGB0140HBK-EESN 14TB, plugged in a regular way sda > sde (and so obviously no M.2 plugged in) - I have the internal battery plugged in At OS level : - docker with netdata container (and nothing else for now) - mdadm activated for the RAID-5 array - SMBFS with a few shares - NFS with shares mounted on other servers (crashes where already happening before switching to NFS instead of SSHFS for regular accesses) At this time at "workaround" side : - latest kernel 5.10.21 - with patched boot.scr - governor = powersave, min speed = max speed = 1.6Ghz (and not 1.8Ghz) it seems to be the "least problematic" configuration (one crash every two days and not every day ...) About load : - rclone each day during a few hours to mirror everything in the cloud (limited to 750GB per day so it takes quite some days) - nearly no freezes during this - borgbackup fetching from another server and through NFS some file to backup each night - i suspect some freezes there - some NFS shares being accessed for various tasks all the times (sometimes with a lot of IO) - i suspect some freezes there - all this is quite reasonable about load and is not generating a lot of IO in the end Helios64 board ordered on 2020, jan 12 (order 1312), sent on 2020, sept 21 and received sometime around beginning of october (NAS not installed before december) By the way i also have a Helios4 board since a long time, and i never got any freeze with it.
  9. I've had a stable system (with previous kernel) for 30 days, then one freeze, then system corrupted, then reinstall everything, then now several freezes per day (at first with vanilla armbian config) Same kernel than you : Linux helios64 5.10.21-rockchip64 #21.02.3 SMP PREEMPT Mon Mar 8 01:05:08 UTC 2021 aarch64 GNU/Linux I can't test different drives, i've 5 WD digital plugged in as a RAID5 array.
  10. Many additional freezes in the meanwhile. Now (with latest kernel) i'm unable to have a stable situation whatever i do : - latest kernel - boot.scr put back - same min and max freq - governor on "performance" or "schedutil" or whatever I always have freeze. I'm at the point i'm about to be DISGUSTED by this NAS - i've never lost so much time with an electronic device. What is the expected delay before having something stable for this NAS ? Is it only worked on by KOBOL ? How many people have a stable NAS versus an unstable NAS ? Is my device faulty in any way ? What is the refund policy on KOBOL ?
  11. And another freeze this night (still with fresh install / latest kernel + modified boor.scr put back in place).
  12. And (after having lost 2 hours yesterday to reinstall the system), today : yet another freeze (this time with the latest image / kernel and default out-of-the-box configuration). This really starts to be insane and nearly unusable.
  13. And a second freeze one hour after the first one (blinking red light), while upgrading the kernel. Now of course nothing boots up.
  14. Okay so i got a freeze today, so even in my previous situation (as described in previous posts) it was not 100% stable (but still way better than at first).
  15. So 26 days as uptime now - it seems better.
  16. About why your NAS has been frozen, maybe indeed (it's quite possible), but sadly in addition you are encountering some other issues during the reboot (that I haven't encountered on my side - crossing fingers on that topic ...). But of course these many freezes and reboots can't be good in any way for the operating system on disk or even for the hardware (hdd)... I would suggest to try a fresh reinstall on a second scarf to see if everything boots up nicely, as a first step ...
  17. So i don't know if it's enough to say that everything is now under control, but for now my uptime is +12 days (before, i had at least one freeze per week, and often more than that). So a little bit too soon to be sure. + for now i'm avoiding 5.10 kernel installation and corresponding reboot.
  18. Mhh so (per other thread) i tried 2nd proposal (without luck : i still have freeze, as stated in other thread) BUT nothing was said about reverting the CPU governance (so i was still in "performance" mode with same min/max values). To what should i revert the CPU gorvenance values ? (min possible value / max possible value + powersave ?)
  19. So i applyed these parameters on 2021/02/01 and today (2 days later) i just got another freeze (this time i had IO but not a lot - was uploading files from NAS to cloud at 40MB/s - but it's of course not the first time i have some IO during hours, it's just that until now most freezes have happened without IO). No RED LED blinking this time once freezed + all HDD leds are ON but are not blinking. I had an opened SSH connection and nothing has been printed there, it's just frozen (ping from another host not answered, and so on). dmesg -T (during reboot after freeze) the MMC errors are new I don't know for the error about voltage that can't be read edit : to be noted, i was still in "performance" mode for CPU governance, with same min/max possible values (here : https://forum.armbian.com/topic/16944-crazy-instability/ it is suggested to not be in that mode -> have just reverted to min possible value / max possible + powersave mode)
  20. Ok i just applyed the suggested modifications this morning (by modifying the boot.cmd file + regenerating the boot.scr file). Reboot has been OK. Let's see ... + what exactly are these values / are they related to the CPU speed and if yes, how are they different than what is applyed when modifying the CPU governance configuration through armbian-config ? Previous armbianEnv.txt (for reference) (untouched) : ``` verbosity=1 bootlogo=false overlay_prefix=rockchip rootdev=UUID=a79a14c0-3cf4-4fb9-a6c6-838571351371 rootfstype=ext4 usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u ``` Previous boot.cmd file, for reference (modified as requested with the 4 new lines) : ``` # DO NOT EDIT THIS FILE # # Please edit /boot/armbianEnv.txt to set supported parameters # setenv load_addr "0x9000000" setenv overlay_error "false" # default values setenv rootdev "/dev/mmcblk0p1" setenv verbosity "1" setenv console "both" setenv bootlogo "false" setenv rootfstype "ext4" setenv docker_optimizations "on" setenv earlycon "off" echo "Boot script loaded from ${devtype} ${devnum}" if test -e ${devtype} ${devnum} ${prefix}armbianEnv.txt; then load ${devtype} ${devnum} ${load_addr} ${prefix}armbianEnv.txt env import -t ${load_addr} ${filesize} fi if test "${logo}" = "disabled"; then setenv logo "logo.nologo"; fi if test "${console}" = "display" || test "${console}" = "both"; then setenv consoleargs "console=tty1"; fi if test "${console}" = "serial" || test "${console}" = "both"; then setenv consoleargs "console=ttyS2,1500000 ${consoleargs}"; fi if test "${earlycon}" = "on"; then setenv consoleargs "earlycon ${consoleargs}"; fi if test "${bootlogo}" = "true"; then setenv consoleargs "bootsplash.bootfile=bootsplash.armbian ${consoleargs}"; fi # get PARTUUID of first partition on SD/eMMC the boot script was loaded from if test "${devtype}" = "mmc"; then part uuid mmc ${devnum}:1 partuuid; fi setenv bootargs "root=${rootdev} rootwait rootfstype=${rootfstype} ${consoleargs} consoleblank=0 loglevel=${verbosity} ubootpart=${partuuid} usb-storage.quirks=${usbstoragequirks} ${extraargs} ${extraboardargs}" if test "${docker_optimizations}" = "on"; then setenv bootargs "${bootargs} cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory swapaccount=1"; fi load ${devtype} ${devnum} ${ramdisk_addr_r} ${prefix}uInitrd load ${devtype} ${devnum} ${kernel_addr_r} ${prefix}Image load ${devtype} ${devnum} ${fdt_addr_r} ${prefix}dtb/${fdtfile} fdt addr ${fdt_addr_r} fdt resize 65536 for overlay_file in ${overlays}; do if load ${devtype} ${devnum} ${load_addr} ${prefix}dtb/rockchip/overlay/${overlay_prefix}-${overlay_file}.dtbo; then echo "Applying kernel provided DT overlay ${overlay_prefix}-${overlay_file}.dtbo" fdt apply ${load_addr} || setenv overlay_error "true" fi done for overlay_file in ${user_overlays}; do if load ${devtype} ${devnum} ${load_addr} ${prefix}overlay-user/${overlay_file}.dtbo; then echo "Applying user provided DT overlay ${overlay_file}.dtbo" fdt apply ${load_addr} || setenv overlay_error "true" fi done if test "${overlay_error}" = "true"; then echo "Error applying DT overlays, restoring original DT" load ${devtype} ${devnum} ${fdt_addr_r} ${prefix}dtb/${fdtfile} else if load ${devtype} ${devnum} ${load_addr} ${prefix}dtb/rockchip/overlay/${overlay_prefix}-fixup.scr; then echo "Applying kernel provided DT fixup script (${overlay_prefix}-fixup.scr)" source ${load_addr} fi if test -e ${devtype} ${devnum} ${prefix}fixup.scr; then load ${devtype} ${devnum} ${load_addr} ${prefix}fixup.scr echo "Applying user provided fixup script (fixup.scr)" source ${load_addr} fi fi booti ${kernel_addr_r} ${ramdisk_addr_r} ${fdt_addr_r} # Recompile with: # mkimage -C none -A arm -T script -d /boot/boot.cmd /boot/boot.scr ``` ``` (...) lrwxrwxrwx 1 root root 25 2021-01-10 14:30 uInitrd -> uInitrd-5.9.14-rockchip64 -rw-r--r-- 1 root root 3,2K 2021-02-01 09:13 boot.cmd -rw-rw-r-- 1 root root 3,3K 2021-02-01 09:13 boot.scr -rw-r--r-- 1 root root 166 2021-02-01 09:15 armbianEnv.txt ```
  21. No (and i'm using MDADM, no ZFS), until now, hopefully (otherwise i'll get mad about losing time and datas just because of these errors ...) i haven't encountered broken raid or any errors (mdadm --misc detail is fine, no errors in dmesg -T, and so on). Indeed maybe there are different problems ... anyway it's far away from being stable in the current state Also i don't think i have overheat issue, my sensors are (even when copying files, ...) :
  22. @Seneca I would suggest to upgrade to kernel 5.9.14 : i have freezes with both, but i had way more freezes with 5.8.14 compared to 5.9.14 Otherwise, on my side, it is 100% clear that in my situation the freezes are not related to disk I/O : i have nearly nothing in crontab, and i'm encountering some freezes during the night, with 0 CPU activity / 0 disk usage (and as said before, i have nearly no processes configured outside MDADM, SSH, and NETDATA).
  23. I would be fine with this solution on my side if it was working (but on my side it's not enough to have same freq. for min and max value (also with 1.8 Ghz, i haven't tryed with lower values) + "performance" or "governor" mode.
  24. Yes, as stated in previous message, i tried "conservative" and then "performance" governor mode (which i'm still running on), with same (max) values for both min/max CPU speed, without any real benefits (maybe this is enough for some users, but clearly not in my situation). And i really don't have a lot of things installed : SMB, SSH, docker (with one container : netdata), MDADM, and that's all (no extra containers, no OMV, ...). And no "load" from a CPU point of view, and not a lot of RAID I/O (at least not during most of the crashes). And same, i'm not always in a setup with a usb cable connected (but i would suspect these freezes to be the same ones than the first i encountered and for which some traces are at the beginning of this post).
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines