djurny
-
Posts
58 -
Joined
-
Last visited
Everything posted by djurny
-
Hi, Short update. After looking at the most recent kernel freezes I experienced on 5.9.13-rockchip64 #trunk.16 and reading through a thread meant for Helios4, I decided to stop using cpufreq governor conservative. Switched to either 'powersave' or 'performance' depending on workload, but not to have the frequency changed on-the-fly. Box has been running smooth since last power cycle. As system is nice and stable now, will do some more maintenance before upgrading to latest advised configuration. Groetjes, See last Oops, mentioning something about trying to set some regulator voltage, triggered by cpufreq-dt module: (Note that system did not fully freeze, parts of the sytem continued service). After posting this, I configured cpufreq to use schedutil governor and after roughly 3 hours of load, it froze up with one of the other patterns observed before: Will change back to powersave and give it some load again. Not sure if there is a correlation here. Groetjes,
-
Hi, Also here kernel Oops after some load: [82182.500900] Unable to handle kernel paging request at virtual address ffff800011b14000 [..] [82182.505753] Internal error: Oops: 96000007 [#1] PREEMPT SMP [..] [82182.526948] Call trace: [82182.527414] x1 : ffff800011532db0 x0 : 00000000ffffffea [82182.527921] gic_handle_irq+0x124/0x158 [82182.528384] Call trace: [82182.528883] el1_irq+0xb8/0x180 [82182.529350] __handle_domain_irq+0xc4/0x108 [82182.529593] arch_cpu_idle+0x14/0x20 [82182.530058] Code: f822683a a94153f3 a9425bf5 a94363f7 (a9446bf9) [82182.530425] do_idle+0x210/0x260 [82182.530640] ---[ end trace c165b2007f1cb8d2 ]--- [82182.530946] cpu_startup_entry+0x28/0x60 [82182.531312] Kernel panic - not syncing: Attempted to kill the idle task! [82182.531657] rest_init+0xd8/0xe8 [82182.532193] SMP: stopping secondary CPUs [82182.532503] arch_call_rest_init+0x10/0x1c [82182.534913] start_kernel+0x80c/0x848 [82182.535258] ---[ end trace c165b2007f1cb8d3 ]--- [82182.535692] Kernel Offset: disabled [82182.536009] CPU features: 0x0240022,2000200c [82182.536388] Memory Limit: none [82182.536675] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- It appears that the system had been idling for some hours before the page fault occurred. Is there anything I can collect or try to see if this will improve? This oops seems unrelated to system load. Thanks, Groetjes,
-
Hi all, Something to share for those who use the USB-C serial console from another Linux host. Install and use 'tio' to connect to the serial console instead of minicom. This supports both 1500k baud and also can be easily used inside GNU screen (minicom gets a meta key conflict per default; CTRL-A is default meta key for both GNU screen and minicom). Minicom resulted in regular errors posted in syslog by the ftdi_sio kernel module. Did not run any strace to find out what syscall is causing it, but in short, tio appears to not treat the tty as a modem: no errors are popping up in syslog. Hopefully the serial consoles will remain up now. One caveat: I did not find a way to send a BREAK over serial using tio. This is something that is handy in case kernel freezes up, as sometimes you will still have opportunity to do a magic sysrq triggered reboot (BREAK + b = initiate a reboot of the kernel, also see magic sysrq & REISUB). Groetjes,
-
Hi, A short update, unfortunately kernel has crashed again, but after a couple of days. So there is improvement :-) No serial console output, as the usb-serial connection on my Pi stopped responding (will open another thread on this, not really Helios64 related though). Will restart some loading and try a different usb-serial setup, hopefully both will not crash (that often) anymore. Groetjes,
-
Hi, I've also experienced almost hourly instabilities when running some load on my Helios64 box. Tried several kernels, each with their own Oops/BUG pattern. See below for an overview: It's not exhaustive; in the end I did the following and the box is now running some load (snapraid scrub on ~12TiB of data) without any issue: Enabled daily built kernel, now running Linux kobol0 5.9.11-rockchip64 #trunk.2 SMP PREEMPT Sun Nov 29 00:29:16 CET 2020 aarch64 GNU/Linux. Why: Every kernel had their own pattern, either do_undefinstr or XHCI hangup or page fault. Assumed latest greatest has most fixes. Enabled the i2c dtb overlays. Why: Some of the kernels showed some IRQ related to i2c in the Oops/BUG. Thought I find something in the dtb related to i2c and just enable it to see if that might fix something. Moved rootfs from USB stick to SATA SSD in slot4. Why: Some of the kernels had a repeated hanging XHCI controller, so I tried to remove some USB devices from the controller, to see if the amount of load on the controller itself might be a vector (, Victor). Also removed tlp and set SATA link power management to max_performance (hat tip @gprovost). It's a weak investigation, as I fiddled with multiple things at once, trying to get things going quickly (I do not have much spare time to spend on this as I would like to). Still, perhaps this will trigger someone or give some more angles to fiddle with for others. Fingers crossed. Looking good so far: djurny@kobol0:~$ uname -a Linux kobol0 5.9.11-rockchip64 #trunk.2 SMP PREEMPT Sun Nov 29 00:29:16 CET 2020 aarch64 GNU/Linux djurny@kobol0:~$ uptime 07:26:58 up 2 days, 10:40, 7 users, load average: 1.73, 1.76, 1.74 djurny@kobol0:~$ (The box has been running rdfind, xfs_fsr, snapraid scrub & check for the last 2 days (in that order).) Groetjes,
-
Hi, It's only the "HDD x Activity LEDs" that are cycling. The other LEDs are not showing this cycling. The cycling speed appears to increase with CPU frequency, just like how the "System Activity LED" (heartbeat trigger) frequency will increase when CPU frequency increases. I made a video of the effect, but have some trouble uploading it (>20MiB). here: https://streamable.com/v8wa36, Note I do not have any trouble with this effect, just wondering if this is by design and how I can customize it :-) Groetjes,
-
Hi, Only the USB HDD (sdc) is configured to have spindown after some minutes, the others were not explicitly configured to enter either standby/sleep mode. Note that all of the HDD status LEDs show are cycling, not just a few. hdparm shows the following strange values: Device: Used as: Interface: Media: hdparm -C says: sdb rootfs USB USB stick drive state is: standby sdg swap SATA SSD drive state is: active/idle sda data0 SATA HDD drive state is: active/idle sdd data1 SATA HDD drive state is: active/idle sde data2 SATA HDD drive state is: active/idle sdf data3 SATA HDD drive state is: active/idle sdc parity0 USB HDD drive state is: active/idle I've never heard that a USB media stick can be put in standby mode? Even more strange is that '/' is running off of this USB device. See below for more details. Groetjes,
-
Hi, Not sure if anyone has seen the same as I have been seen on my box; the HDD status LEDs are cycling in brightness, from 0% to 100% and back to 0% continuously. It's a smooth transition, no flashy circus. I might have triggered this myself, but no idea how? Is this is a new feature? I tried to make some video of it, but it is quite difficult to see. Groetjes,
-
Hi @JeffDwork, I have used snapraid for testing, and running md5sums on all the content on the disks. Once to 'sync' or create MD5 hashes, subsequent runs to 'scrub' or check MD5 hashes. This gave me some warm feeling on how fast the system can calculate hashes and how fast the disk I/O is. This would then give me a good indication on scheduling maintenance actions, e.g. if 'scrub' takes 12 hours, need to make sure it does not push out or overlap other scheduled maintenance actions etc. Overall, it will depend on what you care about the most; CPU performance/temperature, disk throughput, filesystem reliability, system stability or perhaps other factors. Groetjes,
-
Hi, Both are Zyxel switches, one is 16 ports GS1100 -16 and the other is 8 ports GS-108Bv3. Both are Gbps capable, as shown by other devices connected to the same switch: Helios4: Link partner advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Just now, I tried yet another set of cables to connect to the Helios64 box, and it seems I have bought several batches of cat "5e" cables, with stress on "5e". Looks like a cabling issue still. I never noticed this, as most are connected to either Raspberry Pi2b or OrangePi zero devices (see ethtool output from one of the Pis below, connected to the same 16ports Gbps capable switch). I already ordered another batch of [apparently] shielded cat 6 cables, hopefully they are indeed shielded, cat 6 and not cat "6". Raspberry Pi2b: Link partner advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Please disregard my previous post. Thanks, Groetjes,
-
Hi, Are there any plans to make a toddler-proof version of the front grille, that will cover the buttons? Currently I just applied some lofi containment by simply flipping the front grille so it covers the front panel. Perhaps some snap-in plexiglass for the panel cutout, with a little doorknob type of thing? Have not checked if the buttons can be disabled in software yet (https://wiki.kobol.io/helios64/button/), perhaps the PMIC can be programmed in user space? Groetjes,
-
Hi, After fixing the LED issue, I started to try out if snapraid is working. On the Helios4 snapraid ran into some issues due to the amount of files available on the snapraid "array"; 32bit addressing constraints caused snapraid to bork out regularly. No matter the snapraid configuration tweaking/trial & error applied, it kept on requiring more than 4GB of addressing space. After running "sync" and "scrub" for the first time on the Helios64, I noticed a more than comfortable amount of alleged ata I/O errors like below: ata1.00: failed command: READ FPDMA QUEUED After some searching around on the internet, it appeared that limiting SATA link speed, these errors can be prevented. Checking other server deployments, this behavior was also seen in a 8 disk mdadm RAID setup, whre [new] WD blue disks also show these READ FPDMA QUEUED erros, which disappeared after ata error handling starts to turn down SATA link speeds to 3Gbps. To test this out, I added the following to /boot/armbianEnv.txt: extraargs=libata.force=3.0 Upon rebooting the box, it appears that libata indeed limited the SATA link speed for all drives to 3Gbps: Oct 29 22:01:59 localhost kernel: [ 3.143259] ata1: FORCE: PHY spd limit set to 3.0Gbps Oct 29 22:01:59 localhost kernel: [ 3.143728] ata1: SATA max UDMA/133 abar m8192@0xfa010000 port 0xfa010100 irq 238 Oct 29 22:01:59 localhost kernel: [ 3.143736] ata2: FORCE: PHY spd limit set to 3.0Gbps Oct 29 22:01:59 localhost kernel: [ 3.144192] ata2: SATA max UDMA/133 abar m8192@0xfa010000 port 0xfa010180 irq 239 Oct 29 22:01:59 localhost kernel: [ 3.144199] ata3: FORCE: PHY spd limit set to 3.0Gbps Oct 29 22:01:59 localhost kernel: [ 3.144654] ata3: SATA max UDMA/133 abar m8192@0xfa010000 port 0xfa010200 irq 240 Oct 29 22:01:59 localhost kernel: [ 3.144661] ata4: FORCE: PHY spd limit set to 3.0Gbps Oct 29 22:01:59 localhost kernel: [ 3.145115] ata4: SATA max UDMA/133 abar m8192@0xfa010000 port 0xfa010280 irq 241 Oct 29 22:01:59 localhost kernel: [ 3.145122] ata5: FORCE: PHY spd limit set to 3.0Gbps Oct 29 22:01:59 localhost kernel: [ 3.145603] ata5: SATA max UDMA/133 abar m8192@0xfa010000 port 0xfa010300 irq 242 Redoing the snapraid scrub, the READ FPDMA QUEUED errors indeed had disappeared. As the disks in the box are WD red HDDs, there is not really a point of having 6Gbps (~600MB/s) SATA linkspeed anyway, disk performance is rated at less than 300MB/s throughput. (Occasionally it tips sustained sequential reads around 130MiB/s for large files.) Note that YMMV. Groetjes,
-
D'oh. Looks like that is indeed the case. Will plan to try to add some clearance for the front panel during the next scheduled down activity
-
Hi all, In the last few days, finally found some time to migrate my helios4 backup node to helios64. In the beginning I had some trouble accessing the serial console, but this was resolved in the end. I want to ask if anyone has issues with the red disk status LEDs? As it seems like 3 of them are not responding to setting values in /sys/class/leds/helios64:red:ata?-err/brightness. Looks a bit odd with allmost all of the error LEDs on. Groetjes,
-
Hi, A quick update to all who have Helios4 boxes freeze up. After board replacement and switching PSUs, the box still froze from time to time. After the last freeze, I decided to move the rootfs to a USB stick, to rule out anything related to the SDcard. The SDcard is a SanDisk Ultra. 32GB class 10 A1 UHS-I HC1 type of card, picture below. After using the SDcard for booting kernel + initrd only, the box has been going strong for quite a while, when under load and when idling around: 07:39:29 up 21 days, 29 min, 6 users, load average: 0.14, 0.14, 0.10 Note that the uptime is actually more than shown, but the box has been rebooted due to unrelated issues and some planned downtime. Hope this will help any of you. Groetjes,
-
Hi, Yesterday evening, the second box froze up again. This time I had the other PSU connected, to rule out any PSU failure. Time CPU load %cpu %sys %usr %nice %io %irq CPU C.St. 23:40:16: 1600MHz 0.10 5% 2% 0% 0% 2% 0% 67.5째C 0/0 23:40:21: 1600MHz 0.09 15% 6% 0% 3% 3% 1% 68.0째C 0/0 23:40:26: 1600MHz 0.09 7% 2% 0% 0% 1% 1% 67.5째C 0/0 23:40:31: 800MHz 0.16 13% 6% 0% 4% 2% 0% 67.0째C 0/0 23:40:36: 1600MHz 0.23 6% 2% 0% 1% 2% 0% 68.0째C 0/0 23:40:41: 1600MHz 0.21 14% 6% 0% 3% 3% 0% 67.5째C 0/0 23:40:47: 1600MHz 0.19 6% 2% 0% 1% 1% 0% 67.5째C 0/0 Same symptoms. First box is still going strong after the PSU swap: 11:37:11 up 3 days, 14:18, 8 users, load average: 1.43, 1.55, 1.60 Groetjes,
-
Hi, They are both from batch #2. I will check try and swap the PSU with the other box over the weekend and indeed see if it still happens. Thanks, Groetjes,
-
Bad news... It took almost over a week, but: My 2nd box froze today, last output on the serial console: 17:41:22: 800MHz 0.19 10% 2% 0% 1% 5% 0% 67.0°C 0/0 Time CPU load %cpu %sys %usr %nice %io %irq CPU C.St. 17:41:28: 1600MHz 0.17 14% 6% 0% 4% 3% 1% 67.5°C 0/0 17:41:33: 800MHz 0.16 6% 2% 0% 1% 1% 0% 67.0°C 0/0 17:41:38: 1600MHz 0.31 15% 6% 0% 3% 2% 1% 67.5°C 0/0 17:41:43: 1600MHz 0.36 7% 2% 0% 1% 2% 1% 68.0°C 0/0 17:41:48: 1600MHz 0.33 15% 6% 0% 5% 1% 0% 67.5°C 0/0 17:41:53: 1600MHz 0.31 6% 2% 0% 1% 1% 0% 68.5°C 0/0 Same symptoms, NIC is blinking, heartbeat LED is off, serial console is unresponsive. Any thoughts? Groetjes,
-
Hi, Box #0 (4x WD Red 4TB): /dev/sda 30 WDC WD40EFRX-68N32N0 WD-WCC7K1KS6EUY /dev/sdb 32 WDC WD40EFRX-68N32N0 WD-WCC7K2KCR8J9 /dev/sdc 31 WDC WD40EFRX-68N32N0 WD-WCC7K1RVY70H /dev/sdd 31 WDC WD40EFRX-68N32N0 WD-WCC7K6FE2Y7D Box #1 (4x WD Blue 2TB): /dev/sda 32 WDC WD20EZRZ-00Z5HB0 WD-WCC4M2VFJPP7 /dev/sdb 31 WDC WD20EZRZ-00Z5HB0 WD-WCC4M0JNZ8EX /dev/sdc 31 WDC WD20EZRZ-00Z5HB0 WD-WCC4M6HPXZFZ /dev/sdd 29 WDC WD20EZRZ-00Z5HB0 WD-WCC4M6HPX4AP No other devices connected to either box. The 2nd colum is HDD reported temperature in Celcius, it's always around 30~35. Groetjes,
-
Hi, My second Helios4 box also randomly hangs/stalls at unexpected times. It does not seem load-dependent, as today it stalled while the system was not doing anything (all load intensive tasks are running overnight). The Helios4 boxen are both connected to a Raspberry Pi2b via USB/serial console to check if anything odd had happened, but unfortunately, the serial console of the stalled box always shows nothing but the login prompt. No oops, no "bug on" or any other "error" message. Symptoms: The heartbeat LED stops blinking. NIC LEDs are still show activity. Fans remain on constant speed, not regulated anymore. Serial console is unresponsive. /var/log/* and /var/log.hdd/* shows nothing out of the ordinary - the logging just stops at some point. Most likely armbian-ramlog prevents the interesting bits from being flushed to /var/log.hdd/... I will now watch (a customized) sensors + armbianmonitor -m the serial console. Also, disable armbian-ramlog for now. Do you have any idea what else I can redirect to serial console periodically to check if anything out of the ordinary is happening? Groetjes,
-
@gprovost, indeed /proc/irq/${IRQ}/smp_affinity. What I saw irqbalance do is perform a one-shot assignment of all IRQs to CPU1 (from CPU0) and then...basically nothing. The "cpu mask" shown by irqbalance is '2', which lead me to the assumption that it is not taking CPU0 into account as a CPU it can use for the actual balancing. So all are "balanced" to one CPU only. Overall, the logic for the IRQ assignment spread over the two CPU cores was: When CPU1 is handling ALL incoming IRQs, for both SATA, XOR and CESA for all disks in a [software] RAID setup, then CPU1 will be the bottleneck of all transactions. The benefit of the one-shot 'balancing' is that you don't really need to worry about the ongoing possible IRQ handler migration from CPUx to CPUy, as the same CPU will handle the same IRQ all the time, no need to migrate anything from CPUx to CPUy continuously. Any further down the rabbit hole would require me to go back to my college textbooks . Groetjes,
