Declan Posted May 16, 2020 Posted May 16, 2020 New to this forum, seems like the best place to ask this, My Helios4 encountered some errors and now will not boot up. It has OMV running from an SD card supplied with the kit and has 4 data drives, synced with Snapraid. The first sign was monit sent me an emails with a parity disc 1: Filesystem flags changed Service filesystem_srv_dev-disk-by-label-Parity1 Description: filesystem flags changed to 0xc00 Status failed Service mountpoint_srv_dev-disk-by-label-Parity1 Description: status failed (1) -- /srv/dev-disk-by-label-Parity1 is not a mountpoint Data disc 2 Filesystem flags changed Service filesystem_srv_dev-disk-by-label-Data2 Description: filesystem flags changed to 0x1009 These emails were sent just after midnight which has me thinking that a daily service may have changed something, possibly unattended-upgrades? It does now not fully boot when connected to the serial port this appears to be the issue: [ TIME ] Timed out waiting for device dev-disk-by\x2dlabel-Data2.device With all the discs timing out, the discs do not seem to power on, which has me thinking it might be a power supply issue Serial output: https://paste.systemli.org/?d42d768fe6789c45#7HrQSfMdpemMUeQMX3hSJeXHfw1yNeDGdXM9iHpt88CE serial journalctl -xb output: https://paste.systemli.org/?70aaaa320420d245#Cu1PPmKjpTbodMdjcprefi3TcAaXXnPwpFpNZccxtdid I have tried to figure it out but am stuck, Anyone have any ideas? Happy to provide more information if needed Thanks in advance
gprovost Posted May 17, 2020 Author Posted May 17, 2020 @Declan Yes most likely a faulty / dying PSU. The output voltage of PSU must have drop much below 12V but still above 5v, which would explain why the system still power up. By any chance you have a voltmeter ? Could you measure DC voltage on molex power connector shown on photo below? Expected measured value, on 5V rail: 4.90 V - 5.20 V on 12V rail: 11.90 V - 12.5 V If 12V is outside that range, that means the power supply is faulty. As you have seen in this thread history, few users experienced the same kind of issue. Unfortunately the brand of capacitors used in the PSU seems to have a higher fault rate than expected. Something we addressed with our new project (Heliso64) by changing completely capacitor brand.
Koen Posted May 17, 2020 Posted May 17, 2020 @gprovost the Helios4 is supposed to keep an accurate clock when unplugged ? I seem to keep having issues on my boards. Anything i could do or check ?
gprovost Posted May 18, 2020 Author Posted May 18, 2020 @Koen The Helios4 has effectively a battery powered RTC, however until now it wasn't used in Armbian because the builds are using fakehw-clock, a generic solution for most SBC that don't have RTC at all. Anyhow it's something we have addressed recently to be part of Armbian 20.05 release : https://github.com/armbian/build/commit/e3dd8abedb2216e9ce30e74827ccfe5c13a12a5f However the change/tweak won't get applied on already installed system. So you here how to make your Helios4 use the RTC clock : 1. Delete fake-hwclock package apt-get purge fake-hwclock 2. Edit /lib/udev/hwclock-set and comment following lines at the beginning of the file if [ -e /run/systemd/system ] ; then exit 0 fi That's all. HOW TO TEST : 1. boot you system with network 2. let the system sync time over ntp 3. check that both system time and rtc time are updated $> timedatectl status Local time: Thu 2020-05-07 14:46:48 +08 Universal time: Thu 2020-05-07 06:46:48 UTC RTC time: Thu 2020-05-07 06:46:48 [...] 4. poweroff system and disconnect PSU 5. wait 10 min 6. remove network cable 7. start system without network (you don't want chrony to sync the time) 8. check system is still at current time $> timedatectl status Local time: Thu 2020-05-07 14:58:30 +08 Universal time: Thu 2020-05-07 06:58:30 UTC RTC time: Thu 2020-05-07 06:58:30
Declan Posted May 18, 2020 Posted May 18, 2020 @gprovost Thanks for the suggestions, Using a voltmeter both molex measured 5V rail: 5.15-5.21 V 12V rail: 12.27 V Therefore its not the power supply?
gprovost Posted May 19, 2020 Author Posted May 19, 2020 10 hours ago, Declan said: Therefore its not the power supply? Could you do the measure with 2x HDD connected to one of the Molex header ? Also please provide the output link generated by armbianmonitor -u
Declan Posted May 19, 2020 Posted May 19, 2020 15 hours ago, gprovost said: Could you do the measure with 2x HDD connected to one of the Molex header ? Also please provide the output link generated by armbianmonitor -u With 2 HHDs plugged in: 5V: 5.08 12V: fluctuated with range of 10.09 - 12.2 with the hdd sounding like they were failing to start up repeatedly Also, armbianmonitor -U output: https://paste.systemli.org/?7b660a360e84f4ad#99fbr5bfjwR6yij3bcKv1xufgu541hWvo4CV9a9PPr6w I now believe it is the power supply. Do you have a recommendation for a replacement power supply?
gprovost Posted May 20, 2020 Author Posted May 20, 2020 @Declan Yes it's a failing PSU. Right now it's difficult for us to send spare parts because of lockdown in Singapore. Plus most probably will be faster and cheaper (if you account shipping fees) to order a substitute model on amazon. Here is the one we recommend and have been tested : https://www.amazon.com/dp/B07NCG1P8X I'm not sure what's your country or residence so you might have to look for the same model on the right marketplace country. Hope it helps. 1
DavidGF Posted May 21, 2020 Posted May 21, 2020 So reporting back for my Helios. I haven't had any issues using fixed clock freqs. I set it to 800MHz and bump it to fixed 1.6GHz if I need to do more stuff on it. That also helps fans to run quieter so not so bad in the end. Now an issue I've been having every month or so is lost of network connectivity. The SPI screen shows the MAC addr when that happens (instead of the IP). The IP is fixed so no shenanigans there. I also have other devices on the wired connection so I'm assuming it's no the router nor cabling. A log looks like this: [82229.270768] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx [82266.140770] mvneta f1070000.ethernet eth0: Link is Down [82269.204030] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx [82273.301606] mvneta f1070000.ethernet eth0: Link is Down [82275.346281] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx [82277.400309] mvneta f1070000.ethernet eth0: Link is Down [82281.490395] mvneta f1070000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx [82525.192593] mvneta f1070000.ethernet eth0: Link is Down [82527.234861] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx [82537.477555] mvneta f1070000.ethernet eth0: Link is Down [82545.668087] mvneta f1070000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx [159979.389745] hrtimer: interrupt took 22241 ns [280304.142020] mvneta f1070000.ethernet eth0: Link is Down [280307.208683] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx [428513.940938] TCP: request_sock_TCP: Possible SYN flooding on port 6881. Sending cookies. Check SNMP counters. [852021.596335] TCP: request_sock_TCP: Possible SYN flooding on port 2000. Sending cookies. Check SNMP counters. [1116437.428694] mvneta f1070000.ethernet eth0: Link is Down [1116439.474454] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx [1116447.664646] mvneta f1070000.ethernet eth0: Link is Down [1116818.324191] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx As you can see the last interruption happened for a few minutes and was only resolved after I manually unplugged and replugged the eth cable. But it happens every now and then for a few seconds, which is not super noticeable due to TCP hiding those 4-5s interruptions. Any ideas?
gprovost Posted May 22, 2020 Author Posted May 22, 2020 @DavidGF You 100% sure it's not wiring related ? Could you change cable and maybe also swap ports on your router/switch. Because in your trace there are 2 occurrences where the PHY only negotiated link at 100Mbps. [82281.490395] mvneta f1070000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx [82545.668087] mvneta f1070000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
Werner Posted May 22, 2020 Posted May 22, 2020 14 hours ago, DavidGF said: So reporting back for my Helios. Small tipp: You can use the <> code function in the post editor to make your console output more awesome
Declan Posted May 25, 2020 Posted May 25, 2020 @gprovost I ordered a UK power supply: https://www.amazon.co.uk/gp/product/B07VL83VCP Arrived yesterday, so far everything is back up and running. Very much appreciate the help troubleshooting, thanks
gprovost Posted May 26, 2020 Author Posted May 26, 2020 15 hours ago, Declan said: Arrived yesterday, so far everything is back up and running. Good to hear ;-)
DavidGF Posted May 26, 2020 Posted May 26, 2020 @gprovost I changed the router port and seemed ok for a couple days but now again happening. (very short for like 2sec) I can only try the cable indeed, but it's a rather good cable and not longer than 10m, so it would be strange. What I don't get is why it fails without recover every month or so and I have to reconnect the cable manually. Shouldn't it keep trying to re-negotiate the eth link? So weird IMHO.
DavidGF Posted May 26, 2020 Posted May 26, 2020 Wait there's more, this time there's a SATA issue in the middle, so perhaps they are somehow related? Here comes the syslog: [155235.130866] mvneta f1070000.ethernet eth0: Link is Down [155238.204512] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off [189215.220539] mvneta f1070000.ethernet eth0: Link is Down [189217.269545] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off [190094.777185] mvneta f1070000.ethernet eth0: Link is Down [190097.845100] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off [191713.602453] mvneta f1070000.ethernet eth0: Link is Down [191719.750764] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off [192245.025907] mvneta f1070000.ethernet eth0: Link is Down [192247.066877] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off [195015.773833] mvneta f1070000.ethernet eth0: Link is Down [195020.893592] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off [195093.594020] mvneta f1070000.ethernet eth0: Link is Down [201080.502476] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off [201080.592561] ata2.00: exception Emask 0x10 SAct 0x40000000 SErr 0x380000 action 0x6 frozen [201080.592568] ata2.00: irq_stat 0x08000000, interface fatal error [201080.592575] ata2: SError: { 10B8B Dispar BadCRC } [201080.592585] ata2.00: failed command: READ FPDMA QUEUED [201080.592602] ata2.00: cmd 60/00:f0:28:5b:f6/01:00:19:01:00/40 tag 30 ncq dma 131072 in res 40/00:f0:28:5b:f6/00:00:19:01:00/40 Emask 0x10 (ATA bus error) [201080.592607] ata2.00: status: { DRDY } [201080.592616] ata2: hard resetting link [201081.067429] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [201081.069568] ata2.00: configured for UDMA/133 [201081.069591] ata2: EH complete [201081.098432] ata2.00: exception Emask 0x10 SAct 0x1 SErr 0x300000 action 0x6 frozen [201081.098438] ata2.00: irq_stat 0x08000000, interface fatal error [201081.098445] ata2: SError: { Dispar BadCRC } [201081.098452] ata2.00: failed command: READ FPDMA QUEUED [201081.098468] ata2.00: cmd 60/00:00:28:5b:f6/01:00:19:01:00/40 tag 0 ncq dma 131072 in res 40/00:00:28:5b:f6/00:00:19:01:00/40 Emask 0x10 (ATA bus error) [201081.098473] ata2.00: status: { DRDY } [201081.098485] ata2: hard resetting link [201081.581866] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [201081.584000] ata2.00: configured for UDMA/133 [201081.584021] ata2: EH complete [201081.608251] ata2.00: exception Emask 0x10 SAct 0x40000 SErr 0x300000 action 0x6 frozen [201081.608257] ata2.00: irq_stat 0x08000000, interface fatal error [201081.608263] ata2: SError: { Dispar BadCRC } [201081.608273] ata2.00: failed command: READ FPDMA QUEUED [201081.608289] ata2.00: cmd 60/00:90:28:5b:f6/01:00:19:01:00/40 tag 18 ncq dma 131072 in res 40/00:90:28:5b:f6/00:00:19:01:00/40 Emask 0x10 (ATA bus error) [201081.608294] ata2.00: status: { DRDY } [201081.608302] ata2: hard resetting link [201083.575530] mvneta f1070000.ethernet eth0: Link is Down [201092.108849] ata2: softreset failed (1st FIS failed) [201092.108856] ata2: hard resetting link [201093.085525] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [201093.087647] ata2.00: configured for UDMA/133 [201093.087669] ata2: EH complete [201123.506839] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off [201123.888855] ata2: limiting SATA link speed to 1.5 Gbps [201123.888866] ata2.00: exception Emask 0x10 SAct 0x80 SErr 0x380000 action 0x6 frozen [201123.888871] ata2.00: irq_stat 0x08000000, interface fatal error [201123.888878] ata2: SError: { 10B8B Dispar BadCRC } [201123.888886] ata2.00: failed command: READ FPDMA QUEUED [201123.888903] ata2.00: cmd 60/00:38:08:c1:ef/01:00:19:01:00/40 tag 7 ncq dma 131072 in res 40/00:38:08:c1:ef/00:00:19:01:00/40 Emask 0x10 (ATA bus error) [201123.888908] ata2.00: status: { DRDY } [201123.888918] ata2: hard resetting link [201124.363551] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [201124.365795] ata2.00: configured for UDMA/133 [201124.365821] ata2: EH complete [201132.723242] mvneta f1070000.ethernet eth0: Link is Down [201135.794425] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off [201138.865790] mvneta f1070000.ethernet eth0: Link is Down [201141.933754] mvneta f1070000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off
Heisath Posted May 27, 2020 Posted May 27, 2020 Into the blue: Another failing power supply? Have you / can you measure the voltages on the molex connector while the device with disks is running? Picture on howto has been posted by gprovost earlier in this thread.
DavidGF Posted May 28, 2020 Posted May 28, 2020 What kind of power supply failure are we looking for here? Lower voltage than expected? I assume most of these stuff is working on the 3.3V rail, which should we quite isolated from fluctualtions on the 12V/5V rails. Also a multimeter won't be able to measure hight frequency power drops or noise on the DC supply. I'm just a bit surprised I guess that this could be the issue really
Heisath Posted May 29, 2020 Posted May 29, 2020 Yeah measuring 3.3V failure or higher frequency is hard(er) do find. But you could atleast measure 5V and 12V while the board with hdds is running. If all looks fine there we can investigate further, but maybe the supply brick is broken. Check this post on where/howto measure:
gprovost Posted May 29, 2020 Author Posted May 29, 2020 @DavidGF That's effectively a strange occurrence to see this Ethernet link issue together with SATA link issue together. Yeah I'm not sure about the DC supply issue, because a little voltage drop on 12V shouldn't impact the Ethernet that is on the 3.3V rail. I don't have an idea right now, but looks more like an hardware issue. Most probably unrelated but by any chance you have SPI enabled in your /boot/armbianEnv.txt ?
DavidGF Posted May 29, 2020 Posted May 29, 2020 My env looks like: Quote root@helios4:~# cat /boot/armbianEnv.txt verbosity=1 eth1addr=0A...redacted spi_workaround=off overlay_prefix=armada-388-helios4 rootdev=UUID=[redacted] rootfstype=ext4 usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u I think thats disabled right? I'm using the NOR flash for u-boot though, but my understanding is that once uboot boots the kernel, the SPI bus is not used anymore so it should be "off" right?
DavidGF Posted June 1, 2020 Posted June 1, 2020 Oh while we are at it. I haven't found info on the heatsink (I guess from your blog posts this is something more "custom made"?). Could you please share what kind of screws go on their holes? I got a 5cm fan (that seems to work well on 3.3V) and I'm trying to hold it with screws, but can't find the right size And is there any thread on typical temperature ranges for these things? I never know whether mine is too hot or too cold. I also don't like the fact that fancontrol uses the SoC temp for the big fans, since it should only try to keep HDDs cool, and it doesn't affect that much the SoC temperature (I think!) Thanks!
gprovost Posted June 2, 2020 Author Posted June 2, 2020 @DavidGF Honestly right now I have idea what could be the issue you report. If things persist and everything point out to hardware issue then we could arrange a board exchange. We can discuss that in PM. 3 hours ago, DavidGF said: Could you please share what kind of screws go on their holes? Heatsink screw holes are M2.5 3 hours ago, DavidGF said: And is there any thread on typical temperature ranges for these things? I never know whether mine is too hot or too cold. I also don't like the fact that fancontrol uses the SoC temp for the big fans, since it should only try to keep HDDs cool, and it doesn't affect that much the SoC temperature (I think!) Marvell Armada 388 SoC is designed to run at high ambient temperature as you can see in the table below from the hardware datasheet (FYI Helios4 uses Commercial variant). The temperature of this SoC (Tj) is expected to be higher than the average SoC. So to see it around 70-80C when system loaded is quite normal. I live in very humid and hot country, daily ambient average of 30C and 70% humidity, my personal Helios4 in idle is showing CPU : 55-60 C and HDD (4x WD Red) : 35-38C I agree with you that controlling the 2x FAN based on SoC temperature wasn't the smartest approach. We should have instead used hddtemp + fancontrol to control the temperature. Here a pointer on how to do it. https://unix.stackexchange.com/questions/499409/adjust-fan-speed-via-fancontrol-according-to-hard-disk-temperature-hddtemp I would use at least the HDD that is located on top of the SoC as the main source of temperature to control the case fan.
alexcp Posted June 14, 2020 Posted June 14, 2020 Hello, My Helios4's power brick died. What's a good replacement? I almost ordered a Mean Well GST120A-R7B but realized its Mini-DIN connector has a different pinout. Also, Helios4 is end-of-life. What does it practically mean in terms of software updates and support?
gprovost Posted June 15, 2020 Author Posted June 15, 2020 23 hours ago, alexcp said: My Helios4's power brick died. What's a good replacement? Not sure which country you are but on amazon you can find the following good replacement : https://www.amazon.com/dp/B07NCG1P8X You might have to look for the same product ref. on the correct market place according to your country. 23 hours ago, alexcp said: Also, Helios4 is end-of-life. What does it practically mean in terms of software updates and support? While for sure our bandwidth is more focus on Helios64, we are still supporting Helios4, and no plan to change that ;-) For instance latest Armbian 20.05 Kagu still actively includes Helios4. Ok we will still need to put a post on our blog about that and link the image in our wiki :/ 2
alexcp Posted June 16, 2020 Posted June 16, 2020 Thank you for the link and for continuing support. I realize you cannot focus on Helios4 forever, but it is good to know there will be updates to the box that keeps my data safe.
DavidGF Posted June 16, 2020 Posted June 16, 2020 Thanks for your responses! So far I can report my Helios crashes every day unless I do the fixed-frequency clock thingy. The workload is qbittrorrent and using ssh frequently through sshfs or alike. I notice only cause I mount some stuff on my PC, otherwise I tend to not notice the crashes since the watchdog does a pretty good job at restarting it (usually < 60s), so it you only see a 2min interruption (which external monitoring does not notice most of the time). Couple that with the ethernet issues (which yeah I haven't ruled out the cable...) I can say I'm quite disappointed at this device. I used to have a Zyxel NSA320 and worked so well for so many years (and counting!) even tho it was very slow Overall I think might be safer to go for an Intel platform, they are usually rock solid reliable, upstream Linux support, etc. Anyway thanks for your software efforts and support, appreciatted. PS Forgot to ask, does the new Debian Focal armbian release ship any new kernel? I hope there's some fixes soon
gprovost Posted June 17, 2020 Author Posted June 17, 2020 @DavidGF Actually something I forgot to ask before but maybe you already check. Can you confirm that there is thermal pad between the SoM and the heatsink. You will have to unscrew the heatsink for that. We had one case last time of an unit missing the thermal pad and it resulted in frequent system hang. Sorry to hear you disappointed by the device and that we haven't figure the issue yet. Unfortunately we are unable to reproduce your issue. 9 hours ago, DavidGF said: PS Forgot to ask, does the new Debian Focal armbian release ship any new kernel? I hope there's some fixes soon Yes it's running kernel 5.4 We will still need to put a post on our blog about that and link the new image in our wiki :/ But you can find them directly on Armbian download section. https://www.armbian.com/helios4/
NickS Posted June 19, 2020 Posted June 19, 2020 Just happened to be in my office when I heard a sharp crack from one of my 2 Helios4 servers which were both up and running at the time. Noticed that the LED display on one had frozen and could not get that server to respond. Recycled the power on it and although it came up with LED shining, on investigation no power to HDDs or fans. Suprising it booted at all, but it did! Swapped power supplies and issue moved with the PSU so looks like I have a deadish PSU. Pulled it apart and can see 2 of the 1000mf capacitors have blown. Would have replaced them but the massive heat sink is in the way and is soldered to too many components to bother. I'll buy a new one. Lesson 1: Immediately suspected the power supply as other users have been complaining about them nearing end-of-life - so thanks for those users writing on this forum. Lesson 2: These PSUs can partially fail and may at first appear to be OK. Lesson 3: My data base was backed up on the surviving server:} Good job because apart from that being corrupted everything else is OK. Restored db and server is back up - but I can only run one at a time until I source another PSU. Kind regards ... Nick
gprovost Posted June 22, 2020 Author Posted June 22, 2020 @NickS Thanks for sharing your experience and actions regarding a dying PSU. Just to mention again here, for Helios64 we have changed completely supplier for the PSU to a very reputed one. This way we won't anymore experience these premature dying PSU.
Janne Posted June 23, 2020 Posted June 23, 2020 Another dead Helios4 PSU here. I noticed the ethernet light blinking and heard a clicking sound. Immediately knew what the problem was and removed the PSU. Now, I would like to check if any of drives (or the main board) work anymore. I could grab an ATX power supply and take 12V from there but I don't know the polarity of the Mini DIN connector.
Recommended Posts