technik007_cz Posted June 2, 2018 Posted June 2, 2018 Hi, I received 2 same boards recently. One is working great, second has suffered from unexpected shutdowns. Rpimonitor has not recorded this high temperatures yet, maximum temperature recorded is not more than 90C. SoC has been utilized on 100% with ffmpeg video conversion and other tasks. I run these board on Jessie or Bionic system and have tested mainline stable and beta kernel. These board have got passive heatsink however I put 80x80 quiet slower fan on the top of heatsink, thermal compound is replaced for silver one without making problem solved. Spoiler cat /var/log/syslog | grep critical May 27 23:32:30 localhost kernel: [ 10.113037] thermal thermal_zone0: critical temperature reached(228 C),shutting down May 27 23:32:30 localhost kernel: [ 10.125343] thermal thermal_zone1: critical temperature reached(234 C),shutting down May 28 00:05:44 localhost kernel: [ 1774.216599] thermal thermal_zone0: critical temperature reached(228 C),shutting down May 28 00:05:44 localhost kernel: [ 1774.226029] thermal thermal_zone1: critical temperature reached(234 C),shutting down May 28 21:36:00 localhost kernel: [ 63.205422] thermal thermal_zone0: critical temperature reached(228 C),shutting down May 28 21:36:00 localhost kernel: [ 63.214815] thermal thermal_zone1: critical temperature reached(234 C),shutting down May 28 21:36:00 localhost kernel: [ 63.482182] thermal thermal_zone1: critical temperature reached(234 C),shutting down May 28 21:36:00 localhost kernel: [ 63.488590] thermal thermal_zone0: critical temperature reached(228 C),shutting down May 28 21:36:00 localhost kernel: [ 63.742200] thermal thermal_zone1: critical temperature reached(234 C),shutting down May 28 21:36:00 localhost kernel: [ 63.767258] thermal thermal_zone0: critical temperature reached(228 C),shutting down May 28 21:36:01 localhost kernel: [ 64.002200] thermal thermal_zone1: critical temperature reached(234 C),shutting down May 28 21:36:01 localhost kernel: [ 64.027199] thermal thermal_zone0: critical temperature reached(228 C),shutting down May 28 22:00:00 localhost kernel: [ 10.465442] thermal thermal_zone0: critical temperature reached(228 C),shutting down May 28 22:00:00 localhost kernel: [ 10.475037] thermal thermal_zone1: critical temperature reached(234 C),shutting down May 28 22:00:00 localhost kernel: [ 10.549210] thermal thermal_zone0: critical temperature reached(228 C),shutting down May 28 22:00:00 localhost kernel: [ 10.564189] thermal thermal_zone0: critical temperature reached(228 C),shutting down May 28 22:00:00 localhost kernel: [ 10.571505] thermal thermal_zone0: critical temperature reached(228 C),shutting down Jun 2 00:21:47 localhost kernel: [ 71.572150] thermal thermal_zone0: critical temperature reached(228 C),shutting down Jun 2 00:21:47 localhost kernel: [ 71.589459] thermal thermal_zone1: critical temperature reached(234 C),shutting down Jun 2 01:46:04 localhost kernel: [ 12.220382] thermal thermal_zone0: critical temperature reached(228 C),shutting down Jun 2 01:46:04 localhost kernel: [ 12.230128] thermal thermal_zone1: critical temperature reached(234 C),shutting down Jun 2 19:36:54 localhost kernel: [ 1502.078615] thermal thermal_zone1: critical temperature reached(234 C),shutting down
rooted Posted June 3, 2018 Posted June 3, 2018 The reported temperature can't be correct, it would have shutdown much sooner. See this: https://github.com/hardkernel/linux/blob/378d8f85c4f4708fc4266689283fa0202ca700a3/drivers/thermal/thermal_core.c While I know this is for 3.10.y it still seems to apply here.
technik007_cz Posted June 3, 2018 Author Posted June 3, 2018 12 hours ago, rooted said: The reported temperature can't be correct, it would have shutdown much sooner. Maybe it's incorrect but it shutdowns board anyway. Secondly why only one board having this issues even system is cloned and these are connected to same power rail ( rated 5V 10A ) .
rooted Posted June 3, 2018 Posted June 3, 2018 I'm not saying there isn't a problem, I am just saying the board will shutdown somewhere around a report of 95 or 100°C. You can echo a false temperature like this and the board will shutdown. echo 120000 | sudo tee /sys/devices/virtual/thermal/thermal_zone0/emul_temp
technik007_cz Posted June 3, 2018 Author Posted June 3, 2018 2 hours ago, rooted said: a report of 95 or 100°C. And what about if this temperature spike above this limit is reached for couple of ms during heavy load? What could I test is better cooling block/heatsink with better absorption capability but I have no idea where can I get better one. I made USB to 3pin fan adapter with inserted step up converter boosting 5V to 12V. And has started testing again.
zador.blood.stained Posted June 3, 2018 Posted June 3, 2018 19 minutes ago, technik007_cz said: And what about if this temperature spike above this limit is reached for couple of ms during heavy load? +120°C in a couple of ms? Highly unlikely, and your syslog for some reason always contains 228°C for thermal_zone0 and 234°C for thermal_zone1, while, for example, the mainline kernel defines the critical temp at 115°C.
TonyMac32 Posted June 4, 2018 Posted June 4, 2018 4 hours ago, zador.blood.stained said: +120°C in a couple of ms? Highly unlikely You could say that... To gain 120 degrees C in let's say 10 ms with Si's Cp = 711.75 J(kg-K) assuming say 1g of silicon and 10 ms (really gross estimation, but I'm having some fun here) it should take something to the tune of 8500 watts of power, unless I forgot to carry a decimal somewhere... I have some pretty hefty equipment at work and I can't pull that off without a much broader "thermal event" (and probably getting fired).
technik007_cz Posted June 4, 2018 Author Posted June 4, 2018 13 hours ago, zador.blood.stained said: +120°C in a couple of ms? Highly unlikely, I checked graphs "active cooler (ODROID-XU4) VS passive cooler (ODROID-XU4Q)" on www.hardkernel.com. You are right @TonyMac32, it is in couple of seconds before SoC temperature starts hitting a roof, not miliseconds. Because there is still little possibility the heatsing is not attached right (hardkernel's decided to put SoC not to the middle of heatsink), it is not clear is this software or hardware issue.
zador.blood.stained Posted June 4, 2018 Posted June 4, 2018 1 minute ago, technik007_cz said: Because there is still little possibility the heatsing is not attached right (hardkernel's decided to put SoC not to the middle of heatsink), it is not clear is this software or hardware issue. It could be tested by tools like stress or cpuburn, though I'm not sure what kind of cpuburn should be used on A15 cores.
technik007_cz Posted June 4, 2018 Author Posted June 4, 2018 It is still too early to confirm what issue was behind the problem. But you will not believe I think there were network problems I found which could keep overloading this devices. And NanoPi M3 stopped hanging up or lagging when I was on terminal after this. I need more than 24 hours to confirm this.
technik007_cz Posted June 7, 2018 Author Posted June 7, 2018 I have not experienced any problems with high temperatures causing shutdowns since Monday. Only one thing I needed to do was clone system again because one board did not allow me to log in. However I am experiencing false temperature readings since Wednesday. I got 87°C on zone 0 and 1 and it stopped updating until I rebooted board. ( This is second time I have seen this, but on Home Cloud One with same SoC). And I got temperature below zero, yes it is true, -27°C on zone 0 and I am having now 91°C even there is less than 60°C on other zones 1-4 what is very weird.
technik007_cz Posted June 7, 2018 Author Posted June 7, 2018 You can see temperature drop on previous 2 images where temperature dropped to -27°C. What is situation now showing next picture. Maximum clock available is limited to 1300Mhz for LITTLE cores and 1900Mhz for big ones. Why clock dropped to 600Mhz for these? There is script running on the background throttling frequency when temperature is higher than 80 °C. I am waiting until job on this machine is finished and then I will reboot it and try another tests/tasks.
rooted Posted June 8, 2018 Posted June 8, 2018 Thermal throttling is why the clock drops, you seem to have a heatsink contact issue. Although 1900 mhz on the big cores will always see throttling even with proper cooling.
technik007_cz Posted June 18, 2018 Author Posted June 18, 2018 On 6/8/2018 at 4:22 PM, rooted said: heatsink contact issue I think this is hardware failure because I put silver thermal compound between SoC and heatsink. I found way how to run this board. I simply wrote script which shutdowns cores 3 and 4 ( one LITTLE and one big ) during boot. Performance penalty is about 25% percent but the board run 2 days without overheating shutdown occurred and this is big, very big step forward. What I learnt this board does not like Ubuntu Bionic and hate Ubuntu Bionic with btrfs filesystem. (I used working system successfully from second board having no issues running next and powered by same step-down converter to exclude microSD card or power supply issues). This is because Bionic from some reason asking for more processing power and it triggered overheating protection. What does work is Ubuntu Xenial and cores 3 and 4 off. I had plugged in USB meter during tests showing board's actual voltage and having low voltage acoustic signalization which has helped me many times to detect undervoltage conditions.
rooted Posted June 21, 2018 Posted June 21, 2018 3 is a little core.I think the issue may be the 1500mhz step on the little cores, try forcing them to 1400mhz.
technik007_cz Posted June 27, 2018 Author Posted June 27, 2018 On 6/21/2018 at 1:57 AM, rooted said: 3 is a little core. I think the issue may be the 1500mhz step on the little cores, try forcing them to 1400mhz. This was result of testing different version of kernel which has been replaced same day and problems had not gone. I think I tested all kernels and no improve. I run those 2 boards for few days without stability issues after: setting memory frequency to lowest value found in boot.ini which is 633Mhz limiting max cpufrequency for LITTLE cores 1200Mhz and for big cores 1600Mhz (thanks @rooted) AND these boards run with all 8 cores turned on Note: First board reports temperatures correctly but second one still reports high temperatures 85-100C, more likely 100C, what put frequency of big cores down to range 1000-600Mhz. I gonna test higher memory frequencies.
technik007_cz Posted July 17, 2018 Author Posted July 17, 2018 Both boards are doing well. Each board unexpectedly turned off only once in 3 week period. It reminds me fighting with Odroid U3 even I have not experienced same with Odroid HC1 board and I run six of them in past. Perhaps combination of different heatsink/cpu throtling is behind, I do not know. But XU4/HC1/U3 performance is still above everything I know in low powered boards world and this is reason why I will not give up.
technik007_cz Posted September 18, 2018 Author Posted September 18, 2018 This is (probably) last comment because all these boards (XU4) have been damaged during fire in my room. Fortunately that fire was extinguished quickly and I only lost some electronics. What I found out XU4 and HC1 cannot run ffmpeg on maximum cpu frequency 2Ghz, maximum stable frequency is 1,6Ghz and must be limited by cpufrequtils settings in /etc/default/cpufrequtils file otherwise unexpected shutdown occured. I am not going to replace these XU4 boards from this reason, keeping only 2x Orange Pi PC, Orange Pi PC plus, Olimex Micro and four Nanopi2. This is not only one reason, secondly it is very high current demanding board (4A or 6A) causing voltage drops, and therefore sata errors due to bus resets. I have never had these issues with Orange Pi PC boards family or Olimex Micro and I feel happy what community have done to support these.
rooted Posted September 29, 2018 Posted September 29, 2018 Sorry to hear of the fire, glad your entire home didn't burn. I know from experience.
technik007_cz Posted December 5, 2018 Author Posted December 5, 2018 Not entire house but hopefully part of room only. However that room needed new painting, floor and furniture, all was in black. The fire brigade believed it was caused by faulty charging device/circuit charging the powerbank.
Recommended Posts