2 2
technik007_cz

Odroid XU4: critical temperature reached(228 C)

Recommended Posts

Hi, I received 2 same boards recently. One is working great, second has suffered from unexpected shutdowns. Rpimonitor has not recorded this high temperatures yet, maximum temperature recorded is not more than 90C. SoC has been utilized on 100% with ffmpeg video conversion and other tasks.

I run these board on Jessie or Bionic system and have tested mainline stable and beta kernel. These board have got passive heatsink however I put 80x80 quiet slower fan on the top of heatsink, thermal compound is replaced for silver one without making problem solved.

 

Spoiler

cat /var/log/syslog | grep critical
May 27 23:32:30 localhost kernel: [   10.113037] thermal thermal_zone0: critical temperature reached(228 C),shutting down
May 27 23:32:30 localhost kernel: [   10.125343] thermal thermal_zone1: critical temperature reached(234 C),shutting down
May 28 00:05:44 localhost kernel: [ 1774.216599] thermal thermal_zone0: critical temperature reached(228 C),shutting down
May 28 00:05:44 localhost kernel: [ 1774.226029] thermal thermal_zone1: critical temperature reached(234 C),shutting down
May 28 21:36:00 localhost kernel: [   63.205422] thermal thermal_zone0: critical temperature reached(228 C),shutting down
May 28 21:36:00 localhost kernel: [   63.214815] thermal thermal_zone1: critical temperature reached(234 C),shutting down
May 28 21:36:00 localhost kernel: [   63.482182] thermal thermal_zone1: critical temperature reached(234 C),shutting down
May 28 21:36:00 localhost kernel: [   63.488590] thermal thermal_zone0: critical temperature reached(228 C),shutting down
May 28 21:36:00 localhost kernel: [   63.742200] thermal thermal_zone1: critical temperature reached(234 C),shutting down
May 28 21:36:00 localhost kernel: [   63.767258] thermal thermal_zone0: critical temperature reached(228 C),shutting down
May 28 21:36:01 localhost kernel: [   64.002200] thermal thermal_zone1: critical temperature reached(234 C),shutting down
May 28 21:36:01 localhost kernel: [   64.027199] thermal thermal_zone0: critical temperature reached(228 C),shutting down
May 28 22:00:00 localhost kernel: [   10.465442] thermal thermal_zone0: critical temperature reached(228 C),shutting down
May 28 22:00:00 localhost kernel: [   10.475037] thermal thermal_zone1: critical temperature reached(234 C),shutting down
May 28 22:00:00 localhost kernel: [   10.549210] thermal thermal_zone0: critical temperature reached(228 C),shutting down
May 28 22:00:00 localhost kernel: [   10.564189] thermal thermal_zone0: critical temperature reached(228 C),shutting down
May 28 22:00:00 localhost kernel: [   10.571505] thermal thermal_zone0: critical temperature reached(228 C),shutting down
Jun  2 00:21:47 localhost kernel: [   71.572150] thermal thermal_zone0: critical temperature reached(228 C),shutting down
Jun  2 00:21:47 localhost kernel: [   71.589459] thermal thermal_zone1: critical temperature reached(234 C),shutting down
Jun  2 01:46:04 localhost kernel: [   12.220382] thermal thermal_zone0: critical temperature reached(228 C),shutting down
Jun  2 01:46:04 localhost kernel: [   12.230128] thermal thermal_zone1: critical temperature reached(234 C),shutting down
Jun  2 19:36:54 localhost kernel: [ 1502.078615] thermal thermal_zone1: critical temperature reached(234 C),shutting down

 

Share this post


Link to post
Share on other sites
12 hours ago, rooted said:

The reported temperature can't be correct, it would have shutdown much sooner. 

 

Maybe it's incorrect but it shutdowns board anyway.
Secondly why only one board having this issues even system is cloned and these are connected to same power rail ( rated 5V 10A ) . 

Share this post


Link to post
Share on other sites

I'm not saying there isn't a problem, I am just saying the board will shutdown somewhere around a report of 95 or 100°C.

 

You can echo a false temperature like this and the board will shutdown.

 

echo 120000 | sudo tee /sys/devices/virtual/thermal/thermal_zone0/emul_temp

Share this post


Link to post
Share on other sites
2 hours ago, rooted said:

a report of 95 or 100°C.

And what about if this temperature spike above this limit is reached for couple of ms during heavy load? What could I test is better cooling block/heatsink with better absorption capability but I have no idea where can I get better one.

I made USB to 3pin fan adapter with inserted step up converter boosting 5V to 12V. And has started testing again. 

Share this post


Link to post
Share on other sites
19 minutes ago, technik007_cz said:

And what about if this temperature spike above this limit is reached for couple of ms during heavy load?

+120°C in a couple of ms? Highly unlikely,  and your syslog for some reason always contains 228°C for thermal_zone0 and 234°C for thermal_zone1, while, for example, the mainline kernel defines the critical temp at 115°C.

Share this post


Link to post
Share on other sites
4 hours ago, zador.blood.stained said:

+120°C in a couple of ms? Highly unlikely

You could say that...  :D

 

To gain 120 degrees C in let's say 10 ms with Si's Cp = 711.75 J(kg-K) assuming say 1g of silicon and 10 ms (really gross estimation, but I'm having some fun here) it should take something to the tune of 8500 watts of power, unless I forgot to carry a decimal somewhere...  I have some pretty hefty equipment at work and I can't pull that off without a much broader "thermal event" (and probably getting fired).

Share this post


Link to post
Share on other sites
13 hours ago, zador.blood.stained said:

+120°C in a couple of ms? Highly unlikely,

I checked graphs "active cooler (ODROID-XU4) VS passive cooler (ODROID-XU4Q)" on www.hardkernel.com.

You are right @TonyMac32, it is in couple of seconds before SoC temperature starts hitting a roof, not miliseconds.

Because there is still little possibility the heatsing is not attached right (hardkernel's decided to put SoC not to the middle of heatsink), it is not clear is this software or hardware issue.

Share this post


Link to post
Share on other sites
1 minute ago, technik007_cz said:

Because there is still little possibility the heatsing is not attached right (hardkernel's decided to put SoC not to the middle of heatsink), it is not clear is this software or hardware issue.

It could be tested by tools like stress or cpuburn, though I'm not sure what kind of cpuburn should be used on A15 cores.

Share this post


Link to post
Share on other sites

It is still too early to confirm what issue was behind the problem. But you will not believe I think there were network problems I found which could keep overloading this devices. And NanoPi M3 stopped hanging up or lagging when I was on terminal after this.

I need more than 24 hours to confirm this.

Share this post


Link to post
Share on other sites

I have not experienced any problems with high temperatures causing shutdowns since Monday.

Only one thing I needed to do was clone system again because one board did not allow me to log in.

 

However I am experiencing false temperature readings since Wednesday. I got 87°C on zone 0 and 1 and it stopped updating until I rebooted board. ( This is second time I have seen this, but on Home Cloud One with same SoC).

And I got temperature below zero, yes it is true, -27°C on zone 0 and I am having now 91°C even there is less than 60°C on other zones 1-4 what is very weird.

Share this post


Link to post
Share on other sites

You can see temperature drop on previous 2 images where temperature dropped to -27°C.

What is situation now showing next picture.

Screenshot_2018-06-07_16-31-17.png.b6298f6f73d69d149a432eaa4e3ecf6e.png

Maximum clock available is limited to 1300Mhz for LITTLE cores and 1900Mhz for big ones. Why clock dropped to 600Mhz for these? There is script running on the background throttling frequency when temperature is higher than 80 °C.

 

I am waiting until job on this machine is finished and then I will reboot it and try another tests/tasks.

Share this post


Link to post
Share on other sites

Thermal throttling is why the clock drops, you seem to have a heatsink contact issue. Although 1900 mhz on the big cores will always see throttling even with proper cooling.

Share this post


Link to post
Share on other sites
On 6/8/2018 at 4:22 PM, rooted said:

heatsink contact issue

I think this is hardware failure because I put silver thermal compound between SoC and heatsink.

 

I found way how to run this board. I simply wrote script which shutdowns cores 3 and 4 ( one LITTLE and one big ) during boot. Performance penalty is about 25% percent but the board run 2 days without overheating shutdown occurred and this is big, very big step forward.

 

What I learnt this board does not like Ubuntu Bionic and hate Ubuntu Bionic with btrfs filesystem.  (I used working system successfully from second board having no issues running next and powered by same step-down converter to exclude microSD card or power supply issues). This is because Bionic from some reason asking for more processing power and it triggered overheating protection.

 

What does work is Ubuntu Xenial and cores 3 and 4 off.

 

I had plugged in USB meter during tests showing board's actual voltage and having low voltage acoustic signalization which has helped me many times to detect undervoltage conditions.

Share this post


Link to post
Share on other sites
On 6/21/2018 at 1:57 AM, rooted said:

3 is a little core.

I think the issue may be the 1500mhz step on the little cores, try forcing them to 1400mhz.

This was result of testing different version of kernel which has been replaced same day and problems had not gone. I think I tested all kernels and no improve.

 

I run those 2 boards for few days without stability issues after:

  • setting memory frequency to lowest value found in boot.ini which is 633Mhz
  • limiting max cpufrequency for LITTLE cores 1200Mhz and for big cores 1600Mhz (thanks @rooted)
  • AND these boards run with all 8 cores turned on

Note: First board reports temperatures correctly but second one still reports high temperatures 85-100C, more likely 100C, what put frequency of big cores down to range 1000-600Mhz.

 

I gonna test higher memory frequencies.

 

Share this post


Link to post
Share on other sites

Both boards are doing well. Each board unexpectedly turned off only once in 3 week period.

It reminds me fighting with Odroid U3 even I have not experienced same with Odroid HC1 board and I run six of them in past. Perhaps combination of different heatsink/cpu throtling is behind, I do not know.

But XU4/HC1/U3 performance is still above everything I know in low powered boards world and this is reason why I will not give up.

Share this post


Link to post
Share on other sites

This is (probably) last comment because all these boards (XU4) have been damaged during fire in my room. Fortunately that fire was extinguished quickly and I only lost some electronics.

What I found out XU4 and HC1 cannot run ffmpeg on maximum cpu frequency 2Ghz, maximum stable frequency is 1,6Ghz and must be limited by cpufrequtils settings in /etc/default/cpufrequtils file otherwise unexpected shutdown occured.

I am not going to replace these XU4 boards from this reason, keeping only 2x Orange Pi PC,  Orange Pi PC plus,  Olimex Micro and four Nanopi2.

This is not only one reason, secondly it is very high current demanding board (4A or 6A) causing voltage drops, and therefore sata errors due to bus resets.

I have never had these issues with Orange Pi PC boards family or Olimex Micro and I feel happy what community have done to support these.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
2 2