Shen En Chen Posted March 29, 2020 Posted March 29, 2020 I have Tinker Board S. I have been using both Tinker OS and Armbian for a while. Recently I upgraded to 5.4.28-rockchip. I noticed that randomly and frequently when I watch cnn.com or other media, my video would freeze. Initially I thought the firefox or chromium crashes. However, later, I realized it was the network issue. Other devices were still able to connect to the internet. When I checked the ifconfig, the network devices still showed proper connection. However, there is no way to ping any URLs. Disconnecting/reconnecting network does not fix the issue. The only fix was reboot. How can I troubleshoot this issue? I was trying to do armbianmonitor but at that time, there was no network at all. http://ix.io/2fVR This is armbianmonitor after reboot and no problem at all. for baseline reference.
Myy Posted March 30, 2020 Posted March 30, 2020 In armbianmonitor, both eth0 and wlan0 appear to be down, so how do you connect to the Internet ? Ethernet cable or wireless ? 5: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 0c:9d:92:0c:b0:ce brd ff:ff:ff:ff:ff:ff 6: wlan0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether d0:c5:d3:5a:af:db brd ff:ff:ff:ff:ff:ff When the problem occurs, if you do (replace eth0 by wlan0 if you're connected through WiFi) : ip link set eth0 down ip link set eth0 up Can you ping 1.1.1.1 ? Also, what's the temperature of the CPU when the problem occurs ?
Shen En Chen Posted March 30, 2020 Author Posted March 30, 2020 Ethernet cable was used. Wireless not set. At the time of uploading armbianmonitor, network was up with Ethernet. Don't know why in armbianmonitor eth0 and wlan0 were off. When network was down, ping 1.1.1.1 was unreachable. At this time, I do not know how to assess CPU temperature in Armbian. ifdown and ifup did not work. Will try the following ip link set eth0 down ip link set eth0 up next time when network is down again. Thanks!
Shen En Chen Posted March 31, 2020 Author Posted March 31, 2020 Just had another crash. Unable to upload armbianmonitor -u, so here I attached armbianmonitor -U here. I found the /sys/class/thermal zone for temperature. While I monitored it, the temperature was highest at 69.x C in zone0, and 65.x C in zone0. ip link set eth0 down & ip link set eth0 up was able to bring back the network. However, the system became very slow and nearly unusable. Then after a few minutes, the system crashed totally. I had to power off and on in order for the system to be normal. After system reboots, the temperature is 42 C. Not sure if it is the high temperature causing the issue. Any other way to troubleshoot? Thanks armbianmonitor.txt
Myy Posted March 31, 2020 Posted March 31, 2020 Hmm, it seems that panfrost might be the culprit. These "Purging ... bytes" come from the panfrost driver... $ grep "Purging " * -r drivers/s390/char/vmlogrdr.c: * Purging has to be done as separate step, because recording drivers/gpu/drm/msm/msm_gem_shrinker.c: pr_info_ratelimited("Purging %lu bytes\n", freed << PAGE_SHIFT); drivers/gpu/drm/msm/msm_gem_shrinker.c: pr_info_ratelimited("Purging %u vmaps\n", unmapped); drivers/gpu/drm/i915/gem/i915_gem_shrinker.c: pr_info("Purging GPU memory, %lu pages freed, " drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c: pr_info_ratelimited("Purging %lu bytes\n", freed << PAGE_SHIFT); drivers/scsi/ibmvscsi/ibmvfc.c: ibmvfc_dbg(vhost, "Purging all requests\n"); drivers/scsi/qla2xxx/qla_mbx.c: "Chip Reset in progress. Purging Mbox cmd=0x%x.\n", fs/ocfs2/dlm/dlmthread.c: mlog(0, "%s: Purging res %.*s, master %d\n", dlm->name, fs/coda/cache.c:/* Purging dentries and children */ tools/perf/builtin-buildid-cache.c: pr_debug("Purging %s: %s\n", pathname, err ? "FAIL" : "Ok"); Maybe the GPU, or the GPU driver starts acting up and crashes the whole machine after a while... However I have no clear idea on how to test that... the best way would be to load a WebGL benchmark and let it go, while unplugging the network, to be sure that it's the real culprit. Try running glmark2-es2 benchmark for an hour maybe ? or some other OpenGL ES benchmark and see if the slow down and crash happen ? Two of the first results for "WebGL benchmark" on Google : http://luic.github.io/WebGL-Performance-Benchmark/benchmark/cubes.html https://crypt-webgl.unigine.com/game.html Maybe the panfrost spam is just a red herring, I don't know.
Myy Posted March 31, 2020 Posted March 31, 2020 By looking at the logs again, two things are sure : - The RAM got eaten by some process, same thing for the swap. - The CPU are overheating. Try to fire some system monitor thingy while watching a video, and have a look at the RAM consumed. If it goes up constantly, until reaching a point where everything goes to a crawl then crashes, we know it's due to a memory leak :/ . Given how Panfrost is invoking memory "shrinking" like crazy, I don't know if it's the real culprit, or if it just panicking.
Shen En Chen Posted April 1, 2020 Author Posted April 1, 2020 On 3/31/2020 at 1:27 AM, Myy said: Hmm, it seems that panfrost might be the culprit. These "Purging ... bytes" come from the panfrost driver... $ grep "Purging " * -r drivers/s390/char/vmlogrdr.c: * Purging has to be done as separate step, because recording drivers/gpu/drm/msm/msm_gem_shrinker.c: pr_info_ratelimited("Purging %lu bytes\n", freed << PAGE_SHIFT); drivers/gpu/drm/msm/msm_gem_shrinker.c: pr_info_ratelimited("Purging %u vmaps\n", unmapped); drivers/gpu/drm/i915/gem/i915_gem_shrinker.c: pr_info("Purging GPU memory, %lu pages freed, " drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c: pr_info_ratelimited("Purging %lu bytes\n", freed << PAGE_SHIFT); drivers/scsi/ibmvscsi/ibmvfc.c: ibmvfc_dbg(vhost, "Purging all requests\n"); drivers/scsi/qla2xxx/qla_mbx.c: "Chip Reset in progress. Purging Mbox cmd=0x%x.\n", fs/ocfs2/dlm/dlmthread.c: mlog(0, "%s: Purging res %.*s, master %d\n", dlm->name, fs/coda/cache.c:/* Purging dentries and children */ tools/perf/builtin-buildid-cache.c: pr_debug("Purging %s: %s\n", pathname, err ? "FAIL" : "Ok"); Maybe the GPU, or the GPU driver starts acting up and crashes the whole machine after a while... However I have no clear idea on how to test that... the best way would be to load a WebGL benchmark and let it go, while unplugging the network, to be sure that it's the real culprit. Try running glmark2-es2 benchmark for an hour maybe ? or some other OpenGL ES benchmark and see if the slow down and crash happen ? Two of the first results for "WebGL benchmark" on Google : http://luic.github.io/WebGL-Performance-Benchmark/benchmark/cubes.html https://crypt-webgl.unigine.com/game.html Maybe the panfrost spam is just a red herring, I don't know. Running glmark2-es2 and http://luic.github.io/WebGL-Performance-Benchmark/benchmark/cubes.html not efficient to make temperautre go beyound 55C. I realize if I turning on VNC Server in the background, and watch video streaming such as cnn.com or other online video such as youtube with high quality, the chances of getting into crashes are high. Any way to see what is eating the memory? Thanks!
Shen En Chen Posted April 2, 2020 Author Posted April 2, 2020 What I did today was monitoring the temperature and RAM usage every 10 seconds while watching youtube video at 1080p. And after a while, the network dropped like before. During this time, the highest temperature was 68C. Both armbianmonitor and ramtempck-20200401 attached. while :; do file=ramtempck-20200401; touch $file; date >> $file ; temp >> $file ; free -h >> $file; sleep 10; done Are we able to tell it is the temperature causing issues or RAM leak? armbianmonitor.20200401.txt ramtempck-20200401
Myy Posted April 2, 2020 Posted April 2, 2020 Hmm, there's no RAM leak visible on the logs... However, the panfrost driver is still spamming like crazy. Once the network drop happens, if you down and up the interface again, the system becomes unstable again ? Anyway, there seems to be three issues : * Panfrost spamming * Network drops * Potential memory leak leading to system crashes. I'll try to put up a network test and see if that also happen with 5.6 kernels.
Shen En Chen Posted April 4, 2020 Author Posted April 4, 2020 @Myy yes, after down and up the interface, I have network, but the system still unstable. The network still randomly drops.
Recommended Posts