0
sfx2000

Placemaker - H5 crashing under SMP load

Recommended Posts

Armbianmonitor:

Just wanted to note this - target NanoPi NEO2 - task at hand is Byte-UnixBench....

 

https://github.com/sfx2000/byte-unixbench

 

more to follow...

 

Spoiler

 2708.074105] Unable to handle kernel paging request at virtual address f7ff8000358ddbf0

[ 2708.082298] Mem abort info:

[ 2708.085153]   ESR = 0x96000004

[ 2708.088834]   Exception class = DABT (current EL), IL = 32 bits

[ 2708.094826]   SET = 0, FnV = 0

[ 2708.097899]   EA = 0, S1PTW = 0

[ 2708.101086] Data abort info:

[ 2708.104010]   ISV = 0, ISS = 0x00000004

[ 2708.107886]   CM = 0, WnR = 0

[ 2708.110879] [f7ff8000358ddbf0] address between user and kernel address ranges

[ 2708.118043] Internal error: Oops: 96000004 [#1] SMP

[ 2708.122918] Modules linked in: zram sun8i_codec_analog snd_soc_simple_card sun8i_adda_pr_regmap snd_soc_simple_card_utils sun4i_i2s snd_soc_core snd_pcm_dmaengine snd_pcm snd_timer snd soundcore lima gpu_sched sun4i_gpadc_iio industrialio cpufreq_dt usb_f_acm u_serial g_serial libcomposite realtek dwmac_sun8i i2c_mv64xxx mdio_mux

[ 2708.152127] CPU: 3 PID: 28677 Comm: sort Not tainted 5.3.9-sunxi64 #19.11.3

[ 2708.159080] Hardware name: FriendlyARM NanoPi NEO 2 (DT)

[ 2708.164389] pstate: 60000005 (nZCv daif -PAN -UAO)

[ 2708.169187] pc : unlink_file_vma+0x1c/0x58

[ 2708.173283] lr : free_pgtables+0xe4/0x138

[ 2708.177288] sp : ffff000017e4bc60

[ 2708.180599] x29: ffff000017e4bc60 x28: ffff800035c50d00 

[ 2708.185908] x27: 0000000000000000 x26: 0000000000000000 

[ 2708.191216] x25: 0000000056000000 x24: 0000000000000000 

[ 2708.196524] x23: 0000000000000000 x22: ffff000017e4bd08 

[ 2708.201832] x21: 0000ffff98ac4000 x20: f7ff8000358ddb00 

[ 2708.207140] x19: ffff8000354503e8 x18: 0000000000000000 

[ 2708.212448] x17: 0000000000000000 x16: 0000000000000000 

[ 2708.217756] x15: 0000000000000000 x14: 0000000000000000 

[ 2708.223064] x13: 0000000000000000 x12: ffff800037ddb840 

[ 2708.228372] x11: 0000000000000000 x10: ffff000010e761ce 

[ 2708.233679] x9 : 0000000000000000 x8 : ffff800035ae1440 

[ 2708.238988] x7 : 0000ffffb615c000 x6 : 00000000002f6749 

[ 2708.244295] x5 : 0000000000000000 x4 : 0000000000000000 

[ 2708.249603] x3 : 0000000000000001 x2 : ffffffffffffffff 

[ 2708.254911] x1 : 0000000000000002 x0 : ffff8000354503e8 

[ 2708.260219] Call trace:

[ 2708.262666]  unlink_file_vma+0x1c/0x58

[ 2708.266413]  free_pgtables+0xe4/0x138

[ 2708.270074]  exit_mmap+0xd4/0x160

[ 2708.273390]  mmput+0x60/0x150

[ 2708.276356]  do_exit+0x330/0xa88

[ 2708.279581]  do_group_exit+0x34/0xd0

[ 2708.283153]  __arm64_sys_exit_group+0x14/0x18

[ 2708.287511]  el0_svc_common.constprop.0+0x88/0x150

[ 2708.292300]  el0_svc_handler+0x20/0x80

[ 2708.296048]  el0_svc+0x8/0xc

[ 2708.298932] Code: f9405014 b40001d4 a9025bf5 aa0003f3 (f9407a96) 

[ 2708.305022] ---[ end trace 57dc40848a0a1b21 ]---

[ 2708.309668] Fixing recursive fault but reboot is needed!

 

Share this post


Link to post
Share on other sites
(edited)

Hi @sfx2000, if this is reproducible, please try removing "cpu-clock-1.3GHz-1.3v" from your armbianEnv.txt "overlays=" line, and see if the problem still occurs.  If it does not, then pushing to 1.3GHz at 1.3v may be too much for your board...

 

I've attached a new test overlay to this post that enables 1.2GHz at 1.3v; let me know if it fixes the problem (assuming that the 1.3GHz clock is the problem :wacko:).  If this works then I can add it to the mainline.

sun50i-h5-cpu-clock-1.2GHz-1.3v.dtbo

Edited by 5kft
(attached 1.2GHz overlay)

Share this post


Link to post
Share on other sites

been on biz travel over the last few days - jobby job stuff

 

Rolled back the overlay that was in place, and Neo2 is again stable.

 

I'll check out the update for the overlay...

 

I'm in recovery mode - 3 cities in 4 days across the USA - more time in planes than in meetings,

Share this post


Link to post
Share on other sites
13 hours ago, sfx2000 said:

been on biz travel over the last few days - jobby job stuff

 

I'm in recovery mode - 3 cities in 4 days across the USA - more time in planes than in meetings,

 

...ugh...I can relate...  But it's so rewarding though, right??  ;)

 

13 hours ago, sfx2000 said:

Rolled back the overlay that was in place, and Neo2 is again stable.

 

I'll check out the update for the overlay...

 

Ah...then the overlay should help - keep me posted.  Thanks!

Share this post


Link to post
Share on other sites
13 hours ago, 5kft said:

Ah...then the overlay should help - keep me posted.  Thanks!

 

Installed the overlay...

 

Spoiler

sfx@nano2:~$ cat /boot/armbianEnv.txt
verbosity=1
console=both
overlay_prefix=sun50i-h5
overlays=gpio-regulator-1.3v i2c0 usbhost1 usbhost2 cpu-clock-1.2GHz-1.3v
rootdev=UUID=ee092379-cddf-4aff-beb3-7cc62d0fe9bd
rootfstype=ext4
extraargs=net.ifnames=0
usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u

 

 

sfx@nano2:~$ ls -l /sys/class/leds
total 0
lrwxrwxrwx 1 root root 0 Jan  1  1970 nanopi:green:status -> ../../devices/platform/leds/leds/nanopi:green:status
lrwxrwxrwx 1 root root 0 Jan  1  1970 nanopi:red:pwr -> ../../devices/platform/leds/leds/nanopi:red:pwr

 

 

sfx@nano2:~$ cat /etc/default/cpufrequtils
# WARNING: this file will be replaced on board support package (linux-root-...) upgrade
ENABLE=true
MIN_SPEED=480000
#MIN_SPEED=120000
#MAX_SPEED=1000000
MAX_SPEED=1200000
GOVERNOR=schedutil

 

sfx@nano2:~$ cpufreq-info -c1

cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009

Report errors and bugs to cpufreq@vger.kernel.org, please.

analyzing CPU 1:

  driver: cpufreq-dt

  CPUs which run at the same hardware frequency: 0 1 2 3

  CPUs which need to have their frequency coordinated by software: 0 1 2 3

  maximum transition latency: 5.44 ms.

  hardware limits: 480 MHz - 1.20 GHz

  available frequency steps: 480 MHz, 648 MHz, 816 MHz, 960 MHz, 1.01 GHz, 1.10 GHz, 1.20 GHz

  available cpufreq governors: conservative, userspace, powersave, ondemand, performance, schedutil

  current policy: frequency should be within 480 MHz and 1.20 GHz.

                  The governor "schedutil" may decide which speed to use

                  within this range.

  current CPU frequency is 816 MHz (asserted by call to hardware).

  cpufreq stats: 480 MHz:46.69%, 648 MHz:11.69%, 816 MHz:25.12%, 960 MHz:4.96%, 1.01 GHz:0.80%, 1.10 GHz:1.19%, 1.20 GHz:9.55%  (3952)

 

Outcome not good when putting the board with the new overlay when stressing it with 

openssl speed -multi 4

Board crashes with a hard hangup - have to pull power.

 

Should note this is a v1.1 board with the , which I didn't mention earlier.. This is the kit with the OLED hat and the cute little aluminum case...

Share this post


Link to post
Share on other sites
20 hours ago, sfx2000 said:

Outcome not good when putting the board with the new overlay when stressing it

 

OK, thanks for the info!  I ran the openssl test on a number of boards (two NEO2 v1.1s - one 512MB, one 1GB; also a modified Orange Pi Zero Plus2 H5).  I was able to get it to consistently crash at 1.3GHz on one of the NEO2s, and reducing it to 1.2GHz with the overlay worked (I let it run for 5-10 min on each test).  The other NEO2 worked OK at 1.3GHz; I didn't get enough run time on the Orange Pi Plus2 because it would keep overheating (critical shutdown at 100C).  I ran the tests a few times and the behavior was consistent.  I need to dig up a bigger heatsink/temporary fan for the Orange Pi to really test it...I have a number of other boards I could test this on, but I have to install heatsinks on them...

 

Given that the 1.2GHz overlay solved the problem for my failing NEO2 (reproducible), I think that I'll go ahead and add it to the mainline.  Using the overlay eliminates the need to edit /etc/default/cpufrequtils, just add the overlay to /boot/armbianEnv.txt.

 

Unfortunately overclocking is pretty much "luck of the draw" in terms of the CPU...if your board is crashing at 1.2GHz/1.3v there isn't a lot we can do about it at that clockrate :(  As a test you could try using the 1.3GHz overlay and reducing the MAX_SPEED in /etc/default/cpufrequtils to 1152000000 or 1104000000 and see if those are stable?

 

 

Share this post


Link to post
Share on other sites

In case it is helpful to users, I've checked in the new 1.2GHz max overclock overlay:  https://github.com/armbian/build/commit/74c6adec7411ef4d6dfa2115d21378c84aecb488

 

Use it just like the 1.3GHz overclock overlay; no need to edit the default "/etc/default/cpufrequtils".  E.g., relevant excerpt from "/boot/armbianEnv.txt" for NEO2 v1.1:

    ...
overlay_prefix=sun50i-h5
overlays=usbhost1 usbhost2 gpio-regulator-1.3v cpu-clock-1.2GHz-1.3v
    ...

Limiting the overclock to 1.2GHz on one of my NEO2s makes it completely stable as compared to 1.3GHz (e.g., using the "openssl speed -multi 4" SMP test).  If anyone is interested, it'd be easy enough for me to add a 1.1GHz/1.3v overlay as well, just let me know :)

Share this post


Link to post
Share on other sites
(edited)

self deleted for brevity/clarity...

 

 

 

Edited by sfx2000
thread management

Share this post


Link to post
Share on other sites

ok - so the stock FA image also crashes on the same test - it behaves differently than the Armbian image, as it kills off the threads when it tries to do a privileged memory access...

 

since we're working with armv8-a, we have kernel space (EL1) and user space (EL0) - hence the data abort, as the memory is marked as EL1, and an EL0 task cannot access that. I think that overclocking the CPU exposes a bug that is latent, even without the overlay, and this goes not to device tree, but to uboot and DDR ram init vectors there.

 

The stress test (openssl) can show the bug, but this isn't the real problem, and the overlay just enables it to happen faster - getting board temp to around 60c, which on a small board like this, includes not only the SoC, but the DDR, can accelerate this issue, as some DDR can get a bit unstable at that temp.

 

I don't have much time right now to debug further, as I'm in the middle of sfx's North America Tour - last week Austin, TX, next week Miami, FL, Atlanta, GA, Denver, CO, and a short trip to Salt Lake City, UT - about a week of downtime in the SAN, then back to Austin for a week.

 

@5kft  -- Gnarly problem to sort, eh? But spending time might help other AW H5 targets...

 

@Igor -- something to watch maybe

Share this post


Link to post
Share on other sites

@sfx2000 - very nice sleuthing!  I've spent some time with this, and was able to repro the crash consistently by dialing the DDR clock up a bit.  I dialed the clock down and have been testing, and it is completely stable, even when overclocked - the openssl test can run to completion now, multiple times.  (Note:  I tested all of this on my "problematic" NEO2, which can only support a maximum overclock to 1.2GHz.)

 

@Igor - it seems quite clear that the default 624MHz DDR clock is too high for this board.  I've bisected the rates - e.g., 576MHz works great where 624MHz would crash intermittently.  Now 576MHz may not actually be low enough, but at this point I think I'm going to go ahead and lower the H5 clocks to 576MHz (for the boards I have and can test), which will be more stable than the current 624MHz - please let me know if you disagree.  The real driver for this is that even without overclocking sfx2000's board crashes with this test.

 

Also, obviously is variance in boards/DDR...I'm also going to do some more research and testing and see if there isn't a more scientific process that we can use to determine the best DDR clock rate to use here...

 

 

Share this post


Link to post
Share on other sites
32 minutes ago, 5kft said:

Also, obviously is variance in boards/DDR...I'm also going to do some more research and testing and see if there isn't a more scientific process that we can use to determine the best DDR clock rate to use here...


I'm interested in this as my Opi Prime has been bulletproof for months with the existing configs..   Let me know how I can help test..I also have a PC2 that i can use for testing

 

Share this post


Link to post
Share on other sites
14 minutes ago, lanefu said:

I'm interested in this as my Opi Prime has been bulletproof for months with the existing configs..   Let me know how I can help test..I also have a PC2 that i can use for testing

 

Great!  It'd be interesting for you to try "openssl speed -multi 4" on your Opi Prime, and see if it makes it all the way through the run successfully without crashing (make sure you have sufficient cooling, and patience :)).

 

Personally I want a completely stable platform, which would mean take the most conservative route (e.g., possibly reduce the DDR clock even lower).  I just got a crash on one board at 1.2GHz/576MHz, but it works fine if it isn't overclocked.  However, if the overclock is just exposing a latent issue (as per @sfx2000's comment above), then we'd need to go lower...  Any/all thoughts are appreciated regarding this...!

 

Share this post


Link to post
Share on other sites
5 hours ago, 5kft said:

Personally I want a completely stable platform, which would mean take the most conservative route (e.g., possibly reduce the DDR clock even lower).  I just got a crash on one board at 1.2GHz/576MHz, but it works fine if it isn't overclocked.  However, if the overclock is just exposing a latent issue (as per @sfx2000's comment above), then we'd need to go lower...  Any/all thoughts are appreciated regarding this...!

 

Looks like Sunxi-mainline-kernel-4.14 is 504 MHz for the DDR clock for H5 on Neo2

 

http://wiki.friendlyarm.com/wiki/images/a/af/Sunxi-mainline-kernel-4.14-features.xlsx

 

 

Share this post


Link to post
Share on other sites
34 minutes ago, sfx2000 said:

Looks like Sunxi-mainline-kernel-4.14 is 504 MHz for the DDR clock for H5 on Neo2

 

Yes - the posts I mention above have some history regarding this (e.g., patch to bring the mainline 672 down to the FA 504).  It looks like there was an original desire to run it at 504, then at some point in late 2017 people started overclocking, and it seemed to work...


On my NEO2s, I couldn't get any failures at the default CPU clock rate (1.0GHz) at 576.  I'd like to go conservative here for stability, but I'm not sure how people would feel going to 504 if 576 works fine.  Can you confirm the clocks for the FA firmware that you tested with?  It's worrisome if you could repro and it's at DDR 504/CPU 1.0GHz.

 

BTW, I hammered my NEO2 Blacks running with the default (624), and they work without any issue - default clock goes to 1.37GHz.  These boards are different as they use 32-bit DDR (two parts).

 

Share this post


Link to post
Share on other sites
5 minutes ago, 5kft said:

Yes - the posts I mention above have some history regarding this (e.g., patch to bring the mainline 672 down to the FA 504).  It looks like there was an original desire to run it at 504, then at some point in late 2017 people started overclocking, and it seemed to work...


On my NEO2s, I couldn't get any failures at the default clock rate (1.0GHz) at 576.  I'd like to go conservative here for stability, but I'm not sure how people would feel going to 504 if 576 works fine.  Can you confirm the clocks for the FA firmware that you tested with?  It's worrisome if you could repro and it's at DDR 504/CPU 1.0GHz.

 

BTW, I hammered my NEO2 Blacks running with the default (624), and they work without any issue - default clock goes to 1.37GHz.  These boards are different as they use 32-bit DDR (two parts).

 

I did the FA image as a quick test, and then overwrote that card with the current Armbian to get a baseline - as my armbian card is a few months old (custom work for the oled hat, other things)

 

IIRC, FA was pushing CPU up to 1.2, but didn't note the DDR clocks. Anyways, link to that image is in the thread

 

I agree that stability is better than overall performance - stability is it's own performance benchmark. I'd rather err on the side of safety.

 

Neo2 Black - different board, eh?

Share this post


Link to post
Share on other sites
16 minutes ago, 5kft said:

Yes - the posts I mention above have some history regarding this (e.g., patch to bring the mainline 672 down to the FA 504).  It looks like there was an original desire to run it at 504, then at some point in late 2017 people started overclocking, and it seemed to work...

 

Seems like it doesn't under certain loads...

Share this post


Link to post
Share on other sites
2 minutes ago, sfx2000 said:

I agree that stability is better than overall performance - stability is it's own performance benchmark. I'd rather err on the side of safety.

 

Agreed.  @Igor, @martinayotte, @lanefu - apologies for the spam, but am looking for your thoughts...should we drop back to the FA mainline u-boot DDR clock rate for the NEO2, NEO Plus2, etc.?

 

5 minutes ago, sfx2000 said:

Neo2 Black - different board, eh?

 

Yeah, the Black is great - it's a NEO Core2 base with a better regulator, plus eMMC.  I use a number of these now.  I like these boards because they are so tiny, low power, and the performance is great :)  The little metal cases are awesome too.

Share this post


Link to post
Share on other sites
3 minutes ago, sfx2000 said:

Seems like it doesn't under certain loads...

 

Exactly...  Thanks again for looking further into this!

Share this post


Link to post
Share on other sites
1 minute ago, 5kft said:

thoughts...should we drop back to the FA mainline u-boot DDR clock rate for the NEO2, NEO Plus2, etc.?

 

If we adjusted would it be possible to ship overlays for the "unstable" speeds?

Share this post


Link to post
Share on other sites
2 minutes ago, lanefu said:

If we adjusted would it be possible to ship overlays for the "unstable" speeds?

 

Unfortunately not - the DDR clocks are set in u-boot...  However users could feel free to build their own u-boots that use higher DDR clocks.

Share this post


Link to post
Share on other sites
23 minutes ago, 5kft said:

 

Unfortunately not - the DDR clocks are set in u-boot...  However users could feel free to build their own u-boots that use higher DDR clocks.

 

In any event, the clock diffs from "stable" to "unstable" - performance overall isn't enough to justify the risks, unless one looks at specific benchmarks...

 

I'm not into benchmarking - I've got an interest in operational usage of the device.

 

just my thoughts...

 

sfx

Share this post


Link to post
Share on other sites
6 hours ago, 5kft said:

Great!  It'd be interesting for you to try "openssl speed -multi 4" on your Opi Prime, and see if it makes it all the way through the run successfully without crashing (make sure you have sufficient cooling, and patience :)).

 

Well.... that seemed to take it down pretty quick
image.png
 

Share this post


Link to post
Share on other sites
6 minutes ago, lanefu said:

Well.... that seemed to take it down pretty quick

 

Yep...

Share this post


Link to post
Share on other sites
5 minutes ago, lanefu said:

Well.... that seemed to take it down pretty quick

 

OK well that answers that...  I think it's clear that the memory tests used back in 2017 weren't sufficient to determine the stability of this clock.  Why don't I go ahead and set it to 504MHz as that's the FA default, then if desired people could look at overclocking this further.

Share this post


Link to post
Share on other sites
3 minutes ago, 5kft said:

OK well that answers that...  I think it's clear that the memory tests used back in 2017 weren't sufficient to determine the stability of this clock.  Why don't I go ahead and set it to 504MHz as that's the FA default, then if desired people could look at overclocking this further.

 

Sounds good - and folks should test around the 504MHz DDR clock, along with the overlay to upclock on boards that support the 1.3V regulator... both at 1.2 and 1.3 GHz.

Share this post


Link to post
Share on other sites

Changes checked in:  https://github.com/armbian/build/commit/42201fd3fc1386c6dc8785c4f85db35289bfe2db

 

After building a new u-boot, you can copy it to your board, then install it to the filesystem via "dpkg -i ...":

root@nanopineo2:~/tmp# dpkg -i linux-u-boot-current-nanopineo2_20.05.0-trunk_arm64.deb
(Reading database ... 33567 files and directories currently installed.)
Preparing to unpack linux-u-boot-current-nanopineo2_20.05.0-trunk_arm64.deb ...
Unpacking linux-u-boot-nanopineo2-current (20.05.0-trunk) over (20.05.0-trunk) ...
Setting up linux-u-boot-nanopineo2-current (20.05.0-trunk) ...
root@nanopineo2:~/tmp#

Then, to install it to your SD/eMMC, run "armbian-config".  In the menu, select "System and security settings", then "Install to/update boot loader", then "Install/Update the bootloader on SD/eMMC", then "Yes" at the WARNING prompt.  Exit armbian-config, then reboot.

 

Share this post


Link to post
Share on other sites
3 hours ago, 5kft said:

Then, to install it to your SD/eMMC, run "armbian-config".  In the menu, select "System and security settings", then "Install to/update boot loader", then "Install/Update the bootloader on SD/eMMC", then "Yes" at the WARNING prompt.  Exit armbian-config, then reboot.

 

i built uboot and installed package. installed latest armbian-config nightly.   i cant find install update bootloader in armbianconfig.   did you mean nand-sata-inatall?

Share this post


Link to post
Share on other sites

My board - still hangs up hard with the 1.2GHz overlay on the openssl stress test...

 

verbosity=1
console=both
overlay_prefix=sun50i-h5
overlays=usbhost1 usbhost2 gpio-regulator-1.3v i2c0 cpu-clock-1.2GHz-1.3v
rootdev=UUID=c87503e2-838a-42db-8208-a5293ae03ad5
rootfstype=ext4
extraargs=net.ifnames=0
usbstoragequirks=0x2537:0x1066:u,0x2537:0x1068:u

 

Getting better overall performance though with the lower clocks without the overlay...

 

Screenshots with the overlay below... hard crash on the 1.2GHz overlay,,,

 

 

1682476429_ScreenShot2020-03-08at8_08_12PM.png.c614d5bb024a75901e2a4eaabc141edb.png1221434282_ScreenShot2020-03-08at8_07_00PM.png.aeee7c7aba3d483f49e097ac1a924d5c.png

 

sun50i-h5-cpu-clock-1.2GHz-1.3v.dtbolinux-u-boot-current-nanopineo2_20.05.0-trunk_arm64.deb

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
0