Jump to content

Recommended Posts

Posted

I've submitted a pull request to Igor on github Armbian to patch both Lime2 and Lime2-emmc boards to 384 MHz.

 

Could you please provide tinymembench numbers made with 384 MHz? It's pretty easy since it's just

git clone https://github.com/ssvb/tinymembench
cd tinymembench/
make
./tinymembench

On my Lime2 it looks like this (and as already said I've no clue which DRAM clockspeed I specified when I built the image months ago, so curious how your numbers look to understand mine ;) ):

 

 

tk@lime2:~/tinymembench$ ./tinymembench 
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    242.7 MB/s (0.4%)
 C copy backwards (32 byte blocks)                    :    742.0 MB/s (0.5%)
 C copy backwards (64 byte blocks)                    :    769.2 MB/s
 C copy                                               :    768.7 MB/s (0.5%)
 C copy prefetched (32 bytes step)                    :    799.1 MB/s
 C copy prefetched (64 bytes step)                    :    833.3 MB/s
 C 2-pass copy                                        :    688.0 MB/s (0.6%)
 C 2-pass copy prefetched (32 bytes step)             :    736.5 MB/s
 C 2-pass copy prefetched (64 bytes step)             :    752.6 MB/s
 C fill                                               :   2021.5 MB/s (0.6%)
 C fill (shuffle within 16 byte blocks)               :   2021.7 MB/s
 C fill (shuffle within 32 byte blocks)               :    313.2 MB/s
 C fill (shuffle within 64 byte blocks)               :    327.3 MB/s
 ---
 standard memcpy                                      :    519.4 MB/s (0.4%)
 standard memset                                      :   2020.9 MB/s (0.8%)
 ---
 NEON read                                            :   1180.9 MB/s
 NEON read prefetched (32 bytes step)                 :   1334.8 MB/s
 NEON read prefetched (64 bytes step)                 :   1345.1 MB/s (0.4%)
 NEON read 2 data streams                             :    338.5 MB/s
 NEON read 2 data streams prefetched (32 bytes step)  :    644.8 MB/s
 NEON read 2 data streams prefetched (64 bytes step)  :    675.3 MB/s (0.5%)
 NEON copy                                            :    784.7 MB/s
 NEON copy prefetched (32 bytes step)                 :    861.0 MB/s (1.0%)
 NEON copy prefetched (64 bytes step)                 :    876.6 MB/s
 NEON unrolled copy                                   :    849.6 MB/s
 NEON unrolled copy prefetched (32 bytes step)        :    794.0 MB/s
 NEON unrolled copy prefetched (64 bytes step)        :    828.6 MB/s (0.6%)
 NEON copy backwards                                  :    754.7 MB/s
 NEON copy backwards prefetched (32 bytes step)       :    840.3 MB/s
 NEON copy backwards prefetched (64 bytes step)       :    862.2 MB/s (1.1%)
 NEON 2-pass copy                                     :    729.5 MB/s
 NEON 2-pass copy prefetched (32 bytes step)          :    769.9 MB/s
 NEON 2-pass copy prefetched (64 bytes step)          :    782.6 MB/s (0.6%)
 NEON unrolled 2-pass copy                            :    652.1 MB/s
 NEON unrolled 2-pass copy prefetched (32 bytes step) :    612.2 MB/s
 NEON unrolled 2-pass copy prefetched (64 bytes step) :    655.0 MB/s (0.5%)
 NEON fill                                            :   2020.6 MB/s
 NEON fill backwards                                  :   2020.7 MB/s
 VFP copy                                             :    856.7 MB/s (1.7%)
 VFP 2-pass copy                                      :    660.6 MB/s
 ARM fill (STRD)                                      :   2020.2 MB/s
 ARM fill (STM with 8 registers)                      :   2021.2 MB/s
 ARM fill (STM with 4 registers)                      :   2021.1 MB/s (0.7%)
 ARM copy prefetched (incr pld)                       :    830.9 MB/s
 ARM copy prefetched (wrap pld)                       :    793.4 MB/s
 ARM 2-pass copy prefetched (incr pld)                :    730.1 MB/s (0.6%)
 ARM 2-pass copy prefetched (wrap pld)                :    704.7 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    6.6 ns          /    11.3 ns 
    131072 :   10.1 ns          /    15.9 ns 
    262144 :   13.6 ns          /    20.0 ns 
    524288 :  105.5 ns          /   166.9 ns 
   1048576 :  156.9 ns          /   219.7 ns 
   2097152 :  189.3 ns          /   245.3 ns 
   4194304 :  206.3 ns          /   256.5 ns 
   8388608 :  217.1 ns          /   264.7 ns 
  16777216 :  228.6 ns          /   278.7 ns 
  33554432 :  245.5 ns          /   307.7 ns 
  67108864 :  275.5 ns          /   366.9 ns  

 

 

Posted

I can confirm that 432MHz is stable for Cubieboard 2.
My cubieboard 2 is doing a stress test now and it's stable. The test is already running for 5 hours and I will leave it running for today.

When it was 480 MHz, the stress test failed in 1 hour.
I used 432 MHz instead of 384 MHz because of this topic: https://groups.google.com/forum/#!topic/cubieboard/9WMBFAL7JBE
I will upload my u-boot here if anyone is interested.

 

u-boot-sunxi-(432MHz).zip

Posted

I can confirm that 432MHz is stable for Cubieboard 2.

My cubieboard 2 is doing a stress test now and it's stable.

 

Stable? Not crashing and no memory corruption are different things. Did it survive 24h lima-memtester running?

Posted

I'm using just memtester, because when I use lima-memtester I get the error:

 

Please remove 'sunxi_no_mali_mem_reserve' option from
your kernel command line. Otherwise the mali kernel
driver may be non-functional and actually knock down
your system with some old linux-sunxi kernels.
Aborted
 
The kernel I use is the one at armbian repository (5.20). I did not change the kernel.
Strange, because I'm not using a CLI image. There should be memory reservation for GPU, right?
 
For the benchmark I'm using:
stress -c 2
memtester 100M
openssl speed -multi 5
 
I will leave it running for today.
 
Edit: It turns out that my boot.src had Mali memory reservation disabled. I dont know why tough. Running lima-memtester now. Tomorrow I'll come back and report.
Posted

Hi,

 

I installed my a20lime2 in october 2015 (it was Armbian_4.5_Lime2_Debian_jessie_3.4.109) and although not highly loaded worked since then without a single problem or crash. So I was quite astonished about the problems reported in this thread. I frequently updated the system using aptitude. So I tried to find out why my system is working stable.

 

I have a 2,5" HDD connected via SATA and I had done a rsync of around 600GB to the a20lime2 short time ago and it succeeded without a problem.

 

Anyway, I found out that although the deb-packages for kernel, firmware and root filesystem were installed and so updated regularily, the u-boot deb-package was not installed! So I found out that I still use the u-boot of the installation image:

a20lime2:~/tinymembench# dd if=/dev/mmcblk0 bs=48K count=1 | strings | grep -i "U-Boot"
1+0 records in
1+0 records out
49152 bytes (49 kB) copied, 0.0081478 s, 6.0 MB/s
U-Boot
U-Boot SPL 2015.07-armbian-sun7i (Oct 11 2015 - 16:53:01)
U-Boot 2015.07-armbian-sun7i for

I thought the missing u-boot deb-package was my fault, but I checked my armbian a10lime (first install with Armbian_5.00_Lime-a10_Debian_jessie_3.4.110) and found it missing there, too. It still had U-Boot SPL 2016.01-armbian-sun7i (Feb 10 2016 - 20:08:59) installed. So either I did something systematically wrong or the deb-package was not installed on these old images.

 

Maybe this can explain why users with old and updated installations (like me) don't see the problem.

 

I added my tinymembench results at the bottom. Comparing the results, it confirms that my board is running at a lower DRAM-speed than tkaiser's.

 

Bye,

wahlm

 

 

 

a20lime2:~/tinymembench# ./tinymembench
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for 'copy' tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==
==========================================================================

C copy backwards : 230.6 MB/s (0.9%)
C copy backwards (32 byte blocks) : 704.1 MB/s (1.3%)
C copy backwards (64 byte blocks) : 723.9 MB/s (1.7%)
C copy : 594.2 MB/s (0.2%)
C copy prefetched (32 bytes step) : 695.9 MB/s (0.7%)
C copy prefetched (64 bytes step) : 700.7 MB/s (0.8%)
C 2-pass copy : 553.2 MB/s (0.2%)
C 2-pass copy prefetched (32 bytes step) : 590.4 MB/s
C 2-pass copy prefetched (64 bytes step) : 603.4 MB/s (0.5%)
C fill : 1571.4 MB/s (0.6%)
C fill (shuffle within 16 byte blocks) : 1572.7 MB/s (0.5%)
C fill (shuffle within 32 byte blocks) : 301.7 MB/s (2.4%)
C fill (shuffle within 64 byte blocks) : 316.4 MB/s (2.7%)
---
standard memcpy : 454.3 MB/s
standard memset : 1571.7 MB/s (0.7%)
---
NEON read : 963.8 MB/s
NEON read prefetched (32 bytes step) : 1131.9 MB/s
NEON read prefetched (64 bytes step) : 1145.7 MB/s (0.2%)
NEON read 2 data streams : 324.0 MB/s (0.7%)
NEON read 2 data streams prefetched (32 bytes step) : 603.5 MB/s
NEON read 2 data streams prefetched (64 bytes step) : 635.2 MB/s (0.6%)
NEON copy : 633.8 MB/s (0.7%)
NEON copy prefetched (32 bytes step) : 719.1 MB/s
NEON copy prefetched (64 bytes step) : 738.2 MB/s (0.8%)
NEON unrolled copy : 640.0 MB/s (0.6%)
NEON unrolled copy prefetched (32 bytes step) : 675.4 MB/s (0.7%)
NEON unrolled copy prefetched (64 bytes step) : 711.1 MB/s (0.7%)
NEON copy backwards : 715.8 MB/s (0.8%)
NEON copy backwards prefetched (32 bytes step) : 733.0 MB/s (0.7%)
NEON copy backwards prefetched (64 bytes step) : 759.4 MB/s
NEON 2-pass copy : 594.2 MB/s
NEON 2-pass copy prefetched (32 bytes step) : 647.0 MB/s (0.7%)
NEON 2-pass copy prefetched (64 bytes step) : 660.9 MB/s (0.7%)
NEON unrolled 2-pass copy : 526.7 MB/s (0.3%)
NEON unrolled 2-pass copy prefetched (32 bytes step) : 506.7 MB/s (0.5%)
NEON unrolled 2-pass copy prefetched (64 bytes step) : 542.1 MB/s (0.4%)
NEON fill : 1568.9 MB/s
NEON fill backwards : 1652.2 MB/s
VFP copy : 650.8 MB/s (0.7%)
VFP 2-pass copy : 525.8 MB/s (0.4%)
ARM fill (STRD) : 1570.9 MB/s (0.7%)
ARM fill (STM with 8 registers) : 1570.5 MB/s (0.8%)
ARM fill (STM with 4 registers) : 1571.3 MB/s
ARM copy prefetched (incr pld) : 732.3 MB/s (0.7%)
ARM copy prefetched (wrap pld) : 627.4 MB/s
ARM 2-pass copy prefetched (incr pld) : 628.6 MB/s (0.7%)
ARM 2-pass copy prefetched (wrap pld) : 585.3 MB/s

==========================================================================
== Framebuffer read tests. ==
== ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled. ==
== Writes to such framebuffers are quite fast, but reads are much ==
== slower and very sensitive to the alignment and the selection of ==
== CPU instructions which are used for accessing memory. ==
== ==
== Many x86 systems allocate the framebuffer in the GPU memory, ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover, ==
== PCI-E is asymmetric and handles reads a lot worse than writes. ==
== ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall ==
== performance improvement. For example, the xf86-video-fbturbo DDX ==
== uses this trick. ==
==========================================================================

NEON read (from framebuffer) : 45.0 MB/s
NEON copy (from framebuffer) : 44.3 MB/s
NEON 2-pass copy (from framebuffer) : 43.6 MB/s
NEON unrolled copy (from framebuffer) : 43.7 MB/s (0.3%)
NEON 2-pass unrolled copy (from framebuffer) : 43.3 MB/s
VFP copy (from framebuffer) : 240.7 MB/s (0.5%)
VFP 2-pass copy (from framebuffer) : 248.4 MB/s (0.5%)
ARM copy (from framebuffer) : 165.7 MB/s (0.7%)
ARM 2-pass copy (from framebuffer) : 150.3 MB/s

==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================

block size : single random read / dual random read
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 6.8 ns / 10.8 ns
131072 : 10.7 ns / 15.1 ns
262144 : 14.1 ns / 19.5 ns
524288 : 112.3 ns / 176.5 ns
1048576 : 167.1 ns / 233.7 ns
2097152 : 201.2 ns / 260.7 ns
4194304 : 219.5 ns / 272.0 ns
8388608 : 230.4 ns / 280.6 ns
16777216 : 242.4 ns / 294.9 ns
33554432 : 260.8 ns / 328.5 ns
67108864 : 297.0 ns / 398.2 ns

 

Posted

30 hours running lima-memtester and it survived. No memory errors. That companion cube was making me dizzy already.  :D

I think 432 MHz is pretty stable for CB2.

Just reminding that when it was 480 MHz, it couldn't hold memtester for more than 1 hour.

Posted

30 hours running lima-memtester and it survived. No memory errors. That companion cube was making me dizzy already.  :D

I think 432 MHz is pretty stable for CB2.

Just reminding that when it was 480 MHz, it couldn't hold memtester for more than 1 hour.

Thanks a lot!

 

I had a CB2 board myself (but not anymore) and at least my board was passing the lima-memtester test with DRAM clocked at 480MHz. My board board was an early one and had GT memory chips (GT8UB256M16BP-BG). But SK Hynix chips (H5TQ4G63AFR-PBC) have also been observed in the wild later, according to the pictures from the https://linux-sunxi.org/Cubieboard2page. Sometimes the choice of DRAM chips affects reliability. I observed different DRAM reliability behaviour with NANYA and HYNIX chips on LinkSprite pcDuino2 boards. And Olimex also reported reliability problems after changing the DRAM vendor on some of their boards.

 

Could you please check the DRAM chip markings on your board? Also it would be great if you could test the board at different DRAM clock speed steps (432, 456 and 480) in lima-memtester and find the exact crossover point where things become unreliable. There is no need to run lima-memtester for 30 hours except for the final verification, usually something between 20 minutes and 1 hour is enough. Comparing lima-memtester results with your old method is also interesting (do they have similar sensitivity to detecting errors? is one of them faster at detecting problems?).

 

Naturally, it's a bad news that your Cubieboard2 fails with the mainline U-Boot default settings. We clearly need to fix this but need to do a bit of an investigation to see what is going on, whether the DRAM chip vendor makes a difference, and what is the maximum reliable clock frequency on different boards. For example, a perfect example of such investigation is https://linux-sunxi.org/Orange_Pi_PC#DRAM_clock_speed_limit

Posted

I will verify the DRAM chip when I arrive home today.

When I used Cubian, there was an update that increased a voltage  due to some stability issues. See: http://cubian.org/2014/07/05/resolve-stability-issue-on-cb2/

Maybe is the same issue we are facing here? I never had this problem with Cubian nor changed the u-boot.

When I arrive home, I will test Cubian with tinymembench so we can check at what speed and voltages Cubian is running.

 

Edit: I found this: https://github.com/maxnet/a10-meminfo

Looks like it can dump DRAM settings and also works on a20 boards. (Found it re-reading the links linked to the link I posted above.)

Posted

Hi,

 

I just tried a10-meminfo on my stable a20lime2 and a10lime with the old u-boot versions (see below). Funny thing is that the tool is reporting 480 Mhz for both systems!

 

I tried running lima-memtester, but both systems are server/headless and use sunxi_no_mali_mem_reserve in kernel command line. So lima-memtester is terminating at start.

 

The question is why my a20lime2 tinymembench results compared to the results of tkaiser's are showing a lower performance. Maybe there are some other parameters beside the dram_clk involved? Or does it depend on the type/vendor of RAM used on the board?

 

Bye,

wahlm

root@a10lime:~/a10-meminfo# ./a10-meminfo
dram_clk          = 480
dram_type         = 3
dram_rank_num     = 1
dram_chip_density = 4096
dram_io_width     = 16
dram_bus_width    = 16
dram_cas          = 6
dram_zq           = 0x7b
dram_odt_en       = 0
dram_tpr0         = 0x30926692
dram_tpr1         = 0x1090
dram_tpr2         = 0x1a0c8
dram_tpr3         = 0x0
dram_emr1         = 0x4
dram_emr2         = 0x0
dram_emr3         = 0x0

a20lime2:~/a10-meminfo# ./a10-meminfo
dram_clk          = 480
dram_type         = 3
dram_rank_num     = 1
dram_chip_density = 4096
dram_io_width     = 16
dram_bus_width    = 32
dram_cas          = 9
dram_zq           = 0x7b
dram_odt_en       = 0
dram_tpr0         = 0x42d899b7
dram_tpr1         = 0xa090
dram_tpr2         = 0x22a00
dram_tpr3         = 0x0
dram_emr1         = 0x4
dram_emr2         = 0x10
dram_emr3         = 0x0

Posted

Hi,

 

as an addition here are the a10-meminfo results from my CubieBoard 1 and CubieTruck both running cubian. At least the CubieTruck shows a lower dram_clk.

Both systems run 24h/7d since a long time (1-2 years) without problems and have 2,5" HDDs attached via SATA.

 

But it has to be verified that the results of a10-meminfo are reliable...

 

Bye,

wahlm

root@cubie:~/a10-meminfo# ./a10-meminfo
dram_clk          = 480
dram_type         = 3
dram_rank_num     = 1
dram_chip_density = 4096
dram_io_width     = 16
dram_bus_width    = 32
dram_cas          = 6
dram_zq           = 0x7b
dram_odt_en       = 0
dram_tpr0         = 0x30926692
dram_tpr1         = 0x1090
dram_tpr2         = 0x1a0c8
dram_tpr3         = 0x0
dram_emr1         = 0x0
dram_emr2         = 0x0
dram_emr3         = 0x0

root@ctruck:~/a10-meminfo# ./a10-meminfo
dram_clk          = 432
dram_type         = 3
dram_rank_num     = 1
dram_chip_density = 8192
dram_io_width     = 16
dram_bus_width    = 32
dram_cas          = 9
dram_zq           = 0x7f
dram_odt_en       = 0
dram_tpr0         = 0x42d899b7
dram_tpr1         = 0xa090
dram_tpr2         = 0x22a00
dram_tpr3         = 0x0
dram_emr1         = 0x4
dram_emr2         = 0x10
dram_emr3         = 0x0

Posted

 

The question is why my a20lime2 tinymembench results compared to the results of tkaiser's are showing a lower performance. Maybe there are some other parameters beside the dram_clk involved? Or does it depend on the type/vendor of RAM used on the board?

Some of the memory bandwidth may be drained by the screen refresh. Even if you don't have any monitor connected, the board still might be trying to send some data over a non-connected HDMI interface and waste some memory bandwidth. If you are using an older 3.4 kernel, then you can try to run "echo 1 > /sys/devices/platform/disp/graphics/fb0/blank" command before running the benchmark.

Posted

Hi,

 

Just remove this kernel cmdline option. It serves no purpose anyway: https://github.com/linux-sunxi/linux-sunxi/commit/90e6c43fe04755947e252c3796c6b5c00e47df02

 

ok, thanks for the hint! From what I remember, I did not set that option and therefore it is/was part of the armbian default config. Anyway I removed it from the kernel cmdline of my a20lime2 and ran lima-memtester 100M without a problem for about 80 minutes. As it is a headless system I could not check the cube animations, but no error was reported at the (ssh-)console and there were 5 full loops reached at this time.

 

I then cancelled the test to check the DRAMs on the board. It has Samsung K4B4G1646D-BCK0 (datecode 516) assembled. In addition I saw that I placed a heatsink to the A20 (forgot about that...). The board is housed in the standard Olimex plastic case. My board is Rev.C.

 

Bye,

wahlm

Posted

Hi,

 

I would suggest to use an improved version of a10-meminfo from https://github.com/ssvb/a10-dram-tools

 

again thanks for the hint. It seems I missed a lot of already available info :( . But my boards worked reliably up to now, so there was no need to dig into investigation.

Anyway, here are the results of the improved version of a10-meminfo for the a20lime2. Seems the (already existing) values of the older version are still reliable.

 

Bye,

wahlm

a20lime2:~/a10-dram-tools# ./a10-meminfo 
dram_clk          = 480
mbus_clk          = 300
dram_type         = 3
dram_rank_num     = 1
dram_chip_density = 4096
dram_io_width     = 16
dram_bus_width    = 32
dram_cas          = 9
dram_zq           = 0x7b (0x5294a00)
dram_odt_en       = 0
dram_tpr0         = 0x42d899b7
dram_tpr1         = 0xa090
dram_tpr2         = 0x22a00
dram_tpr3         = 0x0
dram_emr1         = 0x4
dram_emr2         = 0x10
dram_emr3         = 0x0
dqs_gating_delay  = 0x05050505
active_windowing  = 0

Posted

Hi,

 

Some of the memory bandwidth may be drained by the screen refresh. Even if you don't have any monitor connected, the board still might be trying to send some data over a non-connected HDMI interface and waste some memory bandwidth. If you are using an older 3.4 kernel, then you can try to run "echo 1 > /sys/devices/platform/disp/graphics/fb0/blank" command before running the benchmark.

 

yes, I still use the 3.4 kernel and blanking the HDMI improved the benchmark results a lot. So the values are even better than tkaiser's. Maybe his results were influenced by HDMI output, too.

 

Find my new tinymembench results for the a20lime2 below.

 

Bye,

wahlm

 

 

 

a20lime2:~/tinymembench# echo 1 > /sys/devices/platform/disp/graphics/fb0/blank
a20lime2:~/tinymembench# ./tinymembench
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for 'copy' tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==
==========================================================================

C copy backwards : 257.2 MB/s (1.2%)
C copy backwards (32 byte blocks) : 835.4 MB/s
C copy backwards (64 byte blocks) : 881.3 MB/s
C copy : 827.9 MB/s
C copy prefetched (32 bytes step) : 856.1 MB/s
C copy prefetched (64 bytes step) : 871.6 MB/s
C 2-pass copy : 690.5 MB/s
C 2-pass copy prefetched (32 bytes step) : 726.2 MB/s
C 2-pass copy prefetched (64 bytes step) : 716.6 MB/s
C fill : 2037.7 MB/s
C fill (shuffle within 16 byte blocks) : 2037.8 MB/s
C fill (shuffle within 32 byte blocks) : 355.7 MB/s (3.9%)
C fill (shuffle within 64 byte blocks) : 381.4 MB/s (4.8%)
---
standard memcpy : 540.2 MB/s
standard memset : 2037.4 MB/s
---
NEON read : 1202.0 MB/s
NEON read prefetched (32 bytes step) : 1367.7 MB/s
NEON read prefetched (64 bytes step) : 1359.0 MB/s
NEON read 2 data streams : 353.0 MB/s
NEON read 2 data streams prefetched (32 bytes step) : 673.5 MB/s
NEON read 2 data streams prefetched (64 bytes step) : 708.1 MB/s
NEON copy : 897.3 MB/s
NEON copy prefetched (32 bytes step) : 898.8 MB/s
NEON copy prefetched (64 bytes step) : 943.3 MB/s
NEON unrolled copy : 920.6 MB/s
NEON unrolled copy prefetched (32 bytes step) : 843.2 MB/s
NEON unrolled copy prefetched (64 bytes step) : 880.7 MB/s
NEON copy backwards : 850.2 MB/s
NEON copy backwards prefetched (32 bytes step) : 876.2 MB/s
NEON copy backwards prefetched (64 bytes step) : 919.8 MB/s
NEON 2-pass copy : 754.6 MB/s
NEON 2-pass copy prefetched (32 bytes step) : 794.8 MB/s
NEON 2-pass copy prefetched (64 bytes step) : 813.1 MB/s
NEON unrolled 2-pass copy : 677.1 MB/s
NEON unrolled 2-pass copy prefetched (32 bytes step) : 633.9 MB/s
NEON unrolled 2-pass copy prefetched (64 bytes step) : 677.5 MB/s
NEON fill : 2037.9 MB/s
NEON fill backwards : 2037.7 MB/s
VFP copy : 927.5 MB/s
VFP 2-pass copy : 685.8 MB/s
ARM fill (STRD) : 2037.8 MB/s
ARM fill (STM with 8 registers) : 2037.9 MB/s
ARM fill (STM with 4 registers) : 2038.0 MB/s
ARM copy prefetched (incr pld) : 914.2 MB/s
ARM copy prefetched (wrap pld) : 861.5 MB/s
ARM 2-pass copy prefetched (incr pld) : 764.6 MB/s
ARM 2-pass copy prefetched (wrap pld) : 729.8 MB/s

==========================================================================
== Framebuffer read tests. ==
== ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled. ==
== Writes to such framebuffers are quite fast, but reads are much ==
== slower and very sensitive to the alignment and the selection of ==
== CPU instructions which are used for accessing memory. ==
== ==
== Many x86 systems allocate the framebuffer in the GPU memory, ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover, ==
== PCI-E is asymmetric and handles reads a lot worse than writes. ==
== ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall ==
== performance improvement. For example, the xf86-video-fbturbo DDX ==
== uses this trick. ==
==========================================================================

NEON read (from framebuffer) : 50.2 MB/s
NEON copy (from framebuffer) : 49.2 MB/s
NEON 2-pass copy (from framebuffer) : 48.7 MB/s
NEON unrolled copy (from framebuffer) : 47.5 MB/s
NEON 2-pass unrolled copy (from framebuffer) : 48.0 MB/s
VFP copy (from framebuffer) : 242.6 MB/s
VFP 2-pass copy (from framebuffer) : 275.5 MB/s
ARM copy (from framebuffer) : 176.3 MB/s
ARM 2-pass copy (from framebuffer) : 166.5 MB/s

==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================

block size : single random read / dual random read
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 6.2 ns / 10.8 ns
131072 : 9.7 ns / 15.1 ns
262144 : 13.1 ns / 19.0 ns
524288 : 104.0 ns / 164.7 ns
1048576 : 154.8 ns / 216.7 ns
2097152 : 187.0 ns / 241.8 ns
4194304 : 203.7 ns / 252.2 ns
8388608 : 214.2 ns / 260.3 ns
16777216 : 225.6 ns / 273.7 ns
33554432 : 242.0 ns / 301.4 ns
67108864 : 272.5 ns / 361.8 ns

 

 

Posted

 If you are using an older 3.4 kernel, then you can try to run "echo 1 > /sys/devices/platform/disp/graphics/fb0/blank" command before running the benchmark.

 

Is there something similar when using mainline kernel? BTW: My Lime2 is running at 480 MHz DRAM clockspeed:

 

 

root@lime2:~/a10-dram-tools# a10-meminfo 
dram_clk          = 480
mbus_clk          = 300
dram_type         = 3
dram_rank_num     = 1
dram_chip_density = 4096
dram_io_width     = 16
dram_bus_width    = 32
dram_cas          = 9
dram_zq           = 0x7b (0x5294a00)
dram_odt_en       = 0
dram_tpr0         = 0x42d899b7
dram_tpr1         = 0xa090
dram_tpr2         = 0x22a00
dram_tpr3         = 0x0
dram_emr1         = 0x4
dram_emr2         = 0x10
dram_emr3         = 0x0
dqs_gating_delay  = 0x05060505
active_windowing  = 0 

 

 

Posted

 

50nvjq.jpg

 

 

It's a SK Hynix chip H5TQ4G63AFR-PBC

Above it is a i2c rtc module I use to sync the time.

 

And here is the output of a10-meminfo (improved version):

 

 

 

 ./a10-meminfo
dram_clk          = 432
mbus_clk          = 400
dram_type         = 3
dram_rank_num     = 1
dram_chip_density = 4096
dram_io_width     = 16
dram_bus_width    = 32
dram_cas          = 9
dram_zq           = 0x7f (0x5294a00)
dram_odt_en       = 0
dram_tpr0         = 0x42d899b7
dram_tpr1         = 0xa090
dram_tpr2         = 0x22a00
dram_tpr3         = 0x0
dram_emr1         = 0x4
dram_emr2         = 0x10
dram_emr3         = 0x0
dqs_gating_delay  = 0x05050505
active_windowing  = 0 

 

 

 

And here is the output of tinymembrench:

 

 

tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    238.4 MB/s (0.5%)
 C copy backwards (32 byte blocks)                    :    671.9 MB/s (1.1%)
 C copy backwards (64 byte blocks)                    :    681.0 MB/s (0.7%)
 C copy                                               :    535.3 MB/s (0.2%)
 C copy prefetched (32 bytes step)                    :    586.9 MB/s (0.7%)
 C copy prefetched (64 bytes step)                    :    601.3 MB/s (1.4%)
 C 2-pass copy                                        :    524.5 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    529.4 MB/s (0.4%)
 C 2-pass copy prefetched (64 bytes step)             :    525.5 MB/s (0.3%)
 C fill                                               :   1655.1 MB/s (1.1%)
 C fill (shuffle within 16 byte blocks)               :   1659.7 MB/s (1.3%)
 C fill (shuffle within 32 byte blocks)               :    321.6 MB/s (1.9%)
 C fill (shuffle within 64 byte blocks)               :    345.3 MB/s (1.9%)
 ---
 standard memcpy                                      :    517.3 MB/s (1.0%)
 standard memset                                      :   1644.4 MB/s (0.6%)
 ---
 NEON read                                            :   1039.0 MB/s (1.4%)
 NEON read prefetched (32 bytes step)                 :   1087.6 MB/s (0.8%)
 NEON read prefetched (64 bytes step)                 :   1104.4 MB/s (1.1%)
 NEON read 2 data streams                             :    351.8 MB/s (0.8%)
 NEON read 2 data streams prefetched (32 bytes step)  :    666.7 MB/s (0.8%)
 NEON read 2 data streams prefetched (64 bytes step)  :    705.1 MB/s (0.4%)
 NEON copy                                            :    533.8 MB/s (0.4%)
 NEON copy prefetched (32 bytes step)                 :    578.8 MB/s (1.1%)
 NEON copy prefetched (64 bytes step)                 :    594.1 MB/s (1.3%)
 NEON unrolled copy                                   :    532.4 MB/s (0.7%)
 NEON unrolled copy prefetched (32 bytes step)        :    532.0 MB/s (0.4%)
 NEON unrolled copy prefetched (64 bytes step)        :    536.4 MB/s (0.6%)
 NEON copy backwards                                  :    693.6 MB/s (1.0%)
 NEON copy backwards prefetched (32 bytes step)       :    718.3 MB/s (0.9%)
 NEON copy backwards prefetched (64 bytes step)       :    744.9 MB/s (1.2%)
 NEON 2-pass copy                                     :    528.9 MB/s (0.7%)
 NEON 2-pass copy prefetched (32 bytes step)          :    532.0 MB/s (0.4%)
 NEON 2-pass copy prefetched (64 bytes step)          :    535.1 MB/s (0.9%)
 NEON unrolled 2-pass copy                            :    514.9 MB/s (0.2%)
 NEON unrolled 2-pass copy prefetched (32 bytes step) :    522.3 MB/s (0.8%)
 NEON unrolled 2-pass copy prefetched (64 bytes step) :    524.2 MB/s
 NEON fill                                            :   1660.5 MB/s (1.0%)
 NEON fill backwards                                  :   1726.6 MB/s (1.6%)
 VFP copy                                             :    534.7 MB/s (1.2%)
 VFP 2-pass copy                                      :    524.2 MB/s (0.9%)
 ARM fill (STRD)                                      :   1638.8 MB/s (1.4%)
 ARM fill (STM with 8 registers)                      :   1660.5 MB/s (0.7%)
 ARM fill (STM with 4 registers)                      :   1661.8 MB/s (0.3%)
 ARM copy prefetched (incr pld)                       :    567.5 MB/s (0.4%)
 ARM copy prefetched (wrap pld)                       :    594.5 MB/s (1.9%)
 ARM 2-pass copy prefetched (incr pld)                :    518.7 MB/s (0.3%)
 ARM 2-pass copy prefetched (wrap pld)                :    518.3 MB/s (0.3%)

==========================================================================
== Framebuffer read tests.                                              ==
==                                                                      ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled.       ==
== Writes to such framebuffers are quite fast, but reads are much       ==
== slower and very sensitive to the alignment and the selection of      ==
== CPU instructions which are used for accessing memory.                ==
==                                                                      ==
== Many x86 systems allocate the framebuffer in the GPU memory,         ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover,    ==
== PCI-E is asymmetric and handles reads a lot worse than writes.       ==
==                                                                      ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer    ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall    ==
== performance improvement. For example, the xf86-video-fbturbo DDX     ==
== uses this trick.                                                     ==
==========================================================================

 NEON read (from framebuffer)                         :     49.1 MB/s (0.4%)
 NEON copy (from framebuffer)                         :     48.8 MB/s (1.4%)
 NEON 2-pass copy (from framebuffer)                  :     48.2 MB/s (0.5%)
 NEON unrolled copy (from framebuffer)                :     47.7 MB/s (0.3%)
 NEON 2-pass unrolled copy (from framebuffer)         :     47.6 MB/s (0.3%)
 VFP copy (from framebuffer)                          :    260.7 MB/s
 VFP 2-pass copy (from framebuffer)                   :    270.1 MB/s (0.3%)
 ARM copy (from framebuffer)                          :    178.1 MB/s (0.6%)
 ARM 2-pass copy (from framebuffer)                   :    163.8 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.5 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.5 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.5 ns
     65536 :    6.5 ns          /    11.4 ns
    131072 :   10.5 ns          /    16.2 ns
    262144 :   15.7 ns          /    24.7 ns
    524288 :  107.3 ns          /   166.7 ns
   1048576 :  156.1 ns          /   215.9 ns
   2097152 :  187.9 ns          /   240.9 ns
   4194304 :  204.3 ns          /   251.5 ns
   8388608 :  215.0 ns          /   259.8 ns
  16777216 :  228.0 ns          /   274.9 ns
  33554432 :  244.6 ns          /   305.7 ns
  67108864 :  272.7 ns          /   361.9 ns

 

 

Posted

Could you please provide tinymembench numbers made with 384 MHz? It's pretty easy since it's just

 

Sorry, I missed this over the weekend when I was answering from home not work and across the forum and pull-request. Not sure if it's still of use to you, but here are the 384 MHz numbers:

 

 

 

root@lime2:~/tinymembench# ./tinymembench 

tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

 

==========================================================================

== Memory bandwidth tests                                               ==

==                                                                      ==

== Note 1: 1MB = 1000000 bytes                                          ==

== Note 2: Results for 'copy' tests show how many bytes can be          ==

==         copied per second (adding together read and writen           ==

==         bytes would have provided twice higher numbers)              ==

== Note 3: 2-pass copy means that we are using a small temporary buffer ==

==         to first fetch data into it, and only then write it to the   ==

==         destination (source -> L1 cache, L1 cache -> destination)    ==

== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==

==         brackets                                                     ==

==========================================================================

 

 C copy backwards                                     :    218.2 MB/s

 C copy backwards (32 byte blocks)                    :    643.6 MB/s

 C copy backwards (64 byte blocks)                    :    660.4 MB/s (0.2%)

 C copy                                               :    661.1 MB/s (0.2%)

 C copy prefetched (32 bytes step)                    :    692.3 MB/s

 C copy prefetched (64 bytes step)                    :    721.4 MB/s (0.2%)

 C 2-pass copy                                        :    655.1 MB/s (0.2%)

 C 2-pass copy prefetched (32 bytes step)             :    702.7 MB/s (0.2%)

 C 2-pass copy prefetched (64 bytes step)             :    715.9 MB/s

 C fill                                               :   2024.9 MB/s (0.3%)

 C fill (shuffle within 16 byte blocks)               :   2022.6 MB/s

 C fill (shuffle within 32 byte blocks)               :    281.8 MB/s

 C fill (shuffle within 64 byte blocks)               :    283.4 MB/s

 ---

 standard memcpy                                      :    442.0 MB/s (0.3%)

 standard memset                                      :   2024.2 MB/s (0.4%)

 ---

 NEON read                                            :   1088.3 MB/s (0.2%)

 NEON read prefetched (32 bytes step)                 :   1229.0 MB/s

 NEON read prefetched (64 bytes step)                 :   1236.1 MB/s (0.2%)

 NEON read 2 data streams                             :    308.0 MB/s

 NEON read 2 data streams prefetched (32 bytes step)  :    585.0 MB/s

 NEON read 2 data streams prefetched (64 bytes step)  :    616.7 MB/s

 NEON copy                                            :    676.7 MB/s (0.2%)

 NEON copy prefetched (32 bytes step)                 :    750.9 MB/s (0.3%)

 NEON copy prefetched (64 bytes step)                 :    760.8 MB/s (0.2%)

 NEON unrolled copy                                   :    740.9 MB/s (0.3%)

 NEON unrolled copy prefetched (32 bytes step)        :    713.5 MB/s (0.2%)

 NEON unrolled copy prefetched (64 bytes step)        :    732.4 MB/s (0.2%)

 NEON copy backwards                                  :    656.7 MB/s (0.2%)

 NEON copy backwards prefetched (32 bytes step)       :    733.4 MB/s (0.3%)

 NEON copy backwards prefetched (64 bytes step)       :    759.0 MB/s (0.3%)

 NEON 2-pass copy                                     :    692.1 MB/s (0.2%)

 NEON 2-pass copy prefetched (32 bytes step)          :    732.7 MB/s (0.3%)

 NEON 2-pass copy prefetched (64 bytes step)          :    754.8 MB/s (0.8%)

 NEON unrolled 2-pass copy                            :    618.5 MB/s

 NEON unrolled 2-pass copy prefetched (32 bytes step) :    587.1 MB/s (0.1%)

 NEON unrolled 2-pass copy prefetched (64 bytes step) :    622.9 MB/s (0.2%)

 NEON fill                                            :   2021.7 MB/s

 NEON fill backwards                                  :   2025.3 MB/s (0.3%)

 VFP copy                                             :    741.7 MB/s

 VFP 2-pass copy                                      :    628.0 MB/s (0.2%)

 ARM fill (STRD)                                      :   2024.2 MB/s (0.4%)

 ARM fill (STM with 8 registers)                      :   2024.6 MB/s (0.3%)

 ARM fill (STM with 4 registers)                      :   2025.4 MB/s (0.4%)

 ARM copy prefetched (incr pld)                       :    708.7 MB/s (0.2%)

 ARM copy prefetched (wrap pld)                       :    695.8 MB/s (0.2%)

 ARM 2-pass copy prefetched (incr pld)                :    703.1 MB/s (0.2%)

 ARM 2-pass copy prefetched (wrap pld)                :    673.9 MB/s (0.2%)

 

==========================================================================

== Memory latency test                                                  ==

==                                                                      ==

== Average time is measured for random memory accesses in the buffers   ==

== of different sizes. The larger is the buffer, the more significant   ==

== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==

== accesses. For extremely large buffer sizes we are expecting to see   ==

== page table walk with several requests to SDRAM for almost every      ==

== memory access (though 64MiB is not nearly large enough to experience ==

== this effect to its fullest).                                         ==

==                                                                      ==

== Note 1: All the numbers are representing extra time, which needs to  ==

==         be added to L1 cache latency. The cycle timings for L1 cache ==

==         latency can be usually found in the processor documentation. ==

== Note 2: Dual random read means that we are simultaneously performing ==

==         two independent memory accesses at a time. In the case if    ==

==         the memory subsystem can't handle multiple outstanding       ==

==         requests, dual random read has the same timings as two       ==

==         single reads performed one after another.                    ==

==========================================================================

 

block size : single random read / dual random read

      1024 :    0.0 ns          /     0.0 ns 

      2048 :    0.0 ns          /     0.0 ns 

      4096 :    0.0 ns          /     0.0 ns 

      8192 :    0.0 ns          /     0.0 ns 

     16384 :    0.0 ns          /     0.0 ns 

     32768 :    0.0 ns          /     0.0 ns 

     65536 :    6.6 ns          /    11.3 ns 

    131072 :   10.2 ns          /    15.9 ns 

    262144 :   25.3 ns          /    44.7 ns 

    524288 :  118.3 ns          /   187.8 ns 

   1048576 :  175.8 ns          /   246.7 ns 

   2097152 :  211.5 ns          /   274.5 ns 

   4194304 :  230.3 ns          /   286.2 ns 

   8388608 :  242.4 ns          /   295.4 ns 

  16777216 :  254.4 ns          /   310.0 ns 

  33554432 :  274.2 ns          /   344.0 ns 

  67108864 :  308.0 ns          /   411.1 ns

 

 

Posted
I installed my a20lime2 in october 2015 (it was Armbian_4.5_Lime2_Debian_jessie_3.4.109) and although not highly loaded worked since then without a single problem or crash. So I was quite astonished about the problems reported in this thread. I frequently updated the system using aptitude. So I tried to find out why my system is working stable.

So if I understand it correctly, you don't have any reliability problems. This is not particularly surprising. For example, "only" around 20% of the Orange Pi PC boards are failing when the DRAM is clocked at 672MHz: the https://linux-sunxi.org/Orange_Pi_PC#DRAM_clock_speed_limit

Not every board is exactly identical. Some of them are overclockable, the others are not so much. The question is whether the happy 80% majority is willing to let these 20% of losers suffer ;)

Posted
It's a SK Hynix chip H5TQ4G63AFR-PBC

Above it is a i2c rtc module I use to sync the time.

 

And here is the output of a10-meminfo (improved version):

 ./a10-meminfo
dram_clk          = 432
mbus_clk          = 400
dram_type         = 3
dram_rank_num     = 1
dram_chip_density = 4096
dram_io_width     = 16
dram_bus_width    = 32
dram_cas          = 9
dram_zq           = 0x7f (0x5294a00)
dram_odt_en       = 0
dram_tpr0         = 0x42d899b7
dram_tpr1         = 0xa090
dram_tpr2         = 0x22a00
dram_tpr3         = 0x0
dram_emr1         = 0x4
dram_emr2         = 0x10
dram_emr3         = 0x0
dqs_gating_delay  = 0x05050505
active_windowing  = 0 

I see that the MBUS clock speed is set to 400MHz instead of 300MHz. This should work fine, as long as the DCDC3 voltage is set to 1.30V instead of 1.25V

You can check the DCDC3 voltage information in the dmesg log. And there was a patch for the 3.4 kernel to ensure that the kernel does not try to change the DCDC3 voltage, previously configured by the bootloader - https://github.com/linux-sunxi/linux-sunxi/commit/5052b83aa44dc16d6662d8d9d936166c139ad8c5

 

Please note that the mainline U-Boot currently uses 300MHz MBUS clock speed for Cubieboard2 / Cubietruck. There were so many broken 3.4 kernel forks that it was better be safe than sorry :)

 

Regarding the DRAM performance with different DRAM / MBUS clock speeds, some information can be found here: https://linux-sunxi.org/A10_DRAM_Controller_Performance

Posted

Is there something similar when using mainline kernel?

The mainline kernel with simplefb just uses the framebuffer passed over from the bootloader, but can't change its configuration. For the kernel is is just a memory buffer where one can write pixel data. The framebuffer setup is done in the U-boot bootloader. I believe that the framebuffer is initialized by U-Boot only when something is detected to be connected to HDMI and queried via EDID, but may be wrong.

Posted

I see that the MBUS clock speed is set to 400MHz instead of 300MHz. This should work fine, as long as the DCDC3 voltage is set to 1.30V instead of 1.25V

You can check the DCDC3 voltage information in the dmesg log. And there was a patch for the 3.4 kernel to ensure that the kernel does not try to change the DCDC3 voltage, previously configured by the bootloader - https://github.com/linux-sunxi/linux-sunxi/commit/5052b83aa44dc16d6662d8d9d936166c139ad8c5

 

Please note that the mainline U-Boot currently uses 300MHz MBUS clock speed for Cubieboard2 / Cubietruck. There were so many broken 3.4 kernel forks that it was better be safe than sorry :)

 

Regarding the DRAM performance with different DRAM / MBUS clock speeds, some information can be found here: https://linux-sunxi.org/A10_DRAM_Controller_Performance

 

 

I think Igor took care of it:

[    1.529300] axp20_ldo1: 1300 mV 
[    1.534695] usb 2-1: new high-speed USB device number 2 using sw-ehci
[    1.539955] axp20_ldo2: 1800 <--> 3300 mV at 3000 mV 
[    1.545230] axp20_ldo3: 700 <--> 3500 mV at 2800 mV 
[    1.550356] axp20_ldo4: 1250 <--> 3300 mV at 2800 mV 
[    1.555716] axp20_buck2: 700 <--> 2275 mV at 1450 mV 
[    1.571542] axp20_buck3: 700 <--> 3500 mV at 1300 mV 
[    1.576398] axp20_ldoio0: 1800 <--> 3300 mV at 2800 mV 

I used the mainline U-Boot to make my U-Boot. The only thing I changed was DRAM speed to 432 MHz.

I wonder why MBUS is working at 400 MHz. 

Posted
I used the mainline U-Boot to make my U-Boot. The only thing I changed was DRAM speed to 432 MHz.

I wonder why MBUS is working at 400 MHz. 

Are you sure about this? Which U-Boot repository are you using?

 

If I build the mainline U-Boot for Cubieboard2 via

 

    git clone git://git.denx.de/u-boot.git
    cd u-boot
    make CROSS_COMPILE=arm-linux-gnueabihf- Cubieboard2_defconfig
    make CROSS_COMPILE=arm-linux-gnueabihf- -j8

Then the ".config" file contains CONFIG_DRAM_MBUS_CLK=300. And everyone else seems to have 300MHz MBUS clock speed, based on the logs posted here.

Posted

But yes, the DCDC3 voltage seems to be correctly set to 1.30V, so the high MBUS clock speed should be OK.

 

The culprit is probably the SK Hynix DRAM chips. I think that Cubieboard2 was initially developed and tested with the DRAM chips from GT. But the change to a different DRAM chips vendor now may require some different and more conservative settings. Please try to run lima-memtester with 432, 456 and 480 clock speeds for DRAM and find the last clock speed where the test passes and the first clock speed where it starts failing.

Posted

Are you sure about this? Which U-Boot repository are you using?

 

If I build the mainline U-Boot for Cubieboard2 via

    git clone git://git.denx.de/u-boot.git
    cd u-boot
    make CROSS_COMPILE=arm-linux-gnueabihf- Cubieboard2_defconfig
    make CROSS_COMPILE=arm-linux-gnueabihf- -j8

Then the ".config" file contains CONFIG_DRAM_MBUS_CLK=300. And everyone else seems to have 300MHz MBUS clock speed, based on the logs posted here.

 

I'm using this repository: https://github.com/linux-sunxi/u-boot-sunxi

Its the only one I managed to work on my cubie.

Is it dangerous to run MBUS 400 MHz @ 1300 mV? Can this cause any damage on the board?

Also, it looks like I dont understand what is mainline. Could you please explain to me?  I read mainline many times here, but I dont know if they are all the same thing.

 

 

But yes, the DCDC3 voltage seems to be correctly set to 1.30V, so the high MBUS clock speed should be OK.

 

The culprit is probably the SK Hynix DRAM chips. I think that Cubieboard2 was initially developed and tested with the DRAM chips from GT. But the change to a different DRAM chips vendor now may require some different and more conservative settings. Please try to run lima-memtester with 432, 456 and 480 clock speeds for DRAM and find the last clock speed where the test passes and the first clock speed where it starts failing.

 

I will test this on the weekend. This week I have some tests at university and I will be very busy.

I will only need to test 456, since 432 and 480 have already been tested.

Posted

Hi,

 

So if I understand it correctly, you don't have any reliability problems. This is not particularly surprising. For example, "only" around 20% of the Orange Pi PC boards are failing when the DRAM is clocked at 672MHz: the https://linux-sunxi.org/Orange_Pi_PC#DRAM_clock_speed_limit

Not every board is exactly identical. Some of them are overclockable, the others are not so much. The question is whether the happy 80% majority is willing to let these 20% of losers suffer ;)

 

from what I read in the github PR, it seems I am more among the lucky 10% with a a20lime2 board running reliably at 480 Mhz :D. But I always prefer reliability to speed, so I have no problem that the configuration is changed to 100% of the boards running reliably! My board even gets more reliable then ;).

 

Bye,

wahlm

Posted

I'm using this repository: https://github.com/linux-sunxi/u-boot-sunxi

Its the only one I managed to work on my cubie.

It is very old and all the development has moved to the mainline U-Boot since at least 2 years ago. The mainline U-Boot should work fine and I believe that this is what is used by Armbian already.

 

Is it dangerous to run MBUS 400 MHz @ 1300 mV? Can this cause any damage on the board?

MBUS should work at 400MHz too, I believe that this information is from Allwinner - https://irclog.whitequark.org/linux-sunxi/2013-11-06#5472690;

Moreover, the A20 datasheet specifies some voltage limits and for VDD-DLL it is 1.4V

 

Also, it looks like I dont understand what is mainline. Could you please explain to me?  I read mainline many times here, but I dont know if they are all the same thing.

Mainline is the only true U-Boot from http://git.denx.de/?p=u-boot.git;a=summary

All the other U-Boot variants are forked versions with local patches.

 

I will test this on the weekend. This week I have some tests at university and I will be very busy.

I will only need to test 456, since 432 and 480 have already been tested.

Did you have time to test anything?

Posted

Did you have time to test anything?

 

Yes, I tested it this weekend. With 456 MHz, the test fails after 4 hours running. I used Lima-memtester.

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines