Da Xue Posted March 19, 2018 Posted March 19, 2018 tinymembench v0.4.9 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 1760.1 MB/s (0.9%) C copy backwards (32 byte blocks) : 1620.9 MB/s (0.7%) C copy backwards (64 byte blocks) : 1581.7 MB/s (1.0%) C copy : 1639.6 MB/s (0.7%) C copy prefetched (32 bytes step) : 1280.2 MB/s C copy prefetched (64 bytes step) : 1580.7 MB/s (0.5%) C 2-pass copy : 1938.4 MB/s (0.4%) C 2-pass copy prefetched (32 bytes step) : 1429.9 MB/s (0.2%) C 2-pass copy prefetched (64 bytes step) : 1432.1 MB/s (0.3%) C fill : 7627.5 MB/s (0.4%) C fill (shuffle within 16 byte blocks) : 7629.8 MB/s (0.5%) C fill (shuffle within 32 byte blocks) : 7640.7 MB/s C fill (shuffle within 64 byte blocks) : 7635.3 MB/s (0.5%) --- standard memcpy : 1616.5 MB/s standard memset : 7604.0 MB/s (0.4%) --- NEON LDP/STP copy : 1903.9 MB/s (0.3%) NEON LDP/STP copy pldl2strm (32 bytes step) : 1479.1 MB/s (0.4%) NEON LDP/STP copy pldl2strm (64 bytes step) : 1615.2 MB/s (0.2%) NEON LDP/STP copy pldl1keep (32 bytes step) : 1981.1 MB/s NEON LDP/STP copy pldl1keep (64 bytes step) : 2014.8 MB/s (0.2%) NEON LD1/ST1 copy : 1862.7 MB/s (0.3%) NEON STP fill : 7603.9 MB/s NEON STNP fill : 2310.3 MB/s (0.4%) ARM LDP/STP copy : 1931.4 MB/s (0.2%) ARM STP fill : 7610.2 MB/s (0.8%) ARM STNP fill : 2339.1 MB/s (0.6%) ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.1 ns / 0.1 ns 65536 : 4.9 ns / 8.5 ns 131072 : 7.6 ns / 11.9 ns 262144 : 12.3 ns / 18.2 ns 524288 : 56.9 ns / 89.5 ns 1048576 : 84.7 ns / 119.7 ns 2097152 : 98.9 ns / 131.8 ns 4194304 : 110.7 ns / 141.9 ns 8388608 : 117.4 ns / 147.9 ns 16777216 : 122.4 ns / 152.3 ns 33554432 : 126.5 ns / 155.6 ns 67108864 : 139.5 ns / 179.4 ns @tkaiser From Ubuntu 16.04 image running Rockchip's 4.4 kernel.
Da Xue Posted March 19, 2018 Author Posted March 19, 2018 For comparison sake, I got this from my AML-S905X-CC 2GB: https://drive.google.com/file/d/129ug_8iuMMmLqP5yKfpL-lzMUWTjWz_P/view?usp=sharing Someone uploaded the ROCK64 tinymembench here which I can't verify since I don't have a board: https://forum.pine64.org/showthread.php?tid=4687&pid=28879#pid28879 The Renegade DDR4-2133 performance is about 33% over the ROCK64's LPDDR3-1600 for memset which is to be expected.
Kwiboo Posted March 19, 2018 Posted March 19, 2018 @Da Xue did you run tinymembench with or without hdmi connected? Having hdmi output active makes a rather big impact on memory performance. I have added my ROCK64 arm tinymembench runs at 786Mhz vs 933MHz without hdmi/framebuffer connected below, will add my ROC-RK3328-CC 933MHz/1066MHz arm/aarch64 numbers and ROCK64 aarch64 numbers later. ROCK64 linux v4.16-rc5 rk3328_ddr_786MHz_v1.12.bin NO-HDMI LibreELEC (community): devel-20180315130549-r28356-g63abb08 (RK3328.arm) LibreELEC:~ # cat /sys/kernel/debug/clk/clk_summary |grep clk_ddr clk_ddrmon 0 0 0 24000000 0 0 pclk_ddr 3 3 0 98304000 0 0 pclk_ddr_grf 1 1 0 98304000 0 0 pclk_ddrstdby 0 0 0 98304000 0 0 pclk_ddr_mon 1 1 0 98304000 0 0 pclk_ddr_msch 1 1 0 98304000 0 0 pclk_ddrupctl 0 0 0 98304000 0 0 pclk_ddrphy 1 1 0 75000000 0 0 clk_ddr 2 2 0 1572000000 0 0 aclk_ddrupctl 0 0 0 1572000000 0 0 clk_ddrupctl 1 1 0 1572000000 0 0 clk_ddrmsch 1 1 0 1572000000 0 0 LibreELEC:~ # ./tinymembench tinymembench v0.4.9 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 1642.7 MB/s (1.4%) C copy backwards (32 byte blocks) : 1527.2 MB/s (1.5%) C copy backwards (64 byte blocks) : 1568.9 MB/s (1.0%) C copy : 1582.5 MB/s (0.8%) C copy prefetched (32 bytes step) : 1793.4 MB/s C copy prefetched (64 bytes step) : 1776.3 MB/s C 2-pass copy : 1617.7 MB/s (0.2%) C 2-pass copy prefetched (32 bytes step) : 1619.7 MB/s C 2-pass copy prefetched (64 bytes step) : 1668.2 MB/s C fill : 5843.3 MB/s C fill (shuffle within 16 byte blocks) : 5843.2 MB/s C fill (shuffle within 32 byte blocks) : 5842.9 MB/s C fill (shuffle within 64 byte blocks) : 5842.8 MB/s --- standard memcpy : 1765.6 MB/s (0.9%) standard memset : 3444.7 MB/s --- NEON read : 2797.5 MB/s (0.7%) NEON read prefetched (32 bytes step) : 4143.6 MB/s NEON read prefetched (64 bytes step) : 4430.5 MB/s NEON read 2 data streams : 2511.4 MB/s NEON read 2 data streams prefetched (32 bytes step) : 4163.7 MB/s NEON read 2 data streams prefetched (64 bytes step) : 4399.6 MB/s NEON copy : 1644.8 MB/s (0.3%) NEON copy prefetched (32 bytes step) : 1782.9 MB/s NEON copy prefetched (64 bytes step) : 1785.2 MB/s (0.2%) NEON unrolled copy : 1816.6 MB/s (0.7%) NEON unrolled copy prefetched (32 bytes step) : 2118.3 MB/s NEON unrolled copy prefetched (64 bytes step) : 2067.9 MB/s NEON copy backwards : 1805.6 MB/s (0.3%) NEON copy backwards prefetched (32 bytes step) : 1902.1 MB/s NEON copy backwards prefetched (64 bytes step) : 1893.7 MB/s NEON 2-pass copy : 1770.3 MB/s NEON 2-pass copy prefetched (32 bytes step) : 1874.8 MB/s NEON 2-pass copy prefetched (64 bytes step) : 1898.7 MB/s NEON unrolled 2-pass copy : 1638.9 MB/s NEON unrolled 2-pass copy prefetched (32 bytes step) : 1567.9 MB/s NEON unrolled 2-pass copy prefetched (64 bytes step) : 1664.5 MB/s NEON fill : 5849.6 MB/s NEON fill backwards : 5849.4 MB/s VFP copy : 1762.0 MB/s (1.4%) VFP 2-pass copy : 1767.1 MB/s ARM fill (STRD) : 3444.6 MB/s ARM fill (STM with 8 registers) : 5838.8 MB/s ARM fill (STM with 4 registers) : 5102.5 MB/s ARM copy prefetched (incr pld) : 1773.0 MB/s ARM copy prefetched (wrap pld) : 1762.1 MB/s ARM 2-pass copy prefetched (incr pld) : 1594.7 MB/s ARM 2-pass copy prefetched (wrap pld) : 1592.5 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read, [MADV_NOHUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 5.3 ns / 9.0 ns 131072 : 8.1 ns / 12.6 ns 262144 : 10.2 ns / 15.1 ns 524288 : 67.7 ns / 106.7 ns 1048576 : 103.0 ns / 142.6 ns 2097152 : 121.4 ns / 155.5 ns 4194304 : 137.1 ns / 168.1 ns 8388608 : 145.5 ns / 175.4 ns 16777216 : 151.6 ns / 181.5 ns 33554432 : 154.0 ns / 185.8 ns 67108864 : 166.0 ns / 207.5 ns block size : single random read / dual random read, [MADV_HUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 5.3 ns / 8.9 ns 131072 : 8.1 ns / 12.4 ns 262144 : 10.2 ns / 14.7 ns 524288 : 67.6 ns / 106.5 ns 1048576 : 103.0 ns / 142.5 ns 2097152 : 120.9 ns / 154.9 ns 4194304 : 130.4 ns / 159.6 ns 8388608 : 135.6 ns / 161.5 ns 16777216 : 138.3 ns / 162.4 ns 33554432 : 139.7 ns / 162.8 ns 67108864 : 140.3 ns / 163.0 ns ROCK64 linux v4.16-rc5 rk3328_ddr_933MHz_v1.12.bin NO-HDMI LibreELEC (community): devel-20180314065621-r28356-g63abb08 (RK3328.arm) LibreELEC:~ # cat /sys/kernel/debug/clk/clk_summary |grep clk_ddr clk_ddrmon 0 0 0 24000000 0 0 pclk_ddr 3 3 0 98304000 0 0 pclk_ddr_grf 1 1 0 98304000 0 0 pclk_ddrstdby 0 0 0 98304000 0 0 pclk_ddr_mon 1 1 0 98304000 0 0 pclk_ddr_msch 1 1 0 98304000 0 0 pclk_ddrupctl 0 0 0 98304000 0 0 pclk_ddrphy 1 1 0 75000000 0 0 clk_ddr 2 2 0 1848000000 0 0 aclk_ddrupctl 0 0 0 1848000000 0 0 clk_ddrupctl 1 1 0 1848000000 0 0 clk_ddrmsch 1 1 0 1848000000 0 0 LibreELEC:~ # ./tinymembench tinymembench v0.4.9 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 1847.4 MB/s (1.4%) C copy backwards (32 byte blocks) : 1602.5 MB/s (0.9%) C copy backwards (64 byte blocks) : 1693.4 MB/s (1.4%) C copy : 1728.3 MB/s (1.7%) C copy prefetched (32 bytes step) : 1864.7 MB/s C copy prefetched (64 bytes step) : 1881.8 MB/s C 2-pass copy : 1741.5 MB/s C 2-pass copy prefetched (32 bytes step) : 1738.7 MB/s C 2-pass copy prefetched (64 bytes step) : 1782.4 MB/s C fill : 6862.1 MB/s C fill (shuffle within 16 byte blocks) : 6861.9 MB/s C fill (shuffle within 32 byte blocks) : 6862.2 MB/s C fill (shuffle within 64 byte blocks) : 6862.1 MB/s --- standard memcpy : 1780.1 MB/s (1.3%) standard memset : 3444.8 MB/s --- NEON read : 2944.6 MB/s (0.7%) NEON read prefetched (32 bytes step) : 4191.8 MB/s NEON read prefetched (64 bytes step) : 4554.5 MB/s NEON read 2 data streams : 2841.0 MB/s NEON read 2 data streams prefetched (32 bytes step) : 4230.7 MB/s NEON read 2 data streams prefetched (64 bytes step) : 4572.2 MB/s NEON copy : 1836.4 MB/s (0.5%) NEON copy prefetched (32 bytes step) : 1948.9 MB/s (0.2%) NEON copy prefetched (64 bytes step) : 1970.8 MB/s (0.2%) NEON unrolled copy : 2000.7 MB/s (0.5%) NEON unrolled copy prefetched (32 bytes step) : 2345.4 MB/s NEON unrolled copy prefetched (64 bytes step) : 2362.7 MB/s NEON copy backwards : 1997.4 MB/s (0.3%) NEON copy backwards prefetched (32 bytes step) : 2089.9 MB/s NEON copy backwards prefetched (64 bytes step) : 2086.2 MB/s (0.2%) NEON 2-pass copy : 1910.8 MB/s NEON 2-pass copy prefetched (32 bytes step) : 2001.1 MB/s NEON 2-pass copy prefetched (64 bytes step) : 2093.7 MB/s NEON unrolled 2-pass copy : 1744.3 MB/s NEON unrolled 2-pass copy prefetched (32 bytes step) : 1574.2 MB/s NEON unrolled 2-pass copy prefetched (64 bytes step) : 1703.4 MB/s NEON fill : 6876.7 MB/s NEON fill backwards : 6876.4 MB/s VFP copy : 1950.3 MB/s (1.4%) VFP 2-pass copy : 1886.4 MB/s ARM fill (STRD) : 3444.9 MB/s ARM fill (STM with 8 registers) : 6648.4 MB/s ARM fill (STM with 4 registers) : 5115.9 MB/s ARM copy prefetched (incr pld) : 1712.9 MB/s (0.2%) ARM copy prefetched (wrap pld) : 1758.2 MB/s (0.3%) ARM 2-pass copy prefetched (incr pld) : 1575.0 MB/s ARM 2-pass copy prefetched (wrap pld) : 1574.2 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read, [MADV_NOHUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 5.3 ns / 9.0 ns 131072 : 8.1 ns / 12.6 ns 262144 : 10.1 ns / 14.7 ns 524288 : 63.5 ns / 99.9 ns 1048576 : 96.4 ns / 133.6 ns 2097152 : 113.6 ns / 145.8 ns 4194304 : 128.4 ns / 158.4 ns 8388608 : 136.9 ns / 165.8 ns 16777216 : 141.6 ns / 171.8 ns 33554432 : 145.3 ns / 176.3 ns 67108864 : 155.5 ns / 195.7 ns block size : single random read / dual random read, [MADV_HUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 5.3 ns / 8.9 ns 131072 : 8.1 ns / 12.4 ns 262144 : 10.1 ns / 14.7 ns 524288 : 63.4 ns / 99.8 ns 1048576 : 96.5 ns / 133.7 ns 2097152 : 113.2 ns / 145.4 ns 4194304 : 121.8 ns / 149.8 ns 8388608 : 126.4 ns / 151.7 ns 16777216 : 128.9 ns / 152.4 ns 33554432 : 130.2 ns / 152.9 ns 67108864 : 130.9 ns / 153.0 ns
Da Xue Posted March 19, 2018 Author Posted March 19, 2018 clk_ddrmon 0 0 24000000 0 0 pclk_ddr 3 3 98304000 0 0 pclk_ddr_grf 1 1 98304000 0 0 pclk_ddrstdby 0 0 98304000 0 0 pclk_ddr_mon 1 1 98304000 0 0 pclk_ddr_msch 1 1 98304000 0 0 pclk_ddrupctl 0 0 98304000 0 0 pclk_ddrphy 1 1 75000000 0 0 sclk_ddrc 2 2 1056000000 0 0 aclk_ddrupctl 0 0 1056000000 0 0 clk_ddrupctl 1 1 1056000000 0 0 clk_ddrmsch 1 1 1056000000 0 0 My testing is done on the Rockchip's Linux 4.4.114 kernel with DDR4 timing adjustments. I am not that familiar with tinymembench but why are our outputs different?
Kwiboo Posted March 19, 2018 Posted March 19, 2018 My build of tinymembench is for armhf and include github PR9 and PR10. Will post my aarch64 numbers once I have run all combos possible with arm/aarch64 + 786/933/1066 mhz + 4.4/4.16 linux + rock64/roc-rk3328-cc and hdmi not connected.
pfeerick Posted March 19, 2018 Posted March 19, 2018 Don't know if it is of any interest, but since it's tinymembench and a rock64, here goes :-P. Running the ayufan-xential-linux-0.7.x image with docker and stuff in the background. Compiled tinymembench with default parameters. rock64 v2 board w/ 4GB memory. No HDMI plugged in, headless box. rock64@rock64:~/tinymembench$ uname -a Linux rock64 4.4.114-rockchip-ayufan-193 #1 SMP Sun Mar 4 20:24:21 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux rock64@rock64:~/tinymembench$ sudo cat /sys/kernel/debug/clk/clk_summary |grep clk_ddr clk_ddrmon 0 0 24000000 0 0 pclk_ddr 3 3 98304000 0 0 pclk_ddr_grf 1 1 98304000 0 0 pclk_ddrstdby 0 0 98304000 0 0 pclk_ddr_mon 1 1 98304000 0 0 pclk_ddr_msch 1 1 98304000 0 0 pclk_ddrupctl 0 0 98304000 0 0 pclk_ddrphy 1 1 75000000 0 0 sclk_ddrc 2 2 786000000 0 0 aclk_ddrupctl 0 0 786000000 0 0 clk_ddrupctl 1 1 786000000 0 0 clk_ddrmsch 1 1 786000000 0 0 rock64@rock64:~/tinymembench$ ./tinymembench tinymembench v0.4.9 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 1438.1 MB/s (1.9%) C copy backwards (32 byte blocks) : 1449.3 MB/s (3.1%) C copy backwards (64 byte blocks) : 1354.0 MB/s (0.3%) C copy : 1360.7 MB/s (1.5%) C copy prefetched (32 bytes step) : 1280.4 MB/s C copy prefetched (64 bytes step) : 1447.5 MB/s (2.6%) C 2-pass copy : 1645.8 MB/s (2.5%) C 2-pass copy prefetched (32 bytes step) : 1214.7 MB/s C 2-pass copy prefetched (64 bytes step) : 1165.4 MB/s C fill : 5678.7 MB/s C fill (shuffle within 16 byte blocks) : 5681.5 MB/s (3.1%) C fill (shuffle within 32 byte blocks) : 5681.5 MB/s C fill (shuffle within 64 byte blocks) : 5685.2 MB/s (3.0%) --- standard memcpy : 1330.4 MB/s standard memset : 5680.6 MB/s (3.0%) --- NEON LDP/STP copy : 1530.9 MB/s NEON LDP/STP copy pldl2strm (32 bytes step) : 1261.7 MB/s (2.8%) NEON LDP/STP copy pldl2strm (64 bytes step) : 1466.2 MB/s NEON LDP/STP copy pldl1keep (32 bytes step) : 1668.5 MB/s NEON LDP/STP copy pldl1keep (64 bytes step) : 1674.7 MB/s (2.9%) NEON LD1/ST1 copy : 1511.5 MB/s NEON STP fill : 5681.3 MB/s NEON STNP fill : 2242.6 MB/s (1.9%) ARM LDP/STP copy : 1531.0 MB/s ARM STP fill : 5681.9 MB/s (2.8%) ARM STNP fill : 2251.2 MB/s (2.3%) ========================================================================== == Framebuffer read tests. == == == == Many ARM devices use a part of the system memory as the framebuffer, == == typically mapped as uncached but with write-combining enabled. == == Writes to such framebuffers are quite fast, but reads are much == == slower and very sensitive to the alignment and the selection of == == CPU instructions which are used for accessing memory. == == == == Many x86 systems allocate the framebuffer in the GPU memory, == == accessible for the CPU via a relatively slow PCI-E bus. Moreover, == == PCI-E is asymmetric and handles reads a lot worse than writes. == == == == If uncached framebuffer reads are reasonably fast (at least 100 MB/s == == or preferably >300 MB/s), then using the shadow framebuffer layer == == is not necessary in Xorg DDX drivers, resulting in a nice overall == == performance improvement. For example, the xf86-video-fbturbo DDX == == uses this trick. == ========================================================================== NEON LDP/STP copy (from framebuffer) : 305.5 MB/s NEON LDP/STP 2-pass copy (from framebuffer) : 288.1 MB/s NEON LD1/ST1 copy (from framebuffer) : 80.3 MB/s NEON LD1/ST1 2-pass copy (from framebuffer) : 79.2 MB/s ARM LDP/STP copy (from framebuffer) : 157.3 MB/s (2.0%) ARM LDP/STP 2-pass copy (from framebuffer) : 152.5 MB/s (1.9%) ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.1 ns / 0.1 ns 65536 : 5.3 ns / 9.0 ns 131072 : 8.2 ns / 12.5 ns 262144 : 11.5 ns / 16.9 ns 524288 : 67.3 ns / 106.0 ns 1048576 : 101.9 ns / 143.6 ns 2097152 : 120.3 ns / 159.0 ns 4194304 : 133.9 ns / 171.1 ns 8388608 : 142.0 ns / 178.2 ns 16777216 : 147.6 ns / 183.7 ns 33554432 : 152.0 ns / 188.1 ns 67108864 : 168.1 ns / 215.8 ns
Recommended Posts