wildcat_paris Posted December 29, 2015 Posted December 29, 2015 (edited) As tkaiser would say, Lamobo-r1 has terrible design but the concept is good (multiple usage possible: network, misc services) But let's be honest, perfection is when you are dead. Like sleeping & coffee some weeks ago, I was wondering (right or wrong) if a DMA config for the lamobo-r1 GMAC would help the bandwidth issues. One CPU core is 100% with all the gazillions of interruptions to handle. Now the DMA for A20 is in the mainline kernel but I am missing some tech glue. I have read an example in the kernel 4.3 doc with DMA config (DTS) for DWMAC (*if* applicable, not only for STMMAC, but also for A20-GMAC ???)links: stmmac kernel 4.3 documentationdma buffer API kernel 4.3stmmac.txt full gmac example (if applicable for DWMAC)sun7i-a20.dtsi from kernel 4.3 => more dma bindings config than audio? sun7i-a20-gmac.txtkernel 4.3 device tree binding listkernel 4.3 dwmac-sunxi.c with static tx_coe config but no rx_coe?kernel 4.3 all possible feature and dma config stmicro/stmmac/common.h I have tried the recipe provided by the lamobo-r1 openwrt support crew, fifo buffer is not enough, also the code between 4.1 - 4.3 has changed for the STMMAC, patch is not portable, I have tried the coe/roe fix with no luck. links: "db260179" openwrt posts "db260179" last post on openwrt release for lamobo-r1sunxi/patches-4.1/306-dt-sun7i-lamobor1-GMAC.patch A20 GMAC tx_coe checksum patch? tx patch existing on Armbian 8192cu wifi hostapd=> wpad At least, I was able to set the clock @1GHz, the cpu handling all the interruption, some small BW gain work (CPU is able to handle more interruptions). NOTE: use governor "performance " as default (note: get a fan, even if it is not heating much, as of 2015/01/03 available http://www.voc-electronics.com/a-37420681/gpio-extensions/picoolfan/) --- v4.3.3/arch/arm/boot/dts/sun7i-a20.dtsi 2015-12-24 19:45:36.704310828 +0100 +++ v4.3.3/arch/arm/boot/dts/sun7i-a20.dtsi 2015-12-25 22:59:41.876408694 +0100 @@ -98,9 +98,11 @@ device_type = "cpu"; reg = <0>; clocks = <&cpu>; + #clock-frequency = <960000000>; clock-latency = <244144>; /* 8 32k periods */ operating-points = < /* kHz uV */ + 1008000 1450000 960000 1400000 912000 1400000 864000 1300000 @@ -117,6 +119,8 @@ cpu@1 { compatible = "arm,cortex-a7"; device_type = "cpu"; + #clock-frequency = <960000000>; + clocks = <&cpu>; reg = <1>; }; }; if I could get a hand on the possible DTS (uboot?kernel?) DWMAC DMA glue as in the kernel example (no link @ hand for now, see later) http://lxr.free-electrons.com/source/Documentation/devicetree/bindings/net/stmmac.txt patchs to look at: https://github.com/db260179/openwrt-bpi-r1/blob/bpi-r1-4.1.12/package/kernel/hostap-driver/patches/001-fix-txpower.patch https://github.com/db260179/openwrt-bpi-r1/tree/bpi-r1-4.1.12/target/linux/sunxi Edited January 3, 2016 by wildcat_paris
zador.blood.stained Posted December 29, 2015 Posted December 29, 2015 I have read an example in the kernel 4.3 doc with DMA config (DTS) for DWMAC (*if* applicable, not only for STMMAC, but also for A20-GMAC ???) About A20 DMA engine (quote from here): AFAIK, the GMAC, USB, and SATA subsystems use their own DMA system, so they already use DMA and aren't affected by the dmaengine patches. The dmaengine patches are useful for the audio support, and could be useful for the security (encrypt/decrypt) chip support, and a few other such things. As tkaiser would say, Lamobo-r1 has terrible design but the concept is good IMHO slapping cheap switch on a board with SoC originally designed for tablets, with single ethernet interface that doesn't support HW checksum offloading and calling it a "router" is not exactly a good concept 1
tkaiser Posted December 29, 2015 Posted December 29, 2015 IMHO slapping cheap switch on a board with SoC originally designed for tablets, with single ethernet interface that doesn't support HW checksum offloading and calling it a "router" is not exactly a good concept True That's why I wrote 'The idea the R1 is based on is good'. But you're absolutely right: the SoC, the board and the single layer 2 switch for both WAN and LAN ports are wrong. The new Marvell ARMADAs with 2 or 3 independent GbE interfaces seem to be way more suited. 13 days left: https://www.indiegogo.com/projects/turris-omnia-hi-performance-open-source-router#/ You get just the board for $99 + shipping -- given the state of R1's Wi-Fi (unuseable crap) it's simply a no-brainer to throw the R1 into the bin and pledge. I would believe the 500K stretch goal will also be reached so you get also a *good* metal case for this board for the same price. Apart from that: Thanks for all the useful links. Will look through them (next year ) but not with the R1 in mind but more focused on A20's GMAC and SATA performance in general (still hoping for the quad-core A20 successor Olimex spread rumours about) 1
Tido Posted December 30, 2015 Posted December 30, 2015 US$ 99.- = WITHOUT: case, power supply, antennas, Wi-Fi cards and cooler. = totally useless 1
tkaiser Posted January 2, 2016 Posted January 2, 2016 US$ 99.- = WITHOUT: case, power supply, antennas, Wi-Fi cards and cooler. = totally useless C'mon Tido, just think about. You get the R1 only without useable Wi-Fi (the module wasting one USB port is simply crap), PSU and enclosure. On top of that you have to solder a sane DC-IN solution or have to get the right cables since no PSU on this planet features an appropriate connector (for the battery connector). On top of that all commercially available enclosures ignore the thermal problems so you've to build your own. Also: different people, different use cases. I'm not interested in Wi-Fi right now, want to combine the board with a 3.5" SATA disk (using this adapter) and need therefore a special PSU solution also (5V/12V). It's a no-brainer to NOT spend 70 bucks on the Lamobo crapboard but to invest in the Omnia for a few bucks more instead.
wildcat_paris Posted January 4, 2016 Author Posted January 4, 2016 A 5% to 15% improvement on the network bandwidth with lamobo-r1, small but still something. I was wondering why only one cpu/softirq thread was working, so now both thread/cpu are working (thread1 90%, thread2 5%) reading the topic "Linux: scaling softirq among many CPU cores" http://natsys-lab.blogspot.fr/2012/09/linux-scaling-softirq-among-many-cpu.html echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpusecho 2 > /sys/class/net/eth0/queues/tx-0/xps_cpus + the A20 is patched to run @1008MHz with "performance" governor SoC temp is 37-39°C (fan is running @40°C => PWM @25%), AXP209 +48.0°C +5.02 V/+0.99 A, /dev/sda: SanDisk SDSSDP128G: 42°C due to B53 chip BW with performance gov is better, no need to wait the CPU to scale the frequency up (ondemand/conservative), SoC doesn't heat much more from PC ( http://beta.speedtest.net/result/4968734194) going through the lamobo-r1 to the Internet RX 199 Mbits/s => 230 Mbits/s around 15% better (Internet link RX is 500 Mbits/s MAX) TX 195 Mbits/s => 214 Mbits/s so about 5% better (but Internet link TX is a little above 200 Mbits/s MAX) on the lamobo-r1 itself the move is 37MB/s to 50MB/s ( RX=438 Mbits/s Internet RX 500Mbit/s usually 450-470Mbit/s) gr@bpi:~$ wget -O /dev/null ftp://ftp.oleane.net/ubuntu-cd/wily/ubuntu-15.10-desktop-amd64.iso --2016-01-04 23:10:46-- ftp://ftp.oleane.net/ubuntu-cd/wily/ubuntu-15.10-desktop-amd64.iso => ‘/dev/null’ Resolving ftp.oleane.net (ftp.oleane.net)... 194.2.0.36, 2a01:c910:0:1::c202:24 Connecting to ftp.oleane.net (ftp.oleane.net)|194.2.0.36|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /ubuntu-cd/wily ... done. ==> SIZE ubuntu-15.10-desktop-amd64.iso ... 1178386432 ==> PASV ... done. ==> RETR ubuntu-15.10-desktop-amd64.iso ... done. Length: 1178386432 (1.1G) (unauthoritative) 100%[================================================================================================>] 1,178,386,432 56.7MB/s in 21s 2016-01-04 23:11:08 (53.6 MB/s) - ‘/dev/null’ saved [1178386432] from Odroid XU4 through lamobo-r1, RX max @224MBits/s gr@odroid:~$ wget -O /dev/null ftp://ftp.oleane.net/ubuntu-cd/wily/ubuntu-15.10-desktop-amd64.iso --2016-01-04 23:20:57-- ftp://ftp.oleane.net/ubuntu-cd/wily/ubuntu-15.10-desktop-amd64.iso => «/dev/null» Résolution de ftp.oleane.net (ftp.oleane.net)… 194.2.0.36, 2a01:c910:0:1::c202:24 Connexion à ftp.oleane.net (ftp.oleane.net)|194.2.0.36|:21… connecté. Ouverture de session en tant que anonymous… Session établie. ==> SYST ... terminé. ==> PWD ... terminé. ==> TYPE I ... terminé. ==> CWD (1) /ubuntu-cd/wily ... terminé. ==> SIZE ubuntu-15.10-desktop-amd64.iso ... 1178386432 ==> PASV ... terminé. ==> RETR ubuntu-15.10-desktop-amd64.iso ... terminé. Taille : 1178386432 (1,1G) (non certifiée) ubuntu-15.10-desktop-amd64.iso 100%[==================================================================>] 1,10G 28,2MB/s ds 40s 2016-01-04 23:21:38 (27,8 MB/s) - «/dev/null» enregistré [1178386432]
zador.blood.stained Posted January 4, 2016 Posted January 4, 2016 Tested dts and driver patch today on cubietruck, kernel 4.4-rc8. Without A20 speed patch. Maybe there is small improvement, ~800 Mbps -> ~900 Mbps, with some extra tweaks left from before. Distributing tx and rx interrupts helps in synthetic tests, but in real world scenarios (i.e. samba file transfer when it hugs single CPU core with 100% usage) it won't help much, and for me it even made file transfer speeds worse before when I tested it. Edit: jumbo frames are still broken for me
zador.blood.stained Posted January 5, 2016 Posted January 5, 2016 Now, after I thought more about my test results and @wildcat_paris' test methods: iperf3 TCP test is not the best way to test raw Ethernet performance, I'll try to do more iperf3 tests with UDP later on fresh Armbian images; Enabling jumbo frames with "bugged_jumbo=1" didn't cause driver lockup like it did before, so I'll have to check other things; from PC going through the lamobo-r1 to the Internet Depending on your firewall setup your speed improvements may be result of higher CPU frequency and not A20 GMAC patches, even though it counts as a "real world scenario" test. Edit: did some tests with iperf3. For me these are the top results, stmmac patches didn't have any noticeable effect. TCP, from Win8.1 c:\Program Files\Tools>iperf3 -4 -c cubietruck.lan -i 0 -b 1000M Connecting to host cubietruck.lan, port 5201 [ 4] local 192.168.1.101 port 1985 connected to 192.168.1.105 port 5201 [ ID] Interval Transfer Bandwidth [ 4] 0.00-10.00 sec 1.06 GBytes 915 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth [ 4] 0.00-10.00 sec 1.06 GBytes 915 Mbits/sec sender [ 4] 0.00-10.00 sec 1.06 GBytes 915 Mbits/sec receiver iperf Done. UDP, from Ubuntu Wily. Didn't bother to increase iperf buffer sizes to decrease packet loss. ➜ armbian % _ iperf3 -4 -c cubietruck.lan -i 0 -b 1000M -u Connecting to host cubietruck.lan, port 5201 [ 4] local 192.168.1.102 port 34763 connected to 192.168.1.105 port 5201 [ ID] Interval Transfer Bandwidth Total Datagrams [ 4] 0.00-10.00 sec 1.10 GBytes 949 Mbits/sec 144760 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 4] 0.00-10.00 sec 1.10 GBytes 949 Mbits/sec 0.053 ms 53575/144760 (37%) [ 4] Sent 144760 datagrams iperf Done. Edit 2: Feel like UDP testing with such high packet loss may not be useful, will try to redo it with higher buffer sizes.
wildcat_paris Posted January 5, 2016 Author Posted January 5, 2016 @zador I already patched the clocking @1GHz before using 2 threads for I/O you are mentioning stmmac patchs, are they included in Armbian next/dev? or are they the changes from Openwrt? please, thx.
zador.blood.stained Posted January 5, 2016 Posted January 5, 2016 I tried this patch from OpenWRT. It (and other tweaks like increasing stmmac bufer sizes) may reduce CPU load, but it's harder to measure than network speed. These patches are not present in mainline kernel. Just to try to identify bottlenecks in your setup I would recommend you to test network performance with iperf3 between lamobo-r1 and any device in LAN; same without any iptables rules; same with unconfigured switch (no VLANs); same with simple tweak "sudo ethtool -k eth0 gso off". Since TCP window autoscaling is enabled by default and it's affecting test results, I would recommend running each test at least 3 times.
wildcat_paris Posted January 8, 2016 Author Posted January 8, 2016 @zador yes thanks for the idea to test with "iperf3" (with different TCP windows size values) gives the value from OpenWRT patches I have also tested the public servers Tweaks to STMMAC and U-boot - stmmac driver I have tweaked the driver to enable RX checksum and improve TX rate (still not full gigabit speed) to maintain a 400Mbit/s rate for TX and 900Mbit/s rate for RX. But it only works: Internet <=> Lamobo-r1 using the L-r1 as a "router" for others machines on the LAN limits the BW to 236/205 MBits/s (as with my previous tests) = more or less 400 MBits / 2 (+ IPtables in the middle) I will be testing soon the XU4 with an extra USB3/GMAC to act as a simple router. XU4 was my plan B for routing when I have ordered it. (ok my AMD Phenom2 965 PC with 2 GMAC is working very fine as a router but... power consumption is terrible as a 24/7 machine) private joke : with your spoiler, you make my "coffee pouring out of ..." my nose while laughing aloud
Recommended Posts