Jump to content

SSH speed greatly improved via ProxyCommand - but why?


Thanassis

Recommended Posts

I've just asked this question on the UNIX StackExchange - but I am hoping that the Armbian wizards may have a better idea on what is happening.

 

The setup:

 

- Machine 1 is an Arch Linux laptop, on which `ssh` is spawned, connecting to an Armbian-running SBC (an Orange PI Zero).
- The SBC itself is connected via Ethernet to a DSL router, and has an IP of 192.168.1.150
- The laptop is connected to the router over WiFi.
- There's also another laptop (Machine 2) connected via Ethernet to the DSL router.

 

When benchmarked with `iperf3`, the link between the laptop and the SBC is less than the theoretical 56 MBits/sec - as expected, since this is a WiFi connection within a very "crowded 2.4GHz" *(apartment building)*.

 

More specifically: after running `iperf3 -s` on the SBC, the following commands are executed on the laptop:

 

    # iperf3 -c 192.168.1.150
    Connecting to host 192.168.1.150, port 5201
    [  5] local 192.168.1.89 port 57954 connected to 192.168.1.150 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.00   sec  2.99 MBytes  25.1 Mbits/sec    0    112 KBytes       
    ...
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  28.0 MBytes  23.5 Mbits/sec    5             sender
    [  5]   0.00-10.00  sec  27.8 MBytes  23.4 Mbits/sec                  receiver
    
    iperf Done.
    
    # iperf3 -c 192.168.1.150 -R
    Connecting to host 192.168.1.150, port 5201
    Reverse mode, remote host 192.168.1.150 is sending
    [  5] local 192.168.1.89 port 57960 connected to 192.168.1.150 port 5201
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec  3.43 MBytes  28.7 Mbits/sec                  
    ...                
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  39.2 MBytes  32.9 Mbits/sec  375             sender
    [  5]   0.00-10.00  sec  37.7 MBytes  31.6 Mbits/sec                  receiver

 

So basically, uploading to the SBC reaches about 24MBits/sec, and downloading from it (`-R`) reaches 32MBits/sec.

Given that, let's see how SSH fares - that is, how this command will execute on the laptop:

 

    # cat /dev/urandom | \
        pv -ptebar | \
        ssh  root@192.168.1.150 'cat >/dev/null'
    20.3MiB 0:00:52 [ 315KiB/s] [ 394KiB/s]

 

Well, that's an abysmal speed! Much slower than the expected link speed...

 

In case you are not aware of `pv -ptevar`: it displays the current and average rate of data going through it. In this case, we see that reading from `/dev/urandom` and sending the data over SSH to the SBC is on average reaching 400KB/s - i.e. 3.2MBits/sec, a far lesser figure than the expected 24MBits/sec.

 

Why is our link running at 13% of its capacity?

 

Is it perhaps our `/dev/urandom`'s fault?

 

    # cat /dev/urandom | pv -ptebar > /dev/null
    834MiB 0:00:04 [ 216MiB/s] [ 208MiB/s]

 

Nope, definitely not.

 

Is it perhaps the SBC itself? Perhaps it's too slow to process?

 

Let's try running the same SSH command (i.e. sending data to the SBC) but this time from another machine (Machine 2) that is connected over the Ethernet:

 

    # cat /dev/urandom | \
        pv -ptebar | \
        ssh  root@192.168.1.150 'cat >/dev/null'
    240MiB 0:00:31 [10.7MiB/s] [7.69MiB/s]

 

Nope, this works fine - the SSH daemon on the SBC can (easily) handle the 11MBytes/sec (i.e. the 100MBits/sec) that it's Ethernet link provides.

And is the CPU of the SBC loaded while doing this?

 

    dstat -clnv --fs --vm --top-bio --top-cpu --top-io
    Terminal width too small, trimming output.
    ----total-cpu-usage---- ---load-avg--- -net/total- ---procs--- ------memory-usage----- ---paging-- -dsk/total- ---system-- --filesystem->
    usr sys idl wai hiq siq| 1m   5m  15m | recv  send|run blk new| used  buff  cach  free|  in   out | read  writ| int   csw |files  inodes>
      2   1  97   0   0   0|0.56 0.19 0.16|   0     0 |  0   0 3.2|76.4M 36.5M  250M  132M|   0     6B|  14k   17k| 480   481 |  864  13778 >
     14   7  76   0   0   3|0.60 0.20 0.16|8188k  201k|1.0   0   0|76.5M 36.5M  250M  131M|   0     0 |   0     0 |9730  4647 |  864  13789 >
     13   4  78   0   0   4|0.60 0.20 0.16|6223k  151k|1.0   0   0|76.4M 36.5M  250M  132M|   0     0 |   0     0 |7378  2869 |  864  13789 >
     20   3  73   0   0   4|0.60 0.20 0.16|  10M  160k|1.0   0   0|77.6M 36.5M  250M  130M|   0     0 |   0     0 |  10k 2285 |  864  13790 >
     28  10  58   0   0   3|0.60 0.20 0.16|8025k  163k|1.0   0  15|76.5M 36.5M  250M  131M|   0     0 |   0     0 |9946  4888 |  896  13789 >
     16   5  75   0   0   4|0.60 0.20 0.16|8173k  194k|1.0   0   0|76.6M 36.5M  250M  131M|   0     0 |   0     0 |9500  4125 |  896  13789 >
     16   3  75   0   0   6|0.55 0.20 0.16|8544k  204k|1.0   0   0|76.8M 36.5M  250M  131M|   0     0 |   0    56k|  10k 4289 |  896  13789 >
     13   6  76   0   0   5|0.55 0.20 0.16|8478k  203k|1.0   0   0|76.7M 36.5M  250M  131M|   0     0 |   0     0 |9692  5185 |  896  13789 >
     19   4  73   0   0   4|0.55 0.20 0.16|8786k  209k|1.0   0   0|76.3M 36.5M  250M  132M|   0     0 |   0     0 |  10k 3202 |  896  13789 >
     27   8  61   0   0   3|0.55 0.20 0.16|8254k  199k|1.0   0  15|76.6M 36.5M  250M  131M|   0     0 |   0     0 |  11k 3632 |  928  13789 >
     15   4  75   0   0   5|0.55 0.20 0.16|8035k  192k|1.0   0   0|76.5M 36.5M  250M  131M|   0     0 |   0     0 |9286  5008 |  928  13789 >
     16   7  74   0   0   3|0.58 0.21 0.17|8582k  205k|1.0   0   0|76.6M 36.5M  250M  131M|   0     0 |   0     0 |9927  4507 |  928  13789 >
     17   3  75   0   0   5|0.58 0.21 0.17|8589k  204k|1.0   0   0|76.6M 36.5M  250M  131M|   0     0 |   0     0 |9952  4272 |  928  13789 >

 

Nope - it's easily handling it.

 

So network-wise (as per `iperf3`) we should be able to do 10x the speed, our SBC's CPU can easily accommodate the load... and we don't involve any other kind of I/O (e.g. drives).
 
What the heck is happening?

 

Let's try plain old netcat connections - do they run as fast as we'd expect?

 

In the SBC:

 

    # nc -l -p 9988 | pv -ptebar > /dev/null

 

In the laptop:

 

    # cat /dev/urandom | pv -ptebar | nc 192.168.1.150 9988
    117MiB 0:00:33 [3.82MiB/s] [3.57MiB/s]

 

It works! And runs at the expected - much better, 10x better - speed.

 

So what happens if I run SSH using a ProxyCommand to use nc?

 

    # cat /dev/urandom | \
        pv -ptebar | \
        ssh -o "Proxycommand nc %h %p" root@192.168.1.150 'cat >/dev/null'
    101MiB 0:00:30 [3.38MiB/s] [3.33MiB/s]

 

Works! 10x speed.

 

Now I am a bit confused - when using a "naked" `nc` as a `Proxycommand`, aren't you basically doing the exact same thing that SSH does? i.e. creating a socket, connecting to the SBC's port 22, and then shoveling the SSH protocol over it?
   
Why is there this huge difference in the resulting speed?

 

P.S. This was not an *academic exercise* - my `borg` backup runs 10 times faster because of this. I just don't know why :-)

Link to comment
Share on other sites

Same result with mbuffer (assuming you meant using it right after "cat" when you referred to using it in the laptop):

 

    $ cat /dev/urandom | mbuffer | ssh  root@192.168.1.150 'cat >/dev/null'
    in @  0.0 kiiB/s, out @  0.0 kiiB/s, 9216 kiiB total, buffer  99% full^C
    mbuffer: error: outputThread: error writing to <stdout> at offset 0x9a3000: Broken pipe
    mbuffer: warning: error during output to <stdout>: Broken pipe
    summary: 9216 kiiByte in 24.1sec - average of  382 kiiB/s
    
    $ cat /dev/urandom | mbuffer | ssh -o "ProxyCommand nc %h %p" root@192.168.1.150 'cat >/dev/null'
    in @  0.0 kiiB/s, out @  0.0 kiiB/s, 44.0 MiiB total, buffer  99% full^C
    packet_write_wait: Connection to UNKNOWN port 65535: Broken pipe
    mbuffer: error: outputThread: error writing to <stdout> at offset 0x2cfc000: Broken pipe
    mbuffer: warning: error during output to <stdout>: Broken pipe
    summary: 44.0 MiiByte in 22.1sec - average of 2037 kiiB/s    

 

 

This was just a quick test, hitting Ctrl-C after around 20sec.

As you can see, the ProxyCommand makes it much faster...

Link to comment
Share on other sites

10 minutes ago, Thanassis said:

assuming you meant using it right after "cat" when you referred to using it in the laptop

 

IMO the best idea is to do a web search for 'mbuffer ssh' for some examples and explanations. Most probably mbuffer works fine without SSH being involved at all with your use case.

Link to comment
Share on other sites

Thanks, tkaiser.

 

My use-case was backup scenarios via borgbackup/rsync - and while debugging the slow speeds I was experiencing, I noticed that SSH (which was used by both tools to perform the remote connection) performs abysmally slower than any direct socket connection.

 

I tried disabling compression (-o "Compression No"),  disabling encryption (-o "Cipher None") - nothing, same abysmal speeds. So I then reproduced the issue outside the backup tools, using simple ssh doing cat - that's why I used that example in the report (much easier to report this than explaining about borg or rsync - it excludes a lot of variables).

 

My point being that the example above is not my use case - indeed, if I just wanted to ferry data across I'd just "tar | nc" as I've been doing for years. But these tools (both rsync and borgbackup) are using SSH as a transport mechanism - and as you saw in the report above, for reasons as of yet unknown, SSH  falls on its face when my laptop (Arch) speaks to the OPIZero (Armbian). The ProxyCommand should not have made any difference - except for perhaps somehow buffering differently? Then again, mbuffer doesn't seem to impact things at all - with various sizes and used in both sender and receiver:

 

    $ for i in 8 16 32 64 128 ; do echo -e "\n Testing with buffer of ${i}MB\n " ; \

          cat /dev/urandom | dd bs=1M count=20 2>/dev/null | \

          mbuffer -s 1M -m ${i}M | ssh  root@192.168.1.150 "mbuffer -s 1M -m ${i}M  >/dev/null" ; done
    
     Testing with buffer of 8MB
     
    in @  0.0 kiiB/s, out @ 4091 kiiB/s, 2048 kiiB total, buffer   0% full
    summary: 2560 kiiByte in  2.4sec - average of 1054 kiiB/s
    in @  0.0 KiB/s, out @  0.0 KiB/s, 2048 KiB total, buffer   0% full
    summary: 2560 KiByte in  5.9sec - average of  435 KiB/s
    
     Testing with buffer of 16MB
     
    in @  0.0 kiiB/s, out @  0.0 kiiB/s, 2048 kiiB total, buffer   0% full
    summary: 2688 kiiByte in  3.5sec - average of  771 kiiB/s
    in @  0.0 KiB/s, out @  0.0 KiB/s, 2048 KiB total, buffer   0% full
    summary: 2688 KiByte in  6.4sec - average of  420 KiB/s
    
     Testing with buffer of 32MB
     
    in @  0.0 kiiB/s, out @ 4091 kiiB/s, 2048 kiiB total, buffer   0% full
    summary: 2688 kiiByte in  3.0sec - average of  885 kiiB/s
    in @  0.0 KiB/s, out @  0.0 KiB/s, 2048 KiB total, buffer   0% full
    summary: 2688 KiByte in  3.6sec - average of  752 KiB/s
    
     Testing with buffer of 64MB
     
    in @  0.0 kiiB/s, out @  0.0 kiiB/s, 2048 kiiB total, buffer   0% full
    summary: 2560 kiiByte in  3.7sec - average of  685 kiiB/s
    in @  0.0 KiB/s, out @  0.0 KiB/s, 2048 KiB total, buffer   0% full
    summary: 2560 KiByte in  4.4sec - average of  583 KiB/s
    
     Testing with buffer of 128MB
     
    in @  0.0 kiiB/s, out @  0.0 kiiB/s, 2048 kiiB total, buffer   0% full
    summary: 2560 kiiByte in  3.6sec - average of  705 kiiB/s
    in @  0.0 KiB/s, out @  0.0 KiB/s, 2048 KiB total, buffer   0% full
    summary: 2560 KiByte in  5.3sec - average of  480 KiB/s
    

In comparison, telling SSH to use nc for ProxyCommand - and removing mbuffer from the receiver, to make sure only one instance exists in the transmitter, we get almost an order of magnitude faster transfers - that aren't impacted by the sender's mbuffer at all (all receives report approx 2Mbytes/sec):

 

    $ for i in 8 16 32 64 128 ; do echo -e "\n Testing with buffer of ${i}MB\n " ; cat /dev/urandom | dd bs=1M count=50 2>/dev/null | mbuffer -s 1M -m ${i}M | ssh -o "ProxyCommand nc %h %p" root@192.168.1.150 "cat  >/dev/null" ; done
    
     Testing with buffer of 8MB
     
    in @  0.0 kiiB/s, out @ 4091 kiiB/s, 4096 kiiB total, buffer  25% full
    summary: 6400 kiiByte in  3.1sec - average of 2055 kiiB/s
    
     Testing with buffer of 16MB
     
    in @  0.0 kiiB/s, out @ 4091 kiiB/s, 6144 kiiB total, buffer   0% full
    summary: 6400 kiiByte in  3.4sec - average of 1907 kiiB/s
    
     Testing with buffer of 32MB
     
    in @  0.0 kiiB/s, out @ 4091 kiiB/s, 4096 kiiB total, buffer   6% full
    summary: 6528 kiiByte in  3.6sec - average of 1830 kiiB/s
    
     Testing with buffer of 64MB
     
    in @  0.0 kiiB/s, out @ 4091 kiiB/s, 6144 kiiB total, buffer   0% full
    summary: 6400 kiiByte in  3.3sec - average of 1965 kiiB/s
    
     Testing with buffer of 128MB
     
    in @  0.0 kiiB/s, out @ 6137 kiiB/s, 5120 kiiB total, buffer   1% full
    summary: 6400 kiiByte in  3.3sec - average of 1932 kiiB/s
    

In plain words - mbuffer is irrelevant.

The only thing that matters is the use of nc for a ProxyCommand - which shouldn't impact the speed at all, yet it does, immensely :-)

 

 

 

Link to comment
Share on other sites

I went through many tests:

 

Recording packets with tcpdump and comparing the contents in WireShark

 

    # tcpdump -i wlan0 -w good.ssh & \
         cat signature | ssh -o "ProxyCommand nc %h %p" \
            root@192.168.1.150 'cat | md5sum' ; \
         killall tcpdump
    # tcpdump -i wlan0 -w bad.ssh & \
         cat signature | ssh root@192.168.1.150 'cat | md5sum' ; \
         killall tcpdump

 

There was no difference of any importance in the recorded packets.

 

Checking for traffic shaping

 

Had no idea about this - but after looking at the "tc" manpage, I was able to verify that

 

  • `tc filter show` returns nothing
  • `tc class show` returns nothing
  • `tc qdisc show` returns these:

   

    qdisc noqueue 0: dev lo root refcnt 2
    qdisc noqueue 0: dev docker0 root refcnt 2
    qdisc fq_codel 0: dev wlan0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn

 

...which don't seem to differentiate between "ssh" and "nc" - in fact, I am not even sure if traffic shaping can operate on the process level (I'd expect it to work on addresses/ports/Differentiated Services field in IP Header).

 

Debian Chroot, to avoid potential "cleverness" in Arch Linux SSH client

 

Nope, same results.

 

Finally - Nagle.

 

Performing an strace in the sender...

 

    pv data | strace -T -ttt -f ssh 192.168.1.150 'cat | md5sum' 2>bad.log

 

...and looking at what exactly happens on the socket that transmits the data across, I noticed this "setup" before the actual transmitting starts:

 

    1522665534.007805 getsockopt(3, SOL_TCP, TCP_NODELAY, [0], [4]) = 0 <0.000025>
    1522665534.007899 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 <0.000021>

 

This sets up the SSH socket to disable Nagle's algorithm. You can Google and read all about it - but what it means, is that SSH is giving priority to responsiveness over bandwidth - it instructs the kernel to transmit anything written on this socket immediately and not "delay" waiting for acknowledgments from the remote.

 

What this means, in plain terms, is that in its default configuration, SSH is NOT a good way to transport data across -  not when the link used is a slow one (which is the case for many WiFi links). If "half-empty" packets are sent over the slow link, the bandwidth is wasted!

 

To prove that this was indeed the culprit, I used LD_PRELOAD to "drop" this specific syscall:

 

    $ cat nonagle.c

 

    #include <stdio.h>
    #include <dlfcn.h>
    #include <netinet/in.h>
    #include <netinet/tcp.h>
    #include <sys/socket.h>
    
    int (*osetsockopt) (int socket, int level, int option_name,
               const void *option_value, socklen_t option_len) = NULL;
    
    int setsockopt(int socket, int level, int option_name,
               const void *option_value, socklen_t option_len)
    {
        int ret;
        if (!osetsockopt) {
            osetsockopt = dlsym(RTLD_NEXT, "setsockopt");
        }
    
        if (option_name == TCP_NODELAY) {
            puts("No, Mr Nagle stays.");
            return 0;
        }
        ret = osetsockopt(socket, level, option_name, option_value, option_len);
        return ret;
    }

 

    $ gcc -fPIC -D_GNU_SOURCE -shared -o force_nagle.so force_nagle.c -ldl

    $ pv /dev/shm/data | LD_PRELOAD=./force_nagle.so ssh root@192.168.1.150 'cat >/dev/null'
    No, Mr Nagle stays.
    No, Mr Nagle stays.
     100MiB 0:00:29 [3.38MiB/s] [3.38MiB/s] [================================>] 100%   

 

There - perfect speed (well, just as fast as iperf3).

 

Morale of the story

 

Never give up :-)

 

And if you do use tools like rsync or borgbackup that transport their data over SSH, and your link has a lot of latency, either use ProxyCommand to switch SSH to using nc, or force SSH to not drop Nagle (as shown above).

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines