Helios4 - Cryptographic Engines And Security Accelerator (CESA) Benchmarking

gprovost · October 18, 2018

It's been a while I have on my TODO list : write a guide on how to activate and use the Marvell Cryptographic Engines And Security Accelerator (CESA) on Helios4.

Previously I already shared some numbers related to the CESA engine while using @tkaiser sbc-bench tool. I also shared some findings on the openssl support for the kernel modules (cryptodev and af_alg) that interact with the cesa engine. My conclusion was :

1. performance wise : effectively cryptodev performs slightly better than af_alg.

2. openssl / libssl support : very messy and broken, it all depends which version of openssl you use.

Since many Debian Stretch apps depend on "old" libssl (1.0.2), I felt taking the cryptodev approach was the best way since it could expose all encryption and authentication algorithms supported by the cesa engine... even though it requires some patching in openssl. Plus cryptodev implementation in new LTS openssl version 1.1.1 has been completely reworked, so long term it should be the right way.

Anyhow I'm not going to describe here the step by step setup, I'm already writing a page on our wiki for that, once it's ready I will post the link here. Also I won't risk myself talking about the relevance of some of ciphers, it deserves a topic on its own.

I'm just going to share benchmark number on a concrete use case which is HTTPS file download :

So I setup on my Helios4 Apache2 to serve a 1GB file hosted on a SSD drive.

Then I did 3 batch of download tests, for each batch I configured Apache2 to use a specific cipher that I know is supported by the cesa engine.

AES_128_CBC_SHA
AES_128_CBC_SHA256
AES_256_CBC_SHA256

For each batch, I do the following 3 tests :

1. download without cryptodev module loaded (100% software encryption)

2. download with cryptodev loaded and libssl (openssl) compiled with -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS

3. download with cryptodev loaded and libssl (openssl) compile only with -DHAVE_CRYPTODEV, which means hashing operation will still be done 100% by software.

Here are the results :

image.png.9dbe6cc61a7ea23e0192c74e60718d32.png

Note: CPU utilization is for both cores. Obviously each test is just a single process running on a single core therefore when you see CPU utilization around 50% (User% + Sys%) it means the core used for the test is fully loaded. Maybe i should have reported the number just for the core used, which would be more or less doing x2 of the value you see in the table.

For reference:

Using AES_128_GCM_SHA256 (Default Apache 2.4 TLS cipher. GCM mode is not something that can be accelerated by CESA.)

CPU Utilization : User %42.9, Sys %7.2
Throughput : 30.6MB/s

No HTTPS
CPU Utilization : User %1.0, Sys %29.8
Throughput : 112MB/s

CONCLUSION

1. Hashing operation are slower on the CESA engine than the CPU itself, therefore making HW encryption with hashing is less performant than 100% software encryption.

2. HW encryption without hashing provides 30 to 50% of throughput increase while decreasing the load on the CPU by 20 to 30%.

3. Still pondering if it's worth the effort to encourage people to do the move... but i think it's still cool improvement.

gprovost · October 19, 2018

I configured SSH to work with cryptodev and use cipher AES-CBC-128.

I did a scp download of the same 1GB file and get the following perf :

Throughput : 56.6MB/s

CPU Utilization : User %12.3, Sys %31.2

Pretty good :-)

Important note :

As concluded in previous post, in the case of Helios4, using cryptodev only for encryption and not for authentication (authentication involves hashing) is the only mode that provides some network performance and cpu load improvement. The other advantage of this mode, is that cryptodev will be completely skipped by sshd... because otherwise sshd will rise an exception during authentication because cryptodev try to do some ioctl() call that are forbidden by seccomp filter in sshd sandbox.

If you still want to test using cryptodev for ssh, the easy workaround is to use normal privilege separation in sshd instead of sandbox (UsePrivilegeSeparation yes). Then as for apache2, you will have to force to use a cipher that is supported by the CESA engine (e.g aes128-cbc)... and most probably you will also have to do the same on client side.

Disclaimer: The sshd tweaking is not recommended for security reason. Only experiment with it if you know what you are doing.

For reference with Cipher encryption algo not supported by CESA :

AES128-CTR

CPU Utilization : User %39.1, Sys %16.4

Throughput : 39.9MB/s

CHACHA20-POLY1305 (default cipher for ssh)

CPU Utilization : User %40.6, Sys %17.0

Throughput : 29.8MB/s

gprovost · November 19, 2018

Finally found the time to finish the CESA page on our Helios4 Wiki. It's not as exhaustive as it should be, but enough to help people experiment.

https://wiki.kobol.io/cesa/

For the ones who are interested, please have look... any comment welcome ;-)

markbirss · November 20, 2018

@gprovost thank you for effort with the guide.

Could you possibly include some CJDNS benchmarks ?

gprovost · November 22, 2018

On 11/20/2018 at 8:07 PM, markbirss said:

Could you possibly include some CJDNS benchmarks ?

I don't know much about CJDNS. Does CJDNS supports cryptodev or AF_ALG ?

markbirss · November 27, 2018

@gprovost refer to these links for detail around how cjdns uses it's own CryptoAuth protocol with ed25519, poly1305, and salsa20.

https://github.com/cjdelisle/cjdns/blob/master/doc/Whitepaper.md

https://github.com/hyperboria/bugs/issues/112

To see if the encryption task the CJDNS uses benefits or not from cryptodev hw acceleration

gprovost · November 29, 2018

I already looked at the white paper but I don't have much time now to dig further. I'm not sure if CJDNS interfaces with the Kernel Crypto API.

Plus anyway the Marvell CESA engines don't support SALSA20 stream encryption... so i don't think CJDNS crypto can be accelerated.

markbirss · November 29, 2018

Ok, understood.

Koen · December 6, 2018

It could be interesting to see the test repeated while on a LUKS encrypted filesystem?

gprovost · December 7, 2018

@Koen Good point, will do a benchmark one of those days.

djurny · December 28, 2018

Hi, I would be interested in the LUKS benchmark results as well, with and without using the CESA.

Currently I'm trying to get my LUKS encrypted volumes to perform a bit better on the Helios4. On my previous box (AMD Athlon X2) I saw numbers above 80 MiB/sec for disk I/O when performing a 'snapraid diff/sync/scrub' on the same LUKS encrypted volumes. The drives themselves were the I/O bottleneck. On the Helios4, those numbers have dropped significantly: ~30MiB/sec for the same actions.

On the Helios4, with 'marvell-cesa' and 'mv_cesa' modules loaded, 'cryptsetup benchmark' shows:

/* snip */
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   101.3 MiB/s   104.2 MiB/s
 serpent-cbc   128b    27.8 MiB/s    29.4 MiB/s
 twofish-cbc   128b    39.7 MiB/s    44.2 MiB/s
     aes-cbc   256b    91.7 MiB/s    94.1 MiB/s
 serpent-cbc   256b    28.0 MiB/s    32.0 MiB/s
 twofish-cbc   256b    39.7 MiB/s    44.3 MiB/s
     aes-xts   256b    63.4 MiB/s    55.0 MiB/s
 serpent-xts   256b    27.6 MiB/s    31.8 MiB/s
 twofish-xts   256b    43.3 MiB/s    44.0 MiB/s
     aes-xts   512b    47.9 MiB/s    41.6 MiB/s
 serpent-xts   512b    29.8 MiB/s    31.8 MiB/s
 twofish-xts   512b    43.2 MiB/s    44.0 MiB/

Without the CESA modules loaded, the aes-cbc performance drops significantly:

/* snip */
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b    25.1 MiB/s    56.2 MiB/s
 serpent-cbc   128b    28.0 MiB/s    31.9 MiB/s
 twofish-cbc   128b    39.7 MiB/s    44.3 MiB/s
     aes-cbc   256b    19.1 MiB/s    42.1 MiB/s
 serpent-cbc   256b    27.9 MiB/s    29.2 MiB/s
 twofish-cbc   256b    39.5 MiB/s    44.2 MiB/s
     aes-xts   256b    63.2 MiB/s    55.3 MiB/s
 serpent-xts   256b    29.8 MiB/s    31.8 MiB/s
 twofish-xts   256b    43.5 MiB/s    44.2 MiB/s
     aes-xts   512b    48.0 MiB/s    41.6 MiB/s
 serpent-xts   512b    27.3 MiB/s    31.7 MiB/s
 twofish-xts   512b    43.3 MiB/s    44.1 MiB/s

This already hints at the fact that 'dm-crypt' is not using the CESA.

After some checking, I would need to reencrypt my LUKS drives; they're using aes-xts-sha1, which is not supported by the CESA according to the Helios4 Wiki. The benchmark results shown by 'cryptsetup benchmark' show only an improvement for the aes-cbc algorithms, so first test will be to see what LUKS will do with 128bit aes-cbc-sha1 instead of 256bit aes-xts-sha1.

Groetjes,

p.s. I'm in no way a cryptography expert, so some of the terms might not be hitting the mark completely

Edited December 28, 2018 by djurny
Redundant statements, some typos.

djurny · December 29, 2018

Ok, I've made some progress already: It turns out that LUKS performance gets a boost when using cipher 'aes-cbc-essiv:sha256'. Key size did not really show any big difference during testing simple filesystem performance (~102MiB/sec vs ~103MiB/sec). The internets say that there is some concern about using cbc vs xts, but after some reading it looks like it's not necessarily a concern related to the privacy of the data, but more of an data integrity issue. Attackers are able to corrupt encrypted content in certain scenarios. For my use case, this is not a problem at all, so I'm happy to see the performance boost by using CESA!

Test setup & playbook:

Make sure 'marvell-cesa' and 'mv_cesa' are modprobe'd.
/dev/sdX1 - optimally aligned using parted.
luksFormat /dev/sdX1
luksOpen /dev/sdX1 decrypt
mkfs.xfs /dev/mapper/decrypt
mount /dev/mapper/decrypt /mnt
Check /proc/interrupts
dd if=/dev/zero of=/mnt/removeme bs=1048576 count=16384 conv=fsync
Check /proc/interrupts if crypto has gotten any interrupts.

Averaged throughput measurement:

Cipher: aes-xts-plain64:sha1, key size: 256bits
- Throughput => 78.0 MB/s,
- Interrupts for crypto => 11 + 8,
Cipher: aes-cbc-essiv:sha256, key size: 256bits
- Throughput => 102 MB/s,
- Interrupts for crypto => 139421 + 144352,
Cipher: aes-cbc-essiv:sha256, key size: 128bits
- Throughput => 103 MB/s
- Interrupts for crypto => 142679 + 152079.

Next steps;

Copy all content from the aes-xts LUKS volumes to the aes-cbc LUKS volumes,
Run snapraid check/diff/sync and check disk throughput.

Comments are welcome,

Groetjes,

Edited December 29, 2018 by djurny
Table markup flattens somehow?

gprovost · January 3, 2019

@djurny Thanks for the benchmark, very useful number. It reminds me I need to indicate on the wiki the correct cipher to use with cryptsetup (preference for aes-cbc-essiv over aes-cbc-plain*) when people create their LUKS device since on latest version of cryptsetup the default is cipher is aes-xts-plain64.

Time to combine both benchmark tests together as suggested by @Koen

BTW you don't need to load mv_cesa module. This is an old module which is replaced by marvell_cesa.

Koen · January 4, 2019

This is very useful information, as i'm planning to have boot root (SD) and data (SATA mirror) encrypted, with BTRFS on top. Better get started the good way.

@djurny : did you come across good links explaining the differences / risks of cbc versus xtc, or even essiv versus plain64 ?

Found this guide for the root fs :

And the data fs i should be able to do with a keyfile on the rootfs. I think it needs to be 2x LUKS and BTRFS "mirror" on top, so i could actually benefit from the self healing functionality, in case of a scrub.

@gprovost : am i correct to understand the CESA will be used automatically by dm-crypt, if aes-cbc-essiv (or another supporter cypher) is used ?
Also looking forward to read updated performance numbers, to understand if it would be worth modifying the openssl libraries or not.

djurny · January 4, 2019

@Koen, do note that on Armbian, the CESA modules are not loaded per default, so even if you choose aes-cbc-essiv but do not load the appropriate kernel module, the CESA is not utilized. (See the cryptsetup benchmark results with- and without marvell_cesa/mv_cesa.)

For the XTS vs CBC I will try to find & list the articles I found on the interwebs.

@gprovost, thanks for the tip, I'll refrain from loading mv_cesa.

matzman666 · January 6, 2019

I also want to share my numbers because I think they show some interesting findings.

My setup: Helios4 & Raid5 with 4x Seagate Ironwolf 4TB (ST4000VN008)

To ensure comparability all numbers where obtained using the same command as used by @djurny: dd if=/dev/zero of=/mnt/removeme bs=1048576 count=16384 conv=fsync

First of all some numbers which show how important correct alignment is (no encryption yet, filesystem was btrfs):

md-raid5 using the whole disk created via OMV or the instructions found in the wiki here:
- Write-Throughput: ~76 MB/s
md-raid5 using optimally aligned partitions (created via parted -a optimal /dev/sdX mkpart primary 0% 100%):
- Write-Throughput: ~104 MB/s

That is a difference of about ~26%! Based on this numbers I would not recommend using the whole disk when creating a md-raid5. Using partitions is not supported at all by OMV, so the raid has to be created on the command-line.

Now my numbers when using encryption:

md5-raid & luks (using aes-cbc-essiv:sha256) & xfs:
- Write-Throughput: ~72 MB/s
md5-raid & luks (using aes-cbc-essiv:sha256) & btrfs:
- Write-Throughput: ~66 MB/s
md5-raid & luks (using aes-cbc-essiv:sha256 with marvell_cesa kernel module unloaded) & btrfs:
- Write-Throughput: ~34 MB/s
md5-raid & luks2 (using aes-cbc-essiv:sha256) & btrfs:
- Write-Throughput: ~ 73 MB/s

Looking at the numbers I see a performance loss of about 30% when using encryption. Hardware encryption is working and definitely speeds up encryption because when I unload the kernel module I see a performance loss of about 60% compared to the unencrypted case.

When creating the luks partition via the OMV encryption plugin aes-xts is used by default which is not supported by marvell_cesa, and there is no way to configure a different encryption algorithm on the web-gui. To be able to use aes-cbc the luks partition has to be created via the commandline.

Using luks2 instead of luks gives a bit of a performance boost. Luks2 is only supported by the Ubuntu image, the Debian image has the usual Debian problem: too old packages.

Based on these numbers I am ditching Debian and OMV and are moving to Ubuntu. OMV is of little use, because for best performance I have to setup everything via the commandline. Also luks2 is more future-proof and results in better performance.

Edit2:

I played a bit more with luks2, and found something very interesting: With luks2 you can also change the sector size.

The default sector size is 512 Byte, but when I change it to 4K, then I see massive performance improvements:

md5-raid & luks2 (using aes-cbc-essiv:sha256 and 4K sectors) & btrfs:
- Write-Throughput: ~ 99 MB/s

That's only a performance loss of ~5% compared to the unencrypted case!?! Before I had a performance loss of at least ~30%. I double-checked everything, and the numbers are real. This means luks2 is definitely the way to go.

To sum up. To get the best performance out of an encrypted raid5, you need to:

Install the Ubuntu image. This means you cannot use OMV, but that's the price to pay for best performance.
Create a single partition on each of your disks that is optimally aligned.
- parted /dev/sdX mklabel gpt
- parted -a optimal /dev/sdX mkpart 0% 100%
Create your raid with mdadm and pass the partitions you created in the second step.
- mdadm --create /dev/md0 --level=5 --raid-devices=4 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
Create a luks2 partition using aes-cbc-essiv:sha256 and a sector size of 4K.
- cryptsetup -s 256 -c aes-cbc-essiv:sha256 --sector-size 4096 --type luks2 luksFormat /dev/md0
Create your file-system on top of your encrypted raid device.

On 12/29/2018 at 7:15 AM, djurny said:

Averaged throughput measurement:

Cipher: aes-xts-plain64:sha1, key size: 256bits
Throughput => 78.0 MB/s,

Interrupts for crypto => 11 + 8,

Cipher: aes-cbc-essiv:sha256, key size: 256bits
Throughput => 102 MB/s,

Interrupts for crypto => 139421 + 144352,

Cipher: aes-cbc-essiv:sha256, key size: 128bits
Throughput => 103 MB/s

Interrupts for crypto => 142679 + 152079.

Do you have any numbers for the unencrypted case? I am curious because I want to know if you also see a performance loss of about 30%.

Edit:

Should be parted -a not parted -o.

gprovost · January 7, 2019

@matzman666 Very interesting findings and well broke down tests :thumbup:

1. I wasn't aware that with LUKS2 you can define sector size bigger than 512b, therefore decreasing the number of IO operations.

2. I never investigate how RAID perf can be impacted with nowadays HDD that support block size of 4K (Advanced Format). Non-aligned 4K disk can effectively show a significant performance penalty. But I'm still a bit surprise that you see that much perf difference (26%) with and without disk alignment on your tests without encryption.

https://raid.wiki.kernel.org/index.php/A_guide_to_mdadm#Blocks.2C_chunks_and_stripes

Would you be interested to help us improve our wiki, on the disk alignment part and then on the disk encryption part ?

https://wiki.kobol.io/mdadm/

https://wiki.kobol.io/cesa/

djurny · January 13, 2019

@matzman666 Sorry, no measurements. From memory the numbers for raw read performance were way above 100MB/sec according to 'hdparm -t'. Currently my box is live, so no more testing with empty FSes on unencrypted devices for now. Perhaps someone else can help out?

-[ edit ]-

So my 2nd box is alive. The setup is slightly different and not yet complete.

I quickly built cryptsetup 2.x from sources on Armbian, was not as tough as I expected - pretty straightforward: configure, correct & configure, correct & ...

cryptsetup 2.x requires the following packages to be installed:

uuid-dev
libdevmapper-dev
libpopt-dev
pkg-config
libgcrypt-dev
libblkid-dev

Not sure about these ones, but I installed them anyway:

libdmraid-dev
libjson-c-dev

Build & install:

Download cryptsetup 2.x via https://gitlab.com/cryptsetup/cryptsetup/tree/master.
Unxz the tarball.
./configure --prefix=/usr/local
make
sudo make install
sudo ldconfig

Configuration:

2x WD Blue 2TB HDDs.
4KiB sector aligned GPT partitions.
mdadm RAID6 (degraded).

Test:

Write ~~10GiB~~ 4GiB worth of zeroes; dd if=/dev/zero of=[dut] bs=4096 count=1048576 conv=fsync.
- directly to mdadm device.
- to a file on an XFS filesystem on top of an mdadm device.
- directly to LUKS2 device on top of an mdadm device (512B/4096KiB sector sizes for LUKS2).
- to a file on an XFS filesystem on top of a LUKS2 device on top of an mdadm device (512B/4096KiB sector sizes for LUKS2).

Results:

Caveat:

CPU load is high: >75% due to mdadm using CPU for parity calculations. If using the box as just a fileserver for a handful of clients, this should be no problem. But if more processing is done besides serving up files, e.g. transcoding, (desktop) applications, this might become problematic.
RAID6 under test was in degraded mode. I don't have enough disks to have a fully functional RAID6 array yet. No time to tear down my old server yet. Having a full RAID6 array might impact parity calculations and add 2x more disk I/O to the mix.

I might consider re-encrypting the disks on my first box, to see if LUKS2 w/4KiB sectors will increase the SnapRAID performance over the LUKS(1) w/512B sectors. Currently it takes over 13 hours to scrub 50% on a 2-parity SnapRAID configuration holding less than 4TB of data.

-[ update ]-

Additional test:

Write/read 4GiB worth of zeroes to a file on an XFS filesystem test on armbian/linux packages 5.73 (upgraded from 5.70)
- for i in {0..9} ;
  do time dd if=/dev/zero of=removeme.${i} bs=4096 count=$(( 4 * 1024 * 1024 * 1024 / 4096 )) conv=fsync;
  dd if=removeme.$(( 9 - ${i} )) of=/dev/null bs=4096 ;
  done 2>&1 | egrep '.*bytes.*copied.*'

Results:

image.png.4342d81ad2163e6a429f340b80ed8eca.png

The write throughput appears to be slightly higher, as now the XOR HW engine is being used - but it could just as well be measurement noise.

CPU load is still quite high during this new test:

%Cpu0  :  0.0 us, 97.5 sy,  1.9 ni,  0.3 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.3 us, 91.4 sy,  1.1 ni,  0.0 id,  0.0 wa,  0.0 hi,  7.2 si,  0.0 st
<snip>
 176 root      20   0       0      0      0 R  60.9  0.0  14:59.42 md0_raid6                                                    
1807 root      30  10    1392    376    328 R  45.7  0.0   0:10.97 dd
9087 root       0 -20       0      0      0 D  28.5  0.0   1:16.53 kworker/u5:1
  34 root       0 -20       0      0      0 D  19.0  0.0   0:53.93 kworker/u5:0                                                     
 149 root     -51   0       0      0      0 S  14.4  0.0   3:19.35 irq/38-f1090000                                               
 150 root     -51   0       0      0      0 S  13.6  0.0   3:12.89 irq/39-f1090000                                               
5567 root      20   0       0      0      0 S   9.5  0.0   1:09.40 dmcrypt_write
<snip>

Will update this again once the RAID6 array set up is complete.

Groetjes,

djurny · February 24, 2019

L.S.,

A quick update on the anecdotal performance of LUKS2 over LUKS;

md5sum'ing ~1 TiB of datafiles on LUKS:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.15   20.01   59.48    2.60    0.00   17.76
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb             339.20        84.80         0.00        848          0
dm-2            339.20        84.80         0.00        848          0

md5sum'ing ~1 TiB of datafiles on LUKS2:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.05   32.37   36.32    0.75    0.00   30.52
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdd             532.70       133.18         0.00       1331          0
dm-0            532.80       133.20         0.00       1332          0

sdb:

sdb1 optimally aligned using parted.
- LUKS(1) w/aes-cbc-essiv:256, w/256 bit key size.
- XFS with 4096 Bytes sector size.
- xfs_fsr'd regularly, negligible file fragmentation.

sdd:

sdd1 optimally aligned using parted.
- LUKS2 w/aes-cbc-essiv:256, w/256 bit key size and w/4096 Bytes sector size, as @matzman666 suggested.
- XFS with 4096 Bytes sector size.

Content-wise: sdd1 is a file-based copy of sdb1 (about to wrap up the migration from LUKS(1) to LUKS2).

Overall a very nice improvement!

Groetjes,

p.s. Not sure if it added to the performance, but I also spread out the IRQ assignments over both CPUs, making sure that each CESA and XOR engine have their own CPU. Originally I saw that all IRQs are handled by the one and same CPU. For reasons yet unclear, irqbalance refused to dynamically reallocate the IRQs over the CPUs. Perhaps the algorithm used by irqbalance does not apply well to ARM or the Armada SoC (- initial assumption is something with cpumask being reported as '00000002' causing irqbalance to only balance on- and to CPU1?).

gprovost · February 25, 2019

On 2/24/2019 at 9:10 AM, djurny said:

Not sure if it added to the performance, but I also spread out the IRQ assignments over both CPUs, making sure that each CESA and XOR engine have their own CPU.

Interesting point. Are you using smp_affinity to pin an IRQ to a specific CPU ?

Some reference for this topic : https://www.arm.com/files/pdf/AT-_Migrating_Software_to_Multicore_SMP_Systems.pdf

I'm honestly not sure you will gain in performance, you might see some improvement in the case of high load... but then I would have imagined that irqbalance (if installed) would kick in and starts balance IRQ if really needed. I will need to play around to check that.

djurny · February 25, 2019

@gprovost, indeed /proc/irq/${IRQ}/smp_affinity. What I saw irqbalance do is perform a one-shot assignment of all IRQs to CPU1 (from CPU0) and then...basically nothing. The "cpu mask" shown by irqbalance is '2', which lead me to the assumption that it is not taking CPU0 into account as a CPU it can use for the actual balancing. So all are "balanced" to one CPU only.

Overall, the logic for the IRQ assignment spread over the two CPU cores was: When CPU1 is handling ALL incoming IRQs, for both SATA, XOR and CESA for all disks in a [software] RAID setup, then CPU1 will be the bottleneck of all transactions. The benefit of the one-shot 'balancing' is that you don't really need to worry about the ongoing possible IRQ handler migration from CPUx to CPUy, as the same CPU will handle the same IRQ all the time, no need to migrate anything from CPUx to CPUy continuously.

Any further down the rabbit hole would require me to go back to my college textbooks .

Groetjes,

gprovost · February 26, 2019

@djurny Yeah my test with irqbalance shows that at startup it pints some IRQ to CPU1 and some to both (see table below).

I don't witness any dynamic change while stressing the system. When the SMP affinity is set as both cpu (value = 3) then the IRQ is allocated by default to CPU0. So it means with irqbalance, the only good thing I see is that the network device interrupt will be not on the same than CPU than ATA, XOR and CESA.

But you right maybe doing manual pining of the IRQ to certain CPU would help performance in case of heavy load. But before doing such tuning there is a need to define a proper benchmark to measure positive and negative impact of the such tuning.

image.png.9672a92d7fe167110f271a621162a879.png

gprovost · March 18, 2019

Just for info I updated the libssl1.0.2 cryptodev patch to work with latest libssl1.0.2 debian deb (libssl1.0.2_1.0.2r)

https://wiki.kobol.io/cesa/#network-application-encryption-acceleration

So if some people wanna test offload crypto operation while using apache2 or ssh for example ;-) You can also find a pre-build dpkg here if you want to skip the compile steps.

gprovost · July 30, 2019

https://wiki.kobol.io/cesa has been updated to be ready for upcoming Debian Buster release. For Debian Buster we will use AF_ALG as Crypto API userspace interface. We choose AF_ALG for Debian Buster because it doesn't require any patching or recompiling but it comes with some limitation. While benchmark shows in some case throughput improvement with AF_ALG, the CPU load is not improved compared to cryptodev or 100% software encryption. This will require further investigation.

So far I didn't succeed to enable cryptodev in Debian version of libssl1.1.1... Need further work, and it's not really our priority right now.

Also for Debian Stretch I updated the prebuild libssl1.0.2 debian deb to latest security update release (libssl1.0.2_1.0.2s). Can be found here.

DavidGF · June 30, 2020

Not sure whether it's a good idea to bump this thread but just my 2cts on this:

CESA is quite good for use cases such as LUKS. Obviously you will need to use aes-cbc (use 128 bits for maximum perf).

In this mode the CPU usage is quite low (you can see two kernel workers doing 10% on each CPU and a ton of interrupts, but other than that works well) which is awesome to keep the load of the CPU which is already limited. In comparison on using pure CPU, in my current setup I get:

- Plain access to disk: 180MB/s

- LUKS2 + CESA: 140MB/s

- LUKS2 (no CESA): 52MB/s

I tuned LUKS using sector size= 4096 and keysize=128, as well as using aes-cbc-plain64.

For me this is quite good already even tho CESA only supports some "relatively old" ciphers.

Hope this helps other people!

sfx2000 · July 11, 2020

With CESA - many assumptions based on Armada 38x - where things in scope looked very good on ARM-V7A - both the Marvell cores and the later ARM-a9's... Armada is mixed up here... let's just say that 38x has massive bandwidth inside the chip - MV3720/Mochi doesn't...

MV3720 - always a tradeoff - more CPU and throughput, or offload to the CESA units, which is probably similar to other "off-loads" for dedicated accelerators

Personally - I would not recommend CESA here,,, I would recommend core here as the is CPU/MEM, and CESA is limited, and so is the bus within the chip itself.

jsr · April 2, 2021

Hi there!

I have been using LUKS2 encrypted disks on my helios4, and I've noticed that the encryption/decryption is no longer making use of the CESA accelerator. I don't see the relevant interrupts counting up:

 48:          0          0     GIC-0  51 Level     f1090000.crypto
 49:          0          0     GIC-0  52 Level     f1090000.crypto

and my decryption speed is a lot lower (~45 MB/s) than it used to be (~100 MB/s). This used to be working on this OS install (Ambian Debian stretch, apt dist-upgrade'd to buster 21.02.3), and I wonder if there may have been a regression somewhere along the line.

I've also tested using matzman666's examples above for setting up a LUKS2 encrypted disk, and I still have the same issue: I don't see any CESA HW acceleration with LUKS2. I checked and I do have the marvell_cesa module loaded. Here is the cryptsetup command:

cryptsetup -s 256 -c aes-cbc-essiv:sha256 --sector-size 4096 --type luks2 luksFormat /dev/mmcblk0p2

I have also tried a fresh install of the latest Armbian Debian buster 21.02.3 and Armbian Ubuntu focal 21.02.3, and I still have no CESA acceleration.

Has anyone seen this issue? How do we get LUKS2 to use the CESA hardware these days?

Regards,

James

lanefu · April 2, 2021

@TheLinuxBugyou have any thoughts?

djurny · April 3, 2021

@jsr, can you also check /proc/cryptinfo ? If I remember correctly, the mv/marvell module should show up in cryptinfo.

Edit: Also noticed you are LUKSing an mmc device. Not sure about the speed of device used in your setup, but my SD cards hardly reach 30MiB/s, even if CESA would work. Your throughput might be limited by additional bottleneck.

jsr · April 3, 2021

@djurny, I have a /proc/crypto (but not /proc/cryptinfo), link at the bottom.

Apologies, I should have explained using the mmc device: that was used just to confirm the issue still existed when starting afresh using matzman666's instructions. I had some spare space available on that mmc device, and when using it I was only checking the CESA interrupt count in /proc/interrupts. My real storage is a pair of hard drives that each can read sequentially at ~180 MB/s (tested with dd).

uname -a:

Linux helios4 5.10.21-mvebu #21.02.3 SMP Mon Mar 8 00:59:48 UTC 2021 armv7l GNU/Linux

/proc/crypto:

https://pastebin.com/QuSpD5u5

Sign In

Helios4 - Cryptographic Engines And Security Accelerator (CESA) Benchmarking

Recommended Posts

gprovost

gprovost

gprovost

markbirss

gprovost

markbirss

gprovost

markbirss

Koen

gprovost

djurny

djurny

gprovost

Koen

djurny

matzman666

gprovost

djurny

djurny

gprovost

djurny

gprovost

gprovost

gprovost

DavidGF

sfx2000

jsr

lanefu

djurny

jsr

Join the conversation

Similar Content

Forums

My Activity Streams

Download

Store

Important Information