gprovost Posted October 18, 2018 Share Posted October 18, 2018 It's been a while I have on my TODO list : write a guide on how to activate and use the Marvell Cryptographic Engines And Security Accelerator (CESA) on Helios4. Previously I already shared some numbers related to the CESA engine while using @tkaiser sbc-bench tool. I also shared some findings on the openssl support for the kernel modules (cryptodev and af_alg) that interact with the cesa engine. My conclusion was : 1. performance wise : effectively cryptodev performs slightly better than af_alg. 2. openssl / libssl support : very messy and broken, it all depends which version of openssl you use. Since many Debian Stretch apps depend on "old" libssl (1.0.2), I felt taking the cryptodev approach was the best way since it could expose all encryption and authentication algorithms supported by the cesa engine... even though it requires some patching in openssl. Plus cryptodev implementation in new LTS openssl version 1.1.1 has been completely reworked, so long term it should be the right way. Anyhow I'm not going to describe here the step by step setup, I'm already writing a page on our wiki for that, once it's ready I will post the link here. Also I won't risk myself talking about the relevance of some of ciphers, it deserves a topic on its own. I'm just going to share benchmark number on a concrete use case which is HTTPS file download : So I setup on my Helios4 Apache2 to serve a 1GB file hosted on a SSD drive. Then I did 3 batch of download tests, for each batch I configured Apache2 to use a specific cipher that I know is supported by the cesa engine. AES_128_CBC_SHA AES_128_CBC_SHA256 AES_256_CBC_SHA256 For each batch, I do the following 3 tests : 1. download without cryptodev module loaded (100% software encryption) 2. download with cryptodev loaded and libssl (openssl) compiled with -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS 3. download with cryptodev loaded and libssl (openssl) compile only with -DHAVE_CRYPTODEV, which means hashing operation will still be done 100% by software. Here are the results : Note: CPU utilization is for both cores. Obviously each test is just a single process running on a single core therefore when you see CPU utilization around 50% (User% + Sys%) it means the core used for the test is fully loaded. Maybe i should have reported the number just for the core used, which would be more or less doing x2 of the value you see in the table. For reference: Using AES_128_GCM_SHA256 (Default Apache 2.4 TLS cipher. GCM mode is not something that can be accelerated by CESA.) CPU Utilization : User %42.9, Sys %7.2 Throughput : 30.6MB/s No HTTPS CPU Utilization : User %1.0, Sys %29.8 Throughput : 112MB/s CONCLUSION 1. Hashing operation are slower on the CESA engine than the CPU itself, therefore making HW encryption with hashing is less performant than 100% software encryption. 2. HW encryption without hashing provides 30 to 50% of throughput increase while decreasing the load on the CPU by 20 to 30%. 3. Still pondering if it's worth the effort to encourage people to do the move... but i think it's still cool improvement. 0 Quote Link to comment Share on other sites More sharing options...
gprovost Posted October 19, 2018 Author Share Posted October 19, 2018 I configured SSH to work with cryptodev and use cipher AES-CBC-128. I did a scp download of the same 1GB file and get the following perf : Throughput : 56.6MB/s CPU Utilization : User %12.3, Sys %31.2 Pretty good :-) Important note : As concluded in previous post, in the case of Helios4, using cryptodev only for encryption and not for authentication (authentication involves hashing) is the only mode that provides some network performance and cpu load improvement. The other advantage of this mode, is that cryptodev will be completely skipped by sshd... because otherwise sshd will rise an exception during authentication because cryptodev try to do some ioctl() call that are forbidden by seccomp filter in sshd sandbox. If you still want to test using cryptodev for ssh, the easy workaround is to use normal privilege separation in sshd instead of sandbox (UsePrivilegeSeparation yes). Then as for apache2, you will have to force to use a cipher that is supported by the CESA engine (e.g aes128-cbc)... and most probably you will also have to do the same on client side. Disclaimer: The sshd tweaking is not recommended for security reason. Only experiment with it if you know what you are doing. For reference with Cipher encryption algo not supported by CESA : AES128-CTR CPU Utilization : User %39.1, Sys %16.4 Throughput : 39.9MB/s CHACHA20-POLY1305 (default cipher for ssh) CPU Utilization : User %40.6, Sys %17.0 Throughput : 29.8MB/s 0 Quote Link to comment Share on other sites More sharing options...
gprovost Posted November 19, 2018 Author Share Posted November 19, 2018 Finally found the time to finish the CESA page on our Helios4 Wiki. It's not as exhaustive as it should be, but enough to help people experiment. https://wiki.kobol.io/cesa/ For the ones who are interested, please have look... any comment welcome ;-) 0 Quote Link to comment Share on other sites More sharing options...
markbirss Posted November 20, 2018 Share Posted November 20, 2018 @gprovost thank you for effort with the guide. Could you possibly include some CJDNS benchmarks ? 0 Quote Link to comment Share on other sites More sharing options...
gprovost Posted November 22, 2018 Author Share Posted November 22, 2018 On 11/20/2018 at 8:07 PM, markbirss said: Could you possibly include some CJDNS benchmarks ? I don't know much about CJDNS. Does CJDNS supports cryptodev or AF_ALG ? 0 Quote Link to comment Share on other sites More sharing options...
markbirss Posted November 27, 2018 Share Posted November 27, 2018 @gprovost refer to these links for detail around how cjdns uses it's own CryptoAuth protocol with ed25519, poly1305, and salsa20. https://github.com/cjdelisle/cjdns/blob/master/doc/Whitepaper.md https://github.com/hyperboria/bugs/issues/112 To see if the encryption task the CJDNS uses benefits or not from cryptodev hw acceleration 0 Quote Link to comment Share on other sites More sharing options...
gprovost Posted November 29, 2018 Author Share Posted November 29, 2018 I already looked at the white paper but I don't have much time now to dig further. I'm not sure if CJDNS interfaces with the Kernel Crypto API. Plus anyway the Marvell CESA engines don't support SALSA20 stream encryption... so i don't think CJDNS crypto can be accelerated. 0 Quote Link to comment Share on other sites More sharing options...
markbirss Posted November 29, 2018 Share Posted November 29, 2018 Ok, understood. 0 Quote Link to comment Share on other sites More sharing options...
Koen Posted December 6, 2018 Share Posted December 6, 2018 It could be interesting to see the test repeated while on a LUKS encrypted filesystem? 0 Quote Link to comment Share on other sites More sharing options...
gprovost Posted December 7, 2018 Author Share Posted December 7, 2018 @Koen Good point, will do a benchmark one of those days. 0 Quote Link to comment Share on other sites More sharing options...
djurny Posted December 28, 2018 Share Posted December 28, 2018 (edited) Hi, I would be interested in the LUKS benchmark results as well, with and without using the CESA. Currently I'm trying to get my LUKS encrypted volumes to perform a bit better on the Helios4. On my previous box (AMD Athlon X2) I saw numbers above 80 MiB/sec for disk I/O when performing a 'snapraid diff/sync/scrub' on the same LUKS encrypted volumes. The drives themselves were the I/O bottleneck. On the Helios4, those numbers have dropped significantly: ~30MiB/sec for the same actions. On the Helios4, with 'marvell-cesa' and 'mv_cesa' modules loaded, 'cryptsetup benchmark' shows: /* snip */ # Algorithm | Key | Encryption | Decryption aes-cbc 128b 101.3 MiB/s 104.2 MiB/s serpent-cbc 128b 27.8 MiB/s 29.4 MiB/s twofish-cbc 128b 39.7 MiB/s 44.2 MiB/s aes-cbc 256b 91.7 MiB/s 94.1 MiB/s serpent-cbc 256b 28.0 MiB/s 32.0 MiB/s twofish-cbc 256b 39.7 MiB/s 44.3 MiB/s aes-xts 256b 63.4 MiB/s 55.0 MiB/s serpent-xts 256b 27.6 MiB/s 31.8 MiB/s twofish-xts 256b 43.3 MiB/s 44.0 MiB/s aes-xts 512b 47.9 MiB/s 41.6 MiB/s serpent-xts 512b 29.8 MiB/s 31.8 MiB/s twofish-xts 512b 43.2 MiB/s 44.0 MiB/ Without the CESA modules loaded, the aes-cbc performance drops significantly: /* snip */ # Algorithm | Key | Encryption | Decryption aes-cbc 128b 25.1 MiB/s 56.2 MiB/s serpent-cbc 128b 28.0 MiB/s 31.9 MiB/s twofish-cbc 128b 39.7 MiB/s 44.3 MiB/s aes-cbc 256b 19.1 MiB/s 42.1 MiB/s serpent-cbc 256b 27.9 MiB/s 29.2 MiB/s twofish-cbc 256b 39.5 MiB/s 44.2 MiB/s aes-xts 256b 63.2 MiB/s 55.3 MiB/s serpent-xts 256b 29.8 MiB/s 31.8 MiB/s twofish-xts 256b 43.5 MiB/s 44.2 MiB/s aes-xts 512b 48.0 MiB/s 41.6 MiB/s serpent-xts 512b 27.3 MiB/s 31.7 MiB/s twofish-xts 512b 43.3 MiB/s 44.1 MiB/s This already hints at the fact that 'dm-crypt' is not using the CESA. After some checking, I would need to reencrypt my LUKS drives; they're using aes-xts-sha1, which is not supported by the CESA according to the Helios4 Wiki. The benchmark results shown by 'cryptsetup benchmark' show only an improvement for the aes-cbc algorithms, so first test will be to see what LUKS will do with 128bit aes-cbc-sha1 instead of 256bit aes-xts-sha1. Groetjes, p.s. I'm in no way a cryptography expert, so some of the terms might not be hitting the mark completely Edited December 28, 2018 by djurny Redundant statements, some typos. 0 Quote Link to comment Share on other sites More sharing options...
djurny Posted December 29, 2018 Share Posted December 29, 2018 (edited) Ok, I've made some progress already: It turns out that LUKS performance gets a boost when using cipher 'aes-cbc-essiv:sha256'. Key size did not really show any big difference during testing simple filesystem performance (~102MiB/sec vs ~103MiB/sec). The internets say that there is some concern about using cbc vs xts, but after some reading it looks like it's not necessarily a concern related to the privacy of the data, but more of an data integrity issue. Attackers are able to corrupt encrypted content in certain scenarios. For my use case, this is not a problem at all, so I'm happy to see the performance boost by using CESA! Test setup & playbook: Make sure 'marvell-cesa' and 'mv_cesa' are modprobe'd. /dev/sdX1 - optimally aligned using parted. luksFormat /dev/sdX1 luksOpen /dev/sdX1 decrypt mkfs.xfs /dev/mapper/decrypt mount /dev/mapper/decrypt /mnt Check /proc/interrupts dd if=/dev/zero of=/mnt/removeme bs=1048576 count=16384 conv=fsync Check /proc/interrupts if crypto has gotten any interrupts. Averaged throughput measurement: Cipher: aes-xts-plain64:sha1, key size: 256bits Throughput => 78.0 MB/s, Interrupts for crypto => 11 + 8, Cipher: aes-cbc-essiv:sha256, key size: 256bits Throughput => 102 MB/s, Interrupts for crypto => 139421 + 144352, Cipher: aes-cbc-essiv:sha256, key size: 128bits Throughput => 103 MB/s Interrupts for crypto => 142679 + 152079. Next steps; Copy all content from the aes-xts LUKS volumes to the aes-cbc LUKS volumes, Run snapraid check/diff/sync and check disk throughput. Comments are welcome, Groetjes, Edited December 29, 2018 by djurny Table markup flattens somehow? 0 Quote Link to comment Share on other sites More sharing options...
gprovost Posted January 3, 2019 Author Share Posted January 3, 2019 @djurny Thanks for the benchmark, very useful number. It reminds me I need to indicate on the wiki the correct cipher to use with cryptsetup (preference for aes-cbc-essiv over aes-cbc-plain*) when people create their LUKS device since on latest version of cryptsetup the default is cipher is aes-xts-plain64. Time to combine both benchmark tests together as suggested by @Koen BTW you don't need to load mv_cesa module. This is an old module which is replaced by marvell_cesa. 0 Quote Link to comment Share on other sites More sharing options...
Koen Posted January 4, 2019 Share Posted January 4, 2019 This is very useful information, as i'm planning to have boot root (SD) and data (SATA mirror) encrypted, with BTRFS on top. Better get started the good way. @djurny : did you come across good links explaining the differences / risks of cbc versus xtc, or even essiv versus plain64 ? Found this guide for the root fs : And the data fs i should be able to do with a keyfile on the rootfs. I think it needs to be 2x LUKS and BTRFS "mirror" on top, so i could actually benefit from the self healing functionality, in case of a scrub. @gprovost : am i correct to understand the CESA will be used automatically by dm-crypt, if aes-cbc-essiv (or another supporter cypher) is used ? Also looking forward to read updated performance numbers, to understand if it would be worth modifying the openssl libraries or not. 0 Quote Link to comment Share on other sites More sharing options...
djurny Posted January 4, 2019 Share Posted January 4, 2019 @Koen, do note that on Armbian, the CESA modules are not loaded per default, so even if you choose aes-cbc-essiv but do not load the appropriate kernel module, the CESA is not utilized. (See the cryptsetup benchmark results with- and without marvell_cesa/mv_cesa.) For the XTS vs CBC I will try to find & list the articles I found on the interwebs. @gprovost, thanks for the tip, I'll refrain from loading mv_cesa. 0 Quote Link to comment Share on other sites More sharing options...
matzman666 Posted January 6, 2019 Share Posted January 6, 2019 I also want to share my numbers because I think they show some interesting findings. My setup: Helios4 & Raid5 with 4x Seagate Ironwolf 4TB (ST4000VN008) To ensure comparability all numbers where obtained using the same command as used by @djurny: dd if=/dev/zero of=/mnt/removeme bs=1048576 count=16384 conv=fsync First of all some numbers which show how important correct alignment is (no encryption yet, filesystem was btrfs): md-raid5 using the whole disk created via OMV or the instructions found in the wiki here: Write-Throughput: ~76 MB/s md-raid5 using optimally aligned partitions (created via parted -a optimal /dev/sdX mkpart primary 0% 100%): Write-Throughput: ~104 MB/s That is a difference of about ~26%! Based on this numbers I would not recommend using the whole disk when creating a md-raid5. Using partitions is not supported at all by OMV, so the raid has to be created on the command-line. Now my numbers when using encryption: md5-raid & luks (using aes-cbc-essiv:sha256) & xfs: Write-Throughput: ~72 MB/s md5-raid & luks (using aes-cbc-essiv:sha256) & btrfs: Write-Throughput: ~66 MB/s md5-raid & luks (using aes-cbc-essiv:sha256 with marvell_cesa kernel module unloaded) & btrfs: Write-Throughput: ~34 MB/s md5-raid & luks2 (using aes-cbc-essiv:sha256) & btrfs: Write-Throughput: ~ 73 MB/s Looking at the numbers I see a performance loss of about 30% when using encryption. Hardware encryption is working and definitely speeds up encryption because when I unload the kernel module I see a performance loss of about 60% compared to the unencrypted case. When creating the luks partition via the OMV encryption plugin aes-xts is used by default which is not supported by marvell_cesa, and there is no way to configure a different encryption algorithm on the web-gui. To be able to use aes-cbc the luks partition has to be created via the commandline. Using luks2 instead of luks gives a bit of a performance boost. Luks2 is only supported by the Ubuntu image, the Debian image has the usual Debian problem: too old packages. Based on these numbers I am ditching Debian and OMV and are moving to Ubuntu. OMV is of little use, because for best performance I have to setup everything via the commandline. Also luks2 is more future-proof and results in better performance. Edit2: I played a bit more with luks2, and found something very interesting: With luks2 you can also change the sector size. The default sector size is 512 Byte, but when I change it to 4K, then I see massive performance improvements: md5-raid & luks2 (using aes-cbc-essiv:sha256 and 4K sectors) & btrfs: Write-Throughput: ~ 99 MB/s That's only a performance loss of ~5% compared to the unencrypted case!?! Before I had a performance loss of at least ~30%. I double-checked everything, and the numbers are real. This means luks2 is definitely the way to go. To sum up. To get the best performance out of an encrypted raid5, you need to: Install the Ubuntu image. This means you cannot use OMV, but that's the price to pay for best performance. Create a single partition on each of your disks that is optimally aligned. parted /dev/sdX mklabel gpt parted -a optimal /dev/sdX mkpart 0% 100% Create your raid with mdadm and pass the partitions you created in the second step. mdadm --create /dev/md0 --level=5 --raid-devices=4 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 Create a luks2 partition using aes-cbc-essiv:sha256 and a sector size of 4K. cryptsetup -s 256 -c aes-cbc-essiv:sha256 --sector-size 4096 --type luks2 luksFormat /dev/md0 Create your file-system on top of your encrypted raid device. On 12/29/2018 at 7:15 AM, djurny said: Averaged throughput measurement: Cipher: aes-xts-plain64:sha1, key size: 256bits Throughput => 78.0 MB/s, Interrupts for crypto => 11 + 8, Cipher: aes-cbc-essiv:sha256, key size: 256bits Throughput => 102 MB/s, Interrupts for crypto => 139421 + 144352, Cipher: aes-cbc-essiv:sha256, key size: 128bits Throughput => 103 MB/s Interrupts for crypto => 142679 + 152079. Do you have any numbers for the unencrypted case? I am curious because I want to know if you also see a performance loss of about 30%. Edit: Should be parted -a not parted -o. 1 Quote Link to comment Share on other sites More sharing options...
gprovost Posted January 7, 2019 Author Share Posted January 7, 2019 @matzman666 Very interesting findings and well broke down tests 1. I wasn't aware that with LUKS2 you can define sector size bigger than 512b, therefore decreasing the number of IO operations. 2. I never investigate how RAID perf can be impacted with nowadays HDD that support block size of 4K (Advanced Format). Non-aligned 4K disk can effectively show a significant performance penalty. But I'm still a bit surprise that you see that much perf difference (26%) with and without disk alignment on your tests without encryption. https://raid.wiki.kernel.org/index.php/A_guide_to_mdadm#Blocks.2C_chunks_and_stripes Would you be interested to help us improve our wiki, on the disk alignment part and then on the disk encryption part ? https://wiki.kobol.io/mdadm/ https://wiki.kobol.io/cesa/ 0 Quote Link to comment Share on other sites More sharing options...
djurny Posted January 13, 2019 Share Posted January 13, 2019 @matzman666 Sorry, no measurements. From memory the numbers for raw read performance were way above 100MB/sec according to 'hdparm -t'. Currently my box is live, so no more testing with empty FSes on unencrypted devices for now. Perhaps someone else can help out? -[ edit ]- So my 2nd box is alive. The setup is slightly different and not yet complete. I quickly built cryptsetup 2.x from sources on Armbian, was not as tough as I expected - pretty straightforward: configure, correct & configure, correct & ... cryptsetup 2.x requires the following packages to be installed: uuid-dev libdevmapper-dev libpopt-dev pkg-config libgcrypt-dev libblkid-dev Not sure about these ones, but I installed them anyway: libdmraid-dev libjson-c-dev Build & install: Download cryptsetup 2.x via https://gitlab.com/cryptsetup/cryptsetup/tree/master. Unxz the tarball. ./configure --prefix=/usr/local make sudo make install sudo ldconfig Configuration: 2x WD Blue 2TB HDDs. 4KiB sector aligned GPT partitions. mdadm RAID6 (degraded). Test: Write 10GiB 4GiB worth of zeroes; dd if=/dev/zero of=[dut] bs=4096 count=1048576 conv=fsync. directly to mdadm device. to a file on an XFS filesystem on top of an mdadm device. directly to LUKS2 device on top of an mdadm device (512B/4096KiB sector sizes for LUKS2). to a file on an XFS filesystem on top of a LUKS2 device on top of an mdadm device (512B/4096KiB sector sizes for LUKS2). Results: Caveat: CPU load is high: >75% due to mdadm using CPU for parity calculations. If using the box as just a fileserver for a handful of clients, this should be no problem. But if more processing is done besides serving up files, e.g. transcoding, (desktop) applications, this might become problematic. RAID6 under test was in degraded mode. I don't have enough disks to have a fully functional RAID6 array yet. No time to tear down my old server yet. Having a full RAID6 array might impact parity calculations and add 2x more disk I/O to the mix. I might consider re-encrypting the disks on my first box, to see if LUKS2 w/4KiB sectors will increase the SnapRAID performance over the LUKS(1) w/512B sectors. Currently it takes over 13 hours to scrub 50% on a 2-parity SnapRAID configuration holding less than 4TB of data. -[ update ]- Additional test: Write/read 4GiB worth of zeroes to a file on an XFS filesystem test on armbian/linux packages 5.73 (upgraded from 5.70) for i in {0..9} ; do time dd if=/dev/zero of=removeme.${i} bs=4096 count=$(( 4 * 1024 * 1024 * 1024 / 4096 )) conv=fsync; dd if=removeme.$(( 9 - ${i} )) of=/dev/null bs=4096 ; done 2>&1 | egrep '.*bytes.*copied.*' Results: The write throughput appears to be slightly higher, as now the XOR HW engine is being used - but it could just as well be measurement noise. CPU load is still quite high during this new test: %Cpu0 : 0.0 us, 97.5 sy, 1.9 ni, 0.3 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 0.3 us, 91.4 sy, 1.1 ni, 0.0 id, 0.0 wa, 0.0 hi, 7.2 si, 0.0 st <snip> 176 root 20 0 0 0 0 R 60.9 0.0 14:59.42 md0_raid6 1807 root 30 10 1392 376 328 R 45.7 0.0 0:10.97 dd 9087 root 0 -20 0 0 0 D 28.5 0.0 1:16.53 kworker/u5:1 34 root 0 -20 0 0 0 D 19.0 0.0 0:53.93 kworker/u5:0 149 root -51 0 0 0 0 S 14.4 0.0 3:19.35 irq/38-f1090000 150 root -51 0 0 0 0 S 13.6 0.0 3:12.89 irq/39-f1090000 5567 root 20 0 0 0 0 S 9.5 0.0 1:09.40 dmcrypt_write <snip> Will update this again once the RAID6 array set up is complete. Groetjes, 1 Quote Link to comment Share on other sites More sharing options...
djurny Posted February 24, 2019 Share Posted February 24, 2019 L.S., A quick update on the anecdotal performance of LUKS2 over LUKS; md5sum'ing ~1 TiB of datafiles on LUKS: avg-cpu: %user %nice %system %iowait %steal %idle 0.15 20.01 59.48 2.60 0.00 17.76 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdb 339.20 84.80 0.00 848 0 dm-2 339.20 84.80 0.00 848 0 md5sum'ing ~1 TiB of datafiles on LUKS2: avg-cpu: %user %nice %system %iowait %steal %idle 0.05 32.37 36.32 0.75 0.00 30.52 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdd 532.70 133.18 0.00 1331 0 dm-0 532.80 133.20 0.00 1332 0 sdb: sdb1 optimally aligned using parted. LUKS(1) w/aes-cbc-essiv:256, w/256 bit key size. XFS with 4096 Bytes sector size. xfs_fsr'd regularly, negligible file fragmentation. sdd: sdd1 optimally aligned using parted. LUKS2 w/aes-cbc-essiv:256, w/256 bit key size and w/4096 Bytes sector size, as @matzman666 suggested. XFS with 4096 Bytes sector size. Content-wise: sdd1 is a file-based copy of sdb1 (about to wrap up the migration from LUKS(1) to LUKS2). Overall a very nice improvement! Groetjes, p.s. Not sure if it added to the performance, but I also spread out the IRQ assignments over both CPUs, making sure that each CESA and XOR engine have their own CPU. Originally I saw that all IRQs are handled by the one and same CPU. For reasons yet unclear, irqbalance refused to dynamically reallocate the IRQs over the CPUs. Perhaps the algorithm used by irqbalance does not apply well to ARM or the Armada SoC (- initial assumption is something with cpumask being reported as '00000002' causing irqbalance to only balance on- and to CPU1?). 1 Quote Link to comment Share on other sites More sharing options...
gprovost Posted February 25, 2019 Author Share Posted February 25, 2019 On 2/24/2019 at 9:10 AM, djurny said: Not sure if it added to the performance, but I also spread out the IRQ assignments over both CPUs, making sure that each CESA and XOR engine have their own CPU. Interesting point. Are you using smp_affinity to pin an IRQ to a specific CPU ? Some reference for this topic : https://www.arm.com/files/pdf/AT-_Migrating_Software_to_Multicore_SMP_Systems.pdf I'm honestly not sure you will gain in performance, you might see some improvement in the case of high load... but then I would have imagined that irqbalance (if installed) would kick in and starts balance IRQ if really needed. I will need to play around to check that. 0 Quote Link to comment Share on other sites More sharing options...
djurny Posted February 25, 2019 Share Posted February 25, 2019 @gprovost, indeed /proc/irq/${IRQ}/smp_affinity. What I saw irqbalance do is perform a one-shot assignment of all IRQs to CPU1 (from CPU0) and then...basically nothing. The "cpu mask" shown by irqbalance is '2', which lead me to the assumption that it is not taking CPU0 into account as a CPU it can use for the actual balancing. So all are "balanced" to one CPU only. Overall, the logic for the IRQ assignment spread over the two CPU cores was: When CPU1 is handling ALL incoming IRQs, for both SATA, XOR and CESA for all disks in a [software] RAID setup, then CPU1 will be the bottleneck of all transactions. The benefit of the one-shot 'balancing' is that you don't really need to worry about the ongoing possible IRQ handler migration from CPUx to CPUy, as the same CPU will handle the same IRQ all the time, no need to migrate anything from CPUx to CPUy continuously. Any further down the rabbit hole would require me to go back to my college textbooks . Groetjes, 0 Quote Link to comment Share on other sites More sharing options...
gprovost Posted February 26, 2019 Author Share Posted February 26, 2019 @djurny Yeah my test with irqbalance shows that at startup it pints some IRQ to CPU1 and some to both (see table below). I don't witness any dynamic change while stressing the system. When the SMP affinity is set as both cpu (value = 3) then the IRQ is allocated by default to CPU0. So it means with irqbalance, the only good thing I see is that the network device interrupt will be not on the same than CPU than ATA, XOR and CESA. But you right maybe doing manual pining of the IRQ to certain CPU would help performance in case of heavy load. But before doing such tuning there is a need to define a proper benchmark to measure positive and negative impact of the such tuning. 0 Quote Link to comment Share on other sites More sharing options...
gprovost Posted March 18, 2019 Author Share Posted March 18, 2019 Just for info I updated the libssl1.0.2 cryptodev patch to work with latest libssl1.0.2 debian deb (libssl1.0.2_1.0.2r) https://wiki.kobol.io/cesa/#network-application-encryption-acceleration So if some people wanna test offload crypto operation while using apache2 or ssh for example ;-) You can also find a pre-build dpkg here if you want to skip the compile steps. 1 Quote Link to comment Share on other sites More sharing options...
gprovost Posted July 30, 2019 Author Share Posted July 30, 2019 https://wiki.kobol.io/cesa has been updated to be ready for upcoming Debian Buster release. For Debian Buster we will use AF_ALG as Crypto API userspace interface. We choose AF_ALG for Debian Buster because it doesn't require any patching or recompiling but it comes with some limitation. While benchmark shows in some case throughput improvement with AF_ALG, the CPU load is not improved compared to cryptodev or 100% software encryption. This will require further investigation. So far I didn't succeed to enable cryptodev in Debian version of libssl1.1.1... Need further work, and it's not really our priority right now. Also for Debian Stretch I updated the prebuild libssl1.0.2 debian deb to latest security update release (libssl1.0.2_1.0.2s). Can be found here. 1 Quote Link to comment Share on other sites More sharing options...
DavidGF Posted June 30, 2020 Share Posted June 30, 2020 Not sure whether it's a good idea to bump this thread but just my 2cts on this: CESA is quite good for use cases such as LUKS. Obviously you will need to use aes-cbc (use 128 bits for maximum perf). In this mode the CPU usage is quite low (you can see two kernel workers doing 10% on each CPU and a ton of interrupts, but other than that works well) which is awesome to keep the load of the CPU which is already limited. In comparison on using pure CPU, in my current setup I get: - Plain access to disk: 180MB/s - LUKS2 + CESA: 140MB/s - LUKS2 (no CESA): 52MB/s I tuned LUKS using sector size= 4096 and keysize=128, as well as using aes-cbc-plain64. For me this is quite good already even tho CESA only supports some "relatively old" ciphers. Hope this helps other people! 2 Quote Link to comment Share on other sites More sharing options...
sfx2000 Posted July 11, 2020 Share Posted July 11, 2020 With CESA - many assumptions based on Armada 38x - where things in scope looked very good on ARM-V7A - both the Marvell cores and the later ARM-a9's... Armada is mixed up here... let's just say that 38x has massive bandwidth inside the chip - MV3720/Mochi doesn't... MV3720 - always a tradeoff - more CPU and throughput, or offload to the CESA units, which is probably similar to other "off-loads" for dedicated accelerators Personally - I would not recommend CESA here,,, I would recommend core here as the is CPU/MEM, and CESA is limited, and so is the bus within the chip itself. 0 Quote Link to comment Share on other sites More sharing options...
jsr Posted April 2, 2021 Share Posted April 2, 2021 Hi there! I have been using LUKS2 encrypted disks on my helios4, and I've noticed that the encryption/decryption is no longer making use of the CESA accelerator. I don't see the relevant interrupts counting up: 48: 0 0 GIC-0 51 Level f1090000.crypto 49: 0 0 GIC-0 52 Level f1090000.crypto and my decryption speed is a lot lower (~45 MB/s) than it used to be (~100 MB/s). This used to be working on this OS install (Ambian Debian stretch, apt dist-upgrade'd to buster 21.02.3), and I wonder if there may have been a regression somewhere along the line. I've also tested using matzman666's examples above for setting up a LUKS2 encrypted disk, and I still have the same issue: I don't see any CESA HW acceleration with LUKS2. I checked and I do have the marvell_cesa module loaded. Here is the cryptsetup command: cryptsetup -s 256 -c aes-cbc-essiv:sha256 --sector-size 4096 --type luks2 luksFormat /dev/mmcblk0p2 I have also tried a fresh install of the latest Armbian Debian buster 21.02.3 and Armbian Ubuntu focal 21.02.3, and I still have no CESA acceleration. Has anyone seen this issue? How do we get LUKS2 to use the CESA hardware these days? Regards, James 0 Quote Link to comment Share on other sites More sharing options...
lanefu Posted April 2, 2021 Share Posted April 2, 2021 @TheLinuxBugyou have any thoughts? 0 Quote Link to comment Share on other sites More sharing options...
djurny Posted April 3, 2021 Share Posted April 3, 2021 @jsr, can you also check /proc/cryptinfo ? If I remember correctly, the mv/marvell module should show up in cryptinfo. Edit: Also noticed you are LUKSing an mmc device. Not sure about the speed of device used in your setup, but my SD cards hardly reach 30MiB/s, even if CESA would work. Your throughput might be limited by additional bottleneck. 0 Quote Link to comment Share on other sites More sharing options...
jsr Posted April 3, 2021 Share Posted April 3, 2021 @djurny, I have a /proc/crypto (but not /proc/cryptinfo), link at the bottom. Apologies, I should have explained using the mmc device: that was used just to confirm the issue still existed when starting afresh using matzman666's instructions. I had some spare space available on that mmc device, and when using it I was only checking the CESA interrupt count in /proc/interrupts. My real storage is a pair of hard drives that each can read sequentially at ~180 MB/s (tested with dd). uname -a: Linux helios4 5.10.21-mvebu #21.02.3 SMP Mon Mar 8 00:59:48 UTC 2021 armv7l GNU/Linux /proc/crypto: https://pastebin.com/QuSpD5u5 0 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.