zfs read vs. write performance


Alexander Eiblinger
 Share

2 2

Recommended Posts

Hi,

 

i have successfully installed my new helios64 and enabled zfs. Everythings works so far, but it seems I'm missing something "important" in regards to ZFS.

 

I did the following tests:

I took one of my WD 4TB Plus drives (in slot 3) and formatted the disk with a) ext4 (mkfs.ext4 /dev/sdc) b) btrfs (mkfs.btrfs /dev/sdc) and c) zfs (zpool create test /dev/sdc).

Based on these 3 formats I did a "dd if=/dev/zero of=test2.bin bs=1GB count=5" to measure write performance and a "dd if=test2.bin of=/dev/null" to get the read performance (all tests were done 3 times, results averaged)

 

The results I got look like this:

a) ext4 ... avg write 165MB/s - read 180MB/s

b) btrfs ... avg write 180MB/s - read 192MB/s

c) zfs ... avg write 140MB/s - read 45MB/s (!)

 

So while ext4 and btrfs produce for sequential read / write pretty much the same (and expected) results, zfs seems to be significantly slower in write and dramaticially slower in read performance. 

 

I need to admit, I'm new to zfs ... but is this expected / normal? 

zfs has tonns of parameters to tune performance - are there any settings which are "needed" to achive almost the same values as e.g. with btrfs?
(honestly, I expected to perform zfs equally to btrfs - as zfs should be the more "mature" file system ...)

 

thanks!
A

Link to post
Share on other sites

Donate and support the project!

For spinning disks, you should use ashift 12 (and ashift 13 for SSDs) when creating your ZFS pool. It can't be changed after the fact and should match the physical block size of your HDD (12 = 4096). ZFS read speed also benefits from additional disks, if you e.g. created a mirror across two disks or say a RAIDZ1 across 5 disks, you should see pretty good performance. Also, to make the above test unfair (in favor of ZFS), you could also enable compression (-O compression=lz4).

zpool create -o ashift=12 ...

 

Personally I use:

zpool create \
        -o ashift=12 \
        -O acltype=posixacl -O canmount=off -O compression=lz4 \
        -O dnodesize=auto -O normalization=formD -O relatime=on \
        -O xattr=sa \
        ...

 

Link to post
Share on other sites

Thank you for your answer.

I know about ashift=12, my tests have been made with this setting. I also tried your settings (without compression!), but they make no real difference. 

Using three disks brings things up to ~80MB/s - which is still unacceptable. 

 

But I think I found my problem:

It is actually related to the blocksize - but not the blocksize of the zfs pool - but the blocksize dd is using:

 

This is what I'm usually doing:
 

root@helios64:/test# dd if=test2.bin of=/dev/null
9765625+0 records in
9765625+0 records out
5000000000 bytes (5.0 GB, 4.7 GiB) copied, 107.466 s, 46.5 MB/s

 

DD is using a blocksize of 512 here. For ext4 / btrfs this seems to be no issue - but zfs has a topic with this. 
If I use explicity a blocksize of 4096, I get these results:

 

root@helios64:/test# dd if=test2.bin of=/dev/null bs=4096
1220703+1 records in
1220703+1 records out
5000000000 bytes (5.0 GB, 4.7 GiB) copied, 27.4704 s, 182 MB/s

 

Which gives the figures expected!

 

So as so often: "Layer 8 problem"  - the problem sits in front of the screen. 

Edited by Alexander Eiblinger
Link to post
Share on other sites

Read speed are quite hard to measure. If you test the read speed of the same file multiple times, ZFS will cache it.
I'm getting up to 1GBps read speed after reading the same files 4 times in a row. Your file might also be cached if you just wrote it.

By the way if someone knows how to flush that cache that would be helpful to run tests.

Link to post
Share on other sites

vor 23 Stunden schrieb tionebrr:

Read speed are quite hard to measure. If you test the read speed of the same file multiple times, ZFS will cache it.
I'm getting up to 1GBps read speed after reading the same files 4 times in a row. Your file might also be cached if you just wrote it.
 

 

That's why I wrote / read 5 GB ... the Helios64 has "only" 4 GB RAM, so if 5 GB are read / writen the cache should be have no useable copy of the data.

Link to post
Share on other sites

Hello,

 

I've similar issues, on my helios64 and I don't see how to correct it.
I'm trying to replace my classic 4 disks raid5 ext4 with by a equivalent raidz1

 

What I've done so far:

 

1 - Added a 8TB on my 5th slot and format it with btrfs

2 - back up all my data on it with rsync, I got an avg write speed of 105-110MB/s on big files (HD movies, usually around 10-15GB each) which is really OK for me as my NAS sits on 1Gb network, more isn't necessary.

3 - create the pool the way it's described on helios NFS tutorial (I've since created a few datasets with different recordsize)

4 - rsync back my data on the new raidz

 

And I do get very poor results

Initially I did get around 35 MB/s, after tinkering a lot with recordsize (now it's 1M) and stuff I managed to go up to 55 MB/s, but it's still way too slow to my taste

 

I've benchmarked a few uses cases, with a single 5GB file and here are the results

Seems like when rsynch worked on only one file, it goes slightly faster (74MB/s)

raidz -> btrfs
root@helios64:/mypool/video# rsync -av --progress pipot/ /srv/dev-disk-by-id-ata-ST8000NM0055-1RM112_ZA1K2SN4-part1/backup/video/pipot/
sending incremental file list
...
sent 4,916,400,124 bytes  received 138 bytes  95,464,082.76 bytes/sec
total size is 4,915,200,000  speedup is 1.00


btrf -> raidz
root@helios64:/mypool/video# rsync -av --progress /srv/dev-disk-by-id-ata-ST8000NM0055-1RM112_ZA1K2SN4-part1/backup/video/pipot/ ./pipot2/
sending incremental file list
...
sent 4,916,400,124 bytes  received 69 bytes  73,930,829.97 bytes/sec
total size is 4,915,200,000  speedup is 1.00


raidz -> raidz
root@helios64:/mypool/video# rsync -av --progress pipot/ /pipot3/
sending incremental file list
...
sent 4,916,400,124 bytes  received 68 bytes  57,501,756.63 bytes/sec
total size is 4,915,200,000  speedup is 1.00


btrfs -> btrfs
root@helios64:/mypool/video# rsync -av --progress /srv/dev-disk-by-id-ata-ST8000NM0055-1RM112_ZA1K2SN4-part1/backup/video/pipot/ /srv/dev-disk-by-id-ata-ST8000NM0055-1RM112_ZA1K2SN4-part1/backup/video/pipot2/
...
sent 4,916,400,124 bytes  received 139 bytes  68,760,842.84 bytes/sec
total size is 4,915,200,000  speedup is 1.00

 

I've tried messing with dd, I get results that match Alexander's, it really seems to be a read speed as when I created the original testfile from /dev/zero, the speed maxes out at around 32k

Creating the testfile from /dev/zero
root@helios64:/mypool/video# dd if=/dev/zero of=testfile1  count=150000 bs=32k status=progress
4791074816 octets (4,8 GB, 4,5 GiB) copiés, 12 s, 399 MB/s
150000+0 enregistrements lus
150000+0 enregistrements écrits
4915200000 octets (4,9 GB, 4,6 GiB) copiés, 12,2836 s, 400 MB/s


just coping the file
bs 512 (I interptuded it before it finished)
root@helios64:/mypool/video# dd if=testfile1 of=testfile2 bs=512 status=progress
17506304 octets (18 MB, 17 MiB) copiés, 24 s, 729 kB/s 

bs 4k -> 5,8 MB/s
bs 8k -> 10,9 MB/s
bs 16k -> 22,8 MB/s
bs 32k -> 43,5 MB/s
bs 64k -> 76,0 MB/s
bs 128k -> 129 MB/s
bs 512k -> 264 MB/s
bs 1M -> 371 MB/s
bs 4M -> 397 MB/s

 

So how can I make a rsync (or cp for that matter), perform the same way i dd?

 

Thanks

Link to post
Share on other sites

Definitely don't use dedupe. Compression seems to work fine for me, but most of my data so far isn't very compressible. When I've done tests using things that are mostly text it works out well. For most things that are binary or encrypted, it doesn't seem to do much. I'm getting a compressratio of only 1.01 using either zstd or lz4 on most of my datasets. If you know you're going to have places that are going to benefit from compression, use it there, but I'm tempted to turn it off for a lot of my stuff (media, encrypted backups).

Link to post
Share on other sites

@gprovost regarding compression there shouldn't be any RAM constraints that need to be considered, ZFS compression operates in recordsize'd chunks (i.e. between 4KiB and 1MiB with ashift=12). Personally I think it's a good idea to set LZ4 as the default compression for the entire pool and then adjust on a per-dataset basis where needed. LZ4 is very cheap on CPU performance and can give some easy savings (see below). I would not advice to use Gzip as the CPU overhead is quite significant, if higher compression is required, OpenZFS 2.0+ with Zstandard (zstd) compression might be a better alternative as it can achieve Gzip level compression at much lower cost of (de)compression. As @wurmfood pointed out, however, it's quite rare that media or encrypted data would benefit from compression, the only exception is when that data contains padding/zeroes, which would compress well under LZ4. So that's a good example of when not to use compression and also keep in mind that disabling compression will not decompress existing data, there could exist uncompressed, LZ4-compressed, Gzip-compressed, etc data mixed under one dataset. A full rewrite of all data is needed to change the compression of all the data.

 

As for deduplication, unless the zpool is extremely small, dedup is not really an option on the Helios64 as it requires very large amounts of RAM. The man page recommends at least 1.25 GiB of RAM per 1 TiB of storage.

 

For reference, some of my space savings from using compression:

 

NAME                                          USED  COMPRESS        RATIO
rpool/ROOT/debian                            15.5G  lz4             1.73x
rpool/data/service                           18.7G  lz4             2.08x
rpool/data/service/avahi                      208K  lz4             1.00x
rpool/data/service/grafana                   48.4M  lz4             1.83x
rpool/data/service/loki                       736K  lz4             1.00x
rpool/data/service/mariadb                    676M  lz4             3.25x
rpool/data/service/mariadb/db                 214M  lz4             1.88x
rpool/data/service/mariadb/dump               455M  gzip            3.88x
rpool/data/service/prometheus                10.9G  lz4             2.29x
rpool/data/service/promtail                   768K  lz4             1.00x
rpool/data/service/samba                     2.73M  lz4             6.04x
rpool/data/service/unifi                     3.41G  lz4             1.34x
rpool/data/service/unifi/db                  2.38M  lz4             1.00x
rpool/var/log                                18.0G  lz4             2.26x

 

Edited by ShadowDance
Add man page recommendation about dedup RAM usage
Link to post
Share on other sites

4 hours ago, ShadowDance said:

As for deduplication, unless the zpool is extremely small, dedup is not really an option on the Helios64 as it requires very large amounts of RAM. The man page recommends at least 1.25 GiB of RAM per 1 TiB of storage.

 

Yes exactly and I think it's important to recommend to disable it at pool creation.

Link to post
Share on other sites

21 hours ago, gprovost said:

 

Yes exactly and I think it's important to recommend to disable it at pool creation.

 

I believe it's not enabled by default. You have to choose to set it. At least, none of my pools have been created with it and I never specified not to.

 

As a side note, I doubled checked my compression on all of my datasets and I noticed that some of my docker data sees massive compression with zstd on. Most are in the 1-3x range, but I have several in the 8-9x range.

Link to post
Share on other sites

 Share

2 2