I heard quite often that it doesn't matter and even that 4kn is faster or that 512e comes with "a performance hit". However, I have yet to see any direct comparisons with tools and parameters I trust (such as fio), so I decided to run my own. I will try to keep this concise .
Obligatory necessities:
Hardware: Asus Z11PA-U12-10G-2S with Xeon 4210R, 192GB DDR4 ECC.
Controller: onboard SATA
OS: Red Hat 8.4, kernel 4.18.0-305.19.1.el8_4.x86_64
NO GUI (not installed): boot to multi-user.target (runlevel 3 if you like)
Test tool: fio 3.19
Disks: 2 x WD Gold 8TB Enterprise Class Hard Disk Drive - 7200 RPM Class SATA 6 Gb/s 256MB Cache 3.5 Inch - WD8003FRYZ
Only user and optional background processes were stopped. Nothing else was changed, tuned or tweaked.
The server is running Red Hat Virtualization (Ovirt, if you will), all vm's were stopped, the server put in maintenance and its processes stopped.
Partitions were aligned at 2048 with gdisk with all defaults (n, enter, enter, enter, enter, w).
Partitions were formatted with ext4 no special parameters.
Disks checked and confirmed physical and logical blocksize:
TL;DR:
I did 4 runs with fio. Actually, I did A LOT more runs, but I'll spare you all those, these are the averages of it:
Obligatory disclaimer:
These are my own results, YMMV. Take it as you please.
And I'll say this beforehand: I've tested this and confirmed it, using direct=1 and end_fsync on ext4 and xfs filesystems will negate caching and bufferering. In other words, there is no need to use test size twice the size of memory. This is different on ZFS for example, ARC works differently (also tested and confirmed).
Now for the numbers:
MDADM Raid0 results
So you thought I was done? Ha! While I was at it, I decided to test a raid0 against different chunk sizes with 512e disks. Reformatted the 4kn drive with HUGO to 512e, checked and confirmed:
All arrays were created the same way, except for chunk size:
All filesystems were created the same way as above, with the exception of the xfs test (details below).
The numbers:
For good measure, one more on xfs with 64K chunks:
This time, I did use a few parameters to format the partition as specified here:
My own conclusion after all of this is that I will run 512e disks on an mdadm raid with 64k chunks formatted with xfs as shown here.
Today I will reformat all my 8TB ultrastars to 512e and recreate a raid-10 array. I'll do a few more fio runs after that just to make sure its in line with these tests on the WD's.
Hope you appreciate my efforts!
Obligatory necessities:
Hardware: Asus Z11PA-U12-10G-2S with Xeon 4210R, 192GB DDR4 ECC.
Controller: onboard SATA
OS: Red Hat 8.4, kernel 4.18.0-305.19.1.el8_4.x86_64
NO GUI (not installed): boot to multi-user.target (runlevel 3 if you like)
Test tool: fio 3.19
Disks: 2 x WD Gold 8TB Enterprise Class Hard Disk Drive - 7200 RPM Class SATA 6 Gb/s 256MB Cache 3.5 Inch - WD8003FRYZ
Only user and optional background processes were stopped. Nothing else was changed, tuned or tweaked.
The server is running Red Hat Virtualization (Ovirt, if you will), all vm's were stopped, the server put in maintenance and its processes stopped.
Partitions were aligned at 2048 with gdisk with all defaults (n, enter, enter, enter, enter, w).
Partitions were formatted with ext4 no special parameters.
Disks checked and confirmed physical and logical blocksize:
Bash:
# cat /sys/block/sdb/queue/logical_block_size
512
# cat /sys/block/sdb/queue/physical_block_size
4096
# cat /sys/block/sdc/queue/logical_block_size
4096
# cat /sys/block/sdc/queue/physical_block_size
4096
512e is ~16% faster with 4 threads of random mixed workloads of 70% read/30% write with 16k blocksize. This increases to ~19% with 8 threads and higher queuedepths with 4k blocksize. Sequential reads with 1M blocksize are 8% slower compared to 4kn, while sequential writes with 1M are 3% slower.
I did 4 runs with fio. Actually, I did A LOT more runs, but I'll spare you all those, these are the averages of it:
Bash:
RUN 1 fio --filename=fiotest --size=4GB --rw=randrw --rwmixread=70 --rwmixwrite=30 --bs=16k --ioengine=libaio --iodepth=16 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --direct=1 --end_fsync=1
RUN 2 fio --filename=fiotest --size=4GB --rw=randrw --rwmixread=70 --rwmixwrite=30 --bs=4k --ioengine=libaio --iodepth=32 --runtime=120 --numjobs=8 --time_based --group_reporting --name=iops-test-job --direct=1 --end_fsync=1
RUN 3 fio --filename=fiotest --size=40GB --rw=read --bs=1M --ioengine=libaio --iodepth=1 --runtime=240 --numjobs=1 --time_based --group_reporting --name=iops-test-job --direct=1 --end_fsync=1
RUN 4 fio --filename=fiotest --size=40GB --rw=write --bs=1M --ioengine=libaio --iodepth=1 --runtime=240 --numjobs=1 --time_based --group_reporting --name=iops-test-job --direct=1 --end_fsync=1
These are my own results, YMMV. Take it as you please.
And I'll say this beforehand: I've tested this and confirmed it, using direct=1 and end_fsync on ext4 and xfs filesystems will negate caching and bufferering. In other words, there is no need to use test size twice the size of memory. This is different on ZFS for example, ARC works differently (also tested and confirmed).
Now for the numbers:
512e read iops | 512e read MiB/s | 512e write iops | 512e write MiB/s | 4kn read iops | 4kn read MiB/s | 4kn write iops | 4kn write MiB/s | |
---|---|---|---|---|---|---|---|---|
RUN 1 | 359 | 5,62 | 156 | 2,45 | 298 | 4,67 | 130 | 2,04 |
RUN 2 | 374 | 1,46 | 162 | 0,63 | 305 | 1,19 | 132 | 0,52 |
RUN 3 | 175 | 176,00 | - | - | 189 | 190,00 | - | - |
RUN 4 | - | - | 195 | 196,00 | - | - | 201 | 201,00 |
MDADM Raid0 results
So you thought I was done? Ha! While I was at it, I decided to test a raid0 against different chunk sizes with 512e disks. Reformatted the 4kn drive with HUGO to 512e, checked and confirmed:
Bash:
# cat /sys/block/sdc/queue/logical_block_size
512
# cat /sys/block/sdc/queue/physical_block_size
4096
# mdadm --create --verbose /dev/md/mdraid0_backup --run --chunk=16K --metadata=1.2 --raid-devices=2 --level=0 /dev/sdb /dev/sdc
All filesystems were created the same way as above, with the exception of the xfs test (details below).
The numbers:
chunk=16K | read iops | read MiB/s | write iops | write iops |
---|---|---|---|---|
RUN 1 | 733 | 11,50 | 317 | 4,97 |
RUN 2 | 767 | 3 | 328 | 1,28 |
RUN 3 | 398 | 398 | - | - |
RUN 4 | - | - | 405 | 406 |
chunk=64K | read iops | read MiB/s | write iops | write iops |
---|---|---|---|---|
RUN 1 | 741 | 11,60 | 321 | 5,02 |
RUN 2 | 774 | 3,03 | 331 | 1,30 |
RUN 3 | 404 | 405 | - | - |
RUN 4 | - | - | 407 | 408 |
chunk=128K | read iops | read MiB/s | write iops | write iops |
---|---|---|---|---|
RUN 1 | 746 | 11,70 | 322 | 5,05 |
RUN 2 | 767 | 3 | 328 | 1,28 |
RUN 3 | 404 | 404 | - | - |
RUN 4 | - | - | 414 | 414 |
chunk=1M | read iops | read MiB/s | write iops | write iops |
---|---|---|---|---|
RUN 1 | 739 | 11,50 | 321 | 5,02 |
RUN 2 | 774 | 3,03 | 331 | 1,30 |
RUN 3 | 404 | 405 | - | - |
RUN 4 | - | - | 407 | 408 |
For good measure, one more on xfs with 64K chunks:
This time, I did use a few parameters to format the partition as specified here:
Code:
# mkfs.xfs -d su=128K -d sw=2 /dev/md/mdraid0_backup -f
chunk=64K (XFS) | read iops | read MiB/s | write iops | write iops |
---|---|---|---|---|
RUN 1 | 772 | 12,10 | 334 | 5,22 |
RUN 2 | 813 | 3,18 | 347 | 1,36 |
RUN 3 | 425 | 425 | - | - |
RUN 4 | - | - | 427 | 427 |
My own conclusion after all of this is that I will run 512e disks on an mdadm raid with 64k chunks formatted with xfs as shown here.
Today I will reformat all my 8TB ultrastars to 512e and recreate a raid-10 array. I'll do a few more fio runs after that just to make sure its in line with these tests on the WD's.
Hope you appreciate my efforts!