Best practice - SSDs hardware raid and filesystems

Gb. · Oct 20, 2017

Hi !

Following this thread (Supermicro 216BE1C-R741JBOD), I have a few questions.
The setup :
- host system with ESXi 6.5 update 1,
- LSI 9361-8i without cachevault. Hardware RAID10 with 8 SSD on the host system. This is the datastore for ESXi,
- LSI 9480-8i8e without cachevault, connected to the supermicro JBOD 216BE1C-R741JBOD. Hardware RAID6 with 24 SSD. This raid card is configured as passthrough by ESXi,
- Ubuntu Server 17.10 (the upgrade to 18.04 LTS will be smoother from 17.10 than 16.04 I think).

The goal is to use this server as file server / cloud (probably Nextcloud, currently using Owncloud).
I am now wondering which filesystem to use with a SSDs hardware raid.
Until now, I supposed that btrfs was the best for this setup.
And in any case, which options to use for mounting (like -o ssd_spread, -o discard or -o ssd) ?
Is it still needed to align manually the partition ?

Thanks !

Gb. · Oct 23, 2017

Nobody with advices ?

MiniKnight · Oct 23, 2017

I still use EXT4

zer0gravity · Oct 26, 2017

Just as a question, what size SSDs?

gigatexal · Oct 26, 2017

it might not be SSD aware but I still put ZFS on my SSDs as I like the compression and bitrot protection. For e-Peen and such you could just do mdadm for software raid and put something like XFS on them. Depends on your usecase really. What is on, average, the size of the files you're working with? Large, small? More questions need answering before a good recommendation can be had.

BackupProphet · Oct 27, 2017

I also put my SSD's on ZFS, the compression is awesome. Feels like I get much larger drives. I run them with raid 0 though, no self healing here. The reason for that is that I have never had a failed enterprise grade SSD or a bad checksum. I may change that opinion one day though. And I still have backups

whitey · Oct 27, 2017

BackupProphet said:
I also put my SSD's on ZFS, the compression is awesome. Feels like I get much larger drives. I run them with raid 0 though, no self healing here. The reason for that is that I have never had a failed enterprise grade SSD or a bad checksum. I may change that opinion one day though. And I still have backups

I did this as well for a good 6months+ w/ no ill effect on HGST ent ssd's, b|tched out recently and took them both (pools on separate systems replication between the two so if a drive did die oh wells/plenty of backups), never had any issues/failures pon my ent class dev's as well but yeah I took them back to raidz w/ a SLOG and am still happy w/ performance and sleep a lil' better at night.

Gb. · Nov 17, 2017

Hello there,

I have made few tests here and this is the final setup :
- @zer0gravity 1TB Samsung 850 Evo,
- @gigatexal we have different kind of files. Some huge files (we are doing videos which can be 4k, 12 bits, etc...) and a lot of small files (documents, etc ...),
- host system with ESXi 6.5 update 1,
- virtualized system has 12 vCPU, 16GB memory
- LSI 9361-8i without cachevault. Hardware RAID10 with 8 SSD on the host system. This is the datastore for ESXi,
- LSI 9400 HBA connected to the Supermicro JBOD 216BE1C-R741JBOD with 2 12Gbps cables. Passthrough in ESXi,
- 24 disks, with this configuration (4 vdevs raidz-2 6 disks, 24 disks - 15 TB) :

zpool create -o ashift=12 tank1 raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg raidz2 /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm raidz2 /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds raidz2 /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy

I have only one HBA connected to the JBOD. Tests with DD when compression is not enabled on ZFS give :

sudo dd if=/dev/zero of=/tank1/tempfile bs=1M count=16384 conv=fdatasync,notrunc status=progress

1.9 GB/s

sudo sh -c "sync && echo 3 > /proc/sys/vm/drop_caches"
dd if=/tank1/tempfile of=/dev/null bs=1M count=16384 status=progress

3.1 GB/s

Tests wit DD when compression (LZ4) is enabled on ZFS give :

sudo dd if=/dev/zero of=/tank1/tempfile bs=1M count=16384 conv=fdatasync,notrunc status=progress

3.7 GB/s

sudo sh -c "sync && echo 3 > /proc/sys/vm/drop_caches"
dd if=/tank1/tempfile of=/dev/null bs=1M count=16384 status=progress

6.3 GB/s

This test only shows the max performances that we can reach with a 0 filled file. It won't be this speed in the real world.

Plans are to connect a second HBA to the backplane and hypothetically double the bandwidth from the host to the backplane.
Anyway currently we are limited by the NIC which is only 10Gbps.

gigatexal · Nov 19, 2017

rookie mistake to use fio Fio - Flexible I/O Tester Synthetic Benchmark | StorageReview.com - Storage Reviews not dd

acquacow · Nov 19, 2017

Yeah, never use dd for io testing, always fio.

Also, my recommendation is mdadm software raid over any hardware raid with SSDs.

Gb. · Nov 20, 2017

Hello all.

@gigatexal : I am not familiar with fio at all, I apologize about that, I am still learning every day. I have read some things about it and achieve that :

Write :

fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=./test --filename=test --bs=4k --iodepth=1 --size=4G --readwrite=randwrite
./test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/391.1MB/0KB /s] [0/100K/0 iops] [eta 00m:00s]
./test: (groupid=0, jobs=1): err= 0: pid=4162: Mon Nov 20 14:55:35 2017
write: io=4096.0MB, bw=168095KB/s, iops=42023, runt= 24952msec
cpu : usr=3.28%, sys=54.52%, ctx=81171, majf=0, minf=6
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=168094KB/s, minb=168094KB/s, maxb=168094KB/s, mint=24952msec, maxt=24952msec

Read :

fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=./test --filename=test --bs=4k --iodepth=1 --size=4G --readwrite=randread
./test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [1001MB/0KB/0KB /s] [256K/0/0 iops] [eta 00m:00s]
./test: (groupid=0, jobs=1): err= 0: pid=4330: Mon Nov 20 14:56:58 2017
read : io=4096.0MB, bw=1026.4MB/s, iops=262735, runt= 3991msec
cpu : usr=10.00%, sys=89.95%, ctx=159, majf=0, minf=9
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: io=4096.0MB, aggrb=1026.4MB/s, minb=1026.4MB/s, maxb=1026.4MB/s, mint=3991msec, maxt=3991msec

I am not sure how to read this, not even sure if these tests are the good ones.
About the IOPS results and if they are corrects, (W=42023, R=262735), I understand that the write performance will be the same than the slowest disk in each vdev. As I have 4 stripped vdevs, I should have the performances of 4 striped disks ? If yes, how are theses results ?

Results for the command line from Fio - Flexible I/O Tester Synthetic Benchmark | StorageReview.com - Storage Reviews (modified to use my 12 cores and removing direct which is not supported by zfs) :

fio --filename=test.io --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=12 --runtime=60 --group_reporting --name=4ktest --size=4G
4ktest: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.16
Starting 12 processes
Jobs: 3 (f=1): [_(1),f(1),_(6),r(1),_(2),f(1)] [100.0% done] [895.1MB/0KB/0KB /s] [229K/0/0 iops] [eta 00m:00s]
4ktest: (groupid=0, jobs=12): err= 0: pid=6224: Mon Nov 20 15:21:01 2017
read : io=49152MB, bw=988640KB/s, iops=247159, runt= 50910msec
slat (usec): min=2, max=28346, avg=47.62, stdev=154.53
clat (usec): min=0, max=28741, avg=725.08, stdev=602.39
lat (usec): min=4, max=29553, avg=772.71, stdev=622.33
clat percentiles (usec):
| 1.00th=[ 63], 5.00th=[ 69], 10.00th=[ 75], 20.00th=[ 98],
| 30.00th=[ 398], 40.00th=[ 490], 50.00th=[ 620], 60.00th=[ 764],
| 70.00th=[ 948], 80.00th=[ 1176], 90.00th=[ 1496], 95.00th=[ 1816],
| 99.00th=[ 2480], 99.50th=[ 2736], 99.90th=[ 3568], 99.95th=[ 4320],
| 99.99th=[11584]
lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=20.40%
lat (usec) : 250=4.49%, 500=16.10%, 750=18.34%, 1000=13.06%
lat (msec) : 2=24.45%, 4=3.11%, 10=0.05%, 20=0.01%, 50=0.01%
cpu : usr=1.37%, sys=97.17%, ctx=40108, majf=0, minf=112
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=12582912/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
READ: io=49152MB, aggrb=988639KB/s, minb=988639KB/s, maxb=988639KB/s, mint=50910msec, maxt=50910msec

Search

Best practice - SSDs hardware raid and filesystems

Gb.

New Member

Gb.

New Member

MiniKnight

Well-Known Member

zer0gravity

Active Member

gigatexal

I'm here to learn

BackupProphet

Well-Known Member

whitey

Moderator

Gb.

New Member

gigatexal

I'm here to learn

acquacow

Well-Known Member

Gb.

New Member