Best practice - SSDs hardware raid and filesystems

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Gb.

New Member
May 17, 2017
8
1
3
43
Hi !

Following this thread (Supermicro 216BE1C-R741JBOD), I have a few questions.
The setup :
- host system with ESXi 6.5 update 1,
- LSI 9361-8i without cachevault. Hardware RAID10 with 8 SSD on the host system. This is the datastore for ESXi,
- LSI 9480-8i8e without cachevault, connected to the supermicro JBOD 216BE1C-R741JBOD. Hardware RAID6 with 24 SSD. This raid card is configured as passthrough by ESXi,
- Ubuntu Server 17.10 (the upgrade to 18.04 LTS will be smoother from 17.10 than 16.04 I think).

The goal is to use this server as file server / cloud (probably Nextcloud, currently using Owncloud).
I am now wondering which filesystem to use with a SSDs hardware raid.
Until now, I supposed that btrfs was the best for this setup.
And in any case, which options to use for mounting (like -o ssd_spread, -o discard or -o ssd) ?
Is it still needed to align manually the partition ?

Thanks !
 

gigatexal

I'm here to learn
Nov 25, 2012
2,913
607
113
Portland, Oregon
alexandarnarayan.com
it might not be SSD aware but I still put ZFS on my SSDs as I like the compression and bitrot protection. For e-Peen and such you could just do mdadm for software raid and put something like XFS on them. Depends on your usecase really. What is on, average, the size of the files you're working with? Large, small? More questions need answering before a good recommendation can be had.
 

BackupProphet

Well-Known Member
Jul 2, 2014
1,093
653
113
Stavanger, Norway
olavgg.com
I also put my SSD's on ZFS, the compression is awesome. Feels like I get much larger drives. I run them with raid 0 though, no self healing here. The reason for that is that I have never had a failed enterprise grade SSD or a bad checksum. I may change that opinion one day though. And I still have backups :D
 
  • Like
Reactions: gigatexal

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
I also put my SSD's on ZFS, the compression is awesome. Feels like I get much larger drives. I run them with raid 0 though, no self healing here. The reason for that is that I have never had a failed enterprise grade SSD or a bad checksum. I may change that opinion one day though. And I still have backups :D
I did this as well for a good 6months+ w/ no ill effect on HGST ent ssd's, b|tched out recently and took them both (pools on separate systems replication between the two so if a drive did die oh wells/plenty of backups), never had any issues/failures pon my ent class dev's as well but yeah I took them back to raidz w/ a SLOG and am still happy w/ performance and sleep a lil' better at night.
 
  • Like
Reactions: gigatexal

Gb.

New Member
May 17, 2017
8
1
3
43
Hello there,

I have made few tests here and this is the final setup :
- @zer0gravity 1TB Samsung 850 Evo,
- @gigatexal we have different kind of files. Some huge files (we are doing videos which can be 4k, 12 bits, etc...) and a lot of small files (documents, etc ...),
- host system with ESXi 6.5 update 1,
- virtualized system has 12 vCPU, 16GB memory
- LSI 9361-8i without cachevault. Hardware RAID10 with 8 SSD on the host system. This is the datastore for ESXi,
- LSI 9400 HBA connected to the Supermicro JBOD 216BE1C-R741JBOD with 2 12Gbps cables. Passthrough in ESXi,
- 24 disks, with this configuration (4 vdevs raidz-2 6 disks, 24 disks - 15 TB) :

zpool create -o ashift=12 tank1 raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg raidz2 /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm raidz2 /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds raidz2 /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy

I have only one HBA connected to the JBOD. Tests with DD when compression is not enabled on ZFS give :

sudo dd if=/dev/zero of=/tank1/tempfile bs=1M count=16384 conv=fdatasync,notrunc status=progress
1.9 GB/s
sudo sh -c "sync && echo 3 > /proc/sys/vm/drop_caches"
dd if=/tank1/tempfile of=/dev/null bs=1M count=16384 status=progress

3.1 GB/s
Tests wit DD when compression (LZ4) is enabled on ZFS give :
sudo dd if=/dev/zero of=/tank1/tempfile bs=1M count=16384 conv=fdatasync,notrunc status=progress
3.7 GB/s
sudo sh -c "sync && echo 3 > /proc/sys/vm/drop_caches"
dd if=/tank1/tempfile of=/dev/null bs=1M count=16384 status=progress

6.3 GB/s
This test only shows the max performances that we can reach with a 0 filled file. It won't be this speed in the real world.

Plans are to connect a second HBA to the backplane and hypothetically double the bandwidth from the host to the backplane.
Anyway currently we are limited by the NIC which is only 10Gbps.
 
Last edited:

acquacow

Well-Known Member
Feb 15, 2017
787
439
63
42
Yeah, never use dd for io testing, always fio.

Also, my recommendation is mdadm software raid over any hardware raid with SSDs.
 

Gb.

New Member
May 17, 2017
8
1
3
43
Hello all.

@gigatexal : I am not familiar with fio at all, I apologize about that, I am still learning every day. I have read some things about it and achieve that :

Write :
fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=./test --filename=test --bs=4k --iodepth=1 --size=4G --readwrite=randwrite
./test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/391.1MB/0KB /s] [0/100K/0 iops] [eta 00m:00s]
./test: (groupid=0, jobs=1): err= 0: pid=4162: Mon Nov 20 14:55:35 2017
write: io=4096.0MB, bw=168095KB/s, iops=42023, runt= 24952msec
cpu : usr=3.28%, sys=54.52%, ctx=81171, majf=0, minf=6
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=168094KB/s, minb=168094KB/s, maxb=168094KB/s, mint=24952msec, maxt=24952msec​

Read :
fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=./test --filename=test --bs=4k --iodepth=1 --size=4G --readwrite=randread
./test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [1001MB/0KB/0KB /s] [256K/0/0 iops] [eta 00m:00s]
./test: (groupid=0, jobs=1): err= 0: pid=4330: Mon Nov 20 14:56:58 2017
read : io=4096.0MB, bw=1026.4MB/s, iops=262735, runt= 3991msec
cpu : usr=10.00%, sys=89.95%, ctx=159, majf=0, minf=9
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: io=4096.0MB, aggrb=1026.4MB/s, minb=1026.4MB/s, maxb=1026.4MB/s, mint=3991msec, maxt=3991msec​

I am not sure how to read this, not even sure if these tests are the good ones.
About the IOPS results and if they are corrects, (W=42023, R=262735), I understand that the write performance will be the same than the slowest disk in each vdev. As I have 4 stripped vdevs, I should have the performances of 4 striped disks ? If yes, how are theses results ?


Results for the command line from Fio - Flexible I/O Tester Synthetic Benchmark | StorageReview.com - Storage Reviews (modified to use my 12 cores and removing direct which is not supported by zfs) :

fio --filename=test.io --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=12 --runtime=60 --group_reporting --name=4ktest --size=4G
4ktest: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.16
Starting 12 processes
Jobs: 3 (f=1): [_(1),f(1),_(6),r(1),_(2),f(1)] [100.0% done] [895.1MB/0KB/0KB /s] [229K/0/0 iops] [eta 00m:00s]
4ktest: (groupid=0, jobs=12): err= 0: pid=6224: Mon Nov 20 15:21:01 2017
read : io=49152MB, bw=988640KB/s, iops=247159, runt= 50910msec
slat (usec): min=2, max=28346, avg=47.62, stdev=154.53
clat (usec): min=0, max=28741, avg=725.08, stdev=602.39
lat (usec): min=4, max=29553, avg=772.71, stdev=622.33
clat percentiles (usec):
| 1.00th=[ 63], 5.00th=[ 69], 10.00th=[ 75], 20.00th=[ 98],
| 30.00th=[ 398], 40.00th=[ 490], 50.00th=[ 620], 60.00th=[ 764],
| 70.00th=[ 948], 80.00th=[ 1176], 90.00th=[ 1496], 95.00th=[ 1816],
| 99.00th=[ 2480], 99.50th=[ 2736], 99.90th=[ 3568], 99.95th=[ 4320],
| 99.99th=[11584]
lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=20.40%
lat (usec) : 250=4.49%, 500=16.10%, 750=18.34%, 1000=13.06%
lat (msec) : 2=24.45%, 4=3.11%, 10=0.05%, 20=0.01%, 50=0.01%
cpu : usr=1.37%, sys=97.17%, ctx=40108, majf=0, minf=112
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=12582912/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
READ: io=49152MB, aggrb=988639KB/s, minb=988639KB/s, maxb=988639KB/s, mint=50910msec, maxt=50910msec​
 
Last edited:
  • Like
Reactions: gigatexal