STH Hosting v6 - New cluster in progress

Patrick · Nov 22, 2015

After a bit of an abort last week, the new cluster is taking shape in Fremont (if you are reading this in November 2015 it is probably out of Las Vegas.) If you were wondering what happened, the previous version of this cluster was taken down by 4x Kingston V200 240GB boot devices dying within 72 hours across three different nodes.

Cluster nodes:

3x Intel Xeon D-1540 w/ 64GB RAM nodes
1x Intel Xeon D-1520 w/ 64GB RAM
1x Dual Intel Xeon E5-2699 V3 w/ 128GB RAM with 4x NVMe SSDs

Coming soon:

2x Dual Intel Xeon E5 V3 w/ 128GB RAM with 4x NVMe SSDs and LSI 12gbps SAS
1x Braswell based server for Zabbix
1x Dedicated ZFS mirror

Cluster distribution: Proxmox VE 4.0 w/ Ceph hammer

Here is the progress as of 22 November 2015:

The STHwp1 and STHf1 VMs are sitting on pairs of NVMe SSDs.

STHwp1 is on the Intel DC P3600 400GB ZFS mirror
STHf1 is on the Samsung XS1715 800GB ZFS mirror

Both were snapshots taken earlier this week and have the MariaDB servers in their respective VMs. They will get updated pre-cutover. Likely the forums will go down for 1-2 minutes when the change happens eventually.

The goal is to create second and third VMs on different NVMe drives/ servers. Then have the Ceph cluster VMs as #3. Finally the goal will be to have machines sent to different edge locations that will be slaves/ DR sites for this setup.

Linux-Bench and a few of the other sites will be moving over eventually as well. Linux-Bench is currently "stuck" on the old Hyper-V architecture remnants in Las Vegas. The Ubuntu VM it is on is very picky about being moved.

The current plan is to burn-in the cluster until the end of the month, then transition to this as the main hosting cluster. Of course, the last one died just after the cutover of just the WP VM as I am extremely conservative now moving the forums VM.

More to come on this project!

MiniKnight · Nov 22, 2015

The hard part is, of course, getting scaling with the DB and underlying FS.

Patrick · Nov 22, 2015

MiniKnight said:
The hard part is, of course, getting scaling with the DB and underlying FS.

Of course!

BTW: here is the basic write performance test:

Code:

root@fmt-pve-01:/home/test# rados bench -p sthbench 10 write --no-cleanup
 Maintaining 16 concurrent writes of 4194304 bytes for up to 10 seconds or 0 objects
 Object prefix: benchmark_data_fmt-pve-01_142409
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16       126       110   439.902       440 0.0758275 0.0960602
     2      16       253       237   473.924       508  0.235081   0.13006
     3      16       363       347   462.597       440  0.050596  0.127156
     4      16       461       445   444.941       392   0.26772  0.138946
     5      16       568       552   441.543       428 0.0446385  0.141567
     6      16       662       646   430.613       376 0.0226359  0.143405
     7      16       749       733   418.805       348  0.230116  0.148113
     8      16       862       846   422.948       452  0.086096  0.149712
     9      16       968       952   423.061       424 0.0496367  0.149372
    10      16      1096      1080   431.949       512  0.259649  0.146058
 Total time run:         10.269571
Total writes made:      1097
Write size:             4194304
Bandwidth (MB/sec):     427.282

Stddev Bandwidth:       139.368
Max bandwidth (MB/sec): 512
Min bandwidth (MB/sec): 0
Average Latency:        0.149753
Stddev Latency:         0.144999
Max latency:            1.14385
Min latency:            0.0226359

Patrick · Nov 22, 2015

And some more:

Code:

root@fmt-pve-01:/home/test# rados bench -p sthbench 10 seq
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16       292       276   1103.81      1104  0.554277 0.0509578
     2      16       617       601   1201.83      1300 0.0149186 0.0501401
     3      16       854       838   1117.18       9480.00424673 0.0561402
     4      16      1097      1081   1080.86       972 0.0255477 0.0570432
 Total time run:        4.164682
Total reads made:     1097
Read size:            4194304
Bandwidth (MB/sec):    1053.622

Average Latency:       0.0606427
Max latency:           0.629765
Min latency:           0.00334897

Code:

root@fmt-pve-01:/home/test# rados bench -p sthbench 10 rand
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16       301       285    1139.7      1140   0.41414 0.0462429
     2      16       400       384   767.848       3960.00819223   0.07241
     3      16       619       603   803.875       876  0.037472 0.0786816
     4      16       952       936   935.878      1332 0.0917215 0.0675853
     5      16      1294      1278   1022.28      13680.00737993 0.0612788
     6      16      1608      1592   1061.21      1256  0.037465 0.0599269
     7      16      1965      1949   1113.59      1428 0.0309193 0.0568868
     8      16      2216      2200   1099.88      10040.00898325 0.0567045
     9      16      2276      2260   1004.33       2400.00446445 0.0607713
    10      16      2411      2395   957.897       540 0.0157556   0.06538
Total time run:        10.237513
Total reads made:     2411
Read size:            4194304
Bandwidth (MB/sec):    942.026

Average Latency:       0.0678577
Max latency:           1.27867
Min latency:           0.00334876

PigLover · Nov 23, 2015

A few questions about your Ceph configuration. This should probably its own thread since it is generic to Ceph, or perhaps Proxmox+Ceph, and not directly related to your hosting re-deploy. But you have the foil for my questions in the Proxmox image in post 1.

How have you set up the Ceph pool(s), to wit:

Are you running MON on all 5 nodes or just three of them? And why.
I assume one single pool for VM images using RBD?
Are you using Ceph for anything other than VM images? Ceph filesystem is intriguing but immature...
What replication model (default would be replica 2, giving you 3 total)
You are using a variety of unequal sized OSDs. Do you forsee - of have you had - any issues caused by that?
Did you leave the min replicas the same as the replica count (meaning all three writes have to commit to journal before ack). Or did you lower it to improve availability?
Did do anything with the Crush maps to ensure that replicas are spread across hosts rather than OSDs (noting that one host has 4 OSDs, its possible that all replicas of an object land on one host - which would be bad...)
Had you tested failing a host or failing an OSD and checking performance while the cluster is rebalanceing?

I believe that if you use ceph-deploy scripts and just used defaults you'd get pools with "Replica 2" (3 copies) and hosts as failure domain. But I'm guessing you used pve-ceph and/or the GUI to deploy it and I am curious about how the crush map came out.

Patrick · Nov 23, 2015

Many questions!

All 5. I read a few posts that suggested to use every node. I also saw what happens when 3 nodes went down last week.
Yes
Not just yet. Container and VM images right now.
I will be using 2 replicas when all is done
Not yet. But the other thing that will change soon is that as the number of devices grows, the fact that it is uneven makes me less worried. I am trying to get ~2TB per node across 7+ nodes.
Min is set to 1 since it is not that write heavy
Not yet. But I sent you the crushmap. That is phase 2.
I did in the previous version quite a bit. It was still around 900MB/s reads and 600MB/s writes. But the issue I had was too many disks/ OSDs on the same host. Also, I had a mixed HDD/ SSD pool and did not have proper journals for the HDDs. The goal now is to use all SSDs. Overall rebalancing time was not too bad but it was over 10-20min when 2TB+ went offline out of the 14TB capacity I had. The biggest issue was losing too many nodes with the 4x Kingston V200's dying.

What I "really" want is to setup cross datacenter Ceph but that looks very scary.

PigLover · Nov 23, 2015

Thanks. And thanks for sharing the Crush map via IM. I'm following this with some interest as I am working on a somewhat larger (understatement) deployment of Ceph-based storage.

I'm really interested in how it performs when using an erasure-coded HDD-based object store with an SSD-based cache tier. Most of my interest is in failure mode performance. Unfortunately I can't configure a large enough cluster in a home lab environment to make this interesting (what I want needs about 45 nodes to get test properly). I have some staff folks doing this in a work environment but its no fun when all I get are project updates and status reports. I want first hand.

Patrick · Nov 23, 2015

Depending on how fast I can get the new nodes up and before I transfer everything over, maybe I can get you on the setup in Fremont?

alex1002 · Nov 25, 2015

I am wondering what nvme and what Lsi cards did you use?

Patrick · Nov 25, 2015

alex1002 said:
I am wondering what nvme and what Lsi cards did you use?

They are all chassis specific. The Intel S2600WTT NVMe riser/ backplane kit info is here. The unit in the datacenter right now has a 4-port LSI mezzanine card connected to the 4 right side slots of the NVMe backplane. The two others that will be ready to go next week will be the LSI SAS3008 mezzanine card.

Naeblis · Nov 25, 2015

Patrick said:
They are all chassis specific. The Intel S2600WTT NVMe riser/ backplane kit info is here. The unit in the datacenter right now has a 4-port LSI mezzanine card connected to the 4 right side slots of the NVMe backplane. The two others that will be ready to go next week will be the LSI SAS3008 mezzanine card.

are you going to put a second cage in there? OEM XS has them and she took a b/o of $250

this is the raid card i used. 8 ports b/o of $125

Intel RMS3JC080 Integrated RAID Module New Bulk Packaging

Patrick · Nov 26, 2015

Naeblis said:
are you going to put a second cage in there? OEM XS has them and she took a b/o of $250

this is the raid card i used. 8 ports b/o of $125

Intel RMS3JC080 Integrated RAID Module New Bulk Packaging

There is an armada of Fedex packages heading my way. I got the SAS 3008 version of that card: Intel RMS3HC080 Integrated RAID Module SAS/SATA NEW Bulk Packaging instead.

Only doing one backplane per server though.

Naeblis · Nov 26, 2015

Patrick said:
There is an armada of Fedex packages heading my way. I got the SAS 3008 version of that card: Intel RMS3HC080 Integrated RAID Module SAS/SATA NEW Bulk Packaging instead.

Only doing one backplane per server though.

Both cards are SAS 3008, yours has raid 5

Patrick · Nov 26, 2015

Naeblis said:
Both cards are SAS 3008, yours has raid 5

Yea I saw that one had the SAS3108 - Intel® Integrated RAID Module RMS3JC080 Specifications so I got the 3008 version. Unlikely it matters.

foureight84 · Mar 7, 2025

Patrick said:
There is an armada of Fedex packages heading my way. I got the SAS 3008 version of that card: Intel RMS3HC080 Integrated RAID Module SAS/SATA NEW Bulk Packaging instead.

Only doing one backplane per server though.

@Patrick would you happen to know the standoffs used for this mezzanine card? Or what height they are? I just bought one without and I'm trying to figure out the best way to stabilize the card after mounting. I figured M3 plastic standoffs would work but I don't know the exact height to get.

Patrick · Mar 8, 2025

foureight84 said:
@Patrick would you happen to know the standoffs used for this mezzanine card? Or what height they are? I just bought one without and I'm trying to figure out the best way to stabilize the card after mounting. I figured M3 plastic standoffs would work but I don't know the exact height to get.

Hey I hate to give you the bad news. This was like 9.5 years ago, and I do not remember the standoffs. Sorry.

foureight84 · Mar 8, 2025

Patrick said:
Hey I hate to give you the bad news. This was like 9.5 years ago, and I do not remember the standoffs. Sorry.

HAHA no worries, @Patrick. I appreciate the reply.

Search

STH Hosting v6 - New cluster in progress

Patrick

Administrator

MiniKnight

Well-Known Member

Patrick

Administrator

Patrick

Administrator

PigLover

Moderator

Patrick

Administrator

PigLover

Moderator

Patrick

Administrator

alex1002

Member

Patrick

Administrator

Naeblis

Active Member

Patrick

Administrator

Naeblis

Active Member

Patrick

Administrator

foureight84

Well-Known Member

Patrick

Administrator

foureight84

Well-Known Member