STH Hosting v6 - New cluster in progress

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Patrick

Administrator
Staff member
Dec 21, 2010
12,516
5,809
113
After a bit of an abort last week, the new cluster is taking shape in Fremont (if you are reading this in November 2015 it is probably out of Las Vegas.) If you were wondering what happened, the previous version of this cluster was taken down by 4x Kingston V200 240GB boot devices dying within 72 hours across three different nodes.

Cluster nodes:
  • 3x Intel Xeon D-1540 w/ 64GB RAM nodes
  • 1x Intel Xeon D-1520 w/ 64GB RAM
  • 1x Dual Intel Xeon E5-2699 V3 w/ 128GB RAM with 4x NVMe SSDs
Coming soon:
  • 2x Dual Intel Xeon E5 V3 w/ 128GB RAM with 4x NVMe SSDs and LSI 12gbps SAS
  • 1x Braswell based server for Zabbix
  • 1x Dedicated ZFS mirror
Cluster distribution: Proxmox VE 4.0 w/ Ceph hammer

Here is the progress as of 22 November 2015:
upload_2015-11-22_13-53-19.png

The STHwp1 and STHf1 VMs are sitting on pairs of NVMe SSDs.
  • STHwp1 is on the Intel DC P3600 400GB ZFS mirror
  • STHf1 is on the Samsung XS1715 800GB ZFS mirror
Both were snapshots taken earlier this week and have the MariaDB servers in their respective VMs. They will get updated pre-cutover. Likely the forums will go down for 1-2 minutes when the change happens eventually.

The goal is to create second and third VMs on different NVMe drives/ servers. Then have the Ceph cluster VMs as #3. Finally the goal will be to have machines sent to different edge locations that will be slaves/ DR sites for this setup.

Linux-Bench and a few of the other sites will be moving over eventually as well. Linux-Bench is currently "stuck" on the old Hyper-V architecture remnants in Las Vegas. The Ubuntu VM it is on is very picky about being moved.

The current plan is to burn-in the cluster until the end of the month, then transition to this as the main hosting cluster. Of course, the last one died just after the cutover of just the WP VM as I am extremely conservative now moving the forums VM.

More to come on this project!
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,516
5,809
113
The hard part is, of course, getting scaling with the DB and underlying FS.
Of course!

BTW: here is the basic write performance test:

Code:
root@fmt-pve-01:/home/test# rados bench -p sthbench 10 write --no-cleanup
 Maintaining 16 concurrent writes of 4194304 bytes for up to 10 seconds or 0 objects
 Object prefix: benchmark_data_fmt-pve-01_142409
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16       126       110   439.902       440 0.0758275 0.0960602
     2      16       253       237   473.924       508  0.235081   0.13006
     3      16       363       347   462.597       440  0.050596  0.127156
     4      16       461       445   444.941       392   0.26772  0.138946
     5      16       568       552   441.543       428 0.0446385  0.141567
     6      16       662       646   430.613       376 0.0226359  0.143405
     7      16       749       733   418.805       348  0.230116  0.148113
     8      16       862       846   422.948       452  0.086096  0.149712
     9      16       968       952   423.061       424 0.0496367  0.149372
    10      16      1096      1080   431.949       512  0.259649  0.146058
 Total time run:         10.269571
Total writes made:      1097
Write size:             4194304
Bandwidth (MB/sec):     427.282

Stddev Bandwidth:       139.368
Max bandwidth (MB/sec): 512
Min bandwidth (MB/sec): 0
Average Latency:        0.149753
Stddev Latency:         0.144999
Max latency:            1.14385
Min latency:            0.0226359
 
  • Like
Reactions: eva2000

Patrick

Administrator
Staff member
Dec 21, 2010
12,516
5,809
113
And some more:

Code:
root@fmt-pve-01:/home/test# rados bench -p sthbench 10 seq
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16       292       276   1103.81      1104  0.554277 0.0509578
     2      16       617       601   1201.83      1300 0.0149186 0.0501401
     3      16       854       838   1117.18       9480.00424673 0.0561402
     4      16      1097      1081   1080.86       972 0.0255477 0.0570432
 Total time run:        4.164682
Total reads made:     1097
Read size:            4194304
Bandwidth (MB/sec):    1053.622

Average Latency:       0.0606427
Max latency:           0.629765
Min latency:           0.00334897
Code:
root@fmt-pve-01:/home/test# rados bench -p sthbench 10 rand
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16       301       285    1139.7      1140   0.41414 0.0462429
     2      16       400       384   767.848       3960.00819223   0.07241
     3      16       619       603   803.875       876  0.037472 0.0786816
     4      16       952       936   935.878      1332 0.0917215 0.0675853
     5      16      1294      1278   1022.28      13680.00737993 0.0612788
     6      16      1608      1592   1061.21      1256  0.037465 0.0599269
     7      16      1965      1949   1113.59      1428 0.0309193 0.0568868
     8      16      2216      2200   1099.88      10040.00898325 0.0567045
     9      16      2276      2260   1004.33       2400.00446445 0.0607713
    10      16      2411      2395   957.897       540 0.0157556   0.06538
Total time run:        10.237513
Total reads made:     2411
Read size:            4194304
Bandwidth (MB/sec):    942.026

Average Latency:       0.0678577
Max latency:           1.27867
Min latency:           0.00334876
 

PigLover

Moderator
Jan 26, 2011
3,186
1,545
113
A few questions about your Ceph configuration. This should probably its own thread since it is generic to Ceph, or perhaps Proxmox+Ceph, and not directly related to your hosting re-deploy. But you have the foil for my questions in the Proxmox image in post 1.

How have you set up the Ceph pool(s), to wit:
  • Are you running MON on all 5 nodes or just three of them? And why.
  • I assume one single pool for VM images using RBD?
  • Are you using Ceph for anything other than VM images? Ceph filesystem is intriguing but immature...
  • What replication model (default would be replica 2, giving you 3 total)
  • You are using a variety of unequal sized OSDs. Do you forsee - of have you had - any issues caused by that?
  • Did you leave the min replicas the same as the replica count (meaning all three writes have to commit to journal before ack). Or did you lower it to improve availability?
  • Did do anything with the Crush maps to ensure that replicas are spread across hosts rather than OSDs (noting that one host has 4 OSDs, its possible that all replicas of an object land on one host - which would be bad...)
  • Had you tested failing a host or failing an OSD and checking performance while the cluster is rebalanceing?
I believe that if you use ceph-deploy scripts and just used defaults you'd get pools with "Replica 2" (3 copies) and hosts as failure domain. But I'm guessing you used pve-ceph and/or the GUI to deploy it and I am curious about how the crush map came out.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,516
5,809
113
Many questions!

  • All 5. I read a few posts that suggested to use every node. I also saw what happens when 3 nodes went down last week.
  • Yes
  • Not just yet. Container and VM images right now.
  • I will be using 2 replicas when all is done
  • Not yet. But the other thing that will change soon is that as the number of devices grows, the fact that it is uneven makes me less worried. I am trying to get ~2TB per node across 7+ nodes.
  • Min is set to 1 since it is not that write heavy
  • Not yet. But I sent you the crushmap. That is phase 2.
  • I did in the previous version quite a bit. It was still around 900MB/s reads and 600MB/s writes. But the issue I had was too many disks/ OSDs on the same host. Also, I had a mixed HDD/ SSD pool and did not have proper journals for the HDDs. The goal now is to use all SSDs. Overall rebalancing time was not too bad but it was over 10-20min when 2TB+ went offline out of the 14TB capacity I had. The biggest issue was losing too many nodes with the 4x Kingston V200's dying.

What I "really" want is to setup cross datacenter Ceph but that looks very scary.
 

PigLover

Moderator
Jan 26, 2011
3,186
1,545
113
Thanks. And thanks for sharing the Crush map via IM. I'm following this with some interest as I am working on a somewhat larger (understatement) deployment of Ceph-based storage.

I'm really interested in how it performs when using an erasure-coded HDD-based object store with an SSD-based cache tier. Most of my interest is in failure mode performance. Unfortunately I can't configure a large enough cluster in a home lab environment to make this interesting (what I want needs about 45 nodes to get test properly). I have some staff folks doing this in a work environment but its no fun when all I get are project updates and status reports. I want first hand.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,516
5,809
113
Depending on how fast I can get the new nodes up and before I transfer everything over, maybe I can get you on the setup in Fremont?
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,516
5,809
113
  • Like
Reactions: eva2000 and Chuntzu

Naeblis

Active Member
Oct 22, 2015
168
123
43
Folsom, CA
They are all chassis specific. The Intel S2600WTT NVMe riser/ backplane kit info is here. The unit in the datacenter right now has a 4-port LSI mezzanine card connected to the 4 right side slots of the NVMe backplane. The two others that will be ready to go next week will be the LSI SAS3008 mezzanine card.

are you going to put a second cage in there? OEM XS has them and she took a b/o of $250

this is the raid card i used. 8 ports b/o of $125

Intel RMS3JC080 Integrated RAID Module New Bulk Packaging
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,516
5,809
113