Advice needed for an Always On SQL server build

dougsk · Jul 22, 2013

Hi STH,

I'll call this my first post, although actually it's my second. I don't recall how I stumbled in here, but I've been a somewhat dedicated lurker. You all rock! You're doing really interesting things here with in reality shoestring budgets which sounds eerily like my day job. Since some of your home labs resemble my work, I figured I'd ask for a recommendation, if that's okay?

I'm currently run a SQL 2005 EE instance on a quad socket dual core Opterons (875s yes that old), connected via mutiple 1gbe to an iSCSI san using fail over clustering to another DL585. I'm using MPIO as opposed to using LAG groups however so many operations are single threaded so I rarely see the benefit of the other NIC. I've got 32GB of RAM in the current box and I'm occasionally RAM bound, never CPU bound, frequently disk bound. I've pulled some performance numbers; SkyDrive and GoogleDocs same data in either one.

Now I went and replicated our SQL data onto my gaming rig, an Intel 2500K with an single OCZ Vertex 3. A query of unordered audit log data I run in production takes about 15 minutes to complete (peaks at ~2200IOPS). This same query takes about 30 seconds on the Vertex SSD and hits 33000 READ IOPS. 30 seconds versus 15 minutes! Holy Cow!

In light of that, I've been given a budget of around 50K to change the slow as molasses boxes out. How would you spend it?

I'm thinking about using SQL Server Always ON and DAS, because when I try pricing SAN solutions the effective performance is usually abysmal until you start spending big money. This needs to be part of an HA cluster, so about 25K/box. I need about 1.5TB (actually only about 1TB with 30% free) for my 'five year' estimated growth plan for the high IOPS databases (Northwind and NorthwindAudit in the spreadsheet on Skydrive) and the rest can sit on the existing ISCSI SAN or slow DAS.

So for instance lets say I use something like a Dell R720XD or DL380 25SFF chassis, I wouldn't need an external sas jbod. If I went DL360 or R410, the JBOD I'm looking at using is probably the DataOn DNS-1640 Single Controller for $3895. But open to suggestion.

Let's say I went with 800GB S3700 Intel MLC drives at $2100/each then I'd need 4 in a Raid 10 (1.5TB) to achieve the usable space I'd want, so that's $8400. Or I could look at and believe me I'll be ordering some spares, I could buy 16 of the 240GB Intel 520 MLC drives in a Raid 60 to achieve similar capacity (1.7TB +2 HS). The Intel 520 goes for $249 each so $3984, but I'd need that JBOD since I'll need some cheap slow local storage and this eats up all the drives, so that's another $3895 so $7879. Now when we use the IOPS calculator things get a little different.

S3700 800GB Raid 10, Stripe =64Kb, 75000 Read and 36000 write IOPS 60/40 read/write = 66000 Random IOPS
520 240GB RAID 60, Stripe =64Kb, 75000 Read 36000 and write IOPS, 60/40 Read/write = 107000 Random IOPS
I save about $500 bucks and I got about 40% faster using consumer drives. Assumes 64KB average IO size, Stripe = 64KB, SSD Write IOPS = 36000, SSD Read IOPS = 75000

Let's say I end up using an LSI 9286 PCIe card $800 without Cachecade and $1250 with to connect to the JBOD. Ah cachecade, here's a good question, would Cachecade 2.0 even be worthwhile for piles of random IO or is it more like that Seagate Hybrid drive, neat for like the first five seconds, but only cool if you've never used a real ssd? I can't see using a 15K FusionIO or WarpDrive, but dunno that could be useful too.

Anyhow math exercises aside, how would you do it?

MiniKnight · Jul 22, 2013

Welcome again!!!

Can I say that this is a GREAT post.

After reading here is a stupid question: how hot is all that data? Seems like peaks high, average low so is it like a few reports banging the disks?

Here is a question for you, ever think about doing 256+ gb of ram? I feel like I would make a big ram pool, use ssd as needed and leave cold data on disk.

Storage caching is very cheap these days and very mature.

In summary... dba will have great insight as he is a dB guru

dba · Jul 22, 2013

Don't spend your money on CPU - you don't seem to need much. I'd bet that one recent Xeon CPU with four cores would be more than enough, though I might spring for six given your budget.
Don't spend your money on a SAN - it seems like you can live with replication instead, which is far cheaper.
Don't spend your money on RAM - if 32GB is working fine now, just get 64GB in your new box, even though it will probably have slots for 1TB. If I saw more big reads or massive write spikes then I could see spending more $ on RAM.

Spend your money on SSD drives, specifically moderately priced MLC SSD drives with supercaps and firmware that you trust.

Looking at your DB, you have only around 9GB/day of writes. Assuming mirrored storage, that's 18GB/day of writes to storage. At that level of load, you are in no danger of using up even the cheapest consumer-grade SSD drives and therefore do not need to spend extra money on special drives. On the other hand, you have a pretty good budget and can afford to buy peace of mind. On balance, I think that the Intel S3500 drives would be my choice for you.

I'd buy as many drives as would fit in my chosen server chassis - and I'd pick a chassis with lots of 2.5" slots. With a 25-drive chassis, I'd probably buy the 240GB or 300GB size. With a smaller chassis, I'd buy larger drives, giving up some performance in order to avoiding the need to buy a separate JBOD chassis.

Put the drives into RAID10 - I don't like RAID5 or RAID6 with SSD drives, although your write load is low enough that you could get away with it.

dba · Jul 22, 2013

MiniKnight said:
Welcome again!!!

Can I say that this is a GREAT post.

After reading here is a stupid question: how hot is all that data? Seems like peaks high, average low so is it like a few reports banging the disks?

Here is a question for you, ever think about doing 256+ gb of ram? I feel like I would make a big ram pool, use ssd as needed and leave cold data on disk.

Storage caching is very cheap these days and very mature.

In summary... dba will have great insight as he is a dB guru

Thanks for the compliment, MiniKnight. I don't consider myself a guru, except perhaps within some very narrow realms.

RAM is often the cheapest way to buy speed, but probably not in this case. On the face of it, this looks like an OLTP database, with tiny writes and small indexed reads, and one with rather low load at that. I suspect that it would be pretty easy to add too much RAM, leaving most of it unused or just caching data that really didn't need to be cached given the SSD drives.

mrkrad · Jul 23, 2013

sql server setup properly will use every drop of ram possible and EE reads ahead more aggressively than standard edition.

you are always best to slice and dice your drives up otherwise you may not reach maximum queue depth for ssd in raid-10

jtreble · Jul 23, 2013

dougsk said:
... and the rest can sit on the existing ISCSI SAN or slow DAS ...

I'm not sure I understand this aspect of your plan - is there an opportunity here for these to be consolidated?

dougsk · Jul 23, 2013

MiniKnight said:
how hot is all that data? Seems like peaks high, average low so is it like a few reports banging the disks?

Here is a question for you, ever think about doing 256+ gb of ram?

The data isn't all hot that's for sure, but it all needs to be available, and determining what data is hot has turned out to be problematic. SSRS can definitely be a contention point at times, but usually isn't the problem. I have about 250 active users and for about eight hours on Fridays, it's somewhat ugly, otherwise it's tolerable. At the application level we've taken to killing any query > 30s, forcing the user to resubmit, which honestly has a snowball effect all of it's own. Usually if the user executes again the result will show up. But that's an ugly hack to hide IO inadequacies from the users. I haven't really thought about using a RamCache mostly cause we've been really happy with the ssd performance we're seeing in test.

I may have data that's five years old that needs to be hot for six weeks and then put to bed for a year. Changes to Northwind are tracked in Northwind_audit via triggers without the use of the service broker, so although I don't actually care about speed of northwind_audit, it looks like I'm stuck chuck and need it on fast storage. I can however cull that database of old records and mount the old records in another db on slow storage if I need to do a historical find out if it was Mr Plum in the Library with a Candlestick. That should alleviate some of the size pressure on the SSD array.

I've taken a cursory look at partitioned tables, but it seems a bit complex and totally avoids the KISS principle, especially since we can't really tell what information to make slow within the Northwind database. The Adventure Works databases are currently culled into an archive system and we've had to build tools to bring records from the archive to the Active database. For there it works, but for the Northwind database I don't think we could actually build the tools to do that.

dba said:
Don't spend your money on CPU
Don't spend your money on a SAN
Don't spend your money on RAM
...
Put the drives into RAID10
...
OLTP database, with tiny writes and small indexed reads
...
would be pretty easy to add too much RAM

Noted, thank you. Specifically on the CPU point, I'm looking at trading clock speed for cores. Assuming licensing is a non issue, I was thinking I'd rather go four core with higher clocks than 6 cores with lower clock, does that sound right?

Nailed it in one, yep it is OLTP, a point of sale system in fact. Dashes to my 128GB of RAM, but, but .... awe shucks

Saves about $700 x 2, so that's good!

MKRAD said:
you are always best to slice and dice your drives up otherwise you may not reach maximum queue depth for ssd in raid-10

Just to clarify, the more spindles on the same volume to reduce queue depths the better, correct?

jtreble said:
I'm not sure I understand this aspect of your plan - is there an opportunity here for these to be consolidated?

Currently this is the primary iscsi storage, the one that only gets 2200iops. I have 14TB raw 9TB usuable. I'll need some bulk storage, and if I can't get bulkstorage into the same chassis as the SSDs, then I'll have to iSCSI mount those volumes for slow storage from here. Basically a place to push backups and restores and maybe the AdventureWorks and AdventureWorks Archive dbs here.

dba · Jul 23, 2013

Cores versus MHz: It's not a big deal in your case - the SSD drives are your key to great performance and even a cheapo Xeon E5-2609 CPU would probably do the trick for you. That said, let's just pretend that you have $700 to spend on CPU and want to get the most performance for your dollar.

Look at the similarly priced Xeon E5-2643 quad core CPU at 3.3GHz versus the E5-2640 six core CPU at 2.5GHz - both are roughly $700 CPUs. The 2640 has 50% more cores while the 2643 is over 30% faster per core.

A reasonable rule of thumb says that the faster cores in the 2643 will make for a more responsive database (faster queries) under low loads while the more numerous cores in the 2640 will win out with greater overall throughput under higher loads. Let's just guess that the crossover point is between four and eight simultaneous non-parallel OLTP queries.

So - proposed rule of thumb for you - if your peak hour during the day sees more than a handful of simultaneous queries (not simultaneous users) then go for more cores. If not, go for GHz.

gigatexal · Jul 23, 2013

mrkrad said:
sql server setup properly will use every drop of ram possible and EE reads ahead more aggressively than standard edition.

you are always best to slice and dice your drives up otherwise you may not reach maximum queue depth for ssd in raid-10

are you saying partition the ssds? how does that affect queue depth?

Also you have come to the best place to get this answered.

dba · Jul 23, 2013

gigatexal said:
are you saying partition the ssds? how does that affect queue depth?

Also you have come to the best place to get this answered.

I really doubt that dougsk will have a queue depth problem after going SSD. Personally, I would not partition the SSD drives.

An aside: If I'm reading the posts correctly, the the data that dougsk manages is a blend of data "in use now" and data rarely used, but there is no simple rule that defines which data is hot and which is not - and it can change at any time - so no simple partitioning strategy will help. I've been playing with Oracle 12c, and one of the new features is the ability to keep a "heat map" representation of how often a given chunk of data is accessed - down to the row level even - and to use that information to automatically migrate data, for example from a SATA SAN partition to an SSD partition and back. I'm sure the Microsoft is working on something similar for their database. More to the point, it seems like the upcoming Windows 2012 R2 release of Storage Spaces will bring exactly this kind of functionality to our file servers.

mrkrad · Jul 23, 2013

L5639 are $135, for $599 you get a server with 48gb of ram and 2 drives (and two of those cpu's). Impossible for anything newer to come close. Available in blades $599 or C6100 $599 or the venerable DL380 $1000~ - this is the deal of the century.

With esxi you have a per vm , per target/lun queue depth. Operating systems themselves may have a limit as well.

So if you think about it, sql server is a giant cache. 64gb of compressed SQL is 64gb of compressed buffers (put those 12 cores to work!). It should never read the same block twice if it doesn't exhaust its cache. If you can roll back to SQL 2008 you can do say 512GB of ram cache (!!) which can be hella cheaper with slow storage.

Then you have TEMPDB, and LOGS. Logs are linear and benchmarks always use a ton of drives to reach maximum performance since the writes are synchronous.

TEMPDB happens when you hit the limit of the governor. Default it is very small but you can tweak this up with EE to use more query ram. If it expects you will spill over the default ram allotment it will use tempb (ouch!).

So if you have a file system spread out (oracle is super slick with its SOFTware raid-1) you can dispatch i/o against each core , much like networking.

Two raid controllers, can dispatch twice the iops and extend the queue depth.

DBA has 11? raid controllers across 4 pci-e bus! This is key since each controller can do its own work.

SSD raid, well sucks. I have found raid-10 with LSI scales linear, but not random, definitely not random write.

I'd say the setup DBA has is most excellent, 4 cpu's each with their own bus, as many raid cards as possible, each raid-1 volumes. Plus a metric ton of ram (again SQL 2008 is FAR better at this).

If the database is all in ram, well it won't read the drives again. It will simply write to tempdb (SSD) and logs (good use for 15K SAS drives and write-back cache).

C: o/s page (raid-1)
D: database (tables active group 1 ssd raid-1 two big drives)
E: database (tables active group 2 ssd raid-1 two big drives)
... repeat splitting up databases into file groups based on the tables.
S: tempdb (two ssd raid-1 - size might be 512gb?)
T: LOG for database active group1
U: LOG for database active group2
V: Database partition group 1 (partition on say year 2012)
W: Database partition group 2 (partition on say year 2011)

(rinse/repeat for SSRS/SSAS)

The idea is to allow your cores to dispatch as much i/o as possible concurrently. Due to new licensing, you may wish to find higher power quad cores!

Use the cheap ram to read-cache. ram is cheap. seriously. SQL EE reads ahead far more aggressively than standard.

So if your core database is all in ram, what does the core database drive speed really matter? (it doesn't)

If the amount of ram is so high that the query cache never needs to dump to TEMPDB, you are golden. If not, prepare tempdb (for each database,ssrs,ssas) for the iops. SQL EE allows simple tweaks, say adjusting the 25% to 45% which reduces Dynamics AX ERP tempdb writes by 90% !! sick !!

Take a look at the big benchmarks, you'll notice they use a ton of regular drives for log (raid-0?) - not ssd.

Virtualization adds extra care since it has serious queue depth issues only when more than 1 vm is on a host.

For instance: PVSCSI is QD32 default, but tweakable to 255 ! Also esxi 5.1u1 penalizes reads that span more than 2000 sectors (meaningless with ssd caching!).

I assume DB2,oracle, SQLOS(sql server) are all slightly different in their own way.

this pissed me off to no end "Beginning with SQL Server 2008 R2, SQL Server Standard edition supports a maximum of 64 gigabytes of system memory. In SQL Server versions earlier than SQL Server 2008 R2, SQL Server Standard supported the operating system maximum memory. "

WTF?!?!

dougsk · Aug 2, 2013

Sorry to post and run, I got caught out with a pile of little one alarms to put out

mrkrad said:
SSD raid, well sucks. I have found raid-10 with LSI scales linear, but not random, definitely not random write.

...

If the amount of ram is so high that the query cache never needs to dump to TEMPDB, you are golden. If not, prepare tempdb (for each database,ssrs,ssas) for the iops. SQL EE allows simple tweaks, say adjusting the 25% to 45% which reduces Dynamics AX ERP tempdb writes by 90% !! sick !!

...

this pissed me off to no end "Beginning with SQL Server 2008 R2, SQL Server Standard edition supports a maximum of 64 gigabytes of system memory. In SQL Server versions earlier than SQL Server 2008 R2, SQL Server Standard supported the operating system maximum memory. "

WRT SSD raid, I'll go do some research, that's very good to know thank you. WRT TEMPDB, I'll go see if I can't poll some data points there and larn me sumthin.

We're stuck with EE. I know we're using some of the features that are only supported there and the rewrite and testing required to leave would probably be more expensive than just paying for EE. Meh.

I updated posted datapoints on the Sky and Google Drive on another worksheet. The data here is derived from a perfmon run during our weekly peak. It's a bit curious in that I had a data read outlier event that actually lasted about twenty minutes that threw the numbers way out. They don't make sense with that event in there because the amount of data read seems to be well outside the scope of the hardware we have.

Thanks again everyone for the great responses.

mrkrad · Aug 2, 2013

Well resource governor kicks ass.

Plus you know standard edition really isn't mean to be virtualized. IIRC it doesn't really support hot-add/remove ram

Search

Advice needed for an Always On SQL server build

dougsk

New Member

MiniKnight

Well-Known Member

dba

Moderator

dba

Moderator

mrkrad

Well-Known Member

jtreble

Member

dougsk

New Member

dba

Moderator

gigatexal

I'm here to learn

dba

Moderator

mrkrad

Well-Known Member

dougsk

New Member

mrkrad

Well-Known Member