PCIe NVMe HBA FYI

AJXCR

Active Member
Jan 20, 2017
565
95
28
32
> Who's trying to use NVMe downstream of the PCH?

Mostly DIY prosumers who buy motherboards with
2 x M.2 slots, or 3 x M.2 slots, and configure those
multiple M.2 SSDs in a RAID-0 array.

ASRock has a motherboard with 3 x M.2 slots:

The ASRock Z170 Extreme7+ Review: When You Need Triple M.2 x4 in RAID

Problem is, those multiple M.2 slots are ALL
downstream of the DMI 3.0 link on ALL
of the latest model motherboards e.g. Intel Z270 chipsets.

So, if you know where to look, a RAID-0 array
of 2 x Samsung 960 Pro SSDs topped out at ~3,500 MB/second,
and a single Samsung 960 Pro SSD was not too far behind,
because all were downstream of the DMI 3.0 link.
So the World of Warcraft crowd :rolleyes:
 
Jun 24, 2015
140
13
18
72
So, with a cabling topology like the one in the following photograph,
and with all M.2 slots located downstream of the DMI link,
nobody has been able to exceed MAX HEADROOM imposed by that DMI link:



So, it seemed to me that motherboard and storage vendors
stopped promoting such RAID-0 arrays, once they became
aware of the MAX HEADROOM imposed by upstream
DMI links, either DMI 2.0 or DMI 3.0.

As far as I know, Intel have no intentions of increasing the
number of lanes in their future DMI links; what they are
probably anticipating is the 16 GHz clock that comes with
the PCIe 4.0 standard.

16G / 8.125 x 4 = 7,876.9 MB/second (x4 DMI 4.0 @ 16 GHz)
 
Last edited:

AJXCR

Active Member
Jan 20, 2017
565
95
28
32
While interesting, it just doesn't seem all that relavent to to the technical non-gaming/case lighting crowd. Not only would you have to look at the limits of the DMI interface, but also account for all of the other devices operating on the PCH, correct?
 

AJXCR

Active Member
Jan 20, 2017
565
95
28
32
I've often thought a board sans USB/sata/onboard LAN/Audio would be preferable anyway... just give me more PCIe slots and quit allocating lanes to devices I'll never use
 
Jun 24, 2015
140
13
18
72
So the World of Warcraft crowd :rolleyes:
Right! Gamers who prefer to post questions on User Forums,
rather than to read manuals and do the hard research.

Just look at how AMD are pitching Ryzen CPUs,
and none of their AM4 motherboards do anything serious
to advance the state of storage technology.

I wrote AMD's CEO to propose that they OEM Highpoint's 3840A
and WOW the world with 4 x NVMe U.2 SSDs in RAID-0 mode.

(Remember how Patrick Kennedy learned that Highpoint's CEO
was seeking to OEM that AIC?)

Highpoint's own calculations were within 1% of my own
predictions for MAX HEADROOM: 8G / 8.125 x 16 = 15,753.8 MB/second

What a feather in their cap, if AMD had been THE FIRST
to break the 10,000 MB/second barrier with such a RAID-0 array
of NVMe SSDs. That speed exceeds the raw bandwidth of DDR2 DRAM
(e.g. DDR2-800 x 8 = 6,400 MB/second) !!

AMD just ignored me.

Good night.
 
Last edited:

AJXCR

Active Member
Jan 20, 2017
565
95
28
32
I hope you don't mind a few dumb questions:

The trays with blue handles are NVMe, correct?:

http://www.servethehome.com/wp-content/uploads/2015/06/Intel-A2U44X25NVMEDK-hot-swap-cage-front.jpg


You are going to need 2 x AICs to control 8 x NVMe M.2 SSDs, correct?

1 of 2:
1 x AIC --> 1 x NVMe backplane @ 4 x NVMe M.2

2 of 2:
1 x AIC --> 1 x NVMe backplane @ 4 x NVMe M.2

CORRECT?

One more dumb question:

Does your x8 AIC work in pairs?

In case you haven't already considered this,
you'll need to ensure that your chipset assigns
a full x8 PCIe 3.0 lanes to both AICs.

Sometimes, a chipset will make its own decisions
about lane assignment, and sometimes
an x8 edge connector only gets x4 lanes assigned.


p.s. Maybe a photo of the backplane, for my benefit? (EDIT: see below ...)
Being low on funds, I can't afford to purchase
any of this stuff.

Thanks!!
Paul,

I'm sorry.. not sure how I missed this post. Pics of the the backplanes from each of the kits attached:




This configuration manual is incredibly comprehensive..

http://www.intel.com/content/dam/su...server-products/S2600WT_Config_Guide_2_10.pdf
 

AJXCR

Active Member
Jan 20, 2017
565
95
28
32
Right! Gamers who prefer to post questions on User Forums,
rather than to read manuals and do the hard research.

Just look at how AMD are pitching Ryzen CPUs,
and none of their AM4 motherboard do anything serious
to advance the state of storage technology.

I wrote AMD's CEO to propose that they OEM Highpoint's 3840A
and WOW the world with 4 x NVMe U.2 SSDs in RAID-0 mode.

(Remember how Patrick Kennedy learned that Highpoint's CEO
was seeking to OEM that AIC?)

Highpoint's own calculations were within 1% of my own
predictions for MAX HEADROOM: 8G / 8.125 x 16 = 15,753.8 MB/second

What a feather in their cap, if AMD had been THE FIRST
to break the 10,000 MB/second barrier with such a RAID-0 array
of NVMe SSDs. That speed exceeds the raw bandwidth of DDR2 DRAM
(e.g. DDR2-800 x 8 = 6,400 MB/second) !!

AMD just ignored me.

Good night.

Couldn't you simply use HBA's to raid maybe 5 Intel 750's/DP3700's together (software) and easily break the 10KMBs mark?
 
Jun 24, 2015
140
13
18
72
While interesting, it just doesn't seem all that relavent to to the technical non-gaming/case lighting crowd. Not only would you have to look at the limits of the DMI interface, but also account for all of the other devices operating on the PCH, correct?
EXACTLY!

Just go back to your SM chipset diagram above
and witness all of the different low-speed devices
that are served by the PCH e.g. SATA / USB 2.0 / USB 3.0 / PCI-E X1 / PCIE-X4 .

ALL OF THE LATTER ARE FED BY THE PCH
AND THE PCH IS FED BY A SINGLE DMI 3.0 LINK.

So, arguendo, if you were to hang an NVMe M.2 device
off that same PCH, all other devices would need to wait
while that M.2 SSD literally HOGS the entire available
bandwidth of the DMI link during data transmission.

It's kind of ironic, because PCI-Express was invented and
designed to overcome the same kind of HOGGING
that occurred on the old legacy PCI bus: that entire
PCI bus was dedicated to any single device that
needed to transmit data over that PCI bus:
all other devices just needed to wait until the
PCI bus was available again.
 
Last edited:
Jun 24, 2015
140
13
18
72
Couldn't you simply use HBA's to raid maybe 5 Intel 750's/DP3700's together (software) and easily break the 10KMBs mark?

Yes, that's the whole idea.

BUT, you are saying that as someone who already
has all of the necessary hardware in hand.

My own objective is to encourage industry players
particularly Intel, to realize that PC / workstation-class prosumers
do NOT have the time, space or money to invest in large Intel
server motherboards, just so they can have a compatible H/W host
for a proprietary riser card and an NVMe RAID controller with
x16 edge connector.

Maybe SM can be persuaded to agree with this strategy?
SM do offer products to the workstation crowd.

I believe Broadcom are also realizing that they have also
implemented a proprietary solution with their "U.2 enabler cable".

Can you imagine a neat 5.25" IcyDock 4-in-1 enclosure,
with an NVMe backplane? There are literally ZILLIONS
of empty 5.25" drive bays in tower and mid-tower chassis
worldwide.

If we use the historical success of SATA Nand Flash SSDs as a model,
the ONLY big change in the wiring topology is the effective
"bundling" of x4 PCIe 3.0 channels into a single U.2 cable
.

Aside from THAT, almost everything else should fit nicely inside
tower and mid-tower chassis e.g.

(a) NVMe RAID controller w/ x16 edge connector, 4 x U.2 ports, like Highpoint 3840A
(b) 4 x U.2 cables meeting general specifications (NOT proprietary)
(c) 4 x 2.5" NVMe SSDs (or M.2 SSDs in a 2.5" enclosure, like Syba)
(d) optionally, an AIC approach like Highpoint's SSD7101A (no cables / no enclosures)

The RAID logic should support all modern RAID modes,
either by means of a dedicated IOC (input-output controller)
or a software driver that does the same (like Highpoint's approach).

Lastly, offer an option to make that NVMe RAID controller
either bootable or not bootable (i.e. ENABLE INT13 / DISABLE INT13)
at the option of the system administrator:

The latter enables the all-important goal of installing and hosting all modern
Operating Systems in a primary partition dedicated to the OS.

These specs will make possible much higher NVMe throughput
for the DIY / prosumer crowd, of which there are literally
millions worldwide.

And, if I am understanding Intel's direction with their 2.5" Optanes,
those low-latency SSDs should fit nicely into the above architectures
and also smoke tires during normal operations.

Lastly, all data transmissions DOUBLE IN SPEED when PCIe 4.0 arrives.

(End of rant :)
 
Jun 24, 2015
140
13
18
72
Concept Product - CP021 Discussions

Concept Product - CP021 Discussions - OEM / Concept Products - ICY DOCK Forums

IcyDock's reply, August 1, 2016:

Thanks for the input. We do hear many voices asking us to make a 4x NVME SSD cage in a 5.25" bay. However, since the market is still new and a lot of uncertainty still exist, we are in the process of gathering more information on this. Could you tell us what is your application or intended application on this? Also, what other suggestions/feedback do you have on the current 4 bay version of 2x NVME + 2x SATA/SAS? Perhaps the rear connection type: U.2 or MiniSAS?
 

edge

Member
Apr 22, 2013
97
28
18
Paul,

How many cores does it take to consume the IO that such a system could generate? Current intel cores have a max consumption rate of around 280 MB/sec when doing a relatively simple table scan with one join and an aggregation. in SQL Server.

While you make good architectural points, I think the manufacturers have a valid question: What is the use case?
 
Jun 24, 2015
140
13
18
72
> What is the use case?

I realize that a lot of data centers will almost automatically
recommend a "cloud" solution for the following; however,
sometimes a data set really does warrant absolute privacy,
particularly when we are talking about confidential information
e.g. proprietary R&D, patent research, legal investigations, etc.

With that said, consider my own personal situation:

I designed and built a Windows workstation with 16GB DRAM,
and configured 13GB as a very functional ramdisk using
RamDisk Plus from www.superspeed.com .

It's configured to SAVE and RESTORE the entire ramdisk
at SHUTDOWN and STARTUP, respectively. I am using
a RAID-0 of 4 x SanDisk Extreme Pro SSDs to store
that ramdisk image automatically, to minimize boot-up time.

But, that ramdisk is now running out of space, and
my motherboard cannot support more than 16GB of DDR2-800 DRAM.

By moving that ramdisk to a fast RAID-0 of NVMe SSDs,
I should be able to experience the same performance level,
e.g. when searching and/or indexing that data set,
and have plenty of extra room for normal growth.

Another routine task is to write a drive image of the C: system partition.
From experience, I prefer to run that task when no other programs
are running. The sooner that drive image is written, the better,
because I can go back to using that workstation for everything
else I do with it, on a daily basis e.g. Internet access, email,
website maintenance, etc.

Now, to extrapolate to a small/medium sized organization e.g. 100 workers:
if we shave just one minute from each hour of every work day,
and we do that for all 100 people in that organization,
you may be astounded by the cumulative amount of time
that is saved over one full year of 2,000 FTE hours per year per worker
(each worker is on the job 50 weeks/year x 40 hours per week).

As I recall, saving just 1 minute per hour, on average,
ends up saving 2.5 person-years (full time equivalent)
in an organization with 100 personnel.

So, what professional organization could NOT benefit
from having 2.5 full-time professional personnel
each working for an entire calendar year withOUT any pay,
benefits or other overhead?

You might argue that I am "cherry picking" here;
however, it is a fact that very small improvements
can result in accumulating large benefits over time.

I also believe that is the reason why high-performance
servers have become so valuable in any medium to
large organizations.

Of course, I am not too concerned about enhancing
the productivity of gamers; I am concerned about
enhancing the productivity of professionals whose
time is very productive and consequently very valuable e.g.
civil engineers, database managers, research scientists, etc.

I seem to recall that even Bill Gates was heard to say,
at one (distant) point in the past, that no one would
ever need more than 640K of RAM.
 
Jun 24, 2015
140
13
18
72
> How many cores does it take to consume the IO that such a system could generate?

Wow, that's a really good, and really challenging, question :)

I am certainly not expert enough to provide the last word here.

A very rough answer to that question goes like this:

If we view a single 64-bit CPU core as a radio frequency broadcaster, THEN:

64 bits @ 4.0 GHz = 256 Gigabits per second / 8 bits per byte = 32 GB/second

So, I honestly do not expect that a single modern CPU core, cycling at 4 GHz+,
will be UNable to saturate the raw bandwidth of a RAID-0 array of 4 x NVMe SSDs,
even when all of the necessary latency, firmware and software overheads
are taken into account.

(I'm impressed by the engineering elegance that obtains
with four NVMe SSDs @ x4 PCIe 3.0 lanes == x16 edge connector.)


Another way of answering that question is to ask another loaded question:
why bother increasing the raw bandwidth and JEDEC timings
of modern DDR4 DRAM, if doing so "saturates" a multi-core CPU?



I recall the reaction of a new user to the CPU utilization
that was measured doing routine benchmarks on ramdisks:
he objected, "Look at how much CPU time is required?" --
not realizing why the same benchmarks on rotating platters
showed how the CPU was idling most of the time.

Even if I am wrong about the above, modern workstation CPUs
come with at least 4 cores, probably also with hyperthreading:
proper scheduling by the OS should assign as many cores
as are necessary to process all tasks waiting on a "wait list"
to have their chance to compute in one of those cores.

A more realistic analysis will take into account the effects
of direct memory access ("DMA"), insofar as AICs
are allowed to perform DMA with minimal
intervention of any given CPU core.

And, we haven't even begun to consider the likely effects
of "affinity" tuning (i.e. assigning certain high-priority tasks
to a specific CPU core).
 

edge

Member
Apr 22, 2013
97
28
18
Well, my point was more about balanced systems.

I am not a big "cloud" proponent - it is just rehashed time sharing and there are reasons everyone dumped time shared systems. I have designed several private cloud (virtual private cloud) and hybrid cloud solutions for large corporations and they (corporations) will never be completely public cloud based due to regulations (financials and health care have a particular penchant for owning complete control of the data).

With database servers, I used to be IO starved. Since we've finally gotten away from spinning iron, I am finding my systems cpu starved. In 2014, it took me 180 cores to consume 17GB/sec of IO when doing large block io (theoretically the system could do 21GB/sec but I never saw it). I will be using nvme JBOF and SDS in my upcoming database VPC design. I haven't gotten that in the lab yet and I don't expect to for a while, but I do not expect IO latency / throughput to be the concern it has been in the past. I'll be interested to learn how many OLTP IOPs a current generation proc can consume (Lenovo and Fujitsu are the only ones still publishing TPC numbers). DR replication and network are latency probably to going to be the prime bottleneck. As always, removing one bottleneck just reveals the next one.

In terms of a server (file, web, other) being utilized by hundreds of people, all the storage throughput in the world does you no good if you can't get it off the server to the end users at the same rate. Of course, you can circumvent some of that with VDI if you can get your users to give up personal systems but east/west communications between servers will still be a bottleneck (unless you go UCS which makes everything north/south which is much worse).

In my home office, I very rarely wait on a system to boot primarily because all my systems are in VMs and the host servers rarely reboot (patch Tuesday for the MS hyper-v nodes, less often for the CentOs kvm nodes) - live migration means VMs never goes down. The guest VM's get their updates scheduled at night so I never see that either. My wait time in the am is the 6 to 8 seconds it takes my docked laptop to come out of sleep. My other wait is while connecting to less often used VM's (I am only just now upgrading the house backbone to 10G). Backups are just snapshots - literally no wait at all and again scheduled off hours. Nothing like having your own personal cloud.

So, if the use case is system image copies within a server or workstation, I think most OEM's will have deaf ears.
 

edge

Member
Apr 22, 2013
97
28
18
> How many cores does it take to consume the IO that such a system could generate?

Wow, that's a really good, and really challenging, question :)

I am certainly not expert enough to provide the last word here.

A very rough answer to that question goes like this:

If we view a single 64-bit CPU core as a radio frequency broadcaster, THEN:

64 bits @ 4.0 GHz = 256 Gigabits per second / 8 bits per byte = 32 GB/second

So, I honestly do not expect that a single modern CPU core, cycling at 4 GHz+,
will be UNable to saturate the raw bandwidth of a RAID-0 array of 4 x NVMe SSDs,
even when all of the necessary latency, firmware and software overheads
are taken into account.

(I'm impressed by the engineering elegance that obtains
with four NVMe SSDs @ x4 PCIe 3.0 lanes == x16 edge connector.)


Another way of answering that question is to ask another loaded question:
why bother increasing the raw bandwidth and JEDEC timings
of modern DDR4 DRAM, if doing so "saturates" a multi-core CPU?



I recall the reaction of a new user to the CPU utilization
that was measured doing routine benchmarks on ramdisks:
he objected, "Look at how much CPU time is required?" --
not realizing why the same benchmarks on rotating platters
showed how the CPU was idling most of the time.

Even if I am wrong about the above, modern workstation CPUs
come with at least 4 cores, probably also with hyperthreading:
proper scheduling by the OS should assign as many cores
as are necessary to process all tasks waiting on a "wait list"
to have their chance to compute in one of those cores.

A more realistic analysis will take into account the effects
of direct memory access ("DMA"), insofar as AICs
are allowed to perform DMA with minimal
intervention of any given CPU core.

And, we haven't even begun to consider the likely effects
of "affinity" tuning (i.e. assigning certain high-priority tasks
to a specific CPU core).
My experience in the lab, in designing Fast Track Data Warehouse systems, in working with HP SuperdomeX systems, and Parallel Data Warehouse systems, lead me to think otherwise. Simply put, I expect it to take a two socket if not a four socket system to consume the IO of 8 nvme in raid 1 (really, who uses raid 0).

You seem to think I am arguing against developing high throughput storage systems, that is not the case. I am simply asking what processes can take advantage of them.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,053
1,598
113
CA
My experience in the lab, in designing Fast Track Data Warehouse systems, in working with HP SuperdomeX systems, and Parallel Data Warehouse systems, lead me to think otherwise. Simply put, I expect it to take a two socket if not a four socket system to consume the IO of 8 nvme in raid 1 (really, who uses raid 0).

You seem to think I am arguing against developing high throughput storage systems, that is not the case. I am simply asking what processes can take advantage of them.
@edge myself and another member here tested I believe 11? or 12? NVME in a single Intel Chassis with 2 of the kits and AICs and I believe got to around 22GB/s on-host, and then he tested at home with 56Gbit IB dual ports and I think he got 19GB/s off-host. There's some old threads (last year if I recall) where he posted some benchmark #s, etc... BUT, depending what you're doing drastically affects the CPU utilization. When dealing with that much performance there's a lot more than just the drive that could be affecting CPU I've found. The file system (ZFS in my case) can take a huge % of CPU, Networking (IB or ETH) affect differently, and then you have other misc. oS duties, maybe even logging, etc, all start to really ramp up CPU when IOPs are HIGH. (All of my usage/testing/experience is on 'small' configurations, ie: less than a full rack so I'm sure things are a bit different when doing entire-company high performance racks, anything else you can share @edge please do :)... for my usages I ultimately ditched the SAN approach so I didn't have to worry (as much) about network performance. IE: Moving the very high performance NVME to the actual DB host, on-host storage for VMs, eliminating redundancy for the hosts that house the workers that are irrelevant to backup, etc...)

Paul A is the only one I know of talking about RAID0 NVME which is likely why everyone is not giving him the time of day... or once they hear his current workstation is running DDR2 they likely don't take him too seriously either.

Lets face it his use case is super niche, and anyone in that niche that actually needs the IOPs and bandwidth can hire a professional right now to deliver what he wants performance-wise. Who cares that it takes a second AIC to add 2 more drives, if you really need the IOPs Intel, SuperMicro and others offer the NVME HBA and there are great software raid solutions that can be done. I also don't think he's done any testing because 4x or 8x NVME won't have the performance as RAM, and if 13GB RAM disk sufficed and you REALLY need high performance then use another RAM DISK with DDR4 and 16GB RDIMMS or.. as I suggested to him before use a properly configured database, and prime the cache and then run your queries against RAM... or write a basic script to load it into a RAM based storage engine, and off-load to persistent storage on a schedule... There's honestly a lot of viable solutions to the "made up" problem Paul A. continually references by not having the ability to use 4x NVME from 1 AIC. I feel the same way with Optane 32GB and 16GB they're not solutions they're bandaids until someone actually does it the right way be-it configuration, tuning, upgarding ram, whatever is 'needed'... band-aiding with a 32GB low capacity drive where data is better suited for even higher performance in RAM is silly.
 
Last edited:
Jun 24, 2015
140
13
18
72
> Simply put, I expect it to take a two socket if not a four socket system to consume the IO of 8 nvme in raid 1 (really, who uses raid 0).

I'm truly looking forward to reading how your 8 x Syba caddies work out for your needs.

Plus, by choosing Intel you can expect proper technical support.


This is quite a statement:

"I expect it to take a two socket if not a four socket system to consume the IO of 8 nvme in raid 1."

Four sockets translate into how many discrete CPU cores?


Software overhead must account for a LOT of that expectation, yes?

In a similar vein, see the contribution of "Software" in this graph prepared by Allyn Malventano here:

Intel Optane SSD DC P4800X 375GB Review - Enterprise 3D XPoint | PC Perspective




But, Allyn's Optane measurements were done remotely,
which he does explain.

And, MANY THANKS for your willingness to share so many professional details.
 
Jun 24, 2015
140
13
18
72
Dear "Friends":

p.s. I would already have a much more modern workstation,
if rogue federal agents had not retaliated at me for whistleblowing.

We deploy RAID-0 as a cheap way of "wear leveling"
and it has worked out extremely well for my needs.

And, the (aging) workstation that I do have would have been long gone
if our Trustee had not rescued all of my research hardware
while I was in solitary confinement and being threatened
with brain-damaging drugs. See Washington v. Harper for details
and/or the documented ordeal of Dr. Sell at the USMCFP in Missouri.

But, I'm guessing that no one here has any desire to read any more
about that ordeal. It did delay my patent award by 2 YEARS.
And, yes, my bias is individual productivity (workstations).

Lastly, I don't appreciate being patronized, which is why
I will SIGN OFF now.

Thanks, everyone. I've learned a lot.