L40S vs RTX 6000 ADA - for LLMs

justinjja1 · Jan 8, 2024

We are currently developing something that does summarization with an LLM.

Saw the STH vid recommending the L40s...
But it seems to be more or less the same specs as the RTX 6000 ADA at 50% higher cost.
What are the pros of the L40S over the RTX 6000 ADA?

Not sure if I can post links but:

NVIDIA L40S Datasheet

The NVIDIA L40S Datasheet provides an overview of product specifications and more.

resources.nvidia.com

https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/rtx-6000/proviz-print-rtx6000-datasheet-web-2504660.pdf

stubaBF · Jan 8, 2024

All depends on the precision you will run your model, for example is relevant to know that in INT4 the L40s dont have any advantage vs the L40.
If you seek a GPU for R&D IMO the RTX6000 ada Is a better over solution. However if you seek a GPU for production I highly suggest you do real case benchmarks with your specific LLM model in a cloud before make any purchase.

Within our infrastructure, we employ two RTX6000 ada in our local server for research and development, and it has proven to be a robust choice for a diverse range of computational needs.

Based on our testing with ML models products currently in production, the performance differential between the L40s and RTX6000 ada is generally not a decisive factor.

Nevertheless, when designing a server setup for production, it is imperative to consider factors beyond mere performance metrics, including power consumption, physical footprint, availability, overall costs, and other relevant considerations.

PS: note that the rtx6000 ada datasheet in the link you provided is outdated. Moreover run benchmarks with current driver to see real scenarios performance

bayleyw · Jan 15, 2024

L40S and RTX 6000 Ada are the same die at roughly the same MSRP - RTX 6000 Ada is $9,999 MSRP currently discounted to about $8K and L40S is ??? MSRP (it's a passively cooled server part) that currently goes for $11.5K if you can find it.

If you're doing development get the RTX 6000 Ada, if you're a lunatic and hosting your own inference servers (don't!) get the L40S. If you're clever, buy some used RTX A6000, save $5K per card, and get about the same performance since inferencing is bandwidth bound anyway. Or save a lot of money and get some used RTX 3090 if your workload fits in 24GB - up to Yi-34B fits in q4 leaving plenty of room for context, insofar as you trust Yi-34B...

Hazily2019 · Mar 11, 2024

L40S and ADA 6000 are the same card. One is for servers and one is for workstations. There is hardly a difference between the two. L40S has a lead time for ~6 weeks and ADA 6000 has a lead time for ~3 months.

L40S and ADA 6000 can not do FP64 calculations.. So things like math modeling is out of the question. You want to make a ChatBot or use FP32... These cards are perfect

leo_silicon_alley · Mar 14, 2024

It's a shame that there's a huge queue to get hands on H100 clusters, as this leaves only the top 0.1% of tech companies as able buyers, allowing them to single-handedly corner the AI marketplace. New AI app? They got to it first. New AI platform, already in the pipeline courtesy of Microsoft, Meta, and Google. I couldn't stop laughing when @bayleyw said not to host your own inference servers. Many valid points there, but if you know your users are isolated to a specific geographic location, and you're allowing a drip-release access methodology to a select group of clients, i.e., only AI researchers that work at Ivy League universities, then scalability isn't an issue. If it's a private server designed to only provide training and inference for a dedicated department, scaling becomes even easier to manage. @Hazily2019 the boys over at Silicon Alley have the Nvidia RTX 6000 ADA 48GB in stock, if you don't wanna wait 3+ months.

kgcdctx · Mar 17, 2024

We (stardog.ai) use both but L40S for production and 6000 Ada for developers.

But for our workload GH200 is 5x better than L40s for $ per 1k tokens per second. That’s on unoptimized inference stack so I expect that to creep up to 6 or even 7x better.

YMMV.

bayleyw · Mar 18, 2024

leo_silicon_alley said:
It's a shame that there's a huge queue to get hands on H100 clusters, as this leaves only the top 0.1% of tech companies as able buyers, allowing them to single-handedly corner the AI marketplace. New AI app? They got to it first. New AI platform, already in the pipeline courtesy of Microsoft, Meta, and Google. I couldn't stop laughing when @bayleyw said not to host your own inference servers. Many valid points there, but if you know your users are isolated to a specific geographic location, and you're allowing a drip-release access methodology to a select group of clients, i.e., only AI researchers that work at Ivy League universities, then scalability isn't an issue. If it's a private server designed to only provide training and inference for a dedicated department, scaling becomes even easier to manage. @Hazily2019 the boys over at Silicon Alley have the Nvidia RTX 6000 ADA 48GB in stock, if you don't wanna wait 3+ months.

you don't host your own inference servers not because its not scalable, but because its not elastic. you have to buy sufficient hardware to support your target latency at peak load which means off-peak your *extremely expensive* servers are idling. also from a business standpoint you're putting down upfront capital and hoping to recoup the costs over the next 12-36 months rather than making your customers pay for your cloud fees.

there's a small exception: nvidia no longer lets hosts advertise consumer cards so you can save a lot of money with a rack full of 4090s, 3090s, or 2080Ti 22G. all of the GDDR6(X) based cards have similar bandwidth so for use cases like low batch size LLM inference performance is about the same, but of course all will be 5x slower than the GH200 which is an HBM3e based part with huge bandwidth. the consumer cards are also less lucrative if your workload is mostly input tokens, since prompt processing is compute bound (and mostly-input use cases are getting more and more common as the industry shifts towards RAG)

kgcdctx · Mar 19, 2024

Or you can implement simple job scheduling and finetune on customer data during off-peak. Keep GPUs earning $ and enjoy a cost-of-money capital base instead of paying hyperscalers.

bayleyw · Mar 20, 2024

kgcdctx said:
Or you can implement simple job scheduling and finetune on customer data during off-peak. Keep GPUs earning $ and enjoy a cost-of-money capital base instead of paying hyperscalers.

sure, if you have a revenue model which supports finetuning (some do) but many (most?) genAI apps are inference only

Search

L40S vs RTX 6000 ADA - for LLMs

justinjja1

New Member

NVIDIA L40S Datasheet

stubaBF

New Member

bayleyw

Active Member

Hazily2019

Active Member

leo_silicon_alley

New Member

kgcdctx

New Member

bayleyw

Active Member

kgcdctx

New Member

bayleyw

Active Member