L40S vs RTX 6000 ADA - for LLMs

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

justinjja1

New Member
Apr 25, 2016
1
0
1
33
We are currently developing something that does summarization with an LLM.

Saw the STH vid recommending the L40s...
But it seems to be more or less the same specs as the RTX 6000 ADA at 50% higher cost.
What are the pros of the L40S over the RTX 6000 ADA?

Not sure if I can post links but:
 

stubaBF

New Member
Dec 30, 2023
6
2
1
All depends on the precision you will run your model, for example is relevant to know that in INT4 the L40s dont have any advantage vs the L40.
If you seek a GPU for R&D IMO the RTX6000 ada Is a better over solution. However if you seek a GPU for production I highly suggest you do real case benchmarks with your specific LLM model in a cloud before make any purchase.

Within our infrastructure, we employ two RTX6000 ada in our local server for research and development, and it has proven to be a robust choice for a diverse range of computational needs.

Based on our testing with ML models products currently in production, the performance differential between the L40s and RTX6000 ada is generally not a decisive factor.

Nevertheless, when designing a server setup for production, it is imperative to consider factors beyond mere performance metrics, including power consumption, physical footprint, availability, overall costs, and other relevant considerations.

PS: note that the rtx6000 ada datasheet in the link you provided is outdated. Moreover run benchmarks with current driver to see real scenarios performance
 
Last edited:

bayleyw

Active Member
Jan 8, 2014
302
99
28
L40S and RTX 6000 Ada are the same die at roughly the same MSRP - RTX 6000 Ada is $9,999 MSRP currently discounted to about $8K and L40S is ??? MSRP (it's a passively cooled server part) that currently goes for $11.5K if you can find it.

If you're doing development get the RTX 6000 Ada, if you're a lunatic and hosting your own inference servers (don't!) get the L40S. If you're clever, buy some used RTX A6000, save $5K per card, and get about the same performance since inferencing is bandwidth bound anyway. Or save a lot of money and get some used RTX 3090 if your workload fits in 24GB - up to Yi-34B fits in q4 leaving plenty of room for context, insofar as you trust Yi-34B...
 

Hazily2019

Active Member
Jan 10, 2023
160
96
28
L40S and ADA 6000 are the same card. One is for servers and one is for workstations. There is hardly a difference between the two. L40S has a lead time for ~6 weeks and ADA 6000 has a lead time for ~3 months.

L40S and ADA 6000 can not do FP64 calculations.. So things like math modeling is out of the question. You want to make a ChatBot or use FP32... These cards are perfect
 

leo_silicon_alley

New Member
Mar 14, 2024
1
1
3
www.siliconalleyai.com
It's a shame that there's a huge queue to get hands on H100 clusters, as this leaves only the top 0.1% of tech companies as able buyers, allowing them to single-handedly corner the AI marketplace. New AI app? They got to it first. New AI platform, already in the pipeline courtesy of Microsoft, Meta, and Google. I couldn't stop laughing when @bayleyw said not to host your own inference servers. Many valid points there, but if you know your users are isolated to a specific geographic location, and you're allowing a drip-release access methodology to a select group of clients, i.e., only AI researchers that work at Ivy League universities, then scalability isn't an issue. If it's a private server designed to only provide training and inference for a dedicated department, scaling becomes even easier to manage. @Hazily2019 the boys over at Silicon Alley have the Nvidia RTX 6000 ADA 48GB in stock, if you don't wanna wait 3+ months.
 
Last edited:
  • Like
Reactions: Hazily2019

kgcdctx

New Member
Dec 29, 2023
6
0
1
We (stardog.ai) use both but L40S for production and 6000 Ada for developers.

But for our workload GH200 is 5x better than L40s for $ per 1k tokens per second. That’s on unoptimized inference stack so I expect that to creep up to 6 or even 7x better.

YMMV.
 

bayleyw

Active Member
Jan 8, 2014
302
99
28
It's a shame that there's a huge queue to get hands on H100 clusters, as this leaves only the top 0.1% of tech companies as able buyers, allowing them to single-handedly corner the AI marketplace. New AI app? They got to it first. New AI platform, already in the pipeline courtesy of Microsoft, Meta, and Google. I couldn't stop laughing when @bayleyw said not to host your own inference servers. Many valid points there, but if you know your users are isolated to a specific geographic location, and you're allowing a drip-release access methodology to a select group of clients, i.e., only AI researchers that work at Ivy League universities, then scalability isn't an issue. If it's a private server designed to only provide training and inference for a dedicated department, scaling becomes even easier to manage. @Hazily2019 the boys over at Silicon Alley have the Nvidia RTX 6000 ADA 48GB in stock, if you don't wanna wait 3+ months.
you don't host your own inference servers not because its not scalable, but because its not elastic. you have to buy sufficient hardware to support your target latency at peak load which means off-peak your *extremely expensive* servers are idling. also from a business standpoint you're putting down upfront capital and hoping to recoup the costs over the next 12-36 months rather than making your customers pay for your cloud fees.

there's a small exception: nvidia no longer lets hosts advertise consumer cards so you can save a lot of money with a rack full of 4090s, 3090s, or 2080Ti 22G. all of the GDDR6(X) based cards have similar bandwidth so for use cases like low batch size LLM inference performance is about the same, but of course all will be 5x slower than the GH200 which is an HBM3e based part with huge bandwidth. the consumer cards are also less lucrative if your workload is mostly input tokens, since prompt processing is compute bound (and mostly-input use cases are getting more and more common as the industry shifts towards RAG)
 
Last edited:

kgcdctx

New Member
Dec 29, 2023
6
0
1
Or you can implement simple job scheduling and finetune on customer data during off-peak. Keep GPUs earning $ and enjoy a cost-of-money capital base instead of paying hyperscalers.
 

bayleyw

Active Member
Jan 8, 2014
302
99
28
Or you can implement simple job scheduling and finetune on customer data during off-peak. Keep GPUs earning $ and enjoy a cost-of-money capital base instead of paying hyperscalers.
sure, if you have a revenue model which supports finetuning (some do) but many (most?) genAI apps are inference only