Nvidia's Chat with RTX is an AI chatbot that runs locally on your PC

operator-name · 2024-02-13T17:56:09 1707846969

This looks quite cool! It's basically a tech demo for TensorRT-LLM, a framework that amongst other things optimises inference time for LLMs on Nvidia cards. Their base repo supports quite a few models.

Previously there was TensorRT for Stable Diffusion[1], which provided pretty drastic performance improvements[2] at the cost of customisation. I don't forsee this being as big of a problem with LLMs as they are used "as is" and augmented with RAG or prompting techniques.

[1]: https://github.com/NVIDIA/Stable-Diffusion-WebUI-TensorRT [2]: https://reddit.com/r/StableDiffusion/comments/17bj6ol/hows_y...

operator-name · 2024-02-13T18:49:59 1707850199

Having installed this, this is an incredibly then wrapper around the following github repos:

https://github.com/NVIDIA/trt-llm-rag-windows https://github.com/NVIDIA/TensorRT-LLM

It's quite a thin wrapper around putting both projects into %LocalAppData%, along with a miniconda environment with the correct dependnancies installed. Also for some reason the LLaMA 13b (24.5GB) and Ministral 7b (13.6GB) but only installed Ministral?

Ministral 7b runs about as accurate as I remeber, but responses are faster than I can read. This seems at the cost of context and variance/temperature - although it's a chat interface the implementation doesn't seem to take into account previous questions or answers. Asking it the same question also gives the same answer.

The RAG (llamaindex) is okay, but a little suspect. The installation comes with a default folder dataset, containing text files of nvidia marketing materials. When I tried asking questions about the files, it often cites the wrong file even if it gave the right answer.

kkielhofner · 2024-02-13T19:09:14 1707851354

The wrapping of TensorRT-LLM alone is significant.

I’ve been working with it for a while and it’s… Rough.

That said it is extremely fast. With TensorRT-LLM and Triton Inference Server with conservation performance settings I get roughly 175 tokens/s on an RTX 4090 with Mistral-Instruct 7B. Following commits, PRs, etc I expect this to increase significantly in the future.

I’m actually working on a project to better package Triton and TensorRT-LLM and make it “name and model and press enter” level usable with support for embeddings models, Whisper, etc.

FirmwareBurner · 2024-02-13T20:02:13 1707854533

>LLaMA 13b (24.5GB) and Ministral 7b (13.6GB)

But the HW requirements state 8GB of VRAM. How do those models fit in that?

av3csr · 2024-02-13T23:11:19 1707865879

They are int4 quantized

Kranar · 2024-02-14T03:54:36 1707882876

Does int4 mean 4 bits per integer, or 4 bytes/32-bits.

If it means that weights for an LLM can be 4 bits well that's just mind boggling.

sillysaurusx · 2024-02-14T04:40:50 1707885650

Four bits per parameter. (A parameter is what you call an integer here.)

I was skeptical of it for some time, but it seems to work because individual parameters don’t encode much information. The knowledge is embedded thanks to having a massive number of low bit parameters.

justahuman74 · 2024-02-14T04:42:20 1707885740

4 bits

randerson · 2024-02-13T22:48:25 1707864505

The Creative Labs sound cards of the early 90's came with Dr. Sbaitso, an app demoing their text-to-speech engine by pretending to an AI psychologist. Someone needs to remake that!

crtified · 2024-02-14T02:10:36 1707876636

Awhile back I engaged in the utterly banal 5-minute pursuit of having Dr. Sbaitso speak with ChatGPT.

It did not go well. The generation gap was perhaps even starker than it is between real people.

https://forums.overclockers.com.au/threads/chatgpt-vs-dr-sba...

pests · 2024-02-14T05:36:45 1707889005

I found the early response of ChatGPT claiming it's not Dr.Sbaito a weird response

lysp · 2024-02-14T01:44:33 1707875073

It's actually been ported to web - along with a lot of other dos based games.

https://classicreload.com/dr-sbaitso.html

swozey · 2024-02-14T04:35:22 1707885322

This is hilarious.

Ok, tell me your problem $name.

"I'm sad."

Do you enjoy being sad?

"No"

Are you sure?

"Yes"

That should solve your problem. Lets move on to discuss about some other things.

Also wow that creative app/sound/etc brings back memories.

davedunkin · 2024-02-13T23:09:40 1707865780

Levels is making something like that. https://twitter.com/levelsio/status/1756396158652432695

I would really want it to have that Dr. Sbaitso voice, though, telling me how to be fitter, happier, more productive.

sedatk · 2024-02-14T02:35:17 1707878117

Somebody actually integrated ChatGPT with Dr.Sbaitso: https://bert.org/2023/01/06/chatgpt-in-dr-sbaitso/

dcist · 2024-02-14T00:05:48 1707869148

Yes, I remember Dr. Sbaitso very, very well. I spent many hours with it as a kid and thought it was tons of fun. To be frank, Dr. Sbaitso is why I was underwhelmed when chatbots were hyped in the early 2010s. I couldn't understand why anyone would be excited about 90s tech.

Mistletoe · 2024-02-14T02:55:38 1707879338

Chatting with ALICE is what has tempered my ChatGPT hype. It was neat and seemed like magic, but I think it was in the 90s when I tried it. I'm sure for new people it feels like an unprecedented event to talk to a computer and it seem sentient.

https://www.pandorabots.com/pandora/talk?botid=b8d616e35e36e...

Like other bogus things like tarot or horoscopes, it's amazing what you can discover when you talk about something, it asks you questions, and what you want or desire eventually floats to the surface. And now people are even more lonely...

>Human: do you like video games

>A.L.I.C.E: Not really, but I like to play the Turing Game.

jhbadger · 2024-02-14T02:54:18 1707879258

Really just 1960s tech. Dr. Sbaitso was really just a version of ELIZA with the only new part being the speech synthesis.

McAtNite · 2024-02-13T16:10:16 1707840616

I’m struggling to understand the point of this. It appears to be a more simplified way of getting a local LLM running on your machine, but I expect less technically inclined users would default to using the AI built into Windows while the more technical users will leverage llama.cpp to run whatever models they are interested in.

Who is the target audience for this solution?

operator-name · 2024-02-13T17:43:56 1707846236

This is a tech demo for TensorRT, which is ment to greatly improve inference time for compatible models.

brucethemoose2 · 2024-02-13T23:53:04 1707868384

> the more technical users will leverage llama.cpp to run whatever models they are interested in.

Llama.cpp is much slower, and does not have built-in RAG.

TRT-LLM is a finicky deployment grade framework, and TBH having it packaged into a one click install with llama index is very cool. The RAG in particular is beyond what most local LLM UIs do out-of-the-box.

dkarras · 2024-02-13T18:54:25 1707850465

>It appears to be a more simplified way of getting a local LLM running on your machine

No, it answers questions from the documents you provide. Off the shelf local LLMs don't do this by default. You need a RAG stack on top of it or fine tune with your own content.

westurner · 2024-02-13T21:31:45 1707859905

From "Artificial intelligence is ineffective and potentially harmful for fact checking" (2023) https://news.ycombinator.com/item?id=37226233 : pdfgpt, knowledge_gpt, elasticsearch :

> Are LLM tools better or worse than e.g. meilisearch or elasticsearch for searching with snippets over a set of document resources?

> How does search compare to generating things with citations?

pdfGPT: https://github.com/bhaskatripathi/pdfGPT :

> PDF GPT allows you to chat with the contents of your PDF file by using GPT capabilities.

GH "pdfgpt" topic: https://github.com/topics/pdfgpt

knowledge_gpt: https://github.com/mmz-001/knowledge_gpt

From https://news.ycombinator.com/item?id=39112014 : paperai

neuml/paperai: https://github.com/neuml/paperai :

> Semantic search and workflows for medical/scientific papers

RAG: https://news.ycombinator.com/item?id=38370452

Google Desktop (2004-2011): https://en.wikipedia.org/wiki/Google_Desktop :

> Google Desktop was a computer program with desktop search capabilities, created by Google for Linux, Apple Mac OS X, and Microsoft Windows systems. It allowed text searches of a user's email messages, computer files, music, photos, chats, Web pages viewed, and the ability to display "Google Gadgets" on the user's desktop in a Sidebar

GNOME/tracker-miners: https://gitlab.gnome.org/GNOME/tracker-miners

src/miners/fs: https://gitlab.gnome.org/GNOME/tracker-miners/-/tree/master/...

SPARQL + SQLite: https://gitlab.gnome.org/GNOME/tracker-miners/-/blob/master/...

https://news.ycombinator.com/item?id=38355385 : LocalAI, braintrust-proxy; promptfoo, chainforge, mixtral

qdequelen · 2024-02-15T05:59:02 1707976742

> Are LLM tools better or worse than e.g. meilisearch or elasticsearch for searching with snippets over a set of document resources?

Absolutely worse, LLM are not made for it at all.

fortran77 · 2024-02-13T17:25:02 1707845102

It seems really clear to me! I downloaded it, pointed it to my documents folder, and started running it. It's nothing like the "AI built into Windows" and it's much easier than dealing with rolling my own.

SirMaster · 2024-02-13T17:02:38 1707843758

This lets you run Mistral or Llama 2, so whomever has an RTX card and wants to run either of those models?

And perhaps they will add more models in the future?

pquki4 · 2024-02-13T21:54:52 1707861292

I don't think your comment answers the question? Basically, those who bother to know underlying model's name can already run their model without this tool from nvidia?

ls612 · 2024-02-14T05:51:02 1707889862

It will run a lot faster by using the tensor (Ray Tracing) cores than the standard CUDA cores.

McAtNite · 2024-02-13T17:09:23 1707844163

I suppose I’m just struggling to see the value add. Ollama already makes it dead simple to get a local LLM running, and this appears to be a more limited vendor locked equivalent.

From my point of view the only person who would be likely to use this would be the small slice of people who are willing to purchase an expensive GPU, know enough about LLMs to not want to use CoPilot, but don’t know enough about them to know of the already existing solutions.

kkielhofner · 2024-02-13T17:28:55 1707845335

With all due respect this comment has fairly strong (and infamous) HN Dropbox thread vibes.

It's an Nvidia "product", published and promoted via their usual channels. This is co-sign/official support from Nvidia vs "Here's an obscure name from a dizzying array of indistinguishable implementations pointing to some random open source project website and Github repo where your eyes will glaze over in seconds".

Completely different but wider and significantly less sophisticated audience. The story link is on The Verge and because this is Nvidia it will also get immediately featured in every other tech publication, website, subreddit, forum, twitter account, youtube channel, etc.

This will get more installs and usage in the next 72 hours than the entire Llama/open LLM ecosystem has had in its history.

McAtNite · 2024-02-13T17:41:15 1707846075

Unfortunately I’m not aware of the reference to the HN Dropbox thread.

I suppose my counter point is only that the user base that relies on simplified solutions is largely already addressed with the wide number of cloud offerings from OpenAi, Microsoft, Google, whatever other random company has popped up. Realistically I don’t know if the people who don’t want to use those, but also don’t want to look at GitHub pages is really that wide of an audience.

You could be right though. I could be out of touch with reality on this one, and people will rush to use the latest software packaged by a well known vendor.

thecal · 2024-02-13T17:54:09 1707846849

It is probably the most famous HN comment ever made and comes up often. It is a dismissive response to Dropbox years ago:

https://news.ycombinator.com/item?id=9224

McAtNite · 2024-02-13T18:22:42 1707848562

Thanks for the explanation. I guess my only hope for not looking like I had a bad opinion is people’s intertia to move beyond CoPilot.

anonymousab · 2024-02-13T18:22:06 1707848526

> the user base that relies on simplified solutions is largely already addressed

There is a wide spectrum of users for which a more white-labelled locally-runnable solution might be exactly what they're looking for. There's much more than just the two camps of "doesn't know what they're doing" and "technically inclined and knows exactly what to do" with LLMs.

pquki4 · 2024-02-13T22:00:19 1707861619

Anyone who bothers to distinguish a product from Microsoft/nvidia/meta/someone else already know what they are doing.

Most users don't care whether whether the model is run, online or local. They go to ChatGPT or Bing/Copilot to get answers, as long as they are free. Well, if it becomes a (mandatory) subscription, they are more likely to pay for it rather than figure out how to run a local LLM.

Sounds like you are the one who's not getting the message.

So basically the only people who runs a local LLM are those who are interested enough in this. Any why would brand name matter? What matters is whether a model is good, whether it can run on a specific machine and how fast it is etc, and there are objectives for it. People who run local LLM don't automatically choose Nvidia's product over something just because nvidia is famous.

kkielhofner · 2024-02-14T15:17:04 1707923824

I'll try again.

Have you ever tried to use ChatGPT alone to work with documents? In terms of the free/ready to use product it's very painful. Give it a URL to a PDF (or something) and assuming it can load it (often can't) you can "chat" with it. One document at a time...

This is for the (BIG) world of Nvidia Windows desktop users (most of whom are fanboys who will install anything Nvidia announces that sounds cool) who don't know what an LLM is. They certainly wouldn't know/have the inclination to wander into /r/LocalLLaMA or some place to try to sort through a bunch of random projects with obscure names that are peppered with jargon and references to various models they've also never heard of or know the difference between. Then the next issue is figuring out the RAG aspects, which is an entirely different challenge.

This is a Windows desktop installer that picks one of two models automatically depending on how much VRAM you have, loads them to run on your GPU using one of the fastest engines out there, and then allows you to load your own local content and interact with it in a UI that just pops up after you double-click the installer. It's green and peppered with Nvidia branding everywhere. They love it.

What the Nvidia Windows desktop users will be able to understand is "WOW, look it's using my own GPU for everything according to my process manager. I just made my own ChatGPT and can even chat with my own local documents. Nvidia is amazing!"

> why would brand name matter?

Do you know anything about humans? Brands make a HUGE difference.

> People who run local LLM don't automatically choose Nvidia's product over something just because nvidia is famous.

/r/LocalLLaMA is currently filled with people ranting and raving about this even though it's inferior (other than ease of use and brand halo) to much of the technology that has been discussed there since forever.

Again - humans spend many billions and billions of dollars choosing products that are inferior solely because of the name/brand.

Capricorn2481 · 2024-02-13T18:21:26 1707848486

I have no idea what you're talking about and am waiting for an answer to OPs question. Downloading text-generation-webui takes a minute, let's you use any model and get going. I don't really understand what this Nvidia thing adds? It seems even more complicated than the open source offerings.

I don't really care how many installs it gets, does it do anything differently or better?

kkielhofner · 2024-02-14T15:34:42 1707924882

> Downloading text-generation-webui takes a minute, let's you use any model and get going.

What you're missing here is you're already in this area deep enough to know what ooogoababagababa text-generation-webui is. Let's back out to the "average Windows desktop user who knows they have an Nvidia card" level. Assuming they even know how to find it:

1) Go to https://github.com/oobabooga/text-generation-webui?tab=readm...

2) See a bunch of instructions opening a terminal window and running random batch/powershell scripts. Powershell, etc will likely prompt you with a scary warning. Then you start wondering who ooobabagagagaba is...

3) Assuming you get this far (many users won't even get to step 1) you're greeted with a web interface[0] FILLED to the brim with technical jargon and extremely overwhelming options just to get a model loaded, which is another mind warp because you get to try to select between a bunch of random models with no clear meaning and non-sensical/joke sounding names from someone called "TheBloke". Ok... Oh yeah, what's a "model"? GGUF? GPTQ? AWQ? Exllama? Prompt format? Transformers? Tokens? Temperature? Repeat for dozens of things you're familiar with but are meaningless to them.

Let's say you somehow braved this gauntlet and get this far now you get to chat with it. Ok, what about my local documents? text-generation-webui itself has nothing for that. Repeat this process over the 10 random open source projects from a bunch of names you've never heard of in an attempt to accomplish that.

This is "I saw this thing from Nvidia explode all over media, twitter, youtube, etc. I downloaded it from Nvidia, double-clicked, pointed it at a folder with documents, and it works".

That's the difference and it's very significant.

[0] - https://raw.githubusercontent.com/oobabooga/screenshots/main...

tracerbulletx · 2024-02-13T19:01:12 1707850872

It's a different inference engine with different capabilities. It should be a lot faster on Nvidia cards. I don't have comp benchmarks for llama.cpp but if you find some compare them to this.

https://nvidia.github.io/TensorRT-LLM/performance.html https://github.com/lapp0/lm-inference-engines/

sevagh · 2024-02-13T18:39:22 1707849562

It brings more authority than "oh just use <string of gibberish from the frontpage of hn>"

Capricorn2481 · 2024-02-13T23:15:57 1707866157

That tells you how it might affect people's perception of it, not whether it's better in any way.

sevagh · 2024-02-13T23:19:31 1707866371

Sure, it's just disingenuous to pretend that authority doesn't matter.

Capricorn2481 · 2024-02-14T01:23:03 1707873783

Disingenuous to what? I'm asking what it brings someone who can already use an open source solution. I feel like you're just trying to argue for the sake of it.

SirMaster · 2024-02-13T17:13:10 1707844390

I just looked up Ollama and it doesn't look like it supports Windows. (At least not yet)

McAtNite · 2024-02-13T17:16:52 1707844612

Oh my apologies for the wild goose chase. I thought they had added support for Windows already. Should be possible to run it through WSL, but I suppose that’s a solid point for Nvidia in this discussion.

SirMaster · 2024-02-13T17:22:56 1707844976

I think there's a market for a user who is not very computer savvy who at least understands how to use LLMs and would potentially run a chat one on their GPU especially if it's just a few clicks to turn on.

dist-epoch · 2024-02-13T17:44:21 1707846261

There are developers which fail to install Ollama/CUDA/Python/create-venv/download-models on their computer after many hours of trying.

You think a regular user has any chance?

McAtNite · 2024-02-13T17:47:33 1707846453

Not really. I expect those users will just use copilot.

se4u · 2024-02-13T17:39:39 1707845979

You are forgetting about developers who may want to develop on top of something stable and with long term support. That's a big market.

McAtNite · 2024-02-13T17:43:50 1707846230

Would they not prefer to develop for CoPilot? In comparison this seems niche.

imtringued · 2024-02-14T08:22:39 1707898959

>people who are willing to purchase an expensive GPU,

Codeword for people who have hardware specialized and suitable for AI.

ribosometronome · 2024-02-13T18:43:17 1707849797

Gamers who bought an expensive card and see this advertised to them in Nvidia's Geforce app?

papichulo2023 · 2024-02-13T16:31:51 1707841911

Does windows uses the pc's gpu or just cpu or cloud?

robotnikman · 2024-02-13T16:54:01 1707843241

If they are talking about the Bing AI, just using whatever OpenAI has in the cloud

McAtNite · 2024-02-13T17:02:23 1707843743

I’m referring to CoPilot which for your average non technical user who doesn’t care whether something is local or not has the huge benefit of not requiring the purchase an expensive GPU.

zamadatix · 2024-02-13T17:09:31 1707844171

Never underestimate people's interest in running something which lets them generate crass jokes about their friends or smutty conversation when hosted solutions like CoPilot could never allow such non-puritan morals. If this delivers on being the easiest way to run local models quickly then many people will be interested.

joenot443 · 2024-02-13T19:04:55 1707851095

The immediate value prop here is the ability to load up documents to train your model on the fly. 6mos ago I was looking for a tool to do exactly this and ended up deciding to wait. Amazing how fast this wave of innovation is happening.

seydor · 2024-02-13T16:54:31 1707843271

Windows users who haven't bought an Nvidia card yet

tuananh · 2024-02-13T17:21:31 1707844891

this is exactly what i want: a personal assistant.

a personal assistant to monitor everything i do on my machine, ingest it and answer question when i need.

it's not there yet (still need to manually input url, etc...) though but it's very much feasible.

mistermann · 2024-02-13T17:44:54 1707846294

I'd like something that monitors my history on all browsers (mobile and desktop, and dedicated client apps like substance, Reddit, etc) and then ingests the articles (and comments, other links with some depth level maybe) and then allows me to ask questions....that would be amazing.

tuananh · 2024-02-13T17:58:21 1707847101

yes, i want that too. not sure if anyone is building sth like this?

majestic5762 · 2024-02-13T18:03:25 1707847405

rewind.ai

lunatuna · 2024-02-13T23:21:53 1707866513

Tried rewind, it does an amazing job grabbing everything locally and the search is great. With the addition of Kin it would be an easy buy.

majestic5762 · 2024-02-13T17:56:13 1707846973

mykin.ai is building this with privacy in mind. Runs small models on-device, while large ones in confidential VMs in the cloud.

Xeyz0r · 2024-02-13T17:38:59 1707845939

But it sounds kinda creepy don't you think?

gmueckl · 2024-02-13T17:46:21 1707846381

You'd be the one controlling the off-switch and the physical storage devices for the data. I'd think that this fact takes most of the potential creep out. What am I not seeing here?

Capricorn2481 · 2024-02-13T18:26:27 1707848787

> You'd be the one controlling the off-switch and the physical storage devices for the data

Based on what? The CPU is a physical storage device on my PC but it still can phone home and has backdoors.

Is there any reason to think Nvidia isn't collecting my data?

pixl97 · 2024-02-13T18:46:35 1707849995

If you're on linux just monitor and block any traffic to random addresses.

If you're on Windows, what makes you think they are not already?

Capricorn2481 · 2024-02-13T23:14:26 1707866066

The Intel backdoor is at the Kernel. OS has nothing to do with it.

> what makes you think they are not already

That is the point

chollida1 · 2024-02-13T17:51:38 1707846698

> But it sounds kinda creepy don't you think?

is the bash history command creepy?

Is your browsers history command creepy?

Nullabillity · 2024-02-13T20:27:34 1707856054

Yes to both?

But those also don't try to reinterpret what I wrote.

spullara · 2024-02-13T17:43:11 1707846191

it is all local so, no?

autoexec · 2024-02-13T17:59:46 1707847186

It generates responses locally, but does your data stay local? It's fine if you only ever use it on a device that you leave offline 100% of the time, but otherwise I'd pay close attention to what it's doing. Nvidia doesn't have a great track record when it comes to privacy (for example: https://news.ycombinator.com/item?id=12884762).

operator-name · 2024-02-13T18:14:58 1707848098

The source is available, minus the installer. You could always use the base repo after verifying it:

https://github.com/NVIDIA/trt-llm-rag-windows

tuananh · 2024-02-13T17:58:03 1707847083

if it's 100% local then fine.

yuck39 · 2024-02-13T17:24:16 1707845056

Interesting. Since you are running it locally do they still have to put up all the legal guardrails that we see from Chat GPT and the like?

dist-epoch · 2024-02-13T17:55:09 1707846909

Yes, because otherwise there would be news articles "NVIDIA installs racist/sexist/... LLM on users computers"

phone8675309 · 2024-02-13T22:42:21 1707864141

Gaming company

Gaming LLM

Checks out

mchinen · 2024-02-13T17:26:16 1707845176

Given that you can pick llama or mistral in the NVIDIA interface, I'm curious if this is built around ollama or reimplementing something similar. The file and URL retrieval is a nice addition in any case.

navjack27 · 2024-02-13T14:33:48 1707834828

30 and 40 series only? My 2080 Ti scoffs at the artificial limitation

andy_xor_andrew · 2024-02-13T17:07:28 1707844048

so they branded this "Chat with RTX", using the RTX branding. Which, originally, meant "ray tracing". And the full title of your 2080 Ti is the "RTX 2080 Ti".

So, reviewing this...

- they are associating AI with RTX (ray tracing) now (??)

- your RTX card cannot chat with RTX (???)

wat

a13o · 2024-02-13T18:37:17 1707849437

The marketing whiff on ray tracing happened long ago. DLSS is the killer app on RTX cards, another 'AI'-enabled workload.

startupsfail · 2024-02-13T17:17:36 1707844656

No support for bf16 in a card that was released more than 5 years ago, I guess? Support starts with Ampere?

Although you’d realistically need 5-6 bit quantization to get anything large/usable enough running on a 12GB card. And I think it’s just CUDA then, so you should be able to use 2080 Ti.

nottorp · 2024-02-13T19:09:28 1707851368

That was my first question, does it display pretty ray traced images instead of answers?

Havoc · 2024-02-15T00:37:38 1707957458

RTX is a brand more than ray tracing.

It is largely an arbitrary generational limit

0x457 · 2024-02-13T17:51:10 1707846670

> I pull my PC with Intel 8086 out of closet

> I try to run windows 10 on it

> It doesn't work

> pff, Intel cpu cannot run OS meant for intel CPUs

wat

Jokes aside, nvidia been using RTX branding for products that use Tensor Cores for a long-time now. Limitation due to 1st gen tensor cores not supporting precisions required.

operator-name · 2024-02-13T18:12:41 1707847961

Yeah, seems a bit odd because the TensorRT-LLM repo lists Turing as supported architecture.

https://github.com/NVIDIA/TensorRT-LLM?tab=readme-ov-file#pr...

speckx · 2024-02-13T17:42:51 1707846171

I, too, was hoping that my 2080 Ti from 2019 would suffice. =(

phone8675309 · 2024-02-13T17:00:22 1707843622

Don't worry, they'll be happy to charge you $750 for an entry level card next generation that can run this.

haunter · 2024-02-13T17:44:50 1707846290

Cheapest 40xx is $288

https://pcpartpicker.com/products/video-card/#c=552&sort=pri...

Chepest 8GB 30xx is $220

https://pcpartpicker.com/products/video-card/#sort=price&c=5...

tekla · 2024-02-13T17:02:47 1707843767

Yes peasants, Nvidia requires you to buy the latest and greatest expensive luxury gear, and you will BEG for it.

nickthegreek · 2024-02-13T17:33:22 1707845602

You can use an older 3-series card. No latest & greatest required.

nickthegreek · 2024-02-13T17:31:38 1707845498

A 4060 8gb is $300.

redder23 · 2024-02-13T23:37:28 1707867448

> and all you need is an RTX 30- or 40-series GPU with at least 8GB of VRAM

Smells like artificial restriction to me. I have a 2080 Ti with 8GB of VRAM that is still perfectly fine for gaming. I play in 3440x1440 and the modern games need DLSS/FSR on quality for nice 60++ - 90 FPS. That is perfectly enough for me and I have not had a game, even UE5 games where I really thought I really NEED a new one. I bet that card is totally capable of running that chatbot.

They do the same with frame generation. There they even require you a 40 series card. That is ridiculous to me as these cards are so fast that you do not even need any frame generation. The slower cards are the ones that would benefit from it most so they just lock it down artificially to boost their sales.

magicalhippo · 2024-02-13T23:56:25 1707868585

> 2080 Ti with 8GB of VRAM

Sure you don't mean 11GB[1]? Or did they make other variants? FWIW I have a 2080 Ti with 11GB, been considering upgrading but thinking I'll wait til 5xxx.

[1]: https://www.techpowerup.com/gpu-specs/geforce-rtx-2080-ti.c3...

redder23 · 2024-02-14T09:23:00 1707902580

Yes of course, I have a 11GB of VRAM.

My next card will be an AMD one. I like that they are open sourcing most of their stuff and I think they play better with Linux Wine/Proton. FSR 3 also not artificially restricts cards and runs even on the competition. I read today about at open source API that takes CUDA calls and runs them on AMD or everywhere. I am sure there will be some cool open source projects that do all kinds of things if I ever even need them.

htrp · 2024-02-13T17:45:55 1707846355

Are there benchmarks on how much faster TensorRT vs native torch/cuda?

fisf · 2024-02-13T18:11:33 1707847893

https://nvidia.github.io/TensorRT-LLM/performance.html

It was one of the fastest backends last time I checked (with vLLM and lmdeploy being comparable), but the space moves fast. It uses cuda under the hood, torch is not relevant in this context.

operator-name · 2024-02-13T18:10:26 1707847826

I found some official benchmarks for enterprise GPUs, but no comparison data. I couldn't find any benchmarks for commercial GPUs.

https://nvidia.github.io/TensorRT-LLM/performance.html

mdrzn · 2024-02-13T14:49:55 1707835795

Requirement:

NVIDIA GeForce™ RTX 30 or 40 Series GPU or NVIDIA RTX™ Ampere or Ada Generation GPU with at least 8GB of VRAM"

strangecasts · 2024-02-13T15:46:56 1707839216

Unfortunately the download is taking its time - which kind of base model is it using and what techniques (if any) are they using to offload weights?

Since the demo is 35 GB, my first assumption was it's bundling a ~13B parameter model, but if the requirement is 8 GB VRAM, I assume they're either doing quantization on the user's end or offloading part of the model to the CPU.

(I also hope that Windows 11 is a suggested and not a hard requirement)

operator-name · 2024-02-13T19:01:09 1707850869

For some reason it's actually bundling both LLaMA 13b (24.5GB) and Ministral 7b (13.6GB), but only installed Ministral 7b. I have a 3070ti 8GB, so maybe it installs the other one if you have more VRAM?

ReFruity · 2024-02-13T20:28:27 1707856107

I have 3070 and when I choose LLaMA in config it just changes it back to Mistral on launch

nottorp · 2024-02-13T19:11:22 1707851482

8 Gb minimum? So they're excluding the new 3050 6 Gb that is only powered from pcie?

anon115 · 2024-02-13T23:01:35 1707865295

System Requirements Platform Windows GPU NVIDIA GeForce™ RTX 30 or 40 Series GPU or NVIDIA RTX™ Ampere or Ada Generation GPU with at least 8GB of VRAM RAM 16GB or greater OS Windows 11 Driver 535.11 or later ----yeah not in this lifetime baby...

jjcm · 2024-02-13T23:28:42 1707866922

FWIW, based off of Steam's hardware stats from this January:

54% have at least that much VRAM

79% have at least that much RAM

44% are on Windows 11

36% have the required video card

breakds · 2024-02-14T06:44:13 1707893053

Good point. Maybe Nvidia should also just publish this on Steam ...

krunck · 2024-02-13T17:59:24 1707847164

Does this communicate with other people's (cloud) computers at all?

temp_user · 2024-02-14T03:03:54 1707879834

So, there should be something equivalent for Linux right? Ill be thankful to the person that points me to the right github repository, I am new to local LLM.

alecco · 2024-02-14T08:28:58 1707899338

https://github.com/NVIDIA/TensorRT-LLM

mleroy · 2024-02-13T21:05:37 1707858337

I actually thought AMD would release something like that. But somehow they don't seem to see their chance.

imtringued · 2024-02-14T08:26:51 1707899211

On what hardware with what software?

Do you think people got an MI300 lying around?

AMD's GPUs simply weren't meant for GPGPU.

9front · 2024-02-13T18:32:11 1707849131

It's a 35.1GB download!

politelemon · 2024-02-13T18:34:57 1707849297

It's the L in LLM!

spullara · 2024-02-13T18:24:53 1707848693

So you download a 35G zipfile, then extract that, then run the setup where you can decide not to install one of the models. Why doesn't it just download the model you want to use? Who did this?

Oh and now the install failed with no error message. Lovely.

9front · 2024-02-13T19:33:26 1707852806

In my case the installer finished okay. It pulled in miniconda, nvidia-toolkit and "Mistral 7b int4". Upon launch it opens a gradio-chat session. For the datasets it supports .txt, .pdf, .doc files. There's a "YouTube URL" option and "AI model default".

When asked "What is ChatWithRTX" the reply was:

    Chat with RTX is not mentioned in the given context information. It is 
    possible that it is a separate event or feature that is related to the 
    NVIDIA Lounge at LTX 2023 or QuakeCon 2023, where attendees can interact 
    with a virtual assistant or chatbot powered by NVIDIA's GeForce RTX 
    technology. However, without more information, it is impossible to provide a 
    more detailed answer.
    Reference files: portal-prelude-rtx-inside-the-game-interview.txt

xyst · 2024-02-14T00:03:08 1707868988

Chatbots are hot again.

2010-2011, vibes.

amelius · 2024-02-13T17:21:27 1707844887

With only 8GB of VRAM, it can't be that good ...

speed_spread · 2024-02-13T18:00:15 1707847215

Newer models such as Phi2 run comfortably with 4GB and are good enough to be useful for casual interaction. Sticking with local inference, multiple small models tuned for specific usage scenarios is where it's at.

v4lheru · 2024-02-13T20:08:59 1707854939

i have 3050 and its failing for me. do i have to install it on Windows Drive - cause i don't have that much space there?

haunter · 2024-02-13T21:34:27 1707860067

Do you have an 8GB 3050?

RockRobotRock · 2024-02-13T17:36:53 1707845813

Why can't this run on older devices?

Legend2440 · 2024-02-13T17:43:02 1707846182

It's an LLM, older devices don't have the juice.

Newer devices only barely have the juice.

RockRobotRock · 2024-02-13T18:01:21 1707847281

What does that mean, though? Is it a VRAM thing? I have a 20 series card with 11 GB and okay performance in CUDA for things like OpenAI Whisper.

I think it could run it, albeit slowly.

imtringued · 2024-02-14T08:30:54 1707899454

It's always a VRAM thing from this point on. Compute will always be abundant in relation to memory capacity and bandwidth. The only places were this doesn't count is I low power situations such as embedded, where you might intentionally choose a small model to save power.

RockRobotRock · 2024-02-15T03:28:20 1707967700

Thanks for the insight. Time for an upgrade.

smcleod · 2024-02-14T01:40:15 1707874815

MS Windows only it seems

temp_user · 2024-02-14T05:46:19 1707889579

So I am looking for the Linux version of this...

vdaea · 2024-02-13T15:44:38 1707839078

I suppose this app despite running locally will also be heavily censored.

Is there some local chatbot application like this, for Windows, that isn't hell to set up and that is not censored?

theshrike79 · 2024-02-13T17:03:26 1707843806

https://lmstudio.ai

You can use it to directly search any models, download and run 100% locally.

M-series Macs are the simplest, they Just Work. Even faster if you tick the GPU box.

Windows needs the right kind of GPU to get workable speed.

thejohnconway · 2024-02-13T17:30:31 1707845431

Can that interact with files on your computer, like they show in the video?

operator-name · 2024-02-13T18:04:32 1707847472

Last time I used it, LM Studio doesn't include RAG.

mrdatawolf · 2024-02-13T19:35:47 1707852947

I just installed it yesterday and you are right it does not seem to have RAG but you can use something like anythingLLM to do the rag work and ut has built in integration with studio LM.

Const-me · 2024-02-13T18:41:29 1707849689

You could try my Mistral implementation: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral...

purpleflame1257 · 2024-02-13T17:18:45 1707844725

You can start with Kobold.cpp which should handhold you through the process.

fortran77 · 2024-02-13T17:26:11 1707845171

Trying it now. Doesn't seem censored to me.

dvngnt_ · 2024-02-13T15:53:13 1707839593

it uses Mistral or Llama 2

bugbuddy · 2024-02-13T19:32:36 1707852756

This is amazing and it shows that Nvidia is at least 3 decades ahead of the competitors. Imagine this turning into a powerful agent that can answer everything about your life. It will revolutionize life as we know it. This is why Nvidia stock is green and everything else is red today. I am glad that I went all in on the green team. I wished I could get more leveraged at this point.

dingnuts · 2024-02-13T19:37:23 1707853043

calm down NVIDIA marketing department, 30 years is a long time

gruturo · 2024-02-14T09:05:42 1707901542

3 decades might be how long it takes until this is running locally in your glasses, although we may hit some hard limit in silicon before we get there at all.

But AI models are already running on tablets (not necessarily on Nvidia hardware) and I expect some phone to ship with them within a year (maybe as a stunt, I guess it would be a few years more before this is practical).