Hacker News new | past | comments | ask | show | jobs | submit | simianwords's comments login

I would really rather like a benchmark purely focusing on diagnosis. Symptoms, patient history vs the real diagnosis. Maybe name this model House M.D 1.0 or something.

The other stuff is good to have but ultimately a model that focuses on diagnosing medical conditions is going to be the most useful. Look - we aren't going to replace doctors anytime soon but it is good to have a second opinion from an LLM purely for diagnosis. I would hope it captures patterns that weren't observed before. This is exactly the sort of thing game that AI can beat a human at - large scale pattern recognition.


I agree - we should exercise a bit of caution here. There is no way they would release a benchmark which makes their model look bad. But then again we know that their models are one of the best for other uses so its not a big leap to accept this benchmark.

how is that relevant here?

it helps explain why theres' less people talking about them than gemini or llama?

less people using them.


You can't download Gemini's weights either, so it's not relevant as a comparison against Gemini.

I think the actually-relevant issue here is that until last month there wasn't API access for Grok 3, so no one could test or benchmark it, and you couldn't integrate it into tools that you might want to use it with. They only allowed Grok 2 in their API, and Grok 2 was a pretty bad model.


lol sorry mixed them up w gemma3 which feels like the open lesser cousin to gemini 2.5/2.0 models

I can guarantee you none of my friends (not in tech) use “downloading weights” as an input to select an LLM application.

isn't chatgpt the most used or most popular model?

Yes OpenAI has a first-mover advantage and Claude seems to be close as a second player with their closed models too, open weights is not a requirement for success but in an already crowded market (grok's prospect) their preposition isn't competing neither with top tier closed models nor the maybe lesser-capable but more-available battle-tested freely available to run locally open ones

It's not.

Also, only one out of the ten models benchmarked have open weights, so I'm not sure what GP is arguing for.


> in terms of how much other models (gemini, llama, etc) are in the news.

not talking about TFA or benchmarks but the news coverage/user sentiment ...


I don't think any of the current consumer LLM tools use embeddings for web search. Instead they do it at the text level.

The evidence for this is the COT summary with ChatGPT - I have seen something where the the LLM uses quotes to grep on the web.

Embeddings seem good in theory but in practice its probably best to ask an LLM to do a deep search instead by giving it instructions like "use synonyms and common typos and grep".

Does any one know any live example of a consumer product using embeddings?


My understanding is that modern search engines are using embeddings / vector search under the hood.

So even if LLM's aren't directly passing a vector to the search engine, my assumption is that the search engine is converting to a vector and searching.

"You interact with embeddings every time you complete a Google Search" from https://cloud.google.com/vertex-ai/generative-ai/docs/embedd...


Fair, and maybe key point here is that it uses embeddings to help with the search results along with many manual heuristics in place. I hardly think google search works just by dumping embeddings then doing KNN's and calling it a day.

I believe they use the LLMs to generate a set of things to search for and then run those through existing search engines, which are totally opaque and use whatever array of techniques SOTA search engines use. They are almost certainly not "grepping" the internet.

yes that's what i meant thanks for clarifying. the grepping part is definitely done at least in spirit where the COT includes quotes. if i were searching for top 10 cars that are manufactured in South America for example, the COT might show:

"Brazil" car manufacture

This forces Brazil to be included in the keywords, at least that's how google (used to?) works.


Why? There are many tasks where AI beats humans. Humans are also prone to bias and fatigue etc.

Although I would still agree that there would need to be a mechanism for escalation to a human.


Because insurance companies aren't in the business of giving you money. They're in the business of trying not to.

how does AI change this part?

>Right now, most LLMs with web search grounding are still in Stage 1: they can retrieve content, but their ability to assess quality, trustworthiness, and semantic ranking is still very limited.

Why do you think it is limited? Imagine you show a link with details to an LLM and ask it if it is trustworthy or high quality w.r.t the query, why can't it answer it?


Don't think the limit is in what LLMs can evaluate - given the right context, they’re good at assessing quality. The problem is what actually gets retrieved and surfaced in the first place. If the upstream search doesn’t rank high-quality or relevant material well, LLM never sees it. It's not a judgment problem, more of a selection problem.

What I mean is that more powerful engineering capabilities are needed to provide LLM with processing of search results.

Not sure I understand -- LLM's are pretty good at assessing quality of search results. If an LLM can bulk assess a bunch of results it can get a pretty far, probably more efficient than a human hand checking all the results.

Can any one answer this question: are they using custom home made web index? Or are they using bing/google api?

Also I'm quite sure that they don't use vector embeddings for web search, its purely on text space. I think the same holds for all LLM web search tools. They all seem to work well -- maybe we don't need embeddings for RAG and grepping works well enough?


How would they train it on google code without revealing internal IP?

Google has 2.8k public repositories on just their main github account (github.com/google).

Even if they're not training on their closed source internal codebases (which they most certainly are, but fair point that they're probably not releasing the models trained on that data), it definitely seems like they have a lot of Go code in the training data.


but so do their competitors?

I was a bit skeptical at first but I think this is a good idea. Although I'm not convinced with the usage of max_depth parameter. In real life you rarely know what type your dependencies are if they are loaded at run time. This is kind of why we explicitly mock our dependencies.

On a side note: I have wondered whether LLM's are particularly good with functional languages. Imagine if your code entirely consisted of just pure functions and no side effects. You pass all parameters required and do not use static methods/variables and no OOP concepts like inheritance. I imagine every program can be converted in such a way, the tradeoff being human readability.


Am I the only one who prefers a more serious approach to prefix caching? It is a powerful tool and having an endpoint dedicated to it and being able to control TTL's using parameters seems like the best approach.

On the other hand the first two approaches from OpenAI and Anthropic are frankly bad. Automatically detecting what should be prefix cached? Yuck! And I can't even set my own TTL's in Anthropic API (feel free to correct me - a quick search revealed this).

Serious features require serious approaches.


> Automatically detecting what should be prefix cached? Yuck!

Why don't you like that? I absolutely love it.


I mean't that this is the only way to control prefix caching. I consider this a serious feature - if I were to make an application using prefix caching I would not consider OpenAI at all. I can't control what gets cached and for how long.

Wouldn't you want to give more power to the developer? Prefix caching seems like an important enough concept to leak to the end user.


Gemini's approach to prefix caching requires me to pay per hour for keeping the cache populated. I have to do pretty sophisticated price modeling and load prediction to use that effectively.

Anthropic require me to add explicit cache breakpoints to my prompts, which charge for writes to the cache. If I get that wrong it can be more expensive than if I left caching turned off entirely.

With OpenAI I don't have to do any planning or optimistic guessing at all: if my app gets a spike in traffic the caching kicks in automatically and saves me money.


that's fair - i have some app ideas for which i would like control over prefix caching. for example you may want to prompt cache entire chunks of enterprise data that don't change too often. the whole RAG application would be built over this concept - paying per hour for caching is sensible here.

>With OpenAI I don't have to do any planning or optimistic guessing at all: if my app gets a spike in traffic the caching kicks in automatically and saves me money.

i think these are completely different use cases. is this not different just from having a redis sitting in front of the LLM provider?

fundamentally i feel like prompt caching is something i want to control and not have happen automatically; i want to use information i have over my (future) access patterns to save costs. for instance i might prompt cache a whole PDF and ask multiple questions. if i choose to prompt cache the PDF, i can save a non trivial amount of tokens processed. how can OpenAI's automatic approach help me here?


Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: