I would really rather like a benchmark purely focusing on diagnosis. Symptoms, patient history vs the real diagnosis. Maybe name this model House M.D 1.0 or something.
The other stuff is good to have but ultimately a model that focuses on diagnosing medical conditions is going to be the most useful. Look - we aren't going to replace doctors anytime soon but it is good to have a second opinion from an LLM purely for diagnosis. I would hope it captures patterns that weren't observed before. This is exactly the sort of thing game that AI can beat a human at - large scale pattern recognition.
I agree - we should exercise a bit of caution here. There is no way they would release a benchmark which makes their model look bad. But then again we know that their models are one of the best for other uses so its not a big leap to accept this benchmark.
You can't download Gemini's weights either, so it's not relevant as a comparison against Gemini.
I think the actually-relevant issue here is that until last month there wasn't API access for Grok 3, so no one could test or benchmark it, and you couldn't integrate it into tools that you might want to use it with. They only allowed Grok 2 in their API, and Grok 2 was a pretty bad model.
Yes OpenAI has a first-mover advantage and Claude seems to be close as a second player with their closed models too, open weights is not a requirement for success but in an already crowded market (grok's prospect) their preposition isn't competing neither with top tier closed models nor the maybe lesser-capable but more-available battle-tested freely available to run locally open ones
I don't think any of the current consumer LLM tools use embeddings for web search. Instead they do it at the text level.
The evidence for this is the COT summary with ChatGPT - I have seen something where the the LLM uses quotes to grep on the web.
Embeddings seem good in theory but in practice its probably best to ask an LLM to do a deep search instead by giving it instructions like "use synonyms and common typos and grep".
Does any one know any live example of a consumer product using embeddings?
My understanding is that modern search engines are using embeddings / vector search under the hood.
So even if LLM's aren't directly passing a vector to the search engine, my assumption is that the search engine is converting to a vector and searching.
Fair, and maybe key point here is that it uses embeddings to help with the search results along with many manual heuristics in place. I hardly think google search works just by dumping embeddings then doing KNN's and calling it a day.
I believe they use the LLMs to generate a set of things to search for and then run those through existing search engines, which are totally opaque and use whatever array of techniques SOTA search engines use. They are almost certainly not "grepping" the internet.
yes that's what i meant thanks for clarifying. the grepping part is definitely done at least in spirit where the COT includes quotes. if i were searching for top 10 cars that are manufactured in South America for example, the COT might show:
"Brazil" car manufacture
This forces Brazil to be included in the keywords, at least that's how google (used to?) works.
>Right now, most LLMs with web search grounding are still in Stage 1: they can retrieve content, but their ability to assess quality, trustworthiness, and semantic ranking is still very limited.
Why do you think it is limited? Imagine you show a link with details to an LLM and ask it if it is trustworthy or high quality w.r.t the query, why can't it answer it?
Don't think the limit is in what LLMs can evaluate - given the right context, they’re good at assessing quality. The problem is what actually gets retrieved and surfaced in the first place. If the upstream search doesn’t rank high-quality or relevant material well, LLM never sees it. It's not a judgment problem, more of a selection problem.
Not sure I understand -- LLM's are pretty good at assessing quality of search results. If an LLM can bulk assess a bunch of results it can get a pretty far, probably more efficient than a human hand checking all the results.
Can any one answer this question: are they using custom home made web index? Or are they using bing/google api?
Also I'm quite sure that they don't use vector embeddings for web search, its purely on text space. I think the same holds for all LLM web search tools. They all seem to work well -- maybe we don't need embeddings for RAG and grepping works well enough?
Google has 2.8k public repositories on just their main github account (github.com/google).
Even if they're not training on their closed source internal codebases (which they most certainly are, but fair point that they're probably not releasing the models trained on that data), it definitely seems like they have a lot of Go code in the training data.
I was a bit skeptical at first but I think this is a good idea. Although I'm not convinced with the usage of max_depth parameter. In real life you rarely know what type your dependencies are if they are loaded at run time. This is kind of why we explicitly mock our dependencies.
On a side note: I have wondered whether LLM's are particularly good with functional languages. Imagine if your code entirely consisted of just pure functions and no side effects. You pass all parameters required and do not use static methods/variables and no OOP concepts like inheritance. I imagine every program can be converted in such a way, the tradeoff being human readability.
Am I the only one who prefers a more serious approach to prefix caching? It is a powerful tool and having an endpoint dedicated to it and being able to control TTL's using parameters seems like the best approach.
On the other hand the first two approaches from OpenAI and Anthropic are frankly bad. Automatically detecting what should be prefix cached? Yuck! And I can't even set my own TTL's in Anthropic API (feel free to correct me - a quick search revealed this).
I mean't that this is the only way to control prefix caching. I consider this a serious feature - if I were to make an application using prefix caching I would not consider OpenAI at all. I can't control what gets cached and for how long.
Wouldn't you want to give more power to the developer? Prefix caching seems like an important enough concept to leak to the end user.
Gemini's approach to prefix caching requires me to pay per hour for keeping the cache populated. I have to do pretty sophisticated price modeling and load prediction to use that effectively.
Anthropic require me to add explicit cache breakpoints to my prompts, which charge for writes to the cache. If I get that wrong it can be more expensive than if I left caching turned off entirely.
With OpenAI I don't have to do any planning or optimistic guessing at all: if my app gets a spike in traffic the caching kicks in automatically and saves me money.
that's fair - i have some app ideas for which i would like control over prefix caching. for example you may want to prompt cache entire chunks of enterprise data that don't change too often. the whole RAG application would be built over this concept - paying per hour for caching is sensible here.
>With OpenAI I don't have to do any planning or optimistic guessing at all: if my app gets a spike in traffic the caching kicks in automatically and saves me money.
i think these are completely different use cases. is this not different just from having a redis sitting in front of the LLM provider?
fundamentally i feel like prompt caching is something i want to control and not have happen automatically; i want to use information i have over my (future) access patterns to save costs. for instance i might prompt cache a whole PDF and ask multiple questions. if i choose to prompt cache the PDF, i can save a non trivial amount of tokens processed.
how can OpenAI's automatic approach help me here?
The other stuff is good to have but ultimately a model that focuses on diagnosing medical conditions is going to be the most useful. Look - we aren't going to replace doctors anytime soon but it is good to have a second opinion from an LLM purely for diagnosis. I would hope it captures patterns that weren't observed before. This is exactly the sort of thing game that AI can beat a human at - large scale pattern recognition.
reply