I can't understand your methods without example prompts or code, so it's hard for me to interpret the data in figure 6. It will be important to document the methodology carefully to avoid concerns that your "text response" methodology is unfairly punishing other models.
In any case, since the methodology that Anthropic applied is documented and straightforward, it would be possible to do an apples to apples comparison with your model.
(I'm also very curious to know how 3.5 Sonnet performs.)
Is your text methodology based on CoT (like the "PubMedQA training dataset enriched with CoT" you trained on) or a forced single token completion like Anthropic used in their evaluation? In the latter case, I'm not sure how "text responses" differ from log probabilities at Temperature T=0 (i.e., isn't the most likely token always going to be the text response?)
A few thoughts -- with some color on 'the why' because we'd love to get your input on how best to get the story and data across. And thoughts you have would be great.
So for method: We did NOT force single token responses. Our goal was to say "if we use [model x] to serve an app for [this task], how accurate would it be?" -- so we wanted to get as close to pasting the prompt directly in and just grading if the output was correct or not. In some cases, that directly works; in others, we'd have to lightly adjust the system prompt (e.g. "Answer ONLY yes, no, or maybe"); and in some cases, it required significant effort (e.g. to parse stubbornly verbose responses).
For the models like GPT-4o, Llama3-70B, and Sonnet that have great instruction following behavior, this works in a straightforward way (and is something we should be able to just add in an appendix). We were surprised how hard this was for a fair number of the domain-specific models with great log-prob benchmark results on the leaderboard -- ultimately a huge gap between numbers saying 'this is a great medical AI model!' and the ability to use it in production -- and to us that was an important part of the story.
For this set of models where a ton of engineering was required to get workable responses, sharing code is the best we can do. I worry a little about rabbit holing on details of how we could improve tuning or output parsing, because if a model requires so much bespoke effort to work on a task it's been built to perform (in the log-prob terms), the point still stands that you couldn't be confident using it across different types of tasks.
Stepping back, for us this method supported our experience that benchmark performance is pretty disconnected to how a model did with records. This behavior was a big piece of that puzzle that we wanted to show. I think there's some nuance though in how we get this across without getting tied up in the details and options for benchmark hacking.
To your question about the difference between our results and log-prob with T=0: behaviorally, I think of a model like Grok that is tuned to be funny, and perhaps it heavily downweights 'yes' or 'no' on a task like this in favor of saying something entertaining; it may have excellent log-probability benchmark performance, but it would be a much worse choice to power your app than the benchmark scores suggest. We wanted our accuracy to be more reflective of that reality-in-production.
And to your comment about using the phrase state-of-the-art: for us, we _didn't_ want to say "you can get the best model for PubMedQA by doing xyz like we did"; instead, we wanted to say "even if you fully invest in getting great benchmark performance, it doesn't do much for your ability to work with records." So for us, s-o-a is more shorthand for saying "we appropriately exhausted what one can do to tune benchmark performance, and here's a top line number that shows that, so we can stand by the relationship we see between benchmarks and performance on records."
Finally, a last note on something I was seeing yesterday when pawing through some structuring and abstraction tasks that GPT-4o got wrong but LLMD did well. It really is amazing how many different pockets of necessary domain bias/contextual bias the records are teaching the model. One obvious example I was seeing was GPT-4o is undertrained to interpret whether "lab" means "lab test" or "laboratory facility." LLMD has picked up on the association that a task asking for a reference range is referring to a lab test, and that behavior is coming from pre-training and instruction fine-tuning (I suspect more the latter). In contrast, if we don't tune the prompt to be explicit, GPT-4o will start dropping street names into the lab-name outputs, etc.
To me, the implication is that you could do a whack-a-mole approach to load the prompt with ultra precise instructions and it would improve performance on records. But based on what we saw in the paper, that likely _only_ works on the big models like GPT-4o and Sonnet, and not on the domain models that are so hard to coerce into giving reasonable responses. But also, there's a long-tail of such things that would drown you, and so you really have no choice to train on records data. Another tiny example we saw a few weeks ago that has a huge impact on app level performance was that the unit for MCV test is so often wrong in records, but the answer can be assumed to be fL in most cases. So we'd need to add tons of things like that if we didn't have records to train on.
tldr; you need to train on records; if you can't and you have a very well defined purpose/input space, use a big model like GPT-4o and load on the prompt to be very precise -- that should work well; pursuing benchmark performance doesn't get you much practically; if you need to work in an unconstrained environment, you have to train on records to pick up all those small biases that matter.
There's so much good stuff here, and I agree it's an important message for you to get across.
I think trying to convey these ideas through a quantitative benchmark result (particularly a benchmark which has a clear common interpretation that you're essentially redefining) risks 1) misleading readers, and 2) failing to convey the rich and detailed analysis you've included here in your HN comment.
I'd suggest you restrict your quantitative PubMedQA analysis to report previously published numbers for other models (so you're not in the role of having to defend choices that might cripple other models) or a very straightforward log probs analysis if no outside numbers are available (making it clear which numbers you've produced vs sourced externally). Then separately explain that many of the small models with high benchmark scores exhibit poor instruction following capabilities (which will not be a surprise for many readers, since these models aren't necessary tuned or evaluated for that), and you can make the point that some of them are so poor at instruction following that they're very hard to deploy in contexts that require instruction following; you could even demonstrate that they're only able to follow an instructions to "conclude answers with 'Final Answer: [ABCDE]'" on x% of questions, given a standard prompt that you've created and published. In other words, if it's clear that the problem is in instruction following, analyze that.
(Not all abstraction pipelines leveraging an LLM need it to exhibit instruction following, and in your own case, I'm not sure you can claim that your model follows instructions well on the basis of its PubMedQA or abstraction performance, since you've fine tuned on prompt,answer pairs in both domains. You'd need a different baseline for comparison to really explore this claim.)
Then I'd suggest creating a detailed table of wrong/surprising stuff that frontier models don't understand about healthcare data, but which your model does understand. Categorize them, show examples in the table, and explain them in narrative much like you've done here.
I can't understand your methods without example prompts or code, so it's hard for me to interpret the data in figure 6. It will be important to document the methodology carefully to avoid concerns that your "text response" methodology is unfairly punishing other models.
In any case, since the methodology that Anthropic applied is documented and straightforward, it would be possible to do an apples to apples comparison with your model.
(I'm also very curious to know how 3.5 Sonnet performs.)
Is your text methodology based on CoT (like the "PubMedQA training dataset enriched with CoT" you trained on) or a forced single token completion like Anthropic used in their evaluation? In the latter case, I'm not sure how "text responses" differ from log probabilities at Temperature T=0 (i.e., isn't the most likely token always going to be the text response?)