There's so much good stuff here, and I agree it's an important message for you t...

There's so much good stuff here, and I agree it's an important message for you to get across.

I think trying to convey these ideas through a quantitative benchmark result (particularly a benchmark which has a clear common interpretation that you're essentially redefining) risks 1) misleading readers, and 2) failing to convey the rich and detailed analysis you've included here in your HN comment.

I'd suggest you restrict your quantitative PubMedQA analysis to report previously published numbers for other models (so you're not in the role of having to defend choices that might cripple other models) or a very straightforward log probs analysis if no outside numbers are available (making it clear which numbers you've produced vs sourced externally). Then separately explain that many of the small models with high benchmark scores exhibit poor instruction following capabilities (which will not be a surprise for many readers, since these models aren't necessary tuned or evaluated for that), and you can make the point that some of them are so poor at instruction following that they're very hard to deploy in contexts that require instruction following; you could even demonstrate that they're only able to follow an instructions to "conclude answers with 'Final Answer: [ABCDE]'" on x% of questions, given a standard prompt that you've created and published. In other words, if it's clear that the problem is in instruction following, analyze that.

(Not all abstraction pipelines leveraging an LLM need it to exhibit instruction following, and in your own case, I'm not sure you can claim that your model follows instructions well on the basis of its PubMedQA or abstraction performance, since you've fine tuned on prompt,answer pairs in both domains. You'd need a different baseline for comparison to really explore this claim.)

Then I'd suggest creating a detailed table of wrong/surprising stuff that frontier models don't understand about healthcare data, but which your model does understand. Categorize them, show examples in the table, and explain them in narrative much like you've done here.