I think that's very true -- and it felt like one of the real opportunities we had in the paper: that we have real production tasks whose results we need to stand behind, and so we can try to explain and show examples of what matters in that context.
One of the sentences near the end that speaks to this is "...[this shows] a case where the type of medical knowledge reflected in common benchmarks is little help getting basic, fundamental questions about a patient right." Point being that you can train on every textbook under the sun, but if you can't say which hospital a record came from, or which date a visit happened as the patient thinks of it, you're toast -- and those seemingly throwaway questions are way harder to get right than people realize. NER can find the dates in a record no problem, but intuitively mapping out how dates are printed in EHR software and how they reflect the workflow of an institution is the critical step needed to pick the right one as the visit date -- that's a whole new world of knowledge that the LLM needs to know, which is not characterized when just comparing results on medical QA.
Giving examples of the crazy things we have to contend is something I can (and will!) gladly talk about for hours...
One other interesting comment in there -- the note about how people think the worst records to deal with are the old handwritten notes. But actually, content-wise they tend to be very to-the-point. Clean printouts from EHR software have so much extra junk and redundancy that you end up with much lower SNR. Even just structuring a single EHR record can require you to look across many pages and do tons of filtering that doesn't come into play on the old handwritten notes (once you get past OCR).
Long way of saying: I feel for today's clinicians. EHRs were supposed to solve all problems, but they've also made things harder in a lot of ways.
Have you seen/heard of Abridge[0]? Long story short their secret sauce comes in two main forms:
1. Accurate speech rec, diarization, etc to record a clinician-patient encounter. No notes, no scribes, no "physician staring at Epic when they should be looking at and talking to you".
2. Parsing of transcripts to correctly and accurately populate the patient EHR record - including various structured fields, etc.
Needless to say you're in this space so I don't have to tell you - every Epic/Cerner install is basically a snowflake so there's a lot going on here, especially at scale.
One of the sentences near the end that speaks to this is "...[this shows] a case where the type of medical knowledge reflected in common benchmarks is little help getting basic, fundamental questions about a patient right." Point being that you can train on every textbook under the sun, but if you can't say which hospital a record came from, or which date a visit happened as the patient thinks of it, you're toast -- and those seemingly throwaway questions are way harder to get right than people realize. NER can find the dates in a record no problem, but intuitively mapping out how dates are printed in EHR software and how they reflect the workflow of an institution is the critical step needed to pick the right one as the visit date -- that's a whole new world of knowledge that the LLM needs to know, which is not characterized when just comparing results on medical QA.
Giving examples of the crazy things we have to contend is something I can (and will!) gladly talk about for hours...