> “ Question: Write a detailed radiology note based on the chest X-ray.
Gold Answer: AP upright and lateral views of the chest were provided. Left chest wall pacer pack is again seen with leads extending into the right heart. ”
The bit about a “wall pacer pack is again seen…” leads me to believe this was based on another doctors note about a similar looking X-ray which was probably paired with other information like another scan at the time. That would be problematic imo.
The Gold Answer is not the output of the model but the expected answer in the benchmark. Probably the benachmark contains multiple consecutive images of the same patient.
It’s problematic because the LLM is describing another person’s scan and not the one presented to it. It should at least present the other scan as workings and the percentage difference between the two. Finding a similar looking scan is very useful no doubt but if the result is hallucinated that it is less so. Dangerous even. There is no confidence percentage and there should be.
Ive been sorta following together.ai for a while. Cool company. Is this available to be used by anyone atm? Could I potentially use the model to look at my own chest xrays (I've had a lot)?
I have been testing out LLMs with the together.ai API, but I can't figure out how to use the multimodal models with the API. I don't see any in their model list.
Is there a demo or API to test the model? There are so many vision language models these days, it's hard to say which one is better, they also use in many cases different benchmarks.
Both the ‘gold’ answer and the model reference a PA and AP view respectively as well as a lateral chest radiograph. The picture only contains a lateral radiograph though.
I don't have exact domain knowledge but I'm fairly certain this type of tech has already been employed to do some of the heavy lifting for radiologists reviewing imaging results.
if image generation gets to be near perfect then it might have a larger impact on communication than gpt does, no paragraph beats a good diagram but drawing is always hard
I can't speak for others obviously, but this sort of caption is nauseous:
> In the heart of a vibrant skatepark, a skateboarder is caught in a moment of pure exhilaration. The skateboarder, dressed in a black t-shirt adorned with a yellow graphic and black pants, is suspended in mid-air, performing an impressive trick on a concrete ramp. The skateboarder's arms are outstretched, adding balance to the daring stunt. The skatepark itself is a concrete playground, with the skateboarder's ramp being the main focus. In the background, palm trees sway gently, adding a touch of nature to the urban setting. A few spectators can be seen in the distance, their attention riveted on the airborne skateboarder. The image captures not just a moment, but a story of skill, courage, and the joy of skateboarding.
This seems a lot more like a puff piece from a local publisher trying to fill space, or description of a stock photo to an advertiser, than a description I'd describe as accurate from a human to another human.
It's clearly the edge of a skatepark. Not "the heart."
etc. etc... others have gone through it extensively so I won't do that again, but it's full of gratuitous added content that does not match what's in the picture at all. It seems to be aping some writing style, not going for accuracy.
What's more remarkable to me is that the authors of this do not seem to notice.
It's bizarre that you would create a project using an approach and then when assessing that project prior to publication, you would just glance at the paragraph without even reading it, and say "looks great" and move on. Details matter. This is absolute crap and we need researchers who can discern crap, not just accept anything.
Further down in the post they ask why an image showing a dog as the Mona Lisa is funny. It's actually not funny. It's an old trope, and, as such, super dull. They should realize it's funny only to a subset of people, but they don't seem to even realize that much. This team needs to get out more.
Came to say the same. Might be the task "describe the picture" that puts it into that mode; however I still hope that no human being would really write such tosh.
I really am very pro ai describing images, but it's the editorializations like "The image captures not just a moment, but a story of skill, courage, and the joy of skateboarding." that struck me as—idk, quite odd and uncanny-valley like.
It's not only nauseous to read but it's also too speculative, e.g. how do you know the trick is impressive? How do you know the skateboarder is exhilarated?
Yeah, this is my number one complaint with all recent open source vision models, and it seems like it is only getting worse. It's verbose to the point of parody, making it extremely difficult to evaluate what it can actually _see_, and what it's just dumbly markov-chaining based on previous text tokens.
In GPT4V, you can prompt around this if you know about it, but none of the people collecting datasets for open models appear know or care to apply that, and so we just get this default GPT4V contamination everywhere.
The only vision model I enjoy is Google Gemini, simply because it will give you a no-nonsense caption. Of course it still hallucinates things that are not there, but getting a color or object wrong is orders of magnitude less bad than having 3 sentences that have nothing to do with the image.
That's the price of getting an image that accurately represents what you had in mind. Otherwise you could just prompt it with "skateboarder in a skatepark".
"vibrant" could apply to the activity, not the physical structure. There's 4 other people in the back of the photo - if you assumed it was a tiny slice of the park, you could say it was vibrant.
(to me it doesn't look particularly vibrant re: activity but this is just one small corner and I will allow the ML some leeway in its floridity.)
The paragraph said it is the heart of the park, not a "tiny slice."
If you're going to praise the paragraph, at least choose a word that's defensible. Like ok, it got "the" right.
Surprising that you would defend that particular word. Far from being vibrant, the place looks dead, frankly. You can see it in the bored faces of the three people staring off in random directions with disinterested stances. Even the guy who's walking looks like he's just shuffling along.
"I can see how, if you look at it in just the right light, you might think there is a little blue in there."
It doesn't accurately represent the photo, hence the issue at hand. In many ways "skateboarder in a park" is more accurate than the large number of small inaccuracies this description manages to accrue. (Like many humans, but still the verbosity is very odd for the little detail it actually conveys!)
I'm not trying to argue against the idea of ai-generated titling, just that the product is very inferior to what even mild diletantes of the field might expect from advertised capabilities.
This isn't a text to image model it's an image captioning model. The images in the figures are confusingly labeled since it's the caption that's generated, not the image I think.
The bit about a “wall pacer pack is again seen…” leads me to believe this was based on another doctors note about a similar looking X-ray which was probably paired with other information like another scan at the time. That would be problematic imo.