> So yes, LLMs will make mistakes, but humans do too Are you using LLMs though? ...

m_ke · 2024-12-27T17:48:14 1735321694

Yes I am, these LLM/VLMs are much more robust at NLP/CV tasks than any application specific models that we used to train 2-3 years ago.

I also wasted a lot of time building complex OCR pipelines that required dewarping / image normalization, detection, bounding box alignment, text recognition, layout analysis, etc and now open models like Qwen VL obliterate them with an end to end transformer model that can be defined in like 300 lines of pytorch code.

ADeerAppeared · 2024-12-27T21:30:48 1735335048

Different tasks then? If you are using VLMs in the context of medical imaging, I have concerns. That is not a place to use hallucinatory AI.

But yes, the transformer model itself isn't useless. It's the application of it. OCR, image description, etc, are all that kind of narrow-intelligence task that lends itself well to the fuzzy nature of AI/ML.

m_ke · 2024-12-27T22:18:16 1735337896

The world is a fuzzy place, most things are not binary.

I haven't worked in medical imaging in a while but VLMs make for much better diagnostic tools than task specific classifiers or segmentation models which tend to find hacks in the data to cheat on the objective that they're optimized for.

The next token objective turns our to give us much better vision supervision than things like CLIP or classification losses. (ex: https://arxiv.org/abs/2411.14402)

I spent the last few years working on large scale food recognition models and my multi label classification models had no chance of competing with GPT4 Vision, which was trained on all of the internet and has an amazing prior thanks to it's vast knowledge of facts about food (recipes, menus, ingredients and etc).

Same goes for other areas like robotics, we've seen very little progress outside of simulation up until about a year ago, when people took pretrained VLMs and tuned them to predict robot actions, beating all previous methods by a large margin (google Vision-Language-Action models). It turns out you need good foundational model with a core understanding of the world before you can train a robot to do general tasks.