Hacker News new | past | comments | ask | show | jobs | submit login
Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models (arxiv.org)
83 points by milliondreams 10 months ago | hide | past | favorite | 7 comments



Mini-Gemini is a bit of a confusing name.

Reminds me of how DALL·E Mini came out three years ago and eventually had to rename itself to Craiyon https://github.com/borisdayma/dalle-mini




Is this based on LLaVA 1.6? Not to be too lazy, but maybe someone could link to a comparison with that, if there is one?


Excite to see how this does on open compass!


The paper introduces Mini-Gemini, a framework aimed at enhancing Vision Language Models (VLMs) to close the performance gap with advanced models like GPT-4 and Gemini. It focuses on improving visual tokens resolution, creating high-quality datasets for better image comprehension, and expanding VLMs' operational scope. Mini-Gemini supports a range of large language models and has shown superior performance in zero-shot benchmarks. The code and models are publicly available.


WTF is a "Multi-modality Vision Language Model"? Does it mean:

- a program where you give it a text description, and it outputs a picture

- a program where you give it a picture, and it outputs a text description

- both of the above

- something else

?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: