Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models

simonw · 2024-04-01T00:28:42 1711931322

Mini-Gemini is a bit of a confusing name.

Reminds me of how DALL·E Mini came out three years ago and eventually had to rename itself to Craiyon https://github.com/borisdayma/dalle-mini

milliondreams · 2024-03-31T22:39:32 1711924772

Code and Models - https://github.com/dvlab-research/MiniGemini

milliondreams · 2024-03-31T22:39:17 1711924757

Project website - https://mini-gemini.github.io/

ilaksh · 2024-04-01T01:05:20 1711933520

Is this based on LLaVA 1.6? Not to be too lazy, but maybe someone could link to a comparison with that, if there is one?

mountainriver · 2024-04-01T14:35:28 1711982128

Excite to see how this does on open compass!

milliondreams · 2024-03-31T22:38:09 1711924689

The paper introduces Mini-Gemini, a framework aimed at enhancing Vision Language Models (VLMs) to close the performance gap with advanced models like GPT-4 and Gemini. It focuses on improving visual tokens resolution, creating high-quality datasets for better image comprehension, and expanding VLMs' operational scope. Mini-Gemini supports a range of large language models and has shown superior performance in zero-shot benchmarks. The code and models are publicly available.

PontifexMinimus · 2024-04-01T13:21:16 1711977676

WTF is a "Multi-modality Vision Language Model"? Does it mean:

- a program where you give it a text description, and it outputs a picture

- a program where you give it a picture, and it outputs a text description

- both of the above

- something else

?