Don't fuck with the architecture for no reason. Just fucking don't. If you reall...

philomath868 · 2025-09-01T11:56:14 1756727774

I hear you loud and clear... Thanks!

What about deleting vision layers (e.g. the "multi_modal_projector" and the "vision_tower.vision_model" layers, assuming I go with Gemma 3), since I need just language generation? Would that also be considered a "kick in the balls", or a useful trimming?

ACCount37 · 2025-09-01T12:32:30 1756729950

Should be safe to do, as long as none of that is load bearing. If it's the usual naive "massage the image into a hundred tokens and throw that into the context" vision implementation, nothing bad would happen from removing or just freezing them.

I've seen "cut off unused vision inputs" done for older multimodals, just not the newer Gemma 3.