Don't fuck with the architecture for no reason. Just fucking don't. If you really, really want to, ALWAYS have a baseline of "the architecture was not fucked with" with otherwise similar training at hand, so you can compare. You'll see why.
The purpose of using a base model in the first place is to be able to reuse existing learned representations so the model only has to learn the specific task. You propose starting the run off by kicking the base model in the balls and forcing it to relearn a lot of the things that lie at its foundation. While not even doing a full fine tune. And with a dataset that's VERY small for a heavy duty tuning run. I'm not saying it can't work - but I am saying that you'll suffer trying to make it work.
Anything fancy you try during the training? Less of a minefield, but, again: keep a baseline to compare things to. 9 out of 10 fancy training ideas fail to outperform the baseline. And quite a few of those 9 underperform the baseline noticeably. For my first run, I'd maybe implement known-good basics like curriculum learning if possible but nothing fancier than that.
"Softened targets" with semantic similarity off a dictionary might work to improve sample efficiency early into the run, but it's the kind of thing that might hobble your performance further into the run because your dictionary assumptions are worse than what the model could learn on its own, so taper this off at least? POS-tagging might improve things, in a similar way, but only if you find a decent way to feed the known-good tags into the model, which may be as simple as "put the tags in the square bracket after the words with a "this is a POS-tagged text" next to the text, then mask". The "extra POS head" may work but it might be harder to make that work than to rotate the tags into the corpus naively?
Keep in mind that those are suggestions I make based on VIBES ONLY, and the only way to know if those vibes are on point or wildly off base is to actually try those runs, because that's how applied ML is.
So if you want to get fancy, start off with a small model that's cheap and fast to tune, make sure you can validate performance at least somewhat, and be ready to experiment with your runs a lot.
What about deleting vision layers (e.g. the "multi_modal_projector" and the "vision_tower.vision_model" layers, assuming I go with Gemma 3), since I need just language generation? Would that also be considered a "kick in the balls", or a useful trimming?
Should be safe to do, as long as none of that is load bearing. If it's the usual naive "massage the image into a hundred tokens and throw that into the context" vision implementation, nothing bad would happen from removing or just freezing them.
I've seen "cut off unused vision inputs" done for older multimodals, just not the newer Gemma 3.
The purpose of using a base model in the first place is to be able to reuse existing learned representations so the model only has to learn the specific task. You propose starting the run off by kicking the base model in the balls and forcing it to relearn a lot of the things that lie at its foundation. While not even doing a full fine tune. And with a dataset that's VERY small for a heavy duty tuning run. I'm not saying it can't work - but I am saying that you'll suffer trying to make it work.
Anything fancy you try during the training? Less of a minefield, but, again: keep a baseline to compare things to. 9 out of 10 fancy training ideas fail to outperform the baseline. And quite a few of those 9 underperform the baseline noticeably. For my first run, I'd maybe implement known-good basics like curriculum learning if possible but nothing fancier than that.
"Softened targets" with semantic similarity off a dictionary might work to improve sample efficiency early into the run, but it's the kind of thing that might hobble your performance further into the run because your dictionary assumptions are worse than what the model could learn on its own, so taper this off at least? POS-tagging might improve things, in a similar way, but only if you find a decent way to feed the known-good tags into the model, which may be as simple as "put the tags in the square bracket after the words with a "this is a POS-tagged text" next to the text, then mask". The "extra POS head" may work but it might be harder to make that work than to rotate the tags into the corpus naively?
Keep in mind that those are suggestions I make based on VIBES ONLY, and the only way to know if those vibes are on point or wildly off base is to actually try those runs, because that's how applied ML is.
So if you want to get fancy, start off with a small model that's cheap and fast to tune, make sure you can validate performance at least somewhat, and be ready to experiment with your runs a lot.