No, tagging of people is already handled by another model. Gemma just describes what's in the image, and produces a comma separated list of keywords. No additional training is required besides a few tweaks to the prompt so that it outputs just the description, without any "fluff". E.g. it normally prepends such outputs with "Here's a description of the image:" unless you really insist that it should output only the description. I suppose I could use constrained decoding into JSON or something to achieve the same, but I didn't mess with that.
On some images where Gemma3 struggles Mistral Small produces better descriptions, BTW. But it seems harder to make it follow my instructions exactly.
I'm looking forward to the day when I can also do this with videos, a lot of which I also have no interest in uploading to someone else's computer.
Search is indeed hit and miss. Immich, for instance, currently does absolutely nothing with the EXIF "description" field, so I store textual descriptions on the side as well. I have found Immich's search by image embeddings to be pretty weak at recall, and even weaker at ranking. IIRC Lightroom Classic (which I also use, but haven't found a way to automate this for without writing an extension) does search that field, but ranking is a bit of a dumpster fire, so your best bet is searching uncommon terms or constraining search by metadata (e.g. not just "black kitten" but "black kitten AND 2025"). I expect this to improve significantly over time - it's a fairly obvious thing to add given the available tech.
On some images where Gemma3 struggles Mistral Small produces better descriptions, BTW. But it seems harder to make it follow my instructions exactly.
I'm looking forward to the day when I can also do this with videos, a lot of which I also have no interest in uploading to someone else's computer.