Nice. I'd try to avoid the puzzles with only two options, in (most?) of them it's obvious hat yu must pick the small object and drag it into the big one. Like
[chair] [room]
It make sense to move to chair into the room, bot not in the other direction.
Something easy like the number of images may be good enough for a first version. Are you randomizing the questions? Perhaps start with some easy examples with 1 object and 1 place, then 2 small objects and 1 place, then 2 small objects and 2 places, then 3 small objects and 2 places, ... Or something like that. (I remember a few other options like "cut the apple", that has no "places" so take my recomendation only as a rought idea.)
Yes, solid ideas, I was thinking along similar lines.
There are two possible meta approaches for this:
a) hard-code the whole progression, which is tedious and not very adaptive to the learner's level
b) use algorithmic exercise/question selection. Then, suddenly, degrees of freedom explode: You had 5 exercises in a row with 6 images, maybe you'd want something easier. It was a long time since we practiced anything with "apple". But "cutting" was just practiced, it would be boring to bring it up again. So here is a set with all possible "apple" questions in the database. Some include words not yet practiced. Which one do we pick? And so on. It's a fascinating problem, far more complex than I anticipated. And if you use simple shortcuts, you quickly end in very boring loops :)