Unless you are writing an optimizing compiler, you just have to understand the target language well enough to write snippets that resemble the operations of your source or intermediate language, which can be pretty little. See:
Anecdotally: I have once written an AXP21164 back-end without any prior knowledge, only based on the reference manual. It was not pretty or fast, but it worked.
Isolated anecdote, but I work on a JVM based language and my colleague is having to write a register allocator because our input language is written in SSA form and so has a huge number of locals in methods. We need to map these onto a smaller number of JVM locals (like register) otherwise the JVM frames are huge.
Also, if you aren't writing a register allocator because you're using the JVM or LLVM, then someone else needs to on your behalf. We can't forget these skills.
Doesn't a modern JIT collapse JVM locals into a smaller actual stack frame already? Compilers have been doing this for local variables with non-overlapping liveness for decades.
The JVM can't do this because of debugging. If you attach a JVM debugger and examine a local variable which has had the storage reused what would you see? Junk from some other local variable. It's one cost of always on debugging.
Normally the cost isn't too bad and of course they're spilled to the stack not kept in registers, but if the language you are implementing has thousands of locals you may even reach the limit of the frame size. We've seen this in practice in more than one language and application.
Also, only some frames are JIT compiled - what about the rest?
In the JVM, the start_pc/length of each LocalVariableTable entry lets you have more than one name for a given index in the stack frame, exactly so as to be able to reuse indexes for variables that don't overlap in liveness. So the debugging and non-JIT objections don't pertain.
But if you're really running into the 64k limit on max_locals even with index reuse, then you're out of luck.
But that's my point - the person emitting the bytecode needs to be the one doing the work to re-use local variable indices. And the algorithm for doing that is register allocation. So even if you target the JVM you may still need to know how to do register allocation, and it isn't some historical piece of trivia.
For every local variable index, the JVM needs to keep the value live for the duration of the method because that's what the debugger will show.
And yes we have seen the 64k limit blown by real programs in the wild that aren't designed to be awkward (I'm in the VM research group at Oracle).
Furthermore, the semantics of a language should be tailored for its target. Some execution pattern works well on the JVM, some don't. Same for javascript and other targets.
Inventing whatever semantics you have in mind without considering how it's going to be compiled is a recipe for slow languages.