I'm guessing that they want to avoid the overhead of handling the fault, context...

I'm guessing that they want to avoid the overhead of handling the fault, context switching to the kernel, emulate the instruction (which may be tricky if it involves a memory access), context switch back to the application. Instead they can just emit the code into the address space of the process and patch the instruction with a jump to it. Maybe they can also "JIT" the instruction to emit optimized code for a particular invocation.

That's just a guess though, I read sideways through the paper and I don't think they really explain that. They link to this page but it doesn't really give any details: https://www.linux-mips.org/wiki/Floating_point#The_Linux_ker...

In particular I'm not sure why you'd want to put it in the stack instead of some allocated page dedicated to that endeavor (besides "it was already there so we used it").