Right. The improvement that this work brings is that it performs the function split very late and at a low level.
Basically, while the previous outliner split a function into two functions (the hot one literally calling the cold one as needed) this new thing takes a single function and splits it into two parts connected to each other by jumps. The cold part of the function isn't really a function --- it's just a group of basic blocks that happen to be located far away from the other group of basic blocks.
By avoiding the call into the cold function and the return to the hot function, the generated code can be smaller and more register-efficient.
That'll make it more stack efficient too since it doesn't have to go through the same dance for a function call. It probably didn't do them while adhering to the C ABI but it'd still put at least a return address and I suspect some registers during the call.
There’s no need, though: the entry point and return address are unique; it’s literally code that is sliced out, jumped to, and it jumps back to the function it was cut out of. The only thing you’d need to save is a register or two if you can’t make the jump without doing some math.
X86 should always be able to jump directly, and ARM sets aside x16 and x17 just for this kind of math. But all jumps should be PC relative, so you shouldn't have to clobber anything anyway
Might be confusing to debuggers if the address-space range of a single function is discontiguous. Does the cold portion get an independent symbol with derived name, like, e.g., "Blocks?"
Basically, while the previous outliner split a function into two functions (the hot one literally calling the cold one as needed) this new thing takes a single function and splits it into two parts connected to each other by jumps. The cold part of the function isn't really a function --- it's just a group of basic blocks that happen to be located far away from the other group of basic blocks.
By avoiding the call into the cold function and the return to the hot function, the generated code can be smaller and more register-efficient.