Update: Looking further, there seems to be a lot of space wasted in the disas factorial example: It is 32+1=33 bytes long, or even 48 if the next function will be aligned on a 16 bytes boundary. There are 2 nop's in there, consuming 6 and 4 bytes. So at least 10/33=about 30% of the memory is traded for aligning jumps on 8 (or 16?) byte boundaries. This seems a big waste of L1I cache.
Is this a normal ratio? Does -Os much better?