"Most" embedded ARM? Cortex-A8 and smaller do not have OoO execution. Cortex-A9 is a 32-bit up-to-quad core CPU with clock of 800MHz-2GHz and 128k-8MB of cache. That's pretty big. I guess a lot of this is subject to opinion, but I don't think smartphones with GBs of RAM when I think of embedded systems.
All of Cortex M is in-order and only M7-- still somewhat exotic-- has a real cache (silicon vendors often do some modest magic to conceal flash wait states, though).
Alignment requirements are modest and consequences are predictable. Etc.
About the most complicated performance management thing you get is the analysis of fighting over the memory with your DMA engine. And even that you can ignore if you're using a tightly coupled memory...
Simple predictability ends at 16-bit CPUs generally, and even those can be tricky if it's say m68k.