Actually, Intel explicitly broke backwards-compatibility starting with the Penti...

Actually, Intel explicitly broke backwards-compatibility starting with the Pentium, by adding the hardware to make SMC work without additional effort. The 486 and below needed an explicit branch to flush the prefetch queue, and this effect has been exploited for various anti-debugging tricks and even this amazing 8088-only optimisation:

https://news.ycombinator.com/item?id=9340231

An instruction for flushing an entire region would potentially have a very long execution time, which is awkward because you would want to be able to interrupt and resume it. So it would need "how far have I got" state stored somewhere.

x86 has the REP prefix for this purpose; used with certain instructions, it decrements a register and if it's nonzero, executes the instruction. The earlier implementations simply didn't update the instruction pointer in this case so the CPU would repeatedly fetch and execute the same instruction, and it's interruptable between each step. The register counts down how many iterations remain. Otherwise, the instruction pointer moves to the next instruction. Modern x86 handles this by generating uops instead in the decoder, but the basic functionality is the same.