Afaik its <120KB/s with all the tricks. 6502 was hand designed and brain optimized for clever use of available silicon real-estate, roughly 20% of CPU bus cycles are dead/bogus/useless. RTS wastes 3 of its 6 cycles, RTI 2 of 6 wasted, JSR 1 of 6 wasted , all increments at least 1 cycle wasted etc. Sad to think state machine handling DMA transfers in REU is probably less than 50 macrocells, and Commodore ran its own fab, they could have build-in REU DMA in C128 and it would cost cents.
Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII) - pretty much x86 'rep movsb', except not great at 6 cycles per byte (~160KB/s). For contrast 5 years older 80286 already did 'rep movsw' at 2 cycles per byte. 6 years later Pentium did 'rep movsd' at 4 bytes per cycle. Nowadays Cannonlake can do 'rep movsb' full cachelines at a time at full cache/memory controller speed.
“The 100 MHz 6502” does a different clever thing - it copies all the dedicated RAM and ROM into its own FPGA copy. Then it can perform 7 to 25 instructions before the next external read/write cycle!