Hacker News new | past | comments | ask | show | jobs | submit login

It could be faster than some alternatives under some circumstances. A loop would probably be more code (== icache pressure) and would consume at least a register and a BTB entry.



It's certainly been a while since I last optimized for the original Pentium architecture. Still faintly remember U & V pipes, unexplained causes for stalls, etc.

As even nowadays, it would likely depend on the particular algorithm and data set. I'd be surprised if you can't do better than 4 cycles per char for sufficiently long strings. Most likely for short strings, REP SCASB wins due to setup costs. (Actually that article's skipto method would have of course used REP CMPSB, but that's just splitting hairs.)

Remember that even original Pentium could execute up to two instructions per clock. Unless you messed up with those damn U & V pipes. :-)

The hypothetical faster-than-rep solution would need to process data in 32-bit chunks, faux vector style.


You would be surprised about what is happening in modern computers. I don't know about REP SCASB, but IIRC, REP MOVSB is now an insanely efficient way to memcpy on last Intel microarchs (not necessarily always the fastest, but really fast enough for tons of scenario, and very I cache friendly). But it might be less interesting on some other x86 processors.

It makes sense to delegate some of the microoptims to the hardware.

But regular scalar instructions are also optimized like crazy. Write a small loop, and your state of the art microarch might sort of unroll it by using register renaming and speculative execution, so sometimes basically multiple iterations are executed at the same time (and on top of that you sometimes get uOP cache locking, which then improves energy and hyperthreading efficiency).


I might have been surprised, but I still optimize for the modern hardware. :-)

Yes, REP MOVSB is fast at least on Intel CPUs nowadays.


> The hypothetical faster-than-rep solution would need to process data in 32-bit chunks, faux vector style.

Or with real vector style with vectorized instructions?


The original 1994 Intel Pentium did not have any vectorized instructions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: