Mostly by eliminating a lot of individual copying steps - also understanding how the texture swizzling worked and writing directly to that format and combining all the steps and writing very good code that used MMX. Often these types of optimizations are things that add speed at the expense of generality but I still think they are useful.