I'm not sure the latency on moving data between SSE and GP registers on modern x86 processors. If I remember correctly, late 1990s x86 was notoriously slow at moving data between GP registers and the x87 fp stack. I could certainly believe that on some architectures, moving data between vector registers and GP registers is slower than an L1 cache access.