I don't think anyone knows yet why transformers work. "Attention is moving vectors in embedding space" does not make sense. At the moment we know how they multiply vectors and then pass through networks etc, but let's not pretend that we understand how they "work".