I can see how this wouldn’t be covered in an undergrad cs education. I took only a single computer architecture class which was extremely limited in scope. The only reason I knew about vectorization during undergrad is because a friend mentioned it to me once.
Are you saying that the majority of the speed up is from caches and then there's a secondary, much smaller, speed up from the vectorization? Or are you saying all the speed up is from caches and I'm off base here with vectorization.