a polyhedral compiler wouldn't find this either - polyhedral compilation is for finding optimal schedules for loop nests i.e., the order in which independent (wrt dataflow) iterations run. as far as i know you, a transpose can't be expressed in the polyhedral model.
Hmm I thought GCC's polyhedral optimizations had a loop transposition, but it turned out I was remembering an old "-floop-transpose" flag that seems to be only in old Apple GCC to get a SPEC win…